Deep neural network expressivity for optimal stopping problems

Gonon, Lukas

doi:10.1007/s00780-024-00538-0

Deep neural network expressivity for optimal stopping problems

Open access
Published: 14 June 2024

Volume 28, pages 865–910, (2024)
Cite this article

Download PDF

You have full access to this open access article

Finance and Stochastics Aims and scope Submit manuscript

Deep neural network expressivity for optimal stopping problems

Download PDF

Lukas Gonon¹

696 Accesses
Explore all metrics

Abstract

This article studies deep neural network expression rates for optimal stopping problems of discrete-time Markov processes on high-dimensional state spaces. A general framework is established in which the value function and continuation value of an optimal stopping problem can be approximated with error at most $\varepsilon $ by a deep ReLU neural network of size at most $\kappa d^{\mathfrak{q}} \varepsilon ^{-\mathfrak{r}}$. The constants $\kappa ,\mathfrak{q},\mathfrak{r} \geq 0$ do not depend on the dimension $d$ of the state space or the approximation accuracy $\varepsilon $. This proves that deep neural networks do not suffer from the curse of dimensionality when employed to approximate solutions of optimal stopping problems. The framework covers for example exponential Lévy models, discrete diffusion processes and their running minima and maxima. These results mathematically justify the use of deep neural networks for numerically solving optimal stopping problems and pricing American options in high dimensions.

A deep learning method for pricing high-dimensional American-style options via state-space partition

Article 02 April 2024

Deep ReLU network expression rates for option prices in high-dimensional, exponential Lévy models

Article Open access 31 August 2021

Pricing options on flow forwards by neural networks in a Hilbert space

Article 24 November 2023

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

In the past years, neural-network-based methods have been used ubiquitously in all areas of science, technology, economics and finance. In particular, such methods have been applied to various problems in mathematical finance such as pricing, hedging and calibration. We refer for instance to the articles Buehler et al. [14], Becker et al. [7, 8], Cuchiero et al. [17] and to the survey papers Ruf and Wang [44], Germain et al. [22], Beck et al. [5] for an overview and further references. The striking computational performance of these methods has also raised questions regarding their theoretical foundations. Towards a complete theoretical understanding, there have been recent results in the literature which prove that deep neural networks are able to approximate option prices in various models without the curse of dimensionality. For deep neural network expressivity results for option prices and associated PDEs, we refer for instance to Elbrächter et al. [19], Grohs et al. [28] for European options in Black–Scholes models, to Hutzenthaler et al. [33], Cioica-Licht et al. [15] for certain semilinear PDEs, to Grohs and Herrmann [27] for certain Hamilton–Jacobi–Bellman equations, to Reisinger and Zhang [41], Jentzen et al. [35], Takahashi and Yamada [48] for diffusion models and game-type options, and to Gonon and Schwab [26] for certain path-dependent options in jump–diffusion models. A few works are also concerned with generalisation errors (Berner et al. [11]) and learning errors (Gonon [23]).

The goal of this article is to analyse deep neural network expressivity for American option prices and general optimal stopping problems in discrete time. An optimal stopping problem consists in selecting a stopping time $\tau $ such that the expected reward $\mathbb{E}[g_{d}(\tau ,X_{\tau}^{d})]$ is maximised. Here $X^{d}$ is a given stochastic process taking values in $\mathbb{R}^{d}$ and $g_{d}(t,x)$ is the reward obtained if the process is stopped at time $t$ in state $x$. Optimal stopping problems arise in a wide range of contexts in statistics, operations research, economics and finance. In mathematical finance, arbitrage-free prices of American and Bermudan options are given as solutions of optimal stopping problems. The solution of an optimal stopping problem can be described by the so-called Snell envelope or, equivalently, by a backward recursion (discrete time) or a free-boundary PDE (continuous time) in the case when $X^{d}$ is a Markov process.

In recent years, a wide range of computational methods have been developed to numerically solve optimal stopping problems also in high-dimensional situations, i.e., when the dimension $d$ of the state space is large. For regression-based algorithms, we refer e.g. to Tsitsiklis and Van Roy [49], Longstaff and Schwartz [38]; for duality-based methods, we refer e.g. to Rogers [43], Andersen and Broadie [1], Haugh and Kogan [31], Belomestny et al. [10]; for stochastic grid methods, we refer e.g. to Broadie and Glasserman [13], Jain and Oosterlee [34]; and for methods based on approximating the exercise boundary, we refer e.g. to Garcia [21]. See for instance also the overview in Bouchard and Warin [12]. Recently proposed methods include signature-based methods (Bayer et al. [4]) and regression trees (Ech-Chafiq et al. [18]). Furthermore, various methods based on deep neural network approximations of the value function, the continuation value or the exercise boundary of the optimal stopping problem have been proposed; see for instance Kohler et al. [36], Becker et al. [6, 7, 8], Herrera et al. [32], Lapeyre and Lelong [37], Reppen et al. [42], and for methods for continuous-time free boundary problems Sirignano and Spiliopoulos [47], Wang and Perdikaris [50]. For many of these methods, also theoretical convergence results or even convergence rates (cf. e.g. Clément et al. [16], Belomestny [9], Bayer et al. [3]) for a fixed dimension $d$ have been established.

In this article, we are interested in mathematically analysing the high-dimensional situation, i.e., in explicitly controlling the dependence on the dimension $d$. We analyse deep neural network approximations for the value function of optimal stopping problems. We provide general conditions on the reward function $g_{d}$ and the stochastic process $X^{d}$ which ensure that the value function (and the continuation value) of an optimal stopping problem can be approximated by deep ReLU neural networks without the curse of dimensionality, i.e., that an approximation error of size at most $\varepsilon $ can be achieved by a deep ReLU neural network of size $\kappa d^{\mathfrak{q}} \varepsilon ^{-\mathfrak{r}}$ for constants $\kappa ,\mathfrak{q},\mathfrak{r} \geq 0$ which do not depend on the dimension $d$ or the accuracy $\varepsilon $. The framework in particular provides deep neural network expressivity results for prices of American and Bermudan options. Our conditions cover most practically relevant payoffs and many popular models such as exponential Lévy models and discrete diffusion processes. The constants $\kappa $, $\mathfrak{q}$, $\mathfrak{r}$ are explicit, and thus the obtained results yield bounds for the approximation error component in any algorithm for optimal stopping and American option pricing in high dimensions which is based on approximating the value function or the continuation value by deep neural networks.

The remainder of the paper is organised as follows. In Sect. 2, we formulate the optimal stopping problem, recall its solution by dynamic programming and introduce the notation for deep neural networks. In Sect. 3, we formulate the assumptions and main results. Specifically, in Sect. 3.1, we formulate a basic framework, Assumptions 3.1 and 3.4, and prove that the value function can be approximated by deep neural networks without the curse of dimensionality; see Theorem 3.6. In Sect. 3.3, we then provide more refined assumptions on the considered Markov processes and extend the approximation result to this refined framework; see Theorem 3.12 which is the main result of the article. In Sects. 3.4–3.6, we then apply this result to exponential Lévy models and discrete diffusion processes and show that also barrier options can be covered via the running maximum or minimum of such processes. In order to make the presentation more streamlined, most proofs, in particular the proofs of Theorems 3.6 and 3.12, are postponed to Sect. 4.

1.1 Notation

Throughout, we fix a time horizon $T \in \mathbb{N}$ and a probability space $(\Omega ,\mathcal {F},\mathbb{P})$ on which all random variables and processes are defined. For $d \in \mathbb{N}$, $x \in \mathbb{R}^{d}$, $A \in \mathbb{R}^{d \times d}$, we denote by $|x|$ the Euclidean norm of $x$ and by $|A|_{F}$ the Frobenius norm of $A$. For $f \colon \mathbb{R}^{d_{0}} \times \mathbb{R}^{d_{1}} \to \mathbb{R}^{d_{2}}$, we let

$$ \mathrm{Lip}(f) = \sup _{ \substack{(x_{1},y_{1}),(x_{2},y_{2}) \in \mathbb{R}^{d_{0}} \times \mathbb{R}^{d_{1}}\\ x_{1} \neq x_{2}, y_{1} \neq y_{2}}} \frac{|f(x_{1},y_{1})-f(x_{2},y_{2})|}{|x_{1}-x_{2}|+|y_{1}-y_{2}|} . $$

2 Preliminaries

In this section, we first formulate the optimal stopping problem and recall its solution in terms of the value function. Then we introduce the required notation for deep neural networks.

2.1 The optimal stopping problem

For each $d \in \mathbb{N}$, consider a function $g_{d} \colon \{0,\ldots ,T\} \times \mathbb{R}^{d} \to \mathbb{R}$ and a discrete-time $\mathbb{R}^{d}$-valued Markov process $X^{d}=(X_{t}^{d})_{t \in \{0,\ldots ,T\}}$. Assume for each $t \in \{0,\ldots ,T\}$ that $\mathbb{E}[|g_{d}(t,X_{t}^{d})|]< \infty $ and let $\mathbb{F}=(\mathcal{F}_{t})_{t \in \{ {0,\ldots ,T}\}}$ be the filtration generated by $X^{d}$. Denote by $\mathcal{T}$ the set of $\mathbb{F}$-stopping times $\tau \colon \Omega \to \{0,\ldots ,T\}$ and by $\mathcal{T}_{t}$ the set of all $\tau \in \mathcal{T}$ with $\tau \geq t$. For notational simplicity, we omit the dependence on $d$ in $\mathbb{F}$, $\mathcal{T}$ and $\mathcal{T}_{t}$.

The optimal stopping problem consists in computing

$$ \sup _{\tau \in \mathcal{T}} \mathbb{E}[g_{d}(\tau ,X^{d}_{\tau})]. $$

(2.1)

Consider the value function $V_{d}$ defined by $V_{d}(T,x) = g_{d}(T,x)$ and the backward recursion

$$ V_{d}(t,x) = \max \big(g_{d}(t,x),\mathbb{E}[V_{d}(t+1,X_{t+1}^{d})|X_{t}^{d}=x] \big) $$

(2.2)

for $t=T-1,\ldots ,0$ and $\mathbb{P}\circ (X_{t}^{d})^{-1}$-a.e. $x \in \mathbb{R}^{d}$. Then knowledge of $V_{d}$ also allows computing a stopping time $\tau ^{*} \in \mathcal{T}$ which is a maximiser in (2.1): the stopping time

$$ \tau ^{*} = \min \big\{ t \in \{0,\ldots ,T\} \colon V_{d}(t,X^{d}_{t}) = g_{d}(t,X^{d}_{t})\big\} $$

satisfies $\mathbb{E}[g_{d}(\tau ^{*},X_{\tau ^{*}}^{d})] = \sup _{\tau \in \mathcal{T}} \mathbb{E}[g_{d}(\tau ,X^{d}_{\tau})]$. Indeed, by backward induction and the Markov property, we obtain that $V_{d}(t,X^{d}_{t})=U^{d}_{t}$ ℙ-a.s., where $U^{d}$ is the Snell envelope defined at $T$ by $U_{T}^{d} = g_{d}(T,X_{T}^{d})$ and for $t=T-1,\ldots ,0$ by the backward recursion $U_{t}^{d} = \max (g_{d}(t,X_{t}^{d}),\mathbb{E}[U_{t+1}^{d}|\mathcal {F}_{t}])$. Then for instance by Föllmer and Schied [20, Theorem 6.18], we have ℙ-a.s. for all $t \in \{0,\ldots ,T\}$ that

$$ V_{d}(t,X^{d}_{t}) = U_{t}^{d} = \operatorname*{{\mathrm{ess}\sup}}_{\tau \in \mathcal{T}_{t}} \mathbb{E}[g_{d}(\tau ,X_{\tau}^{d}) | \mathcal {F}_{t}] = \mathbb{E}\Big[g_{d}\Big(\tau _{ \min}^{(t)},X_{\tau _{\min}^{(t)}}^{d}\Big) \Big| \mathcal {F}_{t}\Big], $$

where $\tau _{\min}^{(t)} = \min \{s \geq t \colon U_{s}^{d} = g_{d}(s,X_{s}^{d}) \}$. In particular, $\tau _{\min}^{(0)}=\tau ^{*}$ is a maximiser of (2.1), and if $X^{d}_{0}$ is constant, $V_{d}(0,X_{0}^{d})$ is the optimal value in (2.1).

The idea of our approach is as follows: In many situations, the function $g_{d}$ is in fact a neural network or can be approximated well by a deep neural network. Then the recursion (2.2) also yields a recursion for a neural network approximation. This argument will be made precise in the proof of Theorem 3.6 below.

Remark 2.1

Alternatively, we could also define

$$ V_{d}(t,x) = \sup _{\tau \in \mathcal{T}_{t}} \mathbb{E}[g_{d}(\tau ,X^{d}_{ \tau}) |X^{d}_{t} = x]. $$

(2.3)

Then under the strong Markov property, it holds that $\mathbb{E}[g_{d}(\tau ,X_{\tau}^{d}) | \mathcal {F}_{t}] = h_{\tau}^{d}(X_{t}^{d})$ ℙ-a.s., for each $\tau \in \mathcal{T}_{t}$, where $h_{\tau}^{d}(x)=\mathbb{E}[g_{d}(\tau ,X_{\tau}^{d}) |X_{t}^{d}=x]$. The definition of the essential supremum then implies that for each $\tau \in \mathcal{T}_{t}$, it holds ℙ-a.s. that $h_{\tau _{ \min}^{(t)}}^{d}(X_{t}^{d}) \geq h_{\tau}^{d}(X_{t}^{d})$. This implies for $\mathbb{P}\circ (X_{t}^{d})^{-1}$-a.e. $x \in \mathbb{R}^{d}$ and all $\tau \in \mathcal{T}_{t}$ that $h_{\tau _{\min}^{(t)}}^{d}(x) \geq h_{\tau}^{d}(x)$, hence $h_{\tau _{\min}^{(t)}}^{d}(x) \geq \sup _{\tau \in \mathcal{T}_{t}} h_{ \tau}^{d}(x)$ for each such $x$. Combining this and [20, Theorem 6.18] yields that ℙ-a.s.,

$$U_{t}^{d} = h_{\tau _{\min}^{(t)}}^{d}(X_{t}^{d}) = \sup _{\tau \in \mathcal{T}_{t}} h_{\tau}^{d}(X_{t}^{d}) = V_{d}(t,X_{t}^{d}). $$

By the definition of the Snell envelope, this then yields the recursion (2.2) for the value function.

Remark 2.2

The conditional expectation in (2.3) is defined in terms of the transition kernels $\mu ^{d}_{s,t}$, $0\leq s < t \leq T$, of the Markov process $X^{d}$. In fact, we formally start with transition kernels $\mu ^{d}$ on $\mathbb{R}^{d}$ from which we then construct a family of probability measures $\mathbb{P}_{x}$ on the canonical path space $((\mathbb{R}^{d})^{T+1},\mathcal{B}((\mathbb{R}^{d})^{T+1}))$ such that under $\mathbb{P}_{x}$, the coordinate process is a Markov process starting at $x$ and with transition kernels $\mu ^{d}$.

2.2 Deep neural networks

In this article, we consider neural networks with the ReLU (rectified linear unit) activation function $\varrho \colon \mathbb{R}\to \mathbb{R}$ given by $\varrho (x)=\max (x,0)$ for $x \in \mathbb{R}$. For each $d \in \mathbb{N}$, we also denote by $\varrho \colon \mathbb{R}^{d} \to \mathbb{R}^{d} $ the compontentwise application of the ReLU activation function. Let $L,d \in \mathbb{N}$, $N_{0}:=d$, $N_{1},\ldots ,N_{L} \in \mathbb{N}$ and $A^{\ell} \in \mathbb{R}^{N_{\ell }\times N_{\ell -1}}$, $b^{\ell }\in \mathbb{R}^{N_{\ell}}$ for $\ell = 1,\ldots ,L$. A deep neural network with $L$ layers, $d$-dimensional input, activation function $\varrho $ and parameters $((A^{1},b^{1}),\ldots ,(A^{L},b^{L}))$ is the function $\phi \colon \mathbb{R}^{d} \to \mathbb{R}^{N_{L}}$ given by

$$ \phi (x) = \mathcal{W}_{L} \circ (\varrho \circ \mathcal{W}_{L-1}) \circ \cdots \circ (\varrho \circ \mathcal{W}_{1})(x), \qquad x \in \mathbb{R}^{d}, $$

(2.4)

where $\mathcal{W}_{\ell }\colon \mathbb{R}^{N_{\ell -1}} \to \mathbb{R}^{N_{\ell}}$ denotes the (affine) function $\mathcal{W}_{\ell}(z) = A^{\ell }z + b^{\ell}$ for $z \in \mathbb{R}^{N_{\ell -1}}$ and $\ell = 1,\ldots ,L$. We let

$$ \mathrm{size}(\phi ) = \sum _{\ell =1}^{L} \sum _{i=1}^{N_{\ell}} \bigg( \mathbbm{1}_{\{b^{\ell}_{i} \neq 0\}} + \sum _{j=1}^{N_{\ell -1}} \mathbbm{1}_{\{A^{\ell}_{i,j} \neq 0\}} \bigg) $$

denote the total number of non-zero entries in the parameter matrices and vectors of the neural network.

In most cases, the number of layers, the activation function and the parameters of the network are not mentioned explicitly and we simply speak of a deep neural network $\phi \colon \mathbb{R}^{d} \to \mathbb{R}^{N_{L}}$. We say that a function $f \colon \mathbb{R}^{d} \to \mathbb{R}^{m}$ can be realised by a deep neural network if there exists a deep neural network $\phi \colon \mathbb{R}^{d} \to \mathbb{R}^{m}$ such that $f(x) = \phi (x)$ for all $x \in \mathbb{R}^{d}$. In the literature, a deep neural network is often defined as the collection of parameters $\Phi = ((A^{1},b^{1}),\ldots ,(A^{L},b^{L}))$, and $\phi $ in (2.4) is then called the realisation of $\Phi $; see for instance Petersen and Voigtlaender [40], Opschoor et al. [39], Gonon and Schwab [25]. In order to simplify the notation, we do not distinguish between the neural network realisation and its parameters here, since the parameters are always (at least implicitly) part of the definition. Note that in general, a function $f$ may admit several different realisations by deep neural networks, i.e., several different choices of parameters may result in the same realisation. However, in the present article, this is not an issue because pathological cases are excluded by bounds on the size of the networks.

3 DNN approximations for optimal stopping problems

This section contains the main results of the article, the deep neural network approximation results for optimal stopping problems. We start by formulating in Assumption 3.1 a general Markovian framework. In Assumption 3.4, we introduce the hypotheses on the reward function. We then formulate in Theorem 3.6 the approximation result for this basic framework. Subsequently, we provide a more refined framework, see Assumption 3.9 below, and prove the main result of the article in Theorem 3.12. This proves that the value function can be approximated by deep neural networks without the curse of dimensionality. Corollary 3.13 shows that an analogous approximation result also holds for the continuation value. Subsequently, in Sects. 3.4–3.6, we specialise the result to exponential Lévy models and discrete diffusion processes and show that also barrier options can be covered by including the running maximum or minimum.

3.1 Basic framework

Let $p \geq 0$ be a fixed rate of growth. For instance, in financial applications, typically $p=1$. We start by formulating in Assumption 3.1 a collection of hypotheses on the Markov process $X^{d}$. These hypotheses will be weakened later in Assumption 3.9.

Assumption 3.1

(i) For each $d \in \mathbb{N}$ and $t \in \{0,\ldots ,T-1\}$, there exist a measurable function $f^{d}_{t} \colon \mathbb{R}^{d} \times \mathbb{R}^{d} \to \mathbb{R}^{d}$ and a random vector $Y^{d}_{t}$ such that

$$ X_{t+1}^{d} = f^{d}_{t}(X_{t}^{d},Y^{d}_{t}). $$

(3.1)

(ii) For each $d \in \mathbb{N}$, the random vectors $X_{0}^{d},Y^{d}_{0},\ldots ,Y^{d}_{T-1}$ are independent and $\mathbb{E}[|X_{0}^{d}|]<\infty $.

(iii) There exist constants $c>0$, $q\geq 0$, $\alpha \geq 0$ such that for all $\varepsilon \in (0,1]$, $d \in \mathbb{N}$ and $t \in \{0,\ldots ,T-1\}$, there exists a neural network $\eta _{\varepsilon ,d,t} \colon \mathbb{R}^{d} \times \mathbb{R}^{d} \to \mathbb{R}^{d}$ with

$$\begin{aligned} | f^{d}_{t}(x,y) - \eta _{\varepsilon ,d,t}(x,y)| & \leq \varepsilon c d^{q} (1+|x|^{p} + |y|^{p}) \qquad \text{for all } x,y \in \mathbb{R}^{d}, \end{aligned}$$

(3.2)

$$\begin{aligned} \mathrm{size}(\eta _{\varepsilon ,d,t}) &\leq c d^{q} \varepsilon ^{- \alpha}, \end{aligned}$$

(3.3)

$$\begin{aligned} \mathrm{Lip}(\eta _{\varepsilon ,d,t}) & \leq c d^{q}. \end{aligned}$$

(3.4)

(iv) There exist constants $c>0$, $q\geq 0$ such that for all $d \in \mathbb{N}$ and for all $t \in \{0,\ldots ,T-1\}$, we have $|f_{t}^{d}(0,0)|\leq c d^{q}$ and $\mathbb{E}[|Y_{t}^{d}|^{2\max{(2,p)}}] \leq c d^{q}$.

Assumption 3.1 (i) requires a recursive updating of the Markov process $X^{d}$ according to update functions $f^{d}_{t}$ and a noise process $Y^{d}$. Assumption 3.1 (ii) asks that the noise random variables and the initial condition are independent. Assumption 3.1 (iii) imposes that the updating functions $f^{d}_{t}$ can be approximated well by deep neural networks. Finally, Assumption 3.1 (iv) requires that certain moments of the noise random variables and the “constant parts” of the update functions exhibit at most polynomial growth.

Remark 3.2

In Assumption 3.1 (iii), (iv), we could also put different constants $c$ and $q$ in each of the hypotheses. But then Assumption 3.1 (iii), (iv) still hold with $c$ and $q$ chosen as the respective maximum, and so for notational simplicity, we choose to directly work with the same constants for all these hypotheses.

Remark 3.3

For $s\geq t$, consider a function $\bar{g}_{d,s} \colon \mathbb{R}^{d} \to \mathbb{R}$ with $\mathbb{E}[|\bar{g}_{d,s}(X_{s}^{d})|]< \infty $. Then under Assumption 3.1 (i), (ii),

$$ \mathbb{E}[\bar{g}_{d,s}(X_{s}^{d})|X_{t}^{d}=x] = \mathbb{E}\big[\big(\bar{g}_{d,s} \circ f_{s-1}^{d}(\,\cdot \,,Y_{s-1}^{d})\circ \cdots \circ f_{t}^{d}( \,\cdot \,,Y_{t}^{d})\big)(x)\big] $$

(3.5)

for $\mathbb{P}\circ (X_{t}^{d})^{-1}$-a.e. $x \in \mathbb{R}^{d}$. But the right-hand side of (3.5) is defined for any $x \in \mathbb{R}^{d}$ for which the expectation is finite, and so in what follows, we also consider the conditional expectation $\mathbb{E}[\bar{g}_{d,s}(X_{s}^{d})|X_{t}^{d}=x]$ to be defined for all such $x \in \mathbb{R}^{d}$ (by (3.5)). Note that also

$$ \mathbb{E}\big[\big|\max \big(g_{d}(t,X_{t}^{d}),\mathbb{E}[\bar{g}_{d,s}(X_{s}^{d})|X_{t}^{d}] \big)\big|\big]\leq \mathbb{E}[|g_{d}(t,X_{t}^{d})|+|\bar{g}_{d,s}(X_{s}^{d})|] < \infty , $$

and so by backward induction, this reasoning allows to define in our framework the value function $V_{d}(t,\,\cdot \,)$ on all of $\mathbb{R}^{d}$, for each $t$.

Next, we formulate a collection of hypotheses on the reward (or payoff) function $g_{d}$.

Assumption 3.4

There exist constants $c>0$, $q\geq 0$, $\alpha \geq 0$ such that

(i) for all $\varepsilon \in (0,1]$, $d \in \mathbb{N}$ and $t \in \{0,\ldots ,T\}$, there exists a (deep) neural network $\phi _{\varepsilon ,d,t} \colon \mathbb{R}^{d} \to \mathbb{R}$ with

$$\begin{aligned} |g_{d}(t,x)-\phi _{\varepsilon ,d,t}(x)| & \leq \varepsilon c d^{q} (1+|x|^{p}) \qquad \text{for all } x \in \mathbb{R}^{d}, \end{aligned}$$

(3.6)

$$\begin{aligned} \mathrm{size}(\phi _{\varepsilon ,d,t}) &\leq c d^{q} \varepsilon ^{- \alpha}, \end{aligned}$$

(3.7)

$$\begin{aligned} \mathrm{Lip}(\phi _{\varepsilon ,d,t}) & \leq c d^{q}; \end{aligned}$$

(3.8)

(ii) for all $d \in \mathbb{N}$ and $t \in \{0,\ldots ,T\}$, it holds that $|g_{d}(t,0)|\leq c d^{q}$.

Assumption 3.4 (i) means that $g_{d}(t,\,\cdot \,) \colon \mathbb{R}^{d} \to \mathbb{R}$ can be approximated well by neural networks for any $d \in \mathbb{N}$ and $t \in \{0,\ldots ,T\}$. Assumption 3.4 (ii) imposes that the “constant part” of the payoff grows at most polynomially in $d$. Lemma 4.7 below shows that the framework indeed ensures that $\mathbb{E}[|g_{d}(t,X_{t}^{d})|]< \infty $, as required in Sect. 2.1.

Example 3.5

Assumption 3.4 is satisfied in typical applications from mathematical finance. For instance, fix a strike price $K>0$, an interest rate $r\geq 0$ and consider the payoff of a max-call option, $g_{d}(t,x) = e^{-r t}(\max _{i=1,\ldots ,d} x_{i} - K)^{+}$. Then (see e.g. Grohs et al. [28, Lemma 4.12]) for each $t$, the payoff $g_{d}(t,\,\cdot \,)$ can be realised exactly by a neural network $\phi _{d,t}$ with $\mathrm{size}(\phi _{d,t}) \leq 6d^{3}$. In addition, $\mathrm{Lip}(g_{d}(t,\,\cdot \,))=1$ and therefore, setting $\phi _{\varepsilon ,d,t} = \phi _{d,t}$ for all $\varepsilon \in (0,1]$, we get that Assumption 3.4 is satisfied with $c=6$, $\alpha =0$, $q=3$. Further examples include basket call options, basket put options, call on min options and, by similar techniques, also put on min options, put on max options and many related payoffs.

We now state the main deep neural network approximation result under the assumptions introduced above.

Theorem 3.6

Suppose Assumptions 3.1and 3.4hold. Let $c >0$, $q \geq 0$ and assume for all $d \in \mathbb{N}$ that $\rho ^{d}$ is a probability measure on $\mathbb{R}^{d}$ with $\int _{\mathbb{R}^{d}} |x|^{2\max (p,2)} \rho ^{d}(dx) \leq c d^{q}$. Then there exist constants $\kappa ,\mathfrak{q},\mathfrak{r} \in [0,\infty )$ and neural networks $\psi _{\varepsilon ,d,t}$, $\varepsilon \in (0,1]$, $d \in \mathbb{N}$, $t \in \{0,\ldots ,T\}$, such that for any $\varepsilon \in (0,1]$, $d \in \mathbb{N}$ and $t \in \{0,\ldots ,T\}$, the number of neural network weights grows at most polynomially and the approximation error between the neural network $\psi _{\varepsilon ,d,t}$ and the value function is at most $\varepsilon $, that is, $\mathrm{size}(\psi _{\varepsilon ,d,t}) \leq \kappa d^{\mathfrak{q}} \varepsilon ^{-\mathfrak{r}}$ and

$$ \bigg(\int _{\mathbb{R}^{d}} |V_{d}(t,x) - \psi _{\varepsilon ,d,t}(x)|^{2} \rho ^{d}(d x) \bigg)^{1/2} \leq \varepsilon . $$

(3.9)

The proof of Theorem 3.6 is given in Sect. 4.4 below. For the reader’s convenience, we also provide a sketch of the proof in Sect. 3.2 below.

Theorem 3.6 shows that under Assumptions 3.1 and 3.4, the value function $V_{d}$ can be approximated by deep neural networks without the curse of dimensionality: An approximation error of at most $\varepsilon $ can be achieved by a deep neural network whose size is at most polynomial in $\varepsilon ^{-1}$ and $d$. The approximation error in Theorem 3.6 is measured in the $L^{2}(\rho ^{d})$-norm.

Remark 3.7

Theorem 3.6 holds for any fixed time horizon $T\in \mathbb{N}$. The constants $\kappa ,\mathfrak{q},\mathfrak{r} \in [0,\infty )$ are explicitly chosen in the proof of Theorem 3.6, and so also their dependence on $T$ can be examined. Due to the recursive procedure used in the proof, the constants $\kappa $, $\mathfrak{q}$, $\mathfrak{r}$ chosen in the proof of Theorem 3.6 in general tend to $\infty $ as $T \to \infty $. In special situations, this may be partially avoided (e.g. if $q=0$); see also Remark 3.11 below. However, these observations indicate that in order to address deep neural network expressivity for continuous-time optimal stopping problems, one cannot directly combine the recursive procedure employed here with a limiting argument letting $T\to \infty $.

Theorem 3.6 can also be used to deduce further properties of $V_{d}$. In the basic framework, we obtain for instance the following corollary, which shows that under Assumptions 3.1 and 3.4, the value function satisfies for each $t$ a certain average Lipschitz property with a constant growing at most polynomially in $d$.

Corollary 3.8

Suppose Assumptions 3.1and 3.4are satisfied. Let $\nu _{0}^{d}$ be the standard Gaussian measure on $\mathbb{R}^{d}$. Then for any $R>0$, there exist constants $\kappa ,\mathfrak{q} \in [0,\infty )$ such that for any $d \in \mathbb{N}$, $t \in \{0,\ldots ,T\}$ and $h \in [-R,R]^{d}$, the value function satisfies

$$ \bigg(\int _{\mathbb{R}^{d}} |V_{d}(t,x) - V_{d}(t,x+h)|^{2} \nu _{0}^{d}(d x) \bigg)^{1/2} \leq |h| \kappa d^{\mathfrak{q}}. $$

The proof of Corollary 3.8 is given at the end of Sect. 4.4.

3.2 Sketch of the proof of Theorem 3.6

In this section, we provide a brief sketch of the proof of Theorem 3.6. The proof proceeds by backward induction. This entails some subtleties regarding the probability measure $\rho ^{d}$, which we do not discuss here. We refer to the proof below (see Sect. 4.4) for details. Here we rather provide an easy-to-follow overview.

The starting point is the backward recursion (2.2). Our goal is to provide a neural network approximation of the right-hand side in (2.2). At time $t$, we first aim to derive a bound on the $L^{2}(\rho ^{d})$-approximation error $E^{d}_{t}$ between the continuation value $\mathbb{E}[V_{d}(t+1,X_{t+1}^{d})|X_{t}^{d}=x]$ and the random function $\Gamma _{\varepsilon ,d,t}$ defined for $x \in \mathbb{R}^{d}$ by $\Gamma _{\varepsilon ,d,t}(x) = \frac{1}{N}\sum _{i=1}^{N} \hat{v}_{ \varepsilon ,d,t+1}(\eta _{\varepsilon ,d,t}(x,Y^{d,i}_{t}))$, where $\hat{v}_{\varepsilon ,d,t+1}$ is a neural network approximating the value function at time $t+1$ and $Y^{d,1}_{t}, \ldots , Y^{d,N}_{t}$ are i.i.d. copies of $Y^{d}_{t}$. The existence and suitable properties of $\hat{v}_{\varepsilon ,d,t+1}$ follow from the induction hypothesis. We derive a bound on $\mathbb{E}[E^{d}_{t}]$ which we can then use to obtain existence of a realisation $\gamma _{\varepsilon ,d,t}$ of $\Gamma _{\varepsilon ,d,t}$ satisfying a slightly worse bound and such that the realisation of $\max _{i=1,\ldots ,N} |Y^{d,i}_{t}|$ can also be bounded suitably. This last point is necessary to control the growth of $\gamma _{\varepsilon ,d,t}(x)$. Then $\gamma _{\varepsilon ,d,t}(x)$ is an approximation of the continuation value, and so we naturally define the approximate value function at time $t$ by

$$ \begin{aligned} \hat{v}_{\varepsilon ,d,t}(x) = \max \big(\phi _{\varepsilon ,d,t}(x) - \delta ,\gamma _{\varepsilon ,d,t}(x) \big) \end{aligned} $$

(3.10)

for a suitably chosen $\delta $ (depending on $\varepsilon $). We then consider the continuation region

$$C_{t} = \{x \in \mathbb{R}^{d} \colon g_{d}(t,x) < \mathbb{E}[V_{d}(t+1,X_{t+1}^{d})|X_{t}^{d}=x] \} $$

and the approximate continuation region

$$ \hat{C}_{t} = \{x \in \mathbb{R}^{d} \colon \phi _{{\varepsilon},d,t}(x) - \delta < \gamma _{{\varepsilon},d,t}(x) \}. $$

Then we may decompose

$$ \begin{aligned} |V_{d}(t,x) -\hat{v}_{{\varepsilon},d,t}(x)| & = | g_{d}(t,x) -\phi _{{ \varepsilon},d,t}(x) + \delta |\mathbbm{1}_{C_{t}^{c} \cap \hat{C}_{t}^{c}}(x) \\ & \hphantom{=:} + |\mathbb{E}[V_{d}(t+1,X_{t+1}^{d})|X_{t}^{d}=x] - \phi _{{\varepsilon},d,t}(x) + \delta | \mathbbm{1}_{C_{t} \cap \hat{C}_{t}^{c}}(x) \\ & \hphantom{=:} + | g_{d}(t,x) -\gamma _{{\varepsilon},d,t}(x) |\mathbbm{1}_{C_{t}^{c} \cap \hat{C}_{t}}(x) \\ & \hphantom{=:} + | \mathbb{E}[V_{d}(t+1,X_{t+1}^{d})|X_{t}^{d}=x] -\gamma _{{\varepsilon},d,t}(x) |\mathbbm{1}_{C_{t} \cap \hat{C}_{t}}(x). \end{aligned} $$

The $L^{2}(\rho ^{d})$-error of the last term has already been analysed, and it remains to analyse the remaining terms. The first term is small due to Assumption 3.4. The second and third term need not necessarily be small, but we are able to show that the probabilities $\rho ^{d}(C_{t} \cap \hat{C}_{t}^{c})$ and $\rho ^{d}(C_{t}^{c} \cap \hat{C}_{t})$ are small. Hence the overall $L^{2}(\rho ^{d})$-error can be controlled. The proof is then completed by showing that the neural network (3.10) satisfies the growth, size and Lipschitz properties required to carry out the induction argument.

3.3 Refined framework

We now introduce a refined framework in which the approximation hypothesis (3.2) and the Lipschitz condition (3.4) in Assumption 3.1 (iii) are weakened; see (3.11) and (3.12) below. Due to these weaker hypotheses, we need to introduce potentially stronger moment assumptions on the noise variables $Y^{d}_{t}$. Note that the additional growth conditions (3.13) and (3.14) are satisfied automatically under Assumption 3.1 (see Lemma 4.5 and Remark 3.10 below).

Assumption 3.9

Assume that (i), (ii) and (iv) in Assumption 3.1 are satisfied. Furthermore, assume that there exist constants $c>0$, $h >0$, $q,\bar{q} \geq 0$, $\alpha \geq 0$, $\beta >0$, $m \in \mathbb{N}$, $\theta \geq 0$ and $\zeta \geq 0$ such that for all $\varepsilon \in (0,1]$, $d \in \mathbb{N}$ and $t \in \{0,\ldots ,T-1\}$, there exists a neural network $\eta _{\varepsilon ,d,t} \colon \mathbb{R}^{d} \times \mathbb{R}^{d} \to \mathbb{R}^{d}$ with

$$\begin{aligned} | f^{d}_{t}(x,y) - \eta _{\varepsilon ,d,t}(x,y)| & \leq \varepsilon c d^{q} (1+|x|^{p} + |y|^{p}) \\ & \hphantom{=::} \text{for all } x,y \in [-(\varepsilon ^{-\beta}),\varepsilon ^{-\beta}]^{d}, \end{aligned}$$

(3.11)

$$\begin{aligned} \mathrm{size}(\eta _{\varepsilon ,d,t}) &\leq c d^{q} \varepsilon ^{- \alpha}, \\ \mathrm{Lip}(\eta _{\varepsilon ,d,t}) & \leq c d^{q} \varepsilon ^{- \zeta}, \end{aligned}$$

(3.12)

and for all $x,y \in \mathbb{R}^{d}$,

$$\begin{aligned} \mathbb{E}[|f_{t}^{d}(x,Y_{t}^{d})|^{2m\max (p,2)}] & \leq h d^{\bar{q}} (1+|x|^{2m \max (p,2)}), \end{aligned}$$

(3.13)

$$\begin{aligned} |\eta _{\varepsilon ,d,t}(x,y)| & \leq c d^{q} \varepsilon ^{-\theta} (2+|x|+|y|), \end{aligned}$$

(3.14)

and $\mathbb{E}[|Y_{t}^{d}|^{2m\max{(2,p)}}] \leq c d^{q}$.

Remark 3.10

A sufficient condition for (3.13) is that there exist $\tilde{c}>0$ and $\tilde{q}\geq 0$ such that $\mathbb{E}[|Y_{t}^{d}|^{2m\max{(2,p)}}] \leq \tilde{c} d^{\tilde{q}}$ and $|f_{t}^{d}(x,y)| \leq \tilde{c} d^{\tilde{q}} (1+|x|+|y|)$ for all $d\in \mathbb{N}$, $x,y \in \mathbb{R}^{d}$. Then

$$ \begin{aligned} \mathbb{E}[|f_{t}^{d}(x,Y_{t}^{d})|^{2m\max (p,2)}] & \leq (\tilde{c} d^{ \tilde{q}})^{2m\max (p,2)} \mathbb{E}[(1+|x|+|Y_{t}^{d}|)^{2m\max (p,2)}] \\ & \leq (3\tilde{c} d^{\tilde{q}})^{2m\max (p,2)} (1+|x|^{2m\max (p,2)}+ \tilde{c} d^{\tilde{q}}). \end{aligned} $$

Remark 3.11

While in many relevant applications, the number $T$ of time steps is only moderate (e.g. around 10 in Becker et al. [6, Sects. 4.1 and 4.2]), it is also important to analyse the situation when $T$ is large. To this end, we have introduced in Assumption 3.9 the constants $h$ and $\bar{q}$ instead of using the common upper bounds $c$, $q$. This makes it possible to get first insights about the situation in which $T$ is large from the proofs in Sect. 4. Indeed, if $h=1+\tilde{h}$ and $\tilde{h}$ is sufficiently small (as is the case for instance in certain discretised diffusion models), then the constants in Lemmas 4.6 and 4.8 are also small for large $T$.

Examples of processes that satisfy Assumption 3.9 are provided further below. These include in particular the Black–Scholes model, more general exponential Lévy processes and discrete diffusions.

We now state the main theorem of the article.

Theorem 3.12

Suppose Assumptions 3.9and 3.4hold. Let $c >0$, $q \geq 0$ and for each $d \in \mathbb{N}$, let $\rho ^{d}$ be a probability measure on $\mathbb{R}^{d}$ with $\int _{\mathbb{R}^{d}} |x|^{2m\max (p,2)} \rho ^{d}(dx) \leq c d^{q}$. Furthermore, assume that $\zeta < \frac{\min (1,\beta m-\theta )}{T-1}$, where $m$, $\beta $, $\zeta $, $\theta $ are the constants appearing in Assumption 3.9. Then there exist constants $\kappa ,\mathfrak{q},\mathfrak{r} \in [0,\infty )$ and neural networks $\psi _{\varepsilon ,d,t}$, $\varepsilon \in (0,1]$, $d \in \mathbb{N}$, $t \in \{0,\ldots ,T\}$, such that for any $\varepsilon \in (0,1]$, $d \in \mathbb{N}$ and $t \in \{0,\ldots ,T\}$, the number of neural network weights grows at most polynomially and the approximation error between the neural network $\psi _{\varepsilon ,d,t}$ and the value function is at most $\varepsilon $, that is, $\mathrm{size}(\psi _{\varepsilon ,d,t}) \leq \kappa d^{\mathfrak{q}} \varepsilon ^{-\mathfrak{r}}$ and

$$ \bigg(\int _{\mathbb{R}^{d}} |V_{d}(t,x) - \psi _{\varepsilon ,d,t}(x)|^{2} \rho ^{d}(d x) \bigg)^{1/2} \leq \varepsilon . $$

(3.15)

The proof of Theorem 3.12 is given in Sect. 4.5 below.

Theorem 3.12 shows that for Markov processes satisfying Assumption 3.9 and for reward functions satisfying Assumption 3.4, the value function of the associated optimal stopping problem can be approximated by deep neural networks without the curse of dimensionality. In other words, an approximation error of at most $\varepsilon $ can be achieved by a deep neural network whose size is at most polynomial in $\varepsilon ^{-1}$ and $d$.

The condition $\zeta < \frac{\min (1,\beta m-\theta )}{T-1}$ in Theorem 3.12 can be viewed as a condition on $m$, which needs to be sufficiently large. This means that sufficiently high moments of $Y^{d}_{t}$ need to exist and grow only polynomially in $d$.

A key step in the proof consists in constructing a deep neural network approximating the continuation value. Hence we immediately obtain the following corollary.

Corollary 3.13

Consider the setting of Theorem 3.12. Then for each $\varepsilon \in (0,1]$, $d \in \mathbb{N}$ and $t \in \{0,\ldots ,T\}$, there exists a neural network $\gamma _{\varepsilon ,d,t}$ with $\mathrm{size}(\gamma _{\varepsilon ,d,t}) \leq \kappa d^{ \mathfrak{q}}\varepsilon ^{-\mathfrak{r}}$ and

$$ \bigg(\int _{\mathbb{R}^{d}} |\mathbb{E}[V_{d}(t+1,X_{t+1}^{d})|X_{t}^{d}=x] - \gamma _{\varepsilon ,d,t}(x)|^{2} \rho ^{d}(d x) \bigg)^{1/2} \leq \varepsilon . $$

Finally, using Theorem 3.12, one may also obtain a bound on the probability that the approximate stopping time coincides with the optimal stopping time. Denote by $\bar{\rho}_{t}^{d}$ the law of $X^{d}_{t}$ and suppose that $\int _{\mathbb{R}^{d}} |x|^{2m\max (p,2)} \bar{\rho}_{t}^{d}(dx) \leq c d^{q}$ for each $t$. For a given probability measure $\rho ^{d}$ with respect to which we aim to measure the error (3.15), we now define $\bar{\rho} = \frac{1}{T+1}(\rho ^{d}+\sum _{t=0}^{T-1} \bar{\rho}_{t}^{d})$. Applying Theorem 3.12 to $\bar{\rho}$, we obtain that for any $\varepsilon \in (0,1]$, $d \in \mathbb{N}$ and $t \in \{0,\ldots ,T\}$, there exists a neural network $\psi _{\varepsilon ,d,t}$ satisfying $\mathrm{size}(\psi _{\varepsilon ,d,t}) \leq \kappa d^{\mathfrak{q}} \varepsilon ^{-\mathfrak{r}}$ and

$$ \bigg(\int _{\mathbb{R}^{d}} |V_{d}(t,x) - \psi _{\varepsilon ,d,t}(x)|^{2} \rho ^{d}(d x) \bigg)^{1/2} \leq \varepsilon \sqrt{T+1}. $$

(3.16)

Now let $\bar{\varepsilon}>0$, $\delta >0$ be chosen as in the proof of Theorem 3.12 below and consider the optimal stopping time $\tau ^{*}_{d}$ and its neural network approximation $\hat{\tau}_{\varepsilon ,d}$ given by

$$ \begin{aligned} \tau ^{*}_{d} & = \min \big\{ t \in \{0,\ldots ,T\} \colon V_{d}(t,X^{d}_{t}) = g_{d}(t,X^{d}_{t})\big\} , \\ \hat{\tau}_{\varepsilon ,d} & = \inf \big\{ t \in \{0,\ldots ,T\} \colon \psi _{\varepsilon ,d,t}(X^{d}_{t}) = \phi _{\bar{\varepsilon},d,t}(X^{d}_{t})- \delta \big\} \wedge T. \end{aligned} $$

Then we may estimate

$$ \begin{aligned} \mathbb{P}[\tau ^{*}_{d} \neq \hat{\tau}_{\varepsilon ,d}] & \leq \mathbb{P}\bigg[ \bigcup _{t \in \{0,\ldots ,T-1\}} (\{X^{d}_{t} \in C_{t} \cap \hat{C}_{t}^{c} \} \cup \{X^{d}_{t} \in C_{t}^{c} \cap \hat{C}_{t} \}) \bigg] \\ & \leq \sum _{t=0}^{T-1} \big(\bar{\rho}_{t}^{d}(C_{t} \cap \hat{C}_{t}^{c}) + \bar{\rho}_{t}^{d}(C_{t}^{c} \cap \hat{C}_{t})\big) \\ & \leq (T+1) \max _{t \in \{0, \dots , T-1\}} \big(\bar{\rho}(C_{t} \cap \hat{C}_{t}^{c}) + \bar{\rho}(C_{t}^{c} \cap \hat{C}_{t})\big), \end{aligned} $$

where $C_{t}$ is the continuation region and $\hat{C}_{t}$ the approximate continuation region (see (4.43) and (4.44) in the proof below). A key step in the proof of Theorem 3.12 consists in deriving upper bounds on these probabilities, and recalling that $\bar{\rho}$ is the probability measure to which we are applying Theorem 3.12, the bounds in the proof of Theorem 3.12 thus yield

$$ \mathbb{P}[\tau ^{*}_{d} \neq \hat{\tau}_{\varepsilon ,d}] \leq (T+1) \varepsilon . $$

In particular, choosing $\varepsilon = \tilde{\varepsilon}(T+1)^{-1}$ for $\tilde{\varepsilon}>0$ very small, we thus obtain a neural network for which both the approximation error for the value function (3.16) is smaller than $\tilde{\varepsilon}$ and the approximate stopping time is equal to the optimal stopping time with probability at least $1-\tilde{\varepsilon}$.

3.4 Exponential Lévy models

In this subsection, we apply Theorem 3.12 to exponential Lévy models. Recall that an $\mathbb{R}^{d}$-valued stochastic process $L^{d} = (L^{d}_{t})_{t \geq 0}$ is called a ($d$-dimensional) Lévy process if it is stochastically continuous, its sample paths are almost surely right-continuous with left limits, it has stationary and independent increments and $\mathbb{P}[L^{d}_{0}=0]=1$. A Lévy process $L^{d}$ is fully characterised by its Lévy triplet $(A^{d},\gamma ^{d}, \nu ^{d})$, where the matrix $A^{d} \in \mathbb{R}^{d\times d}$ is symmetric and nonnegative definite, $\gamma ^{d} \in \mathbb{R}^{d}$ and $\nu ^{d}$ is a measure on $\mathbb{R}^{d}$ describing the jump structure of $L^{d}$. A stochastic process $X^{d}$ is said to follow an exponential Lévy model if

$$ X^{d}_{t} = \big(x^{d}_{1}\exp (L_{t,1}^{d}),\ldots , x^{d}_{d}\exp (L_{t,d}^{d}) \big), \qquad t \in \{0,\ldots ,T\}, $$

(3.17)

for a $d$-dimensional Lévy process $L^{d} = (L^{d}_{t})_{t \geq 0}$ and $x^{d} \in \mathbb{R}^{d}$.

From Theorem 3.12, we now obtain the following deep neural network approximation result. This result includes the case of a Black–Scholes model ($\nu ^{d}=0$) as well as pure jump models ($A^{d}_{i,j} = 0$) with sufficiently integrable tails. In particular, Corollary 3.14 applies to prices of American/Bermudan basket put options and put-on-min or put-on-max options in such models (cf. Example 3.5 for payoffs that satisfy Assumption 3.4).

Corollary 3.14

Let $X^{d}$ follow an exponential Lévy model with underlying triplet $(A^{d},\gamma ^{d},\nu ^{d})$ and assume the triplets are bounded in the dimension, that is, there exists $B > 0$ such that for any $d \in \mathbb{N}$, $i,j=1,\ldots ,d$,

$$ \max \bigg( A^{d}_{i,j}, \gamma ^{d}_{i} , \int _{\{|z|> 1\}} e^{2(T+1) \max (p,2) z_{i}} \nu ^{d}(d z), \int _{\{|z|\leq 1\}} z_{i}^{2} \nu ^{d}(d z) \bigg) \leq B. $$

(3.18)

Suppose the payoff functions $g_{d}$ satisfy Assumption 3.4. Let $c >0$, $q \geq 0$ and for each $d \in \mathbb{N}$, let $\rho ^{d}$ be a probability measure on $\mathbb{R}^{d}$ with

$$ \int _{\mathbb{R}^{d}} |x|^{2(T+1)\max (p,2)} \rho ^{d}(dx) \leq c d^{q}. $$

Then there exist constants $\kappa ,\mathfrak{q},\mathfrak{r} \in [0,\infty )$ and neural networks $\psi _{\varepsilon ,d,t}$, $\varepsilon \in (0,1]$, $d \in \mathbb{N}$, $t \in \{0,\ldots ,T\}$, such that for any $\varepsilon \in (0,1]$, $d \in \mathbb{N}$ and $t \in \{0,\ldots ,T\}$, we have

$$ \mathrm{size}(\psi _{\varepsilon ,d,t}) \leq \kappa d^{\mathfrak{q}} \varepsilon ^{-\mathfrak{r}} , \qquad \bigg(\int _{\mathbb{R}^{d}} |V_{d}(t,x) - \psi _{\varepsilon ,d,t}(x)|^{2} \rho ^{d}(d x) \bigg)^{1/2} \leq \varepsilon . $$

Proof

This follows directly from Theorem 3.12 and Lemma 4.2 with the choices $\zeta = \theta = \beta = \frac{1}{T}$, $m=T+1$, which ensures that $\zeta < \frac{1}{T-1} = \frac{\min (1,\beta m-\theta )}{T-1}$. □

3.5 Discrete diffusion models

Let $\mathfrak{t}_{T}>0$ and let $X^{d}$ follow a discrete diffusion model with coefficient functions $\mu ^{d} \colon [0,\mathfrak{t}_{T}] \times \mathbb{R}^{d} \to \mathbb{R}^{d}$, $\sigma ^{d} \colon [0,\mathfrak{t}_{T}] \times \mathbb{R}^{d} \to \mathbb{R}^{d \times d}$, i.e., $X^{d}$ satisfies $X_{0}^{d}=x^{d}$ and

$$\begin{aligned} &X^{d}_{k+1} = X_{k}^{d} + \mu ^{d}(\mathfrak{t}_{k},X_{k}^{d}) ( \mathfrak{t}_{k+1}-\mathfrak{t}_{k}) + \sigma ^{d}(\mathfrak{t}_{k},X_{k}^{d}) (W^{d}_{\mathfrak{t}_{k+1}}-W^{d}_{\mathfrak{t}_{k}}), \\ & \hphantom{=:} k \in \{0,\ldots ,T-1\} \end{aligned}$$

(3.19)

for some $0 \leq \mathfrak{t}_{0}<\mathfrak{t}_{1}<\cdots <\mathfrak{t}_{T}$, $x^{d} \in \mathbb{R}^{d}$ and a $d$-dimensional Brownian motion $W^{d}$. Consider the following assumption on the drift and diffusion coefficients:

Assumption 3.15

Assume there exist constants $C>0$, $q,\tilde{\alpha},\tilde{\zeta} \geq 0$ and for each $d \in \mathbb{N}$, $t \in \{0,\ldots ,T-1\}$ and $\varepsilon \in (0,1]$, there exist neural networks $\mu _{\varepsilon ,d,t} \colon \mathbb{R}^{d} \to \mathbb{R}^{d}$ and $\sigma _{\varepsilon ,d,t,i} \colon \mathbb{R}^{d} \to \mathbb{R}^{d}$, $i=1,\ldots ,d$, with the property that for all $d \in \mathbb{N}$, $\varepsilon \in (0,1]$, $t \in \{0,\ldots ,T-1\}$ and $x \in \mathbb{R}^{d}$, it holds that

$$ \begin{aligned} |\mu ^{d}(\mathfrak{t}_{t},x)-\mu _{\varepsilon ,d,t}(x)|+|\sigma ^{d}( \mathfrak{t}_{t},x)-\sigma _{\varepsilon ,d,t}(x)|_{F} & \leq \varepsilon C d^{q} (1+|x|), \\ |\mu ^{d}(\mathfrak{t}_{t},x)|+|\sigma ^{d}(\mathfrak{t}_{t},x)|_{F} & \leq C d^{q} (1 + |x|), \\ \mathrm{size}(\mu _{\varepsilon ,d,t}) + \sum _{i=1}^{d} \mathrm{size}(\sigma _{\varepsilon ,d,t,i}) & \leq C d^{q} \varepsilon ^{-\tilde{\alpha}}, \\ \max \big(\mathrm{Lip}(\mu _{\varepsilon ,d,t}),\mathrm{Lip}(\sigma _{ \varepsilon ,d,t,1}),\ldots ,\mathrm{Lip}(\sigma _{\varepsilon ,d,t,d}) \big) & \leq C d^{q} \varepsilon ^{-\tilde{\zeta}}. \end{aligned} $$

Here we denote by $\sigma _{\varepsilon ,d,t}(x) \in \mathbb{R}^{d \times d}$ the matrix with the $i$th row $\sigma _{\varepsilon ,d,t,i}(x)$.

Corollary 3.16

Let $X^{d}$ follow a discrete diffusion model with coefficients satisfying Assumption 3.15with $\tilde{\zeta}<\frac{1}{T-1}$. Suppose $p \geq 2$ and the payoff functions $g_{d}$ satisfy Assumption 3.4. Let $c >0$, $q \geq 0$ and assume for each $d \in \mathbb{N}$ that $\rho ^{d}$ is a probability measure on $\mathbb{R}^{d}$ with $\int _{\mathbb{R}^{d}} |x|^{2m\max (p,2)} \rho ^{d}(dx) \leq c d^{q}$ for $m=\lceil \frac{2(1+\tilde{\zeta})}{\frac{1}{T-1}-\tilde{\zeta}} +1 \rceil $. Then there exist constants $\kappa ,\mathfrak{q},\mathfrak{r} \in [0,\infty )$ and neural networks $\psi _{\varepsilon ,d,t}$, $\varepsilon \in (0,1]$, $d \in \mathbb{N}$, $t \in \{0,\ldots ,T\}$ such that for any $\varepsilon \in (0,1]$, $d \in \mathbb{N}$ and $t \in \{0,\ldots ,T\}$, we have

$$ \mathrm{size}(\psi _{\varepsilon ,d,t}) \leq \kappa d^{\mathfrak{q}} \varepsilon ^{-\mathfrak{r}}, \qquad \bigg(\int _{\mathbb{R}^{d}} |V_{d}(t,x) - \psi _{\varepsilon ,d,t}(x)|^{2} \rho ^{d}(d x) \bigg)^{1/2} \leq \varepsilon . $$

(3.20)

Proof

By Lemma 4.3 below, it follows that Assumption 3.9 is satisfied. In addition, the constant $\beta >0 $ in Assumption 3.9 may be chosen arbitrarily and $\zeta = \theta = \beta + \tilde{\zeta}$. Thus we may select $\beta = \frac{1}{T-1}-\tilde{\zeta}-\delta $ for some $\delta >0$ and then $\beta >0$ and $\zeta = \theta = \frac{1}{T-1}-\delta $. Choosing $\delta = \frac{1}{2}(\frac{1}{T-1}-\tilde{\zeta})$, $m=\lceil \frac{1+\tilde{\zeta}}{\beta} +1 \rceil $ then ensures that $\zeta < \frac{1}{T-1} = \frac{\min (1,\beta m-\theta )}{T-1}$. Theorem 3.12 hence implies (3.20). □

Remark 3.17

Assumption 3.15 means that neural networks are able to approximate the drift and diffusion coefficients in (3.19) without the curse of dimensionality. In addition, Assumption 3.15 also requires that the Lipschitz constants of the approximating neural networks grow at most polynomially in $d$ and $\varepsilon ^{-1}$. The hypothesis $\tilde{\zeta }<\frac{1}{T-1}$ in Corollary 3.16 imposes an upper bound on the rate of growth with respect to $\varepsilon ^{-1}$ and thereby asks that the Lipschitz constants of the approximating neural networks only grow slowly with respect to $\varepsilon ^{-1}$.

For example, in the special case when $\mu ^{d}$ and $\sigma ^{d}$ can be represented exactly by ReLU neural networks, we may choose $\tilde{\alpha }=\tilde{\zeta }=0$ and Assumption 3.15 reduces to a hypothesis on the growth with respect to the dimension $d$; see e.g. (3.18). More generally, there are several settings in which deep neural networks have been shown to exhibit approximation rates free from the curse of dimensionality; see for instance Barron [2], Shaham et al. [46] and the overview in Gühring et al. [30]. An approximation result which allows simultaneously bounding the approximation error and the Lipschitz constant of the approximating network has been proved in Gühring et al. [29]. Assumption 3.15 entails in addition growth hypotheses with respect to the dimension $d$.

Remark 3.18

The approximation hypothesis on the drift and diffusion coefficients in Assumption 3.15, that is,

$$ |\mu ^{d}(\mathfrak{t}_{t},x)-\mu _{\varepsilon ,d,t}(x)|+|\sigma ^{d}( \mathfrak{t}_{t},x)-\sigma _{\varepsilon ,d,t}(x)|_{F} \leq \varepsilon C d^{q} (1+|x|) $$

(3.21)

for all $d \in \mathbb{N}$, $\varepsilon \in (0,1]$, $t \in \{0,\ldots ,T-1\}$ and $x \in \mathbb{R}^{d}$, can be replaced by the weaker assumption that there exists $\beta >0$ such that (3.21) holds for all $d \in \mathbb{N}$, $\varepsilon \in (0,1]$, $t \in \{0,\ldots ,T-1\}$ and $x \in [-(\varepsilon ^{-\beta}),\varepsilon ^{-\beta}]^{d}$. Under this weaker assumption, the proof of Lemma 4.3 below shows that Assumption 3.9 is satisfied with the constant $m \in \mathbb{N}$ in Assumption 3.9 chosen arbitrarily and with $\zeta = \theta = \beta + \tilde{\zeta}$. Theorem 3.12 hence allows us to deduce a statement very similar to Corollary 3.16, where the only additional difference lies in the fact that the hypotheses $\tilde{\zeta}<\frac{1}{T-1}$ and $m=\lceil \frac{2(1+\tilde{\zeta})}{\frac{1}{T-1}-\tilde{\zeta}} +1 \rceil $ need to be replaced by the assumption that

$$ \beta + \tilde{\zeta} < \frac{\min (1,\beta m-(\beta + \tilde{\zeta}))}{T-1} . $$

3.6 Running minimum and maximum

In this subsection, we show that our framework can also cover barrier options. This follows from the next proposition which proves that for processes satisfying Assumption 3.9, also the processes augmented by their running maximum or minimum satisfy Assumption 3.9.

Proposition 3.19

Suppose Assumption 3.9holds. Let $\bar{X}^{1} = X^{1}$ and for $d \in \mathbb{N}$, $d \geq 2$ and $t \in \{0,\ldots ,T\}$, consider the $\mathbb{R}^{d}$-valued process $\bar{X}_{t}^{d} = (X^{d-1}_{t},M^{d}_{t})$, where $M^{d}$ is either the running minimum, $M^{d}_{t} = \min _{i=1,\ldots ,d-1} \min _{s=0,\ldots ,t} X^{d-1}_{s,i}$, or the running maximum, $M^{d}_{t} = \max _{i=1,\ldots ,d-1} \max _{s=0,\ldots ,t} X^{d-1}_{s,i}$. Then $\bar{X}^{d}$, $d \in \mathbb{N}$, satisfy Assumption 3.9.

The proof is given at the end of Sect. 4.2 below.

4 Proofs

This section contains the remaining proofs of the results in Sect. 3. The section is split into several subsections. In Sect. 4.1, we provide a refined result on deep neural network approximations of the product function $\mathbb{R}\times \mathbb{R}\to \mathbb{R}$, $(x,y) \mapsto xy$. Section 4.2 then contains two lemmas in which this approximation result is applied to verify that suitable exponential Lévy and discrete diffusion models satisfy Assumption 3.9. Subsequently, Sect. 4.3 contains auxiliary results needed for the proofs of Theorems 3.6 and 3.12. The proofs of these two results are then in Sects. 4.4 and 4.5.

4.1 Deep neural network approximation of the product

Based on Yarotsky [51, Proposition 3], we provide here a refined result regarding deep neural network approximations of the product function $\mathbb{R}\times \mathbb{R}\to \mathbb{R}$, $(x,y) \mapsto xy$.

Lemma 4.1

There exists $c>0$ such that for any $\varepsilon \in (0,1]$ and $M\geq 1$, there exists a neural network ${\mathfrak{n}}_{\varepsilon ,M} \colon \mathbb{R}\times \mathbb{R}\to \mathbb{R}$ with

$$ \sup _{x,y \in [-M,M]} |{\mathfrak{n}}_{\varepsilon ,M}(x,y) - x y| \leq \varepsilon , $$

(4.1)

$\mathrm{size}({\mathfrak{n}}_{\varepsilon ,M}) \leq c(\log ( \varepsilon ^{-1})+\log M +1)$ and for all $x,x',y,y' \in \mathbb{R}$,

$$ |{\mathfrak{n}}_{\varepsilon ,M}(x,y)-{\mathfrak{n}}_{\varepsilon ,M}(x',y')| \leq M c(|x-x'|+|y-y'|). $$

Proof

By Grohs and Herrmann [27, Lemma 4.2] or Opschoor et al. [39, Proposition 4.1] (based on Yarotsky [51, Proposition 3]), there exists a constant $c>0$ such that for any ${\bar{\varepsilon}} \in (0,\frac{1}{2})$, there exists a neural network $\mathfrak{n}_{{\bar{\varepsilon}}} \colon \mathbb{R}\times \mathbb{R}\to \mathbb{R}$ satisfying $\sup _{x,y \in [-1,1]} |\mathfrak{n}_{\bar{\varepsilon}}(x,y) - x y| \leq {\bar{\varepsilon}}$, $\mathrm{size}(\mathfrak{n}_{\bar{\varepsilon}}) \leq c(\log ({ \bar{\varepsilon}}^{-1})+1)$ and

$$ \sup _{x,x',y,y' \in [-1,1]}|\mathfrak{n}_{\bar{\varepsilon}}(x,y)- \mathfrak{n}_{\bar{\varepsilon}}(x',y')| \leq c(|x-x'|+|y-y'|). $$

Consider now the capped neural network $\bar{\mathfrak{n}}_{\bar{\varepsilon}}(x,y) = \mathfrak{n}_{ \bar{\varepsilon}}(\pi _{1}(x),\pi _{1}(y))$, where we set $\pi _{1}(z) = \max (-1,\min (z,1))$. Define the cap function by $\mathrm{cap}(x,y)=(\pi _{1}(x),\pi _{1}(y))$. Then $\bar{\mathfrak{n}}_{\bar{\varepsilon}}(x,y) = \mathfrak{n}_{ \bar{\varepsilon}} \circ \mathrm{cap}(x,y)$ and it can be verified that $\mathrm{cap}$ is again a neural network and for $x,y \in [-1,1]$, we have $\bar{\mathfrak{n}}_{\bar{\varepsilon}}(x,y) = \mathfrak{n}_{ \bar{\varepsilon}}(x,y) $. The fact that the composition of two ReLU neural networks can again be realised by a ReLU neural network with size bounded by twice the sum of the respective sizes (see e.g. Opschoor et al. [39, Proposition 2.2]) hence proves that there exists $\tilde{c}\geq c$ such that for all ${\bar{\varepsilon}} \in (0,\frac{1}{2})$, we have $\mathrm{size}(\bar{\mathfrak{n}}_{\bar{\varepsilon}}) \leq \tilde{c}( \log ({\bar{\varepsilon}}^{-1})+1)$. Moreover, $\sup _{x,y \in [-1,1]} |\bar{\mathfrak{n}}_{\bar{\varepsilon}}(x,y) - x y| \leq {\bar{\varepsilon}}$ and for all $x,x',y,y' \in \mathbb{R}$,

$$ |\bar{\mathfrak{n}}_{\bar{\varepsilon}}(x,y)-\bar{\mathfrak{n}}_{ \bar{\varepsilon}}(x',y')| \leq c\big(|\pi _{1}(x)-\pi _{1}(x')|+| \pi _{1}(y)-\pi _{1}(y')|\big) \leq \tilde{c}(|x-x'|+|y-y'|). $$

Now let $\varepsilon \in (0,1]$ and $M \geq 1$ be given, choose $\bar{\varepsilon} = 3^{-1} M^{-2} \varepsilon $ and define the rescaled network $\mathfrak{n}_{{{\varepsilon}},M}(x,y) = M^{2} \bar{\mathfrak{n}}_{ \bar{\varepsilon}}(\frac{x}{M},\frac{y}{M})$. Then

$$ \sup _{x,y \in [-M,M]} |{\mathfrak{n}}_{{{\varepsilon}},M}(x,y) - x y| = M^{2} \sup _{x,y \in [-M,M]} \bigg|\bar{\mathfrak{n}}_{{ \bar{\varepsilon}}}\bigg(\frac{x}{M},\frac{y}{M}\bigg) - \frac{x}{M} \frac{y}{M}\bigg| \leq M^{2} {\bar{\varepsilon}}, $$

$\mathrm{size}(\mathfrak{n}_{{{\varepsilon}},M}) \leq \tilde{c}(\log ({ \bar{\varepsilon}}^{-1})+1)$ and for all $x,x',y,y' \in \mathbb{R}$,

$$ \begin{aligned} |\mathfrak{n}_{{{\varepsilon}},M}(x,y)-{\mathfrak{n}}_{{{\varepsilon}},M}(x',y')| & = M^{2} \bigg|\bar{\mathfrak{n}}_{{\bar{\varepsilon}}}\bigg( \frac{x}{M},\frac{y}{M}\bigg)-\bar{\mathfrak{n}}_{{\bar{\varepsilon}}} \bigg(\frac{x'}{M},\frac{y'}{M}\bigg)\bigg| \\ & \leq M \tilde{c}(|x-x'|+|y-y'|). \end{aligned} $$

□

4.2 Sufficient conditions

In this subsection, we prove Lemmas 4.2 and 4.3, which show that the exponential Lévy and discrete diffusion models considered above satisfy Assumption 3.9. We also provide a proof of Proposition 3.19.

Lemma 4.2

Let $X^{d}$ follow an exponential Lévy model (cf. (3.17)) for each $d \in \mathbb{N}$ and assume that the Lévy triplets $(A^{d},\gamma ^{d},\nu ^{d})$ are bounded in the dimension, that is, there exists $B > 0$ such that for any $d \in \mathbb{N}$, $i,j=1,\ldots ,d$,

$$ \max \bigg( A^{d}_{i,j}, \gamma ^{d}_{i} , \int _{\{|z|> 1\}} e^{ \bar{p} z_{i}} \nu ^{d}(d z), \int _{\{|z|\leq 1\}} z_{i}^{2} \nu ^{d}(d z) \bigg) \leq B, $$

(4.2)

where $\bar{p} = 2m\max{(2,p)}$. Then Assumption 3.9is satisfied with the constant $\beta >0 $ in Assumption 3.9chosen arbitrarily and with $\zeta = \theta = \beta $.

Proof

Firstly, for each $t \in \{0,\ldots ,T-1\}$ and $d \in \mathbb{N}$, the exponential Lévy structure (3.17) implies $X_{t+1,i}^{d} = X_{t,i}^{d} \exp (L_{t+1,i}^{d}-L_{t,i}^{d})$ for $i=1,\ldots , d$. Therefore $X_{t+1}^{d} = f^{d}_{t}(X_{t}^{d},Y^{d}_{t})$ with $f^{d}_{t} \colon \mathbb{R}^{d} \times \mathbb{R}^{d} \to \mathbb{R}^{d}$ given by $f^{d}_{t}(x,y) = (x_{1} y_{1},\ldots ,x_{d} y_{d})$ for $x,y \in \mathbb{R}^{d}$ and with $Y_{t,i}^{d} = \exp (L_{t+1,i}^{d}-L_{t,i}^{d})$, i.e., (3.1) is satisfied. Since $L^{d}$ has independent increments, it follows that Assumption 3.1 (ii) is satisfied. Next, we can employ an argument from the proof of Gonon and Schwab [25, Theorem 5.1] (which uses Sato [45, Theorem 25.17] and (4.2)) to obtain for any $d \in \mathbb{N}$ and $i=1,\ldots ,d$ that

$$\begin{aligned} \mathbb{E}[e^{\bar{p} L_{1,i}^{d}}] & = \exp \bigg(\frac{\bar{p}^{2}}{2}A^{d}_{i,i} + \int _{\mathbb{R}^{d}} (e^{\bar{p} y_{i}}-1-\bar{p} y_{i} \mathbbm{1}_{\{|y| \leq 1\}}) \nu ^{d}(d y) + \bar{p} \gamma ^{d}_{i} \bigg) \\ & \leq \exp \bigg(\frac{5 \bar{p}^{2}}{2}B + \bar{p}^{2} e^{\bar{p}} B \bigg). \end{aligned}$$

Combined with Minkowski’s inequality and the stationarity increments property of $L^{d}$, this yields

$$\begin{aligned} \mathbb{E}[|Y_{t}^{d}|^{\bar{p}}] & = \bigg(\mathbb{E}\bigg[\Big(\sum _{i=1}^{d} |Y_{t,i}^{d}|^{2} \Big)^{\frac{\bar{p}}{2}}\bigg]^{\frac{2}{\bar{p}}}\bigg)^{ \frac{\bar{p}}{2}} \\ & \leq \bigg(\sum _{i=1}^{d} \mathbb{E}[ |Y_{t,i}^{d}|^{\bar{p}} ]^{ \frac{2}{\bar{p}}}\bigg)^{\frac{\bar{p}}{2}} \\ & = \bigg(\sum _{i=1}^{d} \mathbb{E}[ e^{\bar{p}L_{1,i}^{d}} ]^{ \frac{2}{\bar{p}}}\bigg)^{\frac{\bar{p}}{2}} \\ & \leq d^{\frac{\bar{p}}{2}} \exp \bigg(\frac{5 \bar{p}^{2}}{2}B + \bar{p}^{2} e^{\bar{p}} B \bigg). \end{aligned}$$

(4.3)

Furthermore, $f_{t}^{d}(0,0) = 0$ and hence Assumption 3.1 (iv) is satisfied. Moreover, for $\varepsilon \in (0,1]$, $d \in \mathbb{N}$ and $t \in \{0,\ldots ,T-1\}$, let $M=\varepsilon ^{-\beta}$ and let $\eta _{\varepsilon ,d,t} \colon \mathbb{R}^{d} \times \mathbb{R}^{d} \to \mathbb{R}^{d}$ be the $d$-fold parallelisation of ${\mathfrak{n}}_{\varepsilon ,M}$ from Lemma 4.1. Then for $x,y \in [-(\varepsilon ^{-\beta}),\varepsilon ^{-\beta}]^{d}$, we obtain

$$ | f^{d}_{t}(x,y) - \eta _{\varepsilon ,d,t}(x,y)| = \bigg(\sum _{i=1}^{d} |x_{i} y_{i} - {\mathfrak{n}}_{\varepsilon ,M}(x_{i},y_{i})|^{2} \bigg)^{1/2} \leq \varepsilon d^{\frac{1}{2}}, $$

$\mathrm{size}(\eta _{\varepsilon ,d,t}) \leq d \mathrm{size}({ \mathfrak{n}}_{\varepsilon ,M}) \leq cd((\beta +1)\log (\varepsilon ^{-1})+1) \leq c_{1} d \varepsilon ^{-1}$ with $c_{1} = c(\beta +2)$ and for all $x,x',y,y' \in \mathbb{R}^{d}$,

$$ \begin{aligned} | \eta _{\varepsilon ,d,t}(x,y) - \eta _{\varepsilon ,d,t}(x',y')| & = \bigg(\sum _{i=1}^{d} | {\mathfrak{n}}_{\varepsilon ,M}(x_{i},y_{i})-{ \mathfrak{n}}_{\varepsilon ,M}(x_{i}',y_{i}')|^{2} \bigg)^{1/2} \\ & \leq \bigg(\sum _{i=1}^{d} \big| Mc(|x_{i}-x_{i}'| + |y_{i}-y_{i}'|) \big|^{2} \bigg)^{1/2} \\ & \leq \sqrt{2} \varepsilon ^{-\beta} c (|x-x'| + |y-y'|). \end{aligned} $$

Finally, for all $x,y \in \mathbb{R}^{d}$,

$$ \begin{aligned} |\eta _{\varepsilon ,d,t}(x,y)| & \leq | \eta _{\varepsilon ,d,t}(x,y)- \eta _{\varepsilon ,d,t}(0,0)|+|\eta _{\varepsilon ,d,t}(0,0)-f_{t}^{d}(0,0)| \\ & \leq \varepsilon ^{-\beta} c (|x| + |y|) + \varepsilon d^{ \frac{1}{2}} \\ & \leq \varepsilon ^{-\beta} \max (c,1) d^{\frac{1}{2}} (1+|x| + |y|), \end{aligned} $$

and Minkowski’s integral inequality and (4.3) imply

$$\begin{aligned} \mathbb{E}[|f_{t}^{d}(x,Y_{t}^{d})|^{\bar{p}}] & = \mathbb{E}\bigg[\bigg(\sum _{i=1}^{d} x_{i}^{2} (Y_{t,i}^{d})^{2} \bigg)^{\frac{\bar{p}}{2}}\bigg] \\ & \leq \bigg( \sum _{i=1}^{d} \big(\mathbb{E}[x_{i}^{\bar{p}} (Y_{t,i}^{d})^{ \bar{p}}]\big)^{2/\bar{p}} \bigg)^{\frac{\bar{p}}{2}} \\ & \leq \bigg( \sum _{i=1}^{d} x_{i}^{2} \bigg)^{\frac{\bar{p}}{2}} \mathbb{E}[ |Y_{t}^{d}|^{\bar{p}}] \\ & \leq d^{m\max{(2,p)}} \exp \bigg(\frac{5 \bar{p}^{2}}{2}B + \bar{p}^{2} e^{\bar{p}} B \bigg) (1+|x|^{\bar{p}}). \end{aligned}$$

□

Lemma 4.3

Assume $p \geq 2$, let $X^{d}$ follow a discrete diffusion model with coefficients $\mu ^{d} \colon [0,\mathfrak{t}_{T}] \times \mathbb{R}^{d} \to \mathbb{R}^{d}$, $\sigma ^{d} \colon [0,\mathfrak{t}_{T}] \times \mathbb{R}^{d} \to \mathbb{R}^{d \times d}$ and suppose Assumption 3.15holds. Then Assumption 3.9is satisfied with the constants $m \in \mathbb{N}$ and $\beta >0 $ in Assumption 3.9chosen arbitrarily and with $\zeta = \theta = \beta + \tilde{\zeta}$.

Proof

First, (3.1) holds with $f^{d}_{t}(x,y) = x + \mu ^{d}(\mathfrak{t}_{t},x) (\mathfrak{t}_{t+1}- \mathfrak{t}_{t}) + \sigma ^{d}(\mathfrak{t}_{t},x)y$ and $Y_{t}^{d}= W^{d}_{\mathfrak{t}_{t+1}}-W^{d}_{\mathfrak{t}_{t}}$. Assumption 3.1 (ii) is satisfied by the independent increment property of Brownian motion. Furthermore, we obtain for each $d \in \mathbb{N}$ and $t \in \{0,\ldots ,T-1\}$ that $|f_{t}^{d}(0,0)| = |\mu ^{d}(\mathfrak{t}_{t},0) (\mathfrak{t}_{t+1}- \mathfrak{t}_{t}) | \leq C \mathfrak{t}_{T} d^{q}$, and with $\bar{p}= m\max{(2,p)}$, we have $\mathbb{E}[|Y_{t}^{d}|^{2\bar{p}}] \leq (\mathfrak{t}_{T})^{\bar{p}} \mathbb{E}[|Z|^{2\bar{p}}]$ for $Z$ standard normally distributed in $\mathbb{R}^{d}$. The fact that $|Z|^{2} \sim \chi ^{2}(d)$ and the upper and lower bounds for the Gamma function (see e.g. Gonon et al. [24, Lemma 2.4]) thus yield

$$\begin{aligned} \mathbb{E}[|Y_{t}^{d}|^{2\bar{p}}] & \leq (\mathfrak{t}_{T})^{\bar{p}} \frac{2^{\bar{p}}\Gamma (\bar{p}+\frac{d}{2})}{\Gamma (\frac{d}{2})} \leq (\mathfrak{t}_{T})^{\bar{p}} 2^{\bar{p}} \bigg(\frac{e}{\frac{d}{2}} \bigg)^{\frac{d}{2}} \bigg( \frac{\bar{p}+\frac{d}{2}}{e}\bigg)^{\bar{p}+\frac{d}{2}} e \\ & \leq \big((4\mathfrak{t}_{T}\bar{p})^{\bar{p}} c_{ \bar{p}} e^{1-\bar{p}}\big)d^{\bar{p}} \end{aligned}$$

(4.4)

with $c_{\bar{p}} = \max _{n \in \mathbb{N}} (1+\frac{2\bar{p}}{n})^{\frac{n}{2}} < \infty $.

Next, for $\varepsilon \in (0,1]$, $d \in \mathbb{N}$ and $t \in \{0,\ldots ,T-1\}$, let $M=4 \max (C,1) d^{q+\frac{1}{2}} \varepsilon ^{-\beta}$ and consider $\eta _{\varepsilon ,d,t} \colon \mathbb{R}^{d} \times \mathbb{R}^{d} \to \mathbb{R}^{d}$ given by

$$ \eta _{{\varepsilon},d,t,i}(x,y) = x_{i} +\mu _{\varepsilon ,d,t,i}(x) (\mathfrak{t}_{t+1}-\mathfrak{t}_{t}) + \sum _{j=1}^{d} {\mathfrak{n}}_{ \varepsilon ,M}\big(\sigma _{\varepsilon ,d,t,i,j}(x),y_{j}\big) $$

(4.5)

for $i=1,\ldots ,d$ with ${\mathfrak{n}}_{\varepsilon ,M}$ from Lemma 4.1. By using the operations of parallelisation and concatenation, we can realize $(x,y)\mapsto {\mathfrak{n}}_{\varepsilon ,M}(\sigma _{\varepsilon ,d,t,i,j}(x),y_{j})$ by a neural network of size $\mathfrak{s}_{i,j} := 2(\mathrm{size}(\mathfrak{n}_{\varepsilon ,M})+2+ \mathrm{size}(\sigma _{\varepsilon ,d,t,i,j}))$; see e.g. Opschoor et al. [39, Propositions 2.2 and 2.3]. Recall that the identity on ℝ can be realised by a ReLU deep neural network of arbitrary depth $\ell $ and size $2 \ell $ (see Petersen and Voigtlaender [40, Remark 2.4], Opschoor et al. [39, Proposition 2.4]). Thus we may insert identity networks in (4.5) to ensure that all summands can be realised by networks of the same depth, which is at most $m_{i} = \max (\mathrm{size}(\mu _{\varepsilon ,d,t,i}),\mathfrak{s}_{i,1}, \ldots ,\mathfrak{s}_{i,d})$ due to the fact that the depth of a network is bounded by its size. By applying the summing operation for neural networks of equal depth (see e.g. Gonon and Schwab [25, Lemma 3.2], it follows that $\eta _{{\varepsilon},d,t,i}$ can be realised by a deep neural network with

$$ \begin{aligned} \mathrm{size}(\eta _{\varepsilon ,d,t,i}) & \leq 2\bigg(2 + \mathrm{size}(\mu _{\varepsilon ,d,t,i})+\sum _{j=1}^{d}\mathfrak{s}_{i,j} \bigg)+4 m_{i} \\ & \leq 12\bigg(1 + C d^{q} \varepsilon ^{-\tilde{\alpha}} +d\Big(c \big(\log (\varepsilon ^{-1})+\log M +1\big)+2\Big)\bigg). \end{aligned} $$

Next, we use Assumption 3.15 to estimate

$$ |\sigma _{\varepsilon ,d,t}(x)|_{F} \leq |\sigma _{\varepsilon ,d,t}(x)- \sigma ^{d}(\mathfrak{t}_{t},x)|_{F} + |\sigma ^{d}(\mathfrak{t}_{t},x)|_{F} \leq 2 C d^{q} (1 + |x|), $$

and so it follows for $x \in [-(\varepsilon ^{-\beta}),\varepsilon ^{-\beta}]^{d}$ that $|\sigma _{\varepsilon ,d,t}(x)|_{F} \leq 2 C d^{q}(1+\sqrt{d}{ \varepsilon ^{-\beta}}) \leq M$. Hence Assumption 3.15 and (4.1) imply for $x,y \in [-(\varepsilon ^{-\beta}),\varepsilon ^{-\beta}]^{d}$ that

$$ \begin{aligned} & | f^{d}_{t}(x,y) - \eta _{\varepsilon ,d,t}(x,y)| \\ &\leq | \mu ^{d}(\mathfrak{t}_{t},x)-\mu _{\varepsilon ,d,t}(x)| ( \mathfrak{t}_{t+1}-\mathfrak{t}_{t}) \\ & \hphantom{=:} + \bigg(\sum _{i=1}^{d} \bigg|\big(\sigma ^{d}(\mathfrak{t}_{t},x)y \big)_{i} - \sum _{j=1}^{d} {\mathfrak{n}}_{\varepsilon ,M}\big( \sigma _{\varepsilon ,d,t,i,j}(x),y_{j}\big)\bigg|^{2}\bigg)^{1/2} \\ & \leq \varepsilon C \mathfrak{t}_{T} d^{q} (1+|x|) + |\sigma ^{d}(\mathfrak{t}_{t},x)y-\sigma _{\varepsilon ,d,t}(x)y| \\ & \hphantom{=:} + \bigg(\sum _{i=1}^{d} \bigg|\sum _{j=1}^{d} \sigma _{\varepsilon ,d,t,i,j}(x)y_{j} - {\mathfrak{n}}_{\varepsilon ,M}\big(\sigma _{\varepsilon ,d,t,i,j}(x),y_{j} \big)\bigg|^{2}\bigg)^{1/2} \\ & \leq \varepsilon C \mathfrak{t}_{T} d^{q} (1+|x|) + |\sigma ^{d}(\mathfrak{t}_{t},x)-\sigma _{\varepsilon ,d,t}(x)|_{F} |y| + d^{\frac{3}{2}} \varepsilon \\ & \leq \varepsilon (C \mathfrak{t}_{T} d^{q} + d^{ \frac{3}{2}}) (1+|x|) + \varepsilon C d^{q} (1+|x|) |y| \\ & \leq 2 \varepsilon (C (\mathfrak{t}_{T}+1) d^{q} + d^{\frac{3}{2}}) (1+|x|^{2}+|y|^{2}). \end{aligned} $$

Moreover, Assumption 3.15 and the Lipschitz property of ${\mathfrak{n}}_{\varepsilon ,M}$ yield that for all $x,x',y,y' \in \mathbb{R}^{d}$,

$$ \begin{aligned} & | \eta _{\varepsilon ,d,t}(x,y) - \eta _{\varepsilon ,d,t}(x',y')| \\ & \leq |x-x'| +|\mu _{\varepsilon ,d,t}(x)-\mu _{\varepsilon ,d,t}(x')| (\mathfrak{t}_{t+1}-\mathfrak{t}_{t}) \\ & \hphantom{=:} + \bigg(\sum _{i=1}^{d} \bigg|\sum _{j=1}^{d} {\mathfrak{n}}_{ \varepsilon ,M}\big(\sigma _{\varepsilon ,d,t,i,j}(x'),y_{j}'\big) - { \mathfrak{n}}_{\varepsilon ,M}\big(\sigma _{\varepsilon ,d,t,i,j}(x),y_{j} \big)\bigg|^{2}\bigg)^{\frac{1}{2}} \\ & \leq (1+C d^{q} \varepsilon ^{-\tilde{\zeta}} \mathfrak{t}_{T}) |x-x'| \\ & \hphantom{=:} + \bigg(\sum _{i=1}^{d} \bigg|\sum _{j=1}^{d} Mc\big(|\sigma _{ \varepsilon ,d,t,i,j}(x')-\sigma _{\varepsilon ,d,t,i,j}(x)| +|y_{j}-y_{j}'| \big) \bigg|^{2}\bigg)^{\frac{1}{2}} \\ & \leq 8 \max (C,1)^{2} d^{2q+2} \varepsilon ^{-\beta -\tilde{\zeta}} c(1+C \mathfrak{t}_{T})( |x-x'| +|y-y'|). \end{aligned} $$

In addition, for all $x,y \in \mathbb{R}^{d}$,

$$ \begin{aligned} \mathbb{E}[|f_{t}^{d}(x,Y_{t}^{d})|^{2\bar{p}}]^{\frac{1}{2\bar{p}}} & \leq |x| + |\mu ^{d}(\mathfrak{t}_{t},x)|\mathfrak{t}_{T} + \mathbb{E}[|\sigma ^{d}(\mathfrak{t}_{t},x)Y_{t}^{d}|^{2\bar{p}}]^{ \frac{1}{2\bar{p}}} \\ & \leq (1+C\mathfrak{t}_{T}d^{q})(1+|x|) + Cd^{q}(1+|x|) \mathbb{E}[|Y_{t}^{d}|^{2\bar{p}}]^{\frac{1}{2\bar{p}}} \end{aligned} $$

so that (4.4) implies the polynomial growth bound (3.13). Finally, the estimate

$$ \begin{aligned} |\eta _{\varepsilon ,d,t}(x,y)| & \leq | \eta _{\varepsilon ,d,t}(x,y)- \eta _{\varepsilon ,d,t}(0,0)|+|\eta _{\varepsilon ,d,t}(0,0)-f_{t}^{d}(0,0)| \\ & \hphantom{=:} + |f_{t}^{d}(0,0)| \end{aligned} $$

combined with the Lipschitz, growth and approximation properties that we already established implies the polynomial bound (3.14) with $\theta = \beta + \tilde{\zeta}$. Altogether, this proves that Assumption 3.9 is satisfied with the claimed choices of $\zeta $ and $\theta $. □

Proof of Proposition 3.19

Consider first the case of the running minimum. Any $z \in \mathbb{R}^{d}$ can be partitioned into $z = (z_{1:d-1},z_{d})$, the first $d-1$ and the last components. Define the transition map $\bar{f}^{d}_{t} \colon \mathbb{R}^{d} \times \mathbb{R}^{d} \to \mathbb{R}^{d}$ for the augmented process by

$$ \bar{f}^{d}_{t}(x,y)=\bigg(f^{d-1}_{t}(x_{1:d-1},y_{1:d-1}),\min \Big(\min _{j=1,\ldots ,d-1} f^{d-1}_{t,j}(x_{1:d-1},y_{1:d-1}),x_{d} \Big)\bigg) $$

and $\bar{Y}^{d}_{t} = (Y^{d-1}_{t},0)$. Then $\bar{X}^{d}_{0} =(X^{d-1}_{0},\min _{i=1,\ldots ,d-1}X^{d-1}_{0,i}) $ and so the independence and moment conditions on $\bar{Y}^{d}$ are satisfied and $|\bar{f}_{t}^{d}(0,0)| \leq 2|f^{d-1}_{t}(0,0)|$. Thus (i), (ii) and (iv) in Assumption 3.1 are satisfied.

Furthermore, by the identity $x=x^{+} - (-x)^{+}$ and Grohs et al. [28, Lemma 4.12] the function $\mathfrak{min}_{k} \colon \mathbb{R}^{k} \to \mathbb{R}$, $z \mapsto \min _{j=1,\ldots ,k} z_{j}$ can be realised by a deep neural network with size at most $12k^{3}$. We now set

$$ \begin{aligned} & \bar{\eta}_{\varepsilon ,d,t}(x,y) \\ & = \bigg(\eta _{{\varepsilon},d-1,t}(x_{1:d-1},y_{1:d-1}), \mathfrak{min}_{2}\Big(\mathfrak{min}_{d-1}\big(\eta _{{\varepsilon},d-1,t}(x_{1:d-1},y_{1:d-1}) \big),x_{d}\Big)\bigg). \end{aligned} $$

Then the 1-Lipschitz property of $\mathfrak{min}_{k}$, which follows from the fact that the pointwise minimum of 1-Lipschitz functions is again 1-Lipschitz, implies the bounds $\mathrm{Lip}(\bar{\eta}_{\varepsilon ,d,t}) \leq \sqrt{2} \mathrm{Lip}({\eta}_{\varepsilon ,d-1,t})$ and

$$ | \bar{f}^{d}_{t}(x,y) - \bar{\eta}_{\varepsilon ,d,t}(x,y)| \leq \sqrt{2}| f^{d-1}_{t}(x_{1:d-1},y_{1:d-1}) - \eta _{\varepsilon ,d-1,t}(x_{1:d-1},y_{1:d-1})|. $$

The bound on $\mathrm{size}(\bar{\eta}_{\varepsilon ,d,t})$ follows from the bound on $\mathrm{size}(\eta _{\varepsilon ,d-1,t})$ and bounds for the operations composition, parallelisation and the realisation of the identity (which yields a bound for the size of the neural network realising $x \mapsto x_{1:d}$). Finally, it is straightforward to deduce the inequalities $|\bar{f}_{t}^{d}(x,y)| \leq \sqrt{2} |f_{t}^{d-1}(x_{1:d-1},y_{1:d-1})|$ and $|\bar{\eta}_{\varepsilon ,d,t}(x,y)| \leq \sqrt{2} |\eta _{ \varepsilon ,d-1,t}(x_{1:d-1},y_{1:d-1})|$ so that all the required bounds follow from the corresponding properties of $X^{d-1}$.

In the case of the running maximum, one proceeds analogously except that the growth bounds obtained at the end are now a bit different. More specifically, in this case, we may deduce that $|\bar{f}_{t}^{d}(x,y)| \leq d |f_{t}^{d-1}(x_{1:d-1},y_{1:d-1})|+|x|$ and in addition, $|\bar{\eta}_{\varepsilon ,d,t}(x,y)| \leq d |\eta _{\varepsilon ,d-1,t}(x_{1:d-1},y_{1:d-1})|+|x|$, which still allows us to deduce the claimed statement. □

4.3 Auxiliary results

This section contains auxiliary results that are needed for the proof of Theorems 3.6 and 3.12. We start with Lemma 4.4 which establishes growth properties of the payoff function and its neural network approximation.

Lemma 4.4

Suppose Assumption 3.4holds. Then for all $\varepsilon \in (0,1]$, $d \in \mathbb{N}$, $x \in \mathbb{R}^{d}$ and $t \in \{0,\ldots ,T\}$, it holds that

$$\begin{aligned} |g_{d}(t,x)| & \leq c d^{q} (1+|x|), \end{aligned}$$

(4.6)

$$\begin{aligned} |\phi _{\varepsilon ,d,t}(x)| & \leq c d^{q} (2+|x|). \end{aligned}$$

(4.7)

Proof

First note that from (3.6), (3.8) and the growth assumption on $g_{d}$, we obtain for every $\bar{\varepsilon} \in (0,1]$ that

$$ \begin{aligned} |g_{d}(t,x)| & \leq |g_{d}(t,x)-\phi _{\bar{\varepsilon},d,t}(x)| + | \phi _{\bar{\varepsilon},d,t}(x)-\phi _{\bar{\varepsilon},d,t}(0)| \\ & \hphantom{=:} +|\phi _{\bar{\varepsilon},d,t}(0)-g_{d}(t,0)|+|g_{d}(t,0)| \\ & \leq \bar{\varepsilon} c d^{q} (2+|x|^{p}) + c d^{q} (1+|x|). \end{aligned} $$

Letting $\bar{\varepsilon}$ tend to 0 gives (4.6). Moreover, the same properties of $g_{d}$ and $\phi _{{\varepsilon},d,t}$ imply

$$ \begin{aligned} |\phi _{\varepsilon ,d,t}(x)| & \leq |\phi _{\varepsilon ,d,t}(x)- \phi _{\varepsilon ,d,t}(0)|+|\phi _{\varepsilon ,d,t}(0)-g_{d}(t,0)|+|g_{d}(t,0)| \\ & \leq c d^{q} (2+|x|). \end{aligned} $$

□

The next result establishes growth properties of the Markov update function and its neural network approximation.

Lemma 4.5

Suppose Assumption 3.1is satisfied. Then for each $d \in \mathbb{N}$, $x,y \in \mathbb{R}^{d}$, $t \in \{0,\ldots ,T-1\}$ and $\varepsilon \in (0,1]$, it holds that

$$\begin{aligned} |f_{t}^{d}(x,y)| & \leq c d^{q} (1+|x|+|y|), \end{aligned}$$

(4.8)

$$\begin{aligned} |\eta _{\varepsilon ,d,t}(x,y)| & \leq c d^{q} (2+|x|+|y|). \end{aligned}$$

(4.9)

Proof

The proof is a straightforward consequence of (3.2), Assumption 3.1 (iv) and (3.4). Indeed, these hypotheses imply for every $\bar{\varepsilon} \in (0,1]$ that

$$ \begin{aligned} |f_{t}^{d}(x,y)| & \leq |f_{t}^{d}(x,y)-\eta _{\bar{\varepsilon},d,t}(x,y)| + |\eta _{\bar{\varepsilon},d,t}(x,y)-\eta _{\bar{\varepsilon},d,t}(0,0)| \\ & \hphantom{=:} +|\eta _{\bar{\varepsilon},d,t}(0,0)-f_{t}^{d}(0,0)|+|f_{t}^{d}(0,0)| \\ & \leq \bar{\varepsilon} c d^{q} (2+|x|^{p} + |y|^{p}) + c d^{q} (1+ |x| + |y|). \end{aligned} $$

Letting $\bar{\varepsilon}$ tend to 0 yields (4.8). In addition, the same hypotheses yield

$$ \begin{aligned} |\eta _{\varepsilon ,d,t}(x,y)| & \leq |\eta _{{\varepsilon},d,t}(x,y)- \eta _{{\varepsilon},d,t}(0,0)| +|\eta _{{\varepsilon},d,t}(0,0)-f_{t}^{d}(0,0)|+|f_{t}^{d}(0,0)| \\ & \leq c d^{q} (2+|x|+|y|). \end{aligned} $$

□

In the next lemma, we establish a bound on the conditional moments of $X^{d}$. The proof and $\mathbb{E}[|X_{0}^{d}|]<\infty $ also yield $\mathbb{E}[|X_{t}^{d}|]<\infty $ for all $t$, and so we may consider the conditional expectation in (4.10) to be well defined for all $x \in \mathbb{R}^{d}$; cf. Remark 3.3.

Lemma 4.6

Suppose Assumption 3.1or 3.9is satisfied. Then for all $d \in \mathbb{N}$, $x \in \mathbb{R}^{d}$ and $s,t \in \{0,\ldots ,T\}$ with $s \geq t$, it holds that

$$ \begin{aligned} \mathbb{E}\big[|X^{d}_{s}| \big|X^{d}_{t}=x\big] \leq \tilde{c}_{1} d^{ \tilde{q}_{1}} (1+|x|) \end{aligned} $$

(4.10)

with $\tilde{c}_{1} = 2\max (c,1)^{T+1} T$, $\tilde{q}_{1} = q(T+1)$ in the case of Assumption 3.1and with $\tilde{c}_{1} = T\max (h,1)^{\frac{T}{2m\max (p,2)}}$, $\tilde{q}_{1} = \frac{\bar{q}T}{2m\max (p,2)}$ in the case of Assumption 3.9.

Proof

Assume without loss of generality that $c \geq 1$. Consider first the case when Assumption 3.1 holds. Then (4.8) can be used to prove inductively that for all $s \geq t$,

$$ \mathbb{E}\big[|X^{d}_{s}| \big|X^{d}_{t}=x\big] \leq (c d^{q})^{s-t} |x| + \sum _{k=1}^{s-t} (c d^{q})^{k} (1+\mathbb{E}[|Y_{s-k}^{d}|]). $$

(4.11)

Indeed, for $s=t$, this directly follows from the definition. Assume now $s>t$ and (4.11) for $s-1,s-2,\ldots ,t$. Then (4.8) and independence yield

$$\begin{aligned} & \mathbb{E}\big[|X^{d}_{s}| \big|X^{d}_{t}=x\big] \\ & = \mathbb{E}\big[|f_{s-1}^{d}(X^{d}_{s-1},Y_{s-1}^{d})| \big|X^{d}_{t}=x \big] \\ & \leq c d^{q} \big(1+\mathbb{E}\big[|X^{d}_{s-1}| \big|X^{d}_{t}=x\big] + \mathbb{E}\big[|Y_{s-1}^{d}|\big|X^{d}_{t}=x\big]\big) \\ & \leq c d^{q} \bigg(1+(c d^{q})^{s-1-t} |x| + \sum _{k=1}^{s-1-t} (c d^{q})^{k} (1+\mathbb{E}[|Y_{s-1-k}^{d}|]) +\mathbb{E}[|Y_{s-1}^{d}|]\bigg) \\ & = c d^{q} (1+\mathbb{E}[|Y_{s-1}^{d}|]) + (c d^{q})^{s-t} |x| + \sum _{k=2}^{s-t} (c d^{q})^{k} (1+\mathbb{E}[|Y_{s-k}^{d}|]), \end{aligned}$$

(4.12)

as claimed. This shows that (4.11) holds for all $s\geq t$. From (4.11) and Assumption 3.1 (iv), we obtain

$$ \mathbb{E}\big[|X^{d}_{s}| \big|X^{d}_{t}=x\big] \leq c^{T} d^{Tq} |x| + \sum _{k=1}^{s-t} (c d^{q})^{k} (1+cd^{q}) \leq \tilde{c}_{1} d^{ \tilde{q}_{1}} (1+|x|). $$

If Assumption 3.9 holds, we first note that independence, Jensen’s inequality and (3.13) yield

$$ \begin{aligned} &\mathbb{E}\big[ |f_{s-1}^{d}(X^{d}_{s-1},Y_{s-1}^{d})| \big|X^{d}_{t}=x \big] \\ & = \int _{\mathbb{R}^{d}} \mathbb{E}[|f_{s-1}^{d}(z,Y_{s-1}^{d})|] \mu _{t,s-1}^{d}(x,dz) \\ & \leq \int _{\mathbb{R}^{d}} \big(h d^{\bar{q}} (1+|z|^{2m\max (p,2)})\big)^{ \frac{1}{2m\max (p,2)}} \mu _{t,s-1}^{d}(x,dz) \\ & \leq (h d^{\bar{q}})^{\frac{1}{2m\max (p,2)}} \int _{\mathbb{R}^{d}} (1+|z|) \mu _{t,s-1}^{d}(x,dz) \\ & = (h d^{\bar{q}})^{\frac{1}{2m\max (p,2)}} \big(1+\mathbb{E}\big[|X^{d}_{s-1}| \big|X^{d}_{t}=x\big]\big). \end{aligned} $$

We can now apply this estimate instead of (4.8) to get from the first to the second line in (4.12) and arrive at

$$ \mathbb{E}\big[|X^{d}_{s}| \big|X^{d}_{t}=x\big] \leq \big((h d^{\bar{q}})^{ \frac{1}{2m\max (p,2)}}\big)^{s-t} |x| + \sum _{k=1}^{s-t} \big((h d^{ \bar{q}})^{\frac{1}{2m\max (p,2)}}\big)^{k}. $$

(4.13)

Hence the conclusion follows. □

The next result ensures that the optimal value (2.1) is finite in our setting.

Lemma 4.7

Suppose Assumption 3.4holds and Assumption 3.1or 3.9is satisfied. Then $\mathbb{E}[|g_{d}(t,X_{t}^{d})|]< \infty $ for all $d \in \mathbb{N}$ and $t \in \{0,\ldots ,T\}$.

Proof

Let $d \in \mathbb{N}$ and $t \in \{0,\ldots ,T\}$. Then Lemmas 4.4 and 4.6 and Assumption 3.1 or 3.9 ensure that

$$ \begin{aligned} \mathbb{E}[|g_{d}(t,X_{t}^{d})|] & \leq c d^{q} (1+\mathbb{E}[|X_{t}^{d}|]) = c d^{q} \Big(1+\mathbb{E}\Big[\mathbb{E}\big[|X_{t}^{d}|\big|X_{0}^{d}\big]\Big]\Big) \\ & \leq c d^{q} \big(1+\tilde{c}_{1} d^{\tilde{q}_{1}} (1+\mathbb{E}[|X_{0}^{d}|]) \big) < \infty . \end{aligned} $$

□

The next lemma proves that the value function grows at most linearly. Recall from Remark 3.3 that Lemma 4.7 allows us to recursively define the value function for all $x \in \mathbb{R}^{d}$ as the right-hand side of (2.2).

Lemma 4.8

Suppose Assumption 3.4holds and Assumption 3.1or 3.9is satisfied. Then for all $d \in \mathbb{N}$, $t \in \{0,\ldots ,T\}$ and $x \in \mathbb{R}^{d}$, it holds that

$$\begin{aligned} |V_{d}(t,x)| & \leq \hat{c}_{t} d^{\hat{q}_{t}} (1+|x|), \end{aligned}$$

(4.14)

where $\hat{c}_{t} = \max (c,1) (3\max (c,1)^{2})^{T-t}$, $\hat{q}_{t} = q + 2q(T-t)$ in the case of Assumption 3.1and $\hat{c}_{t} = c (T+1) \max (h,1)^{\frac{T-t}{2m\max (p,2)}}$, $\hat{q}_{t} = q + \frac{\bar{q}(T-t)}{2m\max (p,2)}$ in the case of Assumption 3.9.

Proof

Consider first the case of Assumption 3.1. The proof proceeds by backward induction. For $t=T$, the statement directly follows from (4.6). Assume now the statement holds for $t+1$. Then (2.2), (4.6), the induction hypothesis and (4.11) yield

$$ \begin{aligned} |V_{d}(t,x)| & \leq \max \big(|g_{d}(t,x)|,\mathbb{E}\big[|V_{d}(t+1,X_{t+1}^{d})| \big|X_{t}^{d}=x\big]\big) \\ & \leq \max \Big(c d^{q} (1+|x|),\hat{c}_{t+1} d^{\hat{q}_{t+1}} \big(1+\mathbb{E}\big[|X_{t+1}^{d}|\big|X_{t}^{d}=x\big]\big)\Big) \\ & \leq \max \Big(c d^{q} (1+|x|),\hat{c}_{t+1} d^{\hat{q}_{t+1}} \big(1+cd^{q}(1+|x|+cd^{q})\big)\Big) \\ & \leq \hat{c}_{t+1} d^{\hat{q}_{t+1}} 3\max (c,1)^{2}d^{2q}(1+|x|). \end{aligned} $$

Hence (4.14) also holds at $t$ and the statement follows by induction.

In the case of Assumption 3.9, we aim to provide a tighter estimate and instead inductively prove that $|V_{d}(t,x)| \leq \hat{a}_{t} + \hat{b}_{t}|x|$, where $\hat{a}_{t} = \hat{a}_{t+1} + \hat{b}_{t+1} (h d^{\bar{q}})^{ \frac{1}{2m\max (p,2)}}$, $\hat{a}_{T} = c d^{q}$, $\hat{b}_{t} = \hat{b}_{t+1} (\max (h,1) d^{\bar{q}})^{ \frac{1}{2m\max (p,2)}}$, $\hat{b}_{T} = c d^{q}$. Indeed, using (4.13) instead of (4.11), we analogously obtain

$$ |V_{d}(t,x)| \leq \max \big(c d^{q} + c d^{q}|x| ,\hat{a}_{t+1} + \hat{b}_{t+1} (h d^{\bar{q}})^{\frac{1}{2m\max (p,2)}}(1+|x|)\big) \leq \hat{a}_{t} + \hat{b}_{t}|x| $$

from which the statement follows by noting $\hat{b}_{t} = cd^{q} (\max (h,1) d^{\bar{q}})^{ \frac{T-t}{2m\max (p,2)}}$,

$$ \hat{a}_{t} = cd^{q} + \sum _{s=t+1}^{T} \hat{b}_{s} (h d^{\bar{q}})^{ \frac{1}{2m\max (p,2)}} \leq cd^{q} (T+1) \big(\max (h,1) d^{\bar{q}} \big)^{\frac{T-t}{2m\max (p,2)}}. $$

□

The next result mathematically proves the intuitively obvious fact that a neural network in which some input arguments are held at fixed values is still a neural network with at most as many non-zero parameters as the original neural network.

Lemma 4.9

Let $d_{0},d_{1},m \in \mathbb{N}$, $y \in \mathbb{R}^{d_{1}}$ and let $\phi \colon \mathbb{R}^{d_{0}+d_{1}} \to \mathbb{R}^{m}$ be a neural network. Then $\Phi _{y} \colon \mathbb{R}^{d_{0}} \to \mathbb{R}^{m}$, $x \mapsto \phi ((x,y))$ can again be realised by a neural network $\phi _{y}$ with $\mathrm{size}(\phi _{y}) \leq \mathrm{size}(\phi )$.

Proof

Let us denote by $((A^{1},b^{1}),\ldots ,(A^{L},b^{L}))$ the parameters of $\phi $ for some $L \in \mathbb{N}$, $N_{0}= d_{0}+d_{1}$, $N_{1},\ldots ,N_{L-1} \in \mathbb{N}$, $N_{L}=m$ and $A^{\ell} \in \mathbb{R}^{N_{\ell }\times N_{\ell -1}}$, $b^{\ell }\in \mathbb{R}^{N_{\ell}}$, $\ell =1,\ldots ,L$. Denote by $A^{1,0} \in \mathbb{R}^{N_{1} \times d_{0}}$ and $A^{1,1} \in \mathbb{R}^{N_{1} \times d_{1}}$ the first $d_{0}$ and the remaining $d_{1}$ columns of $A^{1}$, respectively. Consider the neural network $\phi _{y}$ defined by the parameter choice $((A^{1,0},A^{1,1}y+b^{1}),(A^{2},b^{2}),\ldots ,(A^{L},b^{L}))$. Then

$$ \begin{aligned} \phi _{y}(x) & = \big(\mathcal{W}_{L} \circ (\varrho \circ \mathcal{W}_{L-1}) \circ \cdots \circ (\varrho \circ \mathcal{W}_{2}) \circ \varrho \big)(A^{1,0}x + A^{1,1}y+b^{1}) \\ & = \big(\mathcal{W}_{L} \circ (\varrho \circ \mathcal{W}_{L-1}) \circ \cdots \circ (\varrho \circ \mathcal{W}_{1})\big)\big((x,y) \big) \\ & = \Phi _{y}(x) \end{aligned} $$

for all $x \in \mathbb{R}^{d_{0}}$ and $\mathrm{size}(\phi _{y}) \leq \mathrm{size}(\phi )$, as claimed. □

The next lemma allows us to construct a realisation of a random neural network and at the same time obtain a bound on the neural network weights.

Lemma 4.10

Let $d, N \in \mathbb{N}$, let $M_{1},M_{2} >0$, let $U$ be a nonnegative random variable and let $Y_{1},\ldots ,Y_{N}$ be i.i.d. $\mathbb{R}^{d}$-valued random variables. Suppose $\mathbb{E}[U] \leq M_{1}$ and $\mathbb{E}[|Y_{1}|]\leq M_{2}$. Then

$$ \mathbb{P}\Big[U\leq 3 M_{1}, \max _{i=1,\ldots ,N} |Y_{i}| \leq 3 N M_{2} \Big] > 0. $$

Proof

First, by the i.i.d. assumption, it follows that

$$\begin{aligned} \mathbb{P}\Big[\max _{i=1,\ldots ,N} |Y_{i}| > 3 N M_{2}\Big] & = 1- (\mathbb{P}[|Y_{1}| \leq 3 N M_{2}] )^{N} \\ & = 1- (1-\mathbb{P}[|Y_{1}| > 3 N M_{2}] )^{N}. \end{aligned}$$

(4.15)

Next note that Bernoulli’s inequality implies $(\frac{2}{3})^{1/N} \leq 1 - \frac{1}{3N}$ and therefore, by Markov’s inequality,

$$ \mathbb{P}[|Y_{1}| > 3 N M_{2}] \leq \frac{\mathbb{E}[|Y_{1}|]}{3 N M_{2}} \leq \frac{1}{3 N} \leq 1- \bigg(\frac{2}{3}\bigg)^{\frac{1}{N}}. $$

Thus we obtain $(1-\mathbb{P}[|Y_{1}| > 3 N M_{2}])^{N} \geq \frac{2}{3}$, and inserting this in (4.15) yields

$$ \begin{aligned} \mathbb{P}\Big[\max _{i=1,\ldots ,N} |Y_{i}| > 3 N M_{2}\Big] & \leq \frac{1}{3}. \end{aligned} $$

(4.16)

Furthermore, Markov’s inequality implies that

$$ \begin{aligned} \mathbb{P}[U > 3 M_{1}] & \leq \frac{\mathbb{E}[U]}{3M_{1}} \leq \frac{1}{3}. \end{aligned} $$

(4.17)

Combining now (4.16) and (4.17) with the elementary fact that for $A, B \in \mathcal {F}$, we have $\mathbb{P}[A\cap B] = \mathbb{P}[A] + \mathbb{P}[B] - \mathbb{P}[A \cup B] \geq \mathbb{P}[A] + \mathbb{P}[B] -1$ yields, as claimed, that

$$ \mathbb{P}\Big[U\leq 3 M_{1}, \max _{i=1,\ldots ,N} |Y_{i}| \leq 3 N M_{2} \Big] \geq \frac{2}{3} + \frac{2}{3} -1 > 0. $$

□

4.4 Proof of Theorem 3.6 and Corollary 3.8

With these preparations, we are now ready to prove Theorem 3.6. The proof is divided into several steps which are highlighted in bold in order to facilitate reading. We also refer to Sect. 3.2 for a sketch of the proof.

Proof of Theorem 3.6

1. Preliminaries. Without loss of generality, we may assume that the constants $c>0$, $q,\alpha \geq 0 $ in the statement of the theorem and in Assumptions 3.1 and 3.4 coincide; otherwise we replace each of them by the respective maximum and all the assumptions are still satisfied. We may also assume that $c \geq 1$.

Further, if for each fixed $t \in \{0,\ldots ,T\}$, there exist constants $\kappa _{t},\mathfrak{q}_{t},\mathfrak{r}_{t} \in [0,\infty )$ and a neural network $\psi _{\varepsilon ,d,t}$ such that $\mathrm{size}(\psi _{\varepsilon ,d,t}) \leq \kappa _{t} d^{ \mathfrak{q}_{t}}\varepsilon ^{-\mathfrak{r}_{t}}$ and (3.9) hold for all $\varepsilon \in (0,1]$ and $d \in \mathbb{N}$, then also the statement of the theorem follows by choosing $\kappa $, $\mathfrak{q}$, $\mathfrak{r}$ as the respective maxima over $t \in \{0,\ldots ,T\}$.

Next, let $c_{0},\ldots ,c_{T}$ satisfy $c_{0} = c$, $c_{t+1} = \max (3c,1)^{2\max (p,2)}(1+c_{t}+c)$ and set $q_{t} = (2\max (p,2)+1)qt+q$. Then

$$ \begin{aligned} c_{t} & = c \big( \max (3c,1)^{2\max (p,2)}\big)^{t} + \sum _{k=0}^{t-1} \big( \max (3c,1)^{2\max (p,2)}\big)^{k+1} (1+c) \end{aligned} $$

for all $t \in \{0,\ldots ,T\}$, and $c_{t}$ does not depend on $d$.

2. Stronger statement. We now proceed to prove the following stronger statement, which shows that the constants $\kappa _{t}$, $\mathfrak{q}_{t}$, $\mathfrak{r}_{t}$ can be chosen essentially independently of the probability measure $\rho ^{d}$ and in addition, $\rho ^{d}$ may be allowed to depend on $t$. Specifically, we prove that for any $t \in \{0,\ldots ,T\}$, there exist constants $\kappa _{t},\mathfrak{q}_{t},\mathfrak{r}_{t} \in [0,\infty )$ such that for any family of probability measures $\rho _{t}^{d}$ on $\mathbb{R}^{d}$, $d \in \mathbb{N}$, with

$$ \int _{\mathbb{R}^{d}} |x|^{2\max (p,2)} \rho ^{d}_{t}(dx) \leq c_{t} d^{q_{t}} $$

(4.18)

and for all $d \in \mathbb{N}$ and $\varepsilon \in (0,1]$, there exists a neural network $\psi _{\varepsilon ,d,t}$ such that

$$ \bigg(\int _{\mathbb{R}^{d}} |V_{d}(t,x) - \psi _{\varepsilon ,d,t}(x)|^{2} \rho ^{d}_{t}(d x) \bigg)^{1/2} \leq \varepsilon $$

(4.19)

and

$$\begin{aligned} |\psi _{\varepsilon ,d,t}(x)| & \leq \kappa _{t} d^{\mathfrak{q}_{t}} \varepsilon ^{-\mathfrak{r}_{t}} (1+|x|) \qquad \text{for all } x \in \mathbb{R}^{d}, \end{aligned}$$

(4.20)

$$\begin{aligned} \mathrm{size}(\psi _{\varepsilon ,d,t}) & \leq \kappa _{t} d^{ \mathfrak{q}_{t}}\varepsilon ^{-\mathfrak{r}_{t}}, \end{aligned}$$

(4.21)

$$\begin{aligned} \mathrm{Lip}(\psi _{\varepsilon ,d,t}) & \leq \kappa _{t} d^{ \mathfrak{q}_{t}}. \end{aligned}$$

(4.22)

Choosing $\rho _{t}^{d} = \rho ^{d}$ for all $t$ and noting that (4.18) is satisfied due to $q \leq q_{t}$, $c \leq c_{t}$, the statement of Theorem 3.6 then follows.

In order to prove the stronger statement for each fixed $t$, we now proceed by backward induction.

3. Base case of backward induction. For $t=T$, we have $V_{d}(T,x) = g_{d}(T,x)$. Choose $\psi _{\varepsilon ,d,T} = \phi _{\tilde{\varepsilon},d,T}$ with $\tilde{\varepsilon} = \varepsilon (c d^{q} (1+ (c_{T}d^{q_{T}} )^{ \frac{p}{2\max (p,2)}}))^{-1}$. Then (3.6), Jensen’s inequality and (4.18) imply

$$ \begin{aligned} & \bigg(\int _{\mathbb{R}^{d}} |g_{d}(T,x) - \psi _{\varepsilon ,d,T}(x)|^{2} \rho ^{d}_{T}(d x) \bigg)^{1/2} \\ &\leq \tilde{\varepsilon} c d^{q}\bigg(1+ \Big(\int _{\mathbb{R}^{d}} |x|^{2p} \rho ^{d}_{T}(d x) \Big)^{1/2}\bigg) \\ & \leq \tilde{\varepsilon} c d^{q} \bigg(1+ \Big(\int _{\mathbb{R}^{d}} |x|^{2 \max (p,2)} \rho ^{d}_{T}(d x) \Big)^{\frac{p}{2\max (p,2)}}\bigg) \\ & \leq \tilde{\varepsilon} c d^{q} \big(1+ (c_{T}d^{q_{T}} )^{ \frac{p}{2\max (p,2)}}\big) = \varepsilon . \end{aligned} $$

Furthermore, (3.7) implies that the neural network chosen as $\psi _{\varepsilon ,d,T} = \phi _{\tilde{\varepsilon },d,T}$ satisfies $\mathrm{size}(\psi _{\varepsilon ,d,T}) \leq c d^{q} \varepsilon ^{- \alpha} (c d^{q} (1+ (c_{T}d^{q_{T}} )^{\frac{p}{2\max (p,2)}}))^{ \alpha}$. Combining this with (3.8) and (4.7), it follows that there exist $\kappa _{T},\mathfrak{q}_{T},\mathfrak{r}_{T} \in [0,\infty )$ such that for any family of probability measures $\rho _{T}^{d}$ on $\mathbb{R}^{d}$, $d \in \mathbb{N}$, with (4.18) and for all $d \in \mathbb{N}$ and $\varepsilon \in (0,1]$, there exists a neural network $\psi _{\varepsilon ,d,T}$ such that (4.19)–(4.22) hold, that is, the statement in Step 2 follows in the case $t=T$.

4. Start of the induction step. The remainder of the proof is now dedicated to the induction step. To improve readability, we again divide it into several steps.

For the induction step, we now assume that the stronger statement in Step 2 above holds for time $t+1$ and aim to prove it for time $t$. To this end, let $\rho _{t}^{d}$ be a probability measure satisfying (4.18) and denote by $\nu ^{d}_{t}$ the distribution of $Y_{t}^{d}$.

5. Induction hypothesis. Let $\kappa _{t+1},\mathfrak{q}_{t+1},\mathfrak{r}_{t+1} \in [0,\infty )$ denote the constants with which the stronger statement in Step 2 above holds for time $t+1$.

Consider the probability measure $\rho _{t+1}^{d} = (\rho _{t}^{d} \otimes \nu _{t}^{d}) \circ (f_{t}^{d})^{-1}$ given as the image measure of $\rho _{t}^{d} \otimes \nu _{t}^{d}$ under $f_{t}^{d}$, where we recall that $\nu ^{d}_{t}$ is the distribution of $Y_{t}^{d}$. Then using the change-of-variables formula, (4.8), (4.18) and Assumption 3.1 (iv) and writing $\bar{p}=2\max (p,2)$, this measure satisfies

$$\begin{aligned} \int _{\mathbb{R}^{d}} |z|^{2\max (p,2)} \rho _{t+1}^{d}(dz) & = \int _{\mathbb{R}^{d}} \int _{\mathbb{R}^{d}} |f_{t}^{d}(x,y)|^{\bar{p}} \nu _{t}^{d}(dy) \rho ^{d}_{t}(dx) \\ & \leq \int _{\mathbb{R}^{d}} \int _{\mathbb{R}^{d}} \big(cd^{q}(1+|x|+|y|)\big)^{ \bar{p}} \nu _{t}^{d}(dy) \rho ^{d}_{t}(dx) \\ & \leq (3cd^{q})^{\bar{p}} \bigg( 1 + \int _{\mathbb{R}^{d}}|x|^{\bar{p}} \rho ^{d}_{t}(dx) + \int _{\mathbb{R}^{d}} |y|^{\bar{p}} \nu _{t}^{d}(dy) \bigg) \\ & \leq (3cd^{q})^{\bar{p}} ( 1 + c_{t} d^{q_{t}} + c d^{q} ) \\ & \leq \max (3c,1)^{\bar{p}} d^{\bar{p}q+q_{t}+q} ( 1 + c_{t} + c ) \\ & = c_{t+1} d^{q_{t+1}}. \end{aligned}$$

(4.23)

Hence by the induction hypothesis, for any $\varepsilon \in (0,1]$ and $d \in \mathbb{N}$, there exists a neural network $\psi _{\varepsilon ,d,t+1}$ such that

$$ \bigg(\int _{\mathbb{R}^{d}} |V_{d}(t+1,x) - \psi _{\varepsilon ,d,t+1}(x)|^{2} \rho ^{d}_{t+1}(d x) \bigg)^{1/2} \leq \varepsilon $$

(4.24)

and

$$\begin{aligned} |\psi _{\varepsilon ,d,t+1}(x)| & \leq \kappa _{t+1} d^{\mathfrak{q}_{t+1}} \varepsilon ^{-\mathfrak{r}_{t+1}}(1+|x|) \qquad \text{for all } x \in \mathbb{R}^{d}, \end{aligned}$$

(4.25)

$$\begin{aligned} \mathrm{size}(\psi _{\varepsilon ,d,t+1}) & \leq \kappa _{t+1} d^{ \mathfrak{q}_{t+1}}\varepsilon ^{-\mathfrak{r}_{t+1}}, \end{aligned}$$

(4.26)

$$\begin{aligned} \mathrm{Lip}(\psi _{\varepsilon ,d,{t+1}}) & \leq \kappa _{t+1} d^{ \mathfrak{q}_{t+1}}. \end{aligned}$$

(4.27)

Now let $\varepsilon \in (0,1]$ and $d \in \mathbb{N}$ be given. The remainder of the proof consists in selecting $\kappa _{t}$, $\mathfrak{q}_{t}$, $\mathfrak{r}_{t}$ (only depending on $c$, $\alpha $, $p$, $q$, $t$, $T$, $\kappa _{t+1}$, $\mathfrak{q}_{t+1}$, $\mathfrak{r}_{t+1}$) and constructing a neural network $\psi _{\varepsilon ,d,t}$ such that (4.19)–(4.22) are satisfied. This will complete the proof.

In what follows, we fix $\bar{\varepsilon} \in (0,1)$ and choose

$$ N =\lceil \bar{\varepsilon}^{-2 \mathfrak{r}_{t+1}-2} \rceil , \qquad \delta = \bar{\varepsilon}^{\frac{1}{2}}. $$

(4.28)

The value of $\bar{\varepsilon}$ will be chosen later (depending on $\varepsilon $ and $d$).

6. Approximation of the continuation value. Let $Y^{d,i}_{t}$, $i \in \mathbb{N}$, be i.i.d. copies of $Y^{d}_{t}$ and set $\hat{v}_{\bar{\varepsilon},d,t+1} = \psi _{\bar{\varepsilon},d,t+1} $. Define the (random) function

$$ \Gamma _{\bar{\varepsilon},d,t}(x) = \frac{1}{N}\sum _{i=1}^{N} \hat{v}_{\bar{\varepsilon},d,t+1}\big(\eta _{\bar{\varepsilon},d,t}(x,Y^{d,i}_{t}) \big). $$

Note that $\Gamma _{\bar{\varepsilon},d,t}$ is a random function since $Y^{d,i}_{t}$ is random.

We now estimate the expected $L^{2}(\rho ^{d}_{t})$-error that arises when $\Gamma _{\bar{\varepsilon},d,t}$ is used to approximate the continuation value. Let $Z^{\bar{\varepsilon},d,i}(x)= \hat{v}_{\bar{\varepsilon},d,t+1}( \eta _{\bar{\varepsilon},d,t}(x,Y^{d,i}_{t}))$ and recall that Assumption 3.1 (i), (ii) implies that $Y^{d}_{t}$ is independent of $X^{d}_{t}$. With this notation, $\Gamma _{\bar{\varepsilon},d,t}(x) = \frac{1}{N} \sum _{i=1}^{N} Z^{ \bar{\varepsilon},d,i}(x)$ and thus the bias–variance decomposition and the fact that $Y^{d,i}_{t}$, $i \in \mathbb{N}$, are i.i.d. show that

$$\begin{aligned} & \int _{\mathbb{R}^{d}} \mathbb{E}\big[\big| \mathbb{E}[V_{d}(t+1,X_{t+1}^{d})|X_{t}^{d}=x] -\Gamma _{\bar{\varepsilon},d,t}(x) \big|^{2}\big] \rho ^{d}_{t}(dx) \\ & = \int _{\mathbb{R}^{d}}\big| \mathbb{E}[V_{d}(t+1,X_{t+1}^{d})|X_{t}^{d}=x] - \mathbb{E}[ Z^{\bar{\varepsilon},d,1}(x)\big|^{2} \\ & \hphantom{=:} + \mathbb{E}\big[| \mathbb{E}[Z^{\bar{\varepsilon},d,1}(x)] -\Gamma _{ \bar{\varepsilon},d,t}(x) |^{2}\big] \rho ^{d}_{t}(dx) \\ & = \int _{\mathbb{R}^{d}} \big| \mathbb{E}[V_{d}(t+1,X_{t+1}^{d})|X_{t}^{d}=x] - \mathbb{E}[ Z^{\bar{\varepsilon},d,1}(x)]\big|^{2} \\ & \hphantom{=:} + \frac{1}{N}\mathbb{E}\big[| \mathbb{E}[Z^{\bar{\varepsilon},d,1}(x)] -Z^{ \bar{\varepsilon},d,1}(x) |^{2}\big] \rho ^{d}_{t}(dx). \end{aligned}$$

(4.29)

The first integral in the last line of (4.29) can be estimated as

$$\begin{aligned} &\int _{\mathbb{R}^{d}} \big| \mathbb{E}[V_{d}(t+1,X_{t+1}^{d})|X_{t}^{d}=x] -\mathbb{E}\big[ \hat{v}_{\bar{\varepsilon},d,t+1}\big(\eta _{\bar{\varepsilon},d,t}(x,Y^{d,i}_{t}) \big)\big]\big|^{2} \rho ^{d}_{t}(dx) \\ & \leq 2 \int _{\mathbb{R}^{d}} \big| \mathbb{E}[V_{d}(t+1,X_{t+1}^{d})|X_{t}^{d}=x] -\mathbb{E}[ \hat{v}_{\bar{\varepsilon},d,t+1}(X_{t+1}^{d})|X_{t}^{d}=x] \big|^{2} \rho ^{d}_{t}(dx) \\ & \hphantom{=:} + 2 \int _{\mathbb{R}^{d}} \big| \mathbb{E}[ \hat{v}_{\bar{\varepsilon},d,t+1}(X_{t+1}^{d})|X_{t}^{d}=x]- \mathbb{E}\big[ \hat{v}_{\bar{\varepsilon},d,t+1}\big(\eta _{ \bar{\varepsilon},d,t}(x,Y^{d}_{t})\big)\big]\big|^{2} \rho ^{d}_{t}(dx). \end{aligned}$$

(4.30)

6.a) Applying the error estimate from $t+1$. Now consider the first term on the right-hand side of (4.30) and recall that $\rho _{t+1}^{d} = (\rho _{t}^{d} \otimes \nu _{t}^{d}) \circ (f_{t}^{d})^{-1}$ is the image measure of $\rho _{t}^{d} \otimes \nu _{t}^{d}$ under $f_{t}^{d}$. Then Jensen’s inequality, (3.1), Assumption 3.1 (ii) and (4.24) yield

$$\begin{aligned} &\int _{\mathbb{R}^{d}} \big| \mathbb{E}[V_{d}(t+1,X_{t+1}^{d})|X_{t}^{d}=x] -\mathbb{E}[ \hat{v}_{\bar{\varepsilon},d,t+1}(X_{t+1}^{d})|X_{t}^{d}=x]\big|^{2} \rho ^{d}_{t}(dx) \\ & \leq \int _{\mathbb{R}^{d}} \mathbb{E}\big[\big|V_{d}\big(t+1,f_{t}^{d}(x,Y_{t}^{d}) \big) - \hat{v}_{\bar{\varepsilon},d,t+1}\big(f_{t}^{d}(x,Y_{t}^{d}) \big)\big|^{2}\big] \rho ^{d}_{t}(dx) \\ & = \int _{\mathbb{R}^{d}} \int _{\mathbb{R}^{d}} \big|V_{d}\big(t+1,f_{t}^{d}(x,y) \big) - \hat{v}_{\bar{\varepsilon},d,t+1}\big(f_{t}^{d}(x,y)\big) \big|^{2} \nu _{t}^{d}(dy) \rho ^{d}_{t}(dx) \\ & = \int _{\mathbb{R}^{d}} |V_{d}(t+1,z) - \hat{v}_{\bar{\varepsilon},d,t+1}(z)|^{2} \rho _{t+1}^{d}(dz) \\ & \leq \bar{\varepsilon}^{2}. \end{aligned}$$

(4.31)

6.b) Applying the Lipschitz property of the network at $t+1$. For the second term on the right-hand side of (4.30), note that by the induction hypothesis (4.27), we have $\mathrm{Lip}(\hat{v}_{\bar{\varepsilon},d,t+1}) \leq \kappa _{t+1} d^{ \mathfrak{q}_{t+1}}$. Hence (3.2), the assumption $\mathbb{E}[|Y^{d}_{t}|^{p}] \leq c d^{q}$ and (4.18) imply that

$$\begin{aligned} &\int _{\mathbb{R}^{d}} \big| \mathbb{E}\big[ \hat{v}_{\bar{\varepsilon},d,t+1} \big(f^{d}_{t}(x,Y_{t}^{d})\big)\big]-\mathbb{E}\big[ \hat{v}_{ \bar{\varepsilon},d,t+1}\big(\eta _{\bar{\varepsilon},d,t}(x,Y^{d}_{t}) \big)\big]\big|^{2} \rho ^{d}_{t}(dx) \\ & \leq (\kappa _{t+1} d^{\mathfrak{q}_{t+1}})^{2} \int _{\mathbb{R}^{d}} \big| \mathbb{E}[ | f^{d}_{t}(x,Y_{t}^{d})- \eta _{\bar{\varepsilon},d,t}(x,Y^{d}_{t})|] \big|^{2} \rho ^{d}_{t}(dx) \\ & \leq (\bar{\varepsilon} c \kappa _{t+1} d^{q+\mathfrak{q}_{t+1}})^{2} \int _{\mathbb{R}^{d}} \big|1+|x|^{p} + \mathbb{E}[|Y^{d}_{t}|^{p}]\big|^{2} \rho ^{d}_{t}(dx) \\ & \leq 3 (\bar{\varepsilon} c \kappa _{t+1} d^{q+\mathfrak{q}_{t+1}})^{2} \bigg(1+ c^{2} d^{2q} + \Big(\int _{\mathbb{R}^{d}}|x|^{2\max (p,2)} \rho ^{d}_{t}(dx) \Big)^{\frac{p}{\max (p,2)}} \bigg) \\ & \leq 3 (\bar{\varepsilon} c \kappa _{t+1})^{2} d^{2(q+\mathfrak{q}_{t+1})+ \max (2q,\frac{pq_{t}}{\max (p,2)})} (1+ c^{2} + c_{t}^{ \frac{p}{\max (p,2)}} ). \end{aligned}$$

(4.32)

6.c) Applying the growth property of the network at $t+1$. For the last term in (4.29), note that $|\hat{v}_{\bar{\varepsilon},d,t+1}(x)| \leq \kappa _{t+1} d^{ \mathfrak{q}_{t+1}} \bar{\varepsilon}^{-\mathfrak{r}_{t+1}} (1+|x|)$ by the induction hypothesis. Combining this with (4.9), $\mathbb{E}[|Y_{t}^{d}|^{2}] \leq c d^{q}$, Hölder’s inequality and (4.18) yields

$$\begin{aligned} &\int _{\mathbb{R}^{d}} \mathbb{E}\big[| \mathbb{E}[Z^{\bar{\varepsilon},d,1}(x)] -Z^{ \bar{\varepsilon},d,1}(x) |^{2}\big] \rho ^{d}_{t}(dx) \\ & \leq \int _{\mathbb{R}^{d}} \mathbb{E}[| Z^{\bar{\varepsilon},d,1}(x) |^{2} ] \rho ^{d}_{t}(dx) \\ & = \int _{\mathbb{R}^{d}} \mathbb{E}\big[\big|\hat{v}_{\bar{\varepsilon},d,t+1} \big(\eta _{\bar{\varepsilon},d,t}(x,Y^{d}_{t})\big) \big|^{2}\big] \rho ^{d}_{t}(dx) \\ & \leq (\kappa _{t+1} d^{\mathfrak{q}_{t+1}}\bar{\varepsilon}^{- \mathfrak{r}_{t+1}})^{2} \int _{\mathbb{R}^{d}} \mathbb{E}\big[ \big(1+|\eta _{ \bar{\varepsilon},d,t}(x,Y^{d}_{t})|\big)^{2}\big] \rho ^{d}_{t}(dx) \\ & \leq 2(\kappa _{t+1} d^{\mathfrak{q}_{t+1}} \bar{\varepsilon}^{- \mathfrak{r}_{t+1}})^{2} \bigg(1 + \int _{\mathbb{R}^{d}} \mathbb{E}\big[\big(c d^{q} (2+|x|+|Y^{d}_{t}|)\big)^{2}\big] \rho ^{d}_{t}(dx) \bigg) \\ & \leq 2(\kappa _{t+1} d^{\mathfrak{q}_{t+1}+q}\bar{\varepsilon}^{- \mathfrak{r}_{t+1}})^{2} \\ & \hphantom{=:} \times \bigg(1 + 3 c^{2} \Big( 4+cd^{q}+\Big(\int _{\mathbb{R}^{d}} |x|^{2 \max (p,2)} \rho ^{d}_{t}(dx)\Big)^{\frac{2}{2\max (p,2)}}\Big) \bigg) \\ & \leq 2 \kappa _{t+1}^{2} d^{2\mathfrak{q}_{t+1}+2q+\max (q, \frac{q_{t}}{\max (p,2)})} \bar{\varepsilon}^{-2\mathfrak{r}_{t+1}} \Big(1 + 3 c^{2} \big( 4+c+c_{t}^{\frac{1}{\max (p,2)}}\big) \Big). \end{aligned}$$

(4.33)

6.d) Bounding the overall error and constructing a realisation. We can now insert the estimates from (4.31) and (4.32) into (4.30) and subsequently insert the resulting bound and (4.33) into (4.29). We obtain

$$\begin{aligned} & \int _{\mathbb{R}^{d}} \mathbb{E}\big[\big| \mathbb{E}[V_{d}(t+1,X_{t+1}^{d})|X_{t}^{d}=x] -\Gamma _{\bar{\varepsilon},d,t}(x) \big|^{2}\big] \rho ^{d}_{t}(dx) \\ & \leq 2 \int _{\mathbb{R}^{d}} \big| \mathbb{E}[V_{d}(t+1,X_{t+1}^{d})|X_{t}^{d}=x] -\mathbb{E}[ \hat{v}_{\bar{\varepsilon},d,t+1}(X_{t+1}^{d})|X_{t}^{d}=x] \big|^{2} \rho ^{d}_{t}(dx) \\ & \hphantom{=:} + 2 \int _{\mathbb{R}^{d}} \big| \mathbb{E}[ \hat{v}_{\bar{\varepsilon},d,t+1}(X_{t+1}^{d})|X_{t}^{d}=x]- \mathbb{E}\big[ \hat{v}_{\bar{\varepsilon},d,t+1}\big(\eta _{ \bar{\varepsilon},d,t}(x,Y^{d}_{t})\big)\big]\big|^{2} \rho ^{d}_{t}(dx) \\ & \hphantom{=:} + \frac{1}{N}\int _{\mathbb{R}^{d}} \mathbb{E}\big[| \mathbb{E}[Z^{\bar{\varepsilon},d,1}(x)] -Z^{\bar{\varepsilon},d,1}(x) |^{2}\big] \rho ^{d}_{t}(dx) \\ & \leq 2 \bar{\varepsilon}^{2} + 6 (\bar{\varepsilon} c \kappa _{t+1})^{2} d^{2(q+\mathfrak{q}_{t+1})+\max (2q,\frac{pq_{t}}{\max (p,2)})} \big(1+ c^{2} + c_{t}^{\frac{p}{\max (p,2)}} \big) \\ & \hphantom{=:} + \frac{2}{N} \kappa _{t+1}^{2} d^{2\mathfrak{q}_{t+1}+2q+\max (q, \frac{q_{t}}{\max (p,2)})} \bar{\varepsilon}^{-2\mathfrak{r}_{t+1}} \Big(1 + 3 c^{2} \big( 4+c+c_{t}^{\frac{1}{\max (p,2)}}\big) \Big) \\ & \leq \tilde{c}_{2} d^{\tilde{q}_{2}} (\bar{\varepsilon}^{2} + N^{-1} \bar{\varepsilon}^{-2\mathfrak{r}_{t+1}}) \end{aligned}$$

(4.34)

with constants chosen as $\tilde{c}_{2} = 2+8\max (c,1)^{2}\kappa _{t+1}^{2}(4+\max (c,1)^{2} + \max (c_{t},1))$ and $\tilde{q}_{2} = 2(q+\mathfrak{q}_{t+1})+\max (2q,q_{t})$. But (4.34) implies that

$$\begin{aligned} & \mathbb{E}\bigg[\int _{\mathbb{R}^{d}} \big| \mathbb{E}[V_{d}(t+1,X_{t+1}^{d})|X_{t}^{d}=x] -\Gamma _{\bar{\varepsilon},d,t}(x) \big|^{2} \rho ^{d}_{t}(dx) \bigg] \\ & \leq \tilde{c}_{2} d^{\tilde{q}_{2}} (\bar{\varepsilon}^{2} + N^{-1} \bar{\varepsilon}^{-2\mathfrak{r}_{t+1}}). \end{aligned}$$

Therefore Assumption 3.1 (iv), the fact that $Y^{d,1}_{t},\ldots ,Y^{d,N}_{t}$ are i.i.d. copies of $Y^{d}_{t}$ and Lemma 4.10 prove that there exists $\omega \in \Omega $ such that the function defined via $\gamma _{\bar{\varepsilon},d,t}(x) = \frac{1}{N}\sum _{i=1}^{N} \hat{v}_{\bar{\varepsilon},d,t+1}(\eta _{\bar{\varepsilon},d,t}(x,Y^{d,i}_{t}( \omega )))$ (i.e., the realisation of $\Gamma _{\bar{\varepsilon},d,t}$ at $\omega $) satisfies

$$\begin{aligned} &\int _{\mathbb{R}^{d}} \big| \mathbb{E}[V_{d}(t+1,X_{t+1}^{d})|X_{t}^{d}=x] - \gamma _{\bar{\varepsilon},d,t}(x) \big|^{2} \rho ^{d}_{t}(dx) \\ &\leq 3 \tilde{c}_{2} d^{\tilde{q}_{2}} (\bar{\varepsilon}^{2} + N^{-1} \bar{\varepsilon}^{-2\mathfrak{r}_{t+1}}) \end{aligned}$$

(4.35)

and

$$ \max _{i=1,\ldots ,N}|Y_{t}^{d,i}(\omega )| \leq 3 N cd^{q}. $$

(4.36)

We now define

$$ \begin{aligned} \hat{v}_{\bar{\varepsilon},d,t}(x) = \max \big(\phi _{ \bar{\varepsilon},d,t}(x) - \delta ,\gamma _{\bar{\varepsilon},d,t}(x) \big) \end{aligned} $$

(4.37)

and claim that $\psi _{\varepsilon ,d,t}=\hat{v}_{\bar{\varepsilon},d,t}$ satisfies all the properties required in (4.19)–(4.22).

7. Growth bound on the constructed network. Let us first verify (4.20). Indeed, the growth bound on $\phi _{\bar{\varepsilon},d,t}$ in (4.7), the induction hypothesis (4.25), the growth bound (4.9) on $\eta _{\bar{\varepsilon},d,t}$, the bound (4.36) on $|Y_{t}^{d,i}(\omega )|$ and the choice of $N$ in (4.28) imply for all $x \in \mathbb{R}^{d}$ that

$$\begin{aligned} |\psi _{\varepsilon ,d,t}(x)| & \leq |\phi _{\bar{\varepsilon},d,t}(x)| + \delta +| \gamma _{\bar{\varepsilon},d,t}(x)| \\ & \leq c d^{q} (2 + |x|) + \delta + \frac{1}{N}\sum _{i=1}^{N} \Big| \hat{v}_{\bar{\varepsilon},d,t+1}\Big(\eta _{\bar{\varepsilon},d,t} \big(x,Y^{d,i}_{t}(\omega )\big)\Big)\Big| \\ & \leq c d^{q} (2 + |x|) + \delta + \frac{1}{N}\sum _{i=1}^{N} \kappa _{t+1} d^{\mathfrak{q}_{t+1}} \bar{\varepsilon}^{-\mathfrak{r}_{t+1}} \Big(1+|\eta _{\bar{\varepsilon},d,t}\big(x,Y^{d,i}_{t}(\omega )\big)| \Big) \\ & \leq c d^{q} (2 + |x|) + \delta \\ & \hphantom{=:} + \frac{1}{N}\sum _{i=1}^{N} \kappa _{t+1} d^{\mathfrak{q}_{t+1}} \bar{\varepsilon}^{-\mathfrak{r}_{t+1}} \Big(1+cd^{q}\big(2+|x|+|Y^{d,i}_{t}( \omega )|\big)\Big) \\ & \leq c d^{q} (2 + |x|) + \delta + \kappa _{t+1} d^{\mathfrak{q}_{t+1}} \bar{\varepsilon}^{-\mathfrak{r}_{t+1}} \big(1+cd^{q}(2+|x|+3 N c d^{q} )\big) \\ & \leq \tilde{c}_{3} d^{\tilde{q}_{3}} \bar{\varepsilon}^{-\tilde{r}_{3}} (1+|x|) \end{aligned}$$

(4.38)

with $\tilde{c}_{3} = 18 \max (c,1,\kappa _{t+1})\max (c,1)^{2}$, $\tilde{q}_{3} = \mathfrak{q}_{t+1}+2q$ and $\tilde{r}_{3} = 3\mathfrak{r}_{t+1}+2$.

8. Bounding the size of the constructed network. Next we verify (4.21). To achieve this, we first observe that for each $i \in \{1,\ldots ,N\}$, Lemma 4.9 shows that the map $x \mapsto \eta _{\bar{\varepsilon},d,t}(x,Y^{d,i}_{t}(\omega ))$ can be realised as a neural network with size at most $\mathrm{size}(\eta _{\bar{\varepsilon},d,t})$. Next, the composition of two ReLU neural networks $\phi _{1}$, $\phi _{2}$ can again be realised by a ReLU neural network with size at most $2 (\mathrm{size}(\phi _{1}) + \mathrm{size}(\phi _{2}))$ (see e.g. Opschoor et al. [39, Proposition 2.2]). Finally, Gonon and Schwab [25, Lemma 3.2] shows that the weighted sum of deep neural networks $\phi _{1},\ldots ,\phi _{N}$ with the same number of layers, the same input dimension and the same output dimension can be realised by another deep neural network with size at most $\sum _{i=1}^{N} \mathrm{size}(\phi _{i})$. Therefore $\gamma _{\bar{\varepsilon},d,t}$ can be realised as a deep neural network with

$$\begin{aligned} \mathrm{size}(\gamma _{\bar{\varepsilon},d,t}) & \leq \sum _{i=1}^{N} 2 \big(\mathrm{size}(\hat{v}_{\bar{\varepsilon},d,t+1}) + \mathrm{size}( \eta _{\bar{\varepsilon},d,t})\big) \\ & \leq 2N (\kappa _{t+1} d^{\mathfrak{q}_{t+1}}\bar{\varepsilon}^{- \mathfrak{r}_{t+1}}+ c d^{q} \bar{\varepsilon}^{-\alpha}), \end{aligned}$$

(4.39)

where the last step follows from the induction hypothesis (4.26) and the bound (3.3) on the size of $\eta _{\bar{\varepsilon},d,t}$. Next, subtracting a constant corresponds to a change of the “bias” $b^{L}$ in the last layer, and therefore also $\phi _{\bar{\varepsilon},d,t} - \delta $ is a neural network satisfying $\mathrm{size}(\phi _{\bar{\varepsilon},d,t} - \delta ) = \mathrm{size}(\phi _{\bar{\varepsilon},d,t})$. Now define the neural network $\mathfrak{m}\colon \mathbb{R}^{2} \to \mathbb{R}$ by $\mathfrak{m}(x,y) = A^{2} \varrho (A^{1}(x,y)^{\top})$ with

$$ A^{1}= \left ( \textstyle\begin{array}{c@{\quad}r} 1&-1 \\ 0&1 \\ 0&-1 \end{array}\displaystyle \right ), \qquad A^{2}= \left ( \textstyle\begin{array}{c@{\quad}c@{\quad}c} 1&1&-1 \end{array}\displaystyle \right ). $$

Then $\mathfrak{m}(x,y)= \max (x-y,0)+\max (y,0)-\max (-y,0) = \max (x,y)$. Thus $\hat{v}_{\bar{\varepsilon},d,t}$ in (4.37) can be realised as a neural network by the composition of $\mathfrak{m}$ with the parallelisation of $\phi _{\bar{\varepsilon},d,t} - \delta $ and $\gamma _{\bar{\varepsilon},d,t}$ (see e.g. Opschoor et al. [39, Proposition 2.3]), and the size of the parallelisation is bounded by $\mathrm{size}(\phi _{\bar{\varepsilon},d,t}) + \mathrm{size}(\gamma _{ \bar{\varepsilon},d,t})$. This and the bound (3.7) on the size of $\phi _{\bar{\varepsilon},d,t}$, the choice of $N$ in (4.28) and the bound (4.39) on the size of $\gamma _{\bar{\varepsilon},d,t}$ imply that

$$\begin{aligned} \mathrm{size}(\psi _{\varepsilon ,d,t}) & \leq 2\big(\mathrm{size} ( \mathfrak{m})+\mathrm{size}(\phi _{\bar{\varepsilon},d,t})+ \mathrm{size}(\gamma _{\bar{\varepsilon},d,t})\big) \\ & \leq 2\big(7+cd^{q} \bar{\varepsilon}^{-\alpha} + 2N (\kappa _{t+1} d^{ \mathfrak{q}_{t+1}}\bar{\varepsilon}^{-\mathfrak{r}_{t+1}}+ c d^{q} \bar{\varepsilon}^{-\alpha})\big) \\ & \leq \tilde{c}_{4} d^{\tilde{q}_{4}} \bar{\varepsilon}^{-\tilde{r}_{4}} \end{aligned}$$

(4.40)

with $\tilde{c}_{4} = 2(7+5c + 4\kappa _{t+1})$, $\tilde{q}_{4} = \max (q,\mathfrak{q}_{t+1})$ and $\tilde{r}_{4} =2 \mathfrak{r}_{t+1}+2+\max (\alpha ,\mathfrak{r}_{t+1})$.

9. Lipschitz constant of the constructed network. Next we verify (4.22). To do this, we note that the induction hypothesis (4.27) and the Lipschitz property (3.4) of $\eta _{\bar{\varepsilon},d,t}$ imply for all $x,y \in \mathbb{R}^{d}$ that

$$\begin{aligned} & |\gamma _{\bar{\varepsilon},d,t}(x)-\gamma _{\bar{\varepsilon},d,t}(y)| \\ & = \bigg| \frac{1}{N}\sum _{i=1}^{N} \hat{v}_{\bar{\varepsilon},d,t+1} \Big(\eta _{\bar{\varepsilon},d,t}\big(x,Y^{d,i}_{t}(\omega )\big) \Big) - \hat{v}_{\bar{\varepsilon},d,t+1}\Big(\eta _{ \bar{\varepsilon},d,t}\big(y,Y^{d,i}_{t}(\omega )\big)\Big) \bigg| \\ & \leq \frac{1}{N}\sum _{i=1}^{N} \kappa _{t+1} d^{\mathfrak{q}_{t+1}} \big|\eta _{\bar{\varepsilon},d,t}\big(x,Y^{d,i}_{t}(\omega )\big) - \eta _{\bar{\varepsilon},d,t}\big(y,Y^{d,i}_{t}(\omega )\big)\big| \\ & \leq \kappa _{t+1} d^{\mathfrak{q}_{t+1}} c d^{q} |x - y|. \end{aligned}$$

(4.41)

In addition, (3.8) implies $\mathrm{Lip}(\phi _{\bar{\varepsilon},d,t} - \delta ) = \mathrm{Lip}( \phi _{\bar{\varepsilon},d,t}) \leq c d^{q}$, and the pointwise maximum of two Lipschitz-continuous functions is again Lipschitz-continuous with Lipschitz constant given by the maximum of the two Lipschitz constants. Combining this with (4.41) yields

$$ \mathrm{Lip}(\psi _{\varepsilon ,d,t}) \leq \max (c d^{q},\kappa _{t+1} d^{\mathfrak{q}_{t+1}+q} c ) \leq c \max (\kappa _{t+1},1) d^{ \mathfrak{q}_{t+1}+q}. $$

(4.42)

10. Bounding the overall approximation error. We now work towards verifying the approximation error bound (4.19). To achieve this, let

$$ C_{t} = \{x \in \mathbb{R}^{d} \colon g_{d}(t,x) < \mathbb{E}[V_{d}(t+1,X_{t+1}^{d})|X_{t}^{d}=x] \} $$

(4.43)

be the continuation region and

$$ \hat{C}_{t} = \{x \in \mathbb{R}^{d} \colon \phi _{\bar{\varepsilon },d,t}(x) - \delta < \gamma _{\bar{\varepsilon },d,t}(x) \} $$

(4.44)

the approximate continuation region. Then (2.2), (4.37), (4.43) and (4.44) imply that

$$\begin{aligned} & |V_{d}(t,x) -\hat{v}_{\bar{\varepsilon},d,t}(x)| \\ & = | g_{d}(t,x) -\phi _{\bar{\varepsilon},d,t}(x) + \delta | \mathbbm{1}_{C_{t}^{c} \cap \hat{C}_{t}^{c}}(x) \\ & \hphantom{=:} + \big|\mathbb{E}[V_{d}(t+1,X_{t+1}^{d})|X_{t}^{d}=x] - \phi _{ \bar{\varepsilon},d,t}(x) + \delta \big| \mathbbm{1}_{C_{t} \cap \hat{C}_{t}^{c}}(x) \\ & \hphantom{=:} + | g_{d}(t,x) -\gamma _{\bar{\varepsilon},d,t}(x) |\mathbbm{1}_{C_{t}^{c} \cap \hat{C}_{t}}(x) \\ & \hphantom{=:} +\big| \mathbb{E}[V_{d}(t+1,X_{t+1}^{d})|X_{t}^{d}=x] -\gamma _{ \bar{\varepsilon},d,t}(x) \big|\mathbbm{1}_{C_{t} \cap \hat{C}_{t}}(x). \end{aligned}$$

(4.45)

We now estimate (the integral of) each of these four terms separately. For the first term, we directly get from (3.6) that

$$ | g_{d}(t,x) -\phi _{\bar{\varepsilon},d,t}(x) + \delta |\mathbbm{1}_{C_{t}^{c} \cap \hat{C}_{t}^{c}}(x) \leq \delta + \bar{\varepsilon} c d^{q} (1+|x|^{p}), $$

and so we proceed with analysing the second term.

10.a) Bounding the approximation error on $C_{t} \cap \hat{C}_{t}^{c}$. From Lemma 4.8, we have the growth bound $|V_{d}(t+1,x)| \leq \hat{c}_{t+1} d^{\hat{q}_{t+1}} (1+|x|)$ and so, using (4.7), the second term in (4.45) can be estimated as

$$ \begin{aligned} & \big|\mathbb{E}[V_{d}(t+1,X_{t+1}^{d})|X_{t}^{d}=x] - \phi _{ \bar{\varepsilon},d,t}(x) + \delta \big| \mathbbm{1}_{C_{t} \cap \hat{C}_{t}^{c}}(x) \\ & \leq \Big(\hat{c}_{t+1} d^{\hat{q}_{t+1}} \big(1+ \mathbb{E}\big[|X^{d}_{t+1}| \big|X^{d}_{t}=x\big]\big) + c d^{q} (2+|x|) + \delta \Big) \mathbbm{1}_{C_{t} \cap \hat{C}_{t}^{c}}(x) \\ & \leq 4 \max (\hat{c}_{t+1},c) d^{\hat{q}_{t+1}} \bigg(1+\sum _{s=t}^{t+1} \mathbb{E}\big[|X^{d}_{s}| \big|X^{d}_{t}=x\big] \bigg) \mathbbm{1}_{C_{t} \cap \hat{C}_{t}^{c}}(x). \end{aligned} $$

Combining this with (4.18), (4.10) in Lemma 4.6 and Hölder’s inequality, we obtain with $C_{\mathrm{aux}} = 4 \max (\hat{c}_{t+1},c) d^{\hat{q}_{t+1}}$ that

$$\begin{aligned} & \bigg(\int _{\mathbb{R}^{d}}\big|\mathbb{E}[V_{d}(t+1,X_{t+1}^{d})|X_{t}^{d}=x] - \phi _{\bar{\varepsilon},d,t}(x) + \delta \big|^{2} \mathbbm{1}_{C_{t} \cap \hat{C}_{t}^{c}}(x) \rho ^{d}_{t}(dx)\bigg)^{1/2} \\ & \quad \leq C_{\mathrm{aux}} \bigg(\int _{\mathbb{R}^{d}} \Big(1+\sum _{s=t}^{t+1} \mathbb{E}\big[|X^{d}_{s}| \big|X^{d}_{t}=x\big] \Big)^{2} \mathbbm{1}_{C_{t} \cap \hat{C}_{t}^{c}}(x) \rho ^{d}_{t}(dx)\bigg)^{1/2} \\ & \quad \leq C_{\mathrm{aux}} \bigg(\int _{\mathbb{R}^{d}} \Big(1+\sum _{s=t}^{t+1} \mathbb{E}\big[|X^{d}_{s}| \big|X^{d}_{t}=x\big] \Big)^{4} \rho ^{d}_{t}(dx) \bigg)^{\frac{1}{4}} \big(\rho ^{d}_{t}({C_{t} \cap \hat{C}_{t}^{c}}) \big)^{\frac{1}{4}} \\ & \quad \leq C_{\mathrm{aux}}\bigg(1+\sum _{s=t}^{t+1} \Big(\int _{\mathbb{R}^{d}} \mathbb{E}\big[|X^{d}_{s}| \big|X^{d}_{t}=x\big]^{4} \rho ^{d}_{t}(dx) \Big)^{ \frac{1}{4}}\bigg) \big(\rho ^{d}_{t}({C_{t} \cap \hat{C}_{t}^{c}}) \big)^{\frac{1}{4}} \\ & \quad \leq C_{\mathrm{aux}} \big(1+ \tilde{c}_{1} d^{\tilde{q}_{1}} 2 (1+ c_{t}^{\frac{1}{4}}d^{\frac{q_{t}}{4}})\big) \big(\rho ^{d}_{t}({C_{t} \cap \hat{C}_{t}^{c}})\big)^{\frac{1}{4}}. \end{aligned}$$

(4.46)

10.b) Estimating $\rho ^{d}_{t}({C_{t} \cap \hat{C}_{t}^{c}})$. Next we estimate $\rho ^{d}_{t}({C_{t} \cap \hat{C}_{t}^{c}})$. To do this, set

$$\begin{aligned} A &= \bigg\{ x \in \mathbb{R}^{d} \colon |g_{d}(t,x) - \phi _{ \bar{\varepsilon},d,t}(x)|> \frac{\delta}{2} \bigg\} , \end{aligned}$$

(4.47)

$$\begin{aligned} B &= \bigg\{ x \in \mathbb{R}^{d} \colon \big|\mathbb{E}[V_{d}(t+1,X_{t+1}^{d})|X_{t}^{d}=x]- \gamma _{\bar{\varepsilon},d,t}(x)\big|> \frac{\delta}{2} \bigg\} \end{aligned}$$

(4.48)

and note that employing (4.43), (4.44), (4.47) and (4.48) to verify the inclusion $A^{c} \cap B^{c} \cap C_{t} \subseteq \hat{C}_{t}$ yields

$$\begin{aligned} & C_{t} \cap \hat{C}_{t}^{c} \\ & = \{x \in \mathbb{R}^{d} \colon g_{d}(t,x) < \mathbb{E}[V_{d}(t+1,X_{t+1}^{d})|X_{t}^{d}=x], \phi _{\bar{\varepsilon},d,t}(x) - \delta \geq \gamma _{ \bar{\varepsilon},d,t}(x) \} \\ & \subseteq A \cup B. \end{aligned}$$

(4.49)

Indeed, $|g_{d}(t,x) - \phi _{\bar{\varepsilon},d,t}(x)|\leq \frac{\delta}{2}$, $|\mathbb{E}[V_{d}(t+1,X_{t+1}^{d})|X_{t}^{d}=x]- \gamma _{ \bar{\varepsilon},d,t}(x)|\leq \frac{\delta}{2} $ and $g_{d}(t,x) < \mathbb{E}[V_{d}(t+1,X_{t+1}^{d})|X_{t}^{d}=x]$ lead in combination to

$$ \phi _{\bar{\varepsilon},d,t}(x) - \delta \leq g_{d}(t,x) - \frac{\delta}{2}< \mathbb{E}[V_{d}(t+1,X_{t+1}^{d})|X_{t}^{d}=x]- \frac{\delta}{2} \leq \gamma _{\bar{\varepsilon},d,t}(x). $$

Furthermore, (4.47), Markov’s inequality, (4.18) and (3.6) imply that

$$\begin{aligned} \rho ^{d}_{t} (A) &= \rho ^{d}_{t} \bigg(\bigg\{ x \in \mathbb{R}^{d} \colon |g_{d}(t,x) - \phi _{\bar{\varepsilon},d,t}(x)|> \frac{\delta}{2} \bigg\} \bigg) \\ & \leq \frac{4}{\delta ^{2}} \int _{\mathbb{R}^{d}} |g_{d}(t,x) - \phi _{ \bar{\varepsilon},d,t}(x)|^{2} \rho ^{d}_{t}(dx) \\ & \leq \frac{4}{\delta ^{2}} \int _{\mathbb{R}^{d}} |\bar{\varepsilon}c d^{q} (1+|x|^{p}) |^{2} \rho ^{d}_{t}(dx) \\ & \leq \frac{8(\bar{\varepsilon}c d^{q})^{2}}{\delta ^{2}} \bigg(1+ \int _{\mathbb{R}^{d}} |x|^{2p} \rho ^{d}_{t}(dx) \bigg) \\ & \leq \frac{8(1+c_{t})c^{2} \bar{\varepsilon}^{2} d^{2q+q_{t}}}{\delta ^{2}}. \end{aligned}$$

(4.50)

Similarly, (4.48), Markov’s inequality and (4.35) yield

$$\begin{aligned} \rho ^{d}_{t}(B) &= \rho ^{d}_{t} \bigg(\bigg\{ x \in \mathbb{R}^{d} \colon | \mathbb{E}[V_{d}(t+1,X_{t+1}^{d})|X_{t}^{d}=x]- \gamma _{\bar{\varepsilon},d,t}(x)|> \frac{\delta}{2} \bigg\} \bigg) \\ & \leq \frac{4}{\delta ^{2}} \int _{\mathbb{R}^{d}}\big|\mathbb{E}[V_{d}(t+1,X_{t+1}^{d})|X_{t}^{d}=x]- \gamma _{\bar{\varepsilon},d,t}(x)\big|^{2} \rho ^{d}_{t}(dx) \\ & \leq \frac{12}{\delta ^{2}} \tilde{c}_{2} d^{\tilde{q}_{2}} ( \bar{\varepsilon}^{2} + N^{-1}\bar{\varepsilon}^{-2\mathfrak{r}_{t+1}}). \end{aligned}$$

(4.51)

Putting together (4.46), (4.49)–(4.51) and inserting the choices (4.28), we obtain

$$\begin{aligned} & \bigg(\int _{\mathbb{R}^{d}} \big|\mathbb{E}[V_{d}(t+1,X_{t+1}^{d})|X_{t}^{d}=x] - \phi _{\bar{\varepsilon},d,t}(x) + \delta \big|^{2} \mathbbm{1}_{C_{t} \cap \hat{C}_{t}^{c}}(x) \rho ^{d}_{t}(dx)\bigg)^{1/2} \\ & \leq 4 \max (\hat{c}_{t+1},c) d^{\hat{q}_{t+1}}\big(1+ \tilde{c}_{1} d^{\tilde{q}_{1}} 2 (1+ c_{t}^{\frac{1}{4}}d^{\frac{q_{t}}{4}})\big) \big(\rho ^{d}_{t}(A) + \rho ^{d}_{t}(B)\big)^{\frac{1}{4}} \\ & \leq 4 \max (\hat{c}_{t+1},c) d^{\hat{q}_{t+1}+\tilde{q}_{1}+ \frac{q_{t}}{4}}\big(1+ 2\tilde{c}_{1} (1+ c_{t}^{\frac{1}{4}})\big) \\ & \hphantom{=:} \times \bigg( \frac{8(1+c_{t})c^{2} \bar{\varepsilon}^{2} d^{2q+q_{t}}}{\delta ^{2}} + \frac{12}{\delta ^{2}} \tilde{c}_{2} d^{\tilde{q}_{2}} ( \bar{\varepsilon}^{2} + N^{-1}\bar{\varepsilon}^{-2\mathfrak{r}_{t+1}}) \bigg)^{\frac{1}{4}} \\ & \leq \tilde{c}_{5} d^{\tilde{q}_{5}} \bar{\varepsilon}^{\frac{1}{4}} \end{aligned}$$

(4.52)

with constants chosen as $\tilde{c}_{5} = 4 \max (\hat{c}_{t+1},c) (1+ 2\tilde{c}_{1} (1+ c_{t}^{ \frac{1}{4}})) (8(1+c_{t})c^{2} + 24 \tilde{c}_{2} )^{\frac{1}{4}}$ and $\tilde{q}_{5} =\hat{q}_{t+1}+\tilde{q}_{1}+\frac{q_{t}}{4}+ \frac{1}{4}\max (2q+q_{t},\tilde{q}_{2}) $.

10.c) Bounding the approximation error on $C_{t}^{c} \cap \hat{C}_{t}$. We are now concerned with the third term in (4.45). Observe that

$$\begin{aligned} & | g_{d}(t,x) - \gamma _{\bar{\varepsilon},d,t}(x) |\mathbbm{1}_{C_{t}^{c} \cap \hat{C}_{t}}(x) \\ & \leq | g_{d}(t,x) - \gamma _{\bar{\varepsilon},d,t}(x) |\big( \mathbbm{1}_{A}(x) + \mathbbm{1}_{B}(x) + \mathbbm{1}_{A^{c} \cap B^{c} \cap C_{t}^{c} \cap \hat{C}_{t}}(x)\big). \end{aligned}$$

(4.53)

For $x \in A^{c} \cap \hat{C}_{t}$, we use (4.44) and (4.47) to obtain

$$ \begin{aligned} \gamma _{\bar{\varepsilon},d,t}(x) + \frac{3}{2}\delta > \phi _{ \bar{\varepsilon},d,t}(x) + \frac{\delta}{2} \geq g_{d}(t,x). \end{aligned} $$

Together with (4.43) and (4.48), this therefore implies for $x \in A^{c} \cap B^{c} \cap C_{t}^{c} \cap \hat{C}_{t}$ that

$$ \begin{aligned} | g_{d}(t,x) - \gamma _{\bar{\varepsilon},d,t}(x) | & \leq | g_{d}(t,x) - \gamma _{\bar{\varepsilon},d,t}(x) - \frac{3}{2}\delta | + \frac{3}{2} \delta \\ & = \gamma _{\bar{\varepsilon},d,t}(x) + 3 \delta - g_{d}(t,x) \\ & \leq \gamma _{\bar{\varepsilon},d,t}(x) + 3 \delta - \mathbb{E}[V_{d}(t+1,X_{t+1}^{d})|X_{t}^{d}=x] \\ & \leq \frac{7}{2} \delta . \end{aligned} $$

Combining this with (4.53), we obtain

$$ \begin{aligned} | g_{d}(t,x) - \gamma _{\bar{\varepsilon},d,t}(x) |\mathbbm{1}_{C_{t}^{c} \cap \hat{C}_{t}}(x) \leq | g_{d}(t,x) - \gamma _{\bar{\varepsilon},d,t}(x) |\big(\mathbbm{1}_{A}(x) + \mathbbm{1}_{B}(x)\big) + \frac{7}{2} \delta , \end{aligned} $$

and consequently the growth bounds (4.6), (4.10) and (4.14) on $g_{d}$, on the conditional moments and on $V_{d}$, Hölder’s inequality and the approximation error bound (4.35) for the continuation value yield

$$\begin{aligned} & \bigg(\int _{\mathbb{R}^{d}} | g_{d}(t,x) - \gamma _{\bar{\varepsilon},d,t}(x) |^{2}\mathbbm{1}_{C_{t}^{c} \cap \hat{C}_{t}}(x) \rho ^{d}_{t}(dx) \bigg)^{1/2} \\ & \leq \bigg(\int _{\mathbb{R}^{d}}\Big(| g_{d}(t,x) - \gamma _{ \bar{\varepsilon},d,t}(x) |\big(\mathbbm{1}_{A}(x) + \mathbbm{1}_{B}(x) \big)\Big)^{2} \rho ^{d}_{t}(dx)\bigg)^{1/2} + \frac{7}{2}\delta \\ & \leq \bigg(\int _{\mathbb{R}^{d}} \Big(\big| g_{d}(t,x) - \mathbb{E}[V_{d}(t+1,X_{t+1}^{d}) |X_{t}^{d}=x] \big|\big(\mathbbm{1}_{A}(x) + \mathbbm{1}_{B}(x)\big) \Big)^{2} \rho ^{d}_{t}(dx)\bigg)^{\frac{1}{2}} \\ & \hphantom{=:} + \bigg(\int _{\mathbb{R}^{d}} \!\Big(\big| \mathbb{E}[V_{d}(t+1,X_{t+1}^{d})|X_{t}^{d} \!=x] - \gamma _{\bar{\varepsilon},d,t}(x) \big|\big(\mathbbm{1}_{A}(x) + \mathbbm{1}_{B}(x)\big)\Big)^{2} \rho ^{d}_{t}(dx)\!\bigg)^{ \frac{1}{2}} \\ & \hphantom{=:} + \frac{7}{2}\delta \\ & \leq 2 \hat{c}_{t+1} d^{\hat{q}_{t+1}} \bigg( \int _{\mathbb{R}^{d}} \Big(1+ \sum _{s=t}^{t+1} \mathbb{E}\big[|X_{s}^{d}|\big|X_{t}^{d}=x\big]\Big)^{2} \big(\mathbbm{1}_{A}(x) + \mathbbm{1}_{B}(x)\big)^{2} \rho ^{d}_{t}(dx) \bigg)^{\frac{1}{2}} \\ & \hphantom{=:} + 2\bigg(\int _{\mathbb{R}^{d}}\big|\mathbb{E}[V_{d}(t+1,X_{t+1}^{d})|X_{t}^{d}=x] - \gamma _{\bar{\varepsilon},d,t}(x)\big|^{2} \rho ^{d}_{t}(dx)\bigg)^{ \frac{1}{2}} + \frac{7}{2}\delta \\ & \leq 2 \hat{c}_{t+1} d^{\hat{q}_{t+1}} \bigg(\int _{\mathbb{R}^{d}} \big(1+2 \tilde{c}_{1} d^{\tilde{q}_{1}} (1+|x|) \big)^{4} \rho ^{d}_{t}(dx) \bigg)^{\frac{1}{4}} \Big(\big(\rho ^{d}_{t}(A)\big)^{\frac{1}{4}}+ \big(\rho ^{d}_{t}(B)\big)^{\frac{1}{4}}\Big) \\ & \hphantom{=:} + 2 \big(3 \tilde{c}_{2} d^{\tilde{q}_{2}} (\bar{\varepsilon}^{2} + N^{-1} \bar{\varepsilon}^{-2\mathfrak{r}_{t+1}})\big)^{1/2} + \frac{7}{2} \delta . \end{aligned}$$

(4.54)

Inserting the bound (4.18) on the moments of $\rho ^{d}_{t}$, the upper bounds (4.50), (4.51) on $\rho ^{d}_{t}(A)$, $\rho ^{d}_{t}(B)$ and the choices (4.28) for $N$ and $\delta $ thus shows that

$$\begin{aligned} & \bigg(\int _{\mathbb{R}^{d}} | g_{d}(t,x) - \gamma _{\bar{\varepsilon},d,t}(x) |^{2}\mathbbm{1}_{C_{t}^{c} \cap \hat{C}_{t}}(x) \rho ^{d}_{t}(dx) \bigg)^{1/2} \\ & \leq 2 \hat{c}_{t+1} d^{\hat{q}_{t+1}} \big(1+ 2\tilde{c}_{1} d^{ \tilde{q}_{1}}(1+c_{t}^{\frac{1}{4}} d^{\frac{q_{t}}{4}})\big) \Big( \big(\rho ^{d}_{t}(A)\big)^{\frac{1}{4}}+\big(\rho ^{d}_{t}(B)\big)^{ \frac{1}{4}}\Big) \\ & \hphantom{=:} + 2 (6 \tilde{c}_{2} d^{\tilde{q}_{2}} \bar{\varepsilon}^{2} )^{1/2} + \frac{7}{2}\delta \\ & \leq 2 \hat{c}_{t+1} d^{\hat{q}_{t+1}+\tilde{q}_{1}+\frac{q_{t}}{4}} \big(1+ 2\tilde{c}_{1} (1+c_{t}^{\frac{1}{4}})\big) \\ & \hphantom{=:} \times \bigg(\Big( \frac{8(1+c_{t})c^{2} \bar{\varepsilon}^{2} d^{2q+q_{t}}}{\delta ^{2}} \Big)^{\frac{1}{4}}+\Big(\frac{24}{\delta ^{2}} \tilde{c}_{2} d^{ \tilde{q}_{2}} \bar{\varepsilon}^{2} \Big)^{\frac{1}{4}}\bigg) \\ & \hphantom{=:} + 2 (6 \tilde{c}_{2} d^{\tilde{q}_{2}} \bar{\varepsilon}^{2} )^{1/2} + \frac{7}{2} \delta \\ & \leq \tilde{c}_{6} d^{\tilde{q}_{6}} \bar{\varepsilon}^{\frac{1}{4}} , \end{aligned}$$

(4.55)

where $\tilde{c}_{6} = 2 \hat{c}_{t+1} (1+ 2\tilde{c}_{1} (1+c_{t}^{ \frac{1}{4}})) ((8(1+c_{t})c^{2} )^{\frac{1}{4}}+(24 \tilde{c}_{2} )^{ \frac{1}{4}}) + 2 (6 \tilde{c}_{2} )^{1/2} + \frac{7}{2}$ and $\tilde{q}_{6} = \hat{q}_{t+1}+\frac{1}{2}q+\tilde{q}_{1}+ \frac{q_{t}}{2}+\frac{\tilde{q}_{2}}{2}$.

10.d) Combining the individual error estimates. Finally, note that the second and last line of (4.50) yield

$$ \begin{aligned} \int _{\mathbb{R}^{d}} |g_{d}(t,x) - \phi _{\bar{\varepsilon},d,t}(x)|^{2} \rho ^{d}_{t}(dx) \leq 2(1+c_{t})c^{2} \bar{\varepsilon}^{2} d^{2q+q_{t}} . \end{aligned} $$

(4.56)

Consequently, combining the decomposition (4.45) with the estimates (4.35), (4.52), (4.55) and (4.56), we obtain

$$\begin{aligned} & \bigg(\int _{\mathbb{R}^{d}} |V_{d}(t,x) - \psi _{\varepsilon ,d,t}(x)|^{2} \rho ^{d}_{t}(d x) \bigg)^{1/2} \\ & \leq \bigg(\int _{\mathbb{R}^{d}}| g_{d}(t,x) -\phi _{\bar{\varepsilon},d,t}(x) + \delta |^{2}\mathbbm{1}_{C_{t}^{c} \cap \hat{C}_{t}^{c}}(x) \rho ^{d}_{t}(d x) \bigg)^{1/2} \\ & \hphantom{=:} + \bigg(\int _{\mathbb{R}^{d}} \big|\mathbb{E}[V_{d}(t+1,X_{t+1}^{d})|X_{t}^{d}=x] - \phi _{\bar{\varepsilon},d,t}(x) + \delta \big|^{2} \mathbbm{1}_{C_{t} \cap \hat{C}_{t}^{c}}(x) \rho ^{d}_{t}(d x) \bigg)^{1/2} \\ & \hphantom{=:} + \bigg(\int _{\mathbb{R}^{d}} | g_{d}(t,x) -\gamma _{\bar{\varepsilon},d,t}(x) |^{2}\mathbbm{1}_{C_{t}^{c} \cap \hat{C}_{t}}(x) \rho ^{d}_{t}(d x) \bigg)^{1/2} \\ & \hphantom{=:} + \bigg(\int _{\mathbb{R}^{d}} \big| \mathbb{E}[V_{d}(t+1,X_{t+1}^{d})|X_{t}^{d}=x] - \gamma _{\bar{\varepsilon},d,t}(x) \big|^{2}\mathbbm{1}_{C_{t} \cap \hat{C}_{t}}(x) \rho ^{d}_{t}(d x) \bigg)^{1/2} \\ & \leq \big(2(1+c_{t})\big)^{\frac{1}{2}} c \bar{\varepsilon} d^{q+ \frac{q_{t}}{2}} + \delta + \tilde{c}_{5} d^{\tilde{q}_{5}} \bar{\varepsilon}^{\frac{1}{4}} + \tilde{c}_{6} d^{\tilde{q}_{6}} \bar{\varepsilon}^{\frac{1}{4}} + \bar{\varepsilon} d^{ \frac{\tilde{q}_{2}}{2}}(6 \tilde{c}_{2} )^{\frac{1}{2}} \\ & \leq \tilde{c}_{7} \bar{\varepsilon}^{\frac{1}{4}} d^{\tilde{q}_{7}} \end{aligned}$$

(4.57)

with $\tilde{c}_{7} = (2(1+c_{t}))^{\frac{1}{2}} c + 1 + \tilde{c}_{5} + \tilde{c}_{6} + (6 \tilde{c}_{2} )^{\frac{1}{2}}$ and $\tilde{q}_{7} = \max (q+\frac{q_{t}}{2},\tilde{q}_{5},\tilde{q}_{6}, \frac{\tilde{q}_{2}}{2})$. Now choose

$$\begin{aligned} \bar{\varepsilon} & = \varepsilon ^{4} (\tilde{c}_{7} d^{\tilde{q}_{7}} )^{-4}. \end{aligned}$$

(4.58)

Inserting (4.58) in (4.57) proves that (4.19) is satisfied. Furthermore, choosing

$$\begin{aligned} \kappa _{t} & = \max \big(\tilde{c}_{3} (\tilde{c}_{7})^{4\tilde{r}_{3}}, \tilde{c}_{4}(\tilde{c}_{7})^{4\tilde{r}_{4}},c \max (\kappa _{t+1},1) \big), \end{aligned}$$

(4.59)

$$\begin{aligned} \mathfrak{q}_{t} & = \max (\tilde{q}_{3}+4\tilde{r}_{3}\tilde{q}_{7} , \tilde{q}_{4}+4\tilde{r}_{4}\tilde{q}_{7},\mathfrak{q}_{t+1}+q), \end{aligned}$$

(4.60)

$$\begin{aligned} \mathfrak{r}_{t} & = \max (4\tilde{r}_{3},4\tilde{r}_{4}), \end{aligned}$$

(4.61)

we obtain from (4.38), (4.40) and (4.42) that (4.20)–(4.22) are satisfied. This completes the induction step, and hence the statement follows. □

Proof of Corollary 3.8

Fix $d \in \mathbb{N}$ and $h \in [-R,R]^{d}$ and set $\rho ^{d} = \frac{1}{2}\nu ^{d}_{0} + \frac{1}{2} \nu ^{d}_{h} $, where $\nu ^{d}_{x}$ denotes a multivariate normal distribution on $\mathbb{R}^{d}$ with mean $x$ and identity covariance. Then we estimate

$$ \begin{aligned} & \int _{\mathbb{R}^{d}} |x|^{2\max (p,2)} \rho ^{d}(dx) \\ & \leq \frac{1}{2}(1+2^{2\max (p,2)-1}) \int _{\mathbb{R}^{d}} |x|^{2\max (p,2)} \nu ^{d}_{0}(dx) + \frac{1}{4} (2|h|)^{2\max (p,2)}, \end{aligned} $$

and so $|h|\leq R d^{1/2} $ and (4.4) show that there exist $c >0$ and $q \geq 0$ only depending on $p$ and $R$ such that the bound $\int _{\mathbb{R}^{d}} |x|^{2\max (p,2)} \rho ^{d}(dx) \leq c d^{q}$ holds. Hence we can apply Theorem 3.6 and obtain for all $\varepsilon \in (0,1]$ and $t \in \{0,\ldots ,T\}$ the existence of a neural network $\psi _{\varepsilon ,d,t}$ such that (3.9) holds. From the proof of Theorem 3.6, we obtain that these networks satisfy the Lipschitz condition (4.22). Therefore, for any $\varepsilon >0$, we may use Minkowski’s inequality, the bound $\|\cdot \|_{L^{2}(\nu _{y}^{d})} \leq \sqrt{2}\|\cdot \|_{L^{2}( \rho ^{d})}$ for $y \in \{0,h\}$, the approximation error bound (3.9) and the Lipschitz property (4.22) to estimate

$$ \begin{aligned} & \|V_{d}(t,\,\cdot \,) - V_{d}(t,\cdot +h)\|_{L^{2}(\nu _{0}^{d})} \\ & \leq \|V_{d}(t,\,\cdot \,) - \psi _{\varepsilon ,d,t}\|_{L^{2}(\nu _{0}^{d})} + \|\psi _{\varepsilon ,d,t} - \psi _{\varepsilon ,d,t}(\,\cdot \,+h) \|_{L^{2}(\nu _{0}^{d})} \\ & \hphantom{=:} + \|\psi _{\varepsilon ,d,t}(\,\cdot \,+h)-V_{d}(t,\cdot +h)\|_{L^{2}( \nu _{0}^{d})} \\ & \leq \sqrt{2} \|V_{d}(t,\,\cdot \,) - \psi _{\varepsilon ,d,t}\|_{L^{2}( \rho ^{d})} + \|\psi _{\varepsilon ,d,t} - \psi _{\varepsilon ,d,t}( \,\cdot \,+h)\|_{L^{2}(\nu _{0}^{d})} \\ & \hphantom{=:} + \|\psi _{\varepsilon ,d,t}-V_{d}(t,\,\cdot \,)\|_{L^{2}(\nu _{h}^{d})} \\ & \leq 2 \sqrt{2} \varepsilon + |h| \kappa _{t} d^{\mathfrak{q}_{t}}. \end{aligned} $$

This holds for any $\varepsilon >0 $, and from the statement in Step 2 of the proof of Theorem 3.6, the constants $\kappa _{t}$, $\mathfrak{q}_{t}$ do not depend on $d$, $\varepsilon $ and $h$ (but they depend on $R$). Letting $\varepsilon $ tend to 0 therefore yields the claimed statement. □

4.5 Proof of Theorem 3.12

This subsection is devoted to the proof of Theorem 3.12. It is based on the proof of Theorem 3.6 given in Sect. 4.4.

Proof of Theorem 3.12

Proving this result just requires slight modifications in the proof of Theorem 3.6. Without loss of generality, we may assume $c\geq h$ and $q \geq \bar{q}$. In Step 1, we only need to choose $c_{0},\ldots ,c_{T}$ differently. Indeed, let $c_{0} = c$, $c_{t+1} = h(1+c_{t})$ and set $q_{t} = \bar{q}(t+1)$. In Step 2, the stronger statement is modified accordingly: We prove that for any $t \in \{0,\ldots ,T\}$, there exist constants $\kappa _{t}, \mathfrak{q}_{t},\mathfrak{r}_{t} \in [0,\infty )$ such that for any family of probability measures $\rho _{t}^{d}$ on $\mathbb{R}^{d}$, $d \in \mathbb{N}$, with

$$ \int _{\mathbb{R}^{d}} |x|^{2m\max (p,2)} \rho ^{d}_{t}(dx) \leq c_{t} d^{q_{t}} $$

(4.62)

and for all $d \in \mathbb{N}$ and $\varepsilon \in (0,1]$, there exists a neural network $\psi _{\varepsilon ,d,t}$ such that the approximation error estimate (4.19) holds and $\psi _{\varepsilon ,d,t}$ satisfies the growth and size conditions (4.20) and (4.21) and the modified Lipschitz condition

$$\begin{aligned} \mathrm{Lip}(\psi _{\varepsilon ,d,t}) & \leq \kappa _{t} d^{ \mathfrak{q}_{t}} \varepsilon ^{-\zeta (T-t)}. \end{aligned}$$

(4.63)

For $t=T$, condition (4.63) coincides with (4.22) and so the base case (Step 3) remains the same as in the proof of Theorem 3.6. Due to the assumption (3.13), also Steps 4 and 5 only require slight modifications; indeed (4.23) becomes

$$ \begin{aligned} \int _{\mathbb{R}^{d}} |z|^{2m\max (p,2)} \rho _{t+1}^{d}(dz) & = \int _{\mathbb{R}^{d}} \int _{\mathbb{R}^{d}} |f_{t}^{d}(x,y)|^{2m\max (p,2)} \nu _{t}^{d}(dy) \rho ^{d}_{t}(dx) \\ & = \int _{\mathbb{R}^{d}} \mathbb{E}[ |f_{t}^{d}(x,Y^{d}_{t})|^{2m\max (p,2)}] \rho ^{d}_{t}(dx) \\ & \leq hd^{\bar{q}} \bigg( 1 + \int _{\mathbb{R}^{d}}|x|^{2m\max (p,2)} \rho ^{d}_{t}(dx) \bigg) \\ & \leq h d^{q_{t}+\bar{q}}( 1 + c_{t} ) \\ & = c_{t+1} d^{q_{t+1}}, \end{aligned} $$

the Lipschitz condition (4.27) is replaced by

$$\begin{aligned} \mathrm{Lip}(\psi _{\varepsilon ,d,t+1}) & \leq \kappa _{t+1} d^{ \mathfrak{q}_{t+1}} \varepsilon ^{-\zeta (T-t-1)}, \end{aligned}$$

(4.64)

and we modify the choice of $N$ in (4.28) to $N =\lceil \bar{\varepsilon}^{-2 \mathfrak{r}_{t+1}-2-2\theta} \rceil $.

For the beginning of Step 6 and for Step 6.a), we proceed precisely as above and obtain the error estimates (4.29)–(4.31). In Step 6.b), the Lipschitz property (4.64) of the network now yields the additional factor $\bar{\varepsilon}^{-2\zeta (T-t-1)}$, and the approximation property (3.2) only holds on $[-(\bar{\varepsilon}^{-\beta}),\bar{\varepsilon}^{-\beta}]^{d}$; see (3.11). Hence we estimate

$$\begin{aligned} &\int _{\mathbb{R}^{d}} \big| \mathbb{E}\big[ \hat{v}_{\bar{\varepsilon},d,t+1} \big(f^{d}_{t}(x,Y_{t}^{d})\big)\big]-\mathbb{E}\big[ \hat{v}_{ \bar{\varepsilon},d,t+1}\big(\eta _{\bar{\varepsilon},d,t}(x,Y^{d}_{t}) \big)\big]\big|^{2} \rho ^{d}_{t}(dx) \\ & \leq (\kappa _{t+1} d^{\mathfrak{q}_{t+1}} \bar{\varepsilon}^{- \zeta (T-t-1)})^{2} \int _{\mathbb{R}^{d}} \big| \mathbb{E}[ | f^{d}_{t}(x,Y_{t}^{d})- \eta _{\bar{\varepsilon},d,t}(x,Y^{d}_{t})|]\big|^{2} \rho ^{d}_{t}(dx) \\ & \leq 2(\kappa _{t+1} d^{\mathfrak{q}_{t+1}} \bar{\varepsilon}^{- \zeta (T-t-1)})^{2} \\ & \hphantom{=:} \times \Bigg( (\bar{\varepsilon} c d^{q})^{2} \int _{[-( \bar{\varepsilon}^{-\beta}),\bar{\varepsilon}^{-\beta}]^{d}} \big|1+|x|^{p} + \mathbb{E}[|Y^{d}_{t}|^{p}]\big|^{2} \rho ^{d}_{t}(dx) \\ & \hphantom{=:\times \Bigg(} + \int _{\mathbb{R}^{d} \setminus [-(\bar{\varepsilon}^{-\beta}), \bar{\varepsilon}^{-\beta}]^{d}} \big| \mathbb{E}[ | f^{d}_{t}(x,Y_{t}^{d})- \eta _{\bar{\varepsilon},d,t}(x,Y^{d}_{t})|]\big|^{2} \rho ^{d}_{t}(dx) \\ & \hphantom{=:\times \Bigg(} + \int _{\mathbb{R}^{d}} \big| \mathbb{E}[ | f^{d}_{t}(x,Y_{t}^{d})- \eta _{ \bar{\varepsilon},d,t}(x,Y^{d}_{t})|\mathbbm{1}_{\{Y^{d}_{t} \in \mathbb{R}^{d} \setminus [-(\bar{\varepsilon}^{-\beta}),\bar{\varepsilon}^{-\beta}]^{d} \}}]\big|^{2} \rho ^{d}_{t}(dx)\Bigg). \end{aligned}$$

(4.65)

The first term can be bounded as before. For the second, note that Jensen’s inequality and (3.13) imply that $\mathbb{E}[|f_{t}^{d}(x,Y_{t}^{d})|] \leq c d^{q} (1+|x|)$. Hence we may apply Hölder’s inequality, the growth bound (3.14) on $\eta _{\bar{\varepsilon},d,t}$, Markov’s inequality and (4.62) to obtain, with $|x|_{\infty }= \max _{i=1,\ldots ,d}|x_{i}|$, that

$$ \begin{aligned} & \int _{\mathbb{R}^{d} \setminus [-(\bar{\varepsilon}^{-\beta}), \bar{\varepsilon}^{-\beta}]^{d}} \big| \mathbb{E}[ | f^{d}_{t}(x,Y_{t}^{d})- \eta _{\bar{\varepsilon},d,t}(x,Y^{d}_{t})|]\big|^{2} \rho ^{d}_{t}(dx) \\ & \leq \bigg(\int _{\mathbb{R}^{d}} \big| \mathbb{E}[ | f^{d}_{t}(x,Y_{t}^{d})- \eta _{\bar{\varepsilon},d,t}(x,Y^{d}_{t})|]\big|^{4} \rho ^{d}_{t}(dx) \bigg)^{\frac{1}{2}} \\ & \hphantom{=:} \times \Big(\rho ^{d}_{t}\big(\mathbb{R}^{d} \setminus [-(\bar{\varepsilon}^{- \beta}),\bar{\varepsilon}^{-\beta}]^{d}\big)\Big)^{\frac{1}{2}} \\ & \leq \bigg(\int _{\mathbb{R}^{d}} \big| \mathbb{E}[2c d^{q} \bar{\varepsilon}^{- \theta} (2+|x|+|Y_{t}^{d}|)]\big|^{4} \rho ^{d}_{t}(dx)\bigg)^{ \frac{1}{2}} \\ & \hphantom{=:} \times \bigg( \frac{\int _{\mathbb{R}^{d}} |x|_{\infty}^{2m\max (2,p)} \rho ^{d}_{t}(dx)}{(\bar{\varepsilon}^{-\beta})^{2m\max (2,p)}} \bigg)^{\frac{1}{2}} \\ & \leq 3(2 c d^{q})^{2} \bar{\varepsilon}^{-2\theta} \bigg(4+\mathbb{E}[|Y_{t}^{d}|]^{2} + \Big(\int _{\mathbb{R}^{d}} |x|^{4} \rho ^{d}_{t}(dx)\Big)^{\frac{1}{2}} \bigg) \bar{\varepsilon}^{\beta m \max (2,p)} \\ & \hphantom{=:} \times \bigg(\int _{\mathbb{R}^{d}} |x|^{2m\max (2,p)} \rho ^{d}_{t}(dx) \bigg)^{\frac{1}{2}} \\ & \leq 3 \bar{\varepsilon}^{\beta m \max (2,p)-2\theta} (2 c d^{q})^{2} \big(4+(cd^{q})^{2} + (c_{t} d^{q_{t}})^{\frac{1}{2}}\big) (c_{t} d^{q_{t}})^{ \frac{1}{2}}. \end{aligned} $$

For the last term in (4.65), we note that (3.13) and Jensen’s inequality give

$$ \mathbb{E}[|f_{t}^{d}(x,Y_{t}^{d})|^{2}] \leq \big(cd^{q} (1+|x|^{2m\max (p,2)}) \big)^{\frac{1}{m\max (p,2)}} \leq c d^{q} (1+|x|)^{2}. $$

Using this, Hölder’s inequality, (3.14) and Markov’s inequality, we obtain

$$ \begin{aligned} & \int _{\mathbb{R}^{d}} \big| \mathbb{E}[ | f^{d}_{t}(x,Y_{t}^{d})- \eta _{ \bar{\varepsilon},d,t}(x,Y^{d}_{t})|\mathbbm{1}_{\{Y^{d}_{t} \in \mathbb{R}^{d} \setminus [-(\bar{\varepsilon}^{-\beta}),\bar{\varepsilon}^{-\beta}]^{d} \}}]\big|^{2} \rho ^{d}_{t}(dx) \\ & \leq \int _{\mathbb{R}^{d}} \mathbb{E}[ | f^{d}_{t}(x,Y_{t}^{d})- \eta _{ \bar{\varepsilon},d,t}(x,Y^{d}_{t})|^{2}] \rho ^{d}_{t}(dx) \mathbb{P}\big[{Y^{d}_{t} \in \mathbb{R}^{d} \setminus [-(\bar{\varepsilon}^{-\beta}), \bar{\varepsilon}^{-\beta}]^{d}}\big] \\ & \leq 2 \int _{\mathbb{R}^{d}} \mathbb{E}[2 (c d^{q} \bar{\varepsilon}^{-\theta})^{2} (2+|x|+|Y_{t}^{d}|)^{2}] \rho ^{d}_{t}(dx) \frac{\mathbb{E}[ |Y_{t}^{d}|_{\infty}^{2m\max (2,p)}]}{(\bar{\varepsilon}^{-\beta})^{2m\max (2,p)}} \\ & \leq 3(2 c d^{q})^{2} \bigg(4+\mathbb{E}[|Y_{t}^{d}|^{2}] + \Big(\int _{ \mathbb{R}^{d}} |x|^{4} \rho ^{d}_{t}(dx)\Big)^{\frac{1}{2}}\bigg) \\ & \hphantom{=:} \times \bar{\varepsilon}^{2m\beta \max (2,p)-2\theta} \mathbb{E}[ |Y_{t}^{d}|^{2m \max (2,p)}] \\ & \leq 3 \bar{\varepsilon}^{\beta m \max (2,p)-2\theta} (2 c d^{q})^{2} \big(4+cd^{q} + (c_{t}d^{q_{t}})^{\frac{1}{2}}\big) cd^{q}. \end{aligned} $$

Together, this yields

$$\begin{aligned} &\int _{\mathbb{R}^{d}} \big| \mathbb{E}\big[ \hat{v}_{\bar{\varepsilon},d,t+1} \big(f^{d}_{t}(x,Y_{t}^{d})\big)\big]-\mathbb{E}\big[ \hat{v}_{ \bar{\varepsilon},d,t+1}\big(\eta _{\bar{\varepsilon},d,t}(x,Y^{d}_{t}) \big)\big]\big|^{2} \rho ^{d}_{t}(dx) \\ & \leq 6 (\bar{\varepsilon}^{1-\zeta (T-t-1)} c \kappa _{t+1})^{2} d^{2(q+ \mathfrak{q}_{t+1})+\max (2q,\frac{pq_{t}}{\max (p,2)})} (1+ c^{2} + c_{t}^{ \frac{p}{\max (p,2)}} ) \\ & \hphantom{=:} + 48 \bar{\varepsilon}^{\beta m \max (2,p)-2\theta -2\zeta (T-t-1)} ( \kappa _{t+1})^{2} d^{4q+2\mathfrak{q}_{t+1}+q_{t}} c^{2} (4 +c^{2} + c_{t}^{ \frac{1}{2}}) c_{t}. \end{aligned}$$

(4.66)

Similarly, in Step 6.c), the factor $\bar{\varepsilon}^{-2\mathfrak{r}_{t+1}}$ is replaced by $\bar{\varepsilon}^{-2\mathfrak{r}_{t+1}-2\theta}$ due to the growth bound (3.14). This and the modified bound (4.66) (as opposed to (4.32)) then also lead to a different estimate in Step 6.d) where we obtain

$$ \begin{aligned} & \int _{\mathbb{R}^{d}} \mathbb{E}\big[\big| \mathbb{E}[V_{d}(t+1,X_{t+1}^{d})|X_{t}^{d}=x] -\Gamma _{\bar{\varepsilon},d,t}(x) \big|^{2}\big] \rho ^{d}_{t}(dx) \\ & \leq \tilde{c}_{2} d^{\tilde{q}_{2}} (\bar{\varepsilon}^{2-2\zeta (T-t-1)} + \bar{\varepsilon}^{\beta m \max (2,p)-2\theta - 2\zeta (T-t-1)}+ N^{-1} \bar{\varepsilon}^{-2\mathfrak{r}_{t+1}-2\theta}), \end{aligned} $$

with slightly different choices of constants given by $\tilde{c}_{2} = 96 c^{2}\kappa _{t+1}^{2}(4+c^{2} + c_{t})c_{t}$ and $\tilde{q}_{2} = 4q+2\mathfrak{q}_{t+1}+\max (2q,q_{t})$. Thus the same argument as before yields the existence of $\omega \in \Omega $ such that $\gamma _{\bar{\varepsilon},d,t}$, the realisation of $\Gamma _{\bar{\varepsilon},d,t}$ at $\omega $, satisfies

$$\begin{aligned} &\int _{\mathbb{R}^{d}} \big| \mathbb{E}[V_{d}(t+1,X_{t+1}^{d})|X_{t}^{d}=x] - \gamma _{\bar{\varepsilon},d,t}(x) \big|^{2} \rho ^{d}_{t}(dx) \\ & \leq 3 \tilde{c}_{2} d^{\tilde{q}_{2}} (\bar{\varepsilon}^{2-2 \zeta (T-t-1)} + \bar{\varepsilon}^{\beta m \max (2,p)-2\theta -2 \zeta (T-t-1)}+ N^{-1}\bar{\varepsilon}^{-2\mathfrak{r}_{t+1}-2\theta}) \end{aligned}$$

(4.67)

and (4.36) holds. In Step 7, the modified growth bound (3.14) and the modified choice $N =\lceil \bar{\varepsilon}^{-2 \mathfrak{r}_{t+1}-2-2\theta} \rceil $ lead to an additional factor $\bar{\varepsilon}^{-3\theta}$ in (4.38) and to a modified choice $\tilde{r}_{3} = 3\mathfrak{r}_{t+1}+2+3\theta $. Similarly, in Step 8, the modified choice of $N$ leads to an additional factor $\bar{\varepsilon}^{-2\theta}$ and to a modified choice $\tilde{r}_{4} = 2 \mathfrak{r}_{t+1}+2+\max (\alpha ,\mathfrak{r}_{t+1})+2 \theta $. Next, for the Lipschitz constant of the constructed network, i.e., Step 9, we need to verify (4.63). To do this, we note that the induction hypothesis (4.64) and the Lipschitz property (3.12) of $\eta _{\bar{\varepsilon},d,t}$ imply for all $x,y \in \mathbb{R}^{d}$ that

$$ \begin{aligned} & |\gamma _{\bar{\varepsilon},d,t}(x)-\gamma _{\bar{\varepsilon},d,t}(y)| \\ & = \bigg| \frac{1}{N}\sum _{i=1}^{N} \hat{v}_{\bar{\varepsilon},d,t+1} \Big(\eta _{\bar{\varepsilon},d,t}\big(x,Y^{d,i}_{t}(\omega )\big) \Big) - \hat{v}_{\bar{\varepsilon},d,t+1}\Big(\eta _{ \bar{\varepsilon},d,t}\big(y,Y^{d,i}_{t}(\omega )\big)\Big) \bigg| \\ & \leq \frac{1}{N}\sum _{i=1}^{N} \kappa _{t+1} d^{\mathfrak{q}_{t+1}} \bar{\varepsilon}^{-\zeta (T-t-1)} \big|\eta _{\bar{\varepsilon},d,t} \big(x,Y^{d,i}_{t}(\omega )\big) - \eta _{\bar{\varepsilon},d,t}\big(y,Y^{d,i}_{t}( \omega )\big)\big| \\ & \leq \kappa _{t+1} d^{\mathfrak{q}_{t+1}} \bar{\varepsilon}^{- \zeta (T-t)} c d^{q} |x - y|. \end{aligned} $$

Thus we obtain

$$ \begin{aligned} \mathrm{Lip}(\psi _{\varepsilon ,d,t}) & \leq \max (c d^{q},\kappa _{t+1} d^{\mathfrak{q}_{t+1}+q} \bar{\varepsilon}^{-\zeta (T-t)} c ) \\ & \leq c \max (\kappa _{t+1},1) d^{\mathfrak{q}_{t+1}+q} \bar{\varepsilon}^{-\zeta (T-t)}. \end{aligned} $$

Step 10 requires a different choice of $\delta $ and otherwise only minor modifications. We choose $\delta = \bar{\varepsilon}^{\frac{1}{2}(\min (1,\beta m-\theta ) - \zeta (T-1))} $. Then in Step 10.b), the new bound (4.67) leads to a slightly different bound than in (4.51) and (4.52). We obtain

$$\begin{aligned} & \bigg(\int _{\mathbb{R}^{d}}\big|\mathbb{E}[V_{d}(t+1,X_{t+1}^{d})|X_{t}^{d}=x] - \phi _{\bar{\varepsilon},d,t}(x) + \delta \big|^{2} \mathbbm{1}_{C_{t} \cap \hat{C}_{t}^{c}}(x) \rho ^{d}_{t}(dx)\bigg)^{1/2} \\ & \leq 4 \max (\hat{c}_{t+1},c) d^{\hat{q}_{t+1}+\tilde{q}_{1}+ \frac{q_{t}}{4}}\big(1+ 2\tilde{c}_{1} (1+ c_{t}^{\frac{1}{4}})\big) \\ & \hphantom{=:} \times \bigg( \frac{8(1+c_{t})c^{2} \bar{\varepsilon}^{2} d^{2q+q_{t}}}{\delta ^{2}} \\ & \hphantom{=:\bigg(\times} + \frac{12}{\delta ^{2}} \tilde{c}_{2} d^{\tilde{q}_{2}} ( \bar{\varepsilon}^{2-2\zeta (T-t-1)}\! + \bar{\varepsilon}^{\beta m \max (2,p)-2\theta -2\zeta (T-t-1)}+ N^{-1}\bar{\varepsilon}^{-2 \mathfrak{r}_{t+1}-2\theta} )\bigg)^{\frac{1}{4}} \\ & \leq \tilde{c}_{5} d^{\tilde{q}_{5}} \bar{\varepsilon}^{\frac{1}{4}( \min (1,\beta m-\theta ) -\zeta (T-1))} \end{aligned}$$

(4.68)

with slightly modified constant $\tilde{c}_{5} = 4 \hat{c}_{t+1} (1+ 2\tilde{c}_{1} (1+ c_{t}^{ \frac{1}{4}})) (8(1+c_{t})c^{2} + 36 \tilde{c}_{2} )^{\frac{1}{4}}$ and $\tilde{q}_{5}$ as before. In Step 10.c), (4.67) leads to analogous modifications in (4.54) and (4.55), yielding

$$\begin{aligned} & \bigg(\int _{\mathbb{R}^{d}} | g_{d}(t,x) - \gamma _{\bar{\varepsilon},d,t}(x) |^{2}\mathbbm{1}_{C_{t}^{c} \cap \hat{C}_{t}}(x) \rho ^{d}_{t}(dx) \bigg)^{1/2} \\ & \leq \tilde{c}_{6} d^{\tilde{q}_{6}} \bar{\varepsilon}^{\frac{1}{4}( \min (1,\beta m-\theta ) -\zeta (T-1))} \end{aligned}$$

(4.69)

with $\tilde{c}_{6} = 2 \hat{c}_{t+1} (1+ 2\tilde{c}_{1} (1+c_{t}^{ \frac{1}{4}})) ((8(1+c_{t})c^{2} )^{\frac{1}{4}}+(36 \tilde{c}_{2} )^{ \frac{1}{4}}) + 2 (9 \tilde{c}_{2} )^{1/2} + \frac{7}{2}$ and $\tilde{q}_{6}$ as before. Combining (4.67)–(4.69) and (4.56) thus gives for Step 10.d) the bound

$$ \begin{aligned} & \bigg(\int _{\mathbb{R}^{d}} |V_{d}(t,x) - \psi _{\varepsilon ,d,t}(x)|^{2} \rho ^{d}_{t}(d x) \bigg)^{1/2} \leq \tilde{c}_{7}\bar{\varepsilon}^{ \frac{1}{4}(\min (1,\beta m-\theta ) -\zeta (T-1))} d^{\tilde{q}_{7}} \end{aligned} $$

with $\tilde{c}_{7} = (2(1+c_{t}))^{\frac{1}{2}} c + 1 + \tilde{c}_{5} + \tilde{c}_{6} + (9 \tilde{c}_{2})^{\frac{1}{2}} $ and $\tilde{q}_{7}$ as before. Choose now

$$ \bar{\varepsilon} = (\varepsilon ^{-1}\tilde{c}_{7} d^{\tilde{q}_{7}} )^{- \frac{4}{\min (1,\beta m-\theta ) -\zeta (T-1))}} $$

and note that $\bar{\varepsilon} \in (0,1)$ because $\tilde{c}_{7}>1$ and $\frac{\min (1,\beta m-\theta )}{T-1} >\zeta $. By inserting this choice of $\bar{\varepsilon}$ in the bounds for the growth, size and Lipschitz constants of $\psi _{\varepsilon ,d,t}$, we may then appropriately choose $\kappa _{t}$, $\mathfrak{q}_{t}$, $\mathfrak{r}_{t}$ (analogously to (4.59)–(4.61)) and complete the proof. □

References

Andersen, L., Broadie, M.: Primal–dual simulation algorithm for pricing multidimensional American options. Manag. Sci. 50, 1222–1234 (2004)
Google Scholar
Barron, A.R.: Universal approximation bounds for superpositions of a sigmoidal function. IEEE Trans. Inf. Theory 39, 930–945 (1993)
MathSciNet Google Scholar
Bayer, C., Belomestny, D., Hager, P., Pigato, P., Schoenmakers, J.: Randomized optimal stopping algorithms and their convergence analysis. SIAM J. Financ. Math. 12, 1201–1225 (2021)
MathSciNet Google Scholar
Bayer, C., Hager, P.P., Riedel, S., Schoenmakers, J.: Optimal stopping with signatures. Ann. Appl. Probab. 33, 238–273 (2023)
MathSciNet Google Scholar
Beck, C., Hutzenthaler, M., Jentzen, A., Kuckuck, B.: An overview on deep learning-based approximation methods for partial differential equations. Discrete Contin. Dyn. Syst., Ser. B 28, 3697–3746 (2023)
MathSciNet Google Scholar
Becker, S., Cheridito, P., Jentzen, A.: Deep optimal stopping. J. Mach. Learn. Res. 20, Paper No. 74, 1–25 (2019)
MathSciNet Google Scholar
Becker, S., Cheridito, P., Jentzen, A.: Pricing and hedging American-style options with deep learning. J. Financ. Risk Manag. 13, Paper No. 158 1–12 (2020)
Google Scholar
Becker, S., Cheridito, P., Jentzen, A., Welti, T.: Solving high-dimensional optimal stopping problems using deep learning. Eur. J. Appl. Math. 32, 470–514 (2021)
MathSciNet Google Scholar
Belomestny, D.: On the rates of convergence of simulation-based optimization algorithms for optimal stopping problems. Ann. Appl. Probab. 21, 215–239 (2011)
MathSciNet Google Scholar
Belomestny, D., Bender, C., Schoenmakers, J.: True upper bounds for Bermudan products via non-nested Monte Carlo. Math. Finance 19, 53–71 (2009)
MathSciNet Google Scholar
Berner, J., Grohs, P., Jentzen, A.: Analysis of the generalization error: empirical risk minimization over deep artificial neural networks overcomes the curse of dimensionality in the numerical approximation of Black–Scholes partial differential equations. SIAM J. Math. Data Sci. 2, 631–657 (2020)
MathSciNet Google Scholar
Bouchard, B., Warin, X.: Monte-Carlo valuation of American options: facts and new algorithms to improve existing methods. In: Carmona, R.A., et al. (eds.) Numerical Methods in Finance, pp. 215–255. Springer, Berlin (2012)
Google Scholar
Broadie, M., Glasserman, P.: A stochastic mesh method for pricing high-dimensional American options. J. Comput. Finance 7(4), 35–72 (2004)
Google Scholar
Buehler, H., Gonon, L., Teichmann, J., Wood, B.: Deep hedging. Quant. Finance 19, 1271–1291 (2019)
MathSciNet Google Scholar
Cioica-Licht, P.A., Hutzenthaler, M., Werner, P.T.: Deep neural networks overcome the curse of dimensionality in the numerical approximation of semilinear partial differential equations. Preprint (2022). Available online at https://arxiv.org/abs/2205.14398
Clément, E., Lamberton, D., Protter, P.: An analysis of a least squares regression method for American option pricing. Finance Stoch. 6, 449–471 (2002)
MathSciNet Google Scholar
Cuchiero, C., Khosrawi, W., Teichmann, J.: A generative adversarial network approach to calibration of local stochastic volatility models. Risks 8, Paper No. 101 1–31 (2020)
Google Scholar
Ech-Chafiq, Z.E.F., Labordère, P.H., Lelong, J.: Pricing Bermudan options using regression trees/random forests. SIAM J. Financ. Math. 14, 1113–1139 (2023)
MathSciNet Google Scholar
Elbrächter, D., Grohs, P., Jentzen, A., Schwab, C.: DNN expression rate analysis of high-dimensional PDEs: application to option pricing. Constr. Approx. 55, 3–71 (2022)
MathSciNet Google Scholar
Föllmer, H., Schied, A.: Stochastic Finance: An Introduction in Discrete Time, 4th revised edn. de Gruyter, Berlin (2016)
Google Scholar
Garcia, D.: Convergence and biases of Monte Carlo estimates of American option prices using a parametric exercise rule. J. Econ. Dyn. Control 27, 1855–1879 (2003)
MathSciNet Google Scholar
Germain, M., Pham, H., Warin, X.: Neural networks-based algorithms for stochastic control and PDEs in finance. In: Capponi, A., Lehalle, C.A. (eds.) Machine Learning and Data Sciences for Financial Markets: A Guide to Contemporary Practices, pp. 426–452. Cambridge University Press, Cambridge (2023)
Google Scholar
Gonon, L.: Random feature neural networks learn Black–Scholes type PDEs without curse of dimensionality. J. Mach. Learn. Res. 24, Paper No. 189 1–51 (2023)
MathSciNet Google Scholar
Gonon, L., Grohs, P., Jentzen, A., Kofler, D., Šiška, D.: Uniform error estimates for artificial neural network approximations for heat equations. IMA J. Numer. Anal. 42, 1991–2054 (2021)
MathSciNet Google Scholar
Gonon, L., Schwab, C.: Deep ReLU network expression rates for option prices in high-dimensional, exponential Lévy models. Finance Stoch. 25, 615–657 (2021)
MathSciNet Google Scholar
Gonon, L., Schwab, C.: Deep ReLU neural networks overcome the curse of dimensionality for partial integrodifferential equations. Anal. Appl. (Singap.) 21, 1–47 (2023)
MathSciNet Google Scholar
Grohs, P., Herrmann, L.: Deep neural network approximation for high-dimensional parabolic Hamilton–Jacobi–Bellman equations. Preprint (2021). Available online at https://arxiv.org/abs/2103.05744
Grohs, P., Hornung, F., Jentzen, A., von Wurstemberger, P.: A Proof That Artificial Neural Networks Overcome the Curse of Dimensionality in the Numerical Approximation of Black–Scholes Partial Differential Equations. Am. Math. Soc., Providence (2023)
Google Scholar
Gühring, I., Kutyniok, G., Petersen, P.: Error bounds for approximations with deep ReLU neural networks in $W^{s,p}$ norms. Anal. Appl. (Singap.) 18, 803–859 (2020)
MathSciNet Google Scholar
Gühring, I., Raslan, M., Kutyniok, G.: Expressivity of deep neural networks. In: Grohs, P., Kutyniok, G. (eds.) Mathematical Aspects of Deep Learning, pp. 149–199. Cambridge University Press, Cambridge (2023)
Google Scholar
Haugh, M.B., Kogan, L.: Pricing American options: a duality approach. Oper. Res. 52, 258–270 (2004)
MathSciNet Google Scholar
Herrera, C., Krach, F., Ruyssen, P., Teichmann, J.: Optimal stopping via randomized neural networks. Front. Math. Finance 3, 31–77 (2024)
MathSciNet Google Scholar
Hutzenthaler, M., Jentzen, A., Kruse, T., Nguyen, T.A.: A proof that rectified deep neural networks overcome the curse of dimensionality in the numerical approximation of semilinear heat equations. Part. Differ. Equ. Appl. 1, Paper No. 10, 1–34 (2020)
MathSciNet Google Scholar
Jain, S., Oosterlee, C.W.: The stochastic grid bundling method: efficient pricing of Bermudan options and their Greeks. Appl. Math. Comput. 269, 412–431 (2015)
MathSciNet Google Scholar
Jentzen, A., Salimova, D., Welti, T.: A proof that deep artificial neural networks overcome the curse of dimensionality in the numerical approximation of Kolmogorov partial differential equations with constant diffusion and nonlinear drift coefficients. Commun. Math. Sci. 19, 1167–1205 (2021)
MathSciNet Google Scholar
Kohler, M., Krzyżak, A., Todorovic, N.: Pricing of high-dimensional American options by neural networks. Math. Finance 20, 383–410 (2010)
MathSciNet Google Scholar
Lapeyre, B., Lelong, J.: Neural network regression for Bermudan option pricing. Monte Carlo Methods Appl. 27, 227–247 (2021)
MathSciNet Google Scholar
Longstaff, F.A., Schwartz, E.S.: Valuing American options by simulation: a simple least-squares approach. Rev. Financ. Stud. 14, 113–147 (2001)
Google Scholar
Opschoor, J.A.A., Petersen, P.C., Schwab, C.: Deep ReLU networks and high-order finite element methods. Anal. Appl. (Singap.) 18, 715–770 (2020)
MathSciNet Google Scholar
Petersen, P., Voigtlaender, F.: Optimal approximation of piecewise smooth functions using deep ReLU neural networks. Neural Netw. 108, 296–330 (2018)
Google Scholar
Reisinger, C., Zhang, Y.: Rectified deep neural networks overcome the curse of dimensionality for nonsmooth value functions in zero-sum games of nonlinear stiff systems. Anal. Appl. (Singap.) 18, 951–999 (2020)
MathSciNet Google Scholar
Reppen, A.M., Soner, H.M., Tissot-Daguette, V.: Neural optimal stopping boundary. Preprint (2022). Available online at https://arxiv.org/abs/2205.04595
Rogers, L.C.G.: Monte Carlo valuation of American options. Math. Finance 12, 271–286 (2002)
MathSciNet Google Scholar
Ruf, J., Wang, W.: Neural networks for option pricing and hedging: a literature review. J. Comput. Finance 24(1), 1–46 (2020)
Google Scholar
Sato, K.I.: Lévy Processes and Infinitely Divisible Distributions. Cambridge University Press, Cambridge (1999)
Google Scholar
Shaham, U., Cloninger, A., Coifman, R.R.: Provable approximation properties for deep neural networks. Appl. Comput. Harmon. Anal. 44, 537–557 (2018)
MathSciNet Google Scholar
Sirignano, J., Spiliopoulos, K.: DGM: a deep learning algorithm for solving partial differential equations. J. Comput. Phys. 375, 1339–1364 (2018)
MathSciNet Google Scholar
Takahashi, A., Yamada, T.: Solving Kolmogorov PDEs without the curse of dimensionality via deep learning and asymptotic expansion with Malliavin calculus. Part. Differ. Equ. Appl. 4, Paper No. 27 1–31 (2023)
MathSciNet Google Scholar
Tsitsiklis, J., Van Roy, B.: Regression methods for pricing complex American-style options. IEEE Trans. Neural Netw. Learn. Syst. 12, 694–703 (2001)
Google Scholar
Wang, S., Perdikaris, P.: Deep learning of free boundary and Stefan problems. J. Comput. Phys. 428, Paper No. 109914 1–24 (2021)
MathSciNet Google Scholar
Yarotsky, D.: Error bounds for approximations with deep ReLU networks. Neural Netw. 94, 103–114 (2017)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Mathematics, Imperial College London, London, SW7 1NE, UK
Lukas Gonon

Authors

Lukas Gonon
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lukas Gonon.

Ethics declarations

Competing Interests

The author declares no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Gonon, L. Deep neural network expressivity for optimal stopping problems. Finance Stoch 28, 865–910 (2024). https://doi.org/10.1007/s00780-024-00538-0

Download citation

Received: 05 October 2022
Accepted: 29 October 2023
Published: 14 June 2024
Issue Date: July 2024
DOI: https://doi.org/10.1007/s00780-024-00538-0

Deep neural network expressivity for optimal stopping problems

Abstract

Similar content being viewed by others

A deep learning method for pricing high-dimensional American-style options via state-space partition

Deep ReLU network expression rates for option prices in high-dimensional, exponential Lévy models

Pricing options on flow forwards by neural networks in a Hilbert space

1 Introduction

1.1 Notation

2 Preliminaries

2.1 The optimal stopping problem

Remark 2.1

Remark 2.2

2.2 Deep neural networks

3 DNN approximations for optimal stopping problems

3.1 Basic framework

Assumption 3.1

Remark 3.2

Remark 3.3

Assumption 3.4

Example 3.5

Theorem 3.6

Remark 3.7

Corollary 3.8

3.2 Sketch of the proof of Theorem 3.6

3.3 Refined framework

Assumption 3.9

Remark 3.10

Remark 3.11

Theorem 3.12

Corollary 3.13

3.4 Exponential Lévy models

Corollary 3.14

Proof

3.5 Discrete diffusion models

Assumption 3.15

Corollary 3.16

Proof

Remark 3.17

Remark 3.18

3.6 Running minimum and maximum

Proposition 3.19

4 Proofs

4.1 Deep neural network approximation of the product

Lemma 4.1

Proof

4.2 Sufficient conditions

Lemma 4.2

Proof

Lemma 4.3

Proof

Proof of Proposition 3.19

4.3 Auxiliary results

Lemma 4.4

Proof

Lemma 4.5

Proof

Lemma 4.6

Proof

Lemma 4.7

Proof

Lemma 4.8

Proof

Lemma 4.9

Proof

Lemma 4.10

Proof

4.4 Proof of Theorem 3.6 and Corollary 3.8

Proof of Theorem 3.6

Proof of Corollary 3.8

4.5 Proof of Theorem 3.12

Proof of Theorem 3.12

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Competing Interests

Additional information

Publisher’s Note

Rights and permissions