Abstract
This article studies deep neural network expression rates for optimal stopping problems of discrete-time Markov processes on high-dimensional state spaces. A general framework is established in which the value function and continuation value of an optimal stopping problem can be approximated with error at most \(\varepsilon \) by a deep ReLU neural network of size at most \(\kappa d^{\mathfrak{q}} \varepsilon ^{-\mathfrak{r}}\). The constants \(\kappa ,\mathfrak{q},\mathfrak{r} \geq 0\) do not depend on the dimension \(d\) of the state space or the approximation accuracy \(\varepsilon \). This proves that deep neural networks do not suffer from the curse of dimensionality when employed to approximate solutions of optimal stopping problems. The framework covers for example exponential Lévy models, discrete diffusion processes and their running minima and maxima. These results mathematically justify the use of deep neural networks for numerically solving optimal stopping problems and pricing American options in high dimensions.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
In the past years, neural-network-based methods have been used ubiquitously in all areas of science, technology, economics and finance. In particular, such methods have been applied to various problems in mathematical finance such as pricing, hedging and calibration. We refer for instance to the articles Buehler et al. [14], Becker et al. [7, 8], Cuchiero et al. [17] and to the survey papers Ruf and Wang [44], Germain et al. [22], Beck et al. [5] for an overview and further references. The striking computational performance of these methods has also raised questions regarding their theoretical foundations. Towards a complete theoretical understanding, there have been recent results in the literature which prove that deep neural networks are able to approximate option prices in various models without the curse of dimensionality. For deep neural network expressivity results for option prices and associated PDEs, we refer for instance to Elbrächter et al. [19], Grohs et al. [28] for European options in Black–Scholes models, to Hutzenthaler et al. [33], Cioica-Licht et al. [15] for certain semilinear PDEs, to Grohs and Herrmann [27] for certain Hamilton–Jacobi–Bellman equations, to Reisinger and Zhang [41], Jentzen et al. [35], Takahashi and Yamada [48] for diffusion models and game-type options, and to Gonon and Schwab [26] for certain path-dependent options in jump–diffusion models. A few works are also concerned with generalisation errors (Berner et al. [11]) and learning errors (Gonon [23]).
The goal of this article is to analyse deep neural network expressivity for American option prices and general optimal stopping problems in discrete time. An optimal stopping problem consists in selecting a stopping time \(\tau \) such that the expected reward \(\mathbb{E}[g_{d}(\tau ,X_{\tau}^{d})]\) is maximised. Here \(X^{d}\) is a given stochastic process taking values in \(\mathbb{R}^{d}\) and \(g_{d}(t,x)\) is the reward obtained if the process is stopped at time \(t\) in state \(x\). Optimal stopping problems arise in a wide range of contexts in statistics, operations research, economics and finance. In mathematical finance, arbitrage-free prices of American and Bermudan options are given as solutions of optimal stopping problems. The solution of an optimal stopping problem can be described by the so-called Snell envelope or, equivalently, by a backward recursion (discrete time) or a free-boundary PDE (continuous time) in the case when \(X^{d}\) is a Markov process.
In recent years, a wide range of computational methods have been developed to numerically solve optimal stopping problems also in high-dimensional situations, i.e., when the dimension \(d\) of the state space is large. For regression-based algorithms, we refer e.g. to Tsitsiklis and Van Roy [49], Longstaff and Schwartz [38]; for duality-based methods, we refer e.g. to Rogers [43], Andersen and Broadie [1], Haugh and Kogan [31], Belomestny et al. [10]; for stochastic grid methods, we refer e.g. to Broadie and Glasserman [13], Jain and Oosterlee [34]; and for methods based on approximating the exercise boundary, we refer e.g. to Garcia [21]. See for instance also the overview in Bouchard and Warin [12]. Recently proposed methods include signature-based methods (Bayer et al. [4]) and regression trees (Ech-Chafiq et al. [18]). Furthermore, various methods based on deep neural network approximations of the value function, the continuation value or the exercise boundary of the optimal stopping problem have been proposed; see for instance Kohler et al. [36], Becker et al. [6, 7, 8], Herrera et al. [32], Lapeyre and Lelong [37], Reppen et al. [42], and for methods for continuous-time free boundary problems Sirignano and Spiliopoulos [47], Wang and Perdikaris [50]. For many of these methods, also theoretical convergence results or even convergence rates (cf. e.g. Clément et al. [16], Belomestny [9], Bayer et al. [3]) for a fixed dimension \(d\) have been established.
In this article, we are interested in mathematically analysing the high-dimensional situation, i.e., in explicitly controlling the dependence on the dimension \(d\). We analyse deep neural network approximations for the value function of optimal stopping problems. We provide general conditions on the reward function \(g_{d}\) and the stochastic process \(X^{d}\) which ensure that the value function (and the continuation value) of an optimal stopping problem can be approximated by deep ReLU neural networks without the curse of dimensionality, i.e., that an approximation error of size at most \(\varepsilon \) can be achieved by a deep ReLU neural network of size \(\kappa d^{\mathfrak{q}} \varepsilon ^{-\mathfrak{r}}\) for constants \(\kappa ,\mathfrak{q},\mathfrak{r} \geq 0\) which do not depend on the dimension \(d\) or the accuracy \(\varepsilon \). The framework in particular provides deep neural network expressivity results for prices of American and Bermudan options. Our conditions cover most practically relevant payoffs and many popular models such as exponential Lévy models and discrete diffusion processes. The constants \(\kappa \), \(\mathfrak{q}\), \(\mathfrak{r}\) are explicit, and thus the obtained results yield bounds for the approximation error component in any algorithm for optimal stopping and American option pricing in high dimensions which is based on approximating the value function or the continuation value by deep neural networks.
The remainder of the paper is organised as follows. In Sect. 2, we formulate the optimal stopping problem, recall its solution by dynamic programming and introduce the notation for deep neural networks. In Sect. 3, we formulate the assumptions and main results. Specifically, in Sect. 3.1, we formulate a basic framework, Assumptions 3.1 and 3.4, and prove that the value function can be approximated by deep neural networks without the curse of dimensionality; see Theorem 3.6. In Sect. 3.3, we then provide more refined assumptions on the considered Markov processes and extend the approximation result to this refined framework; see Theorem 3.12 which is the main result of the article. In Sects. 3.4–3.6, we then apply this result to exponential Lévy models and discrete diffusion processes and show that also barrier options can be covered via the running maximum or minimum of such processes. In order to make the presentation more streamlined, most proofs, in particular the proofs of Theorems 3.6 and 3.12, are postponed to Sect. 4.
1.1 Notation
Throughout, we fix a time horizon \(T \in \mathbb{N}\) and a probability space \((\Omega ,\mathcal {F},\mathbb{P})\) on which all random variables and processes are defined. For \(d \in \mathbb{N}\), \(x \in \mathbb{R}^{d}\), \(A \in \mathbb{R}^{d \times d}\), we denote by \(|x|\) the Euclidean norm of \(x\) and by \(|A|_{F}\) the Frobenius norm of \(A\). For \(f \colon \mathbb{R}^{d_{0}} \times \mathbb{R}^{d_{1}} \to \mathbb{R}^{d_{2}}\), we let
2 Preliminaries
In this section, we first formulate the optimal stopping problem and recall its solution in terms of the value function. Then we introduce the required notation for deep neural networks.
2.1 The optimal stopping problem
For each \(d \in \mathbb{N}\), consider a function \(g_{d} \colon \{0,\ldots ,T\} \times \mathbb{R}^{d} \to \mathbb{R}\) and a discrete-time \(\mathbb{R}^{d}\)-valued Markov process \(X^{d}=(X_{t}^{d})_{t \in \{0,\ldots ,T\}}\). Assume for each \(t \in \{0,\ldots ,T\}\) that \(\mathbb{E}[|g_{d}(t,X_{t}^{d})|]< \infty \) and let \(\mathbb{F}=(\mathcal{F}_{t})_{t \in \{ {0,\ldots ,T}\}}\) be the filtration generated by \(X^{d}\). Denote by \(\mathcal{T}\) the set of \(\mathbb{F}\)-stopping times \(\tau \colon \Omega \to \{0,\ldots ,T\}\) and by \(\mathcal{T}_{t}\) the set of all \(\tau \in \mathcal{T}\) with \(\tau \geq t\). For notational simplicity, we omit the dependence on \(d\) in \(\mathbb{F}\), \(\mathcal{T}\) and \(\mathcal{T}_{t}\).
The optimal stopping problem consists in computing
Consider the value function \(V_{d}\) defined by \(V_{d}(T,x) = g_{d}(T,x)\) and the backward recursion
for \(t=T-1,\ldots ,0\) and \(\mathbb{P}\circ (X_{t}^{d})^{-1}\)-a.e. \(x \in \mathbb{R}^{d}\). Then knowledge of \(V_{d}\) also allows computing a stopping time \(\tau ^{*} \in \mathcal{T}\) which is a maximiser in (2.1): the stopping time
satisfies \(\mathbb{E}[g_{d}(\tau ^{*},X_{\tau ^{*}}^{d})] = \sup _{\tau \in \mathcal{T}} \mathbb{E}[g_{d}(\tau ,X^{d}_{\tau})]\). Indeed, by backward induction and the Markov property, we obtain that \(V_{d}(t,X^{d}_{t})=U^{d}_{t}\) ℙ-a.s., where \(U^{d}\) is the Snell envelope defined at \(T\) by \(U_{T}^{d} = g_{d}(T,X_{T}^{d})\) and for \(t=T-1,\ldots ,0\) by the backward recursion \(U_{t}^{d} = \max (g_{d}(t,X_{t}^{d}),\mathbb{E}[U_{t+1}^{d}|\mathcal {F}_{t}])\). Then for instance by Föllmer and Schied [20, Theorem 6.18], we have ℙ-a.s. for all \(t \in \{0,\ldots ,T\}\) that
where \(\tau _{\min}^{(t)} = \min \{s \geq t \colon U_{s}^{d} = g_{d}(s,X_{s}^{d}) \}\). In particular, \(\tau _{\min}^{(0)}=\tau ^{*}\) is a maximiser of (2.1), and if \(X^{d}_{0}\) is constant, \(V_{d}(0,X_{0}^{d})\) is the optimal value in (2.1).
The idea of our approach is as follows: In many situations, the function \(g_{d}\) is in fact a neural network or can be approximated well by a deep neural network. Then the recursion (2.2) also yields a recursion for a neural network approximation. This argument will be made precise in the proof of Theorem 3.6 below.
Remark 2.1
Alternatively, we could also define
Then under the strong Markov property, it holds that \(\mathbb{E}[g_{d}(\tau ,X_{\tau}^{d}) | \mathcal {F}_{t}] = h_{\tau}^{d}(X_{t}^{d})\) ℙ-a.s., for each \(\tau \in \mathcal{T}_{t}\), where \(h_{\tau}^{d}(x)=\mathbb{E}[g_{d}(\tau ,X_{\tau}^{d}) |X_{t}^{d}=x]\). The definition of the essential supremum then implies that for each \(\tau \in \mathcal{T}_{t}\), it holds ℙ-a.s. that \(h_{\tau _{ \min}^{(t)}}^{d}(X_{t}^{d}) \geq h_{\tau}^{d}(X_{t}^{d})\). This implies for \(\mathbb{P}\circ (X_{t}^{d})^{-1}\)-a.e. \(x \in \mathbb{R}^{d}\) and all \(\tau \in \mathcal{T}_{t}\) that \(h_{\tau _{\min}^{(t)}}^{d}(x) \geq h_{\tau}^{d}(x)\), hence \(h_{\tau _{\min}^{(t)}}^{d}(x) \geq \sup _{\tau \in \mathcal{T}_{t}} h_{ \tau}^{d}(x)\) for each such \(x\). Combining this and [20, Theorem 6.18] yields that ℙ-a.s.,
By the definition of the Snell envelope, this then yields the recursion (2.2) for the value function.
Remark 2.2
The conditional expectation in (2.3) is defined in terms of the transition kernels \(\mu ^{d}_{s,t}\), \(0\leq s < t \leq T\), of the Markov process \(X^{d}\). In fact, we formally start with transition kernels \(\mu ^{d}\) on \(\mathbb{R}^{d}\) from which we then construct a family of probability measures \(\mathbb{P}_{x}\) on the canonical path space \(((\mathbb{R}^{d})^{T+1},\mathcal{B}((\mathbb{R}^{d})^{T+1}))\) such that under \(\mathbb{P}_{x}\), the coordinate process is a Markov process starting at \(x\) and with transition kernels \(\mu ^{d}\).
2.2 Deep neural networks
In this article, we consider neural networks with the ReLU (rectified linear unit) activation function \(\varrho \colon \mathbb{R}\to \mathbb{R}\) given by \(\varrho (x)=\max (x,0)\) for \(x \in \mathbb{R}\). For each \(d \in \mathbb{N}\), we also denote by \(\varrho \colon \mathbb{R}^{d} \to \mathbb{R}^{d} \) the compontentwise application of the ReLU activation function. Let \(L,d \in \mathbb{N}\), \(N_{0}:=d\), \(N_{1},\ldots ,N_{L} \in \mathbb{N}\) and \(A^{\ell} \in \mathbb{R}^{N_{\ell }\times N_{\ell -1}}\), \(b^{\ell }\in \mathbb{R}^{N_{\ell}}\) for \(\ell = 1,\ldots ,L\). A deep neural network with \(L\) layers, \(d\)-dimensional input, activation function \(\varrho \) and parameters \(((A^{1},b^{1}),\ldots ,(A^{L},b^{L}))\) is the function \(\phi \colon \mathbb{R}^{d} \to \mathbb{R}^{N_{L}}\) given by
where \(\mathcal{W}_{\ell }\colon \mathbb{R}^{N_{\ell -1}} \to \mathbb{R}^{N_{\ell}}\) denotes the (affine) function \(\mathcal{W}_{\ell}(z) = A^{\ell }z + b^{\ell}\) for \(z \in \mathbb{R}^{N_{\ell -1}}\) and \(\ell = 1,\ldots ,L\). We let
denote the total number of non-zero entries in the parameter matrices and vectors of the neural network.
In most cases, the number of layers, the activation function and the parameters of the network are not mentioned explicitly and we simply speak of a deep neural network \(\phi \colon \mathbb{R}^{d} \to \mathbb{R}^{N_{L}}\). We say that a function \(f \colon \mathbb{R}^{d} \to \mathbb{R}^{m}\) can be realised by a deep neural network if there exists a deep neural network \(\phi \colon \mathbb{R}^{d} \to \mathbb{R}^{m}\) such that \(f(x) = \phi (x)\) for all \(x \in \mathbb{R}^{d}\). In the literature, a deep neural network is often defined as the collection of parameters \(\Phi = ((A^{1},b^{1}),\ldots ,(A^{L},b^{L}))\), and \(\phi \) in (2.4) is then called the realisation of \(\Phi \); see for instance Petersen and Voigtlaender [40], Opschoor et al. [39], Gonon and Schwab [25]. In order to simplify the notation, we do not distinguish between the neural network realisation and its parameters here, since the parameters are always (at least implicitly) part of the definition. Note that in general, a function \(f\) may admit several different realisations by deep neural networks, i.e., several different choices of parameters may result in the same realisation. However, in the present article, this is not an issue because pathological cases are excluded by bounds on the size of the networks.
3 DNN approximations for optimal stopping problems
This section contains the main results of the article, the deep neural network approximation results for optimal stopping problems. We start by formulating in Assumption 3.1 a general Markovian framework. In Assumption 3.4, we introduce the hypotheses on the reward function. We then formulate in Theorem 3.6 the approximation result for this basic framework. Subsequently, we provide a more refined framework, see Assumption 3.9 below, and prove the main result of the article in Theorem 3.12. This proves that the value function can be approximated by deep neural networks without the curse of dimensionality. Corollary 3.13 shows that an analogous approximation result also holds for the continuation value. Subsequently, in Sects. 3.4–3.6, we specialise the result to exponential Lévy models and discrete diffusion processes and show that also barrier options can be covered by including the running maximum or minimum.
3.1 Basic framework
Let \(p \geq 0\) be a fixed rate of growth. For instance, in financial applications, typically \(p=1\). We start by formulating in Assumption 3.1 a collection of hypotheses on the Markov process \(X^{d}\). These hypotheses will be weakened later in Assumption 3.9.
Assumption 3.1
(i) For each \(d \in \mathbb{N}\) and \(t \in \{0,\ldots ,T-1\}\), there exist a measurable function \(f^{d}_{t} \colon \mathbb{R}^{d} \times \mathbb{R}^{d} \to \mathbb{R}^{d}\) and a random vector \(Y^{d}_{t}\) such that
(ii) For each \(d \in \mathbb{N}\), the random vectors \(X_{0}^{d},Y^{d}_{0},\ldots ,Y^{d}_{T-1}\) are independent and \(\mathbb{E}[|X_{0}^{d}|]<\infty \).
(iii) There exist constants \(c>0\), \(q\geq 0\), \(\alpha \geq 0\) such that for all \(\varepsilon \in (0,1]\), \(d \in \mathbb{N}\) and \(t \in \{0,\ldots ,T-1\}\), there exists a neural network \(\eta _{\varepsilon ,d,t} \colon \mathbb{R}^{d} \times \mathbb{R}^{d} \to \mathbb{R}^{d}\) with
(iv) There exist constants \(c>0\), \(q\geq 0\) such that for all \(d \in \mathbb{N}\) and for all \(t \in \{0,\ldots ,T-1\}\), we have \(|f_{t}^{d}(0,0)|\leq c d^{q}\) and \(\mathbb{E}[|Y_{t}^{d}|^{2\max{(2,p)}}] \leq c d^{q}\).
Assumption 3.1 (i) requires a recursive updating of the Markov process \(X^{d}\) according to update functions \(f^{d}_{t}\) and a noise process \(Y^{d}\). Assumption 3.1 (ii) asks that the noise random variables and the initial condition are independent. Assumption 3.1 (iii) imposes that the updating functions \(f^{d}_{t}\) can be approximated well by deep neural networks. Finally, Assumption 3.1 (iv) requires that certain moments of the noise random variables and the “constant parts” of the update functions exhibit at most polynomial growth.
Remark 3.2
In Assumption 3.1 (iii), (iv), we could also put different constants \(c\) and \(q\) in each of the hypotheses. But then Assumption 3.1 (iii), (iv) still hold with \(c\) and \(q\) chosen as the respective maximum, and so for notational simplicity, we choose to directly work with the same constants for all these hypotheses.
Remark 3.3
For \(s\geq t\), consider a function \(\bar{g}_{d,s} \colon \mathbb{R}^{d} \to \mathbb{R}\) with \(\mathbb{E}[|\bar{g}_{d,s}(X_{s}^{d})|]< \infty \). Then under Assumption 3.1 (i), (ii),
for \(\mathbb{P}\circ (X_{t}^{d})^{-1}\)-a.e. \(x \in \mathbb{R}^{d}\). But the right-hand side of (3.5) is defined for any \(x \in \mathbb{R}^{d}\) for which the expectation is finite, and so in what follows, we also consider the conditional expectation \(\mathbb{E}[\bar{g}_{d,s}(X_{s}^{d})|X_{t}^{d}=x]\) to be defined for all such \(x \in \mathbb{R}^{d}\) (by (3.5)). Note that also
and so by backward induction, this reasoning allows to define in our framework the value function \(V_{d}(t,\,\cdot \,)\) on all of \(\mathbb{R}^{d}\), for each \(t\).
Next, we formulate a collection of hypotheses on the reward (or payoff) function \(g_{d}\).
Assumption 3.4
There exist constants \(c>0\), \(q\geq 0\), \(\alpha \geq 0\) such that
(i) for all \(\varepsilon \in (0,1]\), \(d \in \mathbb{N}\) and \(t \in \{0,\ldots ,T\}\), there exists a (deep) neural network \(\phi _{\varepsilon ,d,t} \colon \mathbb{R}^{d} \to \mathbb{R}\) with
(ii) for all \(d \in \mathbb{N}\) and \(t \in \{0,\ldots ,T\}\), it holds that \(|g_{d}(t,0)|\leq c d^{q}\).
Assumption 3.4 (i) means that \(g_{d}(t,\,\cdot \,) \colon \mathbb{R}^{d} \to \mathbb{R}\) can be approximated well by neural networks for any \(d \in \mathbb{N}\) and \(t \in \{0,\ldots ,T\}\). Assumption 3.4 (ii) imposes that the “constant part” of the payoff grows at most polynomially in \(d\). Lemma 4.7 below shows that the framework indeed ensures that \(\mathbb{E}[|g_{d}(t,X_{t}^{d})|]< \infty \), as required in Sect. 2.1.
Example 3.5
Assumption 3.4 is satisfied in typical applications from mathematical finance. For instance, fix a strike price \(K>0\), an interest rate \(r\geq 0\) and consider the payoff of a max-call option, \(g_{d}(t,x) = e^{-r t}(\max _{i=1,\ldots ,d} x_{i} - K)^{+}\). Then (see e.g. Grohs et al. [28, Lemma 4.12]) for each \(t\), the payoff \(g_{d}(t,\,\cdot \,)\) can be realised exactly by a neural network \(\phi _{d,t}\) with \(\mathrm{size}(\phi _{d,t}) \leq 6d^{3}\). In addition, \(\mathrm{Lip}(g_{d}(t,\,\cdot \,))=1\) and therefore, setting \(\phi _{\varepsilon ,d,t} = \phi _{d,t}\) for all \(\varepsilon \in (0,1]\), we get that Assumption 3.4 is satisfied with \(c=6\), \(\alpha =0\), \(q=3\). Further examples include basket call options, basket put options, call on min options and, by similar techniques, also put on min options, put on max options and many related payoffs.
We now state the main deep neural network approximation result under the assumptions introduced above.
Theorem 3.6
Suppose Assumptions 3.1and 3.4hold. Let \(c >0\), \(q \geq 0\) and assume for all \(d \in \mathbb{N}\) that \(\rho ^{d}\) is a probability measure on \(\mathbb{R}^{d}\) with \(\int _{\mathbb{R}^{d}} |x|^{2\max (p,2)} \rho ^{d}(dx) \leq c d^{q}\). Then there exist constants \(\kappa ,\mathfrak{q},\mathfrak{r} \in [0,\infty )\) and neural networks \(\psi _{\varepsilon ,d,t}\), \(\varepsilon \in (0,1]\), \(d \in \mathbb{N}\), \(t \in \{0,\ldots ,T\}\), such that for any \(\varepsilon \in (0,1]\), \(d \in \mathbb{N}\) and \(t \in \{0,\ldots ,T\}\), the number of neural network weights grows at most polynomially and the approximation error between the neural network \(\psi _{\varepsilon ,d,t}\) and the value function is at most \(\varepsilon \), that is, \(\mathrm{size}(\psi _{\varepsilon ,d,t}) \leq \kappa d^{\mathfrak{q}} \varepsilon ^{-\mathfrak{r}}\) and
The proof of Theorem 3.6 is given in Sect. 4.4 below. For the reader’s convenience, we also provide a sketch of the proof in Sect. 3.2 below.
Theorem 3.6 shows that under Assumptions 3.1 and 3.4, the value function \(V_{d}\) can be approximated by deep neural networks without the curse of dimensionality: An approximation error of at most \(\varepsilon \) can be achieved by a deep neural network whose size is at most polynomial in \(\varepsilon ^{-1}\) and \(d\). The approximation error in Theorem 3.6 is measured in the \(L^{2}(\rho ^{d})\)-norm.
Remark 3.7
Theorem 3.6 holds for any fixed time horizon \(T\in \mathbb{N}\). The constants \(\kappa ,\mathfrak{q},\mathfrak{r} \in [0,\infty )\) are explicitly chosen in the proof of Theorem 3.6, and so also their dependence on \(T\) can be examined. Due to the recursive procedure used in the proof, the constants \(\kappa \), \(\mathfrak{q}\), \(\mathfrak{r}\) chosen in the proof of Theorem 3.6 in general tend to \(\infty \) as \(T \to \infty \). In special situations, this may be partially avoided (e.g. if \(q=0\)); see also Remark 3.11 below. However, these observations indicate that in order to address deep neural network expressivity for continuous-time optimal stopping problems, one cannot directly combine the recursive procedure employed here with a limiting argument letting \(T\to \infty \).
Theorem 3.6 can also be used to deduce further properties of \(V_{d}\). In the basic framework, we obtain for instance the following corollary, which shows that under Assumptions 3.1 and 3.4, the value function satisfies for each \(t\) a certain average Lipschitz property with a constant growing at most polynomially in \(d\).
Corollary 3.8
Suppose Assumptions 3.1and 3.4are satisfied. Let \(\nu _{0}^{d}\) be the standard Gaussian measure on \(\mathbb{R}^{d}\). Then for any \(R>0\), there exist constants \(\kappa ,\mathfrak{q} \in [0,\infty )\) such that for any \(d \in \mathbb{N}\), \(t \in \{0,\ldots ,T\}\) and \(h \in [-R,R]^{d}\), the value function satisfies
The proof of Corollary 3.8 is given at the end of Sect. 4.4.
3.2 Sketch of the proof of Theorem 3.6
In this section, we provide a brief sketch of the proof of Theorem 3.6. The proof proceeds by backward induction. This entails some subtleties regarding the probability measure \(\rho ^{d}\), which we do not discuss here. We refer to the proof below (see Sect. 4.4) for details. Here we rather provide an easy-to-follow overview.
The starting point is the backward recursion (2.2). Our goal is to provide a neural network approximation of the right-hand side in (2.2). At time \(t\), we first aim to derive a bound on the \(L^{2}(\rho ^{d})\)-approximation error \(E^{d}_{t}\) between the continuation value \(\mathbb{E}[V_{d}(t+1,X_{t+1}^{d})|X_{t}^{d}=x]\) and the random function \(\Gamma _{\varepsilon ,d,t}\) defined for \(x \in \mathbb{R}^{d}\) by \(\Gamma _{\varepsilon ,d,t}(x) = \frac{1}{N}\sum _{i=1}^{N} \hat{v}_{ \varepsilon ,d,t+1}(\eta _{\varepsilon ,d,t}(x,Y^{d,i}_{t}))\), where \(\hat{v}_{\varepsilon ,d,t+1}\) is a neural network approximating the value function at time \(t+1\) and \(Y^{d,1}_{t}, \ldots , Y^{d,N}_{t}\) are i.i.d. copies of \(Y^{d}_{t}\). The existence and suitable properties of \(\hat{v}_{\varepsilon ,d,t+1}\) follow from the induction hypothesis. We derive a bound on \(\mathbb{E}[E^{d}_{t}]\) which we can then use to obtain existence of a realisation \(\gamma _{\varepsilon ,d,t}\) of \(\Gamma _{\varepsilon ,d,t}\) satisfying a slightly worse bound and such that the realisation of \(\max _{i=1,\ldots ,N} |Y^{d,i}_{t}|\) can also be bounded suitably. This last point is necessary to control the growth of \(\gamma _{\varepsilon ,d,t}(x)\). Then \(\gamma _{\varepsilon ,d,t}(x)\) is an approximation of the continuation value, and so we naturally define the approximate value function at time \(t\) by
for a suitably chosen \(\delta \) (depending on \(\varepsilon \)). We then consider the continuation region
and the approximate continuation region
Then we may decompose
The \(L^{2}(\rho ^{d})\)-error of the last term has already been analysed, and it remains to analyse the remaining terms. The first term is small due to Assumption 3.4. The second and third term need not necessarily be small, but we are able to show that the probabilities \(\rho ^{d}(C_{t} \cap \hat{C}_{t}^{c})\) and \(\rho ^{d}(C_{t}^{c} \cap \hat{C}_{t})\) are small. Hence the overall \(L^{2}(\rho ^{d})\)-error can be controlled. The proof is then completed by showing that the neural network (3.10) satisfies the growth, size and Lipschitz properties required to carry out the induction argument.
3.3 Refined framework
We now introduce a refined framework in which the approximation hypothesis (3.2) and the Lipschitz condition (3.4) in Assumption 3.1 (iii) are weakened; see (3.11) and (3.12) below. Due to these weaker hypotheses, we need to introduce potentially stronger moment assumptions on the noise variables \(Y^{d}_{t}\). Note that the additional growth conditions (3.13) and (3.14) are satisfied automatically under Assumption 3.1 (see Lemma 4.5 and Remark 3.10 below).
Assumption 3.9
Assume that (i), (ii) and (iv) in Assumption 3.1 are satisfied. Furthermore, assume that there exist constants \(c>0\), \(h >0\), \(q,\bar{q} \geq 0\), \(\alpha \geq 0\), \(\beta >0\), \(m \in \mathbb{N}\), \(\theta \geq 0\) and \(\zeta \geq 0\) such that for all \(\varepsilon \in (0,1]\), \(d \in \mathbb{N}\) and \(t \in \{0,\ldots ,T-1\}\), there exists a neural network \(\eta _{\varepsilon ,d,t} \colon \mathbb{R}^{d} \times \mathbb{R}^{d} \to \mathbb{R}^{d}\) with
and for all \(x,y \in \mathbb{R}^{d}\),
and \(\mathbb{E}[|Y_{t}^{d}|^{2m\max{(2,p)}}] \leq c d^{q}\).
Remark 3.10
A sufficient condition for (3.13) is that there exist \(\tilde{c}>0\) and \(\tilde{q}\geq 0\) such that \(\mathbb{E}[|Y_{t}^{d}|^{2m\max{(2,p)}}] \leq \tilde{c} d^{\tilde{q}}\) and \(|f_{t}^{d}(x,y)| \leq \tilde{c} d^{\tilde{q}} (1+|x|+|y|)\) for all \(d\in \mathbb{N}\), \(x,y \in \mathbb{R}^{d}\). Then
Remark 3.11
While in many relevant applications, the number \(T\) of time steps is only moderate (e.g. around 10 in Becker et al. [6, Sects. 4.1 and 4.2]), it is also important to analyse the situation when \(T\) is large. To this end, we have introduced in Assumption 3.9 the constants \(h\) and \(\bar{q}\) instead of using the common upper bounds \(c\), \(q\). This makes it possible to get first insights about the situation in which \(T\) is large from the proofs in Sect. 4. Indeed, if \(h=1+\tilde{h}\) and \(\tilde{h}\) is sufficiently small (as is the case for instance in certain discretised diffusion models), then the constants in Lemmas 4.6 and 4.8 are also small for large \(T\).
Examples of processes that satisfy Assumption 3.9 are provided further below. These include in particular the Black–Scholes model, more general exponential Lévy processes and discrete diffusions.
We now state the main theorem of the article.
Theorem 3.12
Suppose Assumptions 3.9and 3.4hold. Let \(c >0\), \(q \geq 0\) and for each \(d \in \mathbb{N}\), let \(\rho ^{d}\) be a probability measure on \(\mathbb{R}^{d}\) with \(\int _{\mathbb{R}^{d}} |x|^{2m\max (p,2)} \rho ^{d}(dx) \leq c d^{q}\). Furthermore, assume that \(\zeta < \frac{\min (1,\beta m-\theta )}{T-1}\), where \(m\), \(\beta \), \(\zeta \), \(\theta \) are the constants appearing in Assumption 3.9. Then there exist constants \(\kappa ,\mathfrak{q},\mathfrak{r} \in [0,\infty )\) and neural networks \(\psi _{\varepsilon ,d,t}\), \(\varepsilon \in (0,1]\), \(d \in \mathbb{N}\), \(t \in \{0,\ldots ,T\}\), such that for any \(\varepsilon \in (0,1]\), \(d \in \mathbb{N}\) and \(t \in \{0,\ldots ,T\}\), the number of neural network weights grows at most polynomially and the approximation error between the neural network \(\psi _{\varepsilon ,d,t}\) and the value function is at most \(\varepsilon \), that is, \(\mathrm{size}(\psi _{\varepsilon ,d,t}) \leq \kappa d^{\mathfrak{q}} \varepsilon ^{-\mathfrak{r}}\) and
The proof of Theorem 3.12 is given in Sect. 4.5 below.
Theorem 3.12 shows that for Markov processes satisfying Assumption 3.9 and for reward functions satisfying Assumption 3.4, the value function of the associated optimal stopping problem can be approximated by deep neural networks without the curse of dimensionality. In other words, an approximation error of at most \(\varepsilon \) can be achieved by a deep neural network whose size is at most polynomial in \(\varepsilon ^{-1}\) and \(d\).
The condition \(\zeta < \frac{\min (1,\beta m-\theta )}{T-1}\) in Theorem 3.12 can be viewed as a condition on \(m\), which needs to be sufficiently large. This means that sufficiently high moments of \(Y^{d}_{t}\) need to exist and grow only polynomially in \(d\).
A key step in the proof consists in constructing a deep neural network approximating the continuation value. Hence we immediately obtain the following corollary.
Corollary 3.13
Consider the setting of Theorem 3.12. Then for each \(\varepsilon \in (0,1]\), \(d \in \mathbb{N}\) and \(t \in \{0,\ldots ,T\}\), there exists a neural network \(\gamma _{\varepsilon ,d,t}\) with \(\mathrm{size}(\gamma _{\varepsilon ,d,t}) \leq \kappa d^{ \mathfrak{q}}\varepsilon ^{-\mathfrak{r}}\) and
Finally, using Theorem 3.12, one may also obtain a bound on the probability that the approximate stopping time coincides with the optimal stopping time. Denote by \(\bar{\rho}_{t}^{d}\) the law of \(X^{d}_{t}\) and suppose that \(\int _{\mathbb{R}^{d}} |x|^{2m\max (p,2)} \bar{\rho}_{t}^{d}(dx) \leq c d^{q}\) for each \(t\). For a given probability measure \(\rho ^{d}\) with respect to which we aim to measure the error (3.15), we now define \(\bar{\rho} = \frac{1}{T+1}(\rho ^{d}+\sum _{t=0}^{T-1} \bar{\rho}_{t}^{d})\). Applying Theorem 3.12 to \(\bar{\rho}\), we obtain that for any \(\varepsilon \in (0,1]\), \(d \in \mathbb{N}\) and \(t \in \{0,\ldots ,T\}\), there exists a neural network \(\psi _{\varepsilon ,d,t}\) satisfying \(\mathrm{size}(\psi _{\varepsilon ,d,t}) \leq \kappa d^{\mathfrak{q}} \varepsilon ^{-\mathfrak{r}}\) and
Now let \(\bar{\varepsilon}>0\), \(\delta >0\) be chosen as in the proof of Theorem 3.12 below and consider the optimal stopping time \(\tau ^{*}_{d}\) and its neural network approximation \(\hat{\tau}_{\varepsilon ,d}\) given by
Then we may estimate
where \(C_{t}\) is the continuation region and \(\hat{C}_{t}\) the approximate continuation region (see (4.43) and (4.44) in the proof below). A key step in the proof of Theorem 3.12 consists in deriving upper bounds on these probabilities, and recalling that \(\bar{\rho}\) is the probability measure to which we are applying Theorem 3.12, the bounds in the proof of Theorem 3.12 thus yield
In particular, choosing \(\varepsilon = \tilde{\varepsilon}(T+1)^{-1}\) for \(\tilde{\varepsilon}>0\) very small, we thus obtain a neural network for which both the approximation error for the value function (3.16) is smaller than \(\tilde{\varepsilon}\) and the approximate stopping time is equal to the optimal stopping time with probability at least \(1-\tilde{\varepsilon}\).
3.4 Exponential Lévy models
In this subsection, we apply Theorem 3.12 to exponential Lévy models. Recall that an \(\mathbb{R}^{d}\)-valued stochastic process \(L^{d} = (L^{d}_{t})_{t \geq 0}\) is called a (\(d\)-dimensional) Lévy process if it is stochastically continuous, its sample paths are almost surely right-continuous with left limits, it has stationary and independent increments and \(\mathbb{P}[L^{d}_{0}=0]=1\). A Lévy process \(L^{d}\) is fully characterised by its Lévy triplet \((A^{d},\gamma ^{d}, \nu ^{d})\), where the matrix \(A^{d} \in \mathbb{R}^{d\times d}\) is symmetric and nonnegative definite, \(\gamma ^{d} \in \mathbb{R}^{d}\) and \(\nu ^{d}\) is a measure on \(\mathbb{R}^{d}\) describing the jump structure of \(L^{d}\). A stochastic process \(X^{d}\) is said to follow an exponential Lévy model if
for a \(d\)-dimensional Lévy process \(L^{d} = (L^{d}_{t})_{t \geq 0}\) and \(x^{d} \in \mathbb{R}^{d}\).
From Theorem 3.12, we now obtain the following deep neural network approximation result. This result includes the case of a Black–Scholes model (\(\nu ^{d}=0\)) as well as pure jump models (\(A^{d}_{i,j} = 0\)) with sufficiently integrable tails. In particular, Corollary 3.14 applies to prices of American/Bermudan basket put options and put-on-min or put-on-max options in such models (cf. Example 3.5 for payoffs that satisfy Assumption 3.4).
Corollary 3.14
Let \(X^{d}\) follow an exponential Lévy model with underlying triplet \((A^{d},\gamma ^{d},\nu ^{d})\) and assume the triplets are bounded in the dimension, that is, there exists \(B > 0\) such that for any \(d \in \mathbb{N}\), \(i,j=1,\ldots ,d\),
Suppose the payoff functions \(g_{d}\) satisfy Assumption 3.4. Let \(c >0\), \(q \geq 0\) and for each \(d \in \mathbb{N}\), let \(\rho ^{d}\) be a probability measure on \(\mathbb{R}^{d}\) with
Then there exist constants \(\kappa ,\mathfrak{q},\mathfrak{r} \in [0,\infty )\) and neural networks \(\psi _{\varepsilon ,d,t}\), \(\varepsilon \in (0,1]\), \(d \in \mathbb{N}\), \(t \in \{0,\ldots ,T\}\), such that for any \(\varepsilon \in (0,1]\), \(d \in \mathbb{N}\) and \(t \in \{0,\ldots ,T\}\), we have
Proof
This follows directly from Theorem 3.12 and Lemma 4.2 with the choices \(\zeta = \theta = \beta = \frac{1}{T}\), \(m=T+1\), which ensures that \(\zeta < \frac{1}{T-1} = \frac{\min (1,\beta m-\theta )}{T-1}\). □
3.5 Discrete diffusion models
Let \(\mathfrak{t}_{T}>0\) and let \(X^{d}\) follow a discrete diffusion model with coefficient functions \(\mu ^{d} \colon [0,\mathfrak{t}_{T}] \times \mathbb{R}^{d} \to \mathbb{R}^{d}\), \(\sigma ^{d} \colon [0,\mathfrak{t}_{T}] \times \mathbb{R}^{d} \to \mathbb{R}^{d \times d}\), i.e., \(X^{d}\) satisfies \(X_{0}^{d}=x^{d}\) and
for some \(0 \leq \mathfrak{t}_{0}<\mathfrak{t}_{1}<\cdots <\mathfrak{t}_{T}\), \(x^{d} \in \mathbb{R}^{d}\) and a \(d\)-dimensional Brownian motion \(W^{d}\). Consider the following assumption on the drift and diffusion coefficients:
Assumption 3.15
Assume there exist constants \(C>0\), \(q,\tilde{\alpha},\tilde{\zeta} \geq 0\) and for each \(d \in \mathbb{N}\), \(t \in \{0,\ldots ,T-1\}\) and \(\varepsilon \in (0,1]\), there exist neural networks \(\mu _{\varepsilon ,d,t} \colon \mathbb{R}^{d} \to \mathbb{R}^{d}\) and \(\sigma _{\varepsilon ,d,t,i} \colon \mathbb{R}^{d} \to \mathbb{R}^{d}\), \(i=1,\ldots ,d\), with the property that for all \(d \in \mathbb{N}\), \(\varepsilon \in (0,1]\), \(t \in \{0,\ldots ,T-1\}\) and \(x \in \mathbb{R}^{d}\), it holds that
Here we denote by \(\sigma _{\varepsilon ,d,t}(x) \in \mathbb{R}^{d \times d}\) the matrix with the \(i\)th row \(\sigma _{\varepsilon ,d,t,i}(x)\).
Corollary 3.16
Let \(X^{d}\) follow a discrete diffusion model with coefficients satisfying Assumption 3.15with \(\tilde{\zeta}<\frac{1}{T-1}\). Suppose \(p \geq 2\) and the payoff functions \(g_{d}\) satisfy Assumption 3.4. Let \(c >0\), \(q \geq 0\) and assume for each \(d \in \mathbb{N}\) that \(\rho ^{d}\) is a probability measure on \(\mathbb{R}^{d}\) with \(\int _{\mathbb{R}^{d}} |x|^{2m\max (p,2)} \rho ^{d}(dx) \leq c d^{q}\) for \(m=\lceil \frac{2(1+\tilde{\zeta})}{\frac{1}{T-1}-\tilde{\zeta}} +1 \rceil \). Then there exist constants \(\kappa ,\mathfrak{q},\mathfrak{r} \in [0,\infty )\) and neural networks \(\psi _{\varepsilon ,d,t}\), \(\varepsilon \in (0,1]\), \(d \in \mathbb{N}\), \(t \in \{0,\ldots ,T\}\) such that for any \(\varepsilon \in (0,1]\), \(d \in \mathbb{N}\) and \(t \in \{0,\ldots ,T\}\), we have
Proof
By Lemma 4.3 below, it follows that Assumption 3.9 is satisfied. In addition, the constant \(\beta >0 \) in Assumption 3.9 may be chosen arbitrarily and \(\zeta = \theta = \beta + \tilde{\zeta}\). Thus we may select \(\beta = \frac{1}{T-1}-\tilde{\zeta}-\delta \) for some \(\delta >0\) and then \(\beta >0\) and \(\zeta = \theta = \frac{1}{T-1}-\delta \). Choosing \(\delta = \frac{1}{2}(\frac{1}{T-1}-\tilde{\zeta})\), \(m=\lceil \frac{1+\tilde{\zeta}}{\beta} +1 \rceil \) then ensures that \(\zeta < \frac{1}{T-1} = \frac{\min (1,\beta m-\theta )}{T-1}\). Theorem 3.12 hence implies (3.20). □
Remark 3.17
Assumption 3.15 means that neural networks are able to approximate the drift and diffusion coefficients in (3.19) without the curse of dimensionality. In addition, Assumption 3.15 also requires that the Lipschitz constants of the approximating neural networks grow at most polynomially in \(d\) and \(\varepsilon ^{-1}\). The hypothesis \(\tilde{\zeta }<\frac{1}{T-1}\) in Corollary 3.16 imposes an upper bound on the rate of growth with respect to \(\varepsilon ^{-1}\) and thereby asks that the Lipschitz constants of the approximating neural networks only grow slowly with respect to \(\varepsilon ^{-1}\).
For example, in the special case when \(\mu ^{d}\) and \(\sigma ^{d}\) can be represented exactly by ReLU neural networks, we may choose \(\tilde{\alpha }=\tilde{\zeta }=0\) and Assumption 3.15 reduces to a hypothesis on the growth with respect to the dimension \(d\); see e.g. (3.18). More generally, there are several settings in which deep neural networks have been shown to exhibit approximation rates free from the curse of dimensionality; see for instance Barron [2], Shaham et al. [46] and the overview in Gühring et al. [30]. An approximation result which allows simultaneously bounding the approximation error and the Lipschitz constant of the approximating network has been proved in Gühring et al. [29]. Assumption 3.15 entails in addition growth hypotheses with respect to the dimension \(d\).
Remark 3.18
The approximation hypothesis on the drift and diffusion coefficients in Assumption 3.15, that is,
for all \(d \in \mathbb{N}\), \(\varepsilon \in (0,1]\), \(t \in \{0,\ldots ,T-1\}\) and \(x \in \mathbb{R}^{d}\), can be replaced by the weaker assumption that there exists \(\beta >0\) such that (3.21) holds for all \(d \in \mathbb{N}\), \(\varepsilon \in (0,1]\), \(t \in \{0,\ldots ,T-1\}\) and \(x \in [-(\varepsilon ^{-\beta}),\varepsilon ^{-\beta}]^{d}\). Under this weaker assumption, the proof of Lemma 4.3 below shows that Assumption 3.9 is satisfied with the constant \(m \in \mathbb{N}\) in Assumption 3.9 chosen arbitrarily and with \(\zeta = \theta = \beta + \tilde{\zeta}\). Theorem 3.12 hence allows us to deduce a statement very similar to Corollary 3.16, where the only additional difference lies in the fact that the hypotheses \(\tilde{\zeta}<\frac{1}{T-1}\) and \(m=\lceil \frac{2(1+\tilde{\zeta})}{\frac{1}{T-1}-\tilde{\zeta}} +1 \rceil \) need to be replaced by the assumption that
3.6 Running minimum and maximum
In this subsection, we show that our framework can also cover barrier options. This follows from the next proposition which proves that for processes satisfying Assumption 3.9, also the processes augmented by their running maximum or minimum satisfy Assumption 3.9.
Proposition 3.19
Suppose Assumption 3.9holds. Let \(\bar{X}^{1} = X^{1}\) and for \(d \in \mathbb{N}\), \(d \geq 2\) and \(t \in \{0,\ldots ,T\}\), consider the \(\mathbb{R}^{d}\)-valued process \(\bar{X}_{t}^{d} = (X^{d-1}_{t},M^{d}_{t})\), where \(M^{d}\) is either the running minimum, \(M^{d}_{t} = \min _{i=1,\ldots ,d-1} \min _{s=0,\ldots ,t} X^{d-1}_{s,i}\), or the running maximum, \(M^{d}_{t} = \max _{i=1,\ldots ,d-1} \max _{s=0,\ldots ,t} X^{d-1}_{s,i}\). Then \(\bar{X}^{d}\), \(d \in \mathbb{N}\), satisfy Assumption 3.9.
The proof is given at the end of Sect. 4.2 below.
4 Proofs
This section contains the remaining proofs of the results in Sect. 3. The section is split into several subsections. In Sect. 4.1, we provide a refined result on deep neural network approximations of the product function \(\mathbb{R}\times \mathbb{R}\to \mathbb{R}\), \((x,y) \mapsto xy\). Section 4.2 then contains two lemmas in which this approximation result is applied to verify that suitable exponential Lévy and discrete diffusion models satisfy Assumption 3.9. Subsequently, Sect. 4.3 contains auxiliary results needed for the proofs of Theorems 3.6 and 3.12. The proofs of these two results are then in Sects. 4.4 and 4.5.
4.1 Deep neural network approximation of the product
Based on Yarotsky [51, Proposition 3], we provide here a refined result regarding deep neural network approximations of the product function \(\mathbb{R}\times \mathbb{R}\to \mathbb{R}\), \((x,y) \mapsto xy\).
Lemma 4.1
There exists \(c>0\) such that for any \(\varepsilon \in (0,1]\) and \(M\geq 1\), there exists a neural network \({\mathfrak{n}}_{\varepsilon ,M} \colon \mathbb{R}\times \mathbb{R}\to \mathbb{R}\) with
\(\mathrm{size}({\mathfrak{n}}_{\varepsilon ,M}) \leq c(\log ( \varepsilon ^{-1})+\log M +1)\) and for all \(x,x',y,y' \in \mathbb{R}\),
Proof
By Grohs and Herrmann [27, Lemma 4.2] or Opschoor et al. [39, Proposition 4.1] (based on Yarotsky [51, Proposition 3]), there exists a constant \(c>0\) such that for any \({\bar{\varepsilon}} \in (0,\frac{1}{2})\), there exists a neural network \(\mathfrak{n}_{{\bar{\varepsilon}}} \colon \mathbb{R}\times \mathbb{R}\to \mathbb{R}\) satisfying \(\sup _{x,y \in [-1,1]} |\mathfrak{n}_{\bar{\varepsilon}}(x,y) - x y| \leq {\bar{\varepsilon}}\), \(\mathrm{size}(\mathfrak{n}_{\bar{\varepsilon}}) \leq c(\log ({ \bar{\varepsilon}}^{-1})+1)\) and
Consider now the capped neural network \(\bar{\mathfrak{n}}_{\bar{\varepsilon}}(x,y) = \mathfrak{n}_{ \bar{\varepsilon}}(\pi _{1}(x),\pi _{1}(y))\), where we set \(\pi _{1}(z) = \max (-1,\min (z,1))\). Define the cap function by \(\mathrm{cap}(x,y)=(\pi _{1}(x),\pi _{1}(y))\). Then \(\bar{\mathfrak{n}}_{\bar{\varepsilon}}(x,y) = \mathfrak{n}_{ \bar{\varepsilon}} \circ \mathrm{cap}(x,y)\) and it can be verified that \(\mathrm{cap}\) is again a neural network and for \(x,y \in [-1,1]\), we have \(\bar{\mathfrak{n}}_{\bar{\varepsilon}}(x,y) = \mathfrak{n}_{ \bar{\varepsilon}}(x,y) \). The fact that the composition of two ReLU neural networks can again be realised by a ReLU neural network with size bounded by twice the sum of the respective sizes (see e.g. Opschoor et al. [39, Proposition 2.2]) hence proves that there exists \(\tilde{c}\geq c\) such that for all \({\bar{\varepsilon}} \in (0,\frac{1}{2})\), we have \(\mathrm{size}(\bar{\mathfrak{n}}_{\bar{\varepsilon}}) \leq \tilde{c}( \log ({\bar{\varepsilon}}^{-1})+1)\). Moreover, \(\sup _{x,y \in [-1,1]} |\bar{\mathfrak{n}}_{\bar{\varepsilon}}(x,y) - x y| \leq {\bar{\varepsilon}}\) and for all \(x,x',y,y' \in \mathbb{R}\),
Now let \(\varepsilon \in (0,1]\) and \(M \geq 1\) be given, choose \(\bar{\varepsilon} = 3^{-1} M^{-2} \varepsilon \) and define the rescaled network \(\mathfrak{n}_{{{\varepsilon}},M}(x,y) = M^{2} \bar{\mathfrak{n}}_{ \bar{\varepsilon}}(\frac{x}{M},\frac{y}{M})\). Then
\(\mathrm{size}(\mathfrak{n}_{{{\varepsilon}},M}) \leq \tilde{c}(\log ({ \bar{\varepsilon}}^{-1})+1)\) and for all \(x,x',y,y' \in \mathbb{R}\),
□
4.2 Sufficient conditions
In this subsection, we prove Lemmas 4.2 and 4.3, which show that the exponential Lévy and discrete diffusion models considered above satisfy Assumption 3.9. We also provide a proof of Proposition 3.19.
Lemma 4.2
Let \(X^{d}\) follow an exponential Lévy model (cf. (3.17)) for each \(d \in \mathbb{N}\) and assume that the Lévy triplets \((A^{d},\gamma ^{d},\nu ^{d})\) are bounded in the dimension, that is, there exists \(B > 0\) such that for any \(d \in \mathbb{N}\), \(i,j=1,\ldots ,d\),
where \(\bar{p} = 2m\max{(2,p)}\). Then Assumption 3.9is satisfied with the constant \(\beta >0 \) in Assumption 3.9chosen arbitrarily and with \(\zeta = \theta = \beta \).
Proof
Firstly, for each \(t \in \{0,\ldots ,T-1\}\) and \(d \in \mathbb{N}\), the exponential Lévy structure (3.17) implies \(X_{t+1,i}^{d} = X_{t,i}^{d} \exp (L_{t+1,i}^{d}-L_{t,i}^{d})\) for \(i=1,\ldots , d\). Therefore \(X_{t+1}^{d} = f^{d}_{t}(X_{t}^{d},Y^{d}_{t})\) with \(f^{d}_{t} \colon \mathbb{R}^{d} \times \mathbb{R}^{d} \to \mathbb{R}^{d}\) given by \(f^{d}_{t}(x,y) = (x_{1} y_{1},\ldots ,x_{d} y_{d})\) for \(x,y \in \mathbb{R}^{d}\) and with \(Y_{t,i}^{d} = \exp (L_{t+1,i}^{d}-L_{t,i}^{d})\), i.e., (3.1) is satisfied. Since \(L^{d}\) has independent increments, it follows that Assumption 3.1 (ii) is satisfied. Next, we can employ an argument from the proof of Gonon and Schwab [25, Theorem 5.1] (which uses Sato [45, Theorem 25.17] and (4.2)) to obtain for any \(d \in \mathbb{N}\) and \(i=1,\ldots ,d\) that
Combined with Minkowski’s inequality and the stationarity increments property of \(L^{d}\), this yields
Furthermore, \(f_{t}^{d}(0,0) = 0\) and hence Assumption 3.1 (iv) is satisfied. Moreover, for \(\varepsilon \in (0,1]\), \(d \in \mathbb{N}\) and \(t \in \{0,\ldots ,T-1\}\), let \(M=\varepsilon ^{-\beta}\) and let \(\eta _{\varepsilon ,d,t} \colon \mathbb{R}^{d} \times \mathbb{R}^{d} \to \mathbb{R}^{d}\) be the \(d\)-fold parallelisation of \({\mathfrak{n}}_{\varepsilon ,M}\) from Lemma 4.1. Then for \(x,y \in [-(\varepsilon ^{-\beta}),\varepsilon ^{-\beta}]^{d}\), we obtain
\(\mathrm{size}(\eta _{\varepsilon ,d,t}) \leq d \mathrm{size}({ \mathfrak{n}}_{\varepsilon ,M}) \leq cd((\beta +1)\log (\varepsilon ^{-1})+1) \leq c_{1} d \varepsilon ^{-1}\) with \(c_{1} = c(\beta +2)\) and for all \(x,x',y,y' \in \mathbb{R}^{d}\),
Finally, for all \(x,y \in \mathbb{R}^{d}\),
and Minkowski’s integral inequality and (4.3) imply
□
Lemma 4.3
Assume \(p \geq 2\), let \(X^{d}\) follow a discrete diffusion model with coefficients \(\mu ^{d} \colon [0,\mathfrak{t}_{T}] \times \mathbb{R}^{d} \to \mathbb{R}^{d}\), \(\sigma ^{d} \colon [0,\mathfrak{t}_{T}] \times \mathbb{R}^{d} \to \mathbb{R}^{d \times d}\) and suppose Assumption 3.15holds. Then Assumption 3.9is satisfied with the constants \(m \in \mathbb{N}\) and \(\beta >0 \) in Assumption 3.9chosen arbitrarily and with \(\zeta = \theta = \beta + \tilde{\zeta}\).
Proof
First, (3.1) holds with \(f^{d}_{t}(x,y) = x + \mu ^{d}(\mathfrak{t}_{t},x) (\mathfrak{t}_{t+1}- \mathfrak{t}_{t}) + \sigma ^{d}(\mathfrak{t}_{t},x)y\) and \(Y_{t}^{d}= W^{d}_{\mathfrak{t}_{t+1}}-W^{d}_{\mathfrak{t}_{t}}\). Assumption 3.1 (ii) is satisfied by the independent increment property of Brownian motion. Furthermore, we obtain for each \(d \in \mathbb{N}\) and \(t \in \{0,\ldots ,T-1\}\) that \(|f_{t}^{d}(0,0)| = |\mu ^{d}(\mathfrak{t}_{t},0) (\mathfrak{t}_{t+1}- \mathfrak{t}_{t}) | \leq C \mathfrak{t}_{T} d^{q}\), and with \(\bar{p}= m\max{(2,p)}\), we have \(\mathbb{E}[|Y_{t}^{d}|^{2\bar{p}}] \leq (\mathfrak{t}_{T})^{\bar{p}} \mathbb{E}[|Z|^{2\bar{p}}]\) for \(Z\) standard normally distributed in \(\mathbb{R}^{d}\). The fact that \(|Z|^{2} \sim \chi ^{2}(d)\) and the upper and lower bounds for the Gamma function (see e.g. Gonon et al. [24, Lemma 2.4]) thus yield
with \(c_{\bar{p}} = \max _{n \in \mathbb{N}} (1+\frac{2\bar{p}}{n})^{\frac{n}{2}} < \infty \).
Next, for \(\varepsilon \in (0,1]\), \(d \in \mathbb{N}\) and \(t \in \{0,\ldots ,T-1\}\), let \(M=4 \max (C,1) d^{q+\frac{1}{2}} \varepsilon ^{-\beta}\) and consider \(\eta _{\varepsilon ,d,t} \colon \mathbb{R}^{d} \times \mathbb{R}^{d} \to \mathbb{R}^{d}\) given by
for \(i=1,\ldots ,d\) with \({\mathfrak{n}}_{\varepsilon ,M}\) from Lemma 4.1. By using the operations of parallelisation and concatenation, we can realize \((x,y)\mapsto {\mathfrak{n}}_{\varepsilon ,M}(\sigma _{\varepsilon ,d,t,i,j}(x),y_{j})\) by a neural network of size \(\mathfrak{s}_{i,j} := 2(\mathrm{size}(\mathfrak{n}_{\varepsilon ,M})+2+ \mathrm{size}(\sigma _{\varepsilon ,d,t,i,j}))\); see e.g. Opschoor et al. [39, Propositions 2.2 and 2.3]. Recall that the identity on ℝ can be realised by a ReLU deep neural network of arbitrary depth \(\ell \) and size \(2 \ell \) (see Petersen and Voigtlaender [40, Remark 2.4], Opschoor et al. [39, Proposition 2.4]). Thus we may insert identity networks in (4.5) to ensure that all summands can be realised by networks of the same depth, which is at most \(m_{i} = \max (\mathrm{size}(\mu _{\varepsilon ,d,t,i}),\mathfrak{s}_{i,1}, \ldots ,\mathfrak{s}_{i,d})\) due to the fact that the depth of a network is bounded by its size. By applying the summing operation for neural networks of equal depth (see e.g. Gonon and Schwab [25, Lemma 3.2], it follows that \(\eta _{{\varepsilon},d,t,i}\) can be realised by a deep neural network with
Next, we use Assumption 3.15 to estimate
and so it follows for \(x \in [-(\varepsilon ^{-\beta}),\varepsilon ^{-\beta}]^{d}\) that \(|\sigma _{\varepsilon ,d,t}(x)|_{F} \leq 2 C d^{q}(1+\sqrt{d}{ \varepsilon ^{-\beta}}) \leq M\). Hence Assumption 3.15 and (4.1) imply for \(x,y \in [-(\varepsilon ^{-\beta}),\varepsilon ^{-\beta}]^{d}\) that
Moreover, Assumption 3.15 and the Lipschitz property of \({\mathfrak{n}}_{\varepsilon ,M}\) yield that for all \(x,x',y,y' \in \mathbb{R}^{d}\),
In addition, for all \(x,y \in \mathbb{R}^{d}\),
so that (4.4) implies the polynomial growth bound (3.13). Finally, the estimate
combined with the Lipschitz, growth and approximation properties that we already established implies the polynomial bound (3.14) with \(\theta = \beta + \tilde{\zeta}\). Altogether, this proves that Assumption 3.9 is satisfied with the claimed choices of \(\zeta \) and \(\theta \). □
Proof of Proposition 3.19
Consider first the case of the running minimum. Any \(z \in \mathbb{R}^{d}\) can be partitioned into \(z = (z_{1:d-1},z_{d})\), the first \(d-1\) and the last components. Define the transition map \(\bar{f}^{d}_{t} \colon \mathbb{R}^{d} \times \mathbb{R}^{d} \to \mathbb{R}^{d}\) for the augmented process by
and \(\bar{Y}^{d}_{t} = (Y^{d-1}_{t},0)\). Then \(\bar{X}^{d}_{0} =(X^{d-1}_{0},\min _{i=1,\ldots ,d-1}X^{d-1}_{0,i}) \) and so the independence and moment conditions on \(\bar{Y}^{d}\) are satisfied and \(|\bar{f}_{t}^{d}(0,0)| \leq 2|f^{d-1}_{t}(0,0)|\). Thus (i), (ii) and (iv) in Assumption 3.1 are satisfied.
Furthermore, by the identity \(x=x^{+} - (-x)^{+}\) and Grohs et al. [28, Lemma 4.12] the function \(\mathfrak{min}_{k} \colon \mathbb{R}^{k} \to \mathbb{R}\), \(z \mapsto \min _{j=1,\ldots ,k} z_{j}\) can be realised by a deep neural network with size at most \(12k^{3}\). We now set
Then the 1-Lipschitz property of \(\mathfrak{min}_{k}\), which follows from the fact that the pointwise minimum of 1-Lipschitz functions is again 1-Lipschitz, implies the bounds \(\mathrm{Lip}(\bar{\eta}_{\varepsilon ,d,t}) \leq \sqrt{2} \mathrm{Lip}({\eta}_{\varepsilon ,d-1,t})\) and
The bound on \(\mathrm{size}(\bar{\eta}_{\varepsilon ,d,t})\) follows from the bound on \(\mathrm{size}(\eta _{\varepsilon ,d-1,t})\) and bounds for the operations composition, parallelisation and the realisation of the identity (which yields a bound for the size of the neural network realising \(x \mapsto x_{1:d}\)). Finally, it is straightforward to deduce the inequalities \(|\bar{f}_{t}^{d}(x,y)| \leq \sqrt{2} |f_{t}^{d-1}(x_{1:d-1},y_{1:d-1})|\) and \(|\bar{\eta}_{\varepsilon ,d,t}(x,y)| \leq \sqrt{2} |\eta _{ \varepsilon ,d-1,t}(x_{1:d-1},y_{1:d-1})|\) so that all the required bounds follow from the corresponding properties of \(X^{d-1}\).
In the case of the running maximum, one proceeds analogously except that the growth bounds obtained at the end are now a bit different. More specifically, in this case, we may deduce that \(|\bar{f}_{t}^{d}(x,y)| \leq d |f_{t}^{d-1}(x_{1:d-1},y_{1:d-1})|+|x|\) and in addition, \(|\bar{\eta}_{\varepsilon ,d,t}(x,y)| \leq d |\eta _{\varepsilon ,d-1,t}(x_{1:d-1},y_{1:d-1})|+|x|\), which still allows us to deduce the claimed statement. □
4.3 Auxiliary results
This section contains auxiliary results that are needed for the proof of Theorems 3.6 and 3.12. We start with Lemma 4.4 which establishes growth properties of the payoff function and its neural network approximation.
Lemma 4.4
Suppose Assumption 3.4holds. Then for all \(\varepsilon \in (0,1]\), \(d \in \mathbb{N}\), \(x \in \mathbb{R}^{d}\) and \(t \in \{0,\ldots ,T\}\), it holds that
Proof
First note that from (3.6), (3.8) and the growth assumption on \(g_{d}\), we obtain for every \(\bar{\varepsilon} \in (0,1]\) that
Letting \(\bar{\varepsilon}\) tend to 0 gives (4.6). Moreover, the same properties of \(g_{d}\) and \(\phi _{{\varepsilon},d,t}\) imply
□
The next result establishes growth properties of the Markov update function and its neural network approximation.
Lemma 4.5
Suppose Assumption 3.1is satisfied. Then for each \(d \in \mathbb{N}\), \(x,y \in \mathbb{R}^{d}\), \(t \in \{0,\ldots ,T-1\}\) and \(\varepsilon \in (0,1]\), it holds that
Proof
The proof is a straightforward consequence of (3.2), Assumption 3.1 (iv) and (3.4). Indeed, these hypotheses imply for every \(\bar{\varepsilon} \in (0,1]\) that
Letting \(\bar{\varepsilon}\) tend to 0 yields (4.8). In addition, the same hypotheses yield
□
In the next lemma, we establish a bound on the conditional moments of \(X^{d}\). The proof and \(\mathbb{E}[|X_{0}^{d}|]<\infty \) also yield \(\mathbb{E}[|X_{t}^{d}|]<\infty \) for all \(t\), and so we may consider the conditional expectation in (4.10) to be well defined for all \(x \in \mathbb{R}^{d}\); cf. Remark 3.3.
Lemma 4.6
Suppose Assumption 3.1or 3.9is satisfied. Then for all \(d \in \mathbb{N}\), \(x \in \mathbb{R}^{d}\) and \(s,t \in \{0,\ldots ,T\}\) with \(s \geq t\), it holds that
with \(\tilde{c}_{1} = 2\max (c,1)^{T+1} T\), \(\tilde{q}_{1} = q(T+1)\) in the case of Assumption 3.1and with \(\tilde{c}_{1} = T\max (h,1)^{\frac{T}{2m\max (p,2)}}\), \(\tilde{q}_{1} = \frac{\bar{q}T}{2m\max (p,2)}\) in the case of Assumption 3.9.
Proof
Assume without loss of generality that \(c \geq 1\). Consider first the case when Assumption 3.1 holds. Then (4.8) can be used to prove inductively that for all \(s \geq t\),
Indeed, for \(s=t\), this directly follows from the definition. Assume now \(s>t\) and (4.11) for \(s-1,s-2,\ldots ,t\). Then (4.8) and independence yield
as claimed. This shows that (4.11) holds for all \(s\geq t\). From (4.11) and Assumption 3.1 (iv), we obtain
If Assumption 3.9 holds, we first note that independence, Jensen’s inequality and (3.13) yield
We can now apply this estimate instead of (4.8) to get from the first to the second line in (4.12) and arrive at
Hence the conclusion follows. □
The next result ensures that the optimal value (2.1) is finite in our setting.
Lemma 4.7
Suppose Assumption 3.4holds and Assumption 3.1or 3.9is satisfied. Then \(\mathbb{E}[|g_{d}(t,X_{t}^{d})|]< \infty \) for all \(d \in \mathbb{N}\) and \(t \in \{0,\ldots ,T\}\).
Proof
Let \(d \in \mathbb{N}\) and \(t \in \{0,\ldots ,T\}\). Then Lemmas 4.4 and 4.6 and Assumption 3.1 or 3.9 ensure that
□
The next lemma proves that the value function grows at most linearly. Recall from Remark 3.3 that Lemma 4.7 allows us to recursively define the value function for all \(x \in \mathbb{R}^{d}\) as the right-hand side of (2.2).
Lemma 4.8
Suppose Assumption 3.4holds and Assumption 3.1or 3.9is satisfied. Then for all \(d \in \mathbb{N}\), \(t \in \{0,\ldots ,T\}\) and \(x \in \mathbb{R}^{d}\), it holds that
where \(\hat{c}_{t} = \max (c,1) (3\max (c,1)^{2})^{T-t}\), \(\hat{q}_{t} = q + 2q(T-t)\) in the case of Assumption 3.1and \(\hat{c}_{t} = c (T+1) \max (h,1)^{\frac{T-t}{2m\max (p,2)}}\), \(\hat{q}_{t} = q + \frac{\bar{q}(T-t)}{2m\max (p,2)}\) in the case of Assumption 3.9.
Proof
Consider first the case of Assumption 3.1. The proof proceeds by backward induction. For \(t=T\), the statement directly follows from (4.6). Assume now the statement holds for \(t+1\). Then (2.2), (4.6), the induction hypothesis and (4.11) yield
Hence (4.14) also holds at \(t\) and the statement follows by induction.
In the case of Assumption 3.9, we aim to provide a tighter estimate and instead inductively prove that \(|V_{d}(t,x)| \leq \hat{a}_{t} + \hat{b}_{t}|x|\), where \(\hat{a}_{t} = \hat{a}_{t+1} + \hat{b}_{t+1} (h d^{\bar{q}})^{ \frac{1}{2m\max (p,2)}}\), \(\hat{a}_{T} = c d^{q}\), \(\hat{b}_{t} = \hat{b}_{t+1} (\max (h,1) d^{\bar{q}})^{ \frac{1}{2m\max (p,2)}}\), \(\hat{b}_{T} = c d^{q}\). Indeed, using (4.13) instead of (4.11), we analogously obtain
from which the statement follows by noting \(\hat{b}_{t} = cd^{q} (\max (h,1) d^{\bar{q}})^{ \frac{T-t}{2m\max (p,2)}}\),
□
The next result mathematically proves the intuitively obvious fact that a neural network in which some input arguments are held at fixed values is still a neural network with at most as many non-zero parameters as the original neural network.
Lemma 4.9
Let \(d_{0},d_{1},m \in \mathbb{N}\), \(y \in \mathbb{R}^{d_{1}}\) and let \(\phi \colon \mathbb{R}^{d_{0}+d_{1}} \to \mathbb{R}^{m}\) be a neural network. Then \(\Phi _{y} \colon \mathbb{R}^{d_{0}} \to \mathbb{R}^{m}\), \(x \mapsto \phi ((x,y))\) can again be realised by a neural network \(\phi _{y}\) with \(\mathrm{size}(\phi _{y}) \leq \mathrm{size}(\phi )\).
Proof
Let us denote by \(((A^{1},b^{1}),\ldots ,(A^{L},b^{L}))\) the parameters of \(\phi \) for some \(L \in \mathbb{N}\), \(N_{0}= d_{0}+d_{1}\), \(N_{1},\ldots ,N_{L-1} \in \mathbb{N}\), \(N_{L}=m\) and \(A^{\ell} \in \mathbb{R}^{N_{\ell }\times N_{\ell -1}}\), \(b^{\ell }\in \mathbb{R}^{N_{\ell}}\), \(\ell =1,\ldots ,L\). Denote by \(A^{1,0} \in \mathbb{R}^{N_{1} \times d_{0}}\) and \(A^{1,1} \in \mathbb{R}^{N_{1} \times d_{1}}\) the first \(d_{0}\) and the remaining \(d_{1}\) columns of \(A^{1}\), respectively. Consider the neural network \(\phi _{y}\) defined by the parameter choice \(((A^{1,0},A^{1,1}y+b^{1}),(A^{2},b^{2}),\ldots ,(A^{L},b^{L}))\). Then
for all \(x \in \mathbb{R}^{d_{0}}\) and \(\mathrm{size}(\phi _{y}) \leq \mathrm{size}(\phi )\), as claimed. □
The next lemma allows us to construct a realisation of a random neural network and at the same time obtain a bound on the neural network weights.
Lemma 4.10
Let \(d, N \in \mathbb{N}\), let \(M_{1},M_{2} >0\), let \(U\) be a nonnegative random variable and let \(Y_{1},\ldots ,Y_{N}\) be i.i.d. \(\mathbb{R}^{d}\)-valued random variables. Suppose \(\mathbb{E}[U] \leq M_{1}\) and \(\mathbb{E}[|Y_{1}|]\leq M_{2}\). Then
Proof
First, by the i.i.d. assumption, it follows that
Next note that Bernoulli’s inequality implies \((\frac{2}{3})^{1/N} \leq 1 - \frac{1}{3N}\) and therefore, by Markov’s inequality,
Thus we obtain \((1-\mathbb{P}[|Y_{1}| > 3 N M_{2}])^{N} \geq \frac{2}{3}\), and inserting this in (4.15) yields
Furthermore, Markov’s inequality implies that
Combining now (4.16) and (4.17) with the elementary fact that for \(A, B \in \mathcal {F}\), we have \(\mathbb{P}[A\cap B] = \mathbb{P}[A] + \mathbb{P}[B] - \mathbb{P}[A \cup B] \geq \mathbb{P}[A] + \mathbb{P}[B] -1\) yields, as claimed, that
□
4.4 Proof of Theorem 3.6 and Corollary 3.8
With these preparations, we are now ready to prove Theorem 3.6. The proof is divided into several steps which are highlighted in bold in order to facilitate reading. We also refer to Sect. 3.2 for a sketch of the proof.
Proof of Theorem 3.6
1. Preliminaries. Without loss of generality, we may assume that the constants \(c>0\), \(q,\alpha \geq 0 \) in the statement of the theorem and in Assumptions 3.1 and 3.4 coincide; otherwise we replace each of them by the respective maximum and all the assumptions are still satisfied. We may also assume that \(c \geq 1\).
Further, if for each fixed \(t \in \{0,\ldots ,T\}\), there exist constants \(\kappa _{t},\mathfrak{q}_{t},\mathfrak{r}_{t} \in [0,\infty )\) and a neural network \(\psi _{\varepsilon ,d,t}\) such that \(\mathrm{size}(\psi _{\varepsilon ,d,t}) \leq \kappa _{t} d^{ \mathfrak{q}_{t}}\varepsilon ^{-\mathfrak{r}_{t}}\) and (3.9) hold for all \(\varepsilon \in (0,1]\) and \(d \in \mathbb{N}\), then also the statement of the theorem follows by choosing \(\kappa \), \(\mathfrak{q}\), \(\mathfrak{r}\) as the respective maxima over \(t \in \{0,\ldots ,T\}\).
Next, let \(c_{0},\ldots ,c_{T}\) satisfy \(c_{0} = c\), \(c_{t+1} = \max (3c,1)^{2\max (p,2)}(1+c_{t}+c)\) and set \(q_{t} = (2\max (p,2)+1)qt+q\). Then
for all \(t \in \{0,\ldots ,T\}\), and \(c_{t}\) does not depend on \(d\).
2. Stronger statement. We now proceed to prove the following stronger statement, which shows that the constants \(\kappa _{t}\), \(\mathfrak{q}_{t}\), \(\mathfrak{r}_{t}\) can be chosen essentially independently of the probability measure \(\rho ^{d}\) and in addition, \(\rho ^{d}\) may be allowed to depend on \(t\). Specifically, we prove that for any \(t \in \{0,\ldots ,T\}\), there exist constants \(\kappa _{t},\mathfrak{q}_{t},\mathfrak{r}_{t} \in [0,\infty )\) such that for any family of probability measures \(\rho _{t}^{d}\) on \(\mathbb{R}^{d}\), \(d \in \mathbb{N}\), with
and for all \(d \in \mathbb{N}\) and \(\varepsilon \in (0,1]\), there exists a neural network \(\psi _{\varepsilon ,d,t}\) such that
and
Choosing \(\rho _{t}^{d} = \rho ^{d}\) for all \(t\) and noting that (4.18) is satisfied due to \(q \leq q_{t}\), \(c \leq c_{t}\), the statement of Theorem 3.6 then follows.
In order to prove the stronger statement for each fixed \(t\), we now proceed by backward induction.
3. Base case of backward induction. For \(t=T\), we have \(V_{d}(T,x) = g_{d}(T,x)\). Choose \(\psi _{\varepsilon ,d,T} = \phi _{\tilde{\varepsilon},d,T}\) with \(\tilde{\varepsilon} = \varepsilon (c d^{q} (1+ (c_{T}d^{q_{T}} )^{ \frac{p}{2\max (p,2)}}))^{-1}\). Then (3.6), Jensen’s inequality and (4.18) imply
Furthermore, (3.7) implies that the neural network chosen as \(\psi _{\varepsilon ,d,T} = \phi _{\tilde{\varepsilon },d,T}\) satisfies \(\mathrm{size}(\psi _{\varepsilon ,d,T}) \leq c d^{q} \varepsilon ^{- \alpha} (c d^{q} (1+ (c_{T}d^{q_{T}} )^{\frac{p}{2\max (p,2)}}))^{ \alpha}\). Combining this with (3.8) and (4.7), it follows that there exist \(\kappa _{T},\mathfrak{q}_{T},\mathfrak{r}_{T} \in [0,\infty )\) such that for any family of probability measures \(\rho _{T}^{d}\) on \(\mathbb{R}^{d}\), \(d \in \mathbb{N}\), with (4.18) and for all \(d \in \mathbb{N}\) and \(\varepsilon \in (0,1]\), there exists a neural network \(\psi _{\varepsilon ,d,T}\) such that (4.19)–(4.22) hold, that is, the statement in Step 2 follows in the case \(t=T\).
4. Start of the induction step. The remainder of the proof is now dedicated to the induction step. To improve readability, we again divide it into several steps.
For the induction step, we now assume that the stronger statement in Step 2 above holds for time \(t+1\) and aim to prove it for time \(t\). To this end, let \(\rho _{t}^{d}\) be a probability measure satisfying (4.18) and denote by \(\nu ^{d}_{t}\) the distribution of \(Y_{t}^{d}\).
5. Induction hypothesis. Let \(\kappa _{t+1},\mathfrak{q}_{t+1},\mathfrak{r}_{t+1} \in [0,\infty )\) denote the constants with which the stronger statement in Step 2 above holds for time \(t+1\).
Consider the probability measure \(\rho _{t+1}^{d} = (\rho _{t}^{d} \otimes \nu _{t}^{d}) \circ (f_{t}^{d})^{-1}\) given as the image measure of \(\rho _{t}^{d} \otimes \nu _{t}^{d}\) under \(f_{t}^{d}\), where we recall that \(\nu ^{d}_{t}\) is the distribution of \(Y_{t}^{d}\). Then using the change-of-variables formula, (4.8), (4.18) and Assumption 3.1 (iv) and writing \(\bar{p}=2\max (p,2)\), this measure satisfies
Hence by the induction hypothesis, for any \(\varepsilon \in (0,1]\) and \(d \in \mathbb{N}\), there exists a neural network \(\psi _{\varepsilon ,d,t+1}\) such that
and
Now let \(\varepsilon \in (0,1]\) and \(d \in \mathbb{N}\) be given. The remainder of the proof consists in selecting \(\kappa _{t}\), \(\mathfrak{q}_{t}\), \(\mathfrak{r}_{t}\) (only depending on \(c\), \(\alpha \), \(p\), \(q\), \(t\), \(T\), \(\kappa _{t+1}\), \(\mathfrak{q}_{t+1}\), \(\mathfrak{r}_{t+1}\)) and constructing a neural network \(\psi _{\varepsilon ,d,t}\) such that (4.19)–(4.22) are satisfied. This will complete the proof.
In what follows, we fix \(\bar{\varepsilon} \in (0,1)\) and choose
The value of \(\bar{\varepsilon}\) will be chosen later (depending on \(\varepsilon \) and \(d\)).
6. Approximation of the continuation value. Let \(Y^{d,i}_{t}\), \(i \in \mathbb{N}\), be i.i.d. copies of \(Y^{d}_{t}\) and set \(\hat{v}_{\bar{\varepsilon},d,t+1} = \psi _{\bar{\varepsilon},d,t+1} \). Define the (random) function
Note that \(\Gamma _{\bar{\varepsilon},d,t}\) is a random function since \(Y^{d,i}_{t}\) is random.
We now estimate the expected \(L^{2}(\rho ^{d}_{t})\)-error that arises when \(\Gamma _{\bar{\varepsilon},d,t}\) is used to approximate the continuation value. Let \(Z^{\bar{\varepsilon},d,i}(x)= \hat{v}_{\bar{\varepsilon},d,t+1}( \eta _{\bar{\varepsilon},d,t}(x,Y^{d,i}_{t}))\) and recall that Assumption 3.1 (i), (ii) implies that \(Y^{d}_{t}\) is independent of \(X^{d}_{t}\). With this notation, \(\Gamma _{\bar{\varepsilon},d,t}(x) = \frac{1}{N} \sum _{i=1}^{N} Z^{ \bar{\varepsilon},d,i}(x)\) and thus the bias–variance decomposition and the fact that \(Y^{d,i}_{t}\), \(i \in \mathbb{N}\), are i.i.d. show that
The first integral in the last line of (4.29) can be estimated as
6.a) Applying the error estimate from \(t+1\). Now consider the first term on the right-hand side of (4.30) and recall that \(\rho _{t+1}^{d} = (\rho _{t}^{d} \otimes \nu _{t}^{d}) \circ (f_{t}^{d})^{-1}\) is the image measure of \(\rho _{t}^{d} \otimes \nu _{t}^{d}\) under \(f_{t}^{d}\). Then Jensen’s inequality, (3.1), Assumption 3.1 (ii) and (4.24) yield
6.b) Applying the Lipschitz property of the network at \(t+1\). For the second term on the right-hand side of (4.30), note that by the induction hypothesis (4.27), we have \(\mathrm{Lip}(\hat{v}_{\bar{\varepsilon},d,t+1}) \leq \kappa _{t+1} d^{ \mathfrak{q}_{t+1}}\). Hence (3.2), the assumption \(\mathbb{E}[|Y^{d}_{t}|^{p}] \leq c d^{q}\) and (4.18) imply that
6.c) Applying the growth property of the network at \(t+1\). For the last term in (4.29), note that \(|\hat{v}_{\bar{\varepsilon},d,t+1}(x)| \leq \kappa _{t+1} d^{ \mathfrak{q}_{t+1}} \bar{\varepsilon}^{-\mathfrak{r}_{t+1}} (1+|x|)\) by the induction hypothesis. Combining this with (4.9), \(\mathbb{E}[|Y_{t}^{d}|^{2}] \leq c d^{q}\), Hölder’s inequality and (4.18) yields
6.d) Bounding the overall error and constructing a realisation. We can now insert the estimates from (4.31) and (4.32) into (4.30) and subsequently insert the resulting bound and (4.33) into (4.29). We obtain
with constants chosen as \(\tilde{c}_{2} = 2+8\max (c,1)^{2}\kappa _{t+1}^{2}(4+\max (c,1)^{2} + \max (c_{t},1))\) and \(\tilde{q}_{2} = 2(q+\mathfrak{q}_{t+1})+\max (2q,q_{t})\). But (4.34) implies that
Therefore Assumption 3.1 (iv), the fact that \(Y^{d,1}_{t},\ldots ,Y^{d,N}_{t}\) are i.i.d. copies of \(Y^{d}_{t}\) and Lemma 4.10 prove that there exists \(\omega \in \Omega \) such that the function defined via \(\gamma _{\bar{\varepsilon},d,t}(x) = \frac{1}{N}\sum _{i=1}^{N} \hat{v}_{\bar{\varepsilon},d,t+1}(\eta _{\bar{\varepsilon},d,t}(x,Y^{d,i}_{t}( \omega )))\) (i.e., the realisation of \(\Gamma _{\bar{\varepsilon},d,t}\) at \(\omega \)) satisfies
and
We now define
and claim that \(\psi _{\varepsilon ,d,t}=\hat{v}_{\bar{\varepsilon},d,t}\) satisfies all the properties required in (4.19)–(4.22).
7. Growth bound on the constructed network. Let us first verify (4.20). Indeed, the growth bound on \(\phi _{\bar{\varepsilon},d,t}\) in (4.7), the induction hypothesis (4.25), the growth bound (4.9) on \(\eta _{\bar{\varepsilon},d,t}\), the bound (4.36) on \(|Y_{t}^{d,i}(\omega )|\) and the choice of \(N\) in (4.28) imply for all \(x \in \mathbb{R}^{d}\) that
with \(\tilde{c}_{3} = 18 \max (c,1,\kappa _{t+1})\max (c,1)^{2}\), \(\tilde{q}_{3} = \mathfrak{q}_{t+1}+2q\) and \(\tilde{r}_{3} = 3\mathfrak{r}_{t+1}+2\).
8. Bounding the size of the constructed network. Next we verify (4.21). To achieve this, we first observe that for each \(i \in \{1,\ldots ,N\}\), Lemma 4.9 shows that the map \(x \mapsto \eta _{\bar{\varepsilon},d,t}(x,Y^{d,i}_{t}(\omega ))\) can be realised as a neural network with size at most \(\mathrm{size}(\eta _{\bar{\varepsilon},d,t})\). Next, the composition of two ReLU neural networks \(\phi _{1}\), \(\phi _{2}\) can again be realised by a ReLU neural network with size at most \(2 (\mathrm{size}(\phi _{1}) + \mathrm{size}(\phi _{2}))\) (see e.g. Opschoor et al. [39, Proposition 2.2]). Finally, Gonon and Schwab [25, Lemma 3.2] shows that the weighted sum of deep neural networks \(\phi _{1},\ldots ,\phi _{N}\) with the same number of layers, the same input dimension and the same output dimension can be realised by another deep neural network with size at most \(\sum _{i=1}^{N} \mathrm{size}(\phi _{i})\). Therefore \(\gamma _{\bar{\varepsilon},d,t}\) can be realised as a deep neural network with
where the last step follows from the induction hypothesis (4.26) and the bound (3.3) on the size of \(\eta _{\bar{\varepsilon},d,t}\). Next, subtracting a constant corresponds to a change of the “bias” \(b^{L}\) in the last layer, and therefore also \(\phi _{\bar{\varepsilon},d,t} - \delta \) is a neural network satisfying \(\mathrm{size}(\phi _{\bar{\varepsilon},d,t} - \delta ) = \mathrm{size}(\phi _{\bar{\varepsilon},d,t})\). Now define the neural network \(\mathfrak{m}\colon \mathbb{R}^{2} \to \mathbb{R}\) by \(\mathfrak{m}(x,y) = A^{2} \varrho (A^{1}(x,y)^{\top})\) with
Then \(\mathfrak{m}(x,y)= \max (x-y,0)+\max (y,0)-\max (-y,0) = \max (x,y)\). Thus \(\hat{v}_{\bar{\varepsilon},d,t}\) in (4.37) can be realised as a neural network by the composition of \(\mathfrak{m}\) with the parallelisation of \(\phi _{\bar{\varepsilon},d,t} - \delta \) and \(\gamma _{\bar{\varepsilon},d,t}\) (see e.g. Opschoor et al. [39, Proposition 2.3]), and the size of the parallelisation is bounded by \(\mathrm{size}(\phi _{\bar{\varepsilon},d,t}) + \mathrm{size}(\gamma _{ \bar{\varepsilon},d,t})\). This and the bound (3.7) on the size of \(\phi _{\bar{\varepsilon},d,t}\), the choice of \(N\) in (4.28) and the bound (4.39) on the size of \(\gamma _{\bar{\varepsilon},d,t}\) imply that
with \(\tilde{c}_{4} = 2(7+5c + 4\kappa _{t+1})\), \(\tilde{q}_{4} = \max (q,\mathfrak{q}_{t+1})\) and \(\tilde{r}_{4} =2 \mathfrak{r}_{t+1}+2+\max (\alpha ,\mathfrak{r}_{t+1})\).
9. Lipschitz constant of the constructed network. Next we verify (4.22). To do this, we note that the induction hypothesis (4.27) and the Lipschitz property (3.4) of \(\eta _{\bar{\varepsilon},d,t}\) imply for all \(x,y \in \mathbb{R}^{d}\) that
In addition, (3.8) implies \(\mathrm{Lip}(\phi _{\bar{\varepsilon},d,t} - \delta ) = \mathrm{Lip}( \phi _{\bar{\varepsilon},d,t}) \leq c d^{q}\), and the pointwise maximum of two Lipschitz-continuous functions is again Lipschitz-continuous with Lipschitz constant given by the maximum of the two Lipschitz constants. Combining this with (4.41) yields
10. Bounding the overall approximation error. We now work towards verifying the approximation error bound (4.19). To achieve this, let
be the continuation region and
the approximate continuation region. Then (2.2), (4.37), (4.43) and (4.44) imply that
We now estimate (the integral of) each of these four terms separately. For the first term, we directly get from (3.6) that
and so we proceed with analysing the second term.
10.a) Bounding the approximation error on \(C_{t} \cap \hat{C}_{t}^{c}\). From Lemma 4.8, we have the growth bound \(|V_{d}(t+1,x)| \leq \hat{c}_{t+1} d^{\hat{q}_{t+1}} (1+|x|)\) and so, using (4.7), the second term in (4.45) can be estimated as
Combining this with (4.18), (4.10) in Lemma 4.6 and Hölder’s inequality, we obtain with \(C_{\mathrm{aux}} = 4 \max (\hat{c}_{t+1},c) d^{\hat{q}_{t+1}}\) that
10.b) Estimating \(\rho ^{d}_{t}({C_{t} \cap \hat{C}_{t}^{c}})\). Next we estimate \(\rho ^{d}_{t}({C_{t} \cap \hat{C}_{t}^{c}})\). To do this, set
and note that employing (4.43), (4.44), (4.47) and (4.48) to verify the inclusion \(A^{c} \cap B^{c} \cap C_{t} \subseteq \hat{C}_{t}\) yields
Indeed, \(|g_{d}(t,x) - \phi _{\bar{\varepsilon},d,t}(x)|\leq \frac{\delta}{2}\), \(|\mathbb{E}[V_{d}(t+1,X_{t+1}^{d})|X_{t}^{d}=x]- \gamma _{ \bar{\varepsilon},d,t}(x)|\leq \frac{\delta}{2} \) and \(g_{d}(t,x) < \mathbb{E}[V_{d}(t+1,X_{t+1}^{d})|X_{t}^{d}=x]\) lead in combination to
Furthermore, (4.47), Markov’s inequality, (4.18) and (3.6) imply that
Similarly, (4.48), Markov’s inequality and (4.35) yield
Putting together (4.46), (4.49)–(4.51) and inserting the choices (4.28), we obtain
with constants chosen as \(\tilde{c}_{5} = 4 \max (\hat{c}_{t+1},c) (1+ 2\tilde{c}_{1} (1+ c_{t}^{ \frac{1}{4}})) (8(1+c_{t})c^{2} + 24 \tilde{c}_{2} )^{\frac{1}{4}}\) and \(\tilde{q}_{5} =\hat{q}_{t+1}+\tilde{q}_{1}+\frac{q_{t}}{4}+ \frac{1}{4}\max (2q+q_{t},\tilde{q}_{2}) \).
10.c) Bounding the approximation error on \(C_{t}^{c} \cap \hat{C}_{t}\). We are now concerned with the third term in (4.45). Observe that
For \(x \in A^{c} \cap \hat{C}_{t}\), we use (4.44) and (4.47) to obtain
Together with (4.43) and (4.48), this therefore implies for \(x \in A^{c} \cap B^{c} \cap C_{t}^{c} \cap \hat{C}_{t}\) that
Combining this with (4.53), we obtain
and consequently the growth bounds (4.6), (4.10) and (4.14) on \(g_{d}\), on the conditional moments and on \(V_{d}\), Hölder’s inequality and the approximation error bound (4.35) for the continuation value yield
Inserting the bound (4.18) on the moments of \(\rho ^{d}_{t}\), the upper bounds (4.50), (4.51) on \(\rho ^{d}_{t}(A)\), \(\rho ^{d}_{t}(B)\) and the choices (4.28) for \(N\) and \(\delta \) thus shows that
where \(\tilde{c}_{6} = 2 \hat{c}_{t+1} (1+ 2\tilde{c}_{1} (1+c_{t}^{ \frac{1}{4}})) ((8(1+c_{t})c^{2} )^{\frac{1}{4}}+(24 \tilde{c}_{2} )^{ \frac{1}{4}}) + 2 (6 \tilde{c}_{2} )^{1/2} + \frac{7}{2}\) and \(\tilde{q}_{6} = \hat{q}_{t+1}+\frac{1}{2}q+\tilde{q}_{1}+ \frac{q_{t}}{2}+\frac{\tilde{q}_{2}}{2}\).
10.d) Combining the individual error estimates. Finally, note that the second and last line of (4.50) yield
Consequently, combining the decomposition (4.45) with the estimates (4.35), (4.52), (4.55) and (4.56), we obtain
with \(\tilde{c}_{7} = (2(1+c_{t}))^{\frac{1}{2}} c + 1 + \tilde{c}_{5} + \tilde{c}_{6} + (6 \tilde{c}_{2} )^{\frac{1}{2}}\) and \(\tilde{q}_{7} = \max (q+\frac{q_{t}}{2},\tilde{q}_{5},\tilde{q}_{6}, \frac{\tilde{q}_{2}}{2})\). Now choose
Inserting (4.58) in (4.57) proves that (4.19) is satisfied. Furthermore, choosing
we obtain from (4.38), (4.40) and (4.42) that (4.20)–(4.22) are satisfied. This completes the induction step, and hence the statement follows. □
Proof of Corollary 3.8
Fix \(d \in \mathbb{N}\) and \(h \in [-R,R]^{d}\) and set \(\rho ^{d} = \frac{1}{2}\nu ^{d}_{0} + \frac{1}{2} \nu ^{d}_{h} \), where \(\nu ^{d}_{x}\) denotes a multivariate normal distribution on \(\mathbb{R}^{d}\) with mean \(x\) and identity covariance. Then we estimate
and so \(|h|\leq R d^{1/2} \) and (4.4) show that there exist \(c >0\) and \(q \geq 0\) only depending on \(p\) and \(R\) such that the bound \(\int _{\mathbb{R}^{d}} |x|^{2\max (p,2)} \rho ^{d}(dx) \leq c d^{q}\) holds. Hence we can apply Theorem 3.6 and obtain for all \(\varepsilon \in (0,1]\) and \(t \in \{0,\ldots ,T\}\) the existence of a neural network \(\psi _{\varepsilon ,d,t}\) such that (3.9) holds. From the proof of Theorem 3.6, we obtain that these networks satisfy the Lipschitz condition (4.22). Therefore, for any \(\varepsilon >0\), we may use Minkowski’s inequality, the bound \(\|\cdot \|_{L^{2}(\nu _{y}^{d})} \leq \sqrt{2}\|\cdot \|_{L^{2}( \rho ^{d})}\) for \(y \in \{0,h\}\), the approximation error bound (3.9) and the Lipschitz property (4.22) to estimate
This holds for any \(\varepsilon >0 \), and from the statement in Step 2 of the proof of Theorem 3.6, the constants \(\kappa _{t}\), \(\mathfrak{q}_{t}\) do not depend on \(d\), \(\varepsilon \) and \(h\) (but they depend on \(R\)). Letting \(\varepsilon \) tend to 0 therefore yields the claimed statement. □
4.5 Proof of Theorem 3.12
This subsection is devoted to the proof of Theorem 3.12. It is based on the proof of Theorem 3.6 given in Sect. 4.4.
Proof of Theorem 3.12
Proving this result just requires slight modifications in the proof of Theorem 3.6. Without loss of generality, we may assume \(c\geq h\) and \(q \geq \bar{q}\). In Step 1, we only need to choose \(c_{0},\ldots ,c_{T}\) differently. Indeed, let \(c_{0} = c\), \(c_{t+1} = h(1+c_{t})\) and set \(q_{t} = \bar{q}(t+1)\). In Step 2, the stronger statement is modified accordingly: We prove that for any \(t \in \{0,\ldots ,T\}\), there exist constants \(\kappa _{t}, \mathfrak{q}_{t},\mathfrak{r}_{t} \in [0,\infty )\) such that for any family of probability measures \(\rho _{t}^{d}\) on \(\mathbb{R}^{d}\), \(d \in \mathbb{N}\), with
and for all \(d \in \mathbb{N}\) and \(\varepsilon \in (0,1]\), there exists a neural network \(\psi _{\varepsilon ,d,t}\) such that the approximation error estimate (4.19) holds and \(\psi _{\varepsilon ,d,t}\) satisfies the growth and size conditions (4.20) and (4.21) and the modified Lipschitz condition
For \(t=T\), condition (4.63) coincides with (4.22) and so the base case (Step 3) remains the same as in the proof of Theorem 3.6. Due to the assumption (3.13), also Steps 4 and 5 only require slight modifications; indeed (4.23) becomes
the Lipschitz condition (4.27) is replaced by
and we modify the choice of \(N\) in (4.28) to \(N =\lceil \bar{\varepsilon}^{-2 \mathfrak{r}_{t+1}-2-2\theta} \rceil \).
For the beginning of Step 6 and for Step 6.a), we proceed precisely as above and obtain the error estimates (4.29)–(4.31). In Step 6.b), the Lipschitz property (4.64) of the network now yields the additional factor \(\bar{\varepsilon}^{-2\zeta (T-t-1)}\), and the approximation property (3.2) only holds on \([-(\bar{\varepsilon}^{-\beta}),\bar{\varepsilon}^{-\beta}]^{d}\); see (3.11). Hence we estimate
The first term can be bounded as before. For the second, note that Jensen’s inequality and (3.13) imply that \(\mathbb{E}[|f_{t}^{d}(x,Y_{t}^{d})|] \leq c d^{q} (1+|x|)\). Hence we may apply Hölder’s inequality, the growth bound (3.14) on \(\eta _{\bar{\varepsilon},d,t}\), Markov’s inequality and (4.62) to obtain, with \(|x|_{\infty }= \max _{i=1,\ldots ,d}|x_{i}|\), that
For the last term in (4.65), we note that (3.13) and Jensen’s inequality give
Using this, Hölder’s inequality, (3.14) and Markov’s inequality, we obtain
Together, this yields
Similarly, in Step 6.c), the factor \(\bar{\varepsilon}^{-2\mathfrak{r}_{t+1}}\) is replaced by \(\bar{\varepsilon}^{-2\mathfrak{r}_{t+1}-2\theta}\) due to the growth bound (3.14). This and the modified bound (4.66) (as opposed to (4.32)) then also lead to a different estimate in Step 6.d) where we obtain
with slightly different choices of constants given by \(\tilde{c}_{2} = 96 c^{2}\kappa _{t+1}^{2}(4+c^{2} + c_{t})c_{t}\) and \(\tilde{q}_{2} = 4q+2\mathfrak{q}_{t+1}+\max (2q,q_{t})\). Thus the same argument as before yields the existence of \(\omega \in \Omega \) such that \(\gamma _{\bar{\varepsilon},d,t}\), the realisation of \(\Gamma _{\bar{\varepsilon},d,t}\) at \(\omega \), satisfies
and (4.36) holds. In Step 7, the modified growth bound (3.14) and the modified choice \(N =\lceil \bar{\varepsilon}^{-2 \mathfrak{r}_{t+1}-2-2\theta} \rceil \) lead to an additional factor \(\bar{\varepsilon}^{-3\theta}\) in (4.38) and to a modified choice \(\tilde{r}_{3} = 3\mathfrak{r}_{t+1}+2+3\theta \). Similarly, in Step 8, the modified choice of \(N\) leads to an additional factor \(\bar{\varepsilon}^{-2\theta}\) and to a modified choice \(\tilde{r}_{4} = 2 \mathfrak{r}_{t+1}+2+\max (\alpha ,\mathfrak{r}_{t+1})+2 \theta \). Next, for the Lipschitz constant of the constructed network, i.e., Step 9, we need to verify (4.63). To do this, we note that the induction hypothesis (4.64) and the Lipschitz property (3.12) of \(\eta _{\bar{\varepsilon},d,t}\) imply for all \(x,y \in \mathbb{R}^{d}\) that
Thus we obtain
Step 10 requires a different choice of \(\delta \) and otherwise only minor modifications. We choose \(\delta = \bar{\varepsilon}^{\frac{1}{2}(\min (1,\beta m-\theta ) - \zeta (T-1))} \). Then in Step 10.b), the new bound (4.67) leads to a slightly different bound than in (4.51) and (4.52). We obtain
with slightly modified constant \(\tilde{c}_{5} = 4 \hat{c}_{t+1} (1+ 2\tilde{c}_{1} (1+ c_{t}^{ \frac{1}{4}})) (8(1+c_{t})c^{2} + 36 \tilde{c}_{2} )^{\frac{1}{4}}\) and \(\tilde{q}_{5}\) as before. In Step 10.c), (4.67) leads to analogous modifications in (4.54) and (4.55), yielding
with \(\tilde{c}_{6} = 2 \hat{c}_{t+1} (1+ 2\tilde{c}_{1} (1+c_{t}^{ \frac{1}{4}})) ((8(1+c_{t})c^{2} )^{\frac{1}{4}}+(36 \tilde{c}_{2} )^{ \frac{1}{4}}) + 2 (9 \tilde{c}_{2} )^{1/2} + \frac{7}{2}\) and \(\tilde{q}_{6}\) as before. Combining (4.67)–(4.69) and (4.56) thus gives for Step 10.d) the bound
with \(\tilde{c}_{7} = (2(1+c_{t}))^{\frac{1}{2}} c + 1 + \tilde{c}_{5} + \tilde{c}_{6} + (9 \tilde{c}_{2})^{\frac{1}{2}} \) and \(\tilde{q}_{7}\) as before. Choose now
and note that \(\bar{\varepsilon} \in (0,1)\) because \(\tilde{c}_{7}>1\) and \(\frac{\min (1,\beta m-\theta )}{T-1} >\zeta \). By inserting this choice of \(\bar{\varepsilon}\) in the bounds for the growth, size and Lipschitz constants of \(\psi _{\varepsilon ,d,t}\), we may then appropriately choose \(\kappa _{t}\), \(\mathfrak{q}_{t}\), \(\mathfrak{r}_{t}\) (analogously to (4.59)–(4.61)) and complete the proof. □
References
Andersen, L., Broadie, M.: Primal–dual simulation algorithm for pricing multidimensional American options. Manag. Sci. 50, 1222–1234 (2004)
Barron, A.R.: Universal approximation bounds for superpositions of a sigmoidal function. IEEE Trans. Inf. Theory 39, 930–945 (1993)
Bayer, C., Belomestny, D., Hager, P., Pigato, P., Schoenmakers, J.: Randomized optimal stopping algorithms and their convergence analysis. SIAM J. Financ. Math. 12, 1201–1225 (2021)
Bayer, C., Hager, P.P., Riedel, S., Schoenmakers, J.: Optimal stopping with signatures. Ann. Appl. Probab. 33, 238–273 (2023)
Beck, C., Hutzenthaler, M., Jentzen, A., Kuckuck, B.: An overview on deep learning-based approximation methods for partial differential equations. Discrete Contin. Dyn. Syst., Ser. B 28, 3697–3746 (2023)
Becker, S., Cheridito, P., Jentzen, A.: Deep optimal stopping. J. Mach. Learn. Res. 20, Paper No. 74, 1–25 (2019)
Becker, S., Cheridito, P., Jentzen, A.: Pricing and hedging American-style options with deep learning. J. Financ. Risk Manag. 13, Paper No. 158 1–12 (2020)
Becker, S., Cheridito, P., Jentzen, A., Welti, T.: Solving high-dimensional optimal stopping problems using deep learning. Eur. J. Appl. Math. 32, 470–514 (2021)
Belomestny, D.: On the rates of convergence of simulation-based optimization algorithms for optimal stopping problems. Ann. Appl. Probab. 21, 215–239 (2011)
Belomestny, D., Bender, C., Schoenmakers, J.: True upper bounds for Bermudan products via non-nested Monte Carlo. Math. Finance 19, 53–71 (2009)
Berner, J., Grohs, P., Jentzen, A.: Analysis of the generalization error: empirical risk minimization over deep artificial neural networks overcomes the curse of dimensionality in the numerical approximation of Black–Scholes partial differential equations. SIAM J. Math. Data Sci. 2, 631–657 (2020)
Bouchard, B., Warin, X.: Monte-Carlo valuation of American options: facts and new algorithms to improve existing methods. In: Carmona, R.A., et al. (eds.) Numerical Methods in Finance, pp. 215–255. Springer, Berlin (2012)
Broadie, M., Glasserman, P.: A stochastic mesh method for pricing high-dimensional American options. J. Comput. Finance 7(4), 35–72 (2004)
Buehler, H., Gonon, L., Teichmann, J., Wood, B.: Deep hedging. Quant. Finance 19, 1271–1291 (2019)
Cioica-Licht, P.A., Hutzenthaler, M., Werner, P.T.: Deep neural networks overcome the curse of dimensionality in the numerical approximation of semilinear partial differential equations. Preprint (2022). Available online at https://arxiv.org/abs/2205.14398
Clément, E., Lamberton, D., Protter, P.: An analysis of a least squares regression method for American option pricing. Finance Stoch. 6, 449–471 (2002)
Cuchiero, C., Khosrawi, W., Teichmann, J.: A generative adversarial network approach to calibration of local stochastic volatility models. Risks 8, Paper No. 101 1–31 (2020)
Ech-Chafiq, Z.E.F., Labordère, P.H., Lelong, J.: Pricing Bermudan options using regression trees/random forests. SIAM J. Financ. Math. 14, 1113–1139 (2023)
Elbrächter, D., Grohs, P., Jentzen, A., Schwab, C.: DNN expression rate analysis of high-dimensional PDEs: application to option pricing. Constr. Approx. 55, 3–71 (2022)
Föllmer, H., Schied, A.: Stochastic Finance: An Introduction in Discrete Time, 4th revised edn. de Gruyter, Berlin (2016)
Garcia, D.: Convergence and biases of Monte Carlo estimates of American option prices using a parametric exercise rule. J. Econ. Dyn. Control 27, 1855–1879 (2003)
Germain, M., Pham, H., Warin, X.: Neural networks-based algorithms for stochastic control and PDEs in finance. In: Capponi, A., Lehalle, C.A. (eds.) Machine Learning and Data Sciences for Financial Markets: A Guide to Contemporary Practices, pp. 426–452. Cambridge University Press, Cambridge (2023)
Gonon, L.: Random feature neural networks learn Black–Scholes type PDEs without curse of dimensionality. J. Mach. Learn. Res. 24, Paper No. 189 1–51 (2023)
Gonon, L., Grohs, P., Jentzen, A., Kofler, D., Šiška, D.: Uniform error estimates for artificial neural network approximations for heat equations. IMA J. Numer. Anal. 42, 1991–2054 (2021)
Gonon, L., Schwab, C.: Deep ReLU network expression rates for option prices in high-dimensional, exponential Lévy models. Finance Stoch. 25, 615–657 (2021)
Gonon, L., Schwab, C.: Deep ReLU neural networks overcome the curse of dimensionality for partial integrodifferential equations. Anal. Appl. (Singap.) 21, 1–47 (2023)
Grohs, P., Herrmann, L.: Deep neural network approximation for high-dimensional parabolic Hamilton–Jacobi–Bellman equations. Preprint (2021). Available online at https://arxiv.org/abs/2103.05744
Grohs, P., Hornung, F., Jentzen, A., von Wurstemberger, P.: A Proof That Artificial Neural Networks Overcome the Curse of Dimensionality in the Numerical Approximation of Black–Scholes Partial Differential Equations. Am. Math. Soc., Providence (2023)
Gühring, I., Kutyniok, G., Petersen, P.: Error bounds for approximations with deep ReLU neural networks in \(W^{s,p}\) norms. Anal. Appl. (Singap.) 18, 803–859 (2020)
Gühring, I., Raslan, M., Kutyniok, G.: Expressivity of deep neural networks. In: Grohs, P., Kutyniok, G. (eds.) Mathematical Aspects of Deep Learning, pp. 149–199. Cambridge University Press, Cambridge (2023)
Haugh, M.B., Kogan, L.: Pricing American options: a duality approach. Oper. Res. 52, 258–270 (2004)
Herrera, C., Krach, F., Ruyssen, P., Teichmann, J.: Optimal stopping via randomized neural networks. Front. Math. Finance 3, 31–77 (2024)
Hutzenthaler, M., Jentzen, A., Kruse, T., Nguyen, T.A.: A proof that rectified deep neural networks overcome the curse of dimensionality in the numerical approximation of semilinear heat equations. Part. Differ. Equ. Appl. 1, Paper No. 10, 1–34 (2020)
Jain, S., Oosterlee, C.W.: The stochastic grid bundling method: efficient pricing of Bermudan options and their Greeks. Appl. Math. Comput. 269, 412–431 (2015)
Jentzen, A., Salimova, D., Welti, T.: A proof that deep artificial neural networks overcome the curse of dimensionality in the numerical approximation of Kolmogorov partial differential equations with constant diffusion and nonlinear drift coefficients. Commun. Math. Sci. 19, 1167–1205 (2021)
Kohler, M., Krzyżak, A., Todorovic, N.: Pricing of high-dimensional American options by neural networks. Math. Finance 20, 383–410 (2010)
Lapeyre, B., Lelong, J.: Neural network regression for Bermudan option pricing. Monte Carlo Methods Appl. 27, 227–247 (2021)
Longstaff, F.A., Schwartz, E.S.: Valuing American options by simulation: a simple least-squares approach. Rev. Financ. Stud. 14, 113–147 (2001)
Opschoor, J.A.A., Petersen, P.C., Schwab, C.: Deep ReLU networks and high-order finite element methods. Anal. Appl. (Singap.) 18, 715–770 (2020)
Petersen, P., Voigtlaender, F.: Optimal approximation of piecewise smooth functions using deep ReLU neural networks. Neural Netw. 108, 296–330 (2018)
Reisinger, C., Zhang, Y.: Rectified deep neural networks overcome the curse of dimensionality for nonsmooth value functions in zero-sum games of nonlinear stiff systems. Anal. Appl. (Singap.) 18, 951–999 (2020)
Reppen, A.M., Soner, H.M., Tissot-Daguette, V.: Neural optimal stopping boundary. Preprint (2022). Available online at https://arxiv.org/abs/2205.04595
Rogers, L.C.G.: Monte Carlo valuation of American options. Math. Finance 12, 271–286 (2002)
Ruf, J., Wang, W.: Neural networks for option pricing and hedging: a literature review. J. Comput. Finance 24(1), 1–46 (2020)
Sato, K.I.: Lévy Processes and Infinitely Divisible Distributions. Cambridge University Press, Cambridge (1999)
Shaham, U., Cloninger, A., Coifman, R.R.: Provable approximation properties for deep neural networks. Appl. Comput. Harmon. Anal. 44, 537–557 (2018)
Sirignano, J., Spiliopoulos, K.: DGM: a deep learning algorithm for solving partial differential equations. J. Comput. Phys. 375, 1339–1364 (2018)
Takahashi, A., Yamada, T.: Solving Kolmogorov PDEs without the curse of dimensionality via deep learning and asymptotic expansion with Malliavin calculus. Part. Differ. Equ. Appl. 4, Paper No. 27 1–31 (2023)
Tsitsiklis, J., Van Roy, B.: Regression methods for pricing complex American-style options. IEEE Trans. Neural Netw. Learn. Syst. 12, 694–703 (2001)
Wang, S., Perdikaris, P.: Deep learning of free boundary and Stefan problems. J. Comput. Phys. 428, Paper No. 109914 1–24 (2021)
Yarotsky, D.: Error bounds for approximations with deep ReLU networks. Neural Netw. 94, 103–114 (2017)
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Competing Interests
The author declares no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Gonon, L. Deep neural network expressivity for optimal stopping problems. Finance Stoch 28, 865–910 (2024). https://doi.org/10.1007/s00780-024-00538-0
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00780-024-00538-0
Keywords
- Deep neural network
- Optimal stopping problem
- Markov process
- Expression rate
- Approximation error bound
- Curse of dimensionality