DNN Expression Rate Analysis of High-dimensional PDEs: Application to Option Pricing

We analyze approximation rates by deep ReLU networks of a class of multi-variate solutions of Kolmogorov equations which arise in option pricing. Key technical devices are deep ReLU architectures capable of efficiently approximating tensor products. Combining this with results concerning the approximation of well behaved (i.e. fulfilling some smoothness properties) univariate functions, this provides insights into rates of deep ReLU approximation of multi-variate functions with tensor structures. We apply this in particular to the model problem given by the price of a European maximum option on a basket of $d$ assets within the Black-Scholes model for European maximum option pricing. We prove that the solution to the $d$-variate option pricing problem can be approximated up to an $\varepsilon$-error by a deep ReLU network with depth $\mathcal{O}\big(\ln(d)\ln(\varepsilon^{-1})+\ln(d)^2\big)$ and $\mathcal{O}\big(d^{2+\frac{1}{n}}\varepsilon^{-\frac{1}{n}}\big)$ non-zero weights, where $n\in \mathbb{N}$ is arbitrary (with the constant implied in $\mathcal{O}(\cdot)$ depending on $n$). The techniques developed in the constructive proof are of independent interest in the analysis of the expressive power of deep neural networks for solution manifolds of PDEs in high dimension.


Motivation
The development of new classification and regression algorithms based on deep neural networks -coined "Deep Learning" -revolutionized the area of artificial intelligence, machine learning, and data analysis [17]. More recently, these methods have been applied to the numerical solution of partial differential equations (PDEs for short) [41,14,11,29,24,3,10,23,34]. In these works it has been empirically observed that deep learning-based methods work exceptionally well when used for the numerical solution of high-dimensional problems arising in option pricing. The numerical experiments carried out in [3,10,23,2] in particular suggest that deep learning-based methods may not suffer from the curse of dimensionality for these problems, but only few theoretical results exist which support this claim: In [40], a first theoretical result on rates of expression of infinite-variate generalized polynomial chaos expansions for solution manifolds of certain classes of parametric PDEs has been obtained. Furthermore, recent work [20,4] shows that the algorithms introduced in [2] for the numerical solution of Kolmogorov PDEs are free of the curse of dimensionality in terms of network size and training sample complexity.
Neural networks constitute a parametrized class of functions constructed by successive applications of affine mappings and coordinatewise nonlinearities, see [37] for a mathematical introduction. As in [36], we introduce a neural network via a tuple of matrix vector pairs for given hyperparameters L ∈ N, N 0 , N 1 , . . . , N L ∈ N. Given an "activation function" ̺ ∈ C(R, R), a neural network Φ then describes a function R ̺ (Φ) ∈ C(R N0 , R NL ) that can be evaluated by the recursion The number of nonzero values in the matrix vector tuples defining Φ describe the size of Φ which will be denoted by M(Φ) and the depth of the network Φ, i.e. its number of affine transformations, will be denoted by L(Φ). We refer to Setting 5.1 for a more detailed description. A popular activation function ̺ is the so-called "Rectified Linear Unit" ReLU(x) = max{x, 0} [17]. An increasing body of research addresses the approximation properties (or "expressive power") of deep neural networks, where by "approximation properties" we mean the study of the optimal tradeoff between the size M(Φ) and the approximation error u − R ̺ (Φ) of neural networks approximating functions u from a given function class. Classical references include [25,8,1,7] as well as the summary [37] and the references therein. In these works it is shown that deep neural networks provide optimal approximation rates for classical smoothness spaces such as Sobolev spaces or Besov spaces. More recently these results have been extended to Shearlet and Ridgelet spaces [5], Modulation spaces [35], piecewise smooth functions [36] and polynomial chaos expansions [40]. All these results indicate that all classical approximation methods based on sparse expansions can be emulated by neural networks.

Contributions and Main Result
As a first main contribution of this work we show in Proposition 6.4 that low-rank functions of the form with h s j ∈ C(R, R) sufficiently regular and (c s ) R s=1 ⊆ R can be approximated to a given relative precision by deep ReLU neural networks of size scaling like Rd 2 . In particular, we obtain a dependence on the dimension d that is only polynomial and not exponential, i.e. we avoid the curse of dimensionality. In other words, we show that in addition all classical approximation methods based on sparse expansions and on more general low-rank structures, can be emulated by neural networks. Since the solutions of several classes of high-dimensional PDEs are precisely of this form (see, e.g., [40]), our approximation results can be directly applied to these problems to establish approximation rates for neural network approximations that do not suffer from the curse of dimensionality. Note that approximation results for functions of the form (1.2) have previously been considered in [39] in the context of statistical bounds for nonparametric regression.
Moreover, we remark that the networks realizing the product in (1.2) itself, have a connectivity scaling which is logarithmic in the accuracy ε −1 . While we will, for our concrete example, only obtain a spectral connectivity scaling, i.e. like ε − 1 n for any n ∈ N with the implicit constant depending on n, this tensor construction may be used to obtain logarithmic scaling (w.r.t. the accuracy) for d-variate functions in cases where the univariate h s j can be approximated with a logarithmic scaling. As a particular application of the tools developed in the present paper, we provide a mathematical analysis of the rates of expressive power of neural networks for a particular, high-dimensional PDE which arises in mathematical finance, namely the pricing of a so-called European maximum Option (see, e.g., [43]).
We consider the particular (and not quite realistic) situation that the log-returns of these d assets are uncorrelated, i.e. their log-returns evolve according to d uncorrelated drifted scalar diffusion processes.
The price of the European maximum Option on this basket of d assets can then be obtained as solution of the multivariate Black-Scholes equation which reads, for the presently considered case of uncorrelated assets, as for x = (x 1 , . . . , x d ) ∈ (0, ∞) d . It is well known (see, e.g., [13,22] and the references there) that there exists a unique solution of (1.3)- (1.4). This solution can be expressed as conditional expectation of the function ϕ(x) in (1.4) over suitable sample paths of a d-dimensional diffusion.
One main result of this paper is the following result (stated with completely detailed assumptions below as Theorem 7.3), on expression rates of deep neural networks for the basket option price u(0, x) for x ∈ [a, b] d for some 0 < a < b < ∞. To render their dependence on the number d of assets in the basket explicit, we write u d in the statement of the theorem. Informally speaking, the previous result states that the price of a d dimensional European maximum option can, for every n ∈ N, be expressed on cubes [a, b] d by deep neural networks to pointwise accuracy ε > 0 with network size bounded as O(d 2+1/n ε −1/n ) for arbitrary, fixed n ∈ N and with the constant implied in O(·) independent of d and of ε (but depending on n). In other words, the price of a European maximum option on a basket of d assets can be approximated (or "expressed") by deep ReLU networks with spectral accuracy and without curse of dimensionality.
The proof of this result is based on a near explicit expression for the function u d (0, x) (see Section 2). It uses this expression in conjunction with regularity estimates in Section 3 and a neural network quadrature calculus and corresponding error estimates (which is of independent interest) in Section 4 to show that the function u d (0, x) possesses an approximate low-rank representation consisting of tensor products of cumulative normal distribution functions (Lemma 4.3) to which the low-rank approximation result mentioned above can be applied.
Related results have been shown in the recent work [20] which proves (by completely different methods) that solutions to general Kolmogorov equations with affine drift and diffusion terms can be approximated by neural networks of a size that scales polynomially in the dimension and the reciprocal of the desired accuracy as measured by the L p norm with respect to a given probability measure. The approximation estimates developed in the present paper only apply to the European maximum option pricing problem for uncorrelated assets but hold with respect to the much stronger L ∞ norm and provide spectral accuracy in ǫ (as opposed to a low-order polynomial rate obtained in [20]), which is a considerable improvement. In summary, compared to [20], the present paper treats a more restricted problem but achieves stronger approximation results.
In order to give some context to our approximation results, we remark that solutions to Kolmogorov PDEs may, under reasonable assumptions, be approximated by empirical risk minimization over a neural network hypothesis class. The key here is the Feynman-Kac formula which allows to write the solution to the PDE as the expectation of an associated stochastic process. This expectation can be approximated by Monte-Carlo integration, i.e. one can view it as a neural network training problem where the data is generated by Monte-Carlo sampling methods which, under suitable conditions, are capable of avoiding the curse of dimensionality. For more information on this we refer to [4].
While we admit that the European maximum option pricing problem for uncorrelated assets constitutes a rather special problem, the proofs in this paper develop several novel deep neural network approximation results of independent interest that can be applied to more general settings where a low-rank structure is implicit in high-dimensional problems. For mostly numerical results on machine learning for pricing American options we refer to [18]. Lastly we note that after a first preprint of the present paper was submitted, a number of research articles related to this work have appeared [15,16,19,21,26,27,28,30,38].

Outline
The structure of this article is as follows. The following Section 2 provides a derivation of the semi-explicit formula for the price of European maximum options in a standard Black-Scholes setting. This formula consists of an integral of a tensor product function. In Section 3 we develop some auxiliary regularity results for the cumulative normal distribution that are of independent interest which will be used later on. In Section 4 we show that the integral appearing in the formula of Section 2 can be efficiently approximated by numerical quadrature. Section 5 introduces some basic facts related to deep ReLU networks and Section 6 develops basic approximation results for the approximation of functions which possess a tensor product structure. Finally, in Section 7 we show our main result, namely a spectral approximation rate for the approximation of European maximum options by deep ReLU networks without curse of dimensionality. In Appendix A we collect some auxiliary proofs.

High-dimensional derivative pricing
In this section, we briefly review the Black-Scholes differential equation (1.3) which arises, among others, as Kolmogorov equation for multivariate geometric Brownian Motion. This linear, parabolic equation is, for one particular type of financial contracts (so-called "European maximum option" on a basket of d stocks whose log-returns are assumed for simplicity as mutually uncorrelated) endowed with the terminal condition (1.4) and solved for (t, For the proof of this Proposition, we require the following well-known result.
We are now in position to provide a proof of Proposition 2.1.
Proof of Proposition 2.1. The first equality follows directly from the Feynman-Kac formula [22,Corollary 4.17]. We proceed with a proof of the second equality. Throughout this proof let X i : Ω → R, i ∈ {1, 2, . . . , d}, be random variables which satisfy for every i ∈ {1, 2, . . . , d} and let Y : Ω → R be the random variable given by Observe that for every y ∈ (0, ∞) it holds Hence, we obtain that for every y ∈ (0, ∞) it holds This shows that for every y ∈ (0, ∞) it holds Combining this with Lemma 2.2 completes the proof of Proposition 2.1.

Regularity of the Cumulative Normal Distribution
Now that we have derived an semi-explicit formula for the solution, we establish regularity properties of the integrand function in (2.9). This will be required in order to approximate the multivariate integrals by quadratures (which are subsequently realized by neural networks) in Section 4 and to apply the neural network results from Section 6 to our problem. To this end, we analyze the derivatives of the factors in the tensor product, which essentially are compositions of the cumulative normal distribution with the natural logarithm. As this function appears in numerous closed-form option pricing formulae (see, e.g., [31]), the (Gevrey) type regularity estimates obtained in this section are of independent interest (they may, for example, also be used in the analysis of deep network expression rates and of spectral methods for option pricing).
Using the recursive formula from above we can now bound the derivatives of f . Note that the supremum of f (n) is actually attained on the interval [e −4n , 1] and scales with n like e (cn 2 ) for some c ∈ (0, ∞). This can directly be seen by calulating the maximum of the g n,k from (3.2). For our purposes, however, it is sufficient to establish that all derivatives of f are bounded on (0, ∞).
In the following corollary we estimate the derivatives of the function x → f ( K+c x ) required to approximate this function by neural networks. Corollary 3.3. Let n ∈ N, K ∈ [0, ∞), c, a ∈ (0, ∞), b ∈ (a, ∞), let f : (0, ∞) → R be the function which satisfies for every t ∈ (0, ∞) that (3.26) Proof of Corollary 3.3. Throughout this proof let α m,j ∈ Z, m, j ∈ Z, be the integers which satisfy that for every m, j ∈ Z it holds Note that Lemma 3.1 and the chain rule ensure that the functions f and h are infinitely often differentiable. Next we claim that for every m ∈ N, x ∈ [a, b] it holds We prove (3.28) by induction on m ∈ N. To prove the base case m = 1 we note that the chain rule ensures that for every This establishes (3.28) in the base case m = 1. For the induction step N ∋ m → m + 1 ∈ N observe that the chain rule implies for every m ∈ N, This completes the proof of Corollary 3.3.
Next we consider the derivatives of the functions c → f ( K+c xi ), i ∈ {1, 2, . . . , d}, and their tensor product, which will be needed in order to approximate approximate the outer integral in (2.9) by composite Gaussian quadrature.
Corollary 3.4. Let n ∈ N, K ∈ [0, ∞), x ∈ (0, ∞), let f : (0, ∞) → R be the function which satisfies for every t ∈ (0, ∞) that (3.34) and let g : (0, ∞) → R be the function which satisfies for every t ∈ (0, ∞) that Then it holds (i) that f and g are infinitely often differentiable and Proof of Corollary 3.4. Combining Lemma 3.2 with the chain rule implies that for every t ∈ (0, ∞) it holds This completes the proof of Corollary 3.4. (3.38) and let F : (0, ∞) → R be the function which satisfies for every c ∈ (0, ∞) that Then it holds (i) that f and F are infinitely often differentiable and Proof of Lemma 3.5. Note that Lemma 3.1 ensures that f and F are infinitely often differentiable. Moreover, observe that (3.39) and the general Leibniz rule imply for every c ∈ (0, ∞) that (3.41) Next note that the fact that for every r ∈ R it holds that e − 1 2 r 2 ≥ 0 ensures that Moreover, note that the multinomial theorem ensures that Combining this with (3.41), (3.43), and the assumption that x ∈ [a, b] d implies that for every c ∈ (0, ∞) we have This completes the proof of Lemma 3.5.

Quadrature
To approximate the function x → u(0, x) from (2.9) by a neural network we need to evaluate, for arbitrary, given x, an expression of the form ∞ 0 F x (c)dc with F x as defined in Lemma 4.2. We achieve this by proving in Lemma 4.2 that the functions F x decay sufficiently fast for c → ∞, and then employ numerical integration to show that the definite integral N 0 F x (c)dc can be sufficiently well approximated by a weighted sum of F x (c j ) for suitable quadrature points c j ∈ (0, N ). The representation of such a sum can be realized by neural networks. We show in Section 6 and 7 how the functions x → F x (c j ) for (c j ) ∈ (0, N ) can be realized efficiently due to their tensor product structure. We start by recalling an error bound for composite Gaussian quadrature which is explicit in the stepsize and quadrature order.
In the following we bound the error due to truncating the domain of integration.
and for every ε ∈ (0, 1] let N ε ∈ R be given by Proof of Lemma 4.2. Throughout this proof let g : (0, ∞) → (0, 1) be the function given by Furthermore, observe that for every t ∈ [e 2(n+1) , ∞) it holds This, (4.11), and the fact that for every ε (4.13) Combining this with the binomial theorem and the fact that for every This, the geometric sum formula, and the fact that for every ε ∈ (0, 1] it holds that N ε ≥ 2bd Hence, we obtain for every ε This completes the proof of Lemma 4.2.
Next we combine the result above with Lemma 4.1 in order to derive the number of terms needed in order to approximate the integral by a sum to within a prescribed error bound ε.

17)
and for every d ∈ N, ε ∈ (0, 1] let N d,ε ∈ R be given by Proof of Lemma 4.3. Note that Lemma 3.5 ensures the existence of S m ∈ R, m ∈ N, such that for every Next observe that Lemma 4.1 (with N ↔ N d,ε in the notation of Lemma 4.1) establishes the existence of (4.23) Moreover, note that Lemma 4.2 (with N d, ε 2 ↔ N d,ε in the notation of Lemma 4.2) and (4.23) imply for every The proof of Lemma 4.3 is thus completed.

Basic ReLU DNN Calculus
In order to talk about neural networks we will, up to some minor changes and additions, adopt the notation of P. Petersen and F. Voigtlaender from [36]. This allows us to differentiate between a neural network, defined as a structured set of weights, and its realization, which is a function on R d . Note that this is almost necessary in order to talk about the complexity of neural networks, since notions like depth, size or architecture do not make sense for general functions on R d . Even if we know that a given function 'is' a neural network, i.e. can be written a series of affine transformations and componentwise non-linearities, there are, in general, multiple non-trivially different ways to do so. Each of these structured sets we consider does however define a unique function. This enables us to explicitly and unambiguously construct complex neural networks from simple ones, and subsequently relate the approximation capability of a given network to its complexity. Further note that since the realization of neural network is unique we can still speak of a neural network approximating a given function when its realization does so.
Specifically, a neural network will be given by its architecture, i.e. number of layers L and layer dimensions 1 N 0 , N 1 , . . . , N L , as well as the weights determining the affine transformations used to compute each layer from the previous one. Note that our notion of neural networks does not attach the architecture and weights to a fixed activation function, but instead considers the realization of such a neural network with respect to a given activation function. This choice is a purely technical one here, as we always consider networks with ReLU activation function. and , and for every ̺ ∈ C(R, R) denote by The quantity M(Φ) simply denotes the number of non-zero entries of the network Φ, which together with its depth L(Φ) will be how we measure the 'size' of a given neural network Φ. One could instead consider the number of all weights, i.e. including zeroes, of a neural network. Note, however, that for any nondegenerate neural network Φ the total number of weights is bounded from above by M(Φ) 2 + M(Φ). Here, the terminology "degenerate" refers to a neural network which has neurons that can be removed without changing the realization of the NN. This implies for any neural network there also exists a non-degenerate one of smaller or equal size, which has the exact same realization. Since our primary goal is to approximate d-variate functions by networks the size of which only depends polynomially on the dimension, the above means that the qualitatively same results hold regardless of which notion of 'size' is used.
We start by introducing two basic tools for constructing new neural networks from known ones and, in Lemma 5.3 and Lemma 5.4, consider how the properties of a derived network depend on its parts. Note that techniques like these have already been used in [36] and [39].
The first tool will be the 'composition' of neural networks in (5.7), which takes two networks and provides a new network whose realization is the composition of the realizations of the two constituent functions.
The second tool will be the 'parallelization' of neural networks in (5.12), which will be useful when considering linear combinations or tensor products of functions which we can already approximate. While parallelization of same-depth networks (5.10) works with arbitrary activation functions, we use for the general case that any ReLU network can easily be extended (5.11) to an arbitrary depth without changing its realization.

Basic Expression Rate Results
Here we begin by establishing an expression rate result for a very simple function, namely x → x 2 on [0, 1]. Our approach is based on the observation by M. Telgarsky [42], that neural networks with ReLU activation function can efficiently compute high-frequent sawtooth functions, and the idea of D. Yarotsky in [44] to use this in order to approximate the function x → x 2 by networks computing its linear interpolations. This can then be used to derive networks capable of efficiently approximating (x, y) → xy, which leads to tensor products as well as polynomials and subsequently smooth function. Note that [44] uses a slightly different notion of neural networks, where connections between non-adjacent layers are permitted. This does, however, only require a technical modification of the proof, which does not significantly change the result. Nonetheless, the respective proofs are provided in the appendix for completeness. Lemma 6.1. Assume Setting 5.1 and let ̺ : R → R be the ReLU activation function given by ̺(t) = max{0, t}. Then there exist neural networks (σ ε ) ε∈(0,∞) ⊆ N such that for every ε ∈ (0, ∞) We can now derive the following result on approximate multiplication by neural networks, by observing that xy = 2B 2 (|(x + y)/2B| 2 − |x/2B| 2 − |y/2B| 2 ) for every B ∈ (0, ∞), x, y ∈ R. Lemma 6.2. Assume Setting 5.1, let B ∈ (0, ∞), and let ̺ : R → R be the ReLU activation function given by ̺(t) = max{0, t}. Then there exist neural networks (µ ε ) ε∈(0,∞) ⊆ N which satisfy for every ε ∈ (0, ∞) that Next we extend this result to products of any number of factors by hierarchical, pairwise multiplication.
With the above established, it is quite straightforward to get the following result for the approximation of tensor products. Note that the exponential term B m−1 in (iii) is unavoidable as result from multiplying m many inaccurate values of magnitude B. For our purposes this will not be an issue since the functions we consider are bounded in absolute value by B = 1. This is further not an issue in cases, where the h j can be approximated by networks whose size scales logarithmically with ε.
Another way to use the multiplication results is to consider the approximation of smooth functions by polynomials. This can be done for functions of arbitrary dimension using the multivariate Taylor expansion (see [44] and [33,Thm. 2.3]). Such a direct approach, however, yields networks whose size depends exponentially on the dimension of the function. As our goal is to show that high-dimensional functions with a tensor product structure can be approximated by networks with only polynomial dependence on the dimension, we only consider univariate smooth functions here. In the appendix we present a detailed and explicit construction of this Taylor approximation by neural networks. In the following results we employ an auxiliary parameter r, so that the bounds on the depth and connectivity of the networks may be stated for all ε ∈ (0, ∞). Note that this parameter does not influence the construction of the networks themselves.  L(Φ f,ε ) max{r, |ln(ε)|} < ∞, For convenience of use we also provide the following more general corollary. (6.31)

DNN Expression Rates for High-Dimensional Basket prices
Now that we have established a number of general expression rate results, we can apply them to our specific problem. Using the regularity result (3.3) we obtain the following.
We can then employ Proposition 6.4 in order to approximate the required tensor product.

Discussion
While Theorem 7.3 only establishes formally that the solution of one specific high-dimensional PDE may be approximated by neural networks without curse of dimensionality, the constructive approach also serves to illustrate that neural networks are capable of accomplishing the same for any PDE solution which exhibits a similar low-rank structure. Note here, that the tensor product construction in Proposition 6.4 only introduces a logarithmic dependency on the approximation accuracy. That we end up with a spectral rate in this specific case is due to Proposition 6.4 and Lemma 4.3, i.e. the insufficient regularity of the univariate functions inside the tensor product, as well as the number of terms required by the Gaussian quadrature used to approximate the outer integral. In particular, this means that the approach in Section 6 might also be used to produce approximation results with connectivity growing only logarithmically in the inverse of the approximation error, given that one has a suitably well behaved low-rank structure.
The present result is a promising step towards higher order, numerical solution of high-dimensional PDEs, which are notoriously troublesome to handle with any of the classical approaches based on discretization of the domain, or with randomized (a.k.a. Monte-Carlo based) arguments. Of course answering the question of approximability can only ensure that there exist networks with a reasonable size-to-accuracy trade-off, whereas for any practical purpose it is also necessary to establish whether and how one can find these networks.
An analysis of the generalization error for linear Kolmogorov equations can be found in [4], which concludes that, under reasonable assumptions, the number of required Monte Carlo samples is free of the curse of dimensionality. Moreover, there are a number of empirical results [2,3,10,23,41], which suggest that the solutions of various high dimensional PDEs may be learned efficiently using standard stochastic gradient descent based methods. However, a satisfying formal analysis of this training procesdoes not seem to be available at the present.
Lastly we would like to point out that, even though we had a semi-explicit formula available, the ReLU networks we used for approximation were in no way adapted to use this knowledge and have been shown to exhibit excellent approximation properties for, e.g., piecewise smooth functions [36], affine and Gabor systems [12], and even fractal structures [9]. So, while a spline dictionary based approach specifically designed for the approximation of this one PDE solution may have similar rates, it would most certainly lack the remarkable universality of neural networks.
A.2 Proof of Lemma 6.1 Proof of Lemma 6.1. The proof follows [44]. We provide it in order to provide values of constants in the bounds on depth and width, and to reveal the dependence on the scaling parameter B.