1 Introduction

1.1 Motivation

The development of new classification and regression algorithms based on deep neural networks—coined “deep learning”—revolutionized the area of artificial intelligence, machine learning and data analysis [15]. More recently, these methods have been applied to the numerical solution of partial differential equations (PDEs for short) [3, 12, 21, 22, 27, 32, 39, 41, 42]. In these works, it has been empirically observed that deep learning-based methods work exceptionally well when used for the numerical solution of high-dimensional problems arising in option pricing. The numerical experiments carried out in [2, 3, 21, 42] in particular suggest that deep learning-based methods may not suffer from the curse of dimensionality for these problems, but only few theoretical results exist which support this claim: In [38], a first theoretical result on rates of expression of infinite-variate generalized polynomial chaos expansions for solution manifolds of certain classes of parametric PDEs has been obtained. Furthermore, recent work [4, 18] shows that the algorithms introduced in [2] for the numerical solution of Kolmogorov PDEs are free of the curse of dimensionality in terms of network size and training sample complexity.

Neural networks constitute a parametrized class of functions constructed by successive applications of affine mappings and coordinatewise nonlinearities; see [35] for a mathematical introduction. As in [34], we introduce a neural network via a tuple of matrix-vector pairs

$$\begin{aligned} \Phi= & {} (((A^1_{i,j})_{i,j=1}^{N_1,N_0},(b^1_i)_{i=1}^{N_1}),\dots , ((A^L_{i,j})_{i,j=1}^{N_L,N_{L-1}},(b^L_i)_{i=1}^{N_L}))\\&\in \times _{l=1}^L\left( {{\mathbb {R}}^{N_l\times N_{l-1}}\times {\mathbb {R}}^{N_l}}\right) \end{aligned}$$

for given hyperparameters \(L\in {\mathbb {N}}\), \(N_0,N_1,\dots ,N_L\in {\mathbb {N}}\). Given an “activation function” \(\varrho \in C({\mathbb {R}},{\mathbb {R}})\), a neural network \(\Phi \) then describes a function \(R_\varrho (\Phi )\in C({\mathbb {R}}^{N_0},{\mathbb {R}}^{N_L})\) that can be evaluated by the recursion

$$\begin{aligned} x_l = \varrho ( A_{l} x_{l-1} + b_{1}), l=1,\dots , L-1,\quad \left[ { R_{ \varrho }( \Phi ) }\right] ( x_0 ) = A_L x_{ L - 1 } + b_L. \end{aligned}$$
(1.1)

The number of nonzero values in the matrix-vector tuples defining \(\Phi \) describes the size of \(\Phi \) which will be denoted by \({\mathcal {M}}(\Phi )\) and the depth of the network \(\Phi \), i.e., its number of affine transformations, will be denoted by \({\mathcal {L}}(\Phi )\). We refer to Setting 5.1 for a more detailed description. A popular activation function \(\varrho \) is the so-called rectified linear unit \(\mathrm {ReLU}(x)=\max \{x,0\}\) [15].

An increasing body of research addresses the approximation properties (or “expressive power”) of deep neural networks, where by “approximation properties” we mean the study of the optimal trade-off between the size \({\mathcal {M}}(\Phi )\) and the approximation error \(\Vert u-R_\varrho (\Phi )\Vert \) of neural networks approximating functions u from a given function class. Classical references include [1, 7, 8, 23] as well as the summary [35] and the references therein. In these works, it is shown that deep neural networks provide optimal approximation rates for classical smoothness spaces such as Sobolev spaces or Besov spaces. More recently, these results have been extended to Shearlet and Ridgelet spaces [5], modulation spaces [33], piecewise smooth functions [34] and polynomial chaos expansions [38]. All these results indicate that all classical approximation methods based on sparse expansions can be emulated by neural networks.

1.2 Contributions and Main Result

As a first main contribution of this work, we show in Proposition 6.4 that low-rank functions of the form

$$\begin{aligned} (x_1,\dots , x_d)\in {\mathbb {R}}^d \mapsto \sum _{s=1}^Rc_s\prod _{j=1}^dh_j^s(x_j), \end{aligned}$$
(1.2)

with \(h_j^s\in C({\mathbb {R}},{\mathbb {R}})\) sufficiently regular and \((c_s)_{s=1}^R\subseteq {\mathbb {R}}\) can be approximated to a given relative precision by deep ReLU neural networks of size scaling like \(Rd^2\). In particular, we obtain a dependence on the dimension d that is only polynomial and not exponential, i.e., we avoid the curse of dimensionality. In other words, we show that in addition all classical approximation methods based on sparse expansions and on more general low-rank structures, can be emulated by neural networks. Since the solutions of several classes of high-dimensional PDEs are precisely of this form (see, e.g., [38]), our approximation results can be directly applied to these problems to establish approximation rates for neural network approximations that do not suffer from the curse of dimensionality. Note that approximation results for functions of the form (1.2) have previously been considered in [37] in the context of statistical bounds for nonparametric regression.

Moreover, we remark that the networks realizing the product in (1.2) itself have a connectivity scaling which is logarithmic in the accuracy \(\varepsilon ^{-1}\). While we will, for our concrete example, only obtain a spectral connectivity scaling, i.e., like \(\varepsilon ^{-{\frac{1}{n}}}\) for any \(n\in {\mathbb {N}}\) with the implicit constant depending on n, this tensor construction may be used to obtain logarithmic scaling (w.r.t. the accuracy) for d-variate functions in cases where the univariate \(h_j^s\) can be approximated with a logarithmic scaling.

As a particular application of the tools developed in the present paper, we provide a mathematical analysis of the rates of expressive power of neural networks for a particular, high-dimensional PDE which arises in mathematical finance, namely the pricing of a so-called European maximum option (see, e.g., [43]).

We consider the particular (and not quite realistic) situation that the log-returns of these d assets are uncorrelated, i.e., their log-returns evolve according to d uncorrelated drifted scalar diffusion processes.

The price of the European maximum option on this basket of d assets can then be obtained as solution of the multivariate Black–Scholes equation which reads, for the presently considered case of uncorrelated assets, as

$$\begin{aligned} \textstyle ( \tfrac{ \partial }{ \partial t } u )(t,x) + \tfrac{ \mu }{ 2 } \sum \limits _{ i = 1 }^d x_i \big ( \tfrac{ \partial }{ \partial x_i } u \big )(t,x) + \tfrac{ \sigma ^2 }{ 2 } \sum \limits _{ i = 1 }^d | x_i |^2 \big ( \tfrac{ \partial ^2 }{ \partial x_i^2 } u \big )(t,x) = 0 \;. \end{aligned}$$
(1.3)

For the European maximum option, (1.3) is completed with the terminal condition

$$\begin{aligned} u(T,x) = \varphi (x) = \max \{ x_1 - K_1, x_2 - K_2, \dots , x_d - K_d , 0 \} \end{aligned}$$
(1.4)

for \( x = ( x_1, \dots , x_d ) \in (0,\infty )^d \). It is well known (see, e.g., [11, 20] and the references therein) that there exists a unique solution of (1.3)–(1.4). This solution can be expressed as conditional expectation of the function \(\varphi (x)\) in (1.4) over suitable sample paths of a d-dimensional diffusion.

One main result of this paper is the following result (stated with completely detailed assumptions as Theorem 7.3), on expression rates of deep neural networks for the basket option price u(0, x) for \(x\in [a,b]^d\) for some \(0<a<b< \infty \). To render their dependence on the number d of assets in the basket explicit, we write \(u_d\) in the statement of the theorem.

Theorem 1.1

Let \(n\in {\mathbb {N}}\), \(\mu \in {\mathbb {R}}\), \(T,\sigma ,a\in (0,\infty )\), \(b\in (a,\infty )\), \((K_i)_{i\in {\mathbb {N}}}\subseteq [0,K_{\mathrm {max}})\), and let \(u_d:(0,\infty )\times [a,b]^d\rightarrow {\mathbb {R}}\), \(d\in {\mathbb {N}}\), be the functions which satisfy for every \(d\in {\mathbb {N}}\), and for every \((t,x) \in [0,T]\times (0,\infty )^d\) the equation (1.3) with terminal condition (1.4).

Then there exist neural networks \((\Gamma _{d,\varepsilon })_{\varepsilon \in (0,1],d\in {\mathbb {N}}}\) which satisfy

  1. (i)

    \(\displaystyle \sup _{\varepsilon \in (0,1],d\in {\mathbb {N}}}\left[ {\frac{{\mathcal {L}}(\Gamma _{d,\varepsilon })}{\max \{1,\ln (d)\}\left( {|\ln (\varepsilon )|+\ln (d)+1}\right) }}\right] <\infty \),

  2. (ii)

    \(\displaystyle \sup _{\varepsilon \in (0,1],d\in {\mathbb {N}}}\left[ {\frac{{\mathcal {M}}(\Gamma _{d,\varepsilon })}{d^{2+\frac{1}{n}}\varepsilon ^{-\frac{1}{n}}}}\right] <\infty \), and

  3. (iii)

    for every \(\varepsilon \in (0,1]\), \(d\in {\mathbb {N}}\),

    $$\begin{aligned} \sup _{x\in [a,b]^d}\left| {u_d(0,x)-\left[ {R_{\mathrm {ReLU}}(\Gamma _{d,\varepsilon })}\right] \!(x)}\right| \le \varepsilon . \end{aligned}$$
    (1.5)

Informally speaking, the previous result states that the price of a d-dimensional European maximum option can, for every \(n\in {\mathbb {N}}\), be expressed on cubes \([a,b]^d\) by deep neural networks to pointwise accuracy \(\varepsilon >0\) with network size bounded as \({\mathcal {O}}(d^{2+1/n} \varepsilon ^{-1/n})\) for arbitrary, fixed \(n\in {\mathbb {N}}\) and with the constant implied in \({\mathcal {O}}(\cdot )\) independent of d and of \(\varepsilon \) (but depending on n). In other words, the price of a European maximum option on a basket of d assets can be approximated (or “expressed”) by deep ReLU networks with spectral accuracy and without curse of dimensionality.

The proof of this result is based on a near explicit expression for the function \(u_d(0,x)\) (see Sect. 2). It uses this expression in conjunction with regularity estimates in Sect. 3 and a neural network quadrature calculus and corresponding error estimates (which is of independent interest) in Sect. 4 to show that the function \(u_d(0,x)\) possesses an approximate low-rank representation consisting of tensor products of cumulative normal distribution functions (Lemma 4.3) to which the low-rank approximation result mentioned above can be applied.

Related results have been shown in the recent work [18] which proves (by completely different methods) that solutions to general Kolmogorov equations with affine drift and diffusion terms can be approximated by neural networks of a size that scales polynomially in the dimension and the reciprocal of the desired accuracy as measured by the \(L^p\) norm with respect to a given probability measure. The approximation estimates developed in the present paper only apply to the European maximum option pricing problem for uncorrelated assets but hold with respect to the much stronger \(L^\infty \) norm and provide spectral accuracy in \(\varepsilon \) (as opposed to a low-order polynomial rate obtained in [18]), which is a considerable improvement. In summary, compared to [18], the present paper treats a more restricted problem but achieves stronger approximation results.

In order to give some context to our approximation results, we remark that solutions to Kolmogorov PDEs may, under reasonable assumptions, be approximated by empirical risk minimization over a neural network hypothesis class. The key here is the Feynman–Kac formula which allows to write the solution to the PDE as the expectation of an associated stochastic process. This expectation can be approximated by Monte Carlo integration, i.e., one can view it as a neural network training problem where the data are generated by Monte Carlo sampling methods which, under suitable conditions, are capable of avoiding the curse of dimensionality. For more information on this, we refer to [4].

While we admit that the European maximum option pricing problem for uncorrelated assets constitutes a rather special problem, the proofs in this paper develop several novel deep neural network approximation results of independent interest that can be applied to more general settings where a low-rank structure is implicit in high-dimensional problems. For mostly numerical results on machine learning for pricing American options, we refer to [16]. Lastly we note that after a first preprint of the present paper was submitted, a number of research articles related to this work have appeared [13, 14, 17, 19, 24,25,26, 28, 36].

1.3 Outline

The structure of this article is as follows. Section 2 provides a derivation of the semi-explicit formula for the price of European maximum options in a standard Black–Scholes setting. This formula consists of an integral of a tensor product function. In Sect. 3, we develop some auxiliary regularity results for the cumulative normal distribution that are of independent interest which will be used later on. In Sect. 4, we show that the integral appearing in the formula of Sect. 2 can be efficiently approximated by numerical quadrature. Section 5 introduces some basic facts related to deep ReLU networks, and Sect. 6 develops basic approximation results for the approximation of functions which possess a tensor product structure. Finally, in Sect. 7 we show our main result, namely a spectral approximation rate for the approximation of European maximum options by deep ReLU networks without curse of dimensionality. In Appendix A, we collect some auxiliary proofs.

2 High-Dimensional Derivative Pricing

In this section, we briefly review the Black–Scholes differential equation (1.3) which arises, among others, as Kolmogorov equation for multivariate geometric Brownian Motion. This linear, parabolic equation is for one particular type of financial contracts (so-called European maximum option on a basket of d stocks whose log-returns are assumed for simplicity as mutually uncorrelated) endowed with the terminal condition (1.4) and solved for \((t,x)\in [0,T]\times (0,\infty )^d\).

Proposition 2.1

Let \( d \in {\mathbb {N}}\), \( \mu \in {\mathbb {R}}\), \( \sigma , T, K_1, \dots , K_d, \xi _1, \dots , \xi _d \in (0,\infty ) \), let \( ( \Omega , {\mathcal {F}}, {\mathbb {P}}) \) be a probability space, and let \( W = ( W^{ (1) }, \dots , W^{ (d) } ) :[0,T] \times \Omega \rightarrow {\mathbb {R}}^d \) be a standard Brownian motion and let \(u\in C([0,T]\times (0,\infty )^d)\) satisfy (1.3) and (1.4). Then for \(x = (\xi _1,\dots , \xi _d)\in (0,\infty )^d\) it holds that

$$\begin{aligned} \begin{aligned} u(0,x) =&{\mathbb {E}}\!\left[ \max _{ i \in \{ 1, 2, \dots , d \} } \left( \max \!\left\{ \exp \!\big ( \big [ \mu - \tfrac{ \sigma ^2 }{ 2 } \big ] T + \sigma W_T^{ (i) } \big ) \, \xi _i - K_i , 0 \right\} \right) \right] \\ =&\int _0^{ \infty } 1 - \left[ \textstyle \prod \limits _{ i = 1 }^d \left( \int _{ - \infty }^{ \frac{ 1 }{ \sigma \sqrt{ T } } \left[ \ln \left( \frac{ y + K_i }{ \xi _i } \right) - \left( \mu - [ \nicefrac { \sigma ^2 }{ 2 } ] \right) T \right] } \tfrac{ 1 }{ \sqrt{ 2 \pi } } \, \exp \!\left( - \frac{ r^2 }{ 2 } \right) \mathrm{d}r \right) \right] \mathrm{d}y . \end{aligned}\nonumber \\ \end{aligned}$$
(2.1)

For the proof of this Proposition, we require the following well-known result.

Lemma 2.2

(Complementary distribution function formula) Let \( \mu :{\mathcal {B}}( [0,\infty ) ) \rightarrow [0,\infty ] \) be a sigma-finite measure. Then

$$\begin{aligned} \int _0^{ \infty } x \, \mu ( dx ) = \int _0^{ \infty } \mu ( [x,\infty ) ) \, dx . \end{aligned}$$
(2.2)

We are now in position to provide a proof of Proposition 2.1.

Proof of Proposition 2.1

The first equality follows directly from the Feynman–Kac formula [20, Corollary 4.17]. We proceed with a proof of the second equality. Throughout this proof, let \(X_i :\Omega \rightarrow {\mathbb {R}}\), \(i \in \{ 1, 2, \dots , d \}\), be random variables which satisfy for every \( i \in \{ 1, 2, \dots , d \} \)

$$\begin{aligned} X_i = \exp \!\big ( \big [ \mu - \tfrac{ \sigma ^2 }{ 2 } \big ] T + \sigma W_T^{ (i) } \big ) \, \xi _i \end{aligned}$$
(2.3)

and let \( Y :\Omega \rightarrow {\mathbb {R}}\) be the random variable given by

$$\begin{aligned} Y = \max \{ X_1 - K_1 , \dots , X_d - K_d , 0 \} . \end{aligned}$$
(2.4)

Observe that for every \( y \in (0,\infty ) \) it holds

$$\begin{aligned} {\mathbb {P}}\!\left( Y \ge y \right)&= 1 - {\mathbb {P}}\!\left( Y< y \right) = 1 - {\mathbb {P}}\!\left( \max _{ i \in \{ 1, 2, \dots , d \} } \left( X_i - K_i \right)< y \right) \nonumber \\&= 1 - {\mathbb {P}}\!\left( \cap _{ i \in \{ 1, 2, \dots , d \} } \left\{ X_i - K_i< y \right\} \right) = 1 - \textstyle \prod \limits _{ i = 1 }^d {\mathbb {P}}\!\left( X_i - K_i< y \right) \nonumber \\&= 1 - \textstyle \prod \limits _{ i = 1 }^d {\mathbb {P}}\!\left( X_i< y + K_i \right) \nonumber \\&= 1 - \textstyle \prod \limits _{ i = 1 }^d {\mathbb {P}}\!\left( \exp \!\big ( \big [ \mu - \tfrac{ \sigma ^2 }{ 2 } \big ] T + \sigma W_T^{ (i) } \big ) \, \xi _i < y + K_i \right) . \end{aligned}$$
(2.5)

Hence, we obtain that for every \( y \in (0,\infty ) \) it holds

$$\begin{aligned} \begin{aligned} {\mathbb {P}}\!\left( Y \ge y \right)&= 1 - \textstyle \prod \limits _{ i = 1 }^d {\mathbb {P}}\!\left( \exp \!\big ( \big [ \mu - \tfrac{ \sigma ^2 }{ 2 } \big ] T + \sigma W_T^{ (i) } \big )< \frac{ y + K_i }{ \xi _i } \right) \\&= 1 - \textstyle \prod \limits _{ i = 1 }^d {\mathbb {P}}\!\left( \sigma W_T^{ (i) }< \ln \!\left( \frac{ y + K_i }{ \xi _i } \right) - \big [ \mu - \tfrac{ \sigma ^2 }{ 2 } \big ] T \right) \\&= 1 - \textstyle \prod \limits _{ i = 1 }^d {\mathbb {P}}\!\left( \frac{ 1 }{ \sqrt{T} } W_T^{ (i) } < \frac{ 1 }{ \sigma \sqrt{ T } } \left[ \ln \!\left( \frac{ y + K_i }{ \xi _i } \right) - \big [ \mu - \tfrac{ \sigma ^2 }{ 2 } \big ] T \right] \right) . \end{aligned} \end{aligned}$$
(2.6)

This shows that for every \( y \in (0,\infty ) \) it holds

$$\begin{aligned} \begin{aligned} {\mathbb {P}}\!\left( Y \ge y \right)&= 1 - \left[ \textstyle \prod \limits _{ i = 1 }^d \left( \int _{ - \infty }^{ \frac{ 1 }{ \sigma \sqrt{ T } } \left[ \ln \left( \frac{ y + K_i }{ \xi _i } \right) - \left( \mu - [ \nicefrac { \sigma ^2 }{ 2 } ] \right) T \right] } \tfrac{ 1 }{ \sqrt{ 2 \pi } } \, \exp \!\left( - \frac{ r^2 }{ 2 } \right) \mathrm{d}r \right) \right] . \end{aligned} \end{aligned}$$
(2.7)

Combining this with Lemma 2.2 completes the proof of Proposition 2.1. \(\square \)

With Lemma 2.2 and Proposition 2.1, we may write

$$\begin{aligned} \begin{aligned} u(0,x) = {\mathbb {E}}\!\left[ \varphi \!\left( \exp \!\left( \left[ \mu - \nicefrac { \sigma ^2 }{ 2 } \right] T + \sigma W^{ (1) }_T \right) x_1 , \ldots , \exp \!\left( \left[ \mu - \nicefrac { \sigma ^2 }{ 2 } \right] T + \sigma W^{ (d) }_T \right) x_d \right) \right] \end{aligned} \end{aligned}$$
(2.8)

(“semi-explicit” formula). Let us consider the case \( \mu = \sigma ^2 / 2 \), \( T = \sigma = 1 \) and \( K_1 = \ldots = K_d = K \in (0,\infty ) \). Then for every \( x = ( x_1, \dots , x_d) \in (0,\infty )^d \)

$$\begin{aligned} \begin{aligned} u(0,x)&= {\mathbb {E}}\!\left[ \varphi \!\left( e^{ W^{ (1) }_T } x_1 , \ldots , e^{ W^{ (d) }_T } x_d \right) \right] = {\mathbb {E}}\!\left[ \varphi \!\left( e^{ W^{ (1) }_1 } x_1 , \ldots , e^{ W^{ (d) }_1 } x_d \right) \right] \\&= {\mathbb {E}}\!\left[ \max \!\left\{ e^{ W^{ (1) }_1 } x_1 - K , \ldots , e^{ W^{ (d) }_1 } x_d - K , 0 \right\} \right] \\&= \int _0^{ \infty } 1 - \left[ \prod _{ i = 1 }^d \int _{ - \infty }^{ \ln ( \frac{ K + c }{ x_i } ) } \tfrac{ 1 }{ \sqrt{ 2 \pi } } \, \exp \!\left( - \tfrac{ r^2 }{ 2 } \right) \mathrm{d}r \right] \mathrm{d}c . \end{aligned} \end{aligned}$$
(2.9)

3 Regularity of the Cumulative Normal Distribution

Now that we have derived an semi-explicit formula for the solution, we establish regularity properties of the integrand function in (2.9). This will be required in order to approximate the multivariate integrals by quadratures (which are subsequently realized by neural networks) in Sect. 4 and to apply the neural network results from Sect. 6 to our problem. To this end, we analyze the derivatives of the factors in the tensor product, which essentially are compositions of the cumulative normal distribution with the natural logarithm. As this function appears in numerous closed-form option pricing formulae (see, e.g., [29]), the (Gevrey) type regularity estimates obtained in this section are of independent interest. (They may, for example, also be used in the analysis of deep network expression rates and of spectral methods for option pricing).

Lemma 3.1

Let \(f:(0,\infty )\rightarrow {\mathbb {R}}\) be the function which satisfies for every \(t\in (0,\infty )\) that

$$\begin{aligned} f(t)=\tfrac{1}{\sqrt{2\pi }}\int ^{\ln (t)}_{-\infty } e^{-\frac{1}{2}r^2}\mathrm {d}r, \end{aligned}$$
(3.1)

let \(g_{n,k}:(0,\infty )\rightarrow {\mathbb {R}}\), \(n,k\in {\mathbb {N}}_0\), be the functions which satisfy for every \(n,k\in {\mathbb {N}}_0\), \(t\in (0,\infty )\) that

$$\begin{aligned} g_{n,k}(t)=t^{-n}e^{-\frac{1}{2}[\ln (t)]^2}[\ln (t)]^k, \end{aligned}$$
(3.2)

and let \((\gamma _{n,k})_{n,k\in {\mathbb {Z}}}\subseteq {\mathbb {Z}}\) be the integers which satisfy for every \(n,k\in {\mathbb {Z}}\) that

$$\begin{aligned} \gamma _{n,k} = {\left\{ \begin{array}{ll}1 &{} :n=1,k=0\\ -\gamma _{n-1,k-1}-(n-1)\gamma _{n-1,k}+(k+1)\gamma _{n-1,k+1} &{} :n>1, 0\le k<n\\ 0 &{} :\mathrm {else} \end{array}\right. }. \nonumber \\ \end{aligned}$$
(3.3)

Then it holds for every \(n\in {\mathbb {N}}\) that

  1. (i)

    we have that f is n times continuously differentiable and

  2. (ii)

    we have for every \(t\in (0,\infty )\) that

    $$\begin{aligned} f^{(n)}(t)=\tfrac{1}{\sqrt{2\pi }}\left[ {\sum _{k=0}^{n-1}\gamma _{n,k}\,g_{n,k}(t)}\right] . \end{aligned}$$
    (3.4)

Proof of Lemma 3.1

We prove (i) and (ii) by induction on \(n\in {\mathbb {N}}\). For the base case \(n=1\) note that (3.1), (3.2), (3.3), the fact that the function \({\mathbb {R}}\ni r\mapsto e^{-\frac{1}{2}r^2}\in (0,\infty )\) is continuous, the fundamental theorem of calculus and the chain rule yield

  1. (A)

    that f is differentiable and

  2. (B)

    that for every \(t\in (0,\infty )\) it holds

    $$\begin{aligned} f'(t)=\tfrac{1}{\sqrt{2\pi }}\, e^{-\frac{1}{2}[\ln (t)]^2}t^{ - 1 } = \tfrac{1}{\sqrt{2\pi }}\, g_{1,0}(t) = \tfrac{1}{\sqrt{2\pi }}\, \gamma _{ 1, 0 } \, g_{1,0}(t) . \end{aligned}$$
    (3.5)

This establishes (i) and (ii) in the base case \(n=1\). For the induction step \({{\mathbb {N}}\ni n\rightarrow n+1\in \{2,3,4,\dots \}}\), note that for every \(t\in (0,\infty )\) we have

$$\begin{aligned} \tfrac{\mathrm {d}}{\mathrm {d}t}\left[ {e^{-\frac{1}{2}[\ln (t)]^2}}\right] = -t^{-1}e^{-\frac{1}{2}[\ln (t)]^2}\ln (t). \end{aligned}$$
(3.6)

Combining this and (3.2) with the product rule establishes for every \(n\in {\mathbb {N}}\), \(k \in \{ 0, 1, \dots , n - 1 \} \), \( t \in (0,\infty ) \) that

$$\begin{aligned} \begin{aligned} (g_{n,k})'(t)&=\tfrac{\mathrm {d}}{\mathrm {d}t}\left[ {t^{-n}e^{-\frac{1}{2}[\ln (t)]^2}[\ln (t)]^k}\right] \\&=-nt^{-(n+1)}e^{-\frac{1}{2}[\ln (t)]^2}[\ln (t)]^k-t^{-(n+1)}e^{-\frac{1}{2}[\ln (t)]^2}[ \ln (t) ]^{k+1}\\&\quad +t^{-(n+1)}e^{-\frac{1}{2}[\ln (t)]^2}k[\ln (t)]^{ \max \{ k - 1 , 0 \} } \\&=-g_{n+1,k+1}(t)-ng_{n+1,k}(t)+kg_{n+1,\max \{ k - 1 , 0 \} }(t). \end{aligned} \end{aligned}$$
(3.7)

Hence, we obtain that for every \(n\in {\mathbb {N}}\), \(t\in (0,\infty )\) it holds

$$\begin{aligned} \begin{aligned}&\sum _{k=0}^{n-1}\gamma _{n,k}(g_{n,k})'(t)\\&\quad =\sum _{k=0}^{n-1}\left[ {\gamma _{n,k}\left( {-g_{n+1,k+1}(t)-ng_{n+1,k}(t)+kg_{n+1,\max \{ k-1, 0 \} }(t)}\right) }\right] \\&\quad =\sum _{k=0}^{n-1}-\gamma _{n,k}\,g_{n+1,k+1}(t)+\sum _{k=0}^{n-1}-n\gamma _{n,k}\,g_{n+1,k}(t)+\sum _{k=1}^{n-1}k\gamma _{n,k}\,g_{n+1,\max \{ k-1, 0 \} }(t)\\&\quad =\sum _{k=1}^{n}-\gamma _{n,k-1}\,g_{n+1,k}(t)+\sum _{k=0}^{n-1}-n\gamma _{n,k}\,g_{n+1,k}(t)+\sum _{k=0}^{n-2}(k+1)\gamma _{n,k+1}\,g_{n+1,k}(t). \end{aligned} \end{aligned}$$
(3.8)

The fact that for every \(n\in {\mathbb {N}}\) it holds that \(\gamma _{n,-1}=\gamma _{n,n}=\gamma _{n,n+1}=0\) and (3.3) therefore ensure that for every \(n\in {\mathbb {N}}\), \(t\in (0,\infty )\) we have

$$\begin{aligned} \begin{aligned} \sum _{k=0}^{n-1}\gamma _{n,k}(g_{n,k})'(t)&=\sum _{k=0}^{n}\left[ {\left( {-\gamma _{n,k-1}-n\gamma _{n,k}+(k+1)\gamma _{n,k+1}}\right) g_{n+1,k}(t)}\right] \\&=\sum _{k=0}^{n}\gamma _{n+1,k}\,g_{n+1,k}(t). \end{aligned} \end{aligned}$$
(3.9)

Induction thus establishes (i) and (ii). The proof of Lemma 3.1 is thus completed. \(\square \)

Using the recursive formula from the above, we can now bound the derivatives of f. Note that the supremum of \(f^{(n)}\) is actually attained on the interval \([e^{-4n},1]\) and scales with n like \(e^{(cn^2)}\) for some \(c\in (0,\infty )\). This can directly be seen by calculating the maximum of the \(g_{n,k}\) from (3.2). For our purposes, however, it is sufficient to establish that all derivatives of f are bounded on \((0,\infty )\).

Lemma 3.2

Let \(f:(0,\infty )\rightarrow {\mathbb {R}}\) be the function which satisfies for every \(t\in (0,\infty )\) that

$$\begin{aligned} f(t)=\tfrac{1}{\sqrt{2\pi }}\int ^{\ln (t)}_{-\infty } e^{-\frac{1}{2}r^2}\mathrm {d}r. \end{aligned}$$
(3.10)

Then it holds for every \(n\in {\mathbb {N}}\) that

$$\begin{aligned} \sup _{t\in (0,\infty )}\left| {f^{(n)}(t)}\right| \le \max \!\left\{ (n-1)!\,2^{n-2}\, ,\sup _{t\in [e^{-4n},1]}\left| {f^{(n)}(t)}\right| \right\} <\infty . \end{aligned}$$
(3.11)

Proof of Lemma 3.2

Throughout this proof, let \(g_{n,k}:(0,\infty )\rightarrow {\mathbb {R}}\), \(n,k\in {\mathbb {N}}_0\), be the functions introduced in (3.2) and let \((\gamma _{n,k})_{n,k\in {\mathbb {Z}}}\subseteq {\mathbb {Z}}\) be the integers introduced in (3.3). Then Lemma 3.1 shows for every \(n\in {\mathbb {N}}\) that

  1. (a)

    we have that f is n times continuously differentiable and

  2. (b)

    we have for every \(t\in (0,\infty )\) that

    $$\begin{aligned} f^{(n)}(t)=\tfrac{1}{\sqrt{2\pi }}\left[ {\sum _{k=0}^{n-1}\gamma _{n,k}\,g_{n,k}(t)}\right] . \end{aligned}$$
    (3.12)

In addition, observe that for every \(m\in {\mathbb {N}}\), \(t\in (0,e^{-2m}]\) holds \({\tfrac{1}{2}\ln (t)\le -m}\). This ensures that for every \(m\in {\mathbb {N}}\), \(t\in (0,e^{-2m}]\subseteq (0,1]\) we have

$$\begin{aligned} \begin{aligned} e^{-\frac{1}{2}[\ln (t)]^2}&=e^{\left[ {\ln (t)(-\frac{1}{2}\ln (t))}\right] }=\left[ {e^{\ln (t)}}\right] ^{-\frac{1}{2}\ln (t)} =t^{-\frac{1}{2}\ln (t)}=\left( {\tfrac{1}{t}}\right) ^{\frac{1}{2}\ln (t)} \le \left( {\tfrac{1}{t}}\right) ^{-m}=t^m. \end{aligned} \end{aligned}$$
(3.13)

Moreover, note that the fundamental theorem of calculus implies for every \(t\in (0,1]\) that

$$\begin{aligned} \begin{aligned} \left| {\ln (t)}\right|&=\left| {\ln (t)-\ln (1)}\right| =\left| {\ln (1)-\ln (t)}\right| =\left| {\int _t^1 \frac{1}{s}\,\mathrm {d}s}\right| \le \left| {\frac{1}{t}(1-t)}\right| \le t^{-1}. \end{aligned} \end{aligned}$$
(3.14)

Combining (3.2), (3.12) and (3.13) therefore establishes that for every \(n\in {\mathbb {N}}\), \({t\in (0,e^{-4n})}{\subseteq (0,1]}\) it holds

$$\begin{aligned} \begin{aligned} \left| {f^{(n)}(t)}\right|&=\tfrac{1}{\sqrt{2\pi }}\left| {\sum _{k=0}^{n-1}\gamma _{n,k}\,g_{n,k}(t)}\right| =\tfrac{1}{\sqrt{2\pi }}\left| {\sum _{k=0}^{n-1}\gamma _{n,k}t^{-n}e^{-\frac{1}{2}[\ln (t)]^2}[\ln (t)]^k}\right| \\&\le \tfrac{1}{\sqrt{2\pi }}\left[ {\sum _{k=0}^{n-1}\left| {\gamma _{n,k}}\right| t^{n-k}}\right] \le \tfrac{1}{\sqrt{2\pi }}\left[ {\sum _{k=0}^{n-1}\left| {\gamma _{n,k}}\right| }\right] . \end{aligned} \end{aligned}$$
(3.15)

In addition, observe that the fundamental theorem of calculus ensures that for every \(t\in [1,\infty )\) we have

$$\begin{aligned} \left| {\ln (t)}\right| =\left| {\ln (t)-\ln (1)}\right| =\left| {\int _1^t \frac{1}{s}\,\mathrm {d}s}\right| \le \left| {t-1}\right| \le t. \end{aligned}$$
(3.16)

This, (3.2), (3.12) and the fact that for every \(t\in (0,\infty )\) it holds \(|e^{-\frac{1}{2}[\ln (t)]^2}|\le 1\) imply that for every \(n\in {\mathbb {N}}\), \(t\in (1,\infty )\) we have

$$\begin{aligned} \begin{aligned} \left| {f^{(n)}(t)}\right|&=\tfrac{1}{\sqrt{2\pi }}\left| {\sum _{k=0}^{n-1}\gamma _{n,k}\,g_{n,k}(t)}\right| =\tfrac{1}{\sqrt{2\pi }}\left| {\sum _{k=0}^{n-1}\gamma _{n,k}t^{-n}e^{-\frac{1}{2}[\ln (t)]^2}[\ln (t)]^k}\right| \\&\le \tfrac{1}{\sqrt{2\pi }}\left[ {\sum _{k=0}^{n-1}\left| {\gamma _{n,k}}\right| t^{-n}\left| {\ln (t)}\right| ^k}\right] \le \tfrac{1}{\sqrt{2\pi }}\left[ {\sum _{k=0}^{n-1}\left| {\gamma _{n,k}}\right| t^{-n}t^k}\right] \\&=\tfrac{1}{\sqrt{2\pi }}\left[ {\sum _{k=0}^{n-1}\left| {\gamma _{n,k}}\right| t^{-n+k}}\right] \le \tfrac{1}{\sqrt{2\pi }}\left[ {\sum _{k=0}^{n-1}\left| {\gamma _{n,k}}\right| }\right] . \end{aligned} \end{aligned}$$
(3.17)

Moreover, observe that (a) assures that for every \(n\in {\mathbb {N}}\) it holds that the function \(f^{(n)}\) is continuous. This and the boundedness of the set \([e^{-4n},1]\) ensure that for every \(n\in {\mathbb {N}}\) we have

$$\begin{aligned} \sup _{t\in [e^{-4n},1]}\left| {f^{(n)}(t)}\right| <\infty . \end{aligned}$$
(3.18)

Combining this with (3.15) and (3.17) establishes that for every \(n\in {\mathbb {N}}\) we have

$$\begin{aligned} \sup _{t\in (0,\infty )}\left| {f^{(n)}(t)}\right| \le \max \!\left\{ \tfrac{1}{\sqrt{2\pi }}\left[ {\sum _{k=0}^{n-1}\left| {\gamma _{n,k}}\right| }\right] , \sup _{t\in [e^{-4n},1]}\left| {f^{(n)}(t)}\right| \right\} <\infty . \end{aligned}$$
(3.19)

Furthermore, note that (3.3) implies that for every \(n\in \{2,3,4,\dots \}\) it holds

$$\begin{aligned} \begin{aligned} \sum _{k=0}^{n-1}\left| {\gamma _{n,k}}\right|&=\sum _{k=0}^{n-1}\left| {-\gamma _{n-1,k-1}-(n-1)\gamma _{n-1,k}+(k+1)\gamma _{n-1,k+1}}\right| \\&\le \left[ {\sum _{k=0}^{n-1}\left| {\gamma _{n-1,k-1}}\right| }\right] +\left[ {\sum _{k=0}^{n-1}(n-1)\left| {\gamma _{n-1,k}}\right| }\right] +\left[ {\sum _{k=0}^{n-1}(k+1)\left| {\gamma _{n-1,k+1}}\right| }\right] \\&=\left[ {\sum _{k=-1}^{n-2}\left| {\gamma _{n-1,k}}\right| }\right] +\left[ {\sum _{k=0}^{n-1}(n-1)\left| {\gamma _{n-1,k}}\right| }\right] +\left[ {\sum _{k=1}^{n}k\left| {\gamma _{n-1,k}}\right| }\right] . \end{aligned} \end{aligned}$$
(3.20)

Combining this with the fact that for every \(n\in \{2,3,4,\dots \}\), \(k\in {\mathbb {Z}}\backslash \{0,1,\dots ,n-2\}\) we have \(\gamma _{n-1,k}=0\) implies that for every \(n\in \{2,3,4,\dots \}\) it holds

$$\begin{aligned} \sum _{k=0}^{n-1}\left| {\gamma _{n,k}}\right|&=\sum _{k=0}^{n-2}\left[ {(1+(n-1)+k)\left| {\gamma _{n-1,k}}\right| }\right] \le (2n-2)\left[ {\sum _{k=0}^{n-2}\left| {\gamma _{n-1,k}}\right| }\right] \nonumber \\&=2(n-1)\left[ {\sum _{k=0}^{n-2}\left| {\gamma _{n-1,k}}\right| }\right] . \end{aligned}$$
(3.21)

The fact that \(\gamma _{1,0}=1\) hence implies that for every \(n\in {\mathbb {N}}\) we have

$$\begin{aligned} \sum _{k=0}^{n-1}\left| {\gamma _{n,k}}\right| \le (n-1)!\,2^{n-1}\left[ {\sum _{k=0}^0\left| {\gamma _{1,k}}\right| }\right] =(n-1)!\,2^{n-1}. \end{aligned}$$
(3.22)

Combining this and (3.19) ensures that for every \(n\in {\mathbb {N}}\) it holds

$$\begin{aligned} \sup _{t\in (0,\infty )}\left| {f^{(n)}(t)}\right| \le \max \!\left\{ \tfrac{1}{\sqrt{2\pi }}(n-1)!\,2^{n-1}\, ,\sup _{t\in [e^{-4n},1]}\left| {f^{(n)}(t)}\right| \right\} <\infty . \end{aligned}$$
(3.23)

The proof of Lemma 3.2 is thus completed. \(\square \)

In the following corollary, we estimate the derivatives of the function \(x\rightarrow f(\tfrac{K+c}{x})\) required to approximate this function by neural networks.

Corollary 3.3

Let \(n\in {\mathbb {N}}\), \(K\in [0,\infty )\), \(c,a\in (0,\infty )\), \(b\in (a,\infty )\), let \(f:(0,\infty )\rightarrow {\mathbb {R}}\) be the function which satisfies for every \(t\in (0,\infty )\) that

$$\begin{aligned} f(t)=\tfrac{1}{\sqrt{2\pi }}\int ^{\ln (t)}_{-\infty } e^{-\frac{1}{2}r^2}\mathrm {d}r, \end{aligned}$$
(3.24)

and let \(h:[a,b]\rightarrow {\mathbb {R}}\) be the function which satisfies for every \(x\in [a,b]\) that

$$\begin{aligned} h(x)=f\left( \tfrac{K+c}{x}\right) . \end{aligned}$$
(3.25)

Then it holds

  1. (i)

    that f and h are infinitely often differentiable and

  2. (ii)

    that

    $$\begin{aligned}&\max _{k\in \{0,1,\dots ,{n}\}}\sup _{x\in [a,b]}\left| {h^{(k)}\!(x)}\right| \nonumber \\&\quad \le n2^{n-1} n! \left[ {\max _{k\in \{0,1,\dots ,{n}\}} \sup _{t\in [\frac{K+c}{b},\frac{K+c}{a}]}\left| {f^{(k)}\!(t)}\right| }\right] \max \{a^{-2n},1\}\max \{(K+c)^n,1\}.\nonumber \\ \end{aligned}$$
    (3.26)

Proof of Corollary 3.3

Throughout this proof, let \(\alpha _{m,j}\in {\mathbb {Z}}\), \(m,j\in {\mathbb {Z}}\), be the integers which satisfy that for every \(m,j\in {\mathbb {Z}}\) it holds

$$\begin{aligned} \alpha _{m,j}={\left\{ \begin{array}{ll}-1 &{} :m=j=1\\ -(m-1+j)\alpha _{m-1,j}-\alpha _{m-1,j-1} &{} :m>1,\,\, 1\le j\le m \\ 0 &{} :\mathrm {else} \end{array}\right. }. \end{aligned}$$
(3.27)

Note that Lemma 3.1 and the chain rule ensure that the functions f and h are infinitely often differentiable. Next we claim that for every \(m\in {\mathbb {N}}\), \(x\in [a,b]\) it holds

$$\begin{aligned} h^{(m)}\!(x)=\tfrac{\mathrm {d}^m}{\mathrm {d}x^m}\!\left( {f\left( \tfrac{K+c}{x}\right) }\right) =\sum _{j=1}^m \alpha _{m,j}(K+c)^j x^{-(m+j)}(f^{(j)}\!\big (\tfrac{K+c}{x})\big ). \end{aligned}$$
(3.28)

We prove (3.28) by induction on \(m\in {\mathbb {N}}\). To prove the base case \(m=1\) we note that the chain rule ensures that for every \(x\in [a,b]\) we have

$$\begin{aligned} \tfrac{\mathrm {d}}{\mathrm {d}x}\!\left( {f\left( \tfrac{K+c}{x}\right) }\right) =-(K+c)x^{-2}\!\left( {f'\! \left( \tfrac{K+c}{x}\right) }\right) =\alpha _{1,1}(K+c)x^{-2}\!\left( {f'\! \left( \tfrac{K+c}{x}\right) }\right) . \end{aligned}$$
(3.29)

This establishes (3.28) in the base case \(m=1\). For the induction step \({\mathbb {N}}\ni m \rightarrow m+1\in {\mathbb {N}}\) observe that the chain rule implies for every \(m\in {\mathbb {N}}\), \(x\in [a,b]\) that

$$\begin{aligned} \begin{aligned}&\tfrac{\mathrm {d}}{\mathrm {d}x}\!\left[ {\sum _{j=1}^m \alpha _{m,j}(K+c)^j x^{-(m+j)}\!\left( {f^{(j)}\! \left( \tfrac{K+c}{x}\right) }\right) }\right] \\&\quad =-\left[ {\sum _{j=1}^m \alpha _{m,j}(K+c)^{j+1}x^{-(m+j+2)}\!\left( {f^{(j+1)}\! \left( \tfrac{K+c}{x}\right) }\right) }\right] \\&\qquad -\left[ {\sum _{j=1}^m \alpha _{m,j}(K+c)^j(m+j)x^{-(m+j+1)}\!\left( {f^{(j)}\! \left( \tfrac{K+c}{x}\right) }\right) }\right] \\&\quad =-\left[ {\sum _{j=2}^{m+1} \alpha _{m,j-1}(K+c)^{j}x^{-(m+j+1)}\!\left( {f^{(j)} \!\left( \tfrac{K+c}{x}\right) }\right) }\right] \\&\qquad -\left[ {\sum _{j=1}^m \alpha _{m,j}(K+c)^j (m+j)x^{-(m+j+1)}\!\left( {f^{(j)}\! \left( \tfrac{K+c}{x}\right) }\right) }\right] \\&\quad =\sum _{j=1}^{m+1}(-(m+j)\alpha _{m,j}-\alpha _{m,j-1})(K+c)^jx^{-(m+1+j)} \!\left( {f^{(j)}\!\left( \tfrac{K+c}{x}\right) }\right) . \end{aligned}\end{aligned}$$
(3.30)

Induction thus establishes (3.28). Next note that (3.27) ensures that for every \(m\in \{2,3,\dots \}\) it holds

$$\begin{aligned} \begin{aligned}&\max _{j\in \{1,2,\dots ,{m}\}}\left| {\alpha _{m,j}}\right| \\&\quad =\max _{j\in \{1,2,\dots ,{m}\}}\left| {-(m-1+j)\alpha _{m-1,j}-\alpha _{m-1,j-1}}\right| \\&\quad \le \left[ {\max _{j\in \{1,2,\dots ,{m-1}\}}\left| {(m-1+j)\alpha _{m-1,j}}\right| }\right] +\left[ {\max _{j\in \{1,2,\dots ,{m-1}\}}\left| {\alpha _{m-1,j}}\right| }\right] \\&\quad \le (2m-1)\left[ {\max _{j\in \{1,2,\dots ,{m-1}\}}\left| {\alpha _{m-1,j}}\right| }\right] \le 2m\left[ {\max _{j\in \{1,2,\dots ,{m-1}\}}\left| {\alpha _{m-1,j}}\right| }\right] . \end{aligned}\end{aligned}$$
(3.31)

Induction hence proves that for every \(m\in {\mathbb {N}}\) we have \(\max _{j\in \{1,2,\dots ,{m}\}}\left| {\alpha _{m,j}}\right| \le 2^{m-1}m!\). Combining this with (3.28) implies that for every \(m\in \{1,2,\dots ,{n}\}\), \(x\in [a,b]\) we have

$$\begin{aligned}&\left| {h^{(m)}\!(x)}\right| \nonumber \\&\quad =\left| {\sum _{j=1}^m \alpha _{m,j}(K+c)^j x^{-(m+j)}\!\big (f^{(j)} \!\left( \tfrac{K+c}{x}\right) \big )}\right| \nonumber \\&\quad \le 2^{m-1}m! \left[ {\max _{j\in \{1,2,\dots ,{m}\}}\sup _{t\in \left[ \frac{K+c}{b},\frac{K+c}{a}\right] }\left| {f^{(j)}\!(t)}\right| }\right] \max \{x^{-2m},1\}\left[ {\sum _{j=1}^m (K+c)^j}\right] \nonumber \\&\quad \le m2^{m-1} m! \left[ {\max _{j\in \{1,2,\dots ,{m}\}}\sup _{t\in \left[ \frac{K+c}{b},\frac{K+c}{a}\right] }\left| {f^{(j)}\!(t)}\right| }\right] \max \{x^{-2m},1\}\max \{(K+c)^m,1\}.\nonumber \\ \end{aligned}$$
(3.32)

Combining this with the fact that \(\sup _{x\in [a,b]}\left| {h(x)}\right| =\sup _{t\in [\frac{K+c}{b},\frac{K+c}{a}]}\left| {f(t)}\right| \) establishes that it holds

$$\begin{aligned}&\max _{k\in \{0,1,\dots ,{n}\}}\sup _{x\in [a,b]}\left| {h^{(k)}\!(x)}\right| \nonumber \\&\quad \le n2^{n-1} n! \left[ {\max _{k\in \{0,1,\dots ,{n}\}}\sup _{t\in \left[ \frac{K+c}{b},\frac{K+c}{a}\right] }\left| {f^{(k)}\!(t)}\right| }\right] \max \{a^{-2n},1\}\max \{(K+c)^n,1\}.\nonumber \\ \end{aligned}$$
(3.33)

This completes the proof of Corollary 3.3. \(\square \)

Next we consider the derivatives of the functions \(c\mapsto f(\tfrac{K+c}{x_i})\), \(i\in \{1,2,\dots ,{d}\}\), and their tensor product, which will be needed in order to approximate the outer integral in (2.9) by composite Gaussian quadrature.

Corollary 3.4

Let \(n\in {\mathbb {N}}\), \(K\in [0,\infty )\), \(x\in (0,\infty )\), let \(f:(0,\infty )\rightarrow {\mathbb {R}}\) be the function which satisfies for every \(t\in (0,\infty )\) that

$$\begin{aligned} f(t)=\tfrac{1}{\sqrt{2\pi }}\int ^{\ln (t)}_{-\infty } e^{-\frac{1}{2}r^2}\mathrm {d}r, \end{aligned}$$
(3.34)

and let \(g:(0,\infty )\rightarrow {\mathbb {R}}\) be the function which satisfies for every \(t\in (0,\infty )\) that

$$\begin{aligned} g(t)=f\!\left( {\tfrac{K+t}{x}}\right) . \end{aligned}$$
(3.35)

Then it holds

  1. (i)

    that f and g are infinitely often differentiable and

  2. (ii)

    that

    $$\begin{aligned} \sup _{t\in (0,\infty )}\left| {g^{(n)}(t)}\right| \le \left[ \sup _{t\in (0,\infty )}\left| {f^{(n)}(t)}\right| \right] \left| {x}\right| ^{-n}<\infty . \end{aligned}$$
    (3.36)

Proof of Corollary 3.4

Combining Lemma 3.2 with the chain rule implies that for every \(t\in (0,\infty )\) it holds

$$\begin{aligned} \left| {g^{(n)}(t)}\right| =\left| {\tfrac{\mathrm {d}^n}{\mathrm {d}t^n}\big (f(\tfrac{K+t}{x})\big )}\right| =\left| {f^{(n)}\!\left( {\tfrac{K+t}{x}}\right) \tfrac{1}{x^n}}\right| \le \left[ \sup _{t\in (0,\infty )}\left| {f^{(n)}(t)}\right| \right] \left| {x}\right| ^{-n}<\infty . \end{aligned}$$
(3.37)

This completes the proof of Corollary 3.4. \(\square \)

Lemma 3.5

Let \(d,n\in {\mathbb {N}}\), \(a\in (0,\infty )\), \(b\in (a,\infty )\), \(K=(K_1,\dots ,K_d)\in [0,\infty )^d\), \(x=(x_1,\dots ,x_d)\in [a,b]^d\), let \(f:(0,\infty )\rightarrow {\mathbb {R}}\) be the function which satisfies for every \(t\in (0,\infty )\) that

$$\begin{aligned} f(t)=\tfrac{1}{\sqrt{2\pi }}\int ^{\ln (t)}_{-\infty } e^{-\frac{1}{2}r^2}\mathrm {d}r, \end{aligned}$$
(3.38)

and let \(F:(0,\infty )\rightarrow {\mathbb {R}}\) be the function which satisfies for every \(c\in (0,\infty )\) that

$$\begin{aligned} F(c)=1-\left[ {\textstyle \prod \limits _{i=1}^d f\!\left( {\tfrac{K_i+c}{x_i}}\right) }\right] . \end{aligned}$$
(3.39)

Then it holds

  1. (i)

    that f and F are infinitely often differentiable and

  2. (ii)

    that

    $$\begin{aligned} \sup _{c\,\in (0,\infty )}\left| {F^{(n)}(c)}\right| \le \left[ \max _{k\in \{0,1,\dots ,{n}\}}\sup _{t\in (0,\infty )}\left| {f^{(k)}(t)}\right| \right] ^n d^n a^{-n}<\infty . \end{aligned}$$
    (3.40)

Proof of Lemma 3.5

Note that Lemma 3.1 ensures that f and F are infinitely often differentiable. Moreover, observe that (3.39) and the general Leibniz rule imply for every \(c\in (0,\infty )\) that

$$\begin{aligned} \begin{aligned} F^{(n)}(c)&=-\tfrac{\mathrm {d}^n}{\mathrm {d}c^n}\left[ \textstyle \prod \limits _{i=1}^d f\!\left( {\tfrac{K_i+c}{x_i}}\right) \right] \\&=-\sum _{\begin{array}{c} l_1,l_2,\dots ,l_d\in {\mathbb {N}}_0,\\ \sum _{i=1}^d l_i=n \end{array}}\left[ {\left( {\begin{array}{c}n\\ l_1,l_2,\dots ,l_d\end{array}}\right) \textstyle \prod \limits _{i=1}^d\left( {\tfrac{\mathrm {d}^{l_i}}{\mathrm {d}c^{l_i}}\left[ {f\!\left( {\tfrac{K_i+c}{x_i}}\right) }\right] }\right) }\right] . \end{aligned} \end{aligned}$$
(3.41)

Next note that the fact that for every \(r\in {\mathbb {R}}\) it holds that \(e^{-\frac{1}{2}r^2}\ge 0\) ensures that

$$\begin{aligned} \sup _{t\in (0,\infty )}\left| {f(t)}\right| = \sup _{t\in (0,\infty )}\left| {\tfrac{1}{\sqrt{2\pi }}\int ^{\ln (t)}_{-\infty } e^{-\frac{1}{2}r^2}\mathrm {d}r}\right| =\left| {\tfrac{1}{\sqrt{2\pi }}\int ^{\infty }_{-\infty } e^{-\frac{1}{2}r^2}\mathrm {d}r}\right| =1. \end{aligned}$$
(3.42)

Corollary 3.4 hence establishes that for every \(c\in [0,\infty )\), \(l_1,\dots ,l_d\in {\mathbb {N}}_0\) with \(\sum _{i=1}^d l_i=n\) it holds

$$\begin{aligned} \begin{aligned}&\left| {\textstyle \prod \limits _{i=1}^d\left( {\tfrac{\mathrm {d}^{l_i}}{\mathrm {d}c^{l_i}}\left[ {f\!\left( {\tfrac{K_i+c}{x_i}}\right) }\right] }\right) }\right| \\&\quad \le \textstyle \prod \limits _{i=1}^d\displaystyle \left( {\left[ \sup _{t\in (0,\infty )}\left| {f^{(l_i)}(t)}\right| \right] \left| {x_i}\right| ^{-l_i}}\right) \\&\quad =\left[ {\textstyle \prod \limits _{i=1}^d\left| {x_i}\right| ^{-l_i}}\right] \left[ {\textstyle \prod \limits _{i=1}^d\displaystyle \left( {\sup _{t\in (0,\infty )}\left| {f^{(l_i)}(t)}\right| }\right) }\right] \\&\quad \le \left[ {\textstyle \prod \limits _{i=1}^d\left| {x_i}\right| ^{-l_i}}\right] \left[ {\textstyle \prod \limits _{\begin{array}{c} i\in \{1,2,\dots ,{d}\},\\ l_i>0 \end{array}}\displaystyle \left( {\max _{k\in \{1,2,\dots ,{n}\}}\sup _{t\in (0,\infty )}\left| {f^{(k)}(t)}\right| }\right) }\right] \\&\quad \le \left[ {\textstyle \prod \limits _{i=1}^d\left| {x_i}\right| ^{-l_i}}\right] \left[ {\textstyle \prod \limits _{\begin{array}{c} i\in \{1,2,\dots ,{d}\},\\ l_i>0 \end{array}}\displaystyle \max \left\{ 1,\max _{k\in \{1,2,\dots ,{n}\}}\sup _{t\in (0,\infty )}\left| {f^{(k)}(t)}\right| \right\} }\right] \\&\quad \le \left[ {\textstyle \prod \limits _{i=1}^d\left| {x_i}\right| ^{-l_i}}\right] \left[ {\max \left\{ 1,\max _{k\in \{1,2,\dots ,{n}\}}\sup _{t\in (0,\infty )}\left| {f^{(k)}(t)}\right| \right\} }\right] ^{(l_1+\ldots +l_d)}\\&\quad =\left[ {\textstyle \prod \limits _{i=1}^d\left| {x_i}\right| ^{-l_i}}\right] \left[ {\max _{k\in \{0,1,\dots ,{n}\}}\sup _{t\in (0,\infty )}\left| {f^{(k)}(t)}\right| }\right] ^n. \end{aligned} \end{aligned}$$
(3.43)

Moreover, note that the multinomial theorem ensures that

$$\begin{aligned} \begin{aligned} d^n&=\left[ {\sum _{i=1}^d 1}\right] ^n =\sum _{\begin{array}{c} l_1,l_2,\dots ,l_d\in {\mathbb {N}}_0,\\ \sum _{i=1}^d l_i=n \end{array}}\left[ {\left( {\begin{array}{c}n\\ l_1,l_2,\dots ,l_d\end{array}}\right) \textstyle \prod \limits _{i=1}^d 1^{l_i}}\right] \\&=\sum _{\begin{array}{c} l_1,l_2,\dots ,l_d\in {\mathbb {N}}_0,\\ \sum _{i=1}^d l_i=n \end{array}}\left[ {\left( {\begin{array}{c}n\\ l_1,l_2,\dots ,l_d\end{array}}\right) }\right] . \end{aligned} \end{aligned}$$
(3.44)

Combining this with (3.41), (3.43) and the assumption that \(x\in [a,b]^d\) implies that for every \(c\in (0,\infty )\) we have

$$\begin{aligned}&\left| {F^{(n)}(c)}\right| \nonumber \\&\quad \le \left| {\sum _{\begin{array}{c} l_1,l_2,\dots ,l_d\in {\mathbb {N}}_0, \nonumber \\ \sum _{i=1}^d l_i=n \end{array}}\left[ {\left( {\begin{array}{c}n\\ l_1,l_2,\dots ,l_d\end{array}}\right) \left[ {\textstyle \prod \limits _{i=1}^d\displaystyle \left| {x_i}\right| ^{-l_i}}\right] \left[ \max _{k\in \{0,1,\dots ,{n}\}}\sup _{t\in (0,\infty )}\left| {f^{(k)}(t)}\right| \right] ^n}\right] }\right| \nonumber \\&\quad \le a^{-n}\left[ \max _{k\in \{0,1,\dots ,{n}\}}\sup _{t\in (0,\infty )}\left| {f^{(k)}(t)}\right| \right] ^n \left| {\sum _{\begin{array}{c} l_1,l_2,\dots ,l_d\in {\mathbb {N}}_0, \nonumber \\ \sum _{i=1}^d l_i=n \end{array}}\left( {\begin{array}{c}n\\ l_1,l_2,\dots ,l_d\end{array}}\right) }\right| \\&\quad = a^{-n}\left[ \max _{k\in \{0,1,\dots ,{n}\}}\sup _{t\in (0,\infty )}\left| {f^{(k)}(t)}\right| \right] ^n d^n. \end{aligned}$$
(3.45)

This completes the proof of Lemma 3.5. \(\square \)

4 Quadrature

To approximate the function \(x\mapsto u(0,x)\) from (2.9) by a neural network, we need to evaluate, for arbitrary, given x, an expression of the form \(\int _0^{\infty } F_x(c)\mathrm {d}c\) with \(F_x\) as defined in Lemma 4.2. We achieve this by proving in Lemma 4.2 that the functions \(F_x\) decay sufficiently fast for \(c\rightarrow \infty \), and then employ numerical integration to show that the definite integral \(\int _0^N F_x(c)\mathrm {d}c\) can be sufficiently well approximated by a weighted sum of \(F_x(c_j)\) for suitable quadrature points \(c_j\in (0,N)\). The representation of such a sum can be realized by neural networks. We show in Sects. 6 and 7 how the functions \(x\mapsto F_x(c_j)\) for \((c_j)\in (0,N)\) can be realized efficiently due to their tensor product structure. We start by recalling an error bound for composite Gaussian quadrature which is explicit in the step size and quadrature order.

Lemma 4.1

Let \(n,M\in {\mathbb {N}}\), \(N\in (0,\infty )\). Then there exist real numbers \((c_j)_{j=1}^{nM}\subseteq (0,N)\) and \({(w_j)_{j=1}^{nM}\subseteq (0,\infty )}\) such that for every \(h\in C^{2n}([0,N],{\mathbb {R}})\) it holds

$$\begin{aligned} \left| {\int _0^N h(t)\,\mathrm {d}t - \sum _{j=1}^{nM}w_j h(c_j)}\right| \le \tfrac{1}{(2n)!}N^{2n+1}M^{-2n}\left[ {\sup _{\xi \in [0,N]}\left| {h^{(2n)}(\xi )}\right| }\right] . \end{aligned}$$
(4.1)

Proof of Lemma 4.1

Throughout this proof, let \(h\in C^{2n}([0,N],{\mathbb {R}})\) and \(\alpha _k\in [0,N]\), \(k\in \{0,1,\dots ,M\}\), such that for every \(k\in \{0,1,\dots ,M\}\) it holds \(\alpha _k=\tfrac{kN}{M}\). Observe that [30, Theorems 4.17, 6.11 and 6.12] ensure that for every \(k\in \{0,1,\dots ,M-1\}\) there exist \((\gamma ^k_i)_{i=1}^{n}\subseteq (\alpha _k,\alpha _{k+1})\), \((\omega ^k_i)_{i=1}^{n}\subseteq (0,\infty )\) and \(\xi ^k\in [\alpha _k,\alpha _{k+1}]\) such that

$$\begin{aligned} \int _{\alpha _k}^{\alpha _{k+1}}h(t)\,\mathrm {d}t-\sum _{i=1}^n\omega ^k_i h(\gamma ^k_i) =\frac{h^{(2n)}(\xi ^k)}{(2n)!}\int _{\alpha _k}^{\alpha _{k+1}}\left[ {\textstyle \prod \limits _{i=1}^n (t-\gamma ^k_i)^2}\right] \mathrm {d}t. \end{aligned}$$
(4.2)

Next note that for every \(k\in \{0,1,\dots ,M-1\}\) it holds

$$\begin{aligned} \begin{aligned} \int _{\alpha _k}^{\alpha _{k+1}}\left[ {\textstyle \prod \limits _{i=1}^n (t-\gamma ^k_i)^2}\right] \mathrm {d}t&\le \int _{\alpha _{k}}^{\alpha _{k+1}}\left[ {\textstyle \prod \limits _{i=1}^n(\alpha _k-\alpha _{k+1})^2}\right] \mathrm {d}t =\left[ {\tfrac{N}{M}}\right] ^{2n+1}. \end{aligned}\end{aligned}$$
(4.3)

Combining this with (4.2) yields that for every \(k\in \{0,1,\dots ,M\}\) we have

$$\begin{aligned} \begin{aligned}&\left| {\int _{\alpha _k}^{\alpha _{k+1}}h(t)\, \mathrm {d}t-\sum _{i=1}^n\omega ^k_i h(\gamma ^k_i)}\right| \\&\quad \le \frac{\left| {h^{(2n)}(\xi ^k)}\right| }{(2n)!}\left[ {\tfrac{N}{M}}\right] ^{2n+1} \le \tfrac{1}{(2n)!}\left[ {\tfrac{N}{M}}\right] ^{2n+1}\left[ {\sup _{\xi \in [0,N]}\left| {h^{(2n)}(\xi )}\right| }\right] . \end{aligned}\end{aligned}$$
(4.4)

Hence, we obtain

$$\begin{aligned} \begin{aligned} \left| { \int _0^N h(t)\, \mathrm {d}t\!-\!\!\sum _{k=0}^{M-1}\!\sum _{i=1}^n\omega ^k_i h(\gamma ^k_i)}\right|&=\left| {\sum _{k=0}^{M-1}\left[ {\int _{\alpha _k}^{\alpha _{k+1}}h(t)\,\mathrm {d}t-\sum _{i=1}^n\omega ^k_i h(\gamma ^k_i)}\right] }\right| \\&\le \sum _{k=0}^{M-1}\left( {\tfrac{1}{(2n)!}\left( {\tfrac{N}{M}}\right) ^{2n+1}\left[ {\sup _{\xi \in [0,N]}\left| {h^{(2n)}(\xi )}\right| }\right] }\right) \\&=\tfrac{1}{(2n)!}N^{2n+1}M^{-2n}\left[ {\sup _{\xi \in [0,N]}\left| {h^{(2n)}(\xi )}\right| }\right] . \end{aligned}\end{aligned}$$
(4.5)

Let \((c_j)_{j=1}^{nM}\subseteq (0,N)\), \((w_j)_{j=1}^{nM}\subseteq (0,\infty )\) such that for every \(i\in \{1,2,\dots ,{n}\}\), \(k\in \{0,1,\dots ,M-1\}\) it holds

$$\begin{aligned} c_{kn+i}=\gamma _i^k\quad \mathrm {and}\quad w_{kn+i}=\omega _i^k. \end{aligned}$$
(4.6)

Next observe that

$$\begin{aligned} \left| {\int _0^N h(t)\,\mathrm {d}t - \sum _{j=1}^{nM}w_j h(c_j)}\right| =\left| { \int _0^N h(t)\,\mathrm {d}t\!-\!\!\sum _{k=0}^{M-1}\!\sum _{i=1}^n\omega ^k_i h(\gamma ^k_i)}\right| . \end{aligned}$$
(4.7)

This completes the proof of Lemma 4.1. \(\square \)

In the following, we bound the error due to truncating the domain of integration.

Lemma 4.2

Let \(d,n\in {\mathbb {N}}\), \(a\in (0,\infty )\), \(b\in (a,\infty )\), \(K=(K_1,K_2,\dots ,K_d)\in [0,\infty )^d\), let \(F_x:(0,\infty )\rightarrow {\mathbb {R}}\), \(x\in [a,b]^d\), be the functions which satisfy for every \(x=(x_1,x_2,\dots ,x_d)\in [a,b]^d\), \(c\in (0,\infty )\) that

$$\begin{aligned} F_x(c)=1-\prod _{i=1}^d\left[ {\tfrac{1}{\sqrt{2\pi }}\int _{-\infty }^{\ln (\frac{K_i+c}{x_i})}e^{-\frac{1}{2}r^2}\mathrm {d}r}\right] , \end{aligned}$$
(4.8)

and for every \(\varepsilon \in (0,1]\) let \(N_{\varepsilon }\in {\mathbb {R}}\) be given by \(N_{\varepsilon }=2e^{2(n+1)}(b+1)^{1+\frac{1}{n}}d^{\frac{1}{n}}\varepsilon ^{-\frac{1}{n}}\). Then it holds for every \(\varepsilon \in (0,1]\) that

$$\begin{aligned} \sup _{x\in [a,b]^d}\left| {\int _{N_{\varepsilon }}^\infty F_x(c)\,\mathrm {d}c}\right| \le \varepsilon . \end{aligned}$$
(4.9)

Proof of Lemma 4.2

Throughout this proof, let \(g:(0,\infty )\rightarrow (0,1)\) be the function given by

$$\begin{aligned} g(t)=1-\tfrac{1}{\sqrt{2\pi }}\int _{-\infty }^{\ln (t)}e^{-\frac{1}{2}r^2}\mathrm {d}r. \end{aligned}$$
(4.10)

Note that [6, Eq.(5)] ensures that for every \(y\in [0,\infty )\) we have \(\tfrac{2}{\sqrt{\pi }}\int _y^{\infty }e^{-r^2}\mathrm {d}r \le e^{-y^2}\). This implies for every \(t\in [1,\infty )\) that

$$\begin{aligned} \begin{aligned} 0<g(t)&=1-\tfrac{1}{\sqrt{2\pi }}\int _{-\infty }^{\ln (t)}e^{-\frac{1}{2}r^2}\mathrm {d}r =\tfrac{1}{\sqrt{2\pi }}\int _{\ln (t)}^\infty e^{-\frac{1}{2}r^2}\mathrm {d}r\\&=\tfrac{1}{\sqrt{\pi }}\int ^{\infty }_{\frac{\ln (t)}{\sqrt{2}}} e^{-r^2}\mathrm {d}r\le \tfrac{1}{2}e^{-\frac{1}{2}[\ln (t)]^2}. \end{aligned}\end{aligned}$$
(4.11)

Furthermore, observe that for every \(t\in [e^{2(n+1)},\infty )\) it holds

$$\begin{aligned} \begin{aligned} e^{-\frac{1}{2}[\ln (t)]^2}&=e^{\left[ {\ln (t)(-\frac{1}{2}\ln (t))}\right] } =\left[ {e^{\ln (t)}}\right] ^{-\frac{1}{2}\ln (t)}=t^{-\frac{1}{2}\ln (t)}\le t^{-(n+1)}. \end{aligned} \end{aligned}$$
(4.12)

This, (4.11) and the fact that for every \(\varepsilon \in (0,1]\), \(c\in [N_{\varepsilon },\infty )\), \(x\in [a,b]^d\), \(i\in \{1,2,\dots ,{d}\}\) we have \(\tfrac{K_i+c}{x_i}\ge \tfrac{c}{b}\ge e^{2(n+1)}\ge 1\) imply that for every \(\varepsilon \in (0,1]\), \(c\in [N_{\varepsilon },\infty )\), \(x\in [a,b]^d\) it holds

$$\begin{aligned} \begin{aligned} \left| {F_x(c)}\right|&=\left| {1-\prod _{i=1}^d\left[ {\tfrac{1}{\sqrt{2\pi }}\int _{-\infty }^{\ln \left( \frac{K_i+c}{x_i}\right) }e^{-\frac{1}{2}r^2}\mathrm {d}r}\right] }\right| =\left| {1-\prod _{i=1}^d\left[ {1-g\left( \tfrac{K_i+c}{x_i}\right) }\right] }\right| \\&\le \left| {1-\prod _{i=1}^d\left[ {1-\tfrac{1}{2}\left[ {\tfrac{K_i+c}{x_i}}\right] ^{-(n+1)}}\right] }\right| \le \left| {1-\prod _{i=1}^d\left[ {1-\tfrac{1}{2}\left[ {\tfrac{c}{b}}\right] ^{-(n+1)}}\right] }\right| . \end{aligned}\end{aligned}$$
(4.13)

Combining this with the binomial theorem and the fact that for every \(i\in \{1,2,\dots ,{d}\}\) we have \(\left( {\begin{array}{c}d\\ i\end{array}}\right) \le \tfrac{d^i}{i!}\le \tfrac{d^i}{\exp (i\ln (i)-i+1)}\le \tfrac{(de)^i}{i^i}\) establishes that for every \(\varepsilon \in (0,1]\), \(c\in [N_{\varepsilon },\infty )\), \(x\in [a,b]^d\) it holds

$$\begin{aligned} \begin{aligned} \left| {F_x(c)}\right|&\le \left| {1-\left( {1-\tfrac{1}{2}\left[ {\tfrac{c}{b}}\right] ^{-(n+1)}}\right) ^d}\right| =\left| {1-\sum _{i=0}^d\left[ {\left( {\begin{array}{c}d\\ i\end{array}}\right) \left[ {-\tfrac{1}{2}\left[ {\tfrac{c}{b}}\right] ^{-(n+1)}}\right] ^i}\right] }\right| \\&\le \sum _{i=1}^d\left[ {\left( {\begin{array}{c}d\\ i\end{array}}\right) \left[ {\tfrac{1}{2}}\right] ^i\left[ {\tfrac{b}{c}}\right] ^{(n+1)i}}\right] \le \sum _{i=1}^d\left[ {\tfrac{de}{2i}}\right] ^i\left[ {\tfrac{b}{c}}\right] ^{(n+1)i}\\&=\sum _{i=1}^d\left[ {\tfrac{e}{2i}}\right] ^i\left[ {d\left[ {\tfrac{b}{c}}\right] ^{n+1}}\right] ^i\le 2d\left[ {\tfrac{b}{c}}\right] ^{n+1}\left[ {\sum _{i=1}^d\left[ {d\left[ {\tfrac{b}{c}}\right] ^{n+1}}\right] ^{i-1}}\right] \\&=2d\left[ {\tfrac{b}{c}}\right] ^{n+1}\left[ {\sum _{i=0}^{d-1}\left[ {d\left[ {\tfrac{b}{c}}\right] ^{n+1}}\right] ^{i}}\right] \le 2d\left[ {\tfrac{b}{c}}\right] ^{n+1}\left[ {\sum _{i=0}^{\infty }\left[ {d\left[ {\tfrac{b}{c}}\right] ^{n+1}}\right] ^{i}}\right] . \end{aligned}\end{aligned}$$
(4.14)

This, the geometric sum formula and the fact that for every \(\varepsilon \in (0,1]\) it holds that \(N_\varepsilon \ge 2bd^{\frac{1}{n}}\) imply that for every \(\varepsilon \in (0,1]\), \(c\in [N_{\varepsilon },\infty )\), \(x\in [a,b]^d\) we have

$$\begin{aligned} \left| {F_x(c)}\right|&\le 2d\left[ {\tfrac{b}{c}}\right] ^{n+1} \left[ {\frac{1}{1-d\left[ {\tfrac{b}{c}}\right] ^{n+1}}}\right] \le 4d\left[ {\tfrac{b}{c}}\right] ^{n+1}. \end{aligned}$$
(4.15)

Hence, we obtain for every \(\varepsilon \in (0,1]\), \(x\in [a,b]^d\) that

$$\begin{aligned} \begin{aligned} \left| {\int _{N_{\varepsilon }}^\infty F_x(c)\,\mathrm {d}c}\right|&\le 4db^{n+1} \left| {\int _{N_{\varepsilon }}^\infty c^{-(n+1)}\mathrm {d}c}\right| =4db^{n+1}\tfrac{1}{n}(N_{\varepsilon })^{-n}\\&=\tfrac{4}{n}db^{n+1}\left[ {2e^{2(n+1)}(b+1)^{1+\frac{1}{n}}d^{\frac{1}{n}}\varepsilon ^{-\frac{1}{n}}}\right] ^{-n}\\&=\tfrac{4}{n}db^{n+1}2^{-n}e^{-(2n^2+2n)}(b+1)^{-(n+1)}d^{-1}\varepsilon \\&=\tfrac{4}{n}2^{-n}e^{-(2n^2+n)}\left[ {\tfrac{b}{b+1}}\right] ^{n+1}\varepsilon \le \varepsilon . \end{aligned}\end{aligned}$$
(4.16)

This completes the proof of Lemma 4.2. \(\square \)

Next we combine the above result with Lemma 4.1 in order to derive the number of terms needed in order to approximate the integral by a sum to within a prescribed error bound \(\varepsilon \).

Lemma 4.3

Let \(n\in {\mathbb {N}}\), \(a\in (0,\infty )\), \(b\in (a,\infty )\), \((K_i)_{i\in {\mathbb {N}}}\subseteq [0,\infty )\), let \(F^d_x:(0,\infty )\rightarrow {\mathbb {R}}\), \(x\in [a,b]^d\), \(d\in {\mathbb {N}}\), be the functions which satisfy for every \(d\in {\mathbb {N}}\), \(x=(x_1,x_2,\dots ,x_d)\in [a,b]^d\), \(c\in (0, \infty )\) that

$$\begin{aligned} F^d_x(c)=1-\prod _{i=1}^d\left[ {\tfrac{1}{\sqrt{2\pi }}\int _{-\infty }^{\ln (\frac{K_i+c}{x_i})}e^{-\frac{1}{2}r^2}\mathrm {d}r}\right] , \end{aligned}$$
(4.17)

and for every \(d\in {\mathbb {N}}\), \(\varepsilon \in (0,1]\) let \(N_{d,\varepsilon }\in {\mathbb {R}}\) be given by

$$\begin{aligned} N_{d,\varepsilon }=2e^{2(n+1)}(b+1)^{1+\frac{1}{n}}d^{\frac{1}{n}}\left[ {\tfrac{\varepsilon }{2}}\right] ^{-\frac{1}{n}}. \end{aligned}$$
(4.18)

Then there exist \(Q_{d,\varepsilon }\in {\mathbb {N}}\), \(c^d_{\varepsilon ,j}\in (0,N_{d,\varepsilon })\), \(w^d_{\varepsilon ,j}\in [0,\infty )\), \(j\in \{1,2,\dots ,{Q_{d,\varepsilon }}\}\), \(d\in {\mathbb {N}}\), \(\varepsilon \in (0,1]\), such

  1. (i)

    that

    $$\begin{aligned} \sup _{\varepsilon \in (0,1], d\in {\mathbb {N}}}\left[ {\frac{Q_{d,\varepsilon }}{d^{1+\frac{2}{n}}\varepsilon ^{-\frac{2}{n}}}}\right] <\infty \end{aligned}$$
    (4.19)

    and

  2. (ii)

    that for every \(d\in {\mathbb {N}}\), \(\varepsilon \in (0,1]\) it holds \(\sum _{j=1}^{Q_{d,\varepsilon }}w^d_{\varepsilon ,j}=N_{d,\varepsilon }\) and

    $$\begin{aligned} \sup _{x\in [a,b]^d}\left| {\int _0^\infty F^d_x(c)\,\mathrm {d}c-\sum _{j=1}^{Q_{d,\varepsilon }}w^d_{\varepsilon ,j}F^d_x(c^d_{\varepsilon ,j})}\right| \le \varepsilon . \end{aligned}$$
    (4.20)

Proof of Lemma 4.3

Note that Lemma 3.5 ensures the existence of \(S_m\in {\mathbb {R}}\), \(m\in {\mathbb {N}}\), such that for every \(d,m\in {\mathbb {N}}\), \(x\in [a,b]^d\) it holds

$$\begin{aligned} \sup _{c\,\in (0,\infty )}\left| {(F^d_x)^{(m)}(c)}\right| \le S_m d^m. \end{aligned}$$
(4.21)

Let \(Q_{d,\varepsilon }\in {\mathbb {R}}\), \(d\in {\mathbb {N}}\), \(\varepsilon \in (0,1]\), be given by

$$\begin{aligned} Q_{d,\varepsilon }=n\left\lceil \left[ {\tfrac{1}{(2n)!}(N_{d,\varepsilon })^{2n+1}S_{2n}d^{2n}\tfrac{2}{\varepsilon }}\right] ^{\frac{1}{2n}}\right\rceil . \end{aligned}$$
(4.22)

Next observe that Lemma 4.1 (with \(N\leftrightarrow N_{d,\varepsilon }\) in the notation of Lemma 4.1) establishes the existence of \(c^d_{\varepsilon ,j}\in (0,N_{d,\varepsilon })\), \(w^d_{\varepsilon ,j}\in [0,\infty )\), \(j\in \{1,2,\dots ,{Q_{d,\varepsilon }}\}\), \(d\in {\mathbb {N}}\), \(\varepsilon \in (0,1]\), such that for every \(d\in {\mathbb {N}}\), \(\varepsilon \in (0,\infty )\), \(x\in [a,b]^d\) we have \(\sum _{j=1}^{Q_{d,\varepsilon }}w^d_{\varepsilon ,j}=N_{d,\varepsilon }\) and

$$\begin{aligned} \begin{aligned} \left| {\int _0^{N_{d,\varepsilon }}F^d_x(c)\mathrm {d}c-\sum _{j=1}^{Q_{d,\varepsilon }}w^d_{\varepsilon ,j}F^d_x(c^d_{\varepsilon ,j})}\right|&\le \tfrac{1}{(2n)!}(N_{d,\varepsilon })^{2n+1}\left[ {\tfrac{Q_{d,\varepsilon }}{n}}\right] ^{-2n}S_{2n}d^{2n}\\&\le \tfrac{1}{(2n)!}(N_{d,\varepsilon })^{2n+1}\left[ {\tfrac{1}{(2n)!}(N_{d,\varepsilon })^{2n+1}S_{2n}d^{2n}\tfrac{2}{\varepsilon }}\right] ^{-1}S_{2n}d^{2n}\\&=\tfrac{\varepsilon }{2}. \end{aligned}\end{aligned}$$
(4.23)

Moreover, note that Lemma 4.2 (with \(N_{d,\frac{\varepsilon }{2}}\leftrightarrow N_{d,\varepsilon }\) in the notation of Lemma 4.2) and (4.23) imply for every \(d\in {\mathbb {N}}\), \(\varepsilon \in (0,1]\), \(x\in [a,b]^d\) that

$$\begin{aligned} \begin{aligned}&\left| {\int _0^\infty F^d_x(c)\,\mathrm {d}c-\sum _{j=1}^{Q_{d,\varepsilon }} w^d_{\varepsilon ,j}F^d_x(c^d_{\varepsilon ,j})}\right| \\&\quad \le \left| {\int _0^{N_{d,\varepsilon }}F^d_x(c)\,\mathrm {d}c-\sum _{j=1}^{Q_{d,\varepsilon }}w^d_{\varepsilon ,j}F^d_x(c^d_{\varepsilon ,j})}\right| + \left| {\int ^{\infty }_{N_{d,\varepsilon }}F^d_x(c)\,\mathrm {d}c}\right| \\&\quad \le \tfrac{\varepsilon }{2}+\tfrac{\varepsilon }{2}=\varepsilon . \end{aligned}\end{aligned}$$
(4.24)

Furthermore, we have for every \(d\in {\mathbb {N}}\), \(\varepsilon \in (0,1]\) that

$$\begin{aligned} \begin{aligned} Q_{d,\varepsilon }&\le n\left( {1+\left[ {\tfrac{1}{(2n)!}(N_{d,\varepsilon })^{2n+1}S_{2n}d^{2n}\tfrac{2}{\varepsilon }}\right] ^{\frac{1}{2n}}}\right) \\&=n+n\left[ {\tfrac{2S_{2n}}{(2n)!}}\right] ^{\frac{1}{2n}}d\varepsilon ^{-\frac{1}{2n}}(N_{d,\varepsilon })^{1+\frac{1}{2n}}\\&\le n+n\left[ {\tfrac{2S_{2n}}{(2n)!}}\right] ^{\frac{1}{2n}}d\varepsilon ^{-\frac{1}{2n}}\left[ {4e^{2(n+1)}(b+1)^{1+\frac{1}{n}}d^{\frac{1}{n}}\varepsilon ^{-\frac{1}{n}}}\right] ^{1+\frac{1}{2n}}\\&=n+4n\left[ {\tfrac{8S_{2n}}{(2n)!}}\right] ^{\frac{1}{2n}}e^{2n+3+\frac{1}{n}}\left[ {b+1}\right] ^{1+\frac{3}{2n}+\frac{1}{2n^2}}d^{1+\frac{1}{n}+\frac{1}{2n^2}}\varepsilon ^{-\frac{3}{2n}-\frac{1}{2n^2}}\\&\le nd^{1+\frac{2}{n}}\varepsilon ^{-\frac{2}{n}} + 4n\left[ {\tfrac{8S_{2n}}{(2n)!}}\right] ^{\frac{1}{2n}}e^{2n+3+\frac{1}{n}}\left[ {b+1}\right] ^{1+\frac{3}{2n}+\frac{1}{2n^2}}d^{1+\frac{2}{n}}\varepsilon ^{-\frac{2}{n}}. \end{aligned} \end{aligned}$$
(4.25)

This implies

$$\begin{aligned} \sup _{\varepsilon \in (0,1], d\in {\mathbb {N}}}\left[ {\frac{Q_{d,\varepsilon }}{d^{1+\frac{2}{n}}\varepsilon ^{-\frac{2}{n}}}}\right] \le n + 4n\left[ {\tfrac{8S_{2n}}{(2n)!}}\right] ^{\frac{1}{2n}}e^{2n+3+\frac{1}{n}}\left[ {b+1}\right] ^{1+\frac{3}{2n}+\frac{1}{2n^2}}<\infty . \end{aligned}$$
(4.26)

The proof of Lemma 4.3 is thus completed. \(\square \)

5 Basic ReLU DNN Calculus

In order to talk about neural networks we will, up to some minor changes and additions, adopt the notation of P. Petersen and F. Voigtlaender from [34]. This allows us to differentiate between a neural network, defined as a structured set of weights, and its realization, which is a function on \({\mathbb {R}}^d\). Note that this is almost necessary in order to talk about the complexity of neural networks, since notions like depth, size or architecture do not make sense for general functions on \({\mathbb {R}}^d\). Even if we know that a given function “is” a neural network, i.e., can be written a series of affine transformations and componentwise nonlinearities, there are, in general, multiple non-trivially different ways to do so.

Each of these structured sets we consider does, however, define a unique function. This enables us to explicitly and unambiguously construct complex neural networks from simple ones, and subsequently relate the approximation capability of a given network to its complexity. Further note that since the realization of neural network is unique we can still speak of a neural network approximating a given function when its realization does so.

Specifically, a neural network will be given by its architecture, i.e., number of layers L and layer dimensionsFootnote 1\(N_0,N_1,\dots ,N_L\), as well as the weights determining the affine transformations used to compute each layer from the previous one. Note that our notion of neural networks does not attach the architecture and weights to a fixed activation function, but instead considers the realization of such a neural network with respect to a given activation function. This choice is a purely technical one here, as we always consider networks with ReLU activation function.

Setting 5.1

(Neural networks) For every \(L\in {\mathbb {N}}\), \(N_0,N_1,\dots ,N_L\in {\mathbb {N}}\) let \({\mathcal {N}}_L^{N_0,N_1,\dots ,N_L}\) be the set given by

$$\begin{aligned} {\mathcal {N}}_L^{N_0,N_1,\dots ,N_L}=\times _{l=1}^L\left( {{\mathbb {R}}^{N_l\times N_{l-1}}\times {\mathbb {R}}^{N_l}}\right) , \end{aligned}$$
(5.1)

let \({\mathfrak {N}}\) be the set given by

$$\begin{aligned} {\mathfrak {N}}=\bigcup _{\begin{array}{c} L \in {\mathbb {N}},\\ N_0, N_1, ..., N_L \in {\mathbb {N}} \end{array} } {\mathcal {N}}^{ N_0, N_1,\dots , N_L }_L, \end{aligned}$$
(5.2)

let \({\mathcal {L}},{\mathcal {M}},{\mathcal {M}}_l,\dim _{\mathrm {in}},\dim _{\mathrm {out}}:{\mathfrak {N}}\rightarrow {\mathbb {N}}\), \(l\in \{1,2,\dots ,{L}\}\), be the functions which satisfy for every \(L\in {\mathbb {N}}\) and every \({N_0,N_1,\dots ,N_L\in {\mathbb {N}}}\), \(\Phi =(((A^1_{i,j})_{i,j=1}^{N_1,N_0},(b^1_i)_{i=1}^{N_1}),\dots ,((A^L_{i,j})_{i,j=1}^{N_L,N_{L-1}},(b^L_i)_{i=1}^{N_L}))\in {\mathcal {N}}^{ N_0, N_1,\dots , N_L }_L\), \(l\in \{1,2,\dots ,{L}\}\) \({\mathcal {L}}(\Phi )=L\), \(\dim _{\mathrm {in}}(\Phi )=N_0\), \(\dim _{\mathrm {out}}(\Phi )=N_L\),

$$\begin{aligned} {\mathcal {M}}_l(\Phi )=\sum _{i=1}^{N_l}\left[ {\mathbb {1}_{{\mathbb {R}}\backslash \{0\}}(b^l_i) +\sum _{j=1}^{N_{l-1}}\mathbb {1}_{{\mathbb {R}}\backslash \{0\}}(A^l_{i,j})}\right] , \end{aligned}$$
(5.3)

and

$$\begin{aligned} {\mathcal {M}}(\Phi )=\sum _{l=1}^L{\mathcal {M}}_l(\Phi ). \end{aligned}$$
(5.4)

For every \(\varrho \in C({\mathbb {R}},{\mathbb {R}})\) let \(\varrho ^*:\cup _{d\in {\mathbb {N}}}{\mathbb {R}}^d\rightarrow \cup _{d\in {\mathbb {N}}}{\mathbb {R}}^d\) be the function which satisfies for every \(d\in {\mathbb {N}}\), \(x=(x_1,x_2,\dots ,x_d)\in {\mathbb {R}}^d\) that \(\varrho ^*(x)=(\varrho (x_1),\varrho (x_2),\dots ,\varrho (x_d))\), and for every \(\varrho \in {\mathcal {C}}({\mathbb {R}},{\mathbb {R}})\) denote by \(R_{\varrho }:{\mathfrak {N}}\rightarrow \cup _{a,b\in {\mathbb {N}}}\,C({\mathbb {R}}^a,{\mathbb {R}}^b)\) the function which satisfies for every \(L\in {\mathbb {N}}\), \(N_0,N_1,\dots ,N_L\in {\mathbb {N}}\), \(x_0\in {\mathbb {R}}^{N_0}\), and \(\Phi =((A_1,b_1),(A_2,b_2),\dots ,(A_L,b_L))\in {\mathcal {N}}_L^{N_0,N_1,\dots ,N_L}\), with \(x_1\in {\mathbb {R}}^{N_1},\dots ,x_{L-1}\in {\mathbb {R}}^{N_{L-1}}\) given by

$$\begin{aligned} x_l = \varrho ^*( A_{l} x_{l-1} + b_{l})\;,\qquad l = 1,...,L-1\;, \end{aligned}$$
(5.5)

that

$$\begin{aligned} \left[ { R_{ \varrho }( \Phi ) }\right] ( x_0 ) = A_L x_{ L - 1 } + b_L\;. \end{aligned}$$
(5.6)

The quantity \({\mathcal {M}}(\Phi )\) simply denotes the number of nonzero entries of the network \(\Phi \), which together with its depth \({\mathcal {L}}(\Phi )\) will be how we measure the “size” of a given neural network \(\Phi \). One could instead consider the number of all weights, i.e., including zeroes, of a neural network. Note, however, that for any non-degenerate neural network \(\Phi \) the total number of weights is bounded from above by \({\mathcal {M}}(\Phi )^2+{\mathcal {M}}(\Phi )\). Here, the terminology “degenerate” refers to a neural network which has neurons that can be removed without changing the realization of the NN. This implies for any neural network there also exists a non-degenerate one of smaller or equal size, which has the exact same realization. Since our primary goal is to approximate d-variate functions by networks the size of which only depends polynomially on the dimension, the above means that the qualitatively same results hold regardless of which notion of “size” is used.

We start by introducing two basic tools for constructing new neural networks from known ones and, in Lemma 5.3 and Lemma 5.4, consider how the properties of a derived network depend on its parts. Note that techniques like these have already been used in [34] and [37].

The first tool will be the “composition” of neural networks in (5.7), which takes two networks and provides a new network whose realization is the composition of the realizations of the two constituent functions.

The second tool will be the “parallelization” of neural networks in (5.12), which will be useful when considering linear combinations or tensor products of functions which we can already approximate. While parallelization of same depth networks (5.10) works with arbitrary activation functions, we use for the general case that any ReLU network can easily be extended (5.11) to an arbitrary depth without changing its realization.

Setting 5.2

Assume Setting 5.1, for every \(L_1,L_2\in {\mathbb {N}}\), \(\Phi ^i=\left( {(A_1^i,b_1^i), (A_2^i,b_2^i),\dots ,(A^i_{L_i},b^i_{L_i})}\right) \in {\mathfrak {N}}\), \(i\in \{1,2\}\), with \(\dim _{\mathrm {in}}(\Phi ^1)=\dim _{\mathrm {out}}(\Phi ^2)\) let \(\Phi ^1\odot \Phi ^2\in {\mathfrak {N}}\) be the neural network given by

$$\begin{aligned} \Phi ^1\odot \Phi ^2=&\left( (A_1^2,b_1^2),\dots ,(A^2_{L_2-1},b^2_{L_2-1}),\left( {\begin{pmatrix}A^2_{L_2}\\ -A^2_{L_2}\end{pmatrix}, \begin{pmatrix}b^2_{L_2}\\ -b^2_{L_2}\end{pmatrix}}\right) , \left( {\begin{pmatrix}A^1_1&-A^1_1\end{pmatrix},b^1_1}\right) ,\right. \nonumber \\&\left. (A^1_2,b^1_2),\dots ,(A^1_{L_1},b^1_{L_1})\right) , \end{aligned}$$
(5.7)

for every \(d\in {\mathbb {N}}\), \(L\in {\mathbb {N}}\cap [2,\infty )\) let \(\Phi ^{\mathrm {Id}}_{d,L}\in {\mathfrak {N}}\) be the neural network given by

$$\begin{aligned} \Phi ^{\mathrm {Id}}_{d,L}=\left( {\left( {\begin{pmatrix}\mathrm {Id}_{{\mathbb {R}}^d}\\ -\mathrm {Id}_{{\mathbb {R}}^d}\end{pmatrix},0}\right) ,\underbrace{(\mathrm {Id}_{{\mathbb {R}}^{2d}},0),\dots ,(\mathrm {Id}_{{\mathbb {R}}^{2d}},0)}_{\text {L-2 times}},\left( {\begin{pmatrix}\mathrm {Id}_{{\mathbb {R}}^d}&-\mathrm {Id}_{{\mathbb {R}}^d}\end{pmatrix},0}\right) }\right) , \end{aligned}$$
(5.8)

for every \(d\in {\mathbb {N}}\) let \(\Phi ^{\mathrm {Id}}_{d,1}\in {\mathfrak {N}}\) be the neural network given by

$$\begin{aligned} \Phi ^{\mathrm {Id}}_{d,1}=((\mathrm {Id}_{{\mathbb {R}}^d},0)), \end{aligned}$$
(5.9)

for every \(n,L\in {\mathbb {N}}\), \(\Phi ^j=((A^j_1,b^j_1),(A^j_2,b^j_2),\dots ,(A^j_L,b^j_L))\in {\mathfrak {N}}\), \(j\in \{1,2,\dots ,{n}\}\), let \({\mathcal {P}}_s(\Phi ^1,\Phi ^2,\dots ,\Phi ^n)\in {\mathfrak {N}}\) be the neural network which satisfies

$$\begin{aligned} {\mathcal {P}}_s(\Phi ^1,\Phi ^2,\dots ,\Phi ^n)=\left( {\left( {\begin{pmatrix}A^1_1&{}&{}&{}\\ {} &{} A^2_1 &{}&{}\\ &{}&{}\ddots &{}\\ &{}&{}&{}A^n_1\end{pmatrix},\begin{pmatrix}b^1_1\\ b^2_1\\ \vdots \\ b^n_1\end{pmatrix}}\right) ,\dots ,\left( {\begin{pmatrix}A^1_L&{}&{}&{}\\ {} &{} A^2_L &{}&{}\\ &{}&{}\ddots &{}\\ &{}&{}&{}A^n_L\end{pmatrix},\begin{pmatrix}b^1_L\\ b^2_L\\ \vdots \\ b^n_L\end{pmatrix}}\right) }\right) , \end{aligned}$$
(5.10)

for every \(L,d\in {\mathbb {N}}\), \(\Phi \in {\mathfrak {N}}\) with \({\mathcal {L}}(\Phi )\le L\), \(\dim _{\mathrm {out}}(\Phi )=d\), let \({\mathcal {E}}_L(\Phi )\in {\mathfrak {N}}\) be the neural network given by

$$\begin{aligned} {\mathcal {E}}_L(\Phi )={\left\{ \begin{array}{ll}\Phi ^{\mathrm {Id}}_{d,L-{\mathcal {L}}(\Phi )}\odot \Phi &{} :{\mathcal {L}}(\Phi )<L \\ \Phi &{} :{\mathcal {L}}(\Phi )=L\end{array}\right. }, \end{aligned}$$
(5.11)

and for every \(n,L\in {\mathbb {N}}\), \(\Phi ^j\in {\mathfrak {N}}\), \(j\in \{1,2,\dots ,{n}\}\) with \(\max _{j\in \{1,2,\dots ,{n}\}}{\mathcal {L}}(\Phi ^j)=L\), let \({\mathcal {P}}(\Phi ^1,\Phi ^2,\dots ,\Phi ^n)\in {\mathfrak {N}}\) denote the neural network given by

$$\begin{aligned} {\mathcal {P}}(\Phi ^1,\Phi ^2,\dots ,\Phi ^n)=&{\mathcal {P}}_s({\mathcal {E}}_L(\Phi ^1),{\mathcal {E}}_L(\Phi ^2),\dots ,{\mathcal {E}}_L(\Phi ^n)). \end{aligned}$$
(5.12)

Lemma 5.3

Assume Setting 5.2, let \(\Phi ^1,\Phi ^2\in {\mathfrak {N}}\) and let \(\varrho :{\mathbb {R}}\rightarrow {\mathbb {R}}\) be the function which satisfies for every \(t\in {\mathbb {R}}\) that \(\varrho (t)=\max \{0,t\}\). Then

  1. (i)

    for every \(x\in {\mathbb {R}}^{\dim _{\mathrm {in}}(\Phi ^2)}\) it holds

    $$\begin{aligned}{}[R_{\varrho }(\Phi ^1\odot \Phi ^2)](x)=([R_{\varrho }(\Phi ^1)]\circ [R_{\varrho }(\Phi ^2)])(x)=[R_{\varrho }(\Phi ^1)]([R_{\varrho }(\Phi ^2)](x)), \end{aligned}$$
    (5.13)
  2. (ii)

    \({\mathcal {L}}(\Phi ^1\odot \Phi ^2)={\mathcal {L}}(\Phi ^1)+{\mathcal {L}}(\Phi ^2)\),

  3. (iii)

    \({\mathcal {M}}(\Phi ^1\odot \Phi ^2)\le {\mathcal {M}}(\Phi ^1)+{\mathcal {M}}(\Phi ^2)+{\mathcal {M}}_1(\Phi ^1)+{\mathcal {M}}_{{\mathcal {L}}(\Phi ^2)}(\Phi ^2) \le 2({\mathcal {M}}(\Phi ^1)+{\mathcal {M}}(\Phi ^2))\),

  4. (iv)

    \({\mathcal {M}}_1(\Phi ^1\odot \Phi ^2)={\mathcal {M}}_1(\Phi ^2)\),

  5. (v)

    \({\mathcal {M}}_{{\mathcal {L}}(\Phi ^1\odot \Phi ^2)}(\Phi ^1\odot \Phi ^2)={\mathcal {M}}_{{\mathcal {L}}(\Phi ^1)}(\Phi ^1)\),

  6. (vi)

    \(\dim _{\mathrm {in}}(\Phi ^1\odot \Phi ^2)=\dim _{\mathrm {in}}(\Phi ^2)\),

  7. (vii)

    \(\dim _{\mathrm {out}}(\Phi ^1\odot \Phi ^2)=\dim _{\mathrm {out}}(\Phi ^1)\),

  8. (viii)

    for every \(d,L\in {\mathbb {N}}\), \(x\in {\mathbb {R}}^d\) it holds that \([R_{\varrho }(\Phi ^{\mathrm {Id}}_{d,L})](x)=x\) and

  9. (ix)

    for every \(L\in {\mathbb {N}}\), \(\Phi \in {\mathfrak {N}}\) with \({\mathcal {L}}(\Phi )\le L\), \(x\in {\mathbb {R}}^{\dim _{\mathrm {in}}(\Phi )}\) it holds that \([R_{\varrho }({\mathcal {E}}_L(\Phi ))](x)=[R_{\varrho }(\Phi )](x)\).

Proof of Lemma 5.3

For every \(i\in \{1,2\}\) let \(L_i\in {\mathbb {N}}\), \(N^i_1,N^i_2,\dots ,N^i_{L_i}\), \((A^i_l,b^i_l)\in {\mathbb {R}}^{N^i_l\times N^i_{l-1}}\times {\mathbb {R}}^{N^i_l}\), \(l\in \{1,2,\dots ,{L_i}\}\) such that \(\Phi ^i=((A^i_1,b^i_1),\dots ,(A^i_{L_i},b^i_{L_i}))\). Furthermore, let \((A_l,b_l)\in {\mathbb {R}}^{N_l\times N_{l-1}}\times {\mathbb {R}}^{N_l}\), \(l\in \{1,2,\dots ,{L_1+L_2}\}\), be the matrix-vector tuples which satisfy \(\Phi _1\odot \Phi _2=((A_1,b_1),\dots ,(A_{L_1+L_2},b_{L_1+L_2}))\) and let \(r_l:{\mathbb {R}}^{N_0}\rightarrow {\mathbb {R}}^{N_l}\), \(l\in \{1,2,\dots ,{L_1+L_2}\}\), be the functions which satisfy for every \(x\in {\mathbb {R}}^{N_0}\) that

$$\begin{aligned} r_l(x)=\left\{ \begin{array}{lll} \varrho ^*(A_1x+b_1) &{} :l=1\\ \varrho ^*(A_l r_{l-1}(x)+b_l) &{} :1<l<L_1+L_2\\ A_l r_{l-1}(x)+b_l &{} :l=L_1+L_2 \end{array}\right. . \end{aligned}$$
(5.14)

Observe that for every \(l\in \{1,2,\dots ,{L_2-1}\}\) holds \((A_l,b_l)=(A^2_l,b^2_l)\). This implies that for every \(x\in {\mathbb {R}}^{N_0}\) holds

$$\begin{aligned} A^2_{L_2}r_{L_2-1}(x)+b^2_{L_2}=[R_{\varrho }(\Phi _2)](x). \end{aligned}$$
(5.15)

Combining this with (5.7) implies for every \(x\in {\mathbb {R}}^{N_0}\) that

$$\begin{aligned} \begin{aligned} r_{L_2}(x)&=\varrho ^*(A_{L_2}r_{L_2-1}(x)+b_{L_2})=\varrho ^*\left( \begin{pmatrix}A^2_{L_2}\\ -A^2_{L_2}\end{pmatrix}r_{L_2-1}(x)+\begin{pmatrix}b^2_{L_2}\\ -b^2_{L_2}\end{pmatrix}\right) \\&=\varrho ^*\left( \begin{pmatrix}A^2_{L_2}r_{l-1}(x) +b^2_{L_2}\\ -A^2_{L_2}r_{l-1}(x)-b^2_{L_2}\end{pmatrix}\right) =\begin{pmatrix}\varrho ^*([R_{\varrho }(\Phi ^2)](x))\\ \varrho ^*(-[R_{\varrho }(\Phi ^2)](x))\end{pmatrix} \end{aligned}\end{aligned}$$
(5.16)

In addition, for every \(d\in {\mathbb {N}}\), \(y=(y_1,y_2,\dots ,y_d)\in {\mathbb {R}}^d\) holds

$$\begin{aligned} \varrho ^*(y)-\varrho ^*(-y)=(\varrho (y_1)-\varrho (-y_1),\varrho (y_2)-\varrho (-y_2),\dots ,\varrho (y_d)-\varrho (-y_d))=y. \end{aligned}$$
(5.17)

This, (5.7) and (5.16) ensure that for every \(x\in {\mathbb {R}}^{N_0}\) holds

$$\begin{aligned} \begin{aligned} r_{L_2+1}(x)&=A_{L_2+1} \begin{pmatrix}\varrho ^*([R_{\varrho }(\Phi ^2)](x)) \\ \varrho ^*(-[R_{\varrho }(\Phi ^2)](x)) \end{pmatrix}+b_{L_2+1}\\&=A^1_1\varrho ^*([R_{\varrho }(\Phi ^2)](x))-A^1_1 \varrho ^*(-[R_{\varrho }(\Phi ^2)](x))+b_{L_2+1}\\&=A^1_1[R_{\varrho }(\Phi ^2)](x)+b^1_1. \end{aligned}\end{aligned}$$
(5.18)

Combining this with (5.14) establishes (i). Moreover, (ii)-(vii) follow directly from (5.7). Furthermore, (5.8), (5.9) and (5.17) imply (viii). Finally, (ix) follows from (5.11) and (viii). This completes the proof of Lemma 5.3. \(\square \)

Lemma 5.4

Assume Setting 5.2, let \(\varrho :{\mathbb {R}}\rightarrow {\mathbb {R}}\) be the function which satisfies for every \(t\in {\mathbb {R}}\) that \(\varrho (t)=\max \{0,t\}\), let \(n\in {\mathbb {N}}\), let \(\varphi ^j\in {\mathfrak {N}}\), \(j\in \{1,2,\dots ,{n}\}\), let \(d_j\in {\mathbb {N}}\), \(j\in \{1,2,\dots ,{n}\}\), be given by \(d_j=\dim _{\mathrm {in}}(\varphi ^j)\), let \(D\in {\mathbb {N}}\) be given by \(D=\sum _{j=1}^n d_j\) and let \(\Phi \in {\mathfrak {N}}\) be given by \(\Phi ={\mathcal {P}}(\varphi ^1,\varphi ^2,\dots ,\varphi ^n)\). Then

  1. (i)

    for every \(x\in {\mathbb {R}}^D\) it holds

    $$\begin{aligned} {}[R_{\varrho }(\Phi )](x)=&\left( [R_{\varrho }(\varphi ^1)](x_1,\dots ,x_{d_1}), [R_{\varrho }(\varphi ^2)](x_{d_1+1},\dots ,x_{d_1+d_2}),\dots ,\right. \nonumber \\&\left. [R_{\varrho }(\varphi ^n)](x_{D-d_n+1},\dots ,x_{D})\right) , \end{aligned}$$
    (5.19)
  2. (ii)

    \({\mathcal {L}}(\Phi )=\max _{j\in \{1,2,\dots ,{n}\}}{\mathcal {L}}(\varphi ^j)\),

  3. (iii)

    \({\mathcal {M}}(\Phi )\le 2\left( {\sum _{j=1}^n{\mathcal {M}}(\varphi ^j)}\right) +4\left( {\sum _{j=1}^n \dim _{\mathrm {out}}(\varphi ^j)}\right) \max _{j\in \{1,2,\dots ,{n}\}}{\mathcal {L}}(\varphi ^j)\),

  4. (iv)

    \({\mathcal {M}}(\Phi )=\sum _{j=1}^n{\mathcal {M}}(\varphi ^j)\) provided for every \(j,j'\in \{1,2,\dots ,{n}\}\) holds \({\mathcal {L}}(\varphi ^j)={\mathcal {L}}(\varphi ^{j'})\),

  5. (v)

    \({\mathcal {M}}_{{\mathcal {L}}(\Phi )}(\Phi )\le \sum _{j=1}^n \max \{2\dim _{\mathrm {out}}(\varphi ^j),{\mathcal {M}}_{{\mathcal {L}}(\varphi ^j)}(\varphi ^j)\}\),

  6. (vi)

    \({\mathcal {M}}_1(\Phi )=\sum _{j=1}^n{\mathcal {M}}_1(\varphi ^j)\),

  7. (vii)

    \(\dim _{\mathrm {in}}(\Phi )=\sum _{j=1}^n\dim _{\mathrm {in}}(\varphi ^j)\) and

  8. (viii)

    \(\dim _{\mathrm {out}}(\Phi )=\sum _{j=1}^n\dim _{\mathrm {out}}(\varphi ^j)\).

Proof of Lemma 5.4

Observe that Lemma 5.3 implies that for every \(j\in \{1,2,\dots ,{n}\}\) holds

$$\begin{aligned} R_{\varrho }({\mathcal {E}}_{{\mathcal {L}}(\Phi )}(\varphi ^j))=R_{\varrho }(\varphi ^j). \end{aligned}$$
(5.20)

Combining this with (5.10) and (5.12) establishes (i). Furthermore, note that (ii), (vi), (vii) and (viii) follow directly from (5.10) and (5.12). Moreover, (5.10) demonstrates that for every \(m\in {\mathbb {N}}\), \(\psi _i\in {\mathfrak {N}}\), \(i\in \{1,2,\dots ,{m}\}\), with \(\forall i,i'\in \{1,2,\dots ,{m}\}:{\mathcal {L}}(\psi ^i)={\mathcal {L}}(\psi ^{i'})\) holds

$$\begin{aligned} {\mathcal {M}}({\mathcal {P}}_s(\psi ^1,\psi ^2,\dots ,\psi ^m))=\sum _{i=1}^m{\mathcal {M}}(\psi ^i). \end{aligned}$$
(5.21)

This establishes (iv). Next, observe that Lemma 5.3, (5.11) and the fact that for every \(d\in \), \(L\in {\mathbb {N}}\) holds \({\mathcal {M}}(\Phi ^{\mathrm {Id}}_{d,L})\le 2dL\) imply that for every \(j\in \{1,2,\dots ,{n}\}\) we have

$$\begin{aligned} \begin{aligned} {\mathcal {M}}({\mathcal {E}}_{{\mathcal {L}}(\Phi )}(\varphi ^j))&\le 2{\mathcal {M}}(\Phi ^{\mathrm {Id}}_{\dim _{\mathrm {out}}(\varphi ^j),{\mathcal {L}}(\Phi )-{\mathcal {L}}(\varphi ^j)})+2{\mathcal {M}}(\varphi ^j)\\&\le 4\dim _{\mathrm {out}}(\varphi ^j){\mathcal {L}}(\Phi )+2{\mathcal {M}}(\varphi ^j). \end{aligned} \end{aligned}$$
(5.22)

Combining this with (5.21) establishes (iii). In addition, note that (5.8), (5.9) and (5.11) ensure for every \(j\in \{1,2,\dots ,{n}\}\) that

$$\begin{aligned} {\mathcal {M}}_{{\mathcal {L}}(\Phi )}({\mathcal {E}}_{{\mathcal {L}}(\Phi )}(\varphi ^j)) \le \max \{2\dim _{\mathrm {out}}(\varphi ^j),{\mathcal {M}}_{{\mathcal {L}}(\varphi ^j)}(\varphi ^j)\}. \end{aligned}$$
(5.23)

Combining this with (5.10) establishes (v). The proof of Lemma 5.4 is thus completed. \(\square \)

6 Basic Expression Rate Results

Here, we begin by establishing an expression rate result for a very simple function, namely \(x\mapsto x^2\) on [0, 1]. Our approach is based on the observation by M. Telgarsky [40] that neural networks with ReLU activation function can efficiently compute high-frequency sawtooth functions and the idea of D. Yarotsky in [44] to use this in order to approximate the function \(x\mapsto x^2\) by networks computing its linear interpolations. This can then be used to derive networks capable of efficiently approximating \((x,y)\mapsto xy\), which leads to tensor products as well as polynomials and subsequently smooth function. Note that [44] uses a slightly different notion of neural networks, where connections between non-adjacent layers are permitted. This does, however, only require a technical modification of the proof, which does not significantly change the result. Nonetheless, the respective proofs are provided in appendix for completeness.

Lemma 6.1

Assume Setting 5.1 and let \(\varrho :{\mathbb {R}}\rightarrow {\mathbb {R}}\) be the ReLU activation function given by \(\varrho (t)=\max \{0,t\}\). Then there exist neural networks \((\sigma _{\varepsilon })_{\varepsilon \in (0,\infty )}\subseteq {\mathfrak {N}}\) such that for every \(\varepsilon \in (0,\infty )\)

  1. (i)

    \({\mathcal {L}}(\sigma _{\varepsilon })\le {\left\{ \begin{array}{ll}\tfrac{1}{2}\left| {\log _2(\varepsilon )}\right| +1 &{} :\varepsilon <1\\ 1 &{} :\varepsilon \ge 1\end{array}\right. }\),

  2. (ii)

    \({\mathcal {M}}(\sigma _{\varepsilon })\le {\left\{ \begin{array}{ll}15(\tfrac{1}{2}\left| {\log _2(\varepsilon )}\right| +1) &{} :\varepsilon < 1\\ 0 &{} :\varepsilon \ge 1\end{array}\right. }\),

  3. (iii)

    \(\sup _{t\in [0,1]}\left| {t^2-\left[ {R_{\varrho }(\sigma _{\varepsilon })}\right] \!(t)}\right| \le \varepsilon \),

  4. (iv)

    \([R_{\varrho }(\sigma _{\varepsilon })]\!(0) = 0\).

We can now derive the following result on approximate multiplication by neural networks, by observing that \(xy=2B^2(|(x+y)/2B|^2-|x/2B|^2-|y/2B|^2)\) for every \(B\in (0,\infty )\), \(x,y\in {\mathbb {R}}\).

Lemma 6.2

Assume Setting 5.1, let \(B\in (0,\infty )\) and let \(\varrho :{\mathbb {R}}\rightarrow {\mathbb {R}}\) be the ReLU activation function given by \(\varrho (t)=\max \{0,t\}\). Then there exist neural networks \((\mu _{\varepsilon })_{\varepsilon \in (0,\infty )}\subseteq {\mathfrak {N}}\) which satisfy for every \(\varepsilon \in (0,\infty )\) that

  1. (i)

    \({\mathcal {L}}(\mu _{\varepsilon })\le {\left\{ \begin{array}{ll}\tfrac{1}{2}\log _2(\tfrac{1}{\varepsilon })+\log _2(B)+6 &{} :\varepsilon < B^2\\ 1 &{} :\varepsilon \ge B^2\end{array}\right. }\),

  2. (ii)

    \({\mathcal {M}}(\mu _{\varepsilon })\le {\left\{ \begin{array}{ll}45\log _2(\tfrac{1}{\varepsilon })+90\log _2(B)+259 &{} :\varepsilon < B^2\\ 0 &{} :\varepsilon \ge B^2\end{array}\right. }\),

  3. (iii)

    \(\sup _{(x,y)\in [-B,B]^2}\left| {xy-\left[ {R_{\varrho }(\mu _{\varepsilon })}\right] \!(x,y)}\right| \le \varepsilon \),

  4. (iv)

    \({\mathcal {M}}_1(\mu _{\varepsilon })=8,\ {\mathcal {M}}_{{\mathcal {L}}(\mu _{\varepsilon })}(\mu _{\varepsilon })=3\) and

  5. (v)

    for every \(x\in {\mathbb {R}}\) it holds that \(R_\varrho [\mu _{\varepsilon }](0,x) = R_\varrho [\mu _{\varepsilon }](x,0)=0\).

Next we extend this result to products of any number of factors by hierarchical, pairwise multiplication.

Theorem 6.3

Assume Setting 5.1, let \(\varrho :{\mathbb {R}}\rightarrow {\mathbb {R}}\) be the ReLU activation function given by \(\varrho (t)=\max \{0,t\}\), let \(m\in {\mathbb {N}}\cap [2,\infty )\) and let \(B\in [1,\infty )\). Then there exists a constant \(C\in {\mathbb {R}}\) (which is independent of m, B) and neural networks \({(\Pi _{\varepsilon })_{\varepsilon \in (0,\infty )}\subseteq {\mathfrak {N}}}\) which satisfy

  1. (i)

    \({\mathcal {L}}(\Pi _{\varepsilon })\le C\ln (m)\left( {\left| {\ln (\varepsilon )}\right| +m\ln (B)+\ln (m)}\right) \),

  2. (ii)

    \({\mathcal {M}}(\Pi _{\varepsilon })\le C m\left( {\left| {\ln (\varepsilon )}\right| +m\ln (B)+\ln (m)}\right) \),

  3. (iii)

    \(\displaystyle \sup _{x\in [-B,B]^m}\left| {\left[ {\prod _{j=1}^m x_j}\right] -\left[ {R_{\varrho }(\Pi _{\varepsilon })}\right] \!(x)}\right| \le \varepsilon \) and

  4. (iv)

    \(R_\varrho \left[ \Pi _{\varepsilon }\right] (x_1,x_2,\dots ,x_m)=0\), if there exists \(i\in \{1,2,\dots ,m\}\) with \(x_i=0\).

Proof of Theorem 6.3

Throughout this proof, assume Setting 5.2, let \(l=\lceil \log _2 m\rceil \) and let \(\theta \in {\mathcal {N}}^{1,1}_1\) be the neural network given by \(\theta =(0,0)\), let \((A,b)\in {\mathbb {R}}^{l\times m}\times {\mathbb {R}}^{l}\) be the matrix-vector tuple given by

$$\begin{aligned} A_{i,j}={\left\{ \begin{array}{ll}1 &{} :i=j, j\le m \\ 0 &{} :\mathrm {else} \end{array}\right. }\quad \mathrm {and}\quad b_i={\left\{ \begin{array}{ll} 0 &{} :i\le m\\ 1 &{} :i>m\end{array}\right. }. \end{aligned}$$
(6.1)

Let further \(\omega \in {\mathcal {N}}^{m,2^l}_2\) be the neural network given by \(\omega =((A,b))\). Note that Lemma 6.2 (with \(B^m\) as B in the notation of Lemma 6.2) ensures that there exist neural networks \((\mu _{\eta })_{\eta \in (0,\infty )}\subseteq {\mathfrak {N}}\) such that for every \(\eta \in (0,\left[ {B^m}\right] ^2)\) it holds

  1. (A)

    \({\mathcal {L}}(\mu _{\eta })\le \tfrac{1}{2}\log _2(\tfrac{1}{\eta })+\log _2(B^m)+6\),

  2. (B)

    \({\mathcal {M}}(\mu _{\eta })\le 45\log _2(\tfrac{1}{\eta })+90\log _2(B^m)+259\),

  3. (C)

    \(\displaystyle \sup _{x,y\in [-B^m,B^m]}\left| {xy-\left[ {R_{\varrho }(\mu _{\eta })}\right] \!(x,y)}\right| \le \eta \),

  4. (D)

    \({\mathcal {M}}_1(\mu _{\eta })=8,\ {\mathcal {M}}_{{\mathcal {L}}(\mu _{\eta })}(\mu _{\eta })=3\) and

  5. (E)

    for every \(x\in {\mathbb {R}}\) it holds that \(R_\varrho [\mu _{\eta }](0,x) = R_\varrho [\mu _{\eta }](x,0)=0\).

Let \((\nu _{\varepsilon })_{\varepsilon \in (0,\infty )}\subseteq {\mathfrak {N}}\) be the neural networks which satisfy for every \(\varepsilon \in (0,\infty )\)

$$\begin{aligned} \nu _{\varepsilon }=\mu _{m^{-2}B^{-2m}\varepsilon }. \end{aligned}$$
(6.2)

Observe that (A) implies that for every \(\varepsilon \in (0,B^m)\subseteq (0,m^2 B^{4m})\) it holds

$$\begin{aligned} \begin{aligned} {\mathcal {L}}(\nu _{\varepsilon })&\le \tfrac{1}{2}\log _2\left( \tfrac{1}{m^{-2}B^{-2m}\varepsilon }\right) +\log _2(B^m)+6\\&=\tfrac{1}{2}\left( \log _2\left( \tfrac{1}{\varepsilon }\right) +2\log _2(m)+2m\log _2(B)\right) +m\log _2(B)+6\\&=\tfrac{1}{2}\log _2\left( \tfrac{1}{\varepsilon }\right) +2m\log _2(B)+\log _2(m)+6. \end{aligned} \end{aligned}$$
(6.3)

In addition, note that (B) implies that for every \(\varepsilon \in (0,B^m)\subseteq (0,m^2 B^{4m})\)

$$\begin{aligned} \begin{aligned} {\mathcal {M}}(\nu _{\varepsilon })&\le 45\log _2\left( \tfrac{1}{m^{-2}B^{-2m}\varepsilon }\right) +90\log _2(B^m)+259\\&=45\log _2\left( \tfrac{1}{\varepsilon }\right) +180m\log _2(B)+90\log _2(m)+259. \end{aligned}\end{aligned}$$
(6.4)

Furthermore, (C) implies that for every \(\varepsilon \in (0,B^m)\subseteq (0,m^2 B^{4m})\) holds

$$\begin{aligned} \sup _{x,y\in [-B^m,B^m]}\left| {xy-\left[ {R_{\varrho }(\nu _{\eta })}\right] \!(x,y)}\right| \le m^{-2}B^{-2m}\varepsilon . \end{aligned}$$
(6.5)

Let \(\pi _{k,\varepsilon }\in {\mathfrak {N}}\), \(\varepsilon \in (0,\infty )\), \(k\in {\mathbb {N}}\), be the neural networks which satisfy for every \(\varepsilon \in (0,\infty )\), \(k\in {\mathbb {N}}\)

$$\begin{aligned} \pi _{k,\varepsilon }={\left\{ \begin{array}{ll}\nu _{\varepsilon }&{} :k=1\\ \nu _{\varepsilon }\odot {\mathcal {P}}(\pi _{k-1,\varepsilon },\pi _{k-1,\varepsilon }) &{} :k>1 \end{array}\right. } \end{aligned}$$
(6.6)

and let \((\Pi _{\varepsilon })_{\varepsilon \in (0,\infty )}\subseteq {\mathfrak {N}}\) be neural networks given by

$$\begin{aligned} \Pi _{\varepsilon }={\left\{ \begin{array}{ll}\pi _{l,\varepsilon }\odot \omega &{} :\varepsilon <B^m \\ \theta &{} :\varepsilon \ge B^m \end{array}\right. }. \end{aligned}$$
(6.7)

Note that for every \(\varepsilon \in (B^m,\infty )\) it holds

$$\begin{aligned} \begin{aligned} \sup _{x\in [-B,B]^m}\left| {\left[ {\textstyle \prod \limits _{j=1}^m x_j}\right] -\left[ {R_{\varrho }(\Pi _{\varepsilon })}\right] \!(x)}\right|&=\sup _{x\in [-B,B]^m}\left| {\left[ {\textstyle \prod \limits _{j=1}^m x_j}\right] -\left[ {R_{\varrho }(\theta )}\right] \!(x)}\right| \\&=\sup _{x\in [-B,B]^m}\left| {\left[ {\textstyle \prod \limits _{j=1}^m x_j}\right] -0}\right| =B^m\le \varepsilon . \end{aligned}\end{aligned}$$
(6.8)

We claim that for every \(k\in \{1,2,\dots ,{l}\}\), \(\varepsilon \in (0,B^m)\) it holds

  1. (a)

    that

    $$\begin{aligned} \sup _{x\in [-B,B]^{(2^k)}}\left| {\left[ {\textstyle \prod \limits _{j=1}^{2^k}x_j}\right] -[R_{\varrho }(\pi _{k,\varepsilon })](x)}\right| \le 4^{k-1} m^{-2} B^{(2^k-2m)}\varepsilon , \end{aligned}$$
    (6.9)
  2. (b)

    that \({\mathcal {L}}(\pi _{k,\varepsilon })\le k{\mathcal {L}}(\nu _{\varepsilon })\) and

  3. (c)

    that \({\mathcal {M}}(\pi _{k,\varepsilon })\le (2^k-1){\mathcal {M}}(\nu _{\varepsilon })+(2^{k-1}-1)20\).

We prove (a), (b) and (c) by induction on \(k\in \{1,2,\dots ,{l}\}\). Observe that (6.5) and the fact that \(B\in [1,\infty )\) establishes (a) for \(k=1\). Moreover, note that (6.6) establishes (b) and (c) in the base case \(k=1\).

For the induction step \(\{1,2,\dots ,{l-1}\}\ni k\rightarrow k+1\in \{2,3,\dots ,l\}\) note that Lemma 5.3, Lemma 5.4, (6.5) and (6.6) imply that for every \(k\in \{1,2,\dots ,{l-1}\}\), \(\varepsilon \in (0,B^m)\)

$$\begin{aligned} \begin{aligned}&\sup _{x\in [-B,B]^{(2^{k+1})}}\left| {\left[ {\prod _{j=1}^{2^{k+1}}x_j}\right] -[R_{\varrho }(\pi _{k+1,\varepsilon })](x)}\right| \\&\quad =\sup _{x,x'\in [-B,B]^{(2^k)}}\left| {\left[ {\prod _{j=1}^{2^k}x_j}\right] \!\!\left[ {\prod _{j=1}^{2^k}x'_j}\right] -[R_{\varrho }(\pi _{k+1,\varepsilon })]\left( {(x,x')}\right) }\right| \\&\quad =\sup _{x,x'\in [-B,B]^{(2^k)}}\left| {\left[ {\prod _{j=1}^{2^k}x_j}\right] \!\!\left[ {\prod _{j=1}^{2^k}x'_j}\right] -[R_{\varrho }(\nu _{\varepsilon })]\left( {[R_{\varrho }(\pi _{k,\varepsilon })](x),[R_{\varrho }(\pi _{k,\varepsilon })](x')}\right) }\right| \\&\quad \le \sup _{x,x'\in [-B,B]^{(2^k)}}\left| {\left[ {\prod _{j=1}^{2^k}x_j}\right] \!\!\left[ {\prod _{j=1}^{2^k}x'_j}\right] -\left( {[R_{\varrho }(\pi _{k,\varepsilon })](x)}\right) \left( {[R_{\varrho }(\pi _{k,\varepsilon })](x')}\right) }\right| \\&\qquad \,+\!\!\!\!\!\!\sup _{x,x'\in [-B,B]^{(2^k)}}\left| \left( {[R_{\varrho }(\pi _{k,\varepsilon })](x)}\right) \left( {[R_{\varrho }(\pi _{k,\varepsilon })](x')}\right) \right. \\&\qquad \left. -[R_{\varrho }(\nu _{\varepsilon })]\left( {[R_{\varrho }(\pi _{k,\varepsilon })](x),[R_{\varrho }(\pi _{k,\varepsilon })](x')}\right) \right| \\&\quad \le \sup _{x,x'\in [-B,B]^{(2^k)}}\left| {\left[ {\prod _{j=1}^{2^k}x_j}\right] \!\!\left[ {\prod _{j=1}^{2^k}x'_j}\right] -\left( {[R_{\varrho }(\pi _{k,\varepsilon })](x)}\right) \left( {[R_{\varrho }(\pi _{k,\varepsilon })](x')}\right) }\right| \\&\qquad + m^{-2}B^{-2m}\varepsilon . \end{aligned} \end{aligned}$$
(6.10)

Next, for every \(c,\delta \in (0,\infty )\), \(y,z\in [-c,c]\), \({\tilde{y}},{\tilde{z}}\in {\mathbb {R}}\) with \(\left| {y-{\tilde{y}}}\right| , \left| {z-{\tilde{z}}}\right| \le \delta \) it holds

$$\begin{aligned} \left| {yz-{\tilde{y}}{\tilde{z}}}\right| \le 2(\left| {y}\right| +\left| {z}\right| )\delta + \delta ^2\le 2c\delta +\delta ^2. \end{aligned}$$
(6.11)

Moreover, for every \(k\in \{1,2,\dots ,{l}\}\)

$$\begin{aligned} 4^{k-1}\le 4^{l-1}=4^{\lceil \log _2 m\rceil -1}\le 4^{\log _2 m}=m^2. \end{aligned}$$
(6.12)

The fact that \(B\in [1,\infty )\) therefore ensures that for every \(k\in \{1,2,\dots ,{l-1}\}\), \(\varepsilon \in (0,B^m)\)

$$\begin{aligned} \begin{aligned} \left[ {4^{k-1} m^{-2} B^{(2^k-2m)}\varepsilon }\right] ^2 =&\left[ {4^{k-1} m^{-2} B^{(2^{k+1}-2m)}\varepsilon }\right] \left[ {4^{k-1} m^{-2} B^{-2m}\varepsilon }\right] \\ \le&\left[ {4^{k-1} m^{-2} B^{(2^{k+1}-2m)}\varepsilon }\right] . \end{aligned}\end{aligned}$$
(6.13)

This and (6.11) imply that for every \(k\in \{1,2,\dots ,{l-1}\}\), \(\varepsilon \in (0,B^m)\), \(x,x'\in [-B,B]^{(2^k)}\)

$$\begin{aligned} \begin{aligned}&\left| {\left[ {\textstyle \prod \limits _{j=1}^{2^k}x_j}\right] \!\!\left[ {\textstyle \prod \limits _{j=1}^{2^k}x'_j}\right] -\left( {[R_{\varrho }(\pi _{k,\varepsilon })](x)}\right) \left( {[R_{\varrho }(\pi _{k,\varepsilon })](x')}\right) }\right| \\&\quad \le 2B^{(2^k)}4^{k-1} m^{-2} B^{(2^k-2m)}\varepsilon +\left[ {4^{k-1} m^{-2} B^{(2^k-2m)}\varepsilon }\right] ^2\\&\quad \le 3\left[ {4^{k-1} m^{-2} B^{(2^{k+1}-2m)}\varepsilon }\right] . \end{aligned}\end{aligned}$$
(6.14)

Combining this, (6.10) and the fact that \(B\in [1,\infty )\) demonstrates that for every \(k\in \{1,2,\dots ,{l-1}\}\), \(\varepsilon \in (0,B^m)\)

$$\begin{aligned} \begin{aligned}&\sup _{x\in [-B,B]^{(2^{k+1})}}\left| {\left[ {\textstyle \prod \limits _{j=1}^{2^{k+1}}x_j}\right] -[R_{\varrho }(\pi _{k+1,\varepsilon })](x)}\right| \\&\quad \le 3\left[ {4^{k-1} m^{-2} B^{(2^{k+1}-2m)}\varepsilon }\right] +m^{-2}B^{-2m}\varepsilon \\&\quad \le 4^k m^{-2} B^{(2^{k+1}-2m)}\varepsilon . \end{aligned}\end{aligned}$$
(6.15)

This establishes the claim (a). Moreover, Lemma 5.3 and Lemma 5.4 imply for every \(k\in \{1,2,\dots ,{l-1}\}\), \(\varepsilon \in (0,B^m)\) with \({\mathcal {L}}(\pi _{k,\varepsilon })\le k{\mathcal {L}}(\nu _{\varepsilon })\) holds

$$\begin{aligned} \begin{aligned} {\mathcal {L}}(\pi _{k+1,\varepsilon })&={\mathcal {L}}(\nu _{\varepsilon })+\max \{{\mathcal {L}}(\pi _{k,\varepsilon }),{\mathcal {L}}(\pi _{k,\varepsilon })\}\\&\le {\mathcal {L}}(\nu _{\varepsilon }) + k{\mathcal {L}}(\nu _{\varepsilon })=(k+1){\mathcal {L}}(\nu _{\varepsilon }). \end{aligned}\end{aligned}$$
(6.16)

This establishes the claim (b). Furthermore, Lemma 5.3, Lemma 5.4, (B) and (D) imply for every \(k\in \{1,2,\dots ,{l-1}\}\), \(\varepsilon \in (0,B^m)\) with \({\mathcal {M}}(\pi _{k,\varepsilon })\le (2^k-1){\mathcal {M}}(\nu _{\varepsilon })+(2^{k-1}-1)20\) holds

$$\begin{aligned} \begin{aligned} {\mathcal {M}}(\pi _{k+1,\varepsilon })&\le {\mathcal {M}}(\nu _{\varepsilon })+({\mathcal {M}}(\pi _{k,\varepsilon })+{\mathcal {M}}(\pi _{k,\varepsilon }))+{\mathcal {M}}_1(\nu _{\varepsilon })+{\mathcal {M}}_{{\mathcal {L}}({\mathcal {P}}(\pi _{k,\varepsilon },\pi _{k,\varepsilon }))}({\mathcal {P}}(\pi _{k,\varepsilon },\pi _{k,\varepsilon })) \\&\le {\mathcal {M}}(\nu _{\varepsilon })+2{\mathcal {M}}(\pi _{k,\varepsilon })+14+2{\mathcal {M}}_{{\mathcal {L}}(\nu _{\varepsilon })}(\nu _{\varepsilon })\le {\mathcal {M}}(\nu _{\varepsilon })+2{\mathcal {M}}(\pi _{k,\varepsilon })+20\\&\le {\mathcal {M}}(\nu _{\varepsilon })+2((2^k-1){\mathcal {M}}(\nu _{\varepsilon })+(2^{k-1}-1)20)+20\\&=(2^{k+1}-1){\mathcal {M}}(\nu _{\varepsilon })+(2^k-1)20. \end{aligned}\end{aligned}$$
(6.17)

This establishes the claim (c).

Combining (a) with Lemma 5.3 and (6.7) implies for every \(\varepsilon \in (0,B^m)\) the bound

$$\begin{aligned} \begin{aligned} \sup _{x\in [-B,B]^m}\left| {\left[ {\prod _{j=1}^m x_j}\right] -\left[ {R_{\varrho }(\Pi _{\varepsilon })}\right] \!(x)}\right|&\le \!\!\!\!\!\sup _{x\in [-B,B]^{(2^l)}}\!\left| {\left[ {\prod _{j=1}^{2^l} x_j}\right] -\left[ {R_{\varrho }(\pi _{l,\varepsilon })}\right] \!(x)}\right| \\&\le 4^{l-1} m^{-2} B^{(2^l-2m)}\varepsilon \\&\le 4^{\lceil \log _2(m)\rceil -1} m^{-2} B^{(2^{\lceil \log _2(m)\rceil }-2m)}\varepsilon \\&\le 4^{\log _2(m)} m^{-2} B^{(2^{\log _2(m)+1}-2m)}\varepsilon \\&\le \left[ {2^{\log _2(m)}}\right] ^2 m^{-2}B^{(2m-2m)}\varepsilon \le \varepsilon . \end{aligned}\end{aligned}$$
(6.18)

This and (6.8) establish that the neural networks \((\Pi _{\varepsilon })_{\varepsilon \in (0,\infty )}\) satisfy (iii). Combining (b) with Lemma 5.3, (6.3) and (6.7) ensures that for every \(\varepsilon \in (0,B^m)\)

$$\begin{aligned} \begin{aligned} {\mathcal {L}}(\Pi _{\varepsilon })&={\mathcal {L}}(\pi _{l,\varepsilon })+{\mathcal {L}}(\omega )\le l{\mathcal {L}}(\nu _{\varepsilon })+1\le (\log _2(m)+1){\mathcal {L}}(\nu _{\varepsilon })+1\\&\le \log _2(m)\log _2\left( \tfrac{1}{\varepsilon }\right) +4\log _2(m)m\log _2(B)+2[\log _2(m)]^2+12\log _2(m)+1 \end{aligned}\end{aligned}$$
(6.19)

and that for every \(\varepsilon \in (B^m,\infty )\) it holds \({\mathcal {L}}(\Pi _{\varepsilon })={\mathcal {L}}(\theta )=1\). This establishes that the neural networks \((\Pi _{\varepsilon })_{\varepsilon \in (0,\infty )}\) satisfy (i). Furthermore, note that (c), Lemma 5.3, (6.3) and (6.7) demonstrate that for every \(\varepsilon \in (0,B^m)\)

$$\begin{aligned} \begin{aligned} {\mathcal {M}}(\Pi _{\varepsilon })&\le 2({\mathcal {M}}(\pi _{l,\varepsilon }) + {\mathcal {M}}(\omega ))\le 2\left[ {(2^l-1){\mathcal {M}}(\nu _{\varepsilon })+(2^{l-1}-1)20}\right] +4m\\&\le 2^{l+1}{\mathcal {M}}(\nu _{\varepsilon })+(2^l)20+4m\le 4m{\mathcal {M}}(\nu _{\varepsilon })+44m\\&\le 180m\log _2\left( \tfrac{1}{\varepsilon }\right) +720m^2\log _2(B)+360m\log _2(m)+1080m \end{aligned}\end{aligned}$$
(6.20)

and that for every \(\varepsilon \in (B^m,\infty )\) holds \({\mathcal {M}}(\Pi _{\varepsilon })={\mathcal {M}}(\theta )=0\). This establishes that the neural networks \((\Pi _{\varepsilon })_{\varepsilon \in (0,\infty )}\) satisfy (ii). Note that (iv) follows from (E) by construction. The proof of Theorem 6.3 is thus completed. \(\square \)

With the above established, it is quite straightforward to get the following result for the approximation of tensor products. Note that the exponential term \(B^{m-1}\) in (iii) is unavoidable as result from multiplying m many inaccurate values of magnitude B. For our purposes, this will not be an issue since the functions we consider are bounded in absolute value by \(B=1\). This is further not an issue in cases, where the \(h_j\) can be approximated by networks whose size scales logarithmically with \(\varepsilon \).

Proposition 6.4

Assume Setting 5.2, let \(\varrho :{\mathbb {R}}\rightarrow {\mathbb {R}}\) be the ReLU activation function given by \(\varrho (t)=\max \{0,t\}\), let \(B\in [1,\infty )\), \(m\in {\mathbb {N}}\), for every \(j\in \{1,2,\dots ,{m}\}\) let \(d_j\in {\mathbb {N}}\), \(\Omega _j\subseteq {\mathbb {R}}^{d_j}\), and \(h_j:\Omega _j\rightarrow [-B,B]\), let \((\Phi ^j_{\varepsilon })_{\varepsilon \in (0,\infty )}\in {\mathfrak {N}}\), \(j\in \{1,2,\dots ,{m}\}\), be neural networks which satisfy for every \(\varepsilon \in (0,\infty )\), \(j\in \{1,2,\dots ,{m}\}\)

$$\begin{aligned} \sup _{t\in \Omega _j}\left| {h_j(x)-\left[ {R_{\varrho }(\Phi ^j_{\varepsilon })}\right] (x)}\right| \le \varepsilon , \end{aligned}$$
(6.21)

let \(\Phi ^{{\mathcal {P}}}_{\varepsilon }\in {\mathfrak {N}}\), \(\varepsilon \in (0,\infty )\) be given by \(\Phi ^{{\mathcal {P}}}_{\varepsilon }={\mathcal {P}}(\Phi ^1_{\varepsilon },\Phi ^2_{\varepsilon },\dots ,\Phi ^m_{\varepsilon })\), and let \(L_{\varepsilon }\in {\mathbb {N}}\), \(\varepsilon \in (0,\infty )\) be given by \(L_{\varepsilon }=\max _{j\in \{1,2,\dots ,{m}\}}{\mathcal {L}}(\Phi ^j_{\varepsilon })\).

Then there exists a constant \(C\in {\mathbb {R}}\) ( which is independent of \(m,B,\varepsilon \)) and neural networks \((\Psi _{\varepsilon })_{\varepsilon \in (0,\infty )}\subseteq {\mathfrak {N}}\) which satisfy

  1. (i)

    \({\mathcal {L}}(\Psi _{\varepsilon })\le C\ln (m)\left( {\left| {\ln (\varepsilon )}\right| +m\ln (B)+\ln (m)}\right) +L_{\varepsilon }\),

  2. (ii)

    \({\mathcal {M}}(\Psi _{\varepsilon }) \le C m\left( {\left| {\ln (\varepsilon )}\right| +m\ln (B)+\ln (m)}\right) +{\mathcal {M}}(\Phi ^{{\mathcal {P}}}_{\varepsilon })+{\mathcal {M}}_{L_{\varepsilon }}(\Phi ^{{\mathcal {P}}}_{\varepsilon })\) and

  3. (iii)

    \(\displaystyle \sup _{t=(t_1,t_2,\dots ,t_m)\in \times _{j=1}^m \Omega _j}\left| {\left[ {\textstyle \prod \limits _{j=1}^m h_j(t_j)}\right] -\left[ {R_{\varrho }(\Psi _{\varepsilon })}\right] \!(t)}\right| \le 3mB^{m-1}\varepsilon .\)

Proof of Proposition 6.4

In the case of \(m=1\), the neural networks \((\Phi ^1_{\varepsilon })_{\varepsilon \in (0,\infty )}\in {\mathfrak {N}}\) satisfy (i), (ii) and (iii) by assumption. Throughout the remainder of this proof, assume \(m\ge 2\), and let \(\theta \in {\mathcal {N}}^{1,1}_1\) denote the trivial neural network \(\theta =(0,0)\). Observe that Theorem 6.3 (with \(\varepsilon \leftrightarrow \eta \), \(C'\leftrightarrow C\) in the notation Theorem 6.3) ensures that there exist \(C'\in {\mathbb {R}}\) and neural networks \((\Pi _{\eta })_{\eta \in (0,\infty )}\subseteq {\mathfrak {N}}\) which satisfy for every \(\eta \in (0,\infty )\) that

  1. (a)

    \({\mathcal {L}}(\Pi _{\eta })\le C'\ln (m)\left( {\left| {\ln (\eta )}\right| +m\ln (B)+\ln (m)}\right) \),

  2. (b)

    \({\mathcal {M}}(\Pi _{\eta })\le C' m\left( {\left| {\ln (\eta )}\right| +m\ln (B)+\ln (m)}\right) \) and

  3. (c)

    \(\displaystyle \sup _{x\in [-B,B]^m}\left| {\left[ {\prod _{j=1}^m x_j}\right] -\left[ {R_{\varrho }(\Pi _{\eta })}\right] \!(x)}\right| \le \eta \).

Let \((\Psi _{\varepsilon })_{\varepsilon \in (0,\infty )}\subseteq {\mathfrak {N}}\) be the neural networks which satisfy for every \(\varepsilon \in (0,\infty )\) that

$$\begin{aligned} \Psi _{\varepsilon }={\left\{ \begin{array}{ll}\Pi _{\varepsilon }\odot {\mathcal {P}}(\Phi ^1_{\varepsilon },\Phi ^2_{\varepsilon }, \dots , \Phi ^m_{\varepsilon }) &{} :\varepsilon <\tfrac{B}{2m}\\ \theta &{} :\varepsilon \ge \tfrac{B}{2m}\end{array}\right. }. \end{aligned}$$
(6.22)

Note that for every \(\varepsilon \in (0,\tfrac{B}{2m})\)

$$\begin{aligned} \begin{aligned} \max _{\overset{x\in [-B,B]^m,x'\in {\mathbb {R}}^m}{\left\Vert x'-x\right\Vert _{\infty }\le \varepsilon }}\left| {\prod _{j=1}^m x'_j-\prod _{j=1}^m x_j}\right|&=(B+\varepsilon )^m-B^m=\sum _{k=1}^m \left( {\begin{array}{c}m\\ k\end{array}}\right) B^{m-k}\varepsilon ^k\le \varepsilon \sum _{k=1}^m\frac{m^k}{k!}B^{m-k}\varepsilon ^{k-1}\\&\le \varepsilon \sum _{k=1}^m\frac{m^k}{k!}B^{m-k}\left( {\frac{B}{2m}}\right) ^{k-1}=mB^{m-1}\varepsilon \sum _{k=1}^m \frac{1}{2^{k-1}k!}\\&\le 2mB^{m-1}\varepsilon . \end{aligned}\end{aligned}$$
(6.23)

Combining this with Lemma 5.3, Lemma 5.4, (6.21) and (c) implies that for every \(\varepsilon \in (0,\tfrac{B}{2m})\), \(t=(t_1,t_2,\dots ,t_m)\in \Omega \) it holds

$$\begin{aligned} \begin{aligned} \left| {\left[ {\textstyle \prod \limits _{j=1}^m h_j(t_j)}\right] -\left[ {R_{\varrho }(\Psi _{\varepsilon })}\right] \!(t)}\right|&=\left| {\left[ {\textstyle \prod \limits _{j=1}^m h_j(t_j)}\right] -\left[ {R_{\varrho }(\Pi _{\varepsilon }\odot {\mathcal {P}}(\Phi ^1_{\varepsilon },\Phi ^2_{\varepsilon },\dots ,\Phi ^m_{\varepsilon }))}\right] \!(t)}\right| \\&\le \left| {\left[ {\textstyle \prod \limits _{j=1}^m h_j(t_j)}\right] -\left[ {\textstyle \prod \limits _{j=1}^m \left[ {R_{\varrho }(\Phi ^j_{\varepsilon })}\right] (t_j)}\right] }\right| \\&\quad +\left| {\left[ {\textstyle \prod \limits _{j=1}^m \left[ {R_{\varrho }(\Phi ^j_{\varepsilon })}\right] (t_j)}\right] -\left[ {R_{\varrho }(\Pi _{\varepsilon })}\right] \left( {[R_{\varrho }(\Phi ^1_{\varepsilon })](t_1),\dots ,[R_{\varrho }(\Phi ^m_{\varepsilon })](t_j)}\right) }\right| \\&\le 2mB^{m-1}\varepsilon +\varepsilon \le 3mB^{m-1}\varepsilon . \end{aligned}\end{aligned}$$
(6.24)

Moreover, for every \(\varepsilon \in [\tfrac{B}{2m},\infty )\), \(t=(t_1,t_2,\dots ,t_m)\in \Omega \) it holds that

$$\begin{aligned} \begin{aligned} \left| {\left[ {\textstyle \prod \limits _{j=1}^m h_j(t_j)}\right] -\left[ {R_{\varrho }(\Psi _{\varepsilon })}\right] \!(t)}\right|&=\left| {\left[ {\textstyle \prod \limits _{j=1}^m h_j(t_j)}\right] -\left[ {R_{\varrho }(\theta )}\right] \!(t)}\right| \\&=\left| {\left[ {\textstyle \prod \limits _{j=1}^m h_j(t_j)}\right] }\right| \le B^m\le 2mB^{m-1}\varepsilon . \end{aligned}\end{aligned}$$
(6.25)

This and (6.24) establish that the neural networks \((\Psi _{\varepsilon })_{\varepsilon ,c\,\in (0,\infty )}\) satisfy (iii). Next observe that Lemma 5.3, Lemma 5.4 and (a) demonstrate that for every \(\varepsilon \in (0,\tfrac{B}{2m})\)

$$\begin{aligned} \begin{aligned} {\mathcal {L}}(\Psi _{\varepsilon })&={\mathcal {L}}(\Pi _{\varepsilon }\odot {\mathcal {P}}(\Phi ^1_{\varepsilon },\Phi ^2_{\varepsilon }, \dots , \Phi ^m_{\varepsilon }))={\mathcal {L}}(\Pi _{\varepsilon })+\max _{j\in \{1,2,\dots ,{m}\}}{\mathcal {L}}(\Phi ^j_{\varepsilon })\\&\le C'\ln (m)\left( {\left| {\ln (\varepsilon )}\right| +m\ln (B)+\ln (m)}\right) +L_{\varepsilon }. \end{aligned}\end{aligned}$$
(6.26)

This and the fact that for every \(\varepsilon \in [\tfrac{B}{2m},\infty )\) it holds that \({\mathcal {L}}(\Psi _{\varepsilon })={\mathcal {L}}(\theta )=1\) establish that the neural networks \((\Psi _{\varepsilon })_{\varepsilon ,c\,\in (0,\infty )}\) satisfy (i). Furthermore, note that Lemma 5.3, Lemma 5.4 and (b) ensure that for every \(\varepsilon \in (0,\tfrac{B}{2m})\)

$$\begin{aligned} \begin{aligned} {\mathcal {M}}(\Psi _{\varepsilon })&={\mathcal {M}}(\Pi _{\varepsilon }\odot {\mathcal {P}}(\Phi ^1_{\varepsilon },\Phi ^2_{\varepsilon }, \dots , \Phi ^m_{\varepsilon }))\\&\le 2{\mathcal {M}}(\Pi _{\varepsilon })+{\mathcal {M}}({\mathcal {P}}(\Phi ^1_{\varepsilon },\Phi ^2_{\varepsilon }, \dots , \Phi ^m_{\varepsilon }))\\&\quad +{\mathcal {M}}_{{\mathcal {L}}({\mathcal {P}}(\Phi ^1_{\varepsilon },\Phi ^2_{\varepsilon }, \dots , \Phi ^m_{\varepsilon }))}({\mathcal {P}}(\Phi ^1_{\varepsilon },\Phi ^2_{\varepsilon }, \dots , \Phi ^m_{\varepsilon }))\\&\le 2C' m\left( {\left| {\ln (\varepsilon )}\right| +m\ln (B)+\ln (m)}\right) +{\mathcal {M}}(\Phi ^{{\mathcal {P}}}_{\varepsilon })+{\mathcal {M}}_{L_{\varepsilon }}(\Phi ^{{\mathcal {P}}}_{\varepsilon }). \end{aligned}\end{aligned}$$
(6.27)

This and the fact that for every \(\varepsilon \in [\tfrac{B}{2m},\infty )\) it holds that \({\mathcal {M}}(\Psi _{\varepsilon })={\mathcal {M}}(\theta )=0\) imply the neural networks \((\Psi _{\varepsilon })_{\varepsilon ,c\,\in (0,\infty )}\) satisfy (ii). The proof of Proposition 6.4 is completed. \(\square \)

Another way to use the multiplication results is to consider the approximation of smooth functions by polynomials. This can be done for functions of arbitrary dimension using the multivariate Taylor expansion (see [44] and [31, Thm. 2.3]). Such a direct approach, however, yields networks whose size depends exponentially on the dimension of the function. As our goal is to show that high-dimensional functions with a tensor product structure can be approximated by networks with only polynomial dependence on the dimension, we only consider univariate smooth functions here. In appendix, we present a detailed and explicit construction of this Taylor approximation by neural networks. In the following results, we employ an auxiliary parameter r, so that the bounds on the depth and connectivity of the networks may be stated for all \(\varepsilon \in (0,\infty )\). Note that this parameter does not influence the construction of the networks themselves.

Theorem 6.5

Assume Setting 5.1, let \(n\in {\mathbb {N}}\), \(r\in (0,\infty )\), let \(\varrho :{\mathbb {R}}\rightarrow {\mathbb {R}}\) be the ReLU activation function given by \(\varrho (t)=\max \{0,t\}\) and let \(B^n_1\subseteq C^n([0,1],{\mathbb {R}})\) be the set given by

$$\begin{aligned} B^n_1=\left\{ f\in C^n([0,1],{\mathbb {R}}):\max _{k\in \{0,1,\dots ,n\}}\left[ {\sup _{t\in [0,1]}\left| {f^{(k)}(t)}\right| }\right] \le 1\right\} . \end{aligned}$$
(6.28)

Then there exist neural networks \((\Phi _{f,\varepsilon })_{f\in B^n_1,\varepsilon \in (0,\infty )}\subseteq {\mathfrak {N}}\) which satisfy

  1. (i)

    \(\displaystyle \sup _{f\in B^n_1,\varepsilon \in (0,\infty )}\left[ {\frac{{\mathcal {L}}(\Phi _{f,\varepsilon })}{\max \{r,\left| {\ln (\varepsilon )}\right| \}}}\right] <\infty \),

  2. (ii)

    \(\displaystyle \sup _{f\in B^n_1,\varepsilon \in (0,\infty )}\left[ {\frac{{\mathcal {M}}(\Phi _{f,\varepsilon })}{\varepsilon ^{-\frac{1}{n}}\max \{r,|\ln (\varepsilon )|\}}}\right] <\infty \) and

  3. (iii)

    for every \(f\in B^n_1\), \(\varepsilon \in (0,\infty )\) that

    $$\begin{aligned} \sup _{t\in [0,1]}\left| {f(t)-\left[ {R_{\varrho }(\Phi _{f,\varepsilon })}\right] \!(t)}\right| \le \varepsilon . \end{aligned}$$
    (6.29)

For convenience of use, we also provide the following more general corollary.

Corollary 6.6

Assume Setting 5.1, let \(r\in (0,\infty )\) and let \(\varrho :{\mathbb {R}}\rightarrow {\mathbb {R}}\) be the ReLU activation function given by \(\varrho (t)=\max \{0,t\}\). Let further the set \({\mathcal {C}}^n\) be given by \({\mathcal {C}}^n=\cup _{[a,b]\subseteq {\mathbb {R}}_+}C^n([a,b],{\mathbb {R}})\), and let \(\left\Vert \cdot \right\Vert _{n,\infty }:{\mathcal {C}}^n\rightarrow [0,\infty )\) satisfy for every \([a,b]\subseteq {\mathbb {R}}_+\), \(f\in C^n([a,b],{\mathbb {R}})\)

$$\begin{aligned} \left\Vert f\right\Vert _{n,\infty }=\max _{k\in \{0,1,\dots ,n\}}\left[ {\sup _{t\in [a,b]}\left| {f^{(k)}(t)}\right| }\right] . \end{aligned}$$
(6.30)

Then there exist neural networks \(\left( {\Phi _{f,\varepsilon }}\right) _{f\in {\mathcal {C}}^n,\varepsilon \in (0,\infty )}\subseteq {\mathfrak {N}}\) which satisfy

  1. (i)

    \(\displaystyle \sup _{f\in {\mathcal {C}}^n, \varepsilon \in (0,\infty )}\left[ {\frac{{\mathcal {L}}(\Phi _{f,\varepsilon })}{\max \{r,|\ln (\frac{\varepsilon }{\max \{1,b-a\}\left\Vert f\right\Vert _{n,\infty }})|\}}}\right] <\infty \),

  2. (ii)

    \(\displaystyle \sup _{f\in {\mathcal {C}}^n, \varepsilon \in (0,\infty )}\left[ {\frac{{\mathcal {M}}(\Phi _{f,\varepsilon })}{\max \{1,b-a\}\left\Vert f\right\Vert _{n,\infty }^{\frac{1}{n}}\varepsilon ^{-\frac{1}{n}}\max \{r,|\ln (\frac{\varepsilon }{\max \{1,b-a\}\left\Vert f\right\Vert _{n,\infty }})|\}}}\right] <\infty \) and

  3. (iii)

    for every \([a,b]\subseteq {\mathbb {R}}_+\), \(f\in C^n([a,b],{\mathbb {R}})\), \(\varepsilon \in (0,\infty )\) that

    $$\begin{aligned} \sup _{t\in [a,b]}\left| {f(t)-\left[ {R_{\varrho }(\Phi _{f,\varepsilon })}\right] \!(t)}\right| \le \varepsilon . \end{aligned}$$
    (6.31)

7 DNN Expression Rates for High-Dimensional Basket Prices

Now that we have established a number of general expression rate results, we can apply them to our specific problem. Using the regularity result (3.3), we obtain the following.

Corollary 7.1

Assume Setting 5.1, let \(n\in {\mathbb {N}}\), \(r\in (0,\infty )\), \(a\in (0,\infty )\), \(b\in (a,\infty )\), let \(\varrho :{\mathbb {R}}\rightarrow {\mathbb {R}}\) be the ReLU activation function given by \(\varrho (t)=\max \{0,t\}\), let \(f:(0,\infty )\rightarrow {\mathbb {R}}\) be as defined in (3.1) and let \(h_{c,K}:[a,b]\rightarrow {\mathbb {R}}\), \(c\in (0,\infty )\), \(K\in [0,\infty )\), denote the functions which satisfy for every \(c\in (0,\infty )\), \(K\in [0,\infty )\), \(x\in [a,b]\) that

$$\begin{aligned} h_{c,K}(x)=f\left( \tfrac{K+c}{x}\right) . \end{aligned}$$
(7.1)

Then there exist neural networks \(\left( {\Phi _{\varepsilon ,c,K}}\right) _{\varepsilon ,c\,\in (0,\infty ),K\in [0,\infty )}\subseteq {\mathfrak {N}}\) which satisfy

  1. (i)

    \(\displaystyle \sup _{\varepsilon ,c\in (0,\infty ),K\in [0,\infty )}\left[ {\frac{{\mathcal {L}}(\Phi _{\varepsilon ,c,K})}{\max \{r,|\ln (\varepsilon )|\}+\max \{0,\ln (K+c)\}}}\right] <\infty \),

  2. (ii)

    \(\displaystyle \sup _{\varepsilon ,c\,\in (0,\infty ),K\in [0,\infty )}\left[ {\frac{{\mathcal {M}}(\Phi _{\varepsilon ,c,K})}{(K+c+1)^{\frac{1}{n}}\varepsilon ^{-\frac{1}{n^2}}}}\right] <\infty \) and

  3. (iii)

    for every \(\varepsilon ,c\in (0,\infty )\), \(K\in [0,\infty )\) that

    $$\begin{aligned} \sup _{x\in [a,b]}\left| {h_{c,K}(x)-\left[ {R_{\varrho }(\Phi _{\varepsilon ,c,K})}\right] \!(x)}\right| \le \varepsilon . \end{aligned}$$
    (7.2)

Proof of Corollary 7.1

We observe Corollary 3.3 ensures the existence of a constant \(C\in {\mathbb {R}}\) with

$$\begin{aligned} \max _{k\le n}\sup _{x\in [a,b]}\left| {h_{c,K}^{(k)}(x)}\right| \le C\max \{(K+c)^n,1\}. \end{aligned}$$
(7.3)

Moreover, observe for every \(\varepsilon ,c\in (0,\infty )\), \(K\in [0,\infty )\) it holds

$$\begin{aligned} \begin{aligned}&\max \left\{ r,|\ln \left( \tfrac{\varepsilon }{\max \{1,b-a\}C\max \{(K+c)^n,1\}}\right) |\right\} \\&\quad \le \max \{r,\left| {\ln (\varepsilon )}\right| \}+|\ln (\max \{1,b-a\})|+\left| {\ln (C\max \{(K+c)^n,1\})}\right| \\&\quad \le \max \{r,\left| {\ln (\varepsilon )}\right| \}+\ln (\max \{1,b-a\})+\left| {\ln (C)}\right| +\left| {\ln (\max \{(K+c)^n,1\})}\right| \\&\quad \le \max \{r,\left| {\ln (\varepsilon )}\right| \}+\ln (\max \{1,b-a\})+\left| {\ln (C)}\right| +n\max \{\ln (K+c),0\}\\&\quad \le n(1+\max \{1,\tfrac{1}{r}\}(|\ln (C)|+\ln (\max \{1,b-a\})))(\max \{r,\left| {\ln (\varepsilon )}\right| \}\\&\qquad +\max \{\ln (K+c),0\}). \end{aligned}\end{aligned}$$
(7.4)

Furthermore, note for every \(\varepsilon ,c\in (0,\infty )\), \(K\in [0,\infty )\) it holds

$$\begin{aligned} \begin{aligned}&\left[ {\frac{\varepsilon }{\max \{1,b-a\}C\max \{(K+c)^n,1\}}}\right] ^{-\frac{1}{2n^2}}\\&\quad =[\max \{1,b-a\}]^{-\frac{1}{2n^2}}\varepsilon ^{-\frac{1}{2n^2}}C^{\frac{1}{2n^2}}\max \{(K+c)^{\frac{1}{2n}},1\}\\&\quad \le [\max \{1,b-a\}]^{-\frac{1}{2n^2}} C^{\frac{1}{2n^2}}(K+c+1)^{\frac{1}{2n}}\varepsilon ^{-\frac{1}{2n^2}}. \end{aligned}\end{aligned}$$
(7.5)

Combining this, (7.3), (7.4) with Lemma A.1 and Corollary 6.6 (with \(n\leftrightarrow 2n^2\) in the notation of Corollary 6.6) completes the proof of Corollary 7.1. \(\square \)

We can then employ Proposition 6.4 in order to approximate the required tensor product.

Corollary 7.2

Assume Setting 5.1, let \(\varrho :{\mathbb {R}}\rightarrow {\mathbb {R}}\) be the ReLU activation function given by \(\varrho (t)=\max \{0,t\}\), let \(n\in {\mathbb {N}}\), \(a\in (0,\infty )\), \(b\in (a,\infty )\), \((K_i)_{i\in {\mathbb {N}}}\subseteq [0,K_{\mathrm {max}})\), and consider, for \(h_{c,K}:[a,b]\rightarrow {\mathbb {R}}\), \(c\in (0,\infty )\), \(K\in [0,K_{\mathrm {max}})\), the functions which are, for every \(c\in (0,\infty )\), \(K\in [0,K_{\mathrm {max}})\), \(x\in [a,b]\), given by

$$\begin{aligned} h_{c,K}(x)=\tfrac{1}{\sqrt{2\pi }}\int ^{\ln (\frac{K+c}{x})}_{-\infty } e^{-\frac{1}{2}r^2}\mathrm {d}r. \end{aligned}$$
(7.6)

For any \(c\in (0, \infty )\), \(d\in {\mathbb {N}}\) let the function \(F^d_c(x):[a,b]^d\rightarrow {\mathbb {R}}\) be given by

$$\begin{aligned} F^d_c(x)=1-\left[ {\textstyle \prod \limits _{i=1}^d h_{c,K_i}(x_i)}\right] . \end{aligned}$$
(7.7)

Then there exist neural networks \((\Psi ^d_{\varepsilon ,c})_{\varepsilon ,c\,\in (0,\infty ),d\in {\mathbb {N}}}\subseteq {\mathfrak {N}}\) which satisfy

  1. (i)

    \(\displaystyle \sup _{\varepsilon ,c\,\in (0,\infty ),d\in {\mathbb {N}}}\left[ {\frac{{\mathcal {L}}(\Psi ^d_{\varepsilon ,c})}{\max \{1,\ln (d)\}(\left| {\ln (\varepsilon )}\right| +\ln (d)+1)+\ln (c+1)}}\right] <\infty \),

  2. (ii)

    \(\displaystyle \sup _{\varepsilon ,c\,\in (0,\infty ),d\in {\mathbb {N}}}\left[ {\frac{{\mathcal {M}}(\Psi ^d_{\varepsilon ,c})}{(c+1)^{\frac{1}{n}}d^{1+\frac{1}{n}}\varepsilon ^{-\frac{1}{n}}}}\right] <\infty \) and

  3. (iii)

    for every \(\varepsilon ,c\,\in (0,\infty )\), \(d\in {\mathbb {N}}\) that

    $$\begin{aligned} \sup _{x\in [a,b]^d}\left| {F^d_c(x)-\left[ {R_{\varrho }(\Psi ^d_{\varepsilon ,c})}\right] \!(x)}\right| \le \varepsilon . \end{aligned}$$
    (7.8)

Proof of Corollary 7.2

Throughout this proof, assume Setting 5.2. Corollary 7.1 ensures there exist constants \(b_L,b_M\in (0,\infty )\) and neural networks \(\left( {\Phi ^i_{\eta ,c}}\right) _{\eta ,c\,\in (0,\infty )}\subseteq {\mathfrak {N}}\), \(i\in {\mathbb {N}}\) such that for every \(i\in {\mathbb {N}}\) it holds

  1. (a)

    \(\displaystyle \sup _{\eta ,c\in (0,\infty )}\left[ {\frac{{\mathcal {L}}(\Phi ^i_{\eta ,c})}{\max \{1,|\ln (\eta )|\}+\max \{0,\ln (K_{\mathrm {max}}+c)\}}}\right] <b_L\),

  2. (b)

    \(\displaystyle \sup _{\eta ,c\,\in (0,\infty )}\left[ {\frac{{\mathcal {M}}(\Phi ^i_{\eta ,c})}{(K_{\mathrm {max}}+c+1)^{\frac{1}{n}}\eta ^{-\frac{1}{n^2}}}}\right] <b_M\) and

  3. (c)

    for every \(\eta ,c\in (0,\infty )\) that

    $$\begin{aligned} \sup _{x\in [a,b]}\left| {h_{c,K_i}(x)-\left[ {R_{\varrho }(\Phi ^i_{\eta ,c})}\right] \!(x)}\right| \le \eta . \end{aligned}$$
    (7.9)

Furthermore, for every \(c\in (0,\infty )\), \(i\in {\mathbb {N}}\), \(x\in [a,b]\) holds

$$\begin{aligned} \begin{aligned} \left| {h_{c,K_i}(x)}\right| =\left| {\tfrac{1}{\sqrt{2\pi }}\int ^{\ln (\frac{K_i+c}{x})}_{-\infty } e^{-\frac{1}{2}r^2}\mathrm {d}r}\right| \le \tfrac{1}{\sqrt{2\pi }}\left| {\int ^{\infty }_{-\infty } e^{-\frac{1}{2}r^2}\mathrm {d}r}\right| =1. \end{aligned}\end{aligned}$$
(7.10)

Combining this with (a) and Proposition 6.4 and Lemma 5.4 implies there exist \(C\in {\mathbb {R}}\) and neural networks \((\psi ^d_{\eta ,c})_{\eta \in (0,\infty )}\subseteq {\mathfrak {N}}\), \(c\in (0,\infty )\), \(d\in {\mathbb {N}}\), such that for every \(c\in (0,\infty )\), \(d\in {\mathbb {N}}\) it holds

  1. (A)

    \(\displaystyle {\mathcal {L}}(\psi ^d_{\eta ,c})\le C\ln (d)\left( {\left| {\ln (\eta )}\right| +\ln (d)}\right) +\max _{i\in \{1,2,\dots ,{d}\}}{\mathcal {L}}(\Phi ^i_{\eta ,c})\),

  2. (B)

    \(\displaystyle {\mathcal {M}}(\psi ^d_{\eta ,c})\le C d\left( {\left| {\ln (\eta )}\right| +\ln (d)}\right) +4\sum _{i=1}^d {\mathcal {M}}(\Phi ^i_{\eta ,c})+8d\max _{i\in \{1,2,\dots ,{d}\}}{\mathcal {L}}(\Phi ^i_{\eta ,c})\) and

  3. (C)

    for every \(\eta \in (0,\infty )\) that

    $$\begin{aligned} \sup _{x\in [a,b]^d}\left| {\left[ {\textstyle \prod \limits _{i=1}^d h_{c,K_i}(x_i)}\right] -\left[ {R_{\varrho }(\psi ^d_{\eta ,c})}\right] \!(x)}\right| \le 3d\eta . \end{aligned}$$
    (7.11)

Let \(\lambda \in {\mathcal {N}}_1^{1,1}\) be the neural network given by \(\lambda =\left( {(-1,1)}\right) \), let \(\theta \in {\mathcal {N}}^{1,1}_1\) be the neural network given by \(\theta =(0,0)\) and let \((\Psi ^d_{\varepsilon ,c})_{\varepsilon ,c\,\in (0,\infty ),d\in {\mathbb {N}}}\subseteq {\mathfrak {N}}\) be the neural networks given by

$$\begin{aligned} \Psi ^d_{\varepsilon ,c}={\left\{ \begin{array}{ll}\lambda \odot \psi ^d_{\nicefrac {\varepsilon }{(3d)},c} &{} :\varepsilon \le 2\\ \theta &{} :\varepsilon >2\end{array}\right. }. \end{aligned}$$
(7.12)

Observe that this and (B) imply for every \(\varepsilon \in (0,2]\), \(c\,\in (0,\infty )\), \(d\in {\mathbb {N}}\), \(x\in [a,b]^d\) it holds

$$\begin{aligned} \begin{aligned} \left| {F^d_c(x)-\left[ {R_{\varrho }(\Psi ^d_{\varepsilon ,c})}\right] \!(x)}\right|&=\left| {\left( {1-\left[ {\textstyle \prod \limits _{i=1}^d h_{c,K_i}(x_i)}\right] }\right) -\left( {1-\left[ {R_{\varrho }(\psi ^d_{\nicefrac {\varepsilon }{(3d)},c})}\right] \!(x)}\right) }\right| \\&\le 3d\tfrac{\varepsilon }{3d}=\varepsilon . \end{aligned}\end{aligned}$$
(7.13)

Moreover, (7.12) and (7.10) ensure for every \(\varepsilon \in (2,\infty )\), \(c\,\in (0,\infty )\), \(d\in {\mathbb {N}}\), \(x\in [a,b]^d\) it holds

$$\begin{aligned} \begin{aligned} \left| {F^d_c(x)-\left[ {R_{\varrho }(\Psi ^d_{\varepsilon ,c})}\right] \!(x)}\right|&=\left| {\left( {1-\left[ {\textstyle \prod \limits _{i=1}^d h_{c,K_i}(x_i)}\right] }\right) }\right| \\ \end{aligned}\end{aligned}$$
(7.14)

This and (7.13) establish the neural networks \((\Psi ^d_{\varepsilon ,c})_{\varepsilon ,c\,\in (0,\infty ),d\in {\mathbb {N}}}\) satisfy (iii). Next observe that for every \(c\,\in (0,\infty )\) it holds

$$\begin{aligned} \begin{aligned} \max \{0,\ln (K_{\mathrm {max}}+c)\}&\le \max \{0,\ln (\max \{1,K_{\mathrm {max}}\}+\max \{1,K_{\mathrm {max}}\}c)\}\\&=\ln (\max \{1,K_{\mathrm {max}}\}(1+c))=\ln (\max \{1,K_{\mathrm {max}}\})+\ln (1+c)\\&\le \ln (c+1)+|\ln (K_{\mathrm {max}})|. \end{aligned}\end{aligned}$$
(7.15)

Hence, we obtain that for every \(\varepsilon ,c\,\in (0,\infty )\), \(d\in {\mathbb {N}}\) it holds

$$\begin{aligned} \begin{aligned}&\max \left\{ 1,|\ln \left( \tfrac{\varepsilon }{3d}\right) |\right\} +\max \{0,\ln (K_{\mathrm {max}}+c)\}\\&\quad \le |\ln (\varepsilon )|+\ln (d)+\ln (3)+\ln (c+1)+|\ln (K_{\mathrm {max}})|\\&\quad \le (\ln (3)+|\ln (K_{\mathrm {max}})|)\left[ {\max \{1,\ln (d)\}(|\ln (\varepsilon )|+\ln (d)+1)+\ln (c+1)}\right] . \end{aligned}\end{aligned}$$
(7.16)

In addition, for every \(\varepsilon ,c\,\in (0,\infty )\), \(d\in {\mathbb {N}}\) it holds

$$\begin{aligned} C\ln (d)\left( {\left| {\ln \left( \tfrac{\varepsilon }{3d}\right) }\right| +\ln (d)}\right) \le 4C\left[ {\max \{1,\ln (d)\}(|\ln (\varepsilon )|+\ln (d)+1)+\ln (c+1)}\right] . \end{aligned}$$
(7.17)

Combining this with Lemma 5.3, (a), (A) and (7.16) yields

$$\begin{aligned} \begin{aligned}&\sup _{\begin{array}{c} \varepsilon \in (0,2],c\,\in (0,\infty ),\\ d\in {\mathbb {N}} \end{array}}\left[ {\frac{{\mathcal {L}}(\Psi ^d_{\varepsilon ,c})}{\max \{1,\ln (d)\}(\left| {\ln (\varepsilon )}\right| +\ln (d)+1)+\ln (c+1)}}\right] \\&\quad \le \sup _{\begin{array}{c} \varepsilon \in (0,2],c\,\in (0,\infty ),\\ d\in {\mathbb {N}} \end{array}}\left[ {\frac{1+C\ln (d)\left( {\left| {\ln (\frac{\varepsilon }{3d})}\right| +\ln (d)}\right) +\max _{i\in \{1,2,\dots ,{d}\}}{\mathcal {L}}(\Phi ^i_{\nicefrac {\varepsilon }{(3d)},c})}{\max \{1,\ln (d)\}(\left| {\ln (\varepsilon )}\right| +\ln (d)+1)+\ln (c+1)}}\right] \\&\quad \le 2+4C+(\ln (3)+|\ln (K_{\mathrm {max}})|)b_L<\infty . \end{aligned}\end{aligned}$$
(7.18)

Moreover, (7.12) shows

$$\begin{aligned} \begin{aligned}&\sup _{\begin{array}{c} \varepsilon \in (2,\infty ),c\,\in (0,\infty ),\\ d\in {\mathbb {N}} \end{array}}\left[ {\frac{{\mathcal {L}}(\Psi ^d_{\varepsilon ,c})}{\max \{1,\ln (d)\}(\left| {\ln (\varepsilon )}\right| +\ln (d)+1)+\ln (c+1)}}\right] \\&\quad =\sup _{\begin{array}{c} \varepsilon \in (2,\infty ),c\,\in (0,\infty ),\\ d\in {\mathbb {N}} \end{array}}\left[ {\frac{1}{\max \{1,\ln (d)\}(\left| {\ln (\varepsilon )}\right| +\ln (d)+1)+\ln (c+1)}}\right] <\infty . \end{aligned}\end{aligned}$$
(7.19)

This and (7.18) establish that \((\Psi ^d_{\varepsilon ,c})_{\varepsilon ,c\,\in (0,\infty ),d\in {\mathbb {N}}}\) satisfy (i). Next observe Lemma A.1 implies that

  • for every \(\varepsilon \in (0,2]\) it holds

    $$\begin{aligned} |\ln (\varepsilon )|\le \left[ {\sup _{\delta \in [\exp (-2n^2),2]}\ln (\delta )}\right] \varepsilon ^{-\frac{1}{n}}=2n^2\varepsilon ^{-\frac{1}{n}}, \end{aligned}$$
    (7.20)
  • for every \(d\in {\mathbb {N}}\) it holds

    $$\begin{aligned} \ln (d)\le \left[ {\max _{k\in \{1,2,\dots {\exp (2n^2)}\}}\ln (k)}\right] d^{\frac{1}{n}}=2n^2d^{\frac{1}{n}}, \end{aligned}$$
    (7.21)
  • and for every \(c\in (0,\infty )\) it holds

    $$\begin{aligned} \ln (c+1)\le \left[ {\sup _{t\in (0,\exp (2n^2-1)]}\ln (t+1)}\right] (c+1)^{\frac{1}{n}}=2n^2(c+1)^{\frac{1}{n}}. \end{aligned}$$
    (7.22)

For every \(m\in {\mathbb {N}}\), \(x_i\in [1,\infty )\), \(i\in \{1,2,\dots ,{m}\}\), it holds

$$\begin{aligned} \sum _{i=1}^m x_i\le \textstyle \prod \limits _{i=1}^m(x_i+1)\le 2^m\textstyle \prod \limits _{i=1}^m x_i. \end{aligned}$$
(7.23)

Combining this with (7.20), (7.21) and (7.22) shows for every \(\varepsilon \in (0,2]\), \(d\in {\mathbb {N}}\), \(c\in (0,\infty )\) it holds

$$\begin{aligned} \begin{aligned} 2C d\left( |\ln \left( \tfrac{\varepsilon }{3d}\right) |+\ln (d)\right)&\le 2C d (|\ln (\varepsilon )|+2\ln (d)+\ln (3)+\ln (c+1))\\&\le 4n^2C d (2\varepsilon ^{-\frac{1}{n}}+2d^{\frac{1}{n}}+\ln (3)+(c+1)^{\frac{1}{n}})\\&\le 1024n^2 C (c+1)^{\frac{1}{n}} d^{1+\frac{1}{n}} \varepsilon ^{-\frac{1}{n}}. \end{aligned}\end{aligned}$$
(7.24)

Furthermore, note (7.15), (7.20), (7.21), (7.22) and (7.23) ensure for every \(\varepsilon \in (0,2]\), \(d\in {\mathbb {N}}\), \(c\in (0,\infty )\) it holds

$$\begin{aligned} \begin{aligned}&16d\left( \max \{1,|\ln \left( \tfrac{\varepsilon }{3d}\right) |\}+\max \{0,\ln (K_{\mathrm {max}}+c)\}\right) \\&\quad \le 16d(|\ln (\varepsilon )|+\ln (d)+\ln (3)+\ln (c+1)+|\ln (K_{\mathrm {max}})|)\\&\quad \le 32n^2d(2\varepsilon ^{-\frac{1}{n}}+d^{\frac{1}{n}}+(c+1)^{\frac{1}{n}}+\ln (3)+|\ln (K_{\mathrm {max}})|)\\&\quad \le 2048n^2(\ln (3)+|\ln (K_{\mathrm {max}})|)(c+1)^{\frac{1}{n}} d^{1+\frac{1}{n}} \varepsilon ^{-\frac{1}{n}}. \end{aligned}\end{aligned}$$
(7.25)

In addition, observe that for every \(\varepsilon \in (0,2]\), \(d\in {\mathbb {N}}\), \(c\in (0,\infty )\) it holds

$$\begin{aligned} 4d(K_{\mathrm {max}}+c+1)^{\frac{1}{n}}\left( \tfrac{\varepsilon }{3d}\right) ^{-\frac{1}{n^2}}\le 96\max \{1,K_{\mathrm {max}}\}(c+1)^{\frac{1}{n}}d^{1+\frac{1}{n}}\varepsilon ^{-\frac{1}{n}}. \end{aligned}$$
(7.26)

Combining this with Lemma 5.3, (a), (b), (B), (7.24) and (7.25) yield

$$\begin{aligned} \begin{aligned}&\sup _{\begin{array}{c} \varepsilon \in (0,2],c\,\in (0,\infty ),\\ d\in {\mathbb {N}} \end{array}}\left[ {\frac{{\mathcal {M}}(\Psi ^d_{\varepsilon ,c})}{(c+1)^{\frac{1}{n}}d^{1+\frac{1}{n}}\varepsilon ^{-\frac{1}{n}}}}\right] \\&\quad \le \sup _{\begin{array}{c} \varepsilon \in (0,2],c\,\in (0,\infty ),\\ d\in {\mathbb {N}} \end{array}}\left[ {\frac{\displaystyle 4 +2Cd\left( |\ln \left( \tfrac{\varepsilon }{3d}\right) |+\ln (d)\right) +8\sum _{i=1}^d{\mathcal {M}}\left( \Phi ^i_{\nicefrac {\varepsilon }{(3d)},c}\right) +16d\max _{i\in \{1,2,\dots ,{d}\}} {\mathcal {L}}\left( \Phi ^i_{\nicefrac {\varepsilon }{(3d)},c}\right) }{(c+1)^{\frac{1}{n}}d^{1+\frac{1}{n}}\varepsilon ^{-\frac{1}{n}}}}\right] \\&\quad \le 8+1024n^2C+96\max \{1,K_{\mathrm {max}}\}b_M+2048n^2(\ln (3)+|\ln (K_{\mathrm {max}})|)b_L<\infty . \end{aligned}\end{aligned}$$
(7.27)

Furthermore, note that (7.12) ensures

$$\begin{aligned} \begin{aligned} \sup _{\begin{array}{c} \varepsilon \in (2,\infty ),c\,\in (0,\infty ),\\ d\in {\mathbb {N}} \end{array}}\left[ {\frac{{\mathcal {M}}(\Psi ^d_{\varepsilon ,c})}{(c+1)^{\frac{1}{n}}d^{1+\frac{1}{n}}\varepsilon ^{-\frac{1}{n}}}}\right] =\sup _{\begin{array}{c} \varepsilon \in (2,\infty ),c\,\in (0,\infty ),\\ d\in {\mathbb {N}} \end{array}}\left[ {\frac{{\mathcal {M}}(\theta )}{(c+1)^{\frac{1}{n}}d^{1+\frac{1}{n}}\varepsilon ^{-\frac{1}{n}}}}\right] =0. \end{aligned}\end{aligned}$$
(7.28)

This and (7.27) establish that the neural networks \((\Psi ^d_{\varepsilon ,c})_{\varepsilon ,c\,\in (0,\infty ),d\in {\mathbb {N}}}\) satisfy (ii). Thus the proof of Corollary 7.2 is completed. \(\square \)

Finally, we add the quadrature estimates from Sect. 4 to achieve approximation with networks whose size only depends polynomially on the dimension of the problem.

Theorem 7.3

Assume Setting 5.1, let \(\varrho :{\mathbb {R}}\rightarrow {\mathbb {R}}\) be the ReLU activation function given by \(\varrho (t)=\max \{0,t\}\), let \(n\in {\mathbb {N}}\), \(a\in (0,\infty )\), \(b\in (a,\infty )\), \((K_i)_{i\in {\mathbb {N}}}\subseteq [0,K_{\mathrm {max}})\) and let \(F_d:(0,\infty )\times [a,b]^d\rightarrow {\mathbb {R}}\), \(d\in {\mathbb {N}}\), be the functions which satisfy for every \(d\in {\mathbb {N}}\), \(c\in (0, \infty )\), \(x\in [a,b]^d\)

$$\begin{aligned} F_d(c,x)=1-\prod _{i=1}^d \left[ {\tfrac{1}{\sqrt{2\pi }}\displaystyle \int ^{\ln (\frac{K_i+c}{x_i})}_{-\infty } e^{-\frac{1}{2}r^2}\mathrm {d}r}\right] . \end{aligned}$$
(7.29)

Then there exists neural networks \((\Gamma _{d,\varepsilon })_{\varepsilon \in (0,1],d\in {\mathbb {N}}}\in {\mathfrak {N}}\) which satisfy

  1. (i)

    \(\displaystyle \sup _{\varepsilon \in (0,1],d\in {\mathbb {N}}}\left[ {\frac{{\mathcal {L}}(\Gamma _{d,\varepsilon })}{\max \{1,\ln (d)\}\left( {|\ln (\varepsilon )|+\ln (d)+1}\right) }}\right] <\infty \),

  2. (ii)

    \(\displaystyle \sup _{\varepsilon \in (0,1],d\in {\mathbb {N}}}\left[ {\frac{{\mathcal {M}}(\Gamma _{d,\varepsilon })}{d^{2+\frac{1}{n}}\varepsilon ^{-\frac{1}{n}}}}\right] <\infty \) and

  3. (iii)

    for every \(\varepsilon \in (0,1]\), \(d\in {\mathbb {N}}\) that

    $$\begin{aligned} \sup _{x\in [a,b]^d}\left| {\int _0^{\infty }F_d(c,x)\mathrm {d}c-\left[ {R_{\varrho }(\Gamma _{d,\varepsilon })}\right] \!(x)}\right| \le \varepsilon . \end{aligned}$$
    (7.30)

Proof of Theorem 7.3

Throughout this proof, assume Setting 5.2, let \(S_{b,n}\in {\mathbb {R}}\) be given by

$$\begin{aligned} S_{b,n}=2e^{2(4n+1)}(b+1)^{1+\frac{1}{4n}} \end{aligned}$$
(7.31)

and let \(N_{d,\varepsilon }\in {\mathbb {R}}\), \(d\in {\mathbb {N}}\), \(\varepsilon \in (0,1]\), be given by

$$\begin{aligned} N_{d,\varepsilon }=S_{b,n}d^{\frac{1}{4n}}\left[ {\tfrac{\varepsilon }{4}}\right] ^{-\frac{1}{4n}}. \end{aligned}$$
(7.32)

Note Lemma 4.3 (with \(4n\leftrightarrow n\), \(F_x^d(c)\leftrightarrow F_d(x,c)\), \(N_{d,\frac{\varepsilon }{2}}\leftrightarrow N_{d,\varepsilon }\), \(Q_{d,\frac{\varepsilon }{2}}\leftrightarrow Q_{d,\varepsilon }\) in the notation of Lemma 4.3) ensures that there exist \(Q_{d,\varepsilon }\in {\mathbb {R}}\), \(c^d_{\varepsilon ,j}\in (0,N_{d,\varepsilon })\), \(w^d_{\varepsilon ,j}\in [0,\infty )\), \(j\in \{1,2,\dots ,{Q_{d,\varepsilon }}\}\), \(d\in {\mathbb {N}}\), \(\varepsilon \in (0,1]\) with

$$\begin{aligned} \sup _{\varepsilon \in (0,1], d\in {\mathbb {N}}}\left[ {\frac{Q_{d,\varepsilon }}{d^{1+\frac{1}{2n}}\varepsilon ^{-\frac{1}{2n}}}}\right] <\infty \end{aligned}$$
(7.33)

and for every \(d\in {\mathbb {N}}\), \(\varepsilon \in (0,1]\) it holds

$$\begin{aligned} \sup _{x\in [a,b]^d}\left| {\int _0^\infty F_d(c,x)\mathrm {d}c-\sum _{j=0}^{Q_{d,\varepsilon }} w^d_{\varepsilon ,j}F_d(c^d_{\varepsilon ,j},x)}\right| \le \tfrac{\varepsilon }{2} \end{aligned}$$
(7.34)

and

$$\begin{aligned} \sum _{j=1}^{Q_{d,\varepsilon }}w^d_{\varepsilon ,j}=N_{d,\varepsilon }. \end{aligned}$$
(7.35)

Furthermore, Corollary 7.2 (with \(4n\leftrightarrow n\), \(F_{c^d_{\varepsilon ,j}}^d(x)\leftrightarrow F_d(x,c^d_{\varepsilon ,j})\)) ensures there exist neural networks \((\Psi ^d_{\varepsilon ,j})_{\varepsilon \in (0,\infty ),d\in {\mathbb {N}},j\in \{1,2,\dots ,{Q_{d,\varepsilon }}\}}\subseteq {\mathfrak {N}}\) which satisfy

  1. (a)

    \(\displaystyle \sup _{\varepsilon \in (0,\infty ),d\in {\mathbb {N}}}\left[ {\frac{\max _{j\in \{1,2,\dots ,{Q_{d,\varepsilon }}\}}{\mathcal {L}}(\Psi ^d_{\varepsilon ,j})}{\max \{1,\ln (d)\}\left( {|\ln (\frac{\varepsilon }{2N_{d,\varepsilon }})|+\ln (d)+1}\right) +\ln (N_{d,\varepsilon }+1)}}\right] <\infty \),

  2. (b)

    \(\displaystyle \sup _{\varepsilon \in (0,\infty ),d\in {\mathbb {N}}}\left[ {\frac{\max _{j\in \{1,2,\dots ,{Q_{d,\varepsilon }}\}}{\mathcal {M}}(\Psi ^d_{\varepsilon ,j})}{(N_{d,\varepsilon }+1)^{\frac{1}{4n}}d^{1+\frac{1}{4n}}\left[ {\frac{\varepsilon }{2N_{d,\varepsilon }}}\right] ^{-\frac{1}{4n}}}}\right] <\infty \) and

  3. (c)

    for every \(\varepsilon \in (0,\infty )\), \(d\in {\mathbb {N}}\) that

    $$\begin{aligned} \sup _{x\in [a,b]^d}\left| {F_d(c^d_{\varepsilon ,j},x)-\left[ {R_{\varrho }(\Psi ^d_{\varepsilon ,j})}\right] \!(x)}\right| \le \tfrac{\varepsilon }{2N_{d,\varepsilon }}. \end{aligned}$$
    (7.36)

Let \(\mathrm {Id}_{{\mathbb {R}}^d}\in {\mathbb {R}}^{d\times d}\), \(d\in {\mathbb {N}}\), be the matrices given by \(\mathrm {Id}_{{\mathbb {R}}^d}=\mathrm {diag}(1,1,\dots ,1)\), let \(\nabla _{d,q}\in {\mathcal {N}}_1^{d,dq}\), \(d,q\in {\mathbb {N}}\), be the neural networks given by

$$\begin{aligned} \nabla _{d,q}=\left( {(\begin{pmatrix}\mathrm {Id}_d \\ \vdots \\ \mathrm {Id}_d\end{pmatrix},0)}\right) , \end{aligned}$$
(7.37)

let \(\Sigma _{d,\varepsilon }\in {\mathcal {N}}_1^{d,1}\), \(d\in {\mathbb {N}}\), \(\varepsilon \in (0,1]\), be the neural networks given by

$$\begin{aligned} \Sigma _{d,\varepsilon }=\left( {(\begin{pmatrix}w^d_{\varepsilon ,1}&w^d_{\varepsilon ,2}&\dots&w^d_{\varepsilon ,Q_{d,\varepsilon }}\end{pmatrix},0)}\right) , \end{aligned}$$
(7.38)

and let \((\Gamma _{d,\varepsilon })_{\varepsilon \in (0,1],d\in {\mathbb {N}}}\in {\mathfrak {N}}\) be the neural networks given by

$$\begin{aligned} \Gamma _{d,\varepsilon }=\Sigma _{d,\varepsilon }\odot {\mathcal {P}}(\Psi ^d_{\varepsilon ,1}, \Psi ^d_{\varepsilon ,2}, \dots , \Psi ^d_{\varepsilon ,Q_{d,\varepsilon }}) \odot \nabla _{d,Q_{d,\varepsilon }}. \end{aligned}$$
(7.39)

Combining Lemma 5.3, Lemma 5.4, (7.34), (7.35) and (c) implies for every \(\varepsilon \in (0,\infty )\) and \(d\in {\mathbb {N}}\), \(x\in [a,b]^d\) it holds

$$\begin{aligned} \begin{aligned}&\left| {\int _0^{\infty }F_d(c,x)\mathrm {d}c-\left[ {R_{\varrho }(\Gamma _{d,\varepsilon })}\right] \!(x)}\right| \\&\quad \le \left| {\int _0^\infty F_d(c,x)\mathrm {d}c-\sum _{j=0}^{Q_{d,\varepsilon }} w^d_{\varepsilon ,j}F_d(c^d_{\varepsilon ,j},x)}\right| +\left| {\sum _{j=0}^{Q_{d,\varepsilon }} w^d_{\varepsilon ,j}F_d(c^d_{\varepsilon ,j},x)-\left[ {R_{\varrho }(\Gamma _{d,\varepsilon })}\right] \!(x)}\right| \\&\quad \le \tfrac{\varepsilon }{2}+\left| {\sum _{j=0}^{Q_{d,\varepsilon }} w^d_{\varepsilon ,j}F_d(c^d_{\varepsilon ,j},x)-\sum _{j=0}^{Q_{d,\varepsilon }}w^d_{\varepsilon ,j}\left[ {R_{\varrho }(\Psi ^d_{\varepsilon ,j})}\right] \!(x)}\right| \\&\quad \le \tfrac{\varepsilon }{2}+\sum _{j=0}^{Q_{d,\varepsilon }} w^d_{\varepsilon ,j}\left| {F_d(c^d_{\varepsilon ,j},x)-\left[ {R_{\varrho }(\Psi ^d_{\varepsilon ,j})}\right] \!(x)}\right| \le \tfrac{\varepsilon }{2}+N_{d,\varepsilon }\tfrac{\varepsilon }{2N_{d,\varepsilon }}=\varepsilon . \end{aligned}\end{aligned}$$
(7.40)

This establishes that the neural networks \((\Gamma _{d,\varepsilon })_{\varepsilon \in (0,1],d\in {\mathbb {N}}}\) satisfy (iii). Next, observe for every \(\varepsilon \in (0,\infty )\), \(d\in {\mathbb {N}}\)

$$\begin{aligned} \begin{aligned}&\max \{1,\ln (d)\}\left( {|\ln (\frac{\varepsilon }{2N_{d,\varepsilon }})|+\ln (d)+1}\right) +\ln (N_{d,\varepsilon }+1)\\&\quad \le \max \{1,\ln (d)\}\left( {|\ln (\varepsilon )|+\ln (d)+3\ln (N_{d,\varepsilon })+\ln (2)+1}\right) \\&\quad \le \max \{1,\ln (d)\}\left( {|\ln (\varepsilon )|+\ln (d)+3\left( {\ln (S_{b,n})+\frac{1}{4n}\ln (d)+\frac{1}{4n}|\ln (\varepsilon )|+\frac{1}{4n}\ln (4)}\right) +2}\right) \\&\quad \le \max \{1,\ln (d)\}\left( {4|\ln (\varepsilon )|+4\ln (d)+3\ln (S_{b,n})+8}\right) \\&\quad \le (3\ln (S_{b,n})+8)\max \{1,\ln (d)\}\left( {|\ln (\varepsilon )|+\ln (d)+1}\right) . \end{aligned}\end{aligned}$$
(7.41)

Combining this with Lemma 5.3, Lemma 5.4 and (a) implies

$$\begin{aligned} \begin{aligned}&\sup _{\varepsilon \in (0,1],d\in {\mathbb {N}}}\left[ {\frac{{\mathcal {L}}(\Gamma _{d,\varepsilon })}{\max \{1,\ln (d)\}\left( {|\ln (\varepsilon )|+\ln (d)+1}\right) }}\right] \\&\quad \le \sup _{\varepsilon \in (0,1],d\in {\mathbb {N}}}\left[ {\frac{{\mathcal {L}}(\Sigma _{d,\varepsilon })+\max _{j\in \{1,2,\dots ,{Q_{d,\varepsilon }}\}}{\mathcal {L}}(\Psi ^d_{\varepsilon ,j})+{\mathcal {L}}(\nabla _{d,Q_{d,\varepsilon }})}{\max \{1,\ln (d)\}\left( {|\ln (\varepsilon )|+\ln (d)+1}\right) }}\right] \\&\quad \le 2+\sup _{\varepsilon \in (0,1],d\in {\mathbb {N}}}\left[ {\frac{\max _{j\in \{1,2,\dots ,{Q_{d,\varepsilon }}\}}{\mathcal {L}}(\Psi ^d_{\varepsilon ,j})}{\max \{1,\ln (d)\}\left( {|\ln (\varepsilon )|+\ln (d)+1}\right) }}\right] \\&\quad \le 2+(3\ln (S_{b,n})+8)\!\!\!\!\!\!\sup _{\varepsilon \in (0,\infty ),d\in {\mathbb {N}}}\left[ {\frac{\max _{j\in \{1,2,\dots ,{Q_{d,\varepsilon }}\}}{\mathcal {L}}(\Psi ^d_{\varepsilon ,j})}{\max \{1,\ln (d)\}\left( {|\ln (\frac{\varepsilon }{2N_{d,\varepsilon }})|+\ln (d)+1}\right) +\ln (N_{d,\varepsilon }+1)}}\right] \\&\quad <\infty . \end{aligned}\end{aligned}$$
(7.42)

This establishes \((\Gamma _{d,\varepsilon })_{\varepsilon \in (0,1],d\in {\mathbb {N}}}\) satisfy (i). In addition, for every \(\varepsilon \in (0,\infty )\), \(d\in {\mathbb {N}}\) it holds

$$\begin{aligned} \begin{aligned} (N_{d,\varepsilon }+1)^{\frac{1}{4n}}d^{1+\frac{1}{4n}}\left[ {\frac{\varepsilon }{2N_{d,\varepsilon }}}\right] ^{-\frac{1}{4n}}&\le 4N_{d,\varepsilon }^{\frac{1}{2n}}d^{1+\frac{1}{4n}}\varepsilon ^{-\frac{1}{4n}}\\&\le 4\left[ {S_{b,n}d^{\frac{1}{4n}}\left[ {\tfrac{\varepsilon }{4}}\right] ^{-\frac{1}{4n}}}\right] ^{\frac{1}{2n}}d^{1+\frac{1}{4n}}\varepsilon ^{-\frac{1}{4n}}\\&\le 16S_{b,n}d^{1+\frac{1}{4n}+\frac{1}{4n^2}}\varepsilon ^{-(\frac{1}{4n}+\frac{1}{8n^2})}\\&\le 16S_{b,n}d^{1+\frac{1}{2n}}\varepsilon ^{-\frac{1}{2n}}. \end{aligned}\end{aligned}$$
(7.43)

Combining this with Lemma 5.3, Lemma 5.4, (7.33), (b) and the fact that for every \(\psi \in {\mathfrak {N}}\) which satisfies \({\min _{l\in \{1,2,\dots ,{{\mathcal {L}}(\psi )}\}}{\mathcal {M}}_l(\psi )>0}\) it holds \({\mathcal {L}}(\psi )\le {\mathcal {M}}(\psi )\) ensures

$$\begin{aligned}&\sup _{\varepsilon \in (0,1],d\in {\mathbb {N}}}\left[ {\frac{{\mathcal {M}}(\Gamma _{d,\varepsilon })}{d^{(2+\frac{1}{n})}\varepsilon ^{-\frac{1}{n}}}}\right] \nonumber \\&\quad \le \sup _{\varepsilon \in (0,1],d\in {\mathbb {N}}}\left[ {\frac{\displaystyle 2{\mathcal {M}}(\Sigma _{d,\varepsilon })+4\left( {2\sum _{j=1}^{Q_{d,\varepsilon }}{\mathcal {M}}(\Psi ^d_{\varepsilon ,j})+4Q_{d,\varepsilon }\max _{j\in \{1,2,\dots ,{Q_{d,\varepsilon }}\}}{\mathcal {L}}(\Psi ^d_{\varepsilon ,j})}\right) +4{\mathcal {M}}(\nabla _{d,Q_{d,\varepsilon }})}{d^{(2+\frac{1}{n})}\varepsilon ^{-\frac{1}{n}}}}\right] \nonumber \\&\quad \le \sup _{\varepsilon \in (0,1],d\in {\mathbb {N}}}\left[ {\frac{24Q_{d,\varepsilon }\max _{j\in \{1,2,\dots ,{Q_{d,\varepsilon }}\}}{\mathcal {M}}(\Psi ^d_{\varepsilon ,j})}{d^{(2+\frac{1}{n})}\varepsilon ^{-\frac{1}{n}}}}\right] +\sup _{\varepsilon \in (0,1],d\in {\mathbb {N}}}\left[ {\frac{2Q_{d,\varepsilon }+4dQ_{d,\varepsilon }}{d^{(2+\frac{1}{n})}\varepsilon ^{-\frac{1}{n}}}}\right] \nonumber \\&\quad \le 24\left( {\sup _{\varepsilon \in (0,1],d\in {\mathbb {N}}}\left[ {\frac{Q_{d,\varepsilon }}{d^{(1+\frac{1}{2n})}\varepsilon ^{-\frac{1}{2n}}}}\right] }\right) \left( {\sup _{\varepsilon \in (0,1],d\in {\mathbb {N}}}\left[ {\frac{\max _{j\in \{1,2,\dots ,{Q_{d,\varepsilon }}\}}{\mathcal {M}}(\Psi ^d_{\varepsilon ,j})}{d^{(1+\frac{1}{2n})}\varepsilon ^{-\frac{1}{2n}}}}\right] }\right) \nonumber \\&\qquad +4\sup _{\varepsilon \in (0,1],d\in {\mathbb {N}}}\left[ {\frac{Q_{d,\varepsilon }}{d^{(1+\frac{1}{n})}\varepsilon ^{-\frac{1}{n}}}}\right] \nonumber \\&\quad \le 24\left( {\sup _{\varepsilon \in (0,1],d\in {\mathbb {N}}}\left[ {\frac{Q_{d,\varepsilon }}{d^{(1+\frac{1}{2n})}\varepsilon ^{-\frac{1}{2n}}}}\right] }\right) \left( {1+16S_{b,n}\sup _{\varepsilon \in (0,1],d\in {\mathbb {N}}}\left[ {\frac{\max _{j\in \{1,2,\dots ,{Q_{d,\varepsilon }}\}}{\mathcal {M}}(\Psi ^d_{\varepsilon ,j})}{(N_{d,\varepsilon }+1)^{\frac{1}{4n}}d^{1+\frac{1}{4n}}\left[ {\frac{\varepsilon }{2N_{d,\varepsilon }}}\right] ^{-\frac{1}{4n}}}}\right] }\right) \nonumber \\&\quad <\infty . \end{aligned}$$
(7.44)

This establishes the neural networks \((\Gamma _{d,\varepsilon })_{\varepsilon \in (0,1],d\in {\mathbb {N}}}\) satisfy (ii). The proof of Theorem 7.3 is thus completed. \(\square \)

8 Discussion

While Theorem 7.3 only establishes formally that the solution of one specific high-dimensional PDE may be approximated by neural networks without curse of dimensionality, the constructive approach also serves to illustrate that neural networks are capable of accomplishing the same for any PDE solution which exhibits a similar low-rank structure. Note here, that the tensor product construction in Proposition 6.4 only introduces a logarithmic dependency on the approximation accuracy. That we end up with a spectral rate in this specific case is due to Proposition 6.4 and Lemma 4.3, i.e., the insufficient regularity of the univariate functions inside the tensor product, as well as the number of terms required by the Gaussian quadrature used to approximate the outer integral. In particular, this means that the approach in Sect. 6 might also be used to produce approximation results with connectivity growing only logarithmically in the inverse of the approximation error, given that one has a suitably well-behaved low-rank structure.

The present result is a promising step toward higher-order, numerical solution of high-dimensional PDEs, which are notoriously troublesome to handle with any of the classical approaches based on discretization of the domain, or with randomized (a.k.a. Monte Carlo-based) arguments. Of course, answering the question of approximability can only ensure that there exist networks with a reasonable size-to-accuracy trade-off, whereas for any practical purpose it is also necessary to establish whether and how one can find these networks.

An analysis of the generalization error for linear Kolmogorov equations can be found in [4], which concludes that, under reasonable assumptions, the number of required Monte Carlo samples is free of the curse of dimensionality. Moreover, there are a number of empirical results [2, 3, 21, 39, 42], which suggest that the solutions of various high-dimensional PDEs may be learned efficiently using standard stochastic gradient descent-based methods. However, a satisfying formal analysis of this training process does not seem to be available at the present.

Lastly we would like to point out that, even though we had a semi-explicit formula available, the ReLU networks we used for approximation were in no way adapted to use this knowledge and have been shown to exhibit excellent approximation properties for, e.g., piecewise smooth functions [34], affine and Gabor systems [10] and even fractal structures [9]. So, while a spline dictionary-based approach specifically designed for the approximation of this one PDE solution may have similar rates, it would most certainly lack the remarkable universality of neural networks.