Exponential ReLU DNN expression of holomorphic maps in high dimension

For a parameter dimension d ∈ N, we consider the approximation of many-parametric maps u : [−1, 1] → R by deep ReLU neural networks. The input dimension d may possibly be large, and we assume quantitative control of the domain of holomorphy of u: i.e., u admits a holomorphic extension to a Bernstein polyellipse Eρ1 × ...×Eρd ⊂ C of semiaxis sums ρi > 1 containing [−1, 1]. We establish the exponential rate O(exp(−bN1/(d+1))) of expressive power in terms of the total NN size N and of the input dimension d of the ReLU NN in W 1,∞([−1, 1]). The constant b > 0 depends on (ρj) d j=1 which characterizes the coordinate-wise sizes of the Bernsteinellipses for u. We prove exponential convergence in stronger norms for the approximation by DNNs with more regular, so-called “rectified power unit” (RePU) activations.


Introduction
In recent years, so-called deep artificial neural networks ('DNNs' for short) have seen a dramatic development in applications from data science and machine learning.
Accordingly, after early results in the '90s on genericity and universality of DNNs (see [22] for a survey and references), in recent years the refined mathematical analysis of their approximation properties has received increasing attention. A particular class of many-parametric maps whose DNN approximation needs to be considered in many applications are real-analytic and holomorphic maps. Accordingly, the question of DNN expression rate bounds for such maps has received some attention in the approximation theory literature [16,17,7].
It is well-known that multi-variate, holomorphic maps admit exponential expression rates by multivariate polynomials. In particular, countably-parametric maps u : [−1, 1] ∞ → R can be represented under certain conditions by so-called generalized polynomial chaos expansions which, in turn, can be N -term truncated with controlled approximation rate bounds in terms of N . The polynomials which appear in such expansions can, in turn, be represented by DNNs, either exactly for certain activation functions, or approximately for example for the so-called rectified linear unit ("ReLU") activation with exponentially small representation error [13,26].
The purpose of the present paper is to establish corresponding DNN expression rate bounds in Lipschitz-norm for high-dimensional, analytic maps u : [−1, 1] d → R. We focus on ReLU DNNs, but comment in passing also on versions of our results for other DNN activation functions. Next, we briefly discuss the relation of previous results to the present work and also outline the structure of this paper.

Recent mathematical results on expressive power of DNNs
The survey [22] presented succinct proofs of genericity of shallow NNs in various function classes, as shown originally e.g. in [11,10,15] and reviewed the state of mathematical theory of DNNs up to that point. Moreover, exponential expression rate bounds for analytic functions by neural networks had already been achieved in the '90s. We mention in particular [17] where smooth, nonpolynomial activation functions were considered.
More closely related to the present work are the references [7,16]. In [16], approximation rates for deep NN approximations of multivariate functions which are analytic have been investigated. Exponential rate bounds in terms of the total size of the NN have been obtained, for sigmoidal activation functions. In [7], DNN approximation of certain functions u : [−1, 1] d → R which admit holomorphic extensions to C d by deep ReLU NNs has been considered. In particular, it was assumed that u admits a Taylor expansion about the origin of C d which converges absolutely and uniformly on [−1, 1] d . It is well-known that not every u which is real-analytic in [−1, 1] d admits such an expansion. In the present paper, we prove sharper expression rate bounds for both, the ReLU activation σ 1 and RePU activations σ r , for functions which merely are assumed to be real-analytic in [−1, 1] d , in L ∞ ([−1, 1] d ) and in stronger norms thereby generalizing both [7] and [16].

Contributions of the present paper
We prove exponential expression rate bounds of DNNs for d-variate, real-valued functions which depend analytically on their d inputs. Specifically, for holomorphic mappings u : [−1, 1] d → R, we prove expression error bounds in L ∞ ([−1, 1] d ) and in W k,∞ ([−1, 1] d ), for k ∈ N (the precise range of k depending on properties of the NN activation σ). We consider both, ReLU activation σ 1 : R → R + : x → x + and RePU activations σ r : R → R + : x → (x + ) r for some integer r ≥ 2.
Here, x + = max{x, 0}. The expression error bounds with σ 1 as activation are in W 1,∞ ([−1, 1] d ) and of the general type O(exp(−bN 1/(d+1) )) in terms of the NN size N and with a constant b > 0 depending on the domain of analyticity, but independent of N (however, with the constant implied in the Landau symbol O(·) depending exponentially on d, in general). With activation σ r for r ≥ 2, the bounds are in W k,∞ ([−1, 1] d ) for arbitrary fixed k ∈ N and of the type O(exp(−bN 1/d )) in terms of the NN size N . For all r ∈ N, the parameters of the σ r -neural networks approximating u (so-called "weights" and "biases") are continuous functions of u in appropriate norms.

Outline
The structure of the paper is as follows. In Section 2, we present the definition of the DNN architectures and fix notation and terminology. We also review in Section 2.2 a "ReLU DNN calculus", from recent work [21,8], which will facilitate the ensuing DNN expression rate analysis. A first set of key results are ReLU DNN expression rates in W 1,∞ ([−1, 1] d ) for multivariate Legendre polynomials, which are proved in Section 2.3. These novel expression rate bounds are explicit in the W 1,∞ -accuracy and in the polynomial degree. They are of independent interest and remarkable in that the ReLU DNNs which emulate the polynomials at exponential rates, as we prove, realize continuous, piecewise affine functions. They are based on [13,26]. The proofs, being constructive, shed a rather precise light on the architecture, in particular depth and width of the ReLU DNNs, that is sufficient for polynomial emulation. In Section 2.4, we briefly comment on corresponding results for RePU activations; as a rule, the same exponential rates are achieved for slightly smaller NNs and in norms which are stronger than W 1,∞ . Section 3 then contains the main results of this note: exponential ReLU DNN expression rate bounds for d-variate, holomorphic maps. They are based on a) polynomial approximation of these maps and on b) ReLU DNN reapproximation of the approximating polynomials. These are presented in Sections 3.1 and 3.2. Again we comment in Section 3.

Notation
We adopt standard notation consistent with our previous works [29,30]: N = {1, 2, . . . } and N 0 := N ∪ {0}. We write R + := {x ∈ R : x ≥ 0}. The symbol C will stand for a generic, positive constant independent of any asymptotic quantities in an estimate, which may change its value even within the same equation.
Here, the ordering "≤" on N d 0 is defined as µ j ≤ ν j , for all j = 1, . . . , d. We write |Λ| to denote the finite cardinality of a set Λ.
2 Deep neural network approximations

DNN architecture
We consider deep neural networks (DNNs for short) of feed forward type. Such a NN f can mathematically be described as a repeated composition of linear transformations with a nonlinear activation function.
More precisely: For an activation function σ : R → R, a fixed number of hidden layers L ∈ N, numbers N ℓ ∈ N of computation nodes in layer ℓ ∈ {0, . . . , L}, f : R N0 → R N L+1 is realized by a feedforward neural network, if for certain weights w ℓ i,j ∈ R, and biases b ℓ j ∈ R it holds for all and finally In this case n = N 0 is the dimension of the input, and m = N L+1 is the dimension of the output. Furthermore z ℓ j denotes the output of unit j in layer ℓ. The weight w ℓ i,j has the interpretation of connecting the ith unit in layer ℓ − 1 with the jth unit in layer ℓ.
Except when explicitly stated, we will not distinguish between the network (which is defined through σ, the w ℓ i,j and b ℓ j ) and the function f : R N0 → R N L+1 it realizes. We note in passing that this relation is typically not one-to-one, i.e. different NNs may realize the same function as their output. Let us also emphasize that we allow the weights w ℓ i,j and biases b ℓ j for ℓ ∈ {1, . . . , L +1}, i ∈ {1, . . . , N ℓ−1 } and j ∈ {1, . . . , N ℓ } to take any value in R, i.e. we do not consider quantization as e.g. in [1,21].
As is customary in the theory of NNs, the number of hidden layers L of a NN is referred to as depth 2 and the total number of nonzero weights and biases as the size of the NN. Hence, for a DNN f as in (2.1)-(2.3), we define In addition, size in (f ) := |{(i, j) : , which are the number of nonzero weights and biases in the input layer of f and in the output layer, respectively.
The proofs of our main results Theorem 3.7 and Theorem 3.9 are constructive, in the sense that we will explicitly construct NNs with the desired properties. We construct these NNs by assembling smaller networks, using the operations of concatenation and parallelization, as well as so-called "identity-networks" which realize the identity mapping. Below, we recall the definitions. For these operations, we also provide bounds on the number of nonzero weights in the input layer and the output layer of the corresponding network, which can be derived from the definitions in [21].

DNN calculus
Throughout, as activation function σ we consider either the ReLU activation function or, as suggested in [16,14,12], for r ∈ N, r ≥ 2, the RePU activation function If a NN uses σ r as activation function, we refer to it as σ r -NN. ReLU NNs are referred to as σ 1 -NNs. We assume throughout that all activations in a DNN are of equal type.
Remark 2.1 (Historical note on rectified power units). "Rectified power unit" (RePU) activation functions are particular cases of so-called sigmoidal functions of order k ∈ N for k ≥ 2, i.e. lim x→∞ σ(x) The use of NNs with such activation functions for function approximation dates back to the early 1990's, cf. e.g. [16,14]. In fact, proofs in [16, Section 3] proceed in three steps. First, a given function f was approximated by a polynomial, then this polynomial was expressed as a linear combination of powers of a RePU, and finally it was shown that for r ≥ 2 and arbitrary A > 0 the RePU σ r can be approximated on [−A, A] with arbitrary small L ∞ ([−A, A])-precision ε by a NN with a sigmoidal activation function of order k ≥ r, which has depth 1 and fixed network size ([16, Lemma 3.6]).
The exact realization of polynomials by σ r -networks for r ≥ 2 was observed in the proof of [16,Theorem 3.3], based on ideas in the proof of [3, Theorem 3.1]. The same result was recently rediscovered in [12,Theorem 6], whose authors were apparently not aware of [3,16].
We now indicate several fundamental operations on NNs which will be used in the following. These operations have been frequently used in recent works [21,19,8].

Parallelization
We now recall the parallelization of two networks f and g, which in parallel emulates f and g. We first describe the parallelization of networks with the same inputs as in [21], the parallelization of networks with different inputs is similar and introduced directly afterwards.
Let f and g be two NNs with the same depth L ∈ N 0 and the same input dimension n ∈ N. Denote by m f the output dimension of f and by m g the output dimension of g. Then there exists a neural network (f, g), called parallelization of f and g, which in parallel emulates f and g, i.e.
It holds that depth((f, g)) = L and that size((f, g)) = size(f )+size(g), size in ((f, g)) = size in (f )+ size in (g) and size out ((f, g)) = size out (f ) + size out (g). We next recall the parallelization of networks with inputs of possibly different dimension as in [8]. To this end, we let f and g be two NNs with the same depth L ∈ N 0 whose input dimensions n f and n g may be different, and whose output dimensions we will denote by m f and m g , respectively.
Then there exists a neural network (f, g) d , called full parallelization of networks with distinct inputs of f and g, which in parallel emulates f and g, i.e.
Parallelizations of networks with possibly different inputs can be used consecutively to emulate multiple networks in parallel.

Identity networks
We now recall identity networks ([21, Lemma 2.3]), which emulate the identity map.
For all n ∈ N and L ∈ N 0 there exists a σ 1 -identity network Id R n of depth L which emulates the identity map Id R n : R n → R n : x → x. It holds that Analogously, for r ≥ 2 there exist σ r -identity networks. To construct them, we use the concatenation f • g of two NNs f and g as introduced in [21, Definition 2.2]. As we shall make use of it subsequently in Propositions 2.3 and 2.4, we recall its definition here for convenience of the reader. ). Let f, g be such that the output dimension of g equals the input dimension of f , which we denote by k. Denote the weights and biases of f by {u ℓ i,j } i,j,ℓ and {a ℓ j } j,ℓ and those of g by {v ℓ i,j } i,j,ℓ and {c ℓ j } j,ℓ . Then, the NN f • g emulates the composition x → f (g(x)) and satisfies depth(f • g) = depth(f ) + depth(g). Its weights and biases, for ℓ = 1, . . . , depth(f ) + depth(g), are given by Proposition 2.3. For all r ≥ 2, n ∈ N and L ∈ N 0 there exists a σ r -NN Id R n of depth L which emulates the identity function Id R n : R n → R n : x → x. It holds that Proof. We proceed in two steps: first we discuss L = 1, then L > 1.
Step 1. It was shown in [12, This shows the existence of a network Id R 1 : R → R of depth 1 realizing the identity on R. The network employs 2r weights and 2r biases in the first layer, and 2r weights and one bias (namely a 0 ) in the output layer. Its size is thus 6r + 1.
Step 2. For L > 1, we consider the L-fold concatenation Id R 1 • · · · • Id R 1 of the identity network Id R 1 from Step 1. The resulting network has depth L, input dimension 1 and output 6 dimension 1. The number of weights and the number of biases in the first layer both equal 2r, the number of weights in the output layer equals 2r, and the number of biases 1. In each of the L − 1 other hidden layers, the number of weights is 4r 2 , and the number of biases 2r. In total, the network has size at most 4r + (L − 1)(4r 2 + 2r) + 2r + 1 ≤ L(4r 2 + 2r), where we used that r ≥ 2.
Identity networks with input size n ∈ N are obtained as the parallelization with distinct inputs of n identity networks with input size 1.

Sparse concatenation
The sparse concatenation of two σ 1 -NNs f and g was introduced in [21].
Let f and g be σ 1 -NNs, such that the number of nodes in the output layer of g equals the number of nodes in the input layer of f . Denote by n the number of nodes in the input layer of g, and by m the number of nodes in the output layer of f . Then, with "•" as in Definition 2.2, the sparse concatenation of the NNs f and g is defined as the network where Id R k is the σ 1 -identity network of depth 1. The network f • g realizes the function i.e., by abuse of notation, the symbol "•" has two meanings here, depending on whether we interpret f • g as a function or as a network. This will not be the cause of confusion however. and For a proof, we refer to [21].
A similar result holds for σ r -NNs. In this case we define the sparse concatenation f • g as in (2.6), but with Id R k now denoting the σ r -identity network of depth 1 from Proposition 2.3. Proposition 2.4. For r ≥ 2 let f, g be two σ r -NNs such that the output dimension of g, which we denote by k ∈ N, equals the input dimension of f , and suppose that size in (f ), size out (g) ≥ k. Denote by f • g the σ r -network obtained by the σ r -sparse concatenation. Then and Proof. It follows directly from Definition 2.2 and Proposition 2.3 that depth(f • g) = depth(f ) + 1+depth(g). To bound the size of the network, note that the weights in layers ℓ = 1, . . . , depth(g) equal those in the first depth(g) layers of g. Those in layers ℓ = depth(g) + 2, . . . , depth(g) + 2 + depth(f ) equal those in the last depth(f ) layers of f . Layer ℓ = depth(g) + 1 has 2r size out (g) weights and 2rk biases, whereas layer ℓ = depth(g) + 2 has 2r size in (f ) weights and k biases. This shows Equation (2.9) and the bound on size in (f • g) and size out (f • g).
Identity networks are often used in combination with parallelizations. In order to parallelize two networks f and g with depth(f ) < depth(g), the network f can be concatenated with an identity network, resulting in a network whose depth equals depth(g) and which emulates the same function as f .

Basic results
In [13] it was shown that deep networks employing both ReL and BiS ("binary step") units are capable of approximating the product of two numbers with a network whose size and depth increase merely logarithmically in the accuracy. In other words, certain neural networks achieve uniform exponential convergence of the operation of multiplication (of two numbers in a bounded interval) w.r.t. the network size. Independently, a similar result for ReLU networks was obtained in [26]. Here, we shall use the latter result in the following slightly more general form shown in [25]. Contrary to [26], it provides a bound of the error in the It is immediate, that Proposition 2.5 implies the existence of networks approximating the multiplication of n different numbers. We now show such a result, generalizing [25,Proposition 3.3] in that we consider the error again in the W 1,∞ norm (instead of the L ∞ norm). Proposition 2.6. For any δ ∈ (0, 1), n ∈ N and M ≥ 1 there exists a σ 1 -NN˜ δ,M : where ∂ ∂xi denotes a weak derivative.
There exists a constant C independent of δ ∈ (0, 1), n ∈ N and M ≥ 1 such that Proof. We proceed analogous to the proof of [25,Proposition 3.3], and construct˜ δ,1 as a binary tree of× ·,· -networks from Proposition 2.5 with appropriately chosen parameters for the accuracy and the maximum input size. We defineñ := min{2 k : k ∈ N, 2 k ≥ n}, and consider the product ofñ numbers x 1 , . . . , xñ ∈ [−M, M ]. In case n <ñ, we define x n+1 , . . . , xñ := 1, which can be implemented by a bias in the first layer. Becauseñ < 2n, the bounds on network size and depth in terms ofñ also hold in terms of n, possibly with a larger constant.
It suffices to show the result for M = 1, since for M > 1, the network defined through n achieves the desired bounds as is easily verified. Therefore, wlog M = 1 throughout the rest of this proof. Equation (2.11a) follows by the argument given in the proof of [25, Proposition 3.3], we recall it here for completeness. By abuse of notation, for every even k ∈ N let a (k-dependent) mapping R = R 1 be defined via That is, for each product network× δ/ñ 2 ;2 as in Proposition 2.5 we choose maximum input size "M = 2" and accuracy "δ/ñ 2 ". Hence R ℓ can be interpreted as a mapping from R 2 ℓ → R. We now define˜ δ,1 : and next show the error bounds in (2.11) (recall that by definition x n+1 = · · · = xñ = 1 in casẽ n > n).
The number of binary tree layers (each denoted by R) is bounded by O(log 2 (ñ)). With the bound on the network depth from Proposition 2.5, for M = 1 the second part of (2.12) follows.
To estimate the network size, we cannot use the estimate size(f • g) ≤ 2 size(f ) + 2 size(g) from Equation (2.8), because the number of concatenations log 2 (ñ) − 1 depends on n, hence the factors 2 would give an extra n-dependent factor in the estimate on the network size. Instead, from Equation (2.8) we use size(f • g) ≤ size(f ) + size in (f ) + size out (g) + size(g) and the bounds from Proposition 2.5. We find (2 log 2 (ñ)−ℓ being the number of product networks in binary tree layer ℓ) which finishes the proof of (2.12) for M = 1.
The previous two propositions can be used to deduce bounds on the approximation of univariate polynomials on compact intervals w.r.t. the norm W 1,∞ . One such result was already proven in [19,Proposition 4

ReLU DNN approximation of univariate Legendre polynomials
For future reference we note that by (2.17) and Equation (2.19) below, for all j ∈ N 0 , δ ∈ (0, 1) and k ∈ {0, 1} To provide bounds on the size of the networks approximating the tensor product Legendre polynomials, for finite subsets Λ ⊂ N d 0 we will make use of the quantity and for a constant C > 0 that is independent of d, Λ and δ it holds Proof. Let δ ∈ (0, 1) and a finite subset Λ ⊂ N d 0 be given. The proof is divided into three steps. In the first step, we define ReLU NN approximations of tensor product Legendre polynomials {L ν,δ } ν∈Λ and fix the parameters used in the NN approximation. In the second step, we estimate the error of the approximation, and the L ∞ ([−1, 1] d )-norm of theL ν,δ , ν ∈ Λ. In the third step, we describe the network f Λ,δ and estimate its depth and size.

It follows that for all
To determine the error of the gradient, without loss of generality we only consider the derivative with respect to y 1 . In the case 1 / ∈ supp ν, we trivially have ∂ ∂y1 (L ν (y) −L ν,δ (y)) = 0 for all y ∈ [−1, 1] d . Thus let ν 1 = 0 in the following. Then, with δ ′ = 1 Step 3. We now describe the network f Λ,δ , which in parallel emulates {L ν,δ } ν∈Λ . The network is constructed as the concatenation of two subnetworks, i.e.
The subnetwork f (2) Λ,δ evaluates, in parallel, approximate univariate Legendre polynomials in the input variables (y j ) j∈supp ν . With T := {(j, ν j ) ∈ N 2 : ν ∈ Λ, j ∈ supp ν} it is defined as where the pair of round brackets denotes a parallelization. The depth of the identity networks is chosen such that all components of the parallelization have equal depth. The subnetwork f Λ,δ takes the output of f Λ,δ as input and computes

14
where in the last two lines the outer pair of round brackets denotes a parallelization. Again, the depth of the identity networks is such that all components of the parallelization have equal depth.
We have the following expression for the network depth: Λ,δ .
We can choose the depths of the identity networks in the definition of f (2) Λ,δ such that (denoting here and in the remainder of this proof by C > 0 constants independent of d, Λ and δ ∈ (0, 1)) depth f where we used that 2m(Λ) + 2 ≤ 4m(Λ) when Λ = {0}.

RePU DNN emulation of polynomials
The approximation of polynomials by neural networks can be significantly simplified if instead of the ReLU activation σ 1 we consider as activation function the so-called rectified power unit ("RePU" for short): recall that for r ∈ N, r ≥ 2, the RePU activation is defined by σ r (x) = max{0, x} r , x ∈ R. In contrast to σ 1 -NNs, as shown in [12], for every r ∈ N, r ≥ 2 there exist RePU networks of depth 1 realizing the multiplication of two real numbers without error. This yields the following result proven in [12,Theorem 9] for r = 2. With [12,Theorem 5] this extends to all r ≥ 2.
Proposition 2.11. Fix d ∈ N and r ∈ N, r ≥ 2. Then there exists a constant C > 0 (depending on d) such that for any finite downward closed Λ ⊆ N d 0 and any p ∈ P Λ there is a σ r -network p : R d → R which realizes p exactly and such that size(p) ≤ C|Λ| and depth(p) ≤ C log 2 (|Λ|). Remark 2.12. Let ψ : R → R be an arbitrary C 2 function that is not linear, i.e. it does not hold ψ ′′ (x) = 0 for all x ∈ R. In [23] it is shown that ψ-networks can approximate the multiplication of two numbers a, b in a fixed bounded interval up to arbitrary accuracy with a fixed number of units. We also refer to [30,Section 3.3] where we explain this observation from [23] in more detail. From this, analogous to [12,Theorem 9], one can obtain a version of Proposition 2.11 for arbitrary C 2 activation functions. To state it, we fix d ∈ N. Then there exists C > 0 (depending on d) such that for every δ > 0, for every downward closed Λ ⊆ N d 0 and every p ∈ P Λ , there exists a ψ-neural network q :

Exponential expression rate bounds
We now proceed to the statement and proof of the main result of the present note, namely the exponential rate bounds for the DNN expression of d-variate holomorphic maps. First, in Section 3.1 we recall (classical) polynomial approximation results for analytic functions. Subsequently, these are used to deduce DNN approximation results for ReLU and RePU networks.   always exists ρ > 1 such that u allows a holomorphic extension to × d j=1 E ρ . For the proof of the theorem we shall use the following result mentioned in [27].

Polynomial approximation
The lemma is proved by computing (as an upper bound of the left-hand side in (3.2)) the volume of the set {(x j ) d j=1 ∈ R d + : d j=1 (xj −1) aj ≤ 1}, which equals the right-hand side in (3.2). The significance of this result is, that it provides an upper bound for multiindex sets of the type To see this, note that due to log(ρ −ν ) = − d j=1 ν j log(ρ j ), for any ε ∈ (0, 1) we have Applying Lemma 3.2 with a j = log(1/ε)/ log(ρ j ) we thus get (also see [2, Lemma 4.2]): .
We are now in position to prove the following theorem, variations of which can be considered as classical.
We note that by Stirling's inequality, with ρ min = min d j=1 ρ j and ρ max = max d j=1 ρ j , We shall also use the following classic result for Taylor expansions of holomorphic functions.

ReLU DNN approximation
We now come to the main result, concerning the approximation of holomorphic functions on bounded intervals by ReLU networks. The following theorem improves upon [7,Theorem 2.6] in two ways: first, we merely assume u to be analytic from [−1, 1] d → R (and not analytic from (B C 1 ) d → C, cp. [7, Theorem 2.6] and Remark 3.1), and second, we consider the error in the Then, there exist constants β ′ = β ′ (ρ, d) > 0 and C = C(u, ρ, d) > 0, and for every N ∈ N there exists a σ 1 -NNũ N : (3.18) and the error bound Proof. Throughout this proof, let β > 0 be fixed such that (3.8) holds. We proceed in five steps: In Steps 1-2 we treat the case d = 1. Subsequently, in Steps 3-5 we treat the case d ≥ 2.
Step 1. We start with d = 1. Throughout this step fix m ∈ N. We now construct a NNû m approximating u with accuracy δ := ρ −m (up to some constant), where by assumption u : E ρ → C is holomorphic. First, we assume 1]. Let −1 = x 0 < · · · < x n = 1 be a finite sequence of equidistant points with n ∈ N so large that 1 n < κ ρ .
Then (x j ) n j=0 induces a partition of [−1, 1] into n intervals of length 2/n. For every 0 ≤ j ≤ n and y ∈ I j : and, because C u ≤ 1, by Lemma 3.6 (with γ := 1/n and γ/κ = 1/(κn) < 1/ρ) with u m,j (y) := Moreover, Lemma 3.6 implies that it holds for all 0 ≤ j ≤ n (here we use κ ∈ (0, 1)) for C depending on κ, which depends on ρ. for C depending on ρ, but independent of u. Therefore, and due to δ = ρ −m , Proposition 2.7 gives for a constant C depending on ρ, but independent of m and u. Next, denote by (ϕ j ) n j=0 the continuous, piecewise affine functions on the partition induced by (x j ) n j=0 on [−1, 1] such that ϕ j (x i ) = δ i,j for all i, j. As is well known, each ϕ j can be expressed without error with a ReLU network of depth 1 and size at most 3 (see for example [25]). For C > 0 as in (3.21) and M := C + 1 we define a networkû m approximating u bŷ u m := n j=0× δ,M (ϕ j ,ũ m,j,δ ). (3.23) We observe that for all 0 ≤ j ≤ n sup where here and in the following all derivatives are interpreted as weak derivatives. These estimates will be used repeatedly in the following. We now provide an upper bound for u −û m W 1,∞ ([−1,1]) . For every 0 ≤ j ≤ n it holds .
Using (3.21) and δ = ρ −m as well as ϕ ′ j L ∞ ([−1,1]) ≤ n and supp ϕ j = I j = [−1, 1] ∩ [x j − 1/n, x j + 1/n], these norms can be bounded by Cδ for a constant C depending on n, u and on ρ. Hence (3.24) The constant C depends on the number of intervals n in the partition induced by (x j ) n j=0 , but we emphasize that n is a fixed constant in this computation and does not increase as δ → 0.
We now describe the networkû m . The following σ 1 -NN realizes (3.23): Here, Sum n+1 is a network with input dimension n + 1, output dimension 1, depth 0 and size n + 1 which implements the sum of its inputs. Round brackets denote parallelizations. The depth of the identity networks is chosen such that all components of the parallelization have equal depth, which is 1 + max n j=0 depth (ũ m,j,δ ). Next, we bound the size and depth of the networkû m . By (3.22), the identity networks contribute at most (2n + 2)2(1 + max n j=0 depth (ũ m,j,δ )) ≤ C(4n + 4)(1 + m log(m)) to the network size, for C independent of u, but depending on ρ. Using the bound size(× δ,M ) ≤ C(1 + log(M/δ)) ≤ Cm (by Proposition 2.5), (3.22) and size(ϕ j ) = 3 for all 0 ≤ j ≤ n size(û m ) for a constant C that does not depend on m ∈ N and u. Now, for u with C u = sup y∈Eρ |u(y)| > 1 we approximate u/C u as above, and multiply all weights and biases in the output layer of the resulting network by C u . This does not affect the network's depth and size, and it follows that (3.24) holds for C replaced by C u C.
Hence (3.29) and the definition of N imply Similarly one obtains the bound on the depth ofũ N by using (3.28). This shows (3.18 for C 2 as in (3.29) (independent of d, β and u). Similar as in Step 2 (by increasing C > 0 in (3.19) if necessary) we conclude that (3.19) holds for all N ∈ N.
Remark 3.8. Note that in step 1 of the proof, the networkû m only depends on u through the NNs {ũ m,j,δ } n j=0 . The weights and biases of those networks continuously depend on u with respect to the L ∞ (E ρ )-norm (because the only u-dependent weights and biases are the Taylor coefficients of u, which are bounded in terms of C u ).

RePU DNN approximation
For RePU approximations, with activation σ r (x) for integer r ≥ 2, we may combine Proposition 2.11 (which is [12, Theorem 9]) and Theorem 3.5 to infer the following result. Note that the decay of the provided upper bound of the error in (3.33) in terms of the network size N is slightly faster than the one we obtained for ReLU approximations in (3.19). where · is the Euclidean norm on R n resp. on R d . As usual, for n = 1 we write |f |  for a constant C which now additionally depends on |T −1 (·)| W 1,∞ ([−1,1] d ,R d ) . The approximation of the integral (4.1) can thus be reduced to the problem of approximating the integral of the surrogateg N , which can be efficiently represented by a σ 1 -NN.
More generally, if π : [−1, 1] d → (0, ∞) is for example a continuous density function (not necessarily a product of its marginals) there exists a bijective transport T : [−1, 1] d → [−1, 1] d such that analogous to (4.3) it holds [−1,1] d u(y)π(y)dy = [−1,1] d u(T −1 (x))dx (contrary to the situation above, this transformation T is not diagonal in general). One explicit representation of such a transport is provided by the Knothe-Rosenblatt transport, see, e.g. [24,Section 2.3]. It has the property that T inherits the smoothness of π, cp. [24,Remark 2.19]. In case T −1 can be realized without error by a σ 1 (or σ r ) network, we find again an estimate of the type (3.19). If T −1 does not allow an explicit representation by a NN however, we may still approximate T −1 by a NNS N to obtain a NNg N :=ũ N •S N approximating g = u • T −1 . This will introduce an additional error in (4.4) due to the approximation of T −1 addressed in [20].