Exponential ReLU DNN Expression of Holomorphic Maps in High Dimension

For a parameter dimension d∈N\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$d\in {\mathbb {N}}$$\end{document}, we consider the approximation of many-parametric maps u:[-1,1]d→R\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$u: [-\,1,1]^d\rightarrow {\mathbb R}$$\end{document} by deep ReLU neural networks. The input dimension d may possibly be large, and we assume quantitative control of the domain of holomorphy of u: i.e., u admits a holomorphic extension to a Bernstein polyellipse Eρ1×⋯×Eρd⊂Cd\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${{\mathcal {E}}}_{\rho _1}\times \cdots \times {{\mathcal {E}}}_{\rho _d} \subset {\mathbb {C}}^d$$\end{document} of semiaxis sums ρi>1\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\rho _i>1$$\end{document} containing [-1,1]d\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$[-\,1,1]^{d}$$\end{document}. We establish the exponential rate O(exp(-bN1/(d+1)))\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$O(\exp (-\,bN^{1/(d+1)}))$$\end{document} of expressive power in terms of the total NN size N and of the input dimension d of the ReLU NN in W1,∞([-1,1]d)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$W^{1,\infty }([-\,1,1]^d)$$\end{document}. The constant b>0\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$b>0$$\end{document} depends on (ρj)j=1d\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$(\rho _j)_{j=1}^d$$\end{document} which characterizes the coordinate-wise sizes of the Bernstein-ellipses for u. We also prove exponential convergence in stronger norms for the approximation by DNNs with more regular, so-called “rectified power unit” activations. Finally, we extend DNN expression rate bounds also to two classes of non-holomorphic functions, in particular to d-variate, Gevrey-regular functions, and, by composition, to certain multivariate probability distribution functions with Lipschitz marginals.


Introduction
In recent years, so-called deep artificial neural networks ("DNNs" for short) have seen dramatic development in applications from data science and machine learning.
Accordingly, after early results in the 1990s on genericity and universality of DNNs (see [27] for a survey and references), in recent years the refined mathematical analysis of their approximation properties, viz. "expressive power," has received increasing attention. A particular class of many-parametric maps whose DNN approximation needs to be considered in many applications are real-analytic and holomorphic maps. Accordingly, the question of DNN expression rate bounds for such maps has received some attention in the approximation theory literature [21,22,36].
It is well known that multi-variate, holomorphic maps admit exponential expression rates by multivariate polynomials. In particular, countably parametric maps u : [− 1, 1] ∞ → R can be represented under certain conditions by so-called generalized polynomial chaos expansions with quantified sparsity in coefficient sequences. This, in turn, implies N -term truncations with controlled approximation rate bounds in terms of N , with approximation rates which do not depend on the dimension of the active parameters in the truncated approximation [6,7]. The polynomials which appear in such expansions can, in turn, be represented by DNNs, either exactly for certain activation functions, or approximately for example for the so-called rectified linear unit ("ReLU") activation with exponentially small representation error [18,37].
The purpose of the present paper is to establish corresponding DNN expression rate bounds in Lipschitz-norm (i.e., W 1,∞ -norm) for high-dimensional, analytic maps u : [− 1, 1] d → R. We focus on ReLU DNNs, but comment in passing also on versions of our results for other DNN activation functions. Next, we briefly discuss the relation of previous results to the present work and also outline the structure of this paper.

Recent Mathematical Results on Expressive Power of DNNs
The survey [27] presented succinct proofs of genericity of shallow NNs in various function classes, as shown originally, e.g., in [15,16,20] and reviewed the state of mathematical theory of DNNs up to that point. Moreover, exponential expression rate bounds for analytic functions by neural networks had already been achieved in the 1990s. We mention in particular [22] where smooth, nonpolynomial activation functions were considered.
More closely related to the present work are the references [21,36]. In [21], approximation rates for deep NN approximations of multivariate functions which are analytic have been investigated. Exponential rate bounds in terms of the total size of the NN have been obtained, for sigmoidal activation functions. In [37], it was observed that the multiplication of two real numbers, and consequently polynomials, can efficiently be approximated by deep ReLU NNs. This was used in [36] to prove bounds on the DNN approximation of certain functions u : [− 1, 1] d → R which admit holomorphic extensions to some open subset of C d by deep ReLU NNs. In particular, it was assumed that u admits a Taylor expansion about the origin of C d which converges absolutely and uniformly on [− 1, 1] d . It is well known that not every u which is real-analytic in [− 1, 1] d admits such an expansion. In the present paper, we prove sharper expression rate bounds for both, the ReLU activation σ 1 and RePU activations σ r , for functions which merely are assumed to be real-analytic in [− 1, 1] d , in L ∞ ([− 1, 1] d ) and in stronger norms, thereby generalizing both [21,36].

Contributions of the Present Paper
We prove exponential expression rate bounds of DNNs for d-variate, real-valued functions which depend analytically on their d inputs. Specifically, for holomorphic mappings u : [− 1, 1] d → R, we prove expression error bounds in L ∞ ([− 1, 1] d ) and in W k,∞ ([− 1, 1] d ), for k ∈ N (the precise range of k depending on properties of the NN activation σ ). We consider both, ReLU activation σ 1 : R → R + : x → x + and RePU activations σ r : R → R + : x → (x + ) r for some integer r ≥ 2.
Here, x + = max{x, 0}. The expression error bounds in our main result, Theorem 3.6, with ReLU activation σ 1 are in W 1,∞ ([− 1, 1] d ) and of the general type O(exp(−bN 1/(d+1) )) in terms of the NN size N , with a constant b > 0 depending on the domain of analyticity, but independent of N (however, with the constant implied in the Landau symbol O(·) depending exponentially on d, in general). With activation σ r for r ≥ 2, Theorem 3.10 has corresponding expression error bounds in W k,∞ ([− 1, 1] d ) for arbitrary fixed k ∈ N and of the type O(exp(−bN 1/d )) in terms of the NN size N . For all r ∈ N, the parameters of the σ r -neural networks approximating u (so-called "weights" and "biases") are continuous functions of u in appropriate norms. All of our proofs are constructive, i.e., they demonstrate how to build sparsely connected DNNs achieving the claimed convergence rates. We comment in Remarks 3.7 and 3.11 how these statements imply results for (the simpler architecture of) fully connected neural networks.
The main results, Theorems 3.6 and 3.10, are expression rate bounds for holomorphic functions. Similar bounds for Gevrey-regular functions are given in Sect. 4.3.4. In Sect. 4.3.5, we conclude the same bounds also for certain classes of nonholomorphic, merely Lipschitz-continuous functions, by leveraging the compositional nature of DNN approximation and Theorems 3.6 and 3.10.

Outline
The structure of the paper is as follows. In Sect. 2, we present the definition of the DNN architectures and fix notation and terminology. We also review in Sect. 2.2 a "ReLU DNN calculus," from recent work [10,26], which will facilitate the ensuing DNN expression rate analysis. A first set of key results are ReLU DNN expression rates in W 1,∞ ([− 1, 1] d ) for multivariate Legendre polynomials, which are proved in Sect. 2.3. These novel expression rate bounds are explicit in the W 1,∞ -accuracy and in the polynomial degree. They are of independent interest and remarkable in that the ReLU DNNs which emulate the polynomials at exponential rates, as we prove, realize continuous, piecewise affine functions. They are based on [18,37]. The proofs, being constructive, shed a rather precise light on the architecture, in particular depth and width of the ReLU DNNs, that is sufficient for polynomial emulation. In Sect. 2.4, we briefly comment on corresponding results for RePU activations; as a rule, the same exponential rates are achieved for slightly smaller NNs and in norms which are stronger than W 1,∞ . Section 3 then contains the main results of this note: exponential ReLU DNN expression rate bounds for d-variate, holomorphic maps. They are based on a) polynomial approximation of these maps and on b) ReLU DNN reapproximation of the approximating polynomials. These are presented in Sects. 3

Notation
We adopt standard notation consistent with our previous works [40,41]: N = {1, 2, . . .} and N 0 := N ∪ {0}. We write R + := {x ∈ R : x ≥ 0}. The symbol C will stand for a generic, positive constant independent of any asymptotic quantities in an estimate, which may change its value even within the same equation.
In statements about polynomial expansions we require multi- The total order of a multi-index ν is denoted by |ν| 1 := d j=1 ν j . The notation supp ν stands for the support of the multi-index, i.e., supp ν = { j ∈ {1, . . . , d} : ν j = 0}. The size of the support of ν ∈ N d 0 is | supp ν|; it will, subsequently, indicate the number of active coordinates in the multivariate monomial term y ν := d j=1 y Here, the ordering "≤" on N d 0 is defined as μ j ≤ ν j , for all j = 1, . . . , d. We write |Λ| to denote the finite cardinality of a set Λ.
We write B C ε := {z ∈ C : |z| < ε}. Elements of C d will be denoted by boldface characters such as ν j j and ν! = d j=1 ν j ! will be employed (with the conventions 0! := 1 and 0 0 := 1). For n ∈ N 0 we let P n := span{y j : 0 ≤ j ≤ n} be the space of polynomials of degree at most n, and for a finite index set Λ ⊂ N d 0 we denote P Λ := span{ y ν : ν ∈ Λ}.

DNN Architecture
We consider deep neural networks (DNNs for short) of feed forward type. Such a NN f can mathematically be described as a repeated composition of affine transformations with a nonlinear activation function.
More precisely: For an activation function σ : R → R, a fixed number of hidden layers L ∈ N 0 , numbers N ∈ N of computation nodes in layer ∈ {1, . . . , L + 1}, f : R N 0 → R N L+1 is realized by a feedforward neural network, if for certain weights w i, j ∈ R, and biases b j ∈ R it holds for all and finally (2.1c) In this case, N 0 is the dimension of the input, and N L+1 is the dimension of the output. Furthermore, z j denotes the output of unit j in layer . The weight w i, j has the interpretation of connecting the ith unit in layer − 1 with the jth unit in layer . If L = 0, then (2.1c) holds with Z 0 i := X i for i = 1, . . . , N 0 . Except when explicitly stated, we will not distinguish between the network (which is defined through σ , the w i, j and b j ) and the function f : R N 0 → R N L+1 it realizes. We note in passing that this relation is typically not one to one, i.e., different NNs may realize the same function as their output. Let us also emphasize that we allow the weights w i, j and biases b j for ∈ {1, . . . , L + 1}, i ∈ {1, . . . , N −1 } and j ∈ {1, . . . , N } to take any value in R, i.e., we do not consider quantization as, e.g., in [1,26].
As is customary in the theory of NNs, the number of hidden layers L of a NN is referred to as depth 2 and the total number of nonzero weights and biases as the size of the NN. Hence, for a DNN f as in (2.1), we define , which are the number of nonzero weights and biases in the input layer of f and in the output layer, respectively.
The proofs of our main results Theorems 3.6 and 3.10 are constructive, in the sense that we will explicitly construct NNs with the desired properties. We construct these NNs by assembling smaller networks, using the operations of concatenation and parallelization, as well as so-called "identity-networks" which realize the identity mapping. Below, we recall the definitions. For these operations, we also provide bounds on the number of nonzero weights in the input layer and the output layer of the corresponding network, which can be derived from the definitions in [26].

DNN Calculus
Throughout, as activation function σ we consider either the ReLU activation function or, as suggested in [17,19,21], for r ∈ N, r ≥ 2, the RePU activation function If a NN uses σ r as activation function, we refer to it as σ r -NN. ReLU NNs are referred to as σ 1 -NNs. We assume throughout that all activations in a DNN are of equal type. Remark 2.1 (Historical note on rectified power units) "Rectified power unit" (RePU) activation functions are particular cases of so-called sigmoidal functions of order k ∈ N for k ≥ 2, i.e., lim x→∞ σ (x) The use of NNs with such activation functions for function approximation dates back to the early 1990s, cf. e.g., [19,21]. Proofs in [21,Sect. 3] proceed in three steps. First, a given function f was approximated by a polynomial, then this polynomial was expressed as a linear combination of powers of a RePU, and finally, it was shown that for r ≥ 2 and arbitrary A > 0 the RePU σ r can be approximated on [−A, A] with arbitrarily small L ∞ ([−A, A])-error ε by a NN with a continuous, sigmoidal activation function of order k = r , which has depth 1 and fixed network size independent of A and ε [21,Lemma 3.6]. As remarked directly below [21,Lemma 3.6], this result remains true for the L ∞ (R)-norm (instead of L ∞ ([−A, A])) if, additionally, σ is uniformly continuous on R. As also remarked below [21, Lemma 3.6], a similar statement holds for the approximation of the ReLU σ 1 by a NN with sigmoidal activation function of the order k = 1.
For any r ∈ N, in the proof of [21,Lemma 3.6] it was observed that for continuous, sigmoidal σ of order k = r , the σ -NN that approximates σ r is uniformly continuous on [−A, A]. From this, it follows that σ r -NNs can be approximated up to arbitrarily small L ∞ ([−A, A])-error by σ -NNs with NN size independent of A and ε. Again, uniform continuity of σ on R implies the same result w.r.t. the L ∞ (R)-norm.
The exact realization of polynomials by σ r -networks for r ≥ 2 was observed in the proof of [21,Theorem 3.3], based on ideas in the proof of [5,Theorem 3.1]. The same result was recently rediscovered in [17,Theorem 3.1], whose authors were apparently not aware of [5,21].
We now indicate several fundamental operations on NNs which will be used in the following. These operations have been frequently used in recent works [10,25,26].

Parallelization
We now recall the parallelization of two networks f and g, which in parallel emulates f and g. We first describe the parallelization of networks with the same inputs as in [26,Definition 2.7], the parallelization of networks with different inputs is similar and introduced directly afterward.
Let f and g be two NNs with the same depth L ∈ N 0 and the same input dimension n ∈ N. Denote by m f the output dimension of f and by m g the output dimension of g. Then there exists a neural network ( f , g), called parallelization of f and g, which in parallel emulates f and g, i.e., It holds that depth(( f , g)) = L and that size(( f , g)) = size( f ) + size(g), size in (( f , g)) = size in ( f ) + size in (g) and size out (( f , g)) = size out ( f ) + size out (g).
We next recall the parallelization of networks with inputs of possibly different dimension as in [10,Setting 5.2]. To this end, we let f and g be two NNs with the same depth L ∈ N 0 whose input dimensions n f and n g may be different, and whose output dimensions we will denote by m f and m g , respectively.
Then there exists a neural network ( f , g) d , called full parallelization of networks with distinct inputs of f and g, which in parallel emulates f and g, i.e., Parallelizations of networks with possibly different inputs can be used consecutively to emulate multiple networks in parallel.

Identity Networks
We now recall identity networks [26,Lemma 2.3], which emulate the identity map.
For all n ∈ N and L ∈ N 0 there exists a σ 1 -identity network Id R n of depth L which emulates the identity map Id R n : R n → R n : x → x. It holds that Analogously, for r ≥ 2 there exist σ r -identity networks. To construct them, we use the concatenation f • g of two NNs f and g as introduced in [ 2] Let f , g be such that the output dimension of g equals the input dimension of f , which we denote by k. Denote the weights and biases of f by {u i, j } i, j, and {a j } j, and those of g by {v i, j } i, j, and {c j } j, . Then, we denote by f • g be the NN with weights and biases It is easy to check, that the network f • g emulates the composition x → f (g(x)) and satisfies depth( The concatenation of Definition 2.2 will only be used in the proof of Propositions 2.3 and 2.4. Throughout the remainder of this work, we use sparse concatenations f • g introduced in Sect. 2.2.3, whose network size can be estimated by C(size( f )+size(g)) for an absolute constant C. The reason for introducing • in addition to •, is that the size of f • g cannot be bounded by C(size( f ) + size(g)) for an absolute constant C. This can be seen by considering the number of nonzero weights in layer = depth(g) + 1, e.g., for k = 1, and arbitrary layer sizes N depth(g) of g and N 1 of f .

Proposition 2.3
For all r ≥ 2, n ∈ N and L ∈ N 0 there exists a σ r -NN Id R n of depth L which emulates the identity function Id R n : R n → R n : x → x. It holds that Proof First we consider n = 1 and proceed in two steps: We discuss L = 0, 1 in Step 1 and L > 1 in Step 2.
Step 1. For L = 0, let Id R n be the network with weights w 1 i, j = δ i, j , b 1 j = 0, i, j = 1, . . . , n. We next consider L = 1. It was shown in [17,Theorem 2.5 This shows the existence of a network Id R 1 : R → R of depth 1 realizing the identity on R. The network employs 2r weights and 2r biases in the first layer, and 2r weights and one bias (namely a 0 ) in the output layer. Its size is thus 6r + 1.
Step 2. For L > 1, we consider the L-fold concatenation Id R 1 • · · · • Id R 1 of the identity network Id R 1 from Step 1. The resulting network has depth L, input dimension 1 and output dimension 1. The number of weights and the number of biases in the first layer both equal 2r , the number of weights in the output layer equals 2r , and the number of biases 1. In each of the L − 1 other hidden layers, the number of weights is 4r 2 , and the number of biases 2r . In total, the network has size at most 4r + (L − 1)(4r 2 + 2r ) + 2r + 1 ≤ L(4r 2 + 2r ), where we used that r ≥ 2. Identity networks with input size n ∈ N are obtained as the full parallelization with distinct inputs of n identity networks with input size 1.

Sparse Concatenation
The sparse concatenation of two σ 1 -NNs f and g was introduced in [26, Definition 2.5].
Let f and g be σ 1 -NNs, such that the number of nodes in the output layer of g equals the number of nodes in the input layer of f . Denote by n the number of nodes in the input layer of g, and by m the number of nodes in the output layer of f . Then, with "•" as in Definition 2.2, the sparse concatenation of the NNs f and g is defined as the network where Id R k is the σ 1 -identity network of depth 1. The network f • g realizes the function i.e., by abuse of notation, the symbol "•" has two meanings here, depending on whether we interpret f • g as a function or as a network. This will not be the cause of confusion however. It holds depth( f • g) = depth( f ) + 1 + depth(g), and For a proof, we refer to [26,Remark 2.6].
A similar result holds for σ r -NNs. In this case we define the sparse concatenation f • g as in (2.3), but with Id R k now denoting the σ r -identity network of depth 1 from Proposition 2.3.

Proposition 2.4
For r ≥ 2 let f , g be two σ r -NNs such that the output dimension of g, which we denote by k ∈ N, equals the input dimension of f , and suppose that size in ( f ), size out (g) ≥ k. Denote by f • g the σ r -network obtained by the σ r -sparse concatenation. Then depth( f • g) = depth( f ) + 1 + depth(g) and Proof It follows directly from Definition 2.2 and Proposition 2.3 that depth( f • g) = depth( f ) + 1 + depth(g). To bound the size of the network, note that the weights in layers = 1, . . . , depth(g) equal those in the first depth(g) layers of g. Those in layers = depth(g)+3, . . . , depth(g)+2+depth( f ) equal those in the last depth( f ) layers of f . Layer = depth(g) + 1 has at most 2r size out (g) weights and 2rk biases, whereas layer = depth(g) + 2 has at most 2r size in ( f ) weights and k biases. This shows Eq. (2.5) and the bound on size in ( f • g) and size out ( f • g).
Identity networks are often used in combination with parallelizations. In order to parallelize two networks f and g with depth( f ) < depth(g), the network f can be concatenated with an identity network, resulting in a network whose depth equals depth(g) and which emulates the same function as f .

Basic Results
In [18], it was shown that deep networks employing both ReL and BiS ("binary step") units are capable of approximating the product of two numbers with a network whose size and depth increase merely logarithmically in the accuracy. In other words, certain neural networks achieve uniform exponential convergence of the operation of multiplication (of two numbers in a bounded interval) w.r.t. the network size. Independently, a similar result for ReLU networks was obtained in [37]. Here, we shall use the latter result in the following slightly more general form shown in [32]. Contrary to [37], it provides a bound of the error in the

Proposition 2.5 [32, Proposition 3.1] For any
It is immediate that Proposition 2.5 implies the existence of networks approximating the multiplication of n different numbers. We now show such a result, generalizing [32,Proposition 3.3] in that we consider the error again in the

Proposition 2.6
For any δ ∈ (0, 1), n ∈ N and M ≥ 1, there exists a σ 1 -NN˜ δ,M : Proof We proceed analogously to the proof of [32, Proposition 3.3], and construct δ,1 as a binary tree of× ·,· -networks from Proposition 2.5 with appropriately chosen parameters for the accuracy and the maximum input size.
We defineñ := min{2 k : k ∈ N, 2 k ≥ n}, and consider the product ofñ numbers In case n <ñ, we define x n+1 , . . . , xñ := 1, which can be implemented by a bias in the first layer. Becauseñ < 2n, the bounds on network size and depth in terms ofñ also hold in terms of n, possibly with a larger constant.
It suffices to show the result for M = 1, since for M > 1, the network defined , M] n achieves the desired bounds as is easily verified. Therefore, w.l.o.g. M = 1 throughout the rest of this proof. Equation (2.6a) follows by the argument given in the proof of [32, Proposition 3.3], we recall it here for completeness. By abuse of notation, for every even k ∈ N let a (k-dependent) mapping R = R 1 be defined via That is, for each product network× δ/ñ 2 ,2 as in Proposition 2.5 we choose maximum input size "M = 2" and accuracy "δ/ñ 2 ." Hence, R can be interpreted as a mapping from R 2 → R. We now define˜ δ,1 : and next show the error bounds in (2.6) (recall that by definition x n+1 = · · · = xñ = 1 in caseñ > n).
The number of binary tree layers (each denoted by R) is bounded by O(log 2 (ñ)). With the bound on the network depth from Proposition 2.5, for M = 1 the second part of (2.7) follows.
To estimate the network size, we cannot use the estimate size( f • g) ≤ 2 size( f ) + 2 size(g) from Eq. (2.4), because the number of concatenations log 2 (ñ) − 1 depends on n, hence the factors 2 would give an extra n-dependent factor in the estimate on the network size. Instead, from Eq. (2.4) we use size( f • g) ≤ size( f ) + size in ( f ) + size out (g) + size(g) and the bounds from Proposition 2.5. We find (2 log 2 (ñ)− being the number of product networks in binary tree layer ) which finishes the proof of (2.7) for M = 1.
The previous two propositions can be used to deduce bounds on the approximation of univariate polynomials on compact intervals w.r.t. the norm W 1,∞ . One such result was already proven in [25,Proposition 4.2], which we present in Proposition 2.9 in a slightly modified form, allowing for the simultaneous approximation of multiple polynomials reusing the same approximate monomial basis. This yields a smaller network and thus gives a slight improvement over using the parallelization of networks obtained by applying [25,Proposition 4.2] to each polynomial separately. To prove the result, we first recall the following lemma: [25,Lemma 4.5] For all ∈ N and δ ∈ (0, 1) there exists a σ 1 -NNΨ δ with input dimension one and output dimension 2 −1 + 1 such that max j=2 l−1 ,...,2 where C is independent of and δ. Corollary 2.8 Let n ∈ N and δ ∈ (0, 1). There exists a NN Ψ n δ with input dimension one and output dimension n + 1 such that (Ψ n δ (x)) 1 = 1 and (Ψ n δ (x)) 2 = x for all x ∈ R, and max ∈{3,...,n+1} where C is independent of n and δ.
Proof Define k := log 2 (n) and for ∈ {1, . . . , k} let φ : R → R be an identity where the braces indicate which part of the network approximates these outputs. Adding one layer to eliminate the double entries and (in case 2 k > n) the approximations x k with k > n, and adding the first entry which always equals 1 = x 0 , we obtain a network Ψ n : R → R n+1 satisfying (2.12). The depth bound is an immediate con- (2.11) and k ≤ C log(n). To bound the size, first note that by (2.2) and (2.11) holds size(φ ) ≤ C(k 3 + k log(1/δ)) for a constant C > 0 independent of n and δ. Thus, where we used k ≤ C log(n) and n ≥ 1. This shows (2.13).

Proposition 2.9
There exists a constant C > 0 such that the following holds: For every δ > 0, n ∈ N 0 , N ∈ N and N polynomials p i = n j=0 c i j y j ∈ P n , i = 1, . . . , N there exists a σ 1 -NNp δ : Proof We apply a linear transformation to the network in Corollary 2.8. Specifically, let Φ : R n → R N be the network expressing the linear function with ith component and finally

Remark 2.10
If y 0 ∈ R and p i (y) = n j=0 c i j (y − y 0 ) j , i = 1, . . . , N , then Proposition 2.9 can still be applied for the approximation of p i (y) for y ∈ [y 0 − 1, y 0 + 1], since the substitution z = y − y 0 corresponds to a shift, which can be realized exactly in the first layer of a NN, cp.
Analogous to [25,Equation (4.13)] it holds that . Inserting this into Proposition 2.9 with N = n and  (4.13)] For every n ∈ N and for every δ ∈ (0, 1) there exists a σ 1 -NNL n,δ with input dimension one and with output dimension n such that for a positive constant C independent of n and δ there holds size L n,δ ≤ Cn n + log 2 (1/δ) . (2.14)

Remark 2.12
Alternatively, the σ 1 -NN approximation of Legendre polynomials of degree n could be based on the three term recursion formula for Legendre polynomials or the Horner scheme for polynomials in general, by concatenating n product networks from Proposition 2.5 (and affine transformations). Because, depending on the scaling of the Legendre polynomials, either the accuracy δ of the product networks or the maximum input size M needs to grow exponentially with n, both the network depth and the network size of the resulting NN approximation of univariate Legendre polynomials would be bounded by Cn(n + log(1/δ)). That network size is of the same order as in Proposition 2.11, but the network depth has a worse dependence on the polynomial degree n. For more details, see [23,Proposition 2.5], where this construction is used to approximate truncated Chebyšev expansions based on the three term recursion for Chebyšev polynomials, which is very similar to that for Legendre polynomials.

ReLU DNN Approximation of Tensor Product Legendre Polynomials
Let d ∈ N. Denote the uniform probability measure on [− 1, 1] d by μ d , i.e., We shall require the following bound on the norm of the tensorized Legendre polynomials which itself is a consequence of the Markoff inequality, and our normalization of the Legendre polynomials: for any k ∈ N 0 To provide bounds on the size of the networks approximating the tensor product Legendre polynomials, for finite subsets Λ ⊂ N d 0 we will make use of the quantity L ν −L ν,δ and for a constant C > 0 that is independent of d, Λ and δ it holds Proof Let δ ∈ (0, 1) and a finite subset Λ ⊂ N d 0 be given. The proof is divided into three steps. In the first step, we define ReLU NN approximations of tensor product Legendre polynomials {L ν,δ } ν∈Λ and fix the parameters used in the NN approximation. In the second step, we estimate the error of the approximation, and the L ∞ ([− 1, 1] d )-norm of theL ν,δ , ν ∈ Λ. In the third step, we describe the network f Λ,δ and estimate its depth and size.
Step 1. For all ν ∈ N d 0 , we define n ν := | supp ν| and M ν := 2|ν| 1 + 2. We can now definẽ L ν,δ y j j∈supp ν :=˜ so that, as required by Proposition 2.6, the absolute values of the arguments of Using Proposition 2.13, (2.15), (2.16) and M ν = 2|ν| 1 + 2 ≤ 2m(Λ) + 2, the last term can be bounded by To determine the error of the gradient, without loss of generality we only consider the derivative with respect to y 1 . In the case 1 / ∈ supp ν, we trivially have ∂ ∂ y 1 (L ν ( y) − L ν,δ ( y)) = 0 for all y ∈ [− 1, 1] d . Thus, let ν 1 = 0 in the following. Then, with Step 3. We now describe the network f Λ,δ , which in parallel emulates {L ν,δ } ν∈Λ . The network is constructed as the concatenation of two subnetworks, i.e., The subnetwork f (2) Λ,δ evaluates, in parallel, approximate univariate Legendre polynomials in the input variables (y j ) j≤d . It is defined as where the pair of round brackets denotes a parallelization. The subnetwork f (1) Λ,δ takes the output of f (2) Λ,δ as input and computes Λ,δ f (2) Λ,δ (y j ) j≤d where in the last two lines the outer pair of round brackets denotes a parallelization. The depth of the identity networks is such that all components of the parallelization have equal depth.
We have the following expression for the network depth: Definition here and in the remainder of this proof by C > 0 constants independent of d, Λ and δ ∈ (0, 1), where we used that 2m(Λ) , we can choose the identity networks in the definition of f (1) Λ,δ such that depth f (1) where we used that n ν ≤ d. Finally, we find the following bound on the network depth: For the network size, we find that With Proposition 2.11, we estimate the size of f (2) Λ,δ as size f The depth of each of the identity networks in the definition of f (1) Λ,δ is bounded by depth( f (1) Λ,δ ) ≤ C(1 + d log d) 1 + log 2 m(Λ) + log 2 (1/δ) .
It follows that size f (1) Λ Hence, we arrive at

RePU DNN Emulation of Polynomials
The approximation of polynomials by neural networks can be significantly simplified if instead of the ReLU activation σ 1 we consider as activation function the so-called rectified power unit ("RePU" for short): recall that for r ∈ N, r ≥ 2, the RePU activation is defined by σ r (x) = max{0, x} r , x ∈ R. In contrast to σ 1 -NNs, as shown in [17], for every r ∈ N, r ≥ 2 there exist RePU networks of depth 1 realizing the multiplication of two real numbers without error. This yields the following result proven in [ To render the presentation self-contained, an alternative proof is provided in Appendix A, based on ideas in [25]. Unlike in [17], it is shown that the constant C is independent of d. This is relevant in particular when considering RePU emulations of truncated polynomial chaos expansions of countably parametric maps u : [− 1, 1] N → R, shortly discussed in Sect. 4.3.3. Polynomial approximations of such maps depend on a finite number d(ε) ∈ N of parameters only, but with d(ε) → ∞ as ε ↓ 0.
Proposition 2.14 Fix d ∈ N and r ∈ N, r ≥ 2. Then there exists a constant C > 0 independent of d but depending on r such that for any finite downward closed Λ ⊆ N d 0 and any p ∈ P Λ there is a σ r -networkp : R d → R which realizes p exactly and such that size(p) ≤ C|Λ| and depth(p) ≤ C log 2 (|Λ|).

Remark 2.15
Let ψ : R → R be an arbitrary C 2 function that is not linear, i.e., it does not hold ψ (x) = 0 for all x ∈ R. In [29,Theorem 3.4] it is shown that ψ-networks can approximate the multiplication of two numbers a, b in a fixed bounded interval up to arbitrary accuracy with a fixed number of units. We also refer to [32,Sect. 3.3] where we explain this observation from [29] in more detail. From this, one can obtain a version of Proposition 2.14 for arbitrary C 2 activation functions. To state it, we fix d ∈ N. Then there exists C > 0 (independent of d) such that for every δ > 0, for every downward closed Λ ⊆ N d 0 and every p ∈ P Λ , there exists a ψ-neural network q : As discussed in Remark 2.1, the same also holds, e.g., for NNs with continuous, sigmoidal activation σ of order k ≥ 2.
Recently, there has been some interest in the approximation of ReLU NNs by rational functions and NNs with rational activation functions and vice versa, e.g., in [3,34]. In the latter, σ = p/q is used as activation for polynomials p, q of prescribed degree, but within each computational node trainable coefficients of p and q. For all prescribed deg( p) ≥ 2 and deg(q) ∈ N 0 , each node in such a network can emulate the multiplication of two numbers exactly ([3, Proposition 10] and its proof), hence Proposition 2.14 also holds for such NNs (the proof in Appendix A applies, using that also the identity map can be emulated by networks with such activations).
As a result, Theorem 3.10 also holds for all activation functions discussed in this remark.

Exponential Expression Rate Bounds
We now proceed to the statement and proof of the main result of the present note, namely the exponential rate bounds for the DNN expression of d-variate holomorphic maps. First, in Sect. 3.1 we recall (classical) polynomial approximation results for analytic functions, similar to those in [35]. Subsequently, these are used to deduce DNN approximation results for ReLU and RePU networks.

Lemma 3.2 Let
The lemma is proved by computing (as an upper bound of the left-hand side in (3.2)) the volume of the set , which equals the righthand side in (3.2). The significance of this result is, that it provides an upper bound for the size of multi-index sets of the type To see this, note that due to log(ρ −ν ) = − d j=1 ν j log(ρ j ), for any ε ∈ (0, 1) we have Applying Lemma 3.2 with a j = log(1/ε)/ log(ρ j ) we thus get (see also [2, Lemma 4.4]):

Remark 3.4 Note that
This implies the existence of a constant C (depending on ρ but independent of d) such that for all ε ∈ (0, 1) with ρ min := min j=1,...,d ρ j and ρ max := max j=1,...,d ρ j [cp. (2.17)] We are now in position to prove the following theorem, variations of which can be considered as classical.
Then, for all k ∈ N 0 and for any β > 0 such that there exists C > 0 (depending on d, ρ, k, β and u) such that with Proof Due to the holomorphy of u on E ρ , for a constant C > 0 depending on d and ρ, l ν ∈ R satisfies the bound Using [41, Lemma 3.10] (which is a variation of [7, Lemma 7.1]) and thus, (3.10) also converges in W k,∞ ([− 1, 1] d ).
For later use, we note that the right-hand side of (3.7) can be estimated by Stirling's inequality, with ρ min = min d j=1 ρ j and ρ max = max d j=1 ρ j : (3.14)

ReLU DNN Approximation
We now come to the main result, concerning the approximation of holomorphic functions on bounded intervals by ReLU networks.
Then, there exist constants β = β (ρ, d) > 0 and C = C(u, ρ, d) > 0, and for every N ∈ N there exists a σ 1 -NNũ N : (3.15) and the error bound Proof Throughout this proof, let β > 0 be fixed such that (3.7) holds. We proceed in three steps: In Step 1, we introduce a NN approximation of u, whose error, network depth and size we estimate in Step 2. Based on these estimates, we show Equations (3.15)-(3.16) in Step 3.
Let Affine u be a NN of depth 0, with input dimension |Λ ε |, output dimension 1 and size at most |Λ ε | which implements the affine transformation R |Λ ε | → R : (z ν ) ν∈Λ → ν∈Λ ε l ν z ν . Furthermore, let f Λ ε ,δ be the network from Proposition 2.13, emulating approximations to all multivariate Legendre polynomials (L ν ) ν∈Λ ε . We define a NN where (with β > 0 as in (3.7)) the accuracy δ > 0 of the σ 1 -NN approximations of the tensor product Legendre polynomials is chosen as Step 2. For the NNû ε we obtain the error estimate With Theorem 3.5 this yields the existence of a constant C > 0 (depending on d, ρ, β and u) such that We now bound the depth and the size ofû ε . Using Proposition 2.13 and (3.6), we obtain for C > 0 depending on ρ. To bound the NN size, Proposition 2.13 and (3.6) give for a constant C 2 > 0 which depends on ρ, but is independent of d, β, u and of ε ∈ (0, 1).

Remark 3.7 (Fully connected networks)
In the proof of Theorem 3.6 we explicitly constructed a sparsely connected DNN to approximate u. In practice, it might be tedious to implement this type of architecture. Instead one can set up a fully connected network, containing our sparse architecture. We shortly discuss the implications of Theorem 3.6 in this case. The width w ∈ N of a neural network φ (i.e., the maximum number of nodes in one of its layers) is trivially bounded by size(φ). For a fully connected network of width w, the weight matrix connecting two layers may have w 2 nonzero weights. Denote now byû N a fully connected σ 1 -NN of width w = N and depth depth(û N ) ≤ CN 1/(d+1) log 2 (N ) (with C as in (3.15)) realizing the functionũ N from Theorem 3.6. The existence ofû N is an immediate consequence of the depth and size bounds given in Theorem 3.6. Then by (3.15), denoting its total number of weights, also counting vanishing weights, by #weights(û N ), and by (3.16) This yields the error bound for fully connected networks, and whereβ > 0 is some constant independent of N . Hence, the exponent in the error estimate has (up to logarithmic terms) decreased from 1 d+1 for the sparsely connected network in Theorem 3.6 to 1 2d+3 for the fully connected network.

Remark 3.8 Note that in
Step 2 of the proof, the networkû ε depends on u only via the Legendre coefficients {l ν } ν∈Λ ε , appearing only as weights in the output layer. In particular, the weights and biases ofû ε continuously depend on u with respect to the L 2 ([− 1, 1] d , μ d )-norm, because the Legendre coefficients do so. Finally, the  N 1/(d+2) ) in the right-hand side of (3.16).

RePU DNN Approximation
For RePU approximations, with activation σ r (x) for integer r ≥ 2, we may combine Proposition 2.14 (which is almost identical to [17,Theorem 4.1]) and Theorem 3.5 to infer the following result. Note that the decay of the provided upper bound of the error in (3.23) in terms of the network size N is slightly faster than the one we obtained for ReLU approximations in (3.16).
Then, there exists C > 0 and a constant C 1 > 0 which only depends on r such that with β as in (3.7), for every N ∈ N, there exists a σ r -NNũ N : (3.22) and, with β := β/(d + 1), Here, we can consider the W k, Also, we note with (3.14) that β = log(ρ min )/(2e) is attainable for all d ∈ N.

Remark 3.11 (Fully connected networks)
A similar statement as in Remark 3.7 also holds for σ r -NNs with r ≥ 2. By the same arguments, we obtain an error bound of the type for a fully connected σ r -NNû N , whose total number of weights, also counting vanishing weights, we denote by #weights(û N ). Here k ∈ N is arbitrary but fixed, and β > 0 is a constant independent of N .  N log(N ). This is slightly worse than Theorem 3.10. Also note that in [21] the number of neurons is used as measure for the NN size, which may be smaller but not larger than the number of nonzero weights if all neurons have at least one nonzero weight.

Conclusion
We review in Sect. 4.1 the main results obtained in the previous sections. In Sect. 4.2, we relate these results to results which appeared in the literature. In Sect. 4.3, we discuss several novel implications of the main results, which could be of interest in various applications. We point out that although the present analysis is developed in detail for DNNs with ReLU activation, as explained in Remarks 2.1 and 2.15, all DNN expression error bounds proved up to this point, and also in the ensuing remarks remain valid (possibly even with slightly better estimates for the DNN sizes) for smoother activation functions, such as sigmoidal, tanh, or softmax activations.

Main Results
We have established for analytic maps u : [− 1, 1] d → R exponential expression rate bounds in W k,∞ ([− 1, 1] d ) in terms of the DNN size for the ReLU activation (for k = 0, 1) and for the RePU activations σ r , r ≥ 2 (for k ∈ N 0 ). The present analysis improves earlier results in that the NN sizes are slightly reduced and we obtain exponential convergence of ReLU and RePU DNNs for general d-variate analytic functions, without assuming the Taylor expansion of u around 0 ∈ R d to converge on [− 1, 1] d . We also point out that by a simple scaling argument our main results in Theorems 3.6 and 3.10 imply corresponding expression rate results for analytic functions defined on an arbitrary cartesian product of finite intervals

Related Results
We already commented on [36] where ReLU NN expression rates for multivariate, holomorphic functions u were obtained. Assumptions in [ [25]. In [33], alternative constructions of so-called RePU NNs are proposed which are based on NN emulation of univariate Chebyšev polynomials. It is argued in [33] (and verified in numerical experiments) that the numerical size of NN weights scales more favorably than the weights in the presently proposed emulations. "Chebyšev" versions of the present proofs could also be facilitated, resulting in the same scalings of NN sizes and depths, however, as are obtained here.

Solution Manifolds of PDEs
One possible application of our results concerns the approximation of (quantities of interest of) solution manifolds of parametric PDEs depending on a d-dimensional parameter y ∈ [− 1, 1] d . Such a situation arises in particular in Uncertainty Quantification (UQ). There, a mathematical model is described by a PDE depending on the parameters y, which in turn can for instance determine boundary conditions, forcing terms or diffusion coefficients. It is known for a wide range of linear and nonlinear PDE models (see, e.g., [6]), that parametric PDE solutions depend analytically on the parameters. In addition, for these models usually one has precise knowledge on the domain of holomorphic extension of the objective function u, i.e., knowledge of the constants (ρ j ) d j=1 in Theorem 3.5. These constants determine the sets of multi-indices Λ ε in (3.3). As our proofs are constructive and based on the sets Λ ε , such information can be leveraged to a priori guide the identification of suitable network architectures.

ReLU DNN Expression of Data-to-QoI Maps for Bayesian PDE Inversion
The exponential σ 1 -DNN expression rate bound, Theorem 3.6, implies exponential expressivity of data-to-quantity of interest maps in Bayesian PDE inversion, as is shown in [14]. Here, the assumption of centered, additive Gaussian observation noise in the data model underlying the Bayesian inverse theory implies holomorphy of the data to prediction map in the Bayesian theory as we show [14]. This, combined with the present results in Theorems 3.6 and 3.10 implies exponential expressivity of σ r DNNs for this map, for all r ≥ 1.

Infinite-Dimensional (d = ∞) Case
The expression rate analysis becomes more involved, if the objective function u depends on an infinite dimensional parameter (i.e., a parameter sequence) y ∈ [− 1, 1] N . Such functions occur in UQ for instance if the uncertainty is described by a Karhunen-Loeve expansion. Under certain circumstances, u can be expressed by a so-called generalized polynomial chaos (gpc) expansion. Reapproximating truncated gpc expansions by NNs leads to expression rate results for the approximation of infinite dimensional functions, as we showed in [32]. One drawback of [32] is however, that the proofs crucially relied on the assumption that u is holomorphic on certain polydiscs containing [− 1, 1] N . This criterion is not always met in practice [6]. To overcome this restriction, we will generalize the expression rate results of [32] in a forthcoming paper, by basing the analysis on the present results for the approximation of d-variate functions which are merely assumed to be analytic in some (possibly small) neighborhood of [− 1, 1] d .

Gevrey Functions
DNN approximations of tensor product Legendre polynomials constructed in Sect. 2 can be used more generally than for the approximation of holomorphic functions by truncated Legendre expansions. We consider as an example, for d ∈ N, the approximation of non-holomorphic, Gevrey-regular functions (see, e.g., [28] and the references there for definitions and properties of such functions). Gevrey-regular functions appear as natural solution classes for certain PDEs (e.g., [13] and [4,Chapter 8]). Here, for some δ ≥ 1 we consider maps u : [− 1, 1] d → R that satisfy, for constants C, A > 0 depending on u, the bound ∀ν ∈ N d 0 : These maps are analytic when δ = 1, but possibly non-analytic when δ > 1.
In the proof, which is provided in Appendix B, we furthermore show that there exist constants C , β > 0 such that for every p ∈ N and every u ∈ G δ ( Here, N = dim(⊗ d j=1 P p ([− 1, 1])) = ( p + 1) d denotes the dimension of the space of all d-variate polynomials of degree at most p in each variable.

ReLU Expression of Non-smooth Maps by Composition
The results were based on the quantified holomorphy of the map u : While this could be perceived as a strong requirement (and, consequently, limitation) of the present results, by composition the present deep ReLU NN emulation rate bounds cover considerably more general situations. The key observation is that deep ReLU NNs are closed under concatenation (or under composition of realizations) as we explained in Sect. 2.2.3.
Let us give a specific example from high-dimensional integration, where the task is to evaluate the integral Here, u : [− 1, 1] d → R is a function which is holomorphic in a polyellipse E ρ as in (3.1) and π denotes an a-priori given probability density on the coordinates y 1 , . . . , y d w.r.t. the measure μ d (i.e., π : Assuming that the coordinates are independent, the density π factors, i.e., π = d j=1 π j with certain marginal probability densities π j which we assume to be absolutely continuous w.r. to the Lebesgue measure, i.e., 1 −1 π j (ξ )dξ = 2. In the case that the marginals π j > 0 are simple functions for example on finite partitions T j of [− 1, 1] (as, e.g., if π j is a histogram for the law of y j estimated from empirical data), the changes of coordinates in (4.3) T j (y j ) := −1 + where g = u • T −1 is not continuously differentiable. Here we have used that dT −1 (T ( y)) = (dT ( y)) −1 and det(dT ( y)) = π( y), i.e., det dT −1 (x) = π(T −1 (x)) −1 . Now, the functiong N :=ũ N • T −1 with the σ 1 -NNũ N constructed in Theorem 3.6 is a σ 1 -NN which still affords the error bound (3.16): Denote for n ∈ N 0 and where · is the Euclidean norm on R n resp. on R d . As usual, for n = 1 we write for a constant C which now additionally depends on The approximation of the integral (4.3) can thus be reduced to the problem of approximating the integral of the surrogateg N , which can be efficiently represented by a σ 1 -NN. In the case that u is merely assumed Gevrey-regular as in Sec. 4.3.4, a similar calculation leads to a bound of the type (4.5), but with exp More generally, if π : [− 1, 1] d → (0, ∞) is, for example, a continuous density function (not necessarily a product of its marginals), there exists a bijective transport T : )dx (contrary to the situation above, this transformation T is not diagonal in general). One explicit representation of such a transport is provided by the Knothe-Rosenblatt transport, see, e.g., [30,Sect. 2.3]. It has the property that T inherits the smoothness of π , cp. [30,Remark 2.19]. In case T −1 can be realized without error by a σ 1 (or σ r ) network, we find again an estimate of the type (3.16). If T −1 does not allow an explicit representation by a NN, however, we may still approximate T −1 by a NNS N to obtain a NNg N :=ũ N •S N approximating g = u • T −1 . This will introduce an additional error in (4.5) due to the approximation of T −1 .
Funding Open Access funding provided by ETH Zurich.

Conflict of interest
The authors declare that they have no conflict of interest.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

A Proof of Proposition 2.14
Proof The proof consists of 2 steps. In Step 1, we define subnetworks, similar to those in [25,Lemma 4.5], to emulate all monomials x ν for ν ∈ Λ of order 2 k−1 ≤ |ν| 1 ≤ 2 k . In Step 2, we use them to constructp.
Step 1. Throughout this proof, we denote the NN input by x ∈ R d .
In this first step of the proof, we define subnetworks to emulate x ν for ν ∈ Λ 2 k−1 ∪ k . We will use that there exists a σ r -NN× r of depth 1, with input dimension 2 and output dimension 1, which exactly emulates the product operator R 2 → R : (x, y) → x y.
We note that the size of× r depends on r .
Next, for all k ∈ N such that k = ∅ we define the σ r -NN Ψ k as where the identity networks have depth 1.
We obtain the following bounds on the depth and size ofp: In case |Λ| = 1, the constant polynomial p can be emulated exactly by a σ r -NNp of depth 0 and size 1.

B Proof of Proposition 4.1
Proof As in the holomorphic case, to approximate functions u ∈ G δ ([− 1, 1] d , C, A), we first build a tensor product polynomial approximation by H 2 -projection to the space Q p of polynomials in d variables with coordinatewise degree at most p ∈ N. Evidently, dim(Q p ) = ( p + 1) d .