On Sharpness of Error Bounds for Multivariate Neural Network Approximation

Sharpness of error bounds for best non-linear multivariate approximation by sums of logistic activation functions and piecewise polynomials is investigated. The error bounds are given in terms of moduli of smoothness. They describe approximation properties of single hidden layer feedforward neural networks with multiple input nodes. Sharpness with respect to Lipschitz classes is established by constructing counterexamples with a non-linear, quantitative extension of the uniform boundedness principle.


Introduction
A feedforward neural network with an activation function σ, m input nodes, one output node, and one hidden layer of n neurons implements a multivariate real-valued function g : R m → R of type g(x) ∈ Mn := Mn,σ := n k=1 a k σ(w k · x + c k ) : a k , c k ∈ R, w k ∈ R m . (1.1) For vectors w k = (w k,1 , . . . , w k,m ) and x = (x1, . . . , xm), is the standard inner product of w k , x ∈ R m . Summands σ(w k · x + c k ) are ridge functions. They are constant on hyperplanes w k · x = c, c ∈ R. We discuss error bounds for best approximation by functions of Mn in terms of moduli of smoothness for an arbitrary number m of input nodes in Section 2.
Many papers deal with the univariate case m = 1. In [6], Chen proved an estimate against a first oder modulus for general sigmoid activation functions. An overview of other estimates against first order moduli is given in doctoral thesis [7], cf. [8]. Under additional assumptions on activation functions, estimates against higher order moduli are possible. For example, one can easily extend the first order estimate of Ritter for approximation with "nearly exponential" activation functions in [24] to higher moduli, see [12]. Similar results can be obtained for activation functions that are arbitrarily often differentiable on some open interval such that they are not a polynomial on that interval, see [23,Theorem 6.8,p. 176] in combination with [12].
With respect to the general multivariate case, Barron applied Fourier methods in [4] to establish a convergence rate for a certain class of smooth functions in the L 2norm. Approximation errors for multi-dimensional bell shaped activation functions were estimated by first order moduli of smoothness or related Lipschitz classes by Anastassiou (e.g. [2]) and Costarelli and Spiegler (see e.g. [9] including a literature overview). However, discussed neural network spaces differ from (1.1). They do not consist of linear combinations of ridge functions. A special network with four layers is introduced in [19] to obtain a Jackson estimate in terms of a first order modulus of smoothness.
Maiorov and Ratsby establish an upper bound for functions in Sobolev spaces based on pseudo-dimension in [20,Theorem 2]. Pseudo-dimension is an upper bound of the Vapnik-Chervonenkis dimension (VC dimension) that will also be used in this paper to obtain lower bounds.
With respect to the standard situation (1.1), we apply results of Pinkus [23], Maiorov and Meir [22] to obtain error bounds for a large class of activation functions either based on K-functional techniques or on known estimates for best approximation with multivariate polynomials in Section 2. Both L p -and sup-norms are considered.
In Section 3, we prove for the logistic activation function that counterexamples fα exist for all α > 0 such that sup-norm als well as L p -norm bounds are in O(n −α ) but the error of best approximation is not in O(n −β ) for β > α. This result is a multivariate extension of univariate counterexamples (m = 1, one single input node, sup-norm) in [12]. A similar result is shown for piecewise polynomial activation functions with respect to an L 2 -norm bound.
In fact, the non-linear variant of a quantitative uniform boundedness principle in [12] can be applied to construct univariate and multivariate counterexamples. This principle is based on theorems of Dickmeis, Nessel and van Wickeren, cf. [11], that can be used to analyze error bounds of linear approximation processes. Its application, both in a linear and in the given non-linear context, requires the construction of a resonance sequence. To this end, a known result [5] on the VC dimension of networks with logistic activation is used. Theorem 3.2 in Section 3 is formulated as a general means to derive discussed counterexamples from VC dimension estimates. Also, [22] already provides sequences of counter examples that can be condensed to a single counter example with the uniform boundedness principle.
There are some published attempts to show sharpness of error bounds for neural network approximation in terms of moduli of smoothness based on inverse theorems. Inverse and equivalence theorems estimate the values of moduli of smoothness by approximation rates. For example, they determine membership to certain Lipschitz classes from known approximation errors. However, the letter [13] proves that the inverse theorem for neural network approximation in [26] as well as the inverse theorems in some related papers are wrong. Smoothness is one feature that favors high approximation rates. But in this non-linear situation, other features (e.g. the "nearly exponential" property or similarity to certain derivatives of the activation function, cf. [18]) also contribute to convergence rates. Such features cannot be sufficiently measured by moduli of smoothness, cf. sequence of counterexamples in [13]. This is the motivation to work with counterexamples instead of inverse or equivalence theorems in Section 3.

Notation and Direct Estimates
For a multi-index α = (α1, . . . , αm) ∈ N m 0 with non-negative integer components, let |α| := m j=1 αj be its order. We write With P k we denote the set of polynomials with degree at most k, i.e., each polynomial in P k is a linear combination of homogeneous polynomials of degree j ∈ {0, . . . , k}. To this end, let be the space of homogeneous polynomials of degree j.
The set of all univariate polynomials with degree at most k is denoted by Π k , i.e., Π k = P k for m = 1. Let To obtain the upper estimate, we choose exponents independently from {0, . . . , k} for first m − 1 variables. If the sum of these exponents does not exceed k then the last exponent is k minus sum of other exponents. Otherwise, we have counted a polynomial with degree greater than k. Thus, the estimate only is a coarse upper bound. Multivariate polynomials can be represented by univariate polynomials, cf. [23, p. 164]: For a given degree k ∈ N there exist s ≤ (k + 1) m−1 vectors w1, . . . , ws ∈ R m such that Lemma 2.1. Let σ : R → R be arbitrarily often differentiable on an open interval around the origin with σ (i) (0) = 0, i ∈ N0. Then for any polynomial π ∈ P k of degree at most k, any compact set I ⊂ R m , and each ε > 0 there exists a sufficiently often differentiable function g ∈ M s(k+1) such that simultaneously for all α ∈ N m 0 , |α| ≤ k, The requirement that derivatives at zero must not be zero can be replaced by the requirement that σ is no polynomial on the open interval, see [16].
With n summands of the activation function, polynomials of degree k and their derivatives can be simultaneously approximated arbitrarily well for such values of k that fulfill Especially, polynomials of degree at most can be approximated arbitrarily well.
Let Ω ⊂ R m be an open set. By X p (Ω) := L p (Ω) with norm Then the rth radial modulus of smoothness of a function f ∈ L p (Ω), Thus, E(S, f )p,Ω is the distance between f and S.
As an application of a multivariate equivalence theorem between K-functional and moduli of smoothness, an estimate for best polynomial approximation is proved on Lipschitz graph domains (LG-domains) in [17,Corollary 4,p. 139]. For the definition of not necessarily bounded LG-domains, see [1, p. 66]. For bounded domains, the LG property is equivalent to a Lipschitz boundary. Especially, later discussed bounded m-dimensional open intervals like (0, 1) m and the unit ball {x ∈ R m : |x| < 1} are examples for LG-domains.
Let Ω be a bounded LG-domain in R n and 1 ≤ p ≤ ∞, then with a constant Cr that is independent of f and k, see [17].
Theorem 2.1 (Arbitrarily Often Differentiable Functions). Let σ : R → R be arbitrarily often differentiable on some open interval in R, and let σ be no polynomial on that interval, f ∈ X p (Ω) for an LG-domain Ω ∈ R m , 1 ≤ p ≤ ∞, and r ∈ N. For n ≥ 4 m there exists a constant C that is independent of f and k such that By using an error bound for best polynomial approximation we are not able to consider advantages of non-linear approximation. However, we will see in the next section that non-linear neural network approximation does not really perform better than polynomial approximation in the worst case.
Most activation functions, that are not piecewise polynomials, fulfill the requirements of Theorem 2.1. For example, it provides an error bound for approximation with the sigmoid activation function based on inverse tangent the logistic function and "Exponential Linear Unit" (ELU) activation function A direct bound for simultaneous approximation of a function and its partial derivatives in the sup-norm can be obtained similarly based on a corresponding estimate for simultaneous approximation by polynomials using a Jackson estimate from [3]: a function with compact support such that all partial derivatives up to order k ∈ N0 are continuous. Let Ω ⊂ R m be a compact set that contains the support of f . Then there exists a constant C ∈ R (independent of n and f ) such that for each n ∈ N a polynomial π ∈ Pn can be found such that for all α ∈ N m 0 with |α| ≤ min{k, n} Similar to the proof of Theorem 2.1, we combine this cited result with Lemma 2.1 to obtain (cf. [27]) Theorem 2.2 (Synchronous Sup-Norm Approximation). Let σ : R → R be arbitrarily often differentiable without being a polynomial. For each function f : R m → R with compact support and continuous partial derivatives up to order k ∈ N0 and each compact set Ω ⊂ R m containing the support of f following estimate holds true: For each n ∈ N, n ≥ 4 m , there exists a constant C ∈ R (independent of n and f ) such that for Requirements of Theorems 2.1 and 2.2 are not fulfilled for activation functions that are of type for k ∈ N. The often used ReLU function is obtained for k = 1. Corollary 6.11 in [23, p. 178] is an L 2 -norm Jackson estimate for this class of functions. To work with this estimate, we need to introduce Sobolev spaces. Let W r p (Ω), 1 ≤ p < ∞, be the L 2 -Sobolev space of r-times partially differentiable functions (in the weak sense) on Ω ⊂ R m with semi-norms and norm holds true, see [17].
According to the Jackson estimate in for activation functions (2.6) in [23, p. 178], let m ≥ 2 and Ω ⊂ R m be the m-dimensional unit ball. Then there exists a constant C > 0 such that for all f ∈ W r 2 (Ω) with f W r 2 (Ω) ≤ 1 and r ∈ N with r < k + 1 + m−1 2 (k being the exponent in (2.6)) Thus, for all f ∈ W r 2 (Ω) without restriction f W r 2 (Ω) ≤ 1 there holds true This estimate can be extended to moduli of smoothness using K-functional techniques.
To this end, we need some definitions that will also be needed in the next section for discussing sharpness. A functional T on a normed space X, i.e., T maps X into R, is non-negative-valued, sub-linear, and bounded, iff for all f, g ∈ X, c ∈ R The set X ∼ consists of all non-negative-valued, sub-linear, bounded functionals T on X.

(2.15)
Via the Peetre K-functional for n ≥ 2 with a constant C that is independent of f and n.
Proof. Let g ∈ U . Then thus for n ≥ 3: We apply the lemma to (2.8) with X = L 2 (Ω), U = W r 2 (Ω), ϕ(n) = n − r m . Error functional E(Mn, f )2,Ω fulfills all prerequisites. In connection with the equivalence between K-functionals and moduli of smoothness [17, p. 120] we get Theorem 2.3 (Piecewise Polynomial Functions). Let m ≥ 2, Ω ⊂ R m be the mdimensional unit ball and σ a piecewise polynomial activation function of type (2.6). Constants C1, C2 ∈ R exist such that for each f ∈ L 2 (Ω), n ≥ 2, r < k + 1 + m−1 2 : The saturation order of the modulus is n − r m , so the term n − r m f L 2 (Ω) is only technical. The estimate also holds for ReLU (k = 1) with only one (m = 1) input node for r = 2, see [12]. It can be extended to the cut activation function because cut can be written as a difference of ReLU and translated ReLU.

Sharpness due to Counterexamples
A coarse lower estimate can be obtained for all integrable activation functions in the L 2 -norm based on an estimate for ridge functions in [21]. However, the general setting leads to an exponent r m−1 instead of r m . The space of all measurable, real-valued functions that are integrable on every compact subset of R is denoted by L(R). Let Ω be the m-dimensional unit ball.
Then there exists a sequence (fn) ∞ n=1 , fn ∈ W r 2 (Ω), with fn W r 2 (Ω) ≤ C0, and a constant c > 0 such that (cf. Theorem 2.1) Proof. This is a direct corollary of Theorem 1 in [21]: For A ⊂ R m with cardinality |A| let R(A) be the linear space that is spanned by all functions h(w · x), h ∈ L(R), w ∈ A. Thus in contrast to one activation function, different nearly arbitrary functions h are allowed to be used with different vectors w in linear combinations. Let Rn := A⊂R m :|A|≤n R(A) be the space of functions that can be represented as n k=1 a k h k (w k · x), a k ∈ R, h k ∈ L(R), w k ∈ R m . Then for all activation functions σ ∈ L(R) one has h k (x) := σ(x + c k ) ∈ L(R) for c k ∈ R, i.e. Mn ⊂ Rn. According to [21], for m ≥ 2 there exist constants 0 < c ≤ C independently of n such that This result was proved for Ω being the unit ball. But similar to Theorem 3.2 below, a grid is used that can also be adjusted to Ω = (0, 1) m . We now apply a resonance principle from [12] that is a straight-forward extension of a general theorem by Dickmeis, Nessel and van Wickern, see [11]. With this principle, we condense sequences (fn) ∞ n=1 like the one in (3.1) to single counterexamples. To measure convergence rates, abstract moduli of smoothness ω are often used, see [25, p. 96ff]. To this end, let ω be a continuous, increasing function on [0, ∞) such that for 0 < δ1, δ2 Typically, Lipschitz classes are defined via ω(δ) := δ α , 0 < α ≤ 1.
The sequence has to fulfill conditions (2.9)-(2.12). Also, a family of sub-linear bounded functionals S δ ∈ X ∼ for all δ > 0 is given. These functionals will represent moduli of smoothness.
When dealing with the sup-norm, one can generally apply the resonance theorem in connection with known VC dimensions of indicator functions. The general definition of VC dimension based on sets is as follows.
Let X be a finite set and A ⊂ P(X) a family of subsets of X. Set S ⊂ X is said to be shattered by A iff each subset B ⊂ S can be represented as B = S ∩ A for a family member A ∈ A. Thus, the set {S ∩ A : A ∈ A} has 2 |S| elements, |S| denoting the number of elements of S.
VC-dim(A) := sup{k ∈ N : ∃S ⊂ X with cardinality |S| = k such that S is shattered by A} is called the VC dimension of A.
For our purpose, we discuss a (non-linear) set V of functions g : X → R on a set X ⊂ R m . Using Heaviside-function H : R → {0, 1}, such that for each sign sequence s1, . . . , s k ∈ {−1, 1} a function g ∈ V can be found that fulfills (cf. [5]) H(g(xi)) = H(si), 1 ≤ i ≤ k. for all x ∈ Xn} be the set of functions that are generated by restricting functions of Vn to this grid. As in Theorem 3.1, convergence rates are expressed via a function ϕ(x) that fulfills the requirements of Theorem 3.1 including condition (3.4). Let VC dimension of V n,τ (n) and function values of τ and ϕ be coupled via inequalities , (3.11) for all n ≥ n0 ∈ N with a constant C > 0 that is independent of n.
Proof. Condition (3.10) implies for 4n ≥ n0 that a sequence of signs sz ∈ {−1, 1} for points z ∈ X4n exists such that no function in V4n can reproduce the sign of the sequence in each point of X4n, i.e., for each g ∈ V4n there exists a point z0 ∈ X4n such that H(g(z0)) = H(sz 0 ).
Based on this sign sequence, we construct an arbitrarily often partially differentiable resonance function hn such that its function values equal the signs on the grid X4n. To this end, we use the arbitrarily often differentiable function h (2τ (4n) (x k − z k )) .
Scaling factors 2τ (4n) are chosen such that supports of summands only intersect at their borders. Therefore, hn C([0,1] m ) ≤ 1 and hn(z) = sz for all z ∈ X4n. All partial derivatives of order up to r are in O([ϕ(n)] −r ) because of (3.11). Additionally to hn, we choose parameters in Theorem 3.1 as follows: and En(f ) as in (3.9). We do not directly use ϕ(x) with Theorem 3.1. Instead, function [ϕ(x)] r fulfills the requirements of the function also called ϕ(x) in Theorem 3.1.
Requirements (3.5) and (3.6) can be easily shown due to the sup-norms of hn and its partial derivatives, cf. (2.7).
All preliminaries of Theorem 3.1 are fulfilled such that counterexamples exist as stated. For univariate approximation, i.e., m = 1, the theorem is proved in [12]. This proof can be generalized as follows.
Proof. Let D ∈ N. In [5], an upper bound for the VC dimension of function spaces ∆n := g : {−D, −D + 1, . . . , D} m → R : is derived. Functions are defined on a discrete set with (2D + 1) m points. Please note that the constant function a0 is not consistent with the definition of Mn. It provides an additional degree of freedom. We apply Theorem 2 in [5]: There exists n * ∈ N such that for all n ≥ n * the VC dimension of ∆n is upper bounded by 2 · (nm + 2n + 1) · log 2 (24e(nm + 2n + 1)D), i.e., there exists an n0 ≥ max{2, n * }, n0 ∈ N, and a constant Cm > 0, dependent on m, such that for all n ≥ n0 VC-dim(∆n) ≤ Cmn[log 2 (n) + log 2 (D)].
Thus, Theorem 3.2 can be applied to obtain the counterexample.
The theorem can also be proved based on the sequence (fn) ∞ n=1 from [22] with properties (3.1) and (3.2). We use this sequence to obtain the sharpness in L p norms for approximation with piecewise polynomial activation functions as well as with the logistic function.
Theorem 1 in [20] provides a general means to obtain such bounded sequences in Sobolev spaces for which approximation by functions in Mn is lower bounded with respect to pseudo-dimension.
We condense sequence (fn) ∞ n=1 to a single counterexample with the next theorem. With Theorem 3.4, similar L p estimates for the logistic function and L 2 estimates for piecewise polynomial activation functions (2.6) hold true, see direct L 2 -norm estimate (2.16). With one input node (m = 1), a lower estimate for piecewise polynomial activation functions without the log-factor can be proved easily, see [12]. Thus, the bound in Theorem 3.4 might be improvable.
Future work can deal with sharpness of error bound (2.5) for synchronous approximation in the multivariate case. By extending quantitative uniform boundedness principles with multiple error functionals (cf. [10], [14], [15]) to non-linear approximation (cf. proof of Theorem 3.1 in [12]), one might be able to show simultaneous sharpness in different (semi-) norms.