On Sharpness of Error Bounds for Univariate Approximation by Single Hidden Layer Feedforward Neural Networks

A new non-linear variant of a quantitative extension of the uniform boundedness principle is used to show sharpness of error bounds for univariate approximation by sums of sigmoid and ReLU functions. Single hidden layer feedforward neural networks with one input node perform such operations. Errors of best approximation can be expressed using moduli of smoothness of the function to be approximated (i.e., to be learned). In this context, the quantitative extension of the uniform boundedness principle indeed allows to construct counterexamples that show approximation rates to be best possible. Approximation errors do not belong to the little-o class of given bounds. By choosing piecewise linear activation functions, the discussed problem becomes free knot spline approximation. Results of the present paper also hold for non-polynomial (and not piecewise defined) activation functions like inverse tangent. Based on Vapnik–Chervonenkis dimension, first results are shown for the logistic function.


Introduction
A feedforward neural network with an activation function σ, one input, one output node, and one hidden layer of n neurons as shown in Fig. 1 implements an univariate real function g of type The given paper does not deal with multivariate approximation. But some results can be extended to multiple input nodes, see Sect. 5 Sometimes also monotonicity, boundedness, continuity, or even differentiability may be prescribed. Deviant definitions are based on convexity and concavity. In case of differentiability, functions have a bell-shaped first derivative. Throughout this paper, approximation properties of following sigmoid functions are discussed: is often used as activation function for deep neural networks due to its computational simplicity. The Exponential Linear Unit (ELU) activation function σ e (x) := α(e x − 1), x < 0 x, x ≥ 0 with parameter α = 0 is a smoother variant of ReLU for α = 1. Qualitative approximation properties of neural networks have been studied extensively. For example, it is possible to choose an infinitely often differentiable, almost monotonous, sigmoid activation function σ such that for each continuous function f , each compact interval and each bound ε > 0 weights a 0 , a 1 , b 1 , c 1 ∈ R exist such that f can be approximated uniformly by a 0 + a 1 σ(b 1 x + c 1 ) on the interval within bound ε, see [27] and literature cited there. In this sense, a neural network with only one hidden neuron is capable of approximating every continuous function. However, activation functions typically are chosen fixed when applying neural networks to solve application problems. They do not depend on the unknown function to be approximated. In the late 1980s it was already known that, by increasing the number of neurons, all continuous functions can be approximated arbitrarily well in the sup-norm on a compact set with each non-constant bounded monotonically increasing and continuous activation function (universal approximation or density property, see proof of Funahashi in [22]). For each continuous sigmoid activation function (that does not have to be monotone), the universal approximation property was proved by Cybenko in [14] on the unit cube. The result was extended to bounded sigmoid activation functions by Jones [31] without requiring continuity or monotonicity. For monotone sigmoid (and not necessarily continuous) activation functions, Hornik, Stinchcombe and White [29] extended the universal approximation property to the approximation of measurable functions. Hornik [28] proved density in L p -spaces for any non-constant bounded and continuous activation functions. A rather general theorem is proved in [35]. Leshno et al. showed for any continuous activation function σ that the universal approximation property is equivalent to the fact that σ is not an algebraic polynomial.
To approximate or interpolate a given but unknown function f , constants a k , b k , and c k typically are obtained by learning based on sampled function values of f . The underlying optimization algorithm (like gradient descent with back propagation) might get stuck in a local but not in a global minimum. Thus, it might not find optimal constants to approximate f best possible. This paper does not focus on learning but on general approximation properties of function spaces Thus, we discuss functions on the interval [0, 1]. Without loss of generality, it is used instead of an arbitrary compact interval [a, b]. In some papers, an additional constant function a 0 is allowed as summand in the definition of Φ n,σ . Please note that a k σ(0 · x + b k ) already is a constant and that the definitions do not differ significantly. The error of best approximation E(Φ n,σ , f) p , 1 ≤ p ≤ ∞, is defined via We use the abbreviation E(Φ n,σ , f) := E(Φ n,σ , f) ∞ for p = ∞. A trained network cannot approximate a function better than the error of best approximation. Therefore, it is an important measure of what can and what cannot be done with such a network.
The error of best approximation depends on the smoothness of f that is measured in terms of moduli of smoothness (or moduli of continuity). In contrast to using derivatives, first and higher differences of f obviously always exist. By applying a norm to such differences, moduli of smoothness measure a "degree of continuity" of f .
For a natural number The rth uniform modulus of smoothness is the smallest upper bound of the absolute values of rth differences: 1] , and for r-times continuously differentiable functions f , there holds (cf. [16, p. 46]) Barron applied Fourier methods in [4], cf. [36], to establish rates of convergence in an L 2 -norm, i.e., he estimated the error E(Φ n,σ , f) 2 with respect to n for n → ∞. Makovoz [40] analyzed rates for uniform convergence. With respect to moduli of smoothness, Debao [15] proved a direct estimate that is here presented in a version of the textbook [9, p. 172ff]. This estimate is independent of the choice of a bounded, sigmoid function σ. Doctoral thesis [10], cf. [11], provides an overview of such direct estimates in Section 1.3.
According to Debao, holds for each f ∈ C[0, 1]. This is the prototype estimate for which sharpness is discussed in this paper. In fact, the result of Debao for E(Φ n,σ , f) allows to additionally restrict weights such that b k ∈ N and c k ∈ Z. The estimate has to hold true even for σ being a discontinuous Heaviside function. That is the reason why one can only expect an estimate in terms of a first order modulus of smoothness. If the order of approximation of a continuous function f by such piecewise constant functions is o(1/n) then f itself is a constant, see [16, p. 366]. In fact, the idea behind Debao's proof is that sigmoid functions can be asymptotically seen as Heaviside functions. One gets arbitrary step functions to approximate f by superposition of Heaviside functions. For quasiinterpolation operators based on the logistic activation function σ l , Chen and Zhao proved similar estimates in [8] (cf. [2,3] for hyperbolic tangent). However, they only reach a convergence order of O (1/n α ) for α < 1. With respect to the error of best approximation, they prove n by estimating with a polynomial of best approximation. Due to the different technique, constants are larger than in error bound (2). If one takes additional properties of σ into account, higher convergence rates are possible. Continuous sigmoid cut function σ c and ReLU function σ r lead to spaces Φ n,σc,r of continuous, piecewise linear functions. They consist of free knot spline functions of polynomial degree at most one with at most 2n or n knots, cf. [16,Section 12.8]. Spaces Φ n,σc,r include all continuous spline functions g on [0, 1] with polynomial degree at most one that have at most n − 1 simple knots. We show g ∈ Φ n,σc for such a spline g with equidistant knots x k = k n−1 , 0 ≤ k ≤ n − 1, to obtain an error bound for n ≥ 2: One can also represent g by ReLU functions, i.e., g ∈ Φ n,σr : With (1 ≤ k ≤ n − 1) Section 2 deals with even higher order direct estimates. Similarly to (3), not only sup-norm bound (2) but also an L p -bound, 1 ≤ p < ∞, for approximation with Heaviside function σ h can be obtained from the corresponding larger bound of fixed simple knot spline approximation. Each L p [0, 1]-function that is constant between knots x k = k n , 0 ≤ k ≤ n, can be written as a linear combination of n translated Heaviside functions. Thus, [16, p. 225, Theorem 7.3 for δ = 1/n]) yields for n ∈ N Lower error bounds are much harder to obtain than upper bounds, cf. [42] for some results with regard to multilayer feedforward perceptron networks. Often, lower bounds are given using a (non-linear) Kolmogorov n-width W n (cf. [41,45] for a suitable function space X (of functions with certain smoothness) and norm · . Thus, parameters b k and c k cannot be chosen individually for each function f ∈ X. Higher rates of convergence might occur, if that becomes possible. There are three somewhat different types of sharpness results that might be able to show that left sides of Eqs. (2), (3), (4) or (9) and (16) in Sect. 2 do not converge faster to zero than the right sides.
The most far reaching results would provide lower estimates of errors of best approximation in which the lower bound is a modulus of smoothness. In connection with direct upper bounds in terms of the same moduli, this would establish theorems similar to the equivalence between moduli of smoothness and K-functionals (cf. [16,theorem of Johnen,p. 177], [30]) in which the error of best approximation replaces the K-functional. Let σ be r-times continuously differentiable like σ a or σ l . Then for f ∈ C[0, 1], a standard estimate based on (1) is It is unlikely that one can somehow bound g (r)  [24]. In fact, we prove with Lemma 1 in Sect. 2 that (5) is not valid even if constant C is allowed to depend on f . A second class of sharpness results consists of inverse and equivalence theorems. Inverse theorems provide upper bounds for moduli of smoothness in terms of weighted sums of approximation errors. For pseudo-interpolation operators based on piecewise linear activation functions and B-splines (but not for errors of best approximation), [37] deals with an inverse estimate based on Bernstein polynomials.
An idea that does not work is to adapt the inverse theorem for best trigonometric approximation [16, p. 208]. Without considering effects related to interval endpoints in algebraic approximation one gets a (wrong) candidate inequality By choosing f ≡ σ for a non-polynomial, rth times continuously differentiable activation function σ, the modulus on the left side of the estimate behaves like n −r . But the errors of best approximation on the right side are zero. At least this can be cured by the additional expression Cr n r f B[0,1] . Typically, the proof of an inverse theorem is based on a Bernstein-type inequality that is difficult to formulate for function spaces discussed here. The Bernstein inequality provides a bound for derivatives. If p n is a trigonometric polynomial of degree at most n then p n B[0,2π] ≤ n p n B[0,2π] , cf. [16, p. 97]. The problem here is that differentiating aσ(bx + c) leads to a factor b that cannot be bounded easily. Indeed, we show for a large class of activation functions that (6) can't hold, see (14). As noticed in [24], the inverse estimates of type (6) proposed in [51] and [53] are wrong. Similar to inverse theorems, equivalence theorems (like (7) below) describe equivalent behavior of expressions of moduli of smoothness and expressions of approximation errors. Both inverse and equivalence theorems allow to determine smoothness properties, typically membership to Lipschitz classes or Besov spaces, from convergence rates. Such a property is proved in [13] for max-product neural network operators activated by sigmoidal functions. The relationship between order of convergence of best approximation and Besov spaces is well understood for approximation with free knot spline functions and rational functions, see [16,Section 12.8], cf. [34]. The Heaviside activation function leads to free knot splines of polynomial degree 0, i.e., less than r = 1, cut and ReLU function correspond with polynomial degree less than r = 2. For σ being one of these functions, and for 0 < α < r, f ∈ L p [0, 1], 1 ≤ p < ∞ (p = ∞ is excluded), k := 1 if α < 1 and k := 2 otherwise, q := 1 α+1/p , there holds the equivalence (see [17]) However, such equivalence theorems might not be suited to obtain little-o results: Assume that E(Φ n,σ , f) p = 1 n β (ln(n+1)) 1/q = o 1 n β , then the right side of (7) converges exactly for the same smoothness parameters 0 < α < β than if E(Φ n,σ , f) p = 1 n β = o 1 n β . The third type of sharpness results is based on counterexamples. The present paper follows this approach to deal with little-o effects. Without further restrictions, counterexamples show that convergence orders can not be faster than stated in terms of moduli of smoothness in (2), (3), (4) and the estimates in following Sect. 2 for some activation functions. To obtain such counterexamples, a general theorem is introduced in Sect. 3. It is applied to neural network approximation in Sect. 4.
Unlike the counterexamples in this paper, counterexamples that do not focus on moduli of smoothness were recently introduced in Almira et al. [1] for continuous piecewise polynomial activation functions σ with finitely many pieces (cf. Corollary 1 below) as well as for rational activation functions (that we also briefly discuss in Sect. 4): Given an arbitrary sequence of positive real numbers (ε n ) ∞ n=1 with lim n→∞ ε n = 0, a continuous counterexample f is constructed such that E(Φ n,σ , f) ≥ ε n for all n ∈ N.

Higher Order Estimates
In this section, two upper bounds in terms of higher order moduli of smoothness are derived from known results. Proofs are given for the sake of completeness. If, e.g., σ is arbitrarily often differentiable on some open interval such that σ is no polynomial on that interval then it is known that E(Φ n,σ , p n−1 ) = 0 for all polynomials p n−1 of degree at most n − 1, i.e., p n−1 ∈ Π n := {d n−1 x n−1 + Vol. 75 (2020) On Sharpness of Error Bounds for Univariate Approximation Page 9 of 35 109 [42, p. 157] and (15) below. Thus, upper bounds for polynomial approximation can be used as upper bounds for neural network approximation in connection with certain activation functions. Due to a corollary of the classical theorem of Jackson, the best approximation to f ∈ X p [0, 1], 1 ≤ p ≤ ∞, by algebraic polynomials is bounded by the rth modulus of smoothness. For n ≥ r, we use Theorem 6.3 in [16, p. 220] that is stated for the interval [−1, 1]. However, by applying an affine transformation of [0, 1] to [−1, 1], we see that there exists a constant C independently of f and n such that Ritter proved an estimate in terms of a first order modulus of smoothness for approximation with nearly exponential activation functions in [43]. Due to (8), Ritter's proof can be extended in a straightforward manner to higher order moduli. The special case of estimating by a second order modulus is discussed in [51].
According to [43], a function σ : R → R is called "nearly exponential" iff for each ε > 0 there exist real numbers a, b, c, and d such that for all The logistic function fulfills this condition with a = 1/σ l (c), b = 1, d = 0, and c < ln(ε) such that for x ≤ 0 there is 0 < e x ≤ 1 and Then, independently of n ≥ max{r, 2} (or n > r) and f , a constant C r exists such that Proof. Let ε > 0. Due to (8), there exists a polynomial p n ∈ Π n+1 of degree at most n such that The Jackson estimate can be used to extend the proof given for r = 1 in [43]: converge to x pointwise for α → ∞ due to the theorem of L'Hospital. Since 1] is obtained at the endpoints 0 or 1, and convergence of h α (x) to x is uniform on [0, 1] for α → ∞. Thus lim α→∞ p n (h α (x)) = p n (x) uniformly on [0, 1], and for the given ε we can choose α large enough to get Therefore, function f is approximated by an exponential sum of type within the bound Cω r f, n −1 +2ε. It remains to approximate the exponential sum by utilizing that σ is nearly exponential.
By combining (10), (11) and (12), we get Since ε can be chosen arbitrarily, we obtain for n ≥ 2 By choosing a = 1/α, b = 1, c = 0, and d = 1, the ELU activation function σ e (x) obviously fulfills the condition to be nearly exponential. But its definition for x ≥ 0 plays no role.
Given a nearly exponential activation function, a lower bound (5) or an inverse estimate (6) with a constant C r , independent of f , is not valid, see [24] for r = 2. Such inverse estimates were proposed in [51] and [53]. Functions For x ≤ 0, one can uniformly approximate e x arbitrarily well by assigning values to a, b, c and d in aσ which is obviously wrong for n → ∞. The same problem occurs with (5).
The "nearly exponential" property only fits with certain activation functions but it does not require continuity. For example, let h(x) be the Dirichlet function that is one for rational and zero for irrational numbers x. Activation function exp(x)(1+ exp(x)h(x)) is nowhere continuous but nearly exponential: For ε > 0 let c = ln(ε) and a = exp(−c), b = 1, then for x ≤ 0 But a bound can also be obtained from arbitrarily often differentiability. Let σ be arbitrarily often differentiable on some open interval such that σ is no polynomial on that interval. Then one can easily obtain an estimate in terms of the rth modulus from the Jackson estimate (8) by considering that polynomials of degree at most n − 1 can be approximated arbitrarily well by functions in Φ n,σ , see [42, Corollary 3.6, p. 157], cf. [33, Theorem 3.1]. The idea is to approximate monomials by differential quotients of σ. This is possible since This theorem can be applied to σ l but also to σ a and σ e . The preliminaries are also fulfilled for σ(x) := sin(x), a function that is obviously not nearly exponential.
Proof. Let ε > 0. As in the previous proof, there exists a polynomial p n of degree at most n such that (10) holds. Due to [42, p. 157] there exists a function Since ε can be chosen arbitrarily, we get (16) via Eq. (13).
Polynomials in the closure of approximation spaces can be utilized to show that a direct lower bound in terms of a (uniform) modulus of smoothness is not possible.
Lemma 1 (Impossible inverse estimate). Let activation function ϕ be given as in the preceding Theorem 2, r ∈ N. For each positive, monotonically decreasing sequence (α n ) ∞ n=1 , α n > 0, and each 0 < β < 1 a counterexample f β ∈ C[0, 1] exists such that (for n → ∞) lim sup n→∞ Even if the constant C = C f > 0 may depend on f (but not on n), estimate (5), as proposed in a similar context in [51] for r = 2, does not apply.
We choose parameters of [21, Theorem 2.1] as follows: We further set ϕ n = τ n = ψ n = 1/n r such that with (1) We compute an r-th difference of x n at the interval endpoint 1 to get the resonance condition (17) and (19) are fulfilled.

A Uniform Boundedness Principle with Rates
In this paper, sharpness results are proved with a quantitative extension of the classical uniform boundedness principle of Functional Analysis. Dickmeis, Nessel and van Wickern developed several versions of such theorems. We already used one of them in the proof of Lemma 1. An overview of applications in Numerical Analysis can be found in [23, Section 6]. The given paper is based on [20, p. 108]. This and most other versions require error functionals to be sub-additive. Let X be a normed space. A functional T on X, i.e., T maps X into R, is said to be (non-negative-valued) sub-linear and bounded, iff for all The set of non-negative-valued sub-linear bounded functionals T on X is denoted by X ∼ . Typically, errors of best approximation are (non-negativevalued) sub-linear bounded functionals. Let U ⊂ X be a linear subspace. The best approximation of f ∈ X by elements u ∈ U = ∅ is defined as Unfortunately, function sets Φ n,σ are not linear spaces, cf. [42, p. 151]. In general, from f, g ∈ Φ n,σ one can only conclude f + g ∈ Φ 2n,σ whereas cf ∈ Φ n,σ , c ∈ R. Functionals of best approximation fulfill E (Φ n,σ , f) But there is no sub-additivity. However, it is easy to prove a similar condition: For each ε > 0 there exists elements u f,ε , u g,ε ∈ Φ n,σ that fulfill Obviously, also In what follows, a quantitative extension of the uniform boundedness principle based on these conditions is presented. The conditions replace subadditivity. Another extension of the uniform boundedness principle to non-sublinear functionals is proved in [19]. But this version of the theorem is stated for a family of error functionals with two parameters that has to fulfill a condition of quasi lower semi-continuity. Functionals S δ measuring smoothness also do not need to be sub-additive but have to fulfill a condition S δ (f +g) ≤ B(S δ (f )+ S δ (g)) for a constant B ≥ 1. This theorem does not consider replacement (20) for sub-additivity.
The aim is to discuss a sequence of remainders (that will be errors of best approximation) (E n ) ∞ n=1 , E n : X → [0, ∞). These functionals do not have to be sub-linear but instead have to fulfill . . , f m ∈ X, and constants c ∈ R. In the boundedness condition (26), D n is a constant only depending on E n but not on f .
Let μ(δ) : (0, ∞) → (0, ∞) be a positive function, and let ϕ : [1, ∞) → (0, ∞) be a strictly decreasing function with lim x→∞ ϕ(x) = 0. An additional requirement is that for each 0 < λ < 1 a point X 0 = X 0 (λ) ≥ λ −1 and constant If there exist test elements h n ∈ X such that for all n ∈ N with n ≥ n 0 ∈ N and δ > 0 then for each abstract modulus of smoothness ω satisfying there exists a counterexample f ω ∈ X such that (δ → 0+, n → ∞) E n (f ω ) = o(ω(ϕ(n))), i.e., lim sup For example, (28) is fulfilled for a standard choice ϕ(x) = 1/x α . The prerequisites of the theorem differ from the Theorems of Dickmeis, Nessel, and van Wickern in conditions (24)-(27) that replace E n ∈ X ∼ . It also requires additional constraint (28). For convenience, resonance condition (31) replaces E n (h n ) ≥ c 3 . Without much effort, (31) can be weakened to The proof is based on a gliding hump and follows the ideas of [20, Section 2.2] (cf. [18]) for sub-linear functionals and the literature cited there. For the sake of completeness, the whole proof is presented although changes were required only for estimates that are effected by missing sub-additivity.

Sharpness
Free knot spline function approximations by Heaviside, cut and ReLU functions are first examples for application of Theorem 3. Let S r n be the space of functions f for which n + 1 intervals ]x k , x k+1 [, 0 = x 0 < x 1 < · · · < x n+1 = 1, exist such that f equals (potentially different) polynomials p of degree less than r on each of these intervals, i.e. p ∈ Π r . No additional smoothness conditions are required at knots. Results Math Note that r andr can be chosen independently. This corresponds with Marchaud inequality for moduli of smoothness.
The following lemma helps in the proof of this and the next corollary. It is used to show the resonance condition of Theorem 3.
≤ 0 for all x ∈ I k holds. Then g can change its sign only at points x k . Let h(x) := sin (2N · 2π · x). Then there exists a constant c > 0 that is independent of g and N such that The prerequisites on g are fulfilled if g is continuous with at most N zeroes.  (2N ·2π·(x−a)). Thus, for 1 ≤ p < ∞,

Proof. We discuss 2N intervals
Proof of Corollary 1. Theorem 3 can be applied with following parameters.
Whereas S δ is a sub-linear, bounded functional, errors of best approximation E n fulfill conditions (24), (25), (26), and (27) obviously satisfy condition (29): h n (x) X p [0,1] ≤ 1 =: C 1 . One obtains (30) because of Let g ∈ Sr 4n , then g is composed from at most 4n + 1 polynomials on 4n + 1 intervals. On each of these intervals, g ≡ 0 or g at most hasr − 1 zeroes. Thus g can change sign at 4n interval borders and at zeroes of polynomials, and g fulfills the prerequisites of Lemma 2 with N = (4n+1)·r > 4n+(4n+1)·(r−1). Due to the lemma, h n − g X p [0,1] ≥ c > 0 independent of n and g. Since this holds true for all g, (31) is shown for c 3 = c. All preliminaries of Theorem 3 are fulfilled such that counterexamples exist as stated.
Corollary 1 can be applied with respect to all activation functions σ belonging to the class of splines with fixed polynomial degree less than r and a finite number of knots k because Φ n,σ ⊂ S r nk . Since Φ n,σ h ⊂ S 1 n , Corollary 1 directly shows sharpness of (2) and (4) for the Heaviside activation function if one chooses r =r = 1. Sharpness of (3) for cut and ReLU function follows for r =r = 2 because Φ n,σc ⊂ S 2 2n , Φ n,σr ⊂ S 2 n . However, the case ω(δ) = δ of maximum non-saturated convergence order is excluded by condition (32). We discuss this case for r =r. Then a simple counterexample is f ω (x) := x r . For each sequence of coefficients d 0 , . . . , d r−1 ∈ R we can apply the fundamental theorem of algebra to find complex zeroes a 0 , . . . , a r−1 ∈ C such that There exists an interval I := (j(r + 1) −1 , (j + 1)(r + 1) −1 ) ⊂ [0, 1] such that real parts of complex numbers a k are not in I for all 0 ≤ k < r. Let I 0 := ((j + 1/4)(r + 1) −1 , (j + 3/4)(r + 1) −1 ) ⊂ I. Then for all x ∈ I 0 This lower bound is independent of coefficients d k such that We also see that Each function g ∈ S r n is a polynomial of degree less than r on at least n intervals (j(2n) −1 , (j + 1)(2n) −1 ), j ∈ J ⊂ {0, 1, . . . , 2n − 1}. For j ∈ J: Thus, E(S r n , x r ) = o 1 n r . In case of L p -spaces, we similarly obtain with sub- Sharpness is demonstrated by combining lower estimates of all n subintervals: Although our counterexample is arbitrarily often differentiable, the convergence order is limited to n −r . Reason is the definition of the activation function by piecewise polynomials. There is no such limitation for activation functions that are arbitrarily often differentiable on an interval without being a polynomial, see Theorem 2. Thus, neural networks based on smooth nonpolynomial activation functions might approximate better if smooth functions have to be learned. Theorem 3 in [15] states for the Heaviside function that for each n ∈ N a function f n ∈ C[0, 1] exits such that the error of best uniform approximation exactly equals ω 1 f n , 1 2(n+1) . This is used to show optimality of the constant. Functions f n might be different for different n. One does not get the condensed sharpness result of Corollary 1.
Another relevant example of a spline of fixed polynomial degree with a finite number of knots is the square non-linearity (SQNL) activation function σ(x) := sign(x) for |x| > 2 and σ(x) := x − sign(x) · x 2 /4 for |x| ≤ 2. Because σ, restricted to each of the four sub-intervals of piecewise definition, is a polynomial of degree two, we can chooser = 3.
The proof of Corollary 1 is based on Lemma 2. This argument can be also used to deal with rational activation functions σ(x) = q 1 (x)/q 2 (x) where q 1 , q 2 ≡ 0 are polynomials of degree at most ρ. Then non-zero functions g ∈ Φ n,σ do have at most ρn zeroes and ρn poles such that there is no change of sign on N + 1 intervals, N = 2ρn. Thus, Corollary 1 can be extended to neural network approximation with rational activation functions in a straight forward manner.
Whereas the direct estimate (3) for cut and ReLU functions is based on linear best approximation, the counterexamples hold for non-linear best approximation. Thus, error bounds in terms of moduli of smoothness may not be able to express the advantages of non-linear free knot spline approximation in contrast to fixed knot spline approximation (cf. [45]). For an error measured in an L p norm with an order like n −α , smoothness only is required in L q , q := 1/(α + 1/p), see (7) and [16, p. 368].
Corollary 2 (Inverse tangent). Let σ = σ a be the sigmoid function based on the inverse tangent function, r ∈ N, and 1 ≤ p ≤ ∞. For each abstract modulus of smoothness ω satisfying (32), there exists a counterexample f ω ∈ X p [0, 1] such that The corollary shows sharpness of the error bound in Theorem 2 applied to the arbitrarily often differentiable function σ a .
Proof. Similarly to the proof of Corollary 1, we apply Theorem 3 with param- with N = N (n) := 8n, such that condition (29) is obvious and (30) can be shown by estimating the modulus in terms of the rth derivative of h n with (1). Let g ∈ Φ 4n,σa , Then where s(x) is a polynomial of degree 2(4n − 1), and q(x) is a polynomial of degree 8n. If g is not constant then g at most has 8n − 2 zeroes and f at most has 8n − 1 zeroes due to the mean value theorem (Rolle's theorem). In both cases, the requirements of Lemma 2 are fulfilled with N (n) = 8n > 8n − 1 such that h n − g X p [0,1] ≥ c > 0 independent of n and g. Since g can be chosen arbitrarily, (31) is shown with E 4n h n ≥ c > 0.
Whereas lower estimates for sums of n inverse tangent functions are easily obtained by considering O(n) zeroes of their derivatives, sums of n logistic functions (or hyperbolic tangent functions) might have an exponential number of zeroes. To illustrate the problem in the context of Theorem 3, let Using a common denominator, the numerator is a sum of type m k=1 α k κ x k for some κ k > 0 and m < 2 4n . According to [50], such a function has at most m − 1 < 16 n − 1 zeroes, or it equals the zero function. Therefore, an interval [k(16) −n , (k + 1)(16) −n ] exists on which g does not change its sign. By using a resonance sequence h n (x) := sin (16 n · 2πx), one gets E(Φ 4n,σ l , h n ) ≥ 1. But factor 16 n is by far too large. One has to choose φ(x) := 1/16 x and μ(δ) := δ to obtain a "counterexample" f ω with The gap between rates is obvious. The same difficulties do not only occur for the logistic function but also for other activation functions based on exp(x) like the softmax function σ m (x) := log(exp(x) + 1). Similar to (15), Thus, sums of n logistic functions can be approximated uniformly and arbitrarily well by sums of differential quotients that can be written by 2n softmax functions. A lower bound for approximation with σ m would also imply a similar bound for σ l and upper bounds for approximation with σ l imply upper bounds for σ m . With respect to the logistic function, a better estimate than (51) is possible. It can be condensed from a sequence of counterexamples that is derived in [39]. However, we show that the Vapnik-Chervonenkis dimension (VC dimension) of related function spaces can also be used to prove sharpness. This is a rather general approach since many VC dimension estimates are known.
Let X be a set and A a family of subsets of X. Throughout this paper, X can be assumed to be finite. One says that A shatters a set S ⊂ X if and only if each subset B ⊂ S can be written as This general definition can be adapted to (non-linear) function spaces V that consist of functions g : X → R on a (finite) set X ⊂ R. By applying Heavisidefunction σ h , let Then the VC dimension of function space V is defined as VC-dim(V ) := VC-dim(A). This is the largest number m ∈ N for which m points x 1 , . . . , x m ∈ X exist such that for each sign sequence s 1 , . . . , s m ∈ {−1, 1} a function g ∈ V can be found that fulfills cf. [6]. The VC dimension is an indicator for the number of degrees of freedom in the construction of V . Condition (52) is equivalent to Let function ϕ(x) be defined as in Theorem 3 such that (28) holds true. If, for a constant C > 0, function τ fulfills for all n ≥ n 0 ∈ N then for r ∈ N and each abstract modulus of smoothness ω satisfying (32), a counterexample f ω ∈ C[0, 1] exists such that Proof. Let n ≥ n 0 /4. Due to (53), a sign sequence s 0 , . . . , s τ (4n) ∈ {−1, 1} exists such that for each function g ∈ V 4n there is a point We utilize this sign sequence to construct resonance elements. It is well known, that auxiliary function The interior of the support of summands is non-overlapping, i.e., h n B[0,1] ≤ 1, and because of (54) norm h Then conditions (29) and (30) are fulfilled due to the norms of h n and its derivatives, cf. (1). Due to the initial argument of the proof, for each g ∈ V 4n at least one point

Corollary 4 (Logistic function). Let σ = σ l be the logistic function and r ∈ N.
For each abstract modulus of smoothness ω satisfying (32), a counterexample f ω ∈ C[0, 1] exists such that The corollary extends the Theorem of Maiorov and Meir for worst case approximation with sigmoid functions in the case p = ∞ to Lipschitz classes and one condensed counterexample (instead of a sequence), see [39, p. 99]. The sharpness estimate also holds in L p [0, 1], 1 ≤ p < ∞. For all these spaces, one can apply Theorem 3 directly with the sequence of counterexamples constructed in [39,Lemma 7,p. 99]. Even more generally, Theorem 1 in [38] utilizes pseudo-dimension, a generalization of VC dimension, to provide bounded sequences of counterexamples in Sobolev spaces.
Thus, all prerequisites of Corollary 3 are shown. The corollary improves (51): There exists a counterexample f ω ∈ C[0, 1] such that (see (9), (16)) The proof is based on the O(n log 2 (n)) estimate of VC dimension in [6]. This requires functions to be defined on a finite grid. Without this prerequisite, the VC dimension is in Ω(n 2 ), see [48, p. 235]. The referenced book also deals with the case that all weights are restricted to floating point numbers with a fixed number of bits. Then the VC dimension becomes bounded by O(n) without the need for the log-factor. However, direct upper error bounds (9) and (16) are proved for real-valued weights only.
The preceding corollary is a prototype for proving sharpness based on known VC dimensions. Also at the price of a log-factor, the VC dimension estimate for radial basis functions in [6] or [46] can be used similarly in connection with Corollary 3 to construct counterexamples. The sharpness results for Heaviside, cut, ReLU and inverse tangent activation functions shown above for p = ∞ can also be obtained with Corollary 3 by proving that VC dimensions of corresponding function spaces Φ n,σ are in O(n) (whereas the result of [5] only provides an O(n log(n)) bound). This in turn can be shown by estimating the maximum number of zeroes like in the proof of the next corollary and in the same manner as in the proofs of Corollaries 1 and 2.
The problem of different rates in upper and lower bounds arises because different scaling coefficients b k are allowed. In the case of uniform scaling, i.e. all coefficients b k in (50) (2), see [15], are defined using such uniform scaling, see [9, p. 172], the error bound holds. This bound is sharp: ω r (f ω , δ) = O (ω(δ r )) and E(Φ n,σ l , f ω ) = o ω 1 n r .
To prove the corollary, we apply following lemma. {0, 1, . . . , τ(n)} , and V n,τ (n) := {h : G n → R : h(x) = g(x) for a function g ∈ V n } are given as in Corollary 3. If VC-dim(V n,τ (n) ) ≥ τ (n) then there exists a function g ∈ V n , g ≡ 0, with a set of at least τ (n)/2 zero points in [0, 1] such that g has non-zero function values between each two consecutive points of this set.
Using a common denominator q(x), the numerator is a sum of type s(x) = n−1 k=0 α k (e −kB ) x which has at most n − 1 zeroes, see [50]. Because of this contradiction to n zeroes, (53) is fulfilled.
By applying Lemma 2 for N (n) = n − 1 in connection with Theorem 3, one can also show Corollary 5 for L p -spaces, 1 ≤ p < ∞.
Linear VC dimension bounds were proved in [47] for radial basis function networks with uniform width (scaling) or uniform centers. Such bounds can be used with Corollary 3 to prove results that are similar to Corollary 5. Also, such a corollary can be shown for the ELU function σ e . However, without the restriction b k = B(n), piecewise superposition of exponential functions leads to O(n 2 ) zeroes of sums of ELU functions. Then in combination with direct estimates Theorems 1 and 2, i.e., E(Φ n,σe , f ω ) ≤ C r ω r f ω , 1 n , we directly obtain following (improvable) result in a straightforward manner. Corollary 6 (Coarse estimate for ELU activation). Let σ = σ e be the ELU function and r ∈ N, n ≥ max{2, r} (see Theorem 1). For each abstract modulus of smoothness ω satisfying (32), there exists a counterexample f ω ∈ C[0, 1] that fulfills Proof. To prove the existence of a function f ω with ω r (f ω , δ) ∈ O (ω (δ r )) and (56), we apply Corollary 3 with V n = Φ n,σe and E n (f ) = E(Φ n,σe , f) such that conditions (24)- (27) are fulfilled. For each function g ∈ V n the interval [0, 1] can be divided into at most n + 1 subintervals such that on the lth interval g equals a function g l of type g l (x) = γ l + δ l x + n k=1 α l,k exp(β l,k x).
Derivative g l (x) = δ l exp(0 · x) + n k=1 α l,k β l,k exp(β l,k x) has at most n zeroes or equals the zero function according to [50]. Thus, due to the mean value theorem (or Rolle's theorem), g l has at most n + 1 zeroes or is the zero function. By concatenating functions g l to g, one observes that g has at most (n + 1) 2 different zeroes such that g does not vanish between such consecutive zero points. Let τ (n) := 8n 2 and ϕ(n) = 1/n 2 such that (54) holds true: τ (4n) = 128n 2 = 128/ϕ(n). If VC-dim(V n,τ (n) ) ≥ τ (n) then due to Lemma 3 and because n ≥ 2 there exists a function in Φ n,σe with at least τ (n)/2 = (2n) 2 > (n + 1) 2 zeroes such that between consecutive zeroes, the function is not the zero function. This contradicts the previously determined number of zeroes and (53) is fulfilled.
Sums of n softsign functions ϕ(x) = x/(1+|x|) can be expressed piecewise by n + 1 rational functions that each have at most n zeroes. Thus, one also has to deal with O(n 2 ) zeroes.
In terms of (non-linear) Kolmogogrov n-width, let X := Lip r (α, C[0, 1]). Then, for example, condensed counterexamples f α for piecewise linear or inverse tangent activation functions and p = ∞ imply The restriction to the univariate case of a single input node was chosen because of compatibility with most cited error bounds. However, the error of multivariate approximation with certain activation functions can be bounded by the error of best multivariate polynomial approximation, see proof of Theorem 6.8 in [42, p. 176]. Thus, one can obtain estimates in terms of multivariate radial moduli of smoothness similar to Theorem 2 via [30, Corollary 4, p. 139]. Also, Theorem 3 can be applied in a multivariate context in connection with VC dimension bounds. First results are shown in report [25].
Without additional restrictions, a lower estimate for approximation with logistic function σ l could only be obtained with a log-factor in (55). Thus, either direct bounds (2) and (9) or sharpness result (55) can be improved slightly.