Function approximation by deep neural networks with parameters $\{0,\pm \frac{1}{2}, \pm 1, 2\}$

In this paper it is shown that $C_\beta$-smooth functions can be approximated by deep neural networks with ReLU activation function and with parameters $\{0,\pm \frac{1}{2}, \pm 1, 2\}$. The $l_0$ and $l_1$ parameter norms of considered networks are thus equivalent. The depth, width and the number of active parameters of the constructed networks have, up to a logarithmic factor, the same dependence on the approximation error as the networks with parameters in $[-1,1]$. In particular, this means that the nonparametric regression estimation with the constructed networks attains the same convergence rate as with sparse networks with parameters in $[-1,1]$.


Introduction
The problem of function approximation with neural networks has been of big interest in mathmatical research for the last several decades.Various results have been obtained that describe the approximation rates in terms of the structures of the networks and the properties of the approximated functions.One of the most remarkable results in this direction is the universal approximation theorem, which shows that even shallow (but sufficiently wide) networks can approximate continuous functions arbitrarily well (see [9] for the overview and possible proofs of the theorem).Also, in [6] it was shown that integrable functions can be approximated by networks with fixed width.Those networks, however, may need to be very deep to attain small approximation errors.Yet, from a pragmatic point of view, and, in particular, in statistical applications, allowing very big number of network parameters may be impractical.The reason is that in this case controlling the complexity of approximant networks at an optimal rate becomes problematic.Complexities of classes of neural networks are usually described in terms of their covering numbers and entropies.Those two concepts also play an important role in various branches of statistics, such as regression analysis, density estimation and empirical processes (see, e.g., [3], [1], [8]).In particular, in regression estimation the following dichotomy usually comes up while selecting the class of functions from which the estimator will be chosen: on the one hand, the selected class of approximants should be "big" enough to be able to approximate various non-trivial functions and on the other hand it should have "small" enough entropy to attain good learning rates.Thus, the general problem is to obtain powerful classes of functions with well controllable entropies.As to the powerfulness of classes of neural networks, it has recently been shown ( [10], [13]) that with properly chosen architecture the classes of sparse deep neural networks with ReLU activation function can well approximate smooth functions.In particular, it is shown in [13], that C β -smooth functions on [0, 1] d can be ε-approximated by deep ReLU networks with O(ε −d/β log 2 (1/ε)) active (nonzero) parameters.A similar result for sparse ReLU networks with parameters in [−1, 1] has been obtained in [10].The number of active parameters s in those networks is much smaller than the total number of network parameters and the network depth L depends logarithmically on the approximation error.Boundedness of parameters of the networks constructed in [10] implies that the ε−entropy of the approximating networks has order O(sL 2 log 2 (1/ε)).The main advantages of this entropy bound are its logarithmic dependence on 1/ε, which allows to take the covering radius ε to be very small in applications, and its linear dependence on the sparisty s and quadratic dependence on the depth L, both of which, as described above, can also be taken to be small.Using this entropy bound, it is then shown in [10], that if the regression function is a composition of Hölder smooth functions, then sparse neural networks with depth L log 2 n and the number of active parameters s ∼ n t 2β+t log 2 n, where β > 0 and t ≥ 1 depend on the structure and the smoothness of the regression function, attain the minimax optimal prediction error rate n −2β 2β+t (up to a logarithmic factor), where n is the sample size.It would therefore be desirable to obtain a similar entropy bound for the spaces of networks for which the above l 0 (sparsity) regularization is replaced by the better practically implementable l 1 regularization.
Networks with l 1 norm of all parameters bounded by 1 are considered in [12].As in those networks there are at most 1/ε 2 parameters outside of the interval (−ε 2 , ε 2 ), an entropy bound of order O((2/L) 2L−1 /ε 2 ), has been obtained by taking in the covering networks the remaining parameters to be 0.This bound, however, depends polynomially on 1/ε, and it leads to the convergence rate of order 1/ √ n for regression estimation with given n samples.As it is discussed in [12], the rate 1/ √ n is seemingly the best possible for l 1 regularized estimators.Alternative approaches of sparsifying neural networks using derivatives, iterative prunings and clipped l 1 penalties are given in [2], [4], [5] and [7].
To combine the advantages of both l 0 and l 1 regularizations, as well as to make the networks easier to encode, we consider networks with parameters {0, ± 1  2 , ±1, 2}.The l 0 and l 1 parameter regularizations of those networks can differ at most by a factor of 2, which, in particular, allows to employ all the features induced from the sparsity of networks (including their entropy bounds) while imposing l 1 constraints on their parameters.Moreover, discretization of parameters allows to calculate the exact number of networks (the 0-entropy) required to attain a given approximation rate.Importantly, the depth, the width and the number of active parameters in the approximant networks are equivalent to those of networks constructed in [10].Hence, for the considered networks the l 0 parameter regularization can be replaced by the l 1 parameter regularization, leading, up to a logarithmic factor, to the same statistical guarantees as in [10].In our construction the parameters ±1 are used to add/subtract the nodes, change, if necessary, their signs and transfer them to the next layers.The parameters ± 1 2 and 2 are used to attain the values of the form k/2 j ∈ [−1, 1], j ∈ N, k ∈ Z, which can get sufficiently close to any number from [−1, 1].Note that this can also be done using only the parameters ± 1 2 and 1.The latter, however, would require a larger depth and a bigger number of active nodes.
For x, y ∈ R we denote x ∨ y := max{x, y} and (x) + := max{0, x}.Also, to make them multiplicable with preceeding matrices, the vectors from R d , depending on the context, are considered as matrices from R d×1 rather than R 1×d .

Main result
Consider the set of neural networks with L hidden layers and with ReLU activation function σ(x) = 0 ∨ x = (x) + defined by where W i ∈ R p i ×p i+1 are weight matrices, i = 0, ..., L, v i are shift vectors, i = 1, ..., L, and p = (p 0 , p 1 , ..., p L+1 ) is the width vector with p 0 = d.For a given shift vector v = (v 1 , ..., v p ) and a given input vector y = (y 1 , ..., y p ) the action of shifted activation function σ v on y ⊤ is defined as It is assumed that network parameters (the entries of matrices W i and shift vectors v i ) are all in [−1, 1].For s ∈ N let F(L, p, s) be the subset of F(L, p) consisting of networks with at most s nonzero parameters.In [10], Theorem 5, the following approximation of β-Hölder functions belonging to the ball with networks from F(L, p, s) is given: and any integers m ≥ 1 and N ≥ (β + 1) d ∨ (K + 1)e d , there exists a network f ∈ F(L, p, s) with depth and number of nonzero parameters The proof of the theorem is based on local sparse neural network approximation of Taylor polynomials of the function f .
Our goal is to attain an identical approximation rate for networks with parameters in {0, ± 1 2 , ±1, 2}.In our construction we will omit the shift vectors (by adding a coordinate 1 to the input vector x) and will consider the networks of the form with weight matrices W i ∈ R p i ×p i+1 , i = 0, ..., L, and with width vector p = (p 0 , p 1 , ..., p L+1 ), p 0 = d.In this case the ReLU activation function σ(x) acts coordinate-wise on the input vectors.Let F (L, p) be the set of networks of the form (1) with parameters in {0, ± 1 2 , ±1, 2}.For s ∈ N let F(L, p, s) be the subset of F(L, p) with at most s nonzero parameters.We then have the following and number of nonzero parameters where ∆ ≤ 2 log 2 (N β+d Ke d ), R ≤ (2β) d N and L, p and s are the same as in Theorem 1.
Let us now compare the above two theorems.First, the approximation errors in those theorems differ by a constant factor depending only on the input dimension d (note that the values of β, d and K are assumed to be fixed).The depths and the number of nonzero parameters of the networks presented in Theorems 1 and 2 differ at most by log 2 N multiplied by a constant depending on β, d and K, and the maximal widths of those networks differ at most by a constant factor C(β, d, K).Thus, the architecture and the number of active parameters of network given in Theorem 2 have, up to a logarithimc factor, the same dependence on the approximation error as the network given in Theorem 1.
Application to nonparametric regression.Consider a nonparametric regression model where is the unknown regression function that needs to be recovered from n observed iid pairs (X i , Y i ), i = 1, ..., n.The standard normal noise variables ǫ i are assumed to be independent of X i .For a set of functions F from [0, 1] d to [−F, F ] and for an estimator f ∈ F of f 0 define The subscript f 0 indicates that the expectation is taken over the training data generated by our regression model and ∆ n ( f , f 0 , F) measures how close the estimator f is to the empirical risk minimizer.Let also be the prediction error of the estimator f , where The following oracle-type inequality is obtained in [10], Lemma 4: where N (δ, F, • ∞ ) is the covering number of F of radius δ taken with respect to the where c = c(β, d, F ) is some constant.In order to apply Lemma 2.1 it remains to estimate the covering number N (δ, F ( Ln , pn , sn ), • ∞ ).Note however, that since the parameters of networks from F( Ln , pn , sn ) belong to the discrete set {0, ± 1 2 , ±1, 2}, we can calculate the exact number of networks from F( Ln , pn , sn ), or, in other words, we can upper bound the covering number of radius δ = 0. Indeed, as there are at most ( Ln + 1)|p n | 2 ∞ parameters in the networks from F( Ln , pn , sn ), then for a given s there are at most ( Ln + 1)|p n | 2 ∞ s ways to choose s nonzero parameters.As the nonzero parameters can take one of the 5 values {± 1  2 , ±1, 2}, then the total number of networks from F( Ln , pn , sn ) is bounded by Together with (2) and Lemma 2.1, for the empirical risk minimizer fn ∈ arg min which coincides, up to a logarithmic factor, with the minimax estimation rate n −2β 2β+d of the prediction error for β-smooth functions.
Remark 2.1.As the parameters of networks from F( Ln , pn , sn ) belong to {0, ± 1  2 , ±1, 2}, then, instead of defining the sparsity constraint sn to be the maximal number of nonzero parameters, we could define sn to be the upper bound of l 1 norm of all parameters of networks from F ( Ln , pn , sn ).As the l 0 and l 1 parameter norms of networks from F( Ln , pn ) can differ at most by a factor of 2, then this change of notation would lead to the same convergence rate as in (3).

Proofs
One of the ways to approximate functions by neural networks is based on the neural network approximation of local Taylor polynomials of those functions (see, e.g., [10], [13]).Thus, in this procedure, approximation of the product xy given the input (x, y) becomes crucial.The latter is usually done by representing the product xy as a linear combination of functions that can be approximated by neural network-implementable functions.For example, the approximation algorithm presented in [10] is based on the approximation of a function g(x) = x(1 − x), which then leads to an approximation of the product The key observation is that the function g(x) can be approximated by combinations of triangle waves and the latter can be easily implemented by neural networks with ReLU activation function.
In the proof of Theorem 1, neural network approximation of function (x, y) → xy is followed by approximation of the product (x 1 , ..., x r ) → r j=1 x j which then leads to approximation of monomials of degree up to β.The result then follows by local approximation of Taylor polynomials of f .Below we show that all those approximations can also be performed using only the parameters {0, ± 1  2 , ±1, 2}.
Proof.Consider the functions where T + (x) := (x/2) + and T k − (x) := (x − 2 1−2k ) + .In [11], Lemma A.1, it is shown that for the functions and for any positive integer m, where g(x) = x(1 − x).Taking into account ( 4) and ( 8), we need to construct a network that computes Let us first construct a network N m with depth 2m, width 4 and weights {0, ± 1 2 , ±1} that computes For this goal, we modify the network presented in [11], Fig. 2, to assure that the parameters are all in {0, ± 1 2 , ±1}.More explicitly, denote where each of the mutually succeeding matrices A and B appears in the above representation m times.Using parameters {0, ± 1 2 , ±1}, for a given input (1, x, y) the first two layers of the network Mult m compute the vector (note that as in our construction we omit shift vectors, throughout the whole construction we will keep the first coordinate equal to 1).We then apply the network N m to the first and last four coordinates of the above vector that follow the first coordinate 1.We thus obtain a network with 2m + 2 hidden layers and of width 9 that computes Finally, the last two layers of Mult m compute (1, u, v) → (1 − (1 − (u − v)) + ) + applied to the vector obtained in (10) (note that this computation only requires parameters 0 and ±1).We thus get a network Mult m computing (9) and the inequality (5) follows by combining ( 4) and (8).Proof.In order to approximate the product r i=1 x i we first pair the neighbouring entries to get the triples (1, x k , x k+1 ), and apply the previous lemma to each of those triples to obtain the values Mult m (1, x k , x k+1 ).We repeat this procedure q := ⌈log 2 r⌉ times, until there is only one entry left.As pairing the entries requires only parameters 0 and 1, then it follows from the previous lemma that the entries of the constructed network are in {0, ± 1  2 , ±1}.Using Lemma 3.1 and applying the inequality For γ > 0 let C d,γ denote the number of d-dimensional monomials x α with degree |α| < γ.Note that C d,γ < (γ + 1) d .From Lemma 3.2 it follows that using weights {0, ± 1  2 , ±1}, we can simultaneously approximate monomials up to degree γ (see also [11], Lemma A.4): Lemma 3.3.There exists a network Mon d m,γ ∈ F (L, p) with L ≤ (2m + 5)⌈log 2 (γ ∨ 1)⌉ + 1, We now present the final stage of the approximation, that is, the local approximation of Taylor polynomials of f .Proof of Theorem 2. For a given N let Ñ ≥ N be the smallest integer with Ñ = (2 ν + 1) d for some ν ∈ N. Note that Ñ /2 d ≤ N ≤ Ñ .We are going to apply Theorem 1 with N in the condition of that theorem replaced by Ñ .For a ∈ [0, 1] d let be the partial sum of Taylor series of f around a. Choose M to be the largest integer such that (M + 1) d ≤ Ñ , that is, M = 2 ν , and consider the set of (M + 1) where As in our construction we only use parameters {0, ± 1 2 , ±1, 2}, we need to modify the coefficients given in (11) to make them implementable by those parameters.Denote B := ⌊2Ke d ⌋ and let b ∈ N be the smallest integer with 2 b ≥ BM β (β + 1) d .As |c a,γ | < B ( [11], eq.34), then for each c a,γ there is an integer k As the number of monomials of degree up to β is bounded by (β + 1) d , then Also, as Thus, defining we get that In the proof of Theorem 1, the neural network approximation of the function (x 1 , ..., x r ) → r j=1 x j is first constructed followed by approximation of monomials of degree up to β.The result then follows by approximating the function P β f (x) and applying (12).In the latter approximation the set of parameters not belonging to {0, ± 1  2 , ±1} consists of: • shift coordinates j/M, j = 1, ..., M − 1, (the grid points); • at most (β(M + 1)) d weight matrix entries of the form c x ℓ ,γ /B, where c x ℓ ,γ are coefficients of the polynomial P β x ℓ f (x), x ℓ ∈ D(M ), |γ| < β; • a shift coordinate 1/(2M d ) (used to scale the output entries).
Note that the above list gives at most D := M + (β(M + 1)) d different parameters.Taking into account (13) we can use P β f instead of P β f to approximate f .Thus, we can replace the entries c x ℓ ,γ /B by the entries cx ℓ ,γ /B = k 2 b , where k is some integer from [−2 b , 2 b ].Also, as M = 2 ν , then denoting ∆ = max{νd + 1; b} we need to obtain D parameters from the set S = { k 2 ∆ , k ∈ Z ∩ (0, 2 ∆ ]}.As any natural number can be represented as a sum of powers of 2, then for any y 1 , ..., y D ∈ Z ∩ (0, 2 ∆ ] we can compute

Theorem 2 .
For any function f ∈ C β d (K) and any integers m ≥ 1 and N ≥ (β + 1) d ∨ (K + 1)e d , there exists a network f ∈ F( L, p, s) with depth

1 ).
Taking in Theorem 2 r = d, m = ⌈log 2 n⌉ and N = n d 2β+d , we get the existence of a network fn ∈ F( Ln , pn , sn ) with Ln ≤ c log 2 n, |p n | ∞ ≤ cn d 2β+d and sn ≤ cn d 2β+d log 2 n such that