Function Approximation by Deep Neural Networks with Parameters {0,±12,±1,2}\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\{0,\pm \frac{1}{2}, \pm 1, 2\}$$\end{document}

In this paper, it is shown that Cβ\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$C_\beta $$\end{document}-smooth functions can be approximated by deep neural networks with ReLU activation function and with parameters {0,±12,±1,2}\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\{0,\pm \frac{1}{2}, \pm 1, 2\}$$\end{document}. The l0\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$l_0$$\end{document} and l1\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$l_1$$\end{document} parameter norms of considered networks are thus equivalent. The depth, the width and the number of active parameters of the constructed networks have, up to a logarithmic factor, the same dependence on the approximation error as the networks with parameters in [-1,1]\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$[-1,1]$$\end{document}. In particular, this implies that the nonparametric regression estimation with constructed networks achieves, up to logarithmic factors, the same minimax convergence rates as with sparse networks with parameters in [-1,1]\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$[-1,1]$$\end{document}.


Introduction
The problem of function approximation with neural networks has been of big interest in mathematical research for the last several decades. Various results have been obtained that describe the approximation rates in terms of the structures of the networks and the properties of the approximated functions. One of the most remarkable results in this direction is the universal approximation theorem, which shows that even shallow (but sufficiently wide) networks can approximate continuous functions arbitrarily well (see [9] for the overview and possible proofs of the theorem). Also, in [6] it was shown that integrable functions can be approximated by networks with fixed width. Those networks, however, may need to be very deep to attain small approximation errors.
Yet, from a pragmatic point of view, and, in particular, in statistical applications, allowing very big number of network parameters may be impractical. The reason is that in this case controlling the complexity of approximant networks at an optimal rate becomes problematic. Complexities of classes of neural networks are usually described in terms of their covering numbers and entropies. Those two concepts also play an important role in various branches of statistics, such as regression analysis, density estimation and empirical processes (see, e.g., [1,3,8]). In particular, in regression estimation the following dichotomy usually comes up while selecting the class of functions from which the estimator will be chosen: On the one hand, the selected class of approximants should be "big" enough to be able to approximate various non-trivial functions, and on the other hand it should have "small" enough entropy to attain good learning rates. Thus, the general problem is to obtain powerful classes of functions with well-controllable entropies.
As to the powerfulness of classes of neural networks, it has recently been shown ( [10,13]) that with properly chosen architecture the classes of sparse deep neural networks with ReLU activation function can well approximate smooth functions. In particular, it is shown in [13] that C β -smooth functions on [0, 1] d can be εapproximated by deep ReLU networks with O(ε −d/β log 2 (1/ε)) active (nonzero) parameters. A similar result for sparse ReLU networks with parameters in [−1, 1] has been obtained in [10]. The number of active parameters s in those networks is much smaller than the total number of network parameters, and the network depth L depends logarithmically on the approximation error. Boundedness of parameters of the networks constructed in [10] implies that the ε−entropy of the approximating networks has order O(s L 2 log 2 (1/ε)). The main advantages of this entropy bound are its logarithmic dependence on 1/ε, which allows to take the covering radius ε to be very small in applications, and its linear dependence on the sparsity s and quadratic dependence on the depth L, both of which, as described above, can also be taken to be small. Using this entropy bound, it is then shown in [10] that if the regression function is a composition of Hölder smooth functions, then sparse neural networks with depth L log 2 n and the number of active parameters s ∼ n t 2β+t log 2 n, where β > 0 and t ≥ 1 depend on the structure and the smoothness of the regression function and attain the minimax optimal prediction error rate n −2β 2β+t (up to a logarithmic factor), where n is the sample size. It would therefore be desirable to obtain a similar entropy bound for the spaces of networks for which the above l 0 (sparsity) regularization is replaced by the better practically implementable l 1 regularization.
Networks with l 1 norm of all parameters bounded by 1 are considered in [12]. As in those networks, there are at most 1/ε 2 parameters outside of the interval (−ε 2 , ε 2 ); an entropy bound of order O((2/L) 2L−1 /ε 2 ) has been obtained by taking in the covering networks the remaining parameters to be 0. This bound, however, depends polynomially on 1/ε, and it leads to the convergence rate of order 1/ √ n for regression estimation with given n samples. As it is discussed in [12], the rate 1/ √ n is seemingly the best possible for l 1 regularized estimators. Alternative approaches of sparsifying neural networks using derivatives, iterative prunings and clipped l 1 penalties are given in [2,4,5] and [7].
To combine the advantages of both l 0 and l 1 regularizations, as well as to make the networks easier to encode, we consider networks with parameters {0, ± 1 2 , ±1, 2}. The l 0 and l 1 parameter regularizations of those networks can differ at most by a factor of 2, which, in particular, allows to employ all the features induced from the sparsity of networks (including their entropy bounds) while imposing l 1 constraints on their parameters. Moreover, discretization of parameters allows to calculate the exact number of networks (the 0-entropy) required to attain a given approximation rate. The latter, in turn, allows to reduce the problem of selection of the estimator of an unknown regression function to a simple and straightforward procedure of minimization over a finite set of candidates. Importantly, the depth, the width and the number of active parameters in the approximant networks are equivalent to those of networks constructed in [10]. Hence, for the considered networks the l 0 parameter regularization can be replaced by the l 1 parameter regularization, leading, up to a logarithmic factor, to the same statistical guarantees as in [10]. In our construction, the parameters ±1 are used to add/subtract the nodes, change, if necessary, their signs and transfer them to the next layers. The parameters ± 1 2 and 2 are used to attain the values of the form k/2 j ∈ [−1, 1], j ∈ N, k ∈ Z, which can get sufficiently close to any number from [−1, 1]. Note that this can also be done using only the parameters ± 1 2 and 1. The latter, however, would require a larger depth and a bigger number of active nodes.
Notation. The notation |v| ∞ is used for the l ∞ norm of a vector v ∈ R d and f L ∞ [0,1] d denotes the sup norm of a function f defined on [0, 1] d , d ∈ N. For x, y ∈ R, we denote x ∨ y := max{x, y} and (x) + := max{0, x}. Also, to make them multiplicable with preceding matrices, the vectors from R d , depending on the context, are considered as matrices from R d×1 rather than R 1×d .

Main Result
Consider the set of neural networks with L hidden layers and with ReLU activation and a given input vector y = (y 1 , . . . , y p ), the action of shifted activation function σ v on y is defined as It is assumed that the network parameters, that is, the entries of weight matrices W i and the coordinates of shift vectors v i , are all in [−1, 1]. For s ∈ N, let F(L, p, s) be the subset of F(L, p) consisting of networks with at most s nonzero parameters. In [10], Theorem 5, the following approximation of β-Hölder continuous functions belonging to the ball with networks from F(L, p, s) is given: and number of nonzero parameters The proof of the theorem is based on local sparse neural network approximation of Taylor polynomials of the function f .
Our goal is to attain an identical approximation rate for networks with parameters in {0, ± 1 2 , ±1, 2}. In our construction, we will omit the shift vectors (by adding a coordinate 1 to the input vector x) and will consider the networks of the form and number of nonzero parameters Let us now compare the above two theorems. First, the approximation errors in those theorems differ by a constant factor depending only on the input dimension d. (Note that the values of β, d and K are assumed to be fixed.) The depths and the number of nonzero parameters of the networks presented in Theorems 2.1 and 2.2 differ at most by log 2 N multiplied by a constant depending on β, d and K , and the maximal widths of those networks differ at most by a constant factor C(β, d, K ). Thus, the architecture and the number of active parameters of network given in Theorem 2.2 have, up to a logarithmic factor, the same dependence on the approximation error as the network given in Theorem 2.1.
Application to nonparametric regression Consider a nonparametric regression model The subscript f 0 indicates that the expectation is taken over the training data generated by our regression model and n (f , f 0 , F) measures how close the estimatorf is to the empirical risk minimizer. Let also be the prediction error of the estimatorf ∈ F, where X D = X 1 is independent of the sample (X i , Y i ). Prediction errors are assessed by oracle inequalities, which, in turn, are usually given in terms of below-defined covering numbers and the entropies of the function class F from which the estimator is chosen.
The number log 2 N (δ, F, · ∞ ) is then called a δ-entropy of the set F.
The following oracle-type inequality is obtained in [10], Lemma 4: where c = c(β, d, F) is some constant. In order to apply Lemma 2.1, it remains to estimate the covering number N (δ, F (L n ,p n ,s n ), · ∞ ). Note, however, that since the parameters of networks from F(L n ,p n ,s n ) belong to the discrete set {0, ± 1 2 , ±1, 2}, we can calculate the exact number of networks from F(L n ,p n ,s n ), or, in other words, we can upper-bound the covering number of radius δ = 0. Indeed, as there are at most (L n +1)|p n | 2 ∞ parameters in the networks from F(L n ,p n ,s n ), then for a given s there are at most (L n + 1)|p n | 2 ∞ s ways to choose s nonzero parameters. As the nonzero parameters can take one of the 5 values {± 1 2 , ±1, 2}, the total number of networks from F(L n ,p n ,s n ) is bounded by Together with (2) and Lemma 2.1, for the empirical risk minimizer we get an existence of a constant C = C(β, d, F) such that which coincides, up to a logarithmic factor, with the minimax estimation rate n −2β 2β+d of the prediction error for β-smooth functions.

Remark 2.1
Note that if s and s p denote, respectively, the l 0 and l p parameter norms of a network with parameters {0, ± 1 2 , ±1, 2}, then s 1/ p /2 ≤ s p ≤ 2s 1/ p , p > 0. Therefore, in the above application the same convergence rate as in (3) can be attained by replacing F(L n ,p n ,s n ) with the subclass of F(L n ,p n ) consisting of networks with l p parameter norms bounded bys 1/ p n . In particular, taking p = 1 we get that both in Theorem 2.2 and in the application above the same approximation and prediction rates can be obtained by replacing the sparsity constraint with the l 1 network parameter regularization.

Proofs
One of the ways to approximate functions by neural networks is based on the neural network approximation of local Taylor polynomials of those functions (see, e.g., [10,13]). Thus, in this procedure, approximation of the product x y given the input (x, y) becomes crucial. The latter is usually done by representing the product x y as a linear combination of functions that can be approximated by neural network-implementable functions. For example, the approximation algorithm presented in [10] is based on the approximation of a function g(x) = x(1 − x), which then leads to an approximation of the product The key observation is that the function g(x) can be approximated by combinations of triangle waves and the latter can be easily implemented by neural networks with ReLU activation function. In the proof of Theorem 2.1, neural network approximation of function (x, y) → x y is followed by approximation of the product (x 1 , . . . , x r ) → r j=1 x j which then leads to approximation of monomials of degree up to β. The result then follows by local approximation of Taylor polynomials of f . Below, we show that all those approximations can also be performed using only the parameters {0, ± 1 2 , ±1, 2}. As it is formalized in (1), in our constructions we add a coordinate 1 to the input vector to omit the shift vectors. To check the equivalence of those two approaches, suppose that a given hidden layer of a network is determined by a weight matrix W ∈ R d 1 ×d 2 and a shift vector v ∈ R d 1 and let W ∈ R d 1 ×d 2 +1 be the matrix obtained by appending the matrix W to the column −v : W = (−v, W ). Then, for a given input vector y ∈ R d 2 we have that σ W (1, y) = σ v W y from which the desired equivalence follows.
We start our constructions with Lemma 3.1 which shows that networks with parameters {0, ± 1 2 , ±1}, depth 2m + 4 and width 9 can approximate the product x y exponentially fast in m: where T + (x) := (x/2) + and T k − (x) := (x − 2 1−2k ) + . In [11], Lemma A.1, it is shown that for the functions R k : [0, 1] → [0, 2 −2k ], and for any positive integer m, where g(x) = x(1 − x). Taking into account (4) and (8), we need to construct a network that computes Let us first construct a network N m with depth 2m, width 4 and weights {0, ± 1 2 , ±1} that computes For this goal, we modify the network presented in [11], Fig. 2, to assure that the parameters are all in {0, ± 1 2 , ±1}. More explicitly, denote Then, where each of the mutually succeeding matrices A and B appears in the above representation m times. Using parameters {0, ± 1 2 , ±1}, for a given input (1, x, y) the first two layers of the network Mult m compute the vector 1, (Note that as in our construction we omit shift vectors, throughout the whole construction we will keep the first coordinate equal to 1.) We then apply the network N m to the first and last four coordinates of the above vector that follow the first coordinate 1. We thus obtain a network with 2m + 2 hidden layers and of width 9 that computes Finally, the last two layers of Mult m compute (1, u, v) → (1− (1−(u−v)) + ) + applied to the vector obtained in (10). (Note that this computation only requires parameters 0 and ±1.) We thus get a network Mult m computing (9), and the inequality (5) follows by combining (4) and (8).
Using only the parameters 0 and 1 and having width at most γ C d,γ , the first hidden layer of the network Mon d m,γ ∈ F(L, p) computes the vector where the first coordinate 1 of the computed vector is the value of the monomial x α 1 corresponding to the zero exponent vector α 1 and α 2 , . . . , α C d,γ are all the exponent vectors with 0 < |α i | < γ, i = 2, . . . , C d,γ . By Lemma 3.2, for each i = 2, . . . , C d,γ , there is a network Mult α i m ∈ F((2m + 5) log 2 γ , 9 γ ) such that The network Mon d m,γ is obtained by applying in parallel the networks Mult α i m to the components of the output vector (11) obtained in the first step while leaving its first coordinate unchanged.
Note that the depths and the widths of networks presented in Lemmas 3.1, 3.2 and 3.3 that, respectively, approximate the products x y and r i=1 x i and the monomials up to degree γ are at most twice as large as the depths and widths of corresponding networks constructed in Lemmas A.2, A.3 and A.4 in [11]. Thus, by enlarging the network sizes at most by a constant factor of 2 and using the parameters {0, ± 1 2 , ±1, 2} we can achieve the same rates of approximations of monomials as with the networks with parameters in [−1, 1].
We now present the final stage of the approximation, that is, the local approximation of Taylor polynomials of f .

Proof of Theorem 2.2
For a given N , letÑ ≥ N be the smallest integer withÑ = (2 ν + 1) d for some ν ∈ N. Note thatÑ /2 d ≤ N ≤Ñ . We are going to apply Theorem 2.1 with N in the condition of that theorem replaced byÑ . For a ∈ [0, 1] d , let where As in our construction we only use parameters {0, ± 1 2 , ±1, 2}, we need to modify the coefficients given in (12) to make them implementable by those parameters. Denote B := 2K e d and let b ∈ N be the smallest integer with 2 b ≥ B M β (β + 1) d . As |c a,γ | < B ( [11], eq. 34), then for each c a,γ there is an integer k Denote thenc a,γ = k 2 b B and definẽ As the number of monomials of degree up to β is bounded by (β + 1) d , then Also, as then Thus, definingP we get that In the proof of Theorem 2.1, the neural network approximation of the function (x 1 , . . . , x r ) → r j=1 x j is first constructed followed by approximation of monomials of degree up to β. The result then follows by approximating the function P β f (x) and applying (13). In the latter approximation, the set of parameters not belonging to {0, ± 1 2 , ±1} consists of: Note that the above list gives at most D := M + (β(M + 1)) d different parameters. Taking into account (14), we can useP β f instead of P β f to approximate f . Thus, we can replace the entries c x ,γ /B by the entriesc x ,γ /B = k 2 b , where k is some integer from [−2 b , 2 b ]. Also, as M = 2 ν , then denoting = max{νd + 1; b} we need to obtain D parameters from the set S = { k 2 , k ∈ Z ∩ (0, 2 ]}. As any natural number can be represented as a sum of powers of 2, then for any y 1 , . . . , y D ∈ Z ∩ (0, 2 ] we can compute (1, x 1 , . . . , x d ) → (1, x 1 , . . . , x d , y 1 , . . . , y D ) with parameters from {0, 1, 2} using at most hidden layers. The number of active parameters required for this computation is bounded by (1 + d + D + ) . Hence, for any z 1 , . . . , z D ∈ S, we can compute (1, x 1 , . . . , x d ) → (1, x 1 , . . . , x d , z 1 , . . . , z D ) with 2 hidden layers and 2(1 + d + D + ) active parameters. Applying Theorem 2.1, we get the existence of a networkf ∈ F(L,p,s) with the desired architecture and sparsity.