# Approximation with One-Bit Polynomials in Bernstein Form

• Published:

## Abstract

We prove various theorems on approximation using polynomials with integer coefficients in the Bernstein basis of any given order. In the extreme, we draw the coefficients from $$\{ \pm 1\}$$ only. A basic case of our results states that for any Lipschitz function $$f:[0,1] \rightarrow [-1,1]$$ and for any positive integer n, there are signs $$\sigma _0,\dots ,\sigma _n \in \{\pm 1\}$$ such that

\begin{aligned} \left| f(x) - \sum _{k=0}^n \sigma _k \, \left( {\begin{array}{c}n\\ k\end{array}}\right) x^k (1-x)^{n-k} \right| \le \frac{C (1+|f|_{\textrm{Lip}})}{1+\sqrt{nx(1-x)}} ~ \text{ for } \text{ all } x \in [0,1]. \end{aligned}

More generally, we show that higher accuracy is achievable for smoother functions: For any integer $$s\ge 1$$, if f has a Lipschitz $$(s{-}1)$$st derivative, then approximation accuracy of order $$O(n^{-s/2})$$ is achievable with coefficients in $$\{\pm 1\}$$ provided $$\Vert f \Vert _\infty < 1$$, and of order $$O(n^{-s})$$ with unrestricted integer coefficients, both uniformly on closed subintervals of (0, 1) as above. Hence these polynomial approximations are not constrained by the saturation of classical Bernstein polynomials. Our approximations are constructive and can be implemented using feedforward neural networks whose weights are chosen from $$\{\pm 1\}$$ only.

This is a preview of subscription content, log in via an institution to check access.

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

## References

1. Ashbrock, J., Powell, A.M.: Stochastic Markov gradient descent and training low-bit neural networks. Sampl. Theory Signal Process. Data Anal. 19(2), 1–23 (2021)

2. Benedetto, J.J., Powell, A.M., Yılmaz, Ö.: Sigma-delta ($$\Sigma \Delta$$) quantization and finite frames. IEEE Trans. Inf. Theory 52(5), 1990–2005 (2006)

3. Bolcskei, H., Grohs, P., Kutyniok, G., Petersen, P.: Optimal approximation with sparsely connected deep neural networks. SIAM J. Math. Data Sci. 1(1), 8–45 (2019)

4. Bustamante, J.: Bernstein Operators and Their Properties. Springer, Berlin (2017)

5. Candy, J.C., Temes, G.C. (eds).: Oversampling Delta-Sigma Data Converters: Theory, Design and Simulation. Wiley-IEEE (1991)

6. Chlodovsky, I.: Une rèmarque sur la représentation des fonctions continues par des polynomes à coefficients entiers. Math. Sb. 32(3), 472–475 (1925)

7. Chou, E., Güntürk, C.S., Krahmer, F., Saab, R., Yılmaz, Ö.: Noise-shaping quantization methods for frame-based and compressive sampling systems. Sampl. Theory 57–184 (2015)

8. Courbariaux, M., Bengio, Y., David, J.-P.: Binaryconnect: training deep neural networks with binary weights during propagations. Adv. Neural Inf. Process. Syst. 28 (2015)

9. Cybenko, G.: Approximation by superpositions of a sigmoidal function. Math. Control Signals Syst. 2(4), 303–314 (1989)

10. Daubechies, I., DeVore, R.: Approximating a bandlimited function using very coarsely quantized data: a family of stable sigma-delta modulators of arbitrary order. Ann. Math. (2) 158(2), 679–710 (2003)

11. Daubechies, I., DeVore, R., Foucart, S., Hanin, B., Petrova, G.: Nonlinear approximation and (deep) ReLU networks. Constr. Approx. 1–46 (2021)

12. DeVore, R., Hanin, B., Petrova, G.: Neural network approximation. Acta Numer. 30, 327–444 (2021)

13. DeVore, R.A., Lorentz, G.G.: Constructive Approximation, vol. 303. Springer, Berlin (1993)

14. Felbecker, G.: Linearkombinationen von iterierten Bernsteinoperatoren. Manuscr. Math. 29(2), 229–248 (1979)

15. Ferguson L.B.O.: Approximation by Polynomials with Integral Coefficients, volume 17. American Mathematical Society (1980)

16. Golub, G.H., Van Loan, C.F.: Matrix Computations. John Hopkins University Press, 4th edition (2013)

17. Gray, R.M.: Quantization noise spectra. IEEE Trans. Inf. Theory 36(6), 1220–1244 (1990)

18. Güntürk, C.S.: One-bit sigma-delta quantization with exponential accuracy. Commun. Pure Appl. Math. 56(11), 1608–1630 (2003)

19. Güntürk, C.S.: Mathematics of analog-to-digital conversion. Commun. Pure Appl. Math. 65(12), 1671–1696 (2012)

20. Güntürk, C.S., Li, W.: Approximation of functions with one-bit neural networks. arXiv preprint arXiv:2112.09181 (2021)

21. Guo, Y.: A survey on methods and theories of quantized neural networks. arXiv preprint arXiv:1808.04752 (2018)

22. Inose, H., Yasuda, Y.: A unity bit coding method by negative feedback. Proc. IEEE 51(11), 1524–1535 (1963)

23. Inose, H., Yasuda, Y., Murakami, J.: A telemetering system by code manipulation-$$\Delta \Sigma$$ modulation. IRE Trans. Space Electron. Telemetry 204–209 (1962)

24. Kakeya, S.: On approximate polynomials. Tohoku Math. J. First Ser. 6, 182–186 (1914)

25. Kantorovich, L.V.: Some remarks on the approximation of functions by means of polynomials with integral coefficients. Izv. Akad. Nauk. SSSR 7, 1163–1168 (1931)

26. Kolmogorov, A.N., Tihomirov, V.M.: $$\varepsilon$$-entropy and $$\varepsilon$$-capacity of sets in function spaces. Uspehi Mat. Nauk., 14(2 (86)):3–86, 1959. Also in Amer. Math. Soc. Transl., Ser. 2 17 (1961), 277–364

27. Lorentz, G.G: Bernstein Polynomials. Chelsea, 2nd edition (1986)

28. Lorentz, G.G, Golitschek, M.V., Makovoz, Y.: Constructive Approximation: Advanced Problems, volume 304. Springer (1996)

29. Lu, J., Shen, Z., Yang, H., Zhang, S.: Deep network approximation for smooth functions. SIAM J. Math. Anal. 53(5), 5465–5506 (2021)

30. Lybrand, E., Saab, R.: A greedy algorithm for quantizing neural networks. J. Mach. Learn. Res. 22(156), 1–38 (2021)

31. Micchelli, C.: The saturation class and iterates of the Bernstein polynomials. J. Approx. Theory 8(1), 1–18 (1973)

32. Norsworthy, S.R., Schreier, R., Temes, G.C. (eds.): Delta-Sigma-Converters: Theory, Design and Simulation. Wiley, Hoboken (1996)

33. Pál, J.: Zwei kleine bemerkungen. Tohoku Math. J. First Ser. 6, 42–43 (1914)

34. Poon, H., Domingos, P.: Sum-product networks: a new deep architecture. In: 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops), pp. 689–690. IEEE (2011)

35. Qian, W., Riedel, M.D., Rosenberg, I.: Uniform approximation and Bernstein polynomials with coefficients in the unit interval. Eur. J. Comb. 32(3), 448–463 (2011)

36. Schreier, R., Temes, G.C.: Understanding Delta-Sigma Data Converters. Wiley-IEEE Press, Hoboken (2004)

37. Shaham, U., Cloninger, A., Coifman, R.R.: Provable approximation properties for deep neural networks. Appl. Comput. Harmon. Anal. 44(3), 537–557 (2018)

38. Trigub, R.M.: Approximation of functions by polynomials with integer coefficients. Izvestiya Rossiiskoi Akademii Nauk. Seriya Matematicheskaya 26(2), 261–280 (1962)

39. Yarotsky, D.: Error bounds for approximations with deep ReLU networks. Neural Netw. 94, 103–114 (2017)

## Author information

Authors

### Corresponding author

Correspondence to C. Sinan Güntürk.

Communicated by Edward B. Saff.

Dedicated to Professor Ron DeVore on the occasion of his 80th birthday.

### Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

## Appendices

### Appendix A: Frame-theoretic redundancy of the Bernstein basis

We have shown in this paper that a large class of functions (including all continuous functions) can be approximated arbitrarily well using polynomials with very coarsely quantized coefficients in a Bernstein basis. While our methodology can be viewed as a variation on the theme of [2] which addresses quantization of finite frame expansions through $$\Sigma \Delta$$ quantization, we emphasize that we are working with linearly independent systems. As such, the setting of this paper is novel for this type of quantization. The reason why the method works is that the “effective span” of the Bernstein basis is actually roughly of dimension $$\sim \sqrt{n}$$. Both approximation and noise-shaping quantization take place relative to this subspace of $$\mathcal {P}_n$$.

In this “Appendix”, we will make the statements in the preceding paragraph more precise by first providing a complete description of the singular values of the linear operator $$S_n:\mathbb {R}^{n+1}\rightarrow \mathcal {P}_n$$ defined by

\begin{aligned} S_n u :=\sum _{k=0}^n u_k p_{n,k}, \end{aligned}

where $$\mathbb {R}^{n+1}$$ is equipped with the standard Euclidean inner product and $$\mathcal {P}_n$$ with the inner product inherited from $$L^2([0,1])$$, both denoted by $$\langle \cdot , \cdot \rangle$$. Evidently, this discussion is concerned primarily with $$L^2$$ approximation rather than uniform approximation. However, the description of the singular values shed additional light onto the Bernstein basis and may be of independent interest to the frame community.

The adjoint $$S_n^*:\mathcal {P}_n \rightarrow \mathbb {R}^{n+1}$$ is given by

\begin{aligned} (S_n^* f)_k = \langle p_{n,k},f\rangle :=\int _0^1 p_{n,k}(x)f(x)\ dx, ~~k=0,\dots ,n. \end{aligned}

In the language of frame theory, $$S_n$$ and $$S_n^*$$ are the synthesis and the analysis operators for the system , respectively. Then the frame operator $$S_n^*S_n^{}:\mathbb {R}^{n+1}\rightarrow \mathbb {R}^{n+1}$$ is given by

\begin{aligned} (S_n^*S_n^{} u)_k = \sum _{\ell =0}^n \langle p_{n,k}, p_{n,\ell }\rangle u_\ell = (\Gamma _n u)_k \end{aligned}

where $$\Gamma _n$$ is the Gram matrix of the system with entries

\begin{aligned} (\Gamma _n)_{k,\ell } =\langle p_{n,k}, p_{n,\ell }\rangle , ~~k,l=0,\dots ,n. \end{aligned}

Note that $$(n+1) \Gamma _n$$ is doubly stochastic due to the relations

\begin{aligned} \sum _{l=0}^n p_{n,l}(x) = 1 \text{ and } \int _0^1 p_{n,k}(x)\,dx = \frac{1}{n+1}. \end{aligned}

To ease the notation in the discussion below, we assume $$n \ge 0$$ is fixed and suppress the index n from our notation when expressing certain quantities, keeping in mind that they still depend on n. This dependence will naturally become explicit in all formulas we will provide. Henceforth we will refer to the operators $$S:=S_n$$ and $$\Gamma :=\Gamma _n$$.

Since the Bernstein system is linearly independent, $$\Gamma$$ is positive-definite, and in particular, invertible. Let its eigenvalues in decreasing order be $$(\lambda _k)_0^n$$, i.e.

\begin{aligned} \lambda _{0} \ge \lambda _{1} \ge \cdots \ge \lambda _{n} >0. \end{aligned}

As we will see, the eigenvectors of $$\Gamma$$ are precisely the discrete Legendre polynomials on $$\{0,\dots ,n\}$$, which we denote by $$\varphi _0,\dots , \varphi _n$$. (Here, we identify $$\mathbb {R}^{n+1}$$ with $$\mathbb {R}^{\{0,\dots ,n\}}=L^2(\{0,\dots ,n\})$$ equipped with the counting measure. We pair $$f \in \mathbb {R}^{\{0,\dots ,n\}}$$ with $$(f(0),\dots ,f(n)) \in \mathbb {R}^{n+1}$$.) Thus each $$\varphi _k(\ell )$$ is a polynomial in $$\ell \in \{0,\dots ,n\}$$, of degree equal to k, and we have the orthonormality relations

\begin{aligned} \langle \varphi _k, \varphi _j \rangle = \delta _{k,j}. \end{aligned}

We will not need an explicit formula for the $$\varphi _k$$. In order to uniquely define the Legendre polynomials, it is necessary to remove their sign ambiguity according some convention, but we shall not need this either.

For any $$k \ge 0$$, let $$(t)_k$$ denote the falling factorial defined by $$(t)_k:=t(t-1)\cdots (t-k+1)$$ and $$(t)_0:=1$$, where we restrict t to non-negative integers.

### Theorem 9

Let $$n\ge 0$$ be arbitrary, $$\Gamma$$ be the Gram matrix of the Bernstein system . Then for all $$0 \le k \le n$$, we have $$\Gamma \varphi _k = \lambda _k \varphi _k$$ where $$\varphi _k$$ is the degree k discrete Legendre polynomial on $$\{0,\dots ,n\}$$ and

\begin{aligned} \lambda _k = \frac{(n)_k}{(n+k+1)_{k+1}}. \end{aligned}

### Proof

Let $$\pi _k(\ell ):=\ell ^k$$ (as before) and $$\xi _k(\ell ):=(\ell )_k$$, $$\ell \in \{0,\dots ,n\}$$, $$k \ge 0$$. Since each $$\xi _k$$ is a monic polynomial of degree k, it is immediate that $$\text{ span }\{\xi _0,\dots ,\xi _m\} = \text{ span }\{\pi _0,\dots ,\pi _m\} =: P_m$$. (Note that for $$m>n$$ we have $$P_m = P_n = \mathbb {R}^{\{0,\dots ,n\}}$$, though we will not be concerned with the case $$m > n$$.)

Let $$0 \le m \le n$$ be arbitrary. We claim that $$\Gamma (P_m) \subset P_m$$. For this, it suffices to show that $$\Gamma \xi _m$$ is a degree m polynomial. We have

\begin{aligned} (\Gamma \xi _m)(k) =\sum _{\ell =0}^n \Gamma _{k,\ell } \,\xi _m(\ell ) = \Big \langle p_{n,k}, \sum _{\ell =0}^n (\ell )_m\, p_{n,\ell }\Big \rangle , ~~~ k=0,\dots ,n. \end{aligned}
(41)

We can evaluate the polynomial sum in this inner-product via

\begin{aligned} \sum _{\ell =0}^n (\ell )_m\, p_{n,\ell }(x) = \sum _{\ell =m}^n m! \left( {\begin{array}{c}\ell \\ m\end{array}}\right) \, p_{n,\ell }(x) = m! \left( {\begin{array}{c}n\\ m\end{array}}\right) x^m = (n)_m\,x^m, \end{aligned}
(42)

where the first equality uses the fact that $$(\ell )_m=0$$ for $$\ell <m$$, and the second equality is an identity generalizing the partition of unity property of Bernstein polynomials (see, e.g. [4, p.85]). Hence, (41) can now be evaluated to give

\begin{aligned} (\Gamma \xi _m)(k)= & {} (n)_m \left( {\begin{array}{c}n\\ k\end{array}}\right) \int _0^1 x^{k+m} (1-x)^{n-k} dx \nonumber \\= & {} (n)_m \left( {\begin{array}{c}n\\ k\end{array}}\right) \textrm{B}(k+m+1,n-k+1) \nonumber \\= & {} \frac{(n)_m\,(k+m)_m}{(n+m+1)_{m+1}},~~~ k=0,\dots ,n, \end{aligned}
(43)

where $$\textrm{B}(\cdot ,\cdot )$$ stands for the beta function. The term $$(k+m)_m$$ is a polynomial in k of degree m, hence the claim follows.

As a particular case, we have $$\Gamma \varphi _m \in P_m$$, so there are $$\alpha _0^{(m)},\dots ,\alpha _m^{(m)}$$ such that

\begin{aligned} \Gamma \varphi _m = \sum _{k=0}^m \alpha _k^{(m)} \varphi _k. \end{aligned}

In other words, $$\Gamma$$ is represented by an upper-triangular matrix in the discrete Legendre basis $$(\varphi _0,\dots ,\varphi _n)$$. This fact, coupled with the orthogonality of the discrete Legendre basis and the symmetry of $$\Gamma$$, actually implies that $$\Gamma$$ is diagonal in the discrete Legendre basis. Let us spell out how this general principle works:

For the case $$m=0$$, we readily have $$\Gamma \varphi _0 = \alpha _0^{(0)} \varphi _0$$. For $$1 \le m \le n$$, the claim that $$\varphi _m$$ is an eigenvector of $$\Gamma$$ is equivalent to the claim $$\alpha _0^{(m)}=\cdots =\alpha _{m-1}^{(m)}=0$$. Employing orthogonality and self-adjointness, we have

\begin{aligned} \alpha _k^{(m)} = \frac{\langle \Gamma \varphi _m, \varphi _k \rangle }{\langle \varphi _k, \varphi _k\rangle } = \frac{\langle \varphi _m, \Gamma \varphi _k \rangle }{\langle \varphi _k, \varphi _k\rangle }, ~~~k=0,\dots ,m. \end{aligned}
(44)

Having established that $$\varphi _0,\dots ,\varphi _{m-1}$$ are all eigenvectors of $$\Gamma$$, (44) coupled with orthogonality of the discrete Legendre basis implies that $$\alpha _k^{(m)} = 0$$ for all $$k=0,\dots ,m-1$$, implying that $$\varphi _m$$ is an eigenvector of $$\Gamma$$.

We proceed to obtain a formula for the eigenvalues of $$\Gamma$$. For $$m=0$$, the fact that $$(n+1)\Gamma$$ is doubly stochastic implies that $$\Gamma \varphi _0 = \lambda _0 \varphi _0$$ with $$\lambda _0 = 1/(n+1)$$. Let $$1\le m \le n$$ and $$\lambda _m$$ be the eigenvalue correpsonding to the eigenvector $$\varphi _m$$. Since $$\varphi _m$$, $$\xi _m$$, $$\pi _m$$ are all degree m polynomials and the latter two are monic, there is a scalar $$\beta _m \not =0$$ such that

\begin{aligned} \varphi _m = \beta _m \xi _m + \gamma _{m-1} = \beta _m \pi _m + \tilde{\gamma }_{m-1} \end{aligned}

where $$\gamma _{m-1}, {\tilde{\gamma }}_{m-1} \in P_{m-1}$$. Hence

\begin{aligned} \lambda _m \varphi _m - \beta _m \Gamma \xi _m = \Gamma (\varphi _m - \beta _m \xi _m) = \Gamma \gamma _{m-1} \in P_{m-1} \end{aligned}

so that

\begin{aligned} \lambda _m \pi _m - \Gamma \xi _m = \frac{1}{\beta _m}\Big (\lambda _m (\beta _m \pi _m - \varphi _m) + (\lambda _m\varphi _m - \beta _m \Gamma \xi _m) \Big ) \in P_{m-1}. \end{aligned}

Meanwhile, (43) implies

\begin{aligned} \Gamma \xi _m - \frac{(n)_m}{(n+m+1)_{m+1}} \pi _m \in P_{m-1}. \end{aligned}

Adding the last two vectors, we get

\begin{aligned} (\lambda _m - \frac{(n)_m}{(n+m+1)_{m+1}}) \pi _m \in P_{m-1} \end{aligned}

which implies that $$\lambda _m = \frac{(n)_m}{(n+m+1)_{m+1}}$$. $$\square$$

Remark. For any n, let $$\sigma _0,\dots ,\sigma _n$$ be the singular values of S defined by $$\sigma _m := \sqrt{\lambda _m}$$. Define $$\psi _m := S \varphi _m / \sigma _m$$. Then, as is well known, $$\psi _0,\dots ,\psi _n$$ is an orthonormal basis for the range of S, i.e. $$\mathcal {P}_n$$, equipped with the $$L^2$$ inner product on [0, 1]. The relation (42) shows that $$S\xi _m \in \mathcal {P}_m$$, implying $$S(P_m) \subset \mathcal {P}_m$$. In particular, $$\psi _m \in \mathcal {P}_m$$ for each $$m=0,\dots ,n$$. Then the orthonormality of the $$\psi _m$$ imply that they are simply the continuous Legendre polynomials on [0, 1], again up to a sign convention. Note that even though the $$\varphi _m$$ depend on n, the $$\psi _m$$ do not.

Quantifying redundancy of the Bernstein system In frame theory, the redundancy of a finite frame is often defined as the ratio of the number of frame vectors to the dimension of their span. Clearly this definition is insufficient for the purpose of differentiating between all the ways in which n vectors can span a d dimensional space, let alone for handling infinite dimensional spaces. A particular deficiency of working with the algebraic span is that it treats all bases the same, whether orthonormal or ill-conditioned. Unfortunately a more suitable quantitative definition of redundancy has been elusive. In the case of unit norm tight frames, the ratio n/d is equal to the frame constant, i.e. the unique eigenvalue of the frame operator $$SS^*$$ (equivalently, the unique non-zero eigenvalue of the Gram matrix), and for near-tight frames, the frame bounds given by the smallest and largest eigenvalues of the frame operator may continue to serve as a rough substitute of redundancy. However, when the eigenvalues are widely dispersed without a well-defined gap, the frame bounds seem to lose their significance.

As a basis of $$\mathcal {P}_n$$, the Bernstein system would not be considered redundant by the classical definition. However, it is ill-conditioned; in fact, we have

\begin{aligned} \frac{\sigma _0}{\sigma _n} = \sqrt{\frac{\left( {\begin{array}{c}2n+1\\ n+1\end{array}}\right) }{n+1}} \end{aligned}

which is exponentially large in n. Hence the relevant question is: What is the “effective dimension” of the span of ? One possible answer to this question comes from numerical analysis via the notion of “numerical rank” (see e.g. [16, Ch. 5.4.1]). This quantity is tied to a given tolerance threshold $$\epsilon$$ measuring which singular values (and therefore which subspaces) should count as significant. We inspect the singular values against the top singular value $$\sigma _0$$ and define

\begin{aligned} d_n(\epsilon ) := \max \{0\le m \le n :\sigma _{m} \ge \epsilon \sigma _0\} \end{aligned}

where $$\epsilon \in (0,1)$$ is a fixed small parameter. With this definition, if we truncate the singular value decomposition of S to define

\begin{aligned} {\tilde{S}}_\epsilon u := \sum _{\sigma _m \ge \epsilon \sigma _0} \sigma _m \langle u, \varphi _m\rangle \psi _m, \end{aligned}

then a straightforward consequence is that

\begin{aligned} \frac{\Vert S - {\tilde{S}}_\epsilon \Vert _\textrm{op}}{\Vert S\Vert _\textrm{op}} \le \epsilon . \end{aligned}

With this, we can argue that the image of the unit ball under S is well approximated by an ellipsoid within $$\textrm{span}(\psi _0,\dots ,\psi _{d_n(\epsilon )})= \mathcal {P}_{d_n(\epsilon )}$$. Hence we define the $$\epsilon$$-redundancy of to be $$n/d_n(\epsilon )$$. Note that this definition can easily be applied to systems other than .

The next task is to obtain asymptotics for $$d_n(\epsilon )$$. As can be seen from the formula derived in Theorem 9, the singular values of S do not possess any obvious gap. It turns out that the exponentially large condition number of S is highly misleading because the singular values of S do not decay like an exponential. If $$\sigma _m$$ actually decayed like $$\exp (-cm)$$, we would find that $$d_n(\epsilon ) \sim c^{-1} \log (1/\epsilon )$$, which is independent of n. Instead, we will show below that

\begin{aligned} d_n(\epsilon ) = C_\epsilon \sqrt{n} (1 + o(1)) \text{ as } n \rightarrow \infty , \end{aligned}

where $$C_\epsilon := \sqrt{2 \log (1/ \epsilon )}$$. This result is consistent with the fact that the error of best approximation using (with bounded coefficients) behaves as if n is replaced by $$\sqrt{n}$$. In addition, note that the $$\epsilon$$-redundancy of given by $$n/d_n(\epsilon )$$ also behaves like $$\sqrt{n}$$. This is also consistent with our result that the error bound of the r-th order $$\Sigma \Delta$$ quantization method behaves as $$n^{-r/2}$$.

### Theorem 10

For all $$\epsilon \in (0,1)$$,

\begin{aligned} \lim _{n \rightarrow \infty } \frac{d_n(\epsilon )}{ \sqrt{2n \log (1/\epsilon )}} = 1. \end{aligned}
(45)

### Proof

The phrase “for all sufficiently large n” will occur several times below and will be abbreviated by “f.a.s.l. n.” Suppose $$\epsilon \in (0,1)$$ is fixed. Define

\begin{aligned} m_n:= \left\lfloor \sqrt{2(n{+}1) \log (1/\epsilon )} \right\rfloor -1,~~ \text{ and } ~~ m'_n:= \left\lceil \sqrt{2(n{+}1) \log (1/\epsilon )} \right\rceil . \end{aligned}
(46)

It is clear that

\begin{aligned} \lim _{n \rightarrow \infty } \frac{m_n}{ \sqrt{2n \log (1/\epsilon )}} = \lim _{n \rightarrow \infty } \frac{m'_n}{ \sqrt{2n \log (1/\epsilon )}} = 1. \end{aligned}
(47)

As a trivial consequence, $$m_n$$ and $$m'_n$$ are in $$\{0,\dots ,n\}$$ f.a.s.l. n, so we can meaningfully refer to $$\sigma _{m_n}$$ and $$\sigma _{m'_n}$$. The proof will be based on showing that

\begin{aligned} \sigma _{m_n}> \epsilon \sigma _0 > \sigma _{m'_n} ~~ \text{ f.a.s.l. } n \end{aligned}
(48)

which implies $$m_n \le d_n(\epsilon ) < m'_n$$ f.a.s.l. n, and therefore (45) via (47).

It will be more convenient to work with the increasing sequence $$\alpha _m:= \log ( \lambda _0/\lambda _m)$$, $$m=0,\dots ,n$$. By Theorem 9, we have

\begin{aligned} \alpha _m= & {} \log \Big ( \frac{(n{+}m{+}1)_{m+1}}{(n{+}1)_{m+1}}\Big ) = \sum _{k=0}^m \left( \log \Big (1{+}\frac{k}{n{+}1}\Big )-\log \Big ( 1{-}\frac{k}{n{+}1}\Big ) \right) \\= & {} \sum _{k=0}^n g\Big (\frac{k}{n{+}1} \Big ), \end{aligned}

where $$g(t):=\log (1+t) - \log (1-t)$$, $$|t| < 1$$. We have $$g(0)=0$$, $$g'(0)=2$$, $$g''(0)=0$$, and $$|g'''(t)| \le 18$$ for $$|t| \le 1/2$$, so that $$|g(t) - 2t| \le 3|t|^3$$ for $$|t| \le 1/2$$ via Taylor’s theorem. Hence, for all $$m \le \frac{n+1}{2}$$, we have

\begin{aligned} \left| \alpha _m - \frac{m(m{+}1)}{n{+}1} \right|= & {} \left| \sum _{k=0}^m g\Big (\frac{k}{n{+}1} \Big ) - \sum _{k=0}^m \frac{2k}{n{+}1} \right| \nonumber \\{} & {} \le 3 \sum _{k=0}^m \Big (\frac{k}{n{+}1} \Big )^3 < \frac{(m{+}1)^4}{(n{+}1)^3}. \end{aligned}
(49)

We will utilize (49) to estimate $$\alpha _{m_n}$$ and $$\alpha _{m'_n}$$. (Note that (47) also implies that $$0\le m_n < m'_n \le \frac{n{+}1}{2}$$ f.a.s.l. n.) The definition (46) readily implies

\begin{aligned} \frac{(m_n{+}1)^2}{n{+}1} \le \log (1/\epsilon ^2) \le \frac{(m'_n)^2}{n{+}1}, \end{aligned}
(50)

and rearranging these inequalities, we have

\begin{aligned}{} & {} \frac{m_n(m_n{+}1)}{n{+}1} \le \log (1/\epsilon ^2) - \frac{m_n{+}1}{n{+}1} ~~ \text{ and } ~~ \frac{m'_n(m'_n{+}1)}{n{+}1}\nonumber \\{} & {} \quad \ge \log (1/\epsilon ^2) + \frac{m'_n}{n{+}1}. \end{aligned}
(51)

The estimate (49) for $$m=m_n$$ with (50) and (51) yields

\begin{aligned}{} & {} \alpha _{m_n} \le \frac{m_n(m_n{+}1)}{n{+}1} + \frac{(m_n{+}1)^4}{(n{+}1)^3} \nonumber \\{} & {} \quad \le \log (1/\epsilon ^2) - \frac{m_n{+}1}{n{+}1} + \frac{\log ^2(1/\epsilon ^2)}{n{+}1} \nonumber \\{} & {} \quad < \log (1/\epsilon ^2) ~~ \text{ f.a.s.l. } n \end{aligned}
(52)

since $$m_n \rightarrow \infty$$. Similarly, we also have

\begin{aligned}{} & {} \alpha _{m'_n} \ge \frac{m'_n(m'_n{+}1)}{n{+}1} - \frac{(m'_n{+}1)^4}{(n{+}1)^3} \nonumber \\{} & {} \quad \ge \log (1/\epsilon ^2) + \frac{m'_n}{n{+}1} - \frac{2 \log ^2(1/\epsilon ^2)}{n{+}1} ~~ \text{ f.a.s.l. } n, \nonumber \\{} & {} \quad > \log (1/\epsilon ^2) ~~ \text{ f.a.s.l. } n, \end{aligned}
(53)

where in the second last step we used the observation that $$(m'_n+1)^4 \le 2(m_n+1)^4$$ f.a.s.l. n. Hence we have shown

\begin{aligned} \alpha _{m_n}< \log (1/\epsilon ^2) < \alpha _{m'_n} ~~ \text{ f.a.s.l. } n, \end{aligned}

which is equivalent to (48). $$\square$$

### Appendix B: One-bit neural networks

Universal approximation properties of feedforward neural networks are well-known (see [9] as well as the more recent treatments including [3, 11, 29, 37, 39] and the extensive survey [12]). One of the motivations of this work has been to investigate the approximation potential of feedforward neural networks with the additional constraint that their parameters are coarsely quantized, at the extreme using only two values, hence the term “one-bit”. Quantized networks have been studied from a variety of perspectives (see e.g. [1, 8, 21, 30]). As far as we know, a universal approximation theorem using one-bit, or even fixed-precision multi-bit networks was missing. In this “Appendix” we will show how this can be done using our results on polynomial approximation in the Bernstein system. In particular, employing the quadratic activation unit $$s(x) := \frac{1}{2}x^2$$, we will show that it is possible to implement the polynomial approximations of this paper using standard feedforward neural networks whose weights are chosen from $$\{\pm 1\}$$ only. Due to the scope and focus of this paper, we will limit our discussion to univariate functions. A much more comprehensive study of these networks covering multivariate functions, ReLU networks, and information theoretic considerations can be found in our separate manuscript [20].

A feedforward neural network can be described in a variety of general formulations (see e.g. [12]). The architecture of the network is determined by a directed acyclic graph. The vertices of this graph (called nodes of the network) consist of input, output, and hidden vertices. Vertices are assigned variables (independent or dependent depending upon whether the vertex is an input or not) and edges are assigned weights. Every node which is not an input node computes a linear combination of the variables assigned to the nodes that are connected to it (noting the directedness of the graph) using the weights associated to the connecting edges, followed by (except for the output nodes) an application of a given activation function $$\rho$$, possibly subject to a shift of its argument (called the bias). An important class of networks are layered, meaning that the nodes of the network can be partitioned into subsets (layers) such that the nodes in each layer have incoming edges only from one other (the previous) layer and outgoing edges into another (the next) layer.

Having introduced the basic terminology of feedforward neural networks, let us now turn the polynomial approximation method of this paper into a neural network approximation with $$\pm 1$$ weights. There are two critical ingredients in our construction of these networks. The first one is, of course, the fact that we are able to approximate functions f (where $$\Vert f\Vert _\infty \le 1$$) by $$\pm 1$$-linear combinations of the elements of the Bernstein system . The second critical ingredient is that the basis polynomials $$p_{n,k}$$ have a rich hierarchical structure that enables them to be computed recursively, and with few simple arithmetic operations. Indeed, the $$p_{n,k}$$ satisfy the elementary recurrence relation

\begin{aligned} p_{n,k}(x) = \left\{ \begin{array}{ll} (1-x)p_{n-1,0}(x), &{} k=0,\\ x p_{n-1,k-1}(x)+ (1-x)p_{n-1,k}(x), &{} 0< k < n, \\ x p_{n-1,n-1}(x), &{} k=n, \end{array} \right. \end{aligned}
(54)

which follows readily from the simple combinatorial identity $$\left( {\begin{array}{c}n\\ k\end{array}}\right) = \left( {\begin{array}{c}n-1\\ k-1\end{array}}\right) + \left( {\begin{array}{c}n-1\\ k\end{array}}\right)$$. (The endpoint cases of $$k=0$$ and $$k=n$$ can actually be eliminated if we use the earlier convention $$p_{n,j}(x) = 0$$ if $$j < 0$$ or $$j>n$$.) Since this combinatorial identity is the basis of the Pascal triangle, we will refer to the resulting tree structure of the Bernstein basis polynomials as the Pascal-Bernstein tree, which is depicted in the top triangular portion of the graph in Fig. . The bottom portion of this graph shows how these basis polynomials are combined with weights $$\sigma _k \in \{\pm 1\}$$, $$k=0,\dots ,n$$. The red edges of the tree correspond to multiplication by $$1-x$$ and the blue edges correspond to multiplication by x. Note that this schematic diagram is not a proper feedforward neural network yet because the weights depend on the input x. However, it is readily implementable by a sum-product network [34] where the nodes either multiply or compute a weighted sum their inputs. In our case, the weights consist of $${\pm 1}$$ only. The input to the network could be x and $${\bar{x}}:=1-x$$.

Regardless of the nature of this network (or proto-network), the Pascal-Bernstein tree portion of the algorithm is universal in the sense that it is the same for every function f to be approximated. All of the information about how f is approximated is encoded in the weights $$(\sigma _k)_0^n$$ connecting the bottom layer to the final output.

In order to turn this schematic diagram into a proper feedforward neural network, it suffices to convert its multiplications (by $$1-x$$ and x) into a standard neural network operation. We show below that this can be done using $$\pm 1$$ weights in the simplest way if the activation unit is given by the quadratic function $$s(u):=\frac{1}{2}u^2$$. Indeed, with this unit, the multiplication of any two quantities a and b can be implemented simply as

\begin{aligned} ab = s(a+b) - s(a) - s(b). \end{aligned}

Then, for any $$m=1,\dots ,n$$, the quantities $$p_{m{-}1,j}(x)$$, $$j=0,\dots ,m-1$$, associated with the vertices in the mth layer of this tree satisfy the relations

\begin{aligned} p_{m,k}(x) = \left\{ \begin{array}{ll} s(1-x+p_{m-1,0}(x)) - s(1-x) - s(p_{m-1,0}(x)), &{} k=0,\\ \begin{array}{l}\Big (s(x+p_{m-1,k-1}(x)) - s(x) - s(p_{m-1,k-1}(x)) \\ +\, s(1-x+p_{m-1,k}(x)) - s(1-x) - s(p_{m-1,k}(x))\Big ) \end{array}&{} 0< k < m, \\ s(x+p_{m-1,m-1}(x)) - s(x) - s(p_{m-1,m-1}(x)), &{} k=m. \end{array} \right. \nonumber \\ \end{aligned}
(55)

This representation shows that the $$p_{m,k}(x)$$ will not actually correspond to the physical outputs of the nodes of our neural network, but rather to certain intermediate $$\pm 1$$ linear combinations of the outputs of up to 6 of its nodes. If $$m < n$$, we add x or $$1-x$$ to these before passing them onto subsequent nodes. If $$m=n$$, we take a final $$\pm 1$$ linear combination using the coefficients $$\sigma _k$$.

Let us describe the nodes of our network more precisely. For each $$m=0,\dots ,n-1$$, there will be a layer whose nodes output the functions $$X_{m,k}$$, $$Y_{m,k}$$, $$Z_{m,k}$$, $$k=0,\dots ,m$$, where we would like

\begin{aligned} X_{m,k}(x) = s(p_{m,k}(x)), ~~~ Y_{m,k}(x) = s(1-x+p_{m,k}(x)), ~~~ Z_{m,k}(x) = s(x+p_{m,k}(x)). \end{aligned}

Therefore, we set $$U:= s(1-x)$$ and $$V:=s(x)$$ and define these functions for $$m=1,\dots ,n-1$$ via the recurrence

\begin{aligned} X_{m,k}:= & {} \left\{ \begin{array}{ll} s(Y_{m-1,0}-U-X_{m-1,0}), &{} k=0,\\ s(Z_{m-1,k-1}-V-X_{m-1,k-1}&{}\\ \quad +Y_{m-1,k}-U-X_{m-1,k}),&{} 0< k< m, \\ s(Z_{m-1,m-1}-V-X_{m-1,m-1}), &{} k=m, \end{array} \right. \\ Y_{m,k}:= & {} \left\{ \begin{array}{ll} s(1-x+Y_{m-1,0}-U-X_{m-1,0}), &{} k=0,\\ s(1-x+Z_{m-1,k-1}-V-X_{m-1,k-1}&{}\\ \quad +Y_{m-1,k}-U-X_{m-1,k}),&{} 0< k< m, \\ s(1-x+Z_{m-1,m-1}-V-X_{m-1,m-1}), &{} k=m, \end{array} \right. \\ Z_{m,k}:= & {} \left\{ \begin{array}{ll} s(x+Y_{m-1,0}-U-X_{m-1,0}), &{} k=0,\\ s(x+Z_{m-1,k-1}-V-X_{m-1,k-1}&{}\\ \quad +Y_{m-1,k}-U-X_{m-1,k}),&{} 0< k < m, \\ s(x+Z_{m-1,m-1}-V-X_{m-1,m-1}), &{} k=m. \end{array} \right. \end{aligned}

We also set $$X_{0,0} := s(1)$$, $$Y_{0,0} := s(1-x+1)$$, $$Z_{0,0}:=s(x+1)$$.

The final output of the network is the function $$\displaystyle f_{\textrm{NN}}(x)$$ given by $$\sum _k \sigma _k p_{n,k}(x)$$. The $$p_{n,k}(x)$$ are still available indirectly through (55) for $$m=n$$, so we define

\begin{aligned} f_\textrm{NN}:= & {} \sigma _0 Y_{n-1,0} - \sigma _0 U - \sigma _0 X_{n-1,0} \nonumber \\{} & {} + \,\sum _{k=1}^{n-1} (\sigma _k Z_{n-1,k-1} - \sigma _k V - \sigma _k X_{n-1,k-1} + \sigma _k Y_{n-1,k} - \sigma _k U - \sigma _k X_{n-1,k}) \nonumber \\{} & {} + \,\sigma _{n-1} Z_{n-1,n-1} - \sigma _{n-1} U - \sigma _{n-1} X_{n-1,n-1}. \end{aligned}
(56)

We note that this expression contains two copies of the $$X_{n-1,j}$$, $$j=0,\dots ,n-1$$, each carrying a $$\pm 1$$ weight. If these weights were to be combined, then we would get a new weight in the set $$\{-2,0,2\}$$. This inconsistency can easily be removed by duplicating the nodes that produce $$X_{n-1,j}$$. Of course, the same comment applies to U and V which would need to be copied n times. A copy of a node is easily created using the identity

\begin{aligned} a = s(a+1) - s(a) - s(1). \end{aligned}

The appearance of U and V in each step of the recurrence means that the network is not completely layered, containing some “skip” connections. However, the copying mechanism above can also be used as a repeater to remove these skip connections; see Fig. . All of these additional operations can also be avoided by means of special networks (e.g. [11]) where it is permissible to create certain channels to push forward input values or other intermediate computations, or by allowing for more than one type of activation function to be used at the nodes (e.g. the identity function).

Finally, we note that our network takes as input both x and 1 (or alternatively, x and $$1-x$$). This choice has allowed us to avoid the use of any bias values associated to the activation function.

## Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

Güntürk, C.S., Li, W. Approximation with One-Bit Polynomials in Bernstein Form. Constr Approx 57, 601–630 (2023). https://doi.org/10.1007/s00365-022-09608-y

• Revised:

• Accepted:

• Published:

• Issue Date:

• DOI: https://doi.org/10.1007/s00365-022-09608-y