Abstract
We study the expressivity of deep neural networks. Measuring a network’s complexity by its number of connections or by its number of neurons, we consider the class of functions for which the error of best approximation with networks of a given complexity decays at a certain rate when increasing the complexity budget. Using results from classical approximation theory, we show that this class can be endowed with a (quasi)-norm that makes it a linear function space, called approximation space. We establish that allowing the networks to have certain types of “skip connections” does not change the resulting approximation spaces. We also discuss the role of the network’s nonlinearity (also known as activation function) on the resulting spaces, as well as the role of depth. For the popular ReLU nonlinearity and its powers, we relate the newly constructed spaces to classical Besov spaces. The established embeddings highlight that some functions of very low Besov smoothness can nevertheless be well approximated by neural networks, if these networks are sufficiently deep.
Similar content being viewed by others
Notes
See, e.g., [4, Section 3] for reminders on quasi-norms and quasi-Banach spaces.
A function \(\sigma : {\mathbb {R}}\rightarrow {\mathbb {R}}\) is a squashing function if it is non-decreasing with \(\lim _{x \rightarrow -\infty } \sigma (x) = 0\) and \(\lim _{x \rightarrow \infty } \sigma (x) = 1\); see [36, Definition 2.3].
Note that \(4 = 1 \!\mod 3\) and hence \(4^n - 1 = 0 \!\mod 3\), so that \(w \in {\mathbb {N}}\).
Notice the restriction to \(W,N \ge 1\); in fact, the result of Lemma 4.11 as stated cannot hold for \(W=0\) or \(N=0\).
Here, the term “domain” is to be understood as an open connected set.
With the convention \(\lfloor \infty /2\rfloor = \infty -1 = \infty \).
For instance, [26, Proposition 4.35] shows that each function in \(C_0({\mathbb {R}}^d)\) is a uniform limit of continuous, compactly supported functions, [27, Proposition (2.6)] shows that such functions are uniformly continuous, while [63, Theorem 12.8] shows that the uniform continuity is preserved by the uniform limit.
This implicitly uses that \(\varrho _i\) is not affine-linear, so that \( \varrho _i \in \overline{{\mathtt {NN}}^{\varrho _r,1,1}_{2 \cdot 4^{r-i},2,2^{r-i}} {\setminus } {\mathtt {NN}}^{\varrho _r,1,1}_{\infty ,1,\infty }} \).
References
Adams, R.A., Fournier, J.J.F.: Sobolev Spaces. Pure and Applied Mathematics (Amsterdam), vol. 140, 2nd edn. Elsevier/Academic Press, Amsterdam (2003)
Adler, J., Öktem, O.: Solving ill-posed inverse problems using iterative deep neural networks. Inverse Probl. 33, 124007 (2017)
Aliprantis, C.D., Border, K.C.: Infinite Dimensional Analysis: A Hitchhiker’s Guide, third edn. Springer, Berlin (2006)
Almira, J.M., Luther, U.: Generalized approximation spaces and applications. Math. Nachr. 263(264), 3–35 (2004)
Barron, A.R.: Universal approximation bounds for superpositions of a sigmoidal function. IEEE Trans. Inf. Theory 39(3), 930–945 (1993)
Barron, A.R.: Approximation and estimation bounds for artificial neural networks. Mach. Learn. 14(1), 115–133 (1994)
Bartlett, P.L., Harvey, N., Liaw, C., Mehrabian, A.: Nearly-tight VC-dimension and pseudodimension bounds for piecewise linear neural networks. arXiv (2017)
Beck, C., Becker, S., Grohs, P., Jaafari, N., Jentzen, A.: Solving stochastic differential equations and Kolmogorov equations by means of deep learning (2018). arXiv preprint arXiv:1806.00421
Bölcskei, H., Grohs, P., Kutyniok, G., Petersen, P.: Optimal approximation with sparsely connected deep neural networks. SIAM J. Math. Data Sci. 1, 8–45 (2019)
Bubba, T.A., Kutyniok, G., Lassas, M., März, M., Samek, W., Siltanen, S., Srinivasan, V.: Learning the invisible: a hybrid deep learning-shearlet framework for limited angle computed tomography. Inverse Probl. 35(6), 064002 (2019)
Bui, H.-Q., Laugesen, R.S.: Affine systems that span Lebesgue spaces. J. Fourier Anal. Appl. 11(5), 533–556 (2005)
Candès, E.J.: Ridgelets: Theory and Applications. Ph.D. thesis, Stanford University (1998)
Chui, C.K., Li, X., Mhaskar, H.N.: Neural networks for localized approximation. Math. Comput. 63(208), 607–623 (1994)
Cohen, N., Sharir, O., Shashua, A.: On the expressive power of deep learning: a tensor analysis. In: Conference on Learning Theory, pp. 698–728 (2016)
Cohen, N., Shashua, A.: Convolutional rectifier networks as generalized tensor decompositions. In: International Conference on Machine Learning, pp. 955–963 (2016)
Cybenko, G.: Approximation by superpositions of a sigmoidal function. Math. Control Signals Syst. 2(4), 303–314 (1989)
DeVore, R.A.: Nonlinear approximation. In: Acta Numerica, pp. 51–150. Cambridge Univ. Press, Cambridge (1998)
DeVore, R.A., Oskolkov, K.I., Petrushev, P.P.: Approximation by feed-forward neural networks. Ann. Numer. Math. 4, 261–287 (1996)
DeVore, R.A., Popov, V.A.: Interpolation of Besov spaces. Trans. Am. Math. Soc. 305(1), 397–414 (1988)
DeVore, R.A., Sharpley, R.C.: Besov spaces on domains in \( {R}^d\). Trans. Am. Math. Soc. 335(2), 843–864 (1993)
DeVore, R.A., Lorentz, G.G.: Constructive approximation. Grundlehren der Mathematischen Wissenschaften [Fundamental Principles of Mathematical Sciences], vol. 303. Springer, Berlin (1993)
Elad, M.: Deep, Deep Trouble. Deep Learning’s Impact on Image Processing, Mathematics, and Humanity. SIAM News (2017)
Eldan, R., Shamir, O.: The power of depth for feedforward neural networks. In: Proceedings of the 29th Conference on Learning Theory, COLT 2016, New York, USA, June 23–26, 2016, pp. 907–940. (2016)
Ellacott, S.W.: Aspects of the numerical analysis of neural networks. Acta Numer. 3, 145–202 (1994)
Elstrodt, J.: Maß- und Integrationstheorie. Springer Spektrum, 8th edn. Springer, Berlin, Heidelberg (2018)
Folland, G.B.: Real Analysis: Modern Techniques and Their Applications. Pure and Applied Mathematics, 2nd edn. Wiley, Amsterdam (1999)
Folland, G.B.: A Course in Abstract Harmonic Analysis. Studies in Advanced Mathematics, 2nd edn. CRC Press, Boca Raton (1995)
Foucart, S., Rauhut, H.: A Mathematical Introduction to Compressive Sensing. Springer, Berlin (2012)
Funahashi, K.-I.: On the approximate realization of continuous mappings by neural networks. Neural Netw. 2(3), 183–192 (1989)
Gregor, K., LeCun, Y.: Learning fast approximations of sparse coding. In: Proceedings of the 27th Annual International Conference on Machine Learning, pp. 399–406 (2010)
Håstad, J.T.: Computational limitations for small-depth circuits. ACM Doctoral Dissertation Award (1986) (1987)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In: Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), ICCV ’15, pp. 1026–1034. IEEE Computer Society, Washington, DC, USA (2015)
Hoffman, K., Kunze, R.: Linear Algebra, 2nd edn. Prentice-Hall Inc, Englewood Cliffs (1971)
Hornik, K.: Approximation capabilities of multilayer feedforward networks. Neural Netw. 4(2), 251–257 (1991)
Hornik, K., Stinchcombe, M., White, H.: Multilayer feedforward networks are universal approximators. Neural Netw. 2(5), 359–366 (1989)
Johnen, H., Scherer, K.: On the equivalence of the \(K\)-functional and moduli of continuity and some applications. In: Constructive Theory of Functions of Several Variables (Proc. Conf., Math. Res. Inst., Oberwolfach, 1976), pp. 119–140. Lecture Notes in Math., Vol. 571. Springer, Berlin (1977)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Proceedings of the 25th International Conference on Neural Information Processing Systems, Volume 1, NIPS’12, pp. 1097–1105. Curran Associates Inc, USA (2012)
Laugesen, R.S.: Affine synthesis onto \(L^p\) when \(0<p\le 1\). J. Fourier Anal. Appl. 14(2), 235–266 (2008)
Lax, P.D., Terrell, M.S.: Calculus with Applications. Undergraduate Texts in Mathematics, 2nd edn. Springer, New York (2014)
Le Magoarou, L., Gribonval, R.: Flexible multi-layer sparse approximations of matrices and applications. IEEE J. Sel. Top. Signal Process. 10(4), 688–700 (2016). https://doi.org/10.1109/JSTSP.2016.2543461
LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436–444 (2015)
Leshno, M., Lin, V.Ya., Pinkus, A., Schocken, S.: Multilayer feedforward networks with a nonpolynomial activation function can approximate any function. Neural Netw. 6(6), 861–867 (1993)
Maas, A.L., Hannun, A.Y., Ng, A.Y.: Rectifier nonlinearities improve neural network acoustic models. In: Proceedings of the ICML, vol. 30, pp. 3 (2013)
Maiorov, V., Pinkus, A.: Lower bounds for approximation by MLP neural networks. Neurocomputing 25(1), 81–91 (1999)
Mallat, Stéphane: Understanding deep convolutional networks. Philos. Trans. R. Soc. A 374(2065), 20150203–16 (2016)
Mardt, A., Pasquali, L., Wu, H., Noé, F.: Vampnets: deep learning of molecular kinetics. Nat. Commun. 9, 5 (2018)
McCulloch, W.S., Pitts, W.: A logical calculus of the ideas immanent in nervous activity. Bull. Math. Biophys. 5(4), 115–133 (1943)
Mhaskar, H.N.: Approximation properties of a multilayered feedforward artificial neural network. Adv. Comput. Math. 1(1), 61–80 (1993)
Mhaskar, H.N.: Neural networks for optimal approximation of smooth and analytic functions. Neural Comput. 8(1), 164–177 (1996)
Mhaskar, H.N., Poggio, T.: Deep vs. shallow networks: an approximation theory perspective. Anal. Appl. 14(06), 829–848 (2016)
Mhaskar, H.N., Micchelli, C.A.: Degree of approximation by neural and translation networks with a single hidden layer. Adv. Appl. Math. 16(2), 151–183 (1995)
Nguyen-Thien, T., Tran-Cong, T.: Approximation of functions and their derivatives: a neural network implementation with applications. Appl. Math. Model. 23(9), 687–704 (1999)
Orhan, A.E., Pitkow, X.: Skip Connections Eliminate Singularities (2017). arXiv preprint arXiv:1701.09175
Petersen, P., Voigtlaender, F.: Optimal approximation of piecewise smooth functions using deep ReLU neural networks. Neural Netw. 108, 296–330 (2018)
Petrushev, P.P.: Direct and converse theorems for spline and rational approximation and Besov spaces. In: Function Spaces and Applications (Lund, 1986), volume 1302 of Lecture Notes in Math., pp. 363–377. Springer, Berlin (1988)
Pinkus, A.: Approximation theory of the MLP model in neural networks. Acta Numer. 8, 143–195 (1999)
Ronneberger, O., Fischer, P., Brox, T.: U-Net: Convolutional Networks for Biomedical Image Segmentation. Springer, Cham (2015)
Rudin, W.: Functional Analysis. International Series in Pure and Applied Mathematics, 2nd edn. McGraw-Hill Inc, New York (1991)
Schmidt-Hieber, J.: Nonparametric regression using deep neural networks with ReLU activation function (2017). arXiv preprint arXiv:1708.06633
Schütt, K.T., Arbabzadah, F., Chmiela, S., Müller, K.R., Tkatchenko, A.: Quantum-chemical insights from deep tensor neural networks. Nat. Commun. 8, 13890 (2017)
Shaham, U., Cloninger, A., Coifman, R.R.: Provable approximation properties for deep neural networks. Appl. Comput. Harmon. Anal. 44(3), 537–557 (2018)
Somashekhar, A.N., Peters, J.F.: Topology with Applications. World Scientific, Singapore (2013)
Telgarsky, M.: Benefits of depth in neural networks (2016). arXiv preprint arXiv:1602.04485
Unser, M.A.: Splines: a perfect fit for signal and image processing. IEEE Signal Process. Mag. 16(6), 22–38 (1999)
Voigtlaender, F.: Embedding Theorems for Decomposition Spaces with Applications to Wavelet Coorbit Spaces. PhD thesis, RWTH Aachen University (2015). http://publications.rwth-aachen.de/record/564979
Wu, Z., Shen, C., Hengel, A.V.D.: Wider or deeper: revisiting the resnet model for visual recognition (2016). arXiv preprint arXiv:1611.10080
Yarotsky, D.: Error bounds for approximations with deep ReLU networks. Neural Netw. 94, 103–114 (2017). https://doi.org/10.1016/j.neunet.2017.07.002
Yarotsky, D.: Optimal approximation of continuous functions by very deep relu networks (2018). arXiv preprintarXiv:1802.03620
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Wolfgang Dahmen, Ronald A. Devore, and Philipp Grohs.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This work was conducted while R.G. was with Univ Rennes, Inria, CNRS, IRISA.
G.K. acknowledges partial support by the Bundesministerium für Bildung und Forschung (BMBF) through the Berliner Zentrum for Machine Learning (BZML), Project AP4, RTG DAEDALUS (RTG 2433), Projects P1 and P3, RTG BIOQIC (RTG 2260), Projects P4 and P9, and by the Berlin Mathematics Research Center MATH+, Projects EF1-1 and EF1-4. G.K. and F.V. acknowledge support by the European Commission-Project DEDALE (Contract No. 665044) within the H2020 Framework.
Appendices
Appendix A. Proofs for Section 2
For a matrix \(A \in {\mathbb {R}}^{n \times d}\), we write \(A^T \in {\mathbb {R}}^{d \times n}\) for the transpose of A. For \(i \in \{1,\dots ,n\}\), we write \(A_{i,-} \in {\mathbb {R}}^{1 \times d}\) for the i-th row of A, while \(A_{{(i)}} \in {\mathbb {R}}^{(n-1) \times d}\) denotes the matrix obtained by deleting the i-th row of A. We use the same notation \(b_{(i)}\) for vectors \(b\in {\mathbb {R}}^{n}\cong {\mathbb {R}}^{n \times 1}\). Finally, for \(j \in \{1,\dots ,d\}\), \(A_{[j]} \in {\mathbb {R}}^{n \times (d-1)}\) denotes the matrix obtained by removing the j-th column of A.
1.1 Proof of Lemma 2.6
Write \(N_0 (\Phi ) := {d_{\mathrm {in}}}(\Phi ) + {d_{\mathrm {out}}}(\Phi ) + N(\Phi )\) for the total number of neurons of the network \(\Phi \), including the “non-hidden” neurons.
The proof is by contradiction. Assume that there is a network \(\Phi \) for which the claim fails. Among all such networks, consider one with minimal value of \(N_0(\Phi )\), i.e., such the claim holds for all networks \(\Psi \) with \(N_0(\Psi ) < N_0(\Phi )\). Let us write \(\Phi = \big ( (T_1, \alpha _1), \dots , (T_L, \alpha _L) \big )\) with \(T_\ell \, x = A^{(\ell )} x + b^{(\ell )}\), for certain \(A^{(\ell )} \in {\mathbb {R}}^{N_\ell \times N_{\ell -1}}\) and \(b^{(\ell )} \in {\mathbb {R}}^{N_\ell }\).
Let us first consider the case that
By (A.1), we get \(\Vert A^{(\ell )}\Vert _{\ell ^0} \ge N_\ell \ge \Vert b^{(\ell )}\Vert _{\ell ^0}\), so that
Hence, with \({\widetilde{\Phi }} = \Phi \), \(\Phi \) satisfies the claim of the lemma, in contradiction to our assumption.
Thus, there is some \(\ell _0 \in \{1,\dots ,L\}\) and some \(i \in \{1,\dots ,N_{\ell _0}\}\) satisfying \(A^{(\ell _0)}_{i, -} = 0\). In other words, there is a neuron that is not connected to the previous layers. Intuitively, one can “remove it” without changing \({\mathtt {R}}(\Phi )\). This is what we now show formally.
Let us write \(\alpha _\ell = \bigotimes _{j=1}^{N_\ell } \varrho _j^{(\ell )}\) for certain \(\varrho _j^{(\ell )} \in \{\mathrm {id}_{{\mathbb {R}}}, \varrho \}\), and set \(\theta _\ell := \alpha _\ell \circ T_\ell \), so that \({\mathtt {R}}(\Phi ) = \theta _L \circ \cdots \circ \theta _1\). By our choice of \(\ell _0\) and i, note
for arbitrary \(x \in {\mathbb {R}}^{N_{\ell _0 - 1}}\). After these initial observations, we now distinguish four cases:
Case 1 (Neuron on the output layer of size \({d_{\mathrm {out}}}(\Phi ) = 1\)): We have \(\ell _0 = L\) and \(N_L = 1\), so that necessarily \(i = 1\). In view of Eq. (A.2), we then have \({\mathtt {R}}(\Phi ) \equiv c\). Thus, if we choose the affine-linear map \(S_1 : {\mathbb {R}}^{N_0}\rightarrow {\mathbb {R}}^1, x\mapsto c\), and set \(\gamma _1 := \mathrm {id}_{\mathbb {R}}\), then the strict \(\varrho \)-network \({\widetilde{\Phi }} := \big ( (S_1, \gamma _1) \big )\) satisfies \({\mathtt {R}}(\, {\widetilde{\Phi }} \,) \equiv c \equiv {\mathtt {R}}(\Phi )\), and \(L(\, {\widetilde{\Phi }} \,) = 1 \le L(\Phi )\), as well as \(W_0(\, {\widetilde{\Phi }} \,) =1={d_{\mathrm {out}}}(\Phi ) \le {d_{\mathrm {out}}}(\Phi ) +2 W(\Phi )\) and \(N(\, {\widetilde{\Phi }} \,) = 0 \le N(\Phi )\). Thus, \(\Phi \) satisfies the claim of the lemma, contradicting our assumption.
Case 2 (Neuron on the output layer of size \({d_{\mathrm {out}}}(\Phi )>1\)): We have \(\ell _0 = L\) and \(N_L > 1\). Define
We then set \(B^{(L)} := A^{(L)}_{(i)} \in {\mathbb {R}}^{(N_L - 1) \times N_{L-1}}\) and \(c^{(L)} := b^{(L)}_{(i)} \in {\mathbb {R}}^{N_{L} - 1}\), as well as \(\beta _L := \mathrm {id}_{{\mathbb {R}}^{N_L - 1}}\).
Setting \(S_\ell \, x := B^{(\ell )} x+c^{(\ell )}\) for \(x \in {\mathbb {R}}^{N_{\ell - 1}}\), the network \(\Phi _0 := \big ( (S_1, \beta _1), \dots , (S_L,\beta _L) \big )\) then satisfies \({\mathtt {R}}(\Phi _0) (x) = \big ( {\mathtt {R}}(\Phi ) (x) \big )_{(i)}\) for all \(x \in {\mathbb {R}}^{N_0}\), and \(N_0 (\Phi _0) = N_0 (\Phi ) - 1 < N_0 (\Phi )\). Furthermore, if \(\Phi \) is strict, then so is \(\Phi _0\).
By the “minimality” assumption on \(\Phi \), there is thus a network \({\widetilde{\Phi }}_0\) (which is strict if \(\Phi \) is strict) with \({\mathtt {R}}(\, {\widetilde{\Phi }} \,_0) = {\mathtt {R}}(\Phi _0)\) and such that \(L' := L(\, {\widetilde{\Phi }} \,_0) \le L(\Phi _0) = L(\Phi )\), as well as \(N (\, {\widetilde{\Phi }} \,_0) \le N (\Phi _0) = N(\Phi )\), and
Let us write \({\widetilde{\Phi }}_0 = \big ( (U_1, \gamma _1), \dots , (U_{L'}, \gamma _{L'}) \big )\), with affine-linear maps \(U_\ell : {\mathbb {R}}^{M_{\ell - 1}} \rightarrow {\mathbb {R}}^{M_\ell }\), so that \(U_\ell \, x = C^{(\ell )} x + d^{(\ell )}\) for \(\ell \in \{1,\dots ,L'\}\) and \(x \in {\mathbb {R}}^{M_{\ell - 1}}\). Note that \(M_{L'} = N_L - 1\), and define
as well as \({\widetilde{\gamma }}_{L'} := \mathrm {id}_{{\mathbb {R}}^{N_L}}\), and \({\widetilde{U}}_{L'} : {\mathbb {R}}^{M_{L' - 1}} \rightarrow {\mathbb {R}}^{N_L}, x \mapsto {\widetilde{C}}^{(L')} x + {\widetilde{d}}^{(L')}\), and finally,
By virtue of Eq. (A.2), we then have \({\mathtt {R}}(\, {\widetilde{\Phi }} \,) = {\mathtt {R}}(\Phi )\), and if \(\Phi \) is strict, then so is \(\Phi _0\) and thus also \({\widetilde{\Phi }}_0\) and \({\widetilde{\Phi }}\). Furthermore, we have \(L (\, {\widetilde{\Phi }} \,) = L' \le L(\Phi )\), and \(N(\, {\widetilde{\Phi }} \,) = N({\widetilde{\Phi }}_0) \le N(\Phi )\), as well as \(W (\, {\widetilde{\Phi }} \,) \le W_0 (\, {\widetilde{\Phi }} \,) \le 1 + W_0 (\, {\widetilde{\Phi }} \,_0) \le {d_{\mathrm {out}}}(\Phi ) + 2 W(\Phi )\). Thus, \(\Phi \) satisfies the claim of the lemma, contradicting our assumption.
Case 3 (Hidden neuron on layer \(\ell _0\) with \(N_{\ell _0} = 1\)): We have \(1 \le \ell _0 < L\) and \(N_{\ell _0} = 1\). In this case, Eq. (A.2) implies \(\theta _{\ell _0} \equiv c\), whence \({\mathtt {R}}(\Phi ) = \theta _L \circ \cdots \circ \theta _1 \equiv {\widetilde{c}}\) for some \({\widetilde{c}} \in {\mathbb {R}}^{N_L}\).
Thus, if we choose the affine map \(S_1 : {\mathbb {R}}^{N_0} \rightarrow {\mathbb {R}}^{N_L}, x \mapsto {\widetilde{c}}\), then the strict \(\varrho \)-network \({\widetilde{\Phi }} = \big ( (S_1, \gamma _1) \big )\) satisfies \({\mathtt {R}}({\widetilde{\Phi }}) \equiv {\widetilde{c}} \equiv {\mathtt {R}}(\Phi )\) and \(L({\widetilde{\Phi }}) = 1 \le L(\Phi )\), as well as \(W_0 ({\widetilde{\Phi }}) \le d_{\mathrm {out}} (\Phi ) \le d_{\mathrm {out}} (\Phi ) + 2 \, W(\Phi )\) and \(N({\widetilde{\Phi }}) = 0 \le N(\Phi )\). Thus, \(\Phi \) satisfies the claim of the lemma, in contradiction to our choice of \(\Phi \).
Case 4 (Hidden neuron on layer \(\ell _0\) with \(N_{\ell _0} > 1\)): In this case, we have \(1 \le \ell _0 < L\) and \(N_{\ell _0} > 1\). Define \(S_\ell := T_\ell \) and \(\beta _\ell := \alpha _\ell \) for \(\ell \in \{1,\dots ,L\} {\setminus } \{\ell _0, \ell _0 + 1\}\), and let us choose \({S_{\ell _0} : {\mathbb {R}}^{N_{\ell _0 - 1}} \rightarrow {\mathbb {R}}^{N_{\ell _0} - 1}, x \mapsto B^{(\ell _0)} x + c^{(\ell _0)}}\), where
Finally, for \(x \in {\mathbb {R}}^{N_{\ell _0} - 1}\), let \( \iota _c (x) := \left( x_1,\dots , x_{i-1}, c, x_i,\dots , x_{N_{\ell _0} - 1} \right) ^{\mathrm{T}} \in {\mathbb {R}}^{N_{\ell _0}} , \) and set \(\beta _{\ell _{0} + 1} := \alpha _{\ell _0 + 1}\), as well as
where \(e_i\) is the i-th element of the standard basis of \({\mathbb {R}}^{N_{\ell _0}}\).
Setting \(\vartheta _{\ell } := \beta _\ell \circ S_\ell \) and recalling that \(\theta _\ell = \alpha _\ell \circ T_\ell \) for \(\ell \in \{1,\dots ,L\}\), we then have \(\vartheta _{\ell _0} (x) = (\theta _{\ell _0} (x) )_{(i)}\) for all \(x \in {\mathbb {R}}^{N_{\ell _0 - 1}}\). By virtue of Eq. (A.2), this implies \(\theta _{\ell _0} (x) = \iota _c ( \vartheta _{\ell _0} (x) )\), so that
Recalling that \(\beta _{\ell _0 + 1} = \alpha _{\ell _0 + 1}\), we thus see \(\vartheta _{\ell _0 + 1} \circ \vartheta _{\ell _0} = \theta _{\ell _0 + 1} \circ \theta _{\ell _0}\), which then easily shows \({\mathtt {R}}(\Phi _0) = {\mathtt {R}}(\Phi )\) for \(\Phi _0 := \big ( (S_1, \beta _1),\dots , (S_L, \beta _L) \big )\). Note that if \(\Phi \) is strict, then so is \(\Phi _0\). Furthermore, we have \(N_{0}(\Phi _0) = N_{0}(\Phi ) - 1 < N_{0}(\Phi )\) so that by “minimality” of \(\Phi \), there is a network \({\widetilde{\Phi }}_0\) (which is strict if \(\Phi \) is strict) satisfying \({\mathtt {R}}(\, {\widetilde{\Phi }}_0\,)={\mathtt {R}}(\Phi _0)={\mathtt {R}}(\Phi )\) and furthermore \(L(\, {\widetilde{\Phi }}_0 \,) \le L(\Phi _0) = L(\Phi )\), as well as \(N(\, {\widetilde{\Phi }}_0\, ) \le N(\Phi _0) \le N(\Phi )\), and finally \( W (\, {\widetilde{\Phi }}_0 \,) \le W_0 (\, {\widetilde{\Phi }}_0 \,) \le {d_{\mathrm {out}}}(\Phi _0) + 2 W(\Phi _0) \le {d_{\mathrm {out}}}(\Phi ) + 2 W(\Phi ). \) Thus, the claim holds for \(\Phi \), contradicting our assumption. \(\square \)
1.2 Proof of Lemma 2.14
We begin by showing \({\mathtt {NN}}_{W, L,W}^{\varrho ,d,k} \subset {\mathtt {NN}}_{W,W,W}^{\varrho ,d,k}\). Let \(f \in {\mathtt {NN}}_{W, L,W}^{\varrho ,d,k}\). By definition, there is \(\Phi \in {\mathcal {NN}}_{W, L,W}^{\varrho ,d,k}\) such that \(f = {\mathtt {R}}(\Phi )\). Note that \(W(\Phi ) \le W\), and let us distinguish two cases: If \(L(\Phi ) \le W(\Phi )\) then \(L(\Phi ) \le W\), whence in fact \(\Phi \in {\mathcal {NN}}_{W, W, W}^{\varrho ,d,k}\) and \(f \in {\mathtt {NN}}_{W, W, W}^{\varrho ,d,k}\) as claimed. Otherwise, \(W(\Phi ) < L(\Phi )\) and by Corollary 2.10 we have \(f = {\mathtt {R}}(\Phi ) \equiv c\) for some \(c \in {\mathbb {R}}^{k}\). Therefore, Lemma 2.13 shows that \(f \in {\mathtt {NN}}_{0,1,0}^{\varrho ,d,k} \subset {\mathtt {NN}}_{W,W,W}^{\varrho ,d,k}\), where the inclusion holds by definition of these sets.
The inclusion \({\mathtt {NN}}_{W,L,W}^{\varrho ,d,k} \subset {\mathtt {NN}}_{W,L,\infty }^{\varrho ,d,k}\) is trivial. Similarly, if \(L \ge W\), then trivially \({\mathtt {NN}}_{W, W, W}^{\varrho ,d,k} \subset {\mathtt {NN}}_{W,L,W}^{\varrho ,d,k}\).
Thus, it remains to show \({\mathtt {NN}}_{W,L,\infty }^{\varrho ,d,k} \subset {\mathtt {NN}}_{W,L,W}^{\varrho ,d,k}\). To prove this, we will show that for each network \( \Phi = \big ( (T_{1}, \alpha _{1}), \dots , (T_{K}, \alpha _{K}) \big ) \in {\mathcal {NN}}_{W,L,\infty }^{\varrho ,d,k} \) (so that necessarily \(K \le L\)) with \(N (\Phi ) > W\), one can find a neural network \(\Phi ' \in {\mathcal {NN}}_{W,L, \infty }^{\varrho ,d,k}\) with \({\mathtt {R}}(\Phi ') = {\mathtt {R}}(\Phi )\), and such that \(N(\Phi ') < N(\Phi )\). If \(\Phi \) is strict, then we show that \(\Phi '\) can also be chosen to be strict. The desired inclusion can then be obtained by repeating this “compression” step until one reaches the point where \(N(\Phi ') \le W\).
For each \(\ell \in \{1,\dots ,K\}\), let \(b^{(\ell )} \in {\mathbb {R}}^{N_\ell }\) and \(A^{(\ell )} \in {\mathbb {R}}^{N_{\ell } \times N_{\ell -1}}\) be such that \(T_\ell = A^{(\ell )} \bullet + b^{(\ell )}\). Since \(\Phi \in {\mathcal {NN}}_{W,L,\infty }^{\varrho ,d,k}\), we have \(W(\Phi )\le W\). In combination with \(N(\Phi ) > W\), this implies
Therefore, \(K > 1\), and there must be some \(\ell _0 \in \{1,\dots ,K-1\}\) and \(i \in \{1,\dots ,N_{\ell _0}\}\) with \(A^{(\ell _0)}_{i, -} = 0\). We now distinguish two cases:
Case 1 (Single neuron on layer \(\ell _{0}\)): We have \(N_{\ell _0} = 1\). In this case, \(A^{(\ell _0)} = 0\) and hence \(T_{\ell _0} \equiv b^{(\ell _0)}\). Therefore, \({\mathtt {R}}(\Phi )\) is constant; say \({\mathtt {R}}(\Phi ) \equiv c \in {\mathbb {R}}^k\). Choose \(S_{1} : {\mathbb {R}}^d \rightarrow {\mathbb {R}}^k, x \mapsto c\), and \(\beta _{1} := \mathrm {id}_{{\mathbb {R}}^k}\). Then, \({\mathtt {R}}(\Phi ) \equiv c \equiv {\mathtt {R}}(\Phi ')\) for the strict \(\varrho \)-network \( \Phi ' := \big ( (S_{1},\beta _{1}) \big ) \in {\mathcal {NN}}_{0,1,0}^{\varrho ,d,k} \subset {\mathcal {NN}}_{W,L,\infty }^{\varrho ,d,k} \), which indeed satisfies \(N(\Phi ') = 0 \le W < N(\Phi )\).
Case 2 (Multiple neurons on layer \(\ell _{0}\)): We have \(N_{\ell _0} > 1\). Recall that \(\ell _0 \in \{1,\dots ,K-1\}\), so that \(\ell _0 + 1 \in \{1,\dots ,K\}\). Now, define \(S_{\ell } := T_{\ell }\) and \(\beta _{\ell } := \alpha _{\ell }\) for \(\ell \in \{1,\dots ,K\} {\setminus } \{\ell _0, \ell _0 + 1\}\). Further, define
Using the notation \(A_{(i)}, b_{(i)}\) from the beginning of Appendix A, this means \(S_{\ell _0} \, x=A^{(\ell _0)}_{(i)} x + b^{(\ell _0)}_{(i)} = (T_{\ell _0} \, x)_{(i)}\).
Finally, writing \(\alpha _{\ell } = \varrho _1^{(\ell )} \otimes \cdots \otimes \varrho _{N_\ell }^{(\ell )}\) for \(\ell \in \{1,\dots ,K\}\), define \(\beta _{\ell _0 + 1} := \alpha _{\ell _0 + 1}\), as well as
and
where \(e_i \in {\mathbb {R}}^{N_{\ell _0}}\) denotes the i-th element of the standard basis, and where \(A_{[i]}\) is the matrix obtained from a given matrix A by removing its i-th column.
Now, for arbitrary \(x \in {\mathbb {R}}^{N_{\ell _0 - 1}}\), let \(y := S_{\ell _0} \, x \in {\mathbb {R}}^{N_{\ell _0} - 1}\) and \(z := T_{\ell _0} \, x \in {\mathbb {R}}^{N_{\ell _0}}\). Because of \(A^{(\ell _0)}_{i, -} = 0\), we then have \(z_i = b_i^{(\ell _0)}\). Further, by definition of \(S_{\ell _0}\), we have \(y_{j}=(T_{\ell _0} \, x)_j = z_j\) for \(j<i\), and \(y_j =(T_{\ell _0} \, x)_{j+1}=z_{j+1}\) for \(j\ge i\). All in all, this shows
Recall that this holds for all \(x \in {\mathbb {R}}^{N_{\ell _0 - 1}}\). From this, it is not hard to see \({\mathtt {R}}(\Phi ) = {\mathtt {R}}(\Phi ')\) for the network \( \Phi ' := \big ( (S_{1}, \beta _{1}), \dots , (S_{K}, \beta _{K}) \big ) \in {\mathcal {NN}}_{\infty , K,\infty }^{\varrho ,d,k} \subset {\mathcal {NN}}_{\infty , L,\infty }^{\varrho ,d,k} \). Note that \(\Phi '\) is a strict network if \(\Phi \) is strict. Finally, directly from the definition of \(\Phi '\), we see \(W(\Phi ') \le W(\Phi ) \le W\), so that \(\Phi ' \in {\mathcal {NN}}_{W, L,\infty }^{\varrho ,d,k}\). Also, \(N(\Phi ') = N(\Phi ) - 1 < N(\Phi )\), as desired. \(\square \)
1.3 Proof of Lemma 2.16
Write \(\Phi = \big ( (T_1,\alpha _1),\dots ,(T_{L},\alpha _{L}) \big )\) with \(L = L(\Phi )\). If \(L_0 = 0\), we can simply choose \(\Psi = \Phi \). Thus, let us assume \(L_0 > 0\), and distinguish two cases:
Case 1: If \(k \le d\), so that \(c = k\), set
and note that the affine map \(T := \mathrm {id}_{{\mathbb {R}}^{k}}\) satisfies \(\Vert T\Vert _{\ell ^{0}} = k=c\), and hence \(W(\Psi ) = W(\Phi ) + c \, L_0\). Furthermore, \({\mathtt {R}}(\Psi ) = {\mathtt {R}}(\Phi )\), \(L(\Psi ) = L(\Phi ) + L_0\), and \(N(\Psi ) = N(\Phi ) + c L_0\). Here, we used crucially that the definition of generalized neural networks allows us to use the identity as the activation function for some neurons.
Case 2: If \(d < k\), so that \(c = d\), the proof proceeds as in the previous case, but with
1.4 Proof of Lemma 2.17
For the proof of the first part, denoting \(\Phi = \big ( (T_{1}, \alpha _{1}), \dots , (T_L, \alpha _{L}) \big )\), we set \( \Psi := \big ( (T_{1}, \alpha _{1}), \dots , (c \cdot T_L, \alpha _{L}) \big ) \). By Definition 2.1, we have \(\alpha _{L} = \mathrm {id}_{{\mathbb {R}}^{k}}\); hence, one easily sees \({\mathtt {R}}(\Psi ) = c \cdot {\mathtt {R}}(\Phi )\). If \(\Phi \) is strict, then so is \(\Psi \). By construction, \(\Phi \) and \(\Psi \) have the same number of layers and neurons, and \(W(\Psi ) \le W(\Phi )\) with equality if \(c \ne 0\).
For the second and third part, we proceed by induction, using two auxiliary claims.
Lemma A.1
Let \(\Psi _1 \in {\mathcal {NN}}^{\varrho ,d,k_{1}}\) and \(\Psi _2 \in {\mathcal {NN}}^{\varrho ,d,k_{2}}\). There is a network \(\Psi \in {\mathcal {NN}}^{\varrho ,d,k_{1}+k_{2}}\) with \({L(\Psi ) = \max \{ L(\Psi _1), L(\Psi _2) \}}\) such that \({\mathtt {R}}(\Psi ) = g\), where \( g : {\mathbb {R}}^{d} \rightarrow {\mathbb {R}}^{k_{1}+k_{2}}, x \mapsto \big ( {\mathtt {R}}(\Psi _{1})(x),{\mathtt {R}}(\Psi _{2})(x) \big ) \). Furthermore, setting \(c := \min \big \{ d,\max \{ k_{1},k_{2} \} \big \}\), \(\Psi \) can be chosen to satisfy
\(\blacktriangleleft \)
Lemma A.2
Let \(\Psi _{1}, \Psi _{2} \in {\mathcal {NN}}^{\varrho ,d,k}\). There is \(\Psi \in {\mathcal {NN}}^{\varrho ,d,k}\) with \(L(\Psi ) = \max \{ L(\Psi _1), L(\Psi _2) \}\) such that \({\mathtt {R}}(\Psi ) = {\mathtt {R}}(\Psi _{1}) + {\mathtt {R}}(\Psi _{2})\) and, with \(c = \min \{d,k\}\),
\(\blacktriangleleft \)
Proof of Lemmas A.1 and A.2
Set \(L := \max \{ L(\Psi _1), L(\Psi _2) \}\) and \(L_i := L(\Psi _i)\) for \(i \in \{1,2\}\). By Lemma 2.16 applied to \(\Psi _i\) and \(L_0 = L - L_i \in {\mathbb {N}}_0\), we get for each \(i \in \{1,2\}\) a network \({\Psi _i ' \in {\mathcal {NN}}^{\varrho ,d,k_{i}}}\) with \({\mathtt {R}}(\Psi _i ') = {\mathtt {R}}(\Psi _i)\) and such that \(L(\Psi _i ') = L\), as well as \(W(\Psi _i ') \le W(\Psi _i) + c (L - L_i)\) and furthermore \(N(\Psi _i') \le N(\Psi _i) + c (L - L_i)\). By choice of L, we have \((L - L_1) + (L - L_2) = |L_1 - L_2|\), whence \(W(\Psi _1 ') + W(\Psi _2 ') \le W(\Psi _1) + W(\Psi _2) + c \, |L_1 - L_2|\), and \(N(\Psi _1 ') + N(\Psi _2 ') \le N(\Psi _1) + N(\Psi _2) + c \, |L_1 - L_2|\).
First, we deal with the pathological case \(L = 1\). In this case, each \(\Psi '_i\) is of the form \(\Psi '_i = \big ( ( T_i, \mathrm {id}_{{\mathbb {R}}^k}) \big )\), with \(T_i : {\mathbb {R}}^d \rightarrow {\mathbb {R}}^k\) an affine-linear map. For proving Lemma A.1, we set \(\Psi := \big ( (T,\mathrm {id}_{{\mathbb {R}}^{k_1+k_2}}) \big )\) with the affine-linear map \(T:{\mathbb {R}}^{d} \rightarrow {\mathbb {R}}^{k_1+k_2},\ x \mapsto \big ( T_1(x),T_2(x) \big )\), so that \({\mathtt {R}}(\Psi ) = g\). For proving Lemma A.2, we set \(\Psi := \big ( (T, \mathrm {id}_{{\mathbb {R}}^k}) \big )\) with \(T = T_1+T_2\), so that \( {\mathtt {R}}(\Psi ) = T_1 + T_2 = {\mathtt {R}}(\Psi '_1) + {\mathtt {R}}(\Psi '_2) = {\mathtt {R}}(\Psi _1) + {\mathtt {R}}(\Psi _2) \). Finally, we see for both cases that \(N(\Psi ) = 0 = N(\Psi '_1) + N(\Psi '_2)\) and
This establishes the result for the case \(L=1\).
For \(L > 1\), write \(\Psi _1 ' = \big ( (T_1, \alpha _1), \dots , (T_L, \alpha _L) \big )\) and \(\Psi _2 ' = \big ( (S_1, \beta _1), \dots , (S_L, \beta _L) \big )\) with affine-linear maps \(T_\ell : {\mathbb {R}}^{N_{\ell - 1}} \rightarrow {\mathbb {R}}^{N_\ell }\) and \(S_\ell : {\mathbb {R}}^{M_{\ell -1}} \rightarrow {\mathbb {R}}^{M_\ell }\) for \(\ell \in \{1,\dots ,L\}\). Let us define \(\theta _\ell := \alpha _\ell \otimes \beta _\ell \) for \(\ell \in \{1,\dots ,L\}\)—except for \(\ell = L\) when proving Lemma A.2, in which case we set \(\theta _L := \mathrm {id}_{{\mathbb {R}}^k}\). Next, set
for \(2 \le \ell \le L\)—except if \(\ell = L\) when proving Lemma A.2. In this latter case, we instead define \(R_L\) as \( {R_L : {\mathbb {R}}^{N_{L-1} + M_{L-1}} \rightarrow {\mathbb {R}}^{k}, (x,y) \mapsto T_L \, x + S_L \, y} \). Finally, set \(\Psi := \big ( (R_1, \theta _1), \dots , (R_L, \theta _L) \big )\).
When proving Lemma A.1, it is straightforward to verify that \(\Psi \) satisfies
Similarly, when proving Lemma A.2, one can easily check that \( {\mathtt {R}}(\Psi ) = {\mathtt {R}}(\Psi '_1) + {\mathtt {R}}(\Psi '_2) = {\mathtt {R}}(\Psi _1) + {\mathtt {R}}(\Psi _2) \).
Further, for arbitrary \(\ell \in \{1,\dots ,L\}\), we have \(\Vert R_\ell \Vert _{\ell ^0} \le \Vert T_\ell \Vert _{\ell ^0}+ \Vert S_\ell \Vert _{\ell ^0}\) so that
Finally, \(N(\Psi ) = \sum _{\ell =1}^{L-1} (N_{\ell }+M_{\ell }) = N(\Psi '_1)+N(\Psi '_2)\). Given the estimates for \(W(\Psi _1') + W(\Psi _2')\) and \(N(\Psi _1') + N(\Psi _2')\) stated at the beginning of the proof, this yields the claim. \(\square \)
Let us now return to the proof of Parts 2 and 3 of Lemma 2.17. Set \(f_{i} := {\mathtt {R}}(\Phi _{i})\) and \(L_i := L(\Phi _i)\). We first show that we can without loss of generality assume \(L_1 \le \dots \le L_n\). To see this, note that there is a permutation \(\sigma \in S_n\) such that if we set \(\Gamma _j := \Phi _{\sigma (j)}\), then \(L(\Gamma _1) \le \dots \le L(\Gamma _n)\). Furthermore, \(\sum _{j=1}^n {\mathtt {R}}(\Gamma _j) = \sum _{j=1}^n {\mathtt {R}}(\Phi _j)\). Finally, there is a permutation matrix \(P \in \mathrm {GL}({\mathbb {R}}^d)\) such that
Since the permutation matrix P has exactly one nonzero entry per row and column, we have \(\Vert P\Vert _{\ell ^{0,\infty }} = 1\) in the notation of Eq. (2.4). Therefore, the first part of Lemma 2.18 (which will be proven independently) shows that \(g \in {\mathtt {NN}}^{\varrho ,d,K}_{W,L,N}\), provided that \(\big ( {\mathtt {R}}(\Gamma _1), \dots , {\mathtt {R}}(\Gamma _n) \big ) \in {\mathtt {NN}}^{\varrho ,d,K}_{W,L,N}\). These considerations show that we can assume \(L(\Phi _1) \le \dots \le L(\Phi _n)\) without loss of generality.
We now prove the following claim by induction on \(j \in \{1,\dots ,n\}\): There is \(\Theta _{j} \in {\mathcal {NN}}^{\varrho ,d,K_{j}}\) satisfying \(W(\Theta _{j}) \le \sum _{i=1}^{j} W(\Phi _{i}) + c \, (L_j - L_1)\), and \(N(\Theta _{j}) = \sum _{i=1}^{j} N(\Phi _{i}) + c \, (L_j - L_1)\), as well as \({L(\Theta _{j}) = L_j}\), and such that \({\mathtt {R}}(\Theta _j) = g_{j} := \sum _{i=1}^{j} f_{i}\) and \(K_{j} := k\) for the summation, respectively such that \({{\mathtt {R}}(\Theta _{j}) = g_{j} := (f_1, \dots , f_j)}\) and \(K_{j} := \sum _{i=1}^{j} k_{i}\) for the Cartesian product. Here, c is as in the corresponding claim of Lemma 2.17.
Specializing to \(j=n\) then yields the conclusion of Lemma 2.17.
We now proceed to the induction. The claim trivially holds for \(j=1\)—just take \(\Theta _1 = \Phi _1\). Assuming that the claim holds for some \(j \in \{1,\dots ,n-1\}\), we define \(\Psi _{1} := \Theta _{j}\) and \(\Psi _{2} := \Phi _{j+1}\). Note that \(L(\Psi _1) = L(\Theta _j) = L_j \le L_{j+1} = L(\Psi _2)\). For the summation, by Lemma A.2 there is a network \(\Psi \in {\mathcal {NN}}^{\varrho ,d,k}\) with \(L(\Psi ) = L_{j+1}\) and \( {\mathtt {R}}(\Psi ) = {\mathtt {R}}(\Psi _{1}) + {\mathtt {R}}(\Psi _{2}) = {\mathtt {R}}(\Theta _{j}) + {\mathtt {R}}(\Phi _{j+1}) = g_{j}+ f_{j+1} = g_{j+1} \), and such that
and likewise \(N(\Psi ) \le N(\Theta _j) + N(\Phi _{j+1}) + c' \cdot (L_{j+1} - L_j)\), where \(c' = \min \{d,k\} = c\). For the Cartesian product, Lemma A.1 yields a network \(\Psi \in {\mathcal {NN}}^{\varrho ,d,K_{j}+k_{j+1}} = {\mathcal {NN}}^{\varrho ,d,K_{j+1}}\) satisfying
and such that, setting \(c' := \min \big \{ d, \max \{ K_{j}, k_{j+1} \} \big \} \le \min \{d,K-1\} = c\), we have
and \(N(\Psi ) \le N(\Theta _j) + N(\Phi _{j+1}) + c' \cdot (L_{j+1} - L_j)\).
With \(\Theta _{j+1} := \Psi \), we get \({\mathtt {R}}(\Theta _{j+1}) = g_{j+1}\), \(L(\Theta _{j+1}) = L_{j+1}\) and, by the induction hypothesis,
Similarly, \( N(\Theta _{j+1}) \le \sum _{i=1}^{j+1} N(\Phi _{i}) + c \cdot (L_{j+1} - L_1) \). This completes the induction and the proof. \(\square \)
1.5 Proof of Lemma 2.18
We prove each part of the lemma individually.
Part (2): Let \( \Phi _1 = \big ( (T_1, \alpha _1), \dots , (T_{L_1}, \alpha _{L_1}) \big ) \in {\mathcal {NN}}^{\varrho , d, d_1} \) and \( \Phi _2 = \big ( (S_1, \beta _1), \dots , (S_{L_2}, \beta _{L_2}) \big ) \in {\mathcal {NN}}^{\varrho , d_1, d_2} \) Define
We emphasize that \(\Psi \) is indeed a generalized \(\varrho \)-network, since all \(T_\ell \) and all \(S_\ell \) are affine-linear (with “fitting” dimensions), and since all \(\alpha _\ell \) and all \(\beta _\ell \) are \(\otimes \)-products of \(\varrho \) and \(\mathrm {id}_{{\mathbb {R}}}\), with \(\beta _{L_2} = \mathrm {id}_{{\mathbb {R}}^{d_2}}\). Furthermore, we clearly have \(L(\Psi ) = L_1 + L_2 = L(\Phi _1) + L(\Phi _2)\), and
Clearly, \(N(\Psi ) = N(\Phi _1) + d_1 + N(\Phi _2)\). Finally, the property \({\mathtt {R}}(\Psi ) = {\mathtt {R}}(\Phi _2) \circ {\mathtt {R}}(\Phi _1)\) is a direct consequence of the definition of the realization of neural networks.
Part (1): Let \(\Phi = \big ( (T_{1},\alpha _{1}), \dots , (T_L,\alpha _{L}) \big ) \in {\mathcal {NN}}^{\varrho ,d,k}\). We give the proof for \(Q \circ {\mathtt {R}}(\Phi )\), since the proof for \({\mathtt {R}}(\Phi ) \circ P\) is similar but simpler; the general statement in the lemma then follows from the identity \( Q \circ {\mathtt {R}}(\Phi ) \circ P = (Q \circ {\mathtt {R}}(\Phi )) \circ P = {\mathtt {R}}(\Psi _1) \circ P \).
We first treat the special case \(\Vert Q\Vert _{\ell ^{0,\infty }}=0\) which implies \(\Vert Q\Vert _{\ell ^{0}} = 0\), and hence, \(Q \circ {\mathtt {R}}(\Phi ) \equiv c\) for some \(c \in {\mathbb {R}}^{k_1}\). Choose \(N_0,\dots ,N_L\) such that \(T_\ell : {\mathbb {R}}^{N_{\ell - 1}} \rightarrow {\mathbb {R}}^{N_\ell }\) for \(\ell \in \{1,\dots ,L\}\), and define \(S_\ell : {\mathbb {R}}^{N_{\ell - 1}} \rightarrow {\mathbb {R}}^{N_\ell }, x \mapsto 0\) for \(\ell \in \{1,\dots ,L-1\}\) and \(S_L : {\mathbb {R}}^{N_{L-1}} \rightarrow {\mathbb {R}}^{k_1}, x \mapsto c\). It is then not hard to see that the network \(\Psi := \big ( (S_1,\alpha _1),\dots ,(S_L,\alpha _L) \big )\) satisfies \(L(\Psi ) = L(\Phi )\) and \(N(\Psi ) = N(\Phi )\), as well as \(W(\Psi ) = 0\) and \({\mathtt {R}}(\Psi ) \equiv c = Q \circ {\mathtt {R}}(\Phi )\).
We now consider the case \(\Vert Q\Vert _{\ell ^{0,\infty }} \ge 1\). Define \(U_{\ell } := T_{\ell }\) for \(\ell \in \{1,\dots ,L-1\}\) and \({U_{L} := Q \circ T_{L}}\). By Definition 2.1, we have \(\alpha _{L} = \mathrm {id}_{{\mathbb {R}}^{k}}\), whence \({ \Psi := \big ( (U_{1},\alpha _{1}), \dots , (U_{L-1}, \alpha _{L-1}), (U_L, \mathrm {id}_{{\mathbb {R}}^{k_{1}}}) \big ) \in {\mathcal {NN}}_{\infty ,L,N(\Phi )}^{\varrho ,d,k_1} }\) satisfies \({\mathtt {R}}(\Psi ) = Q \circ {\mathtt {R}}(\Phi )\). To control \(W(\Psi )\), we use the following lemma. The proof is slightly deferred.
Lemma A.3
Let \(p,q,r \in {\mathbb {N}}\) be arbitrary.
-
(1)
For arbitrary affine-linear maps \(T : {\mathbb {R}}^p \rightarrow {\mathbb {R}}^q\) and \(S : {\mathbb {R}}^q \rightarrow {\mathbb {R}}^r\), we have
$$\begin{aligned} \Vert S \circ T \Vert _{\ell ^0} \le \Vert S \Vert _{\ell ^{0,\infty }} \cdot \Vert T \Vert _{\ell ^0} \quad \text {and} \quad \Vert S \circ T \Vert _{\ell ^0} \le \Vert S \Vert _{\ell ^0} \cdot \Vert T \Vert _{\ell ^{0,\infty }_{*}} . \end{aligned}$$ -
(2)
For affine-linear maps \(T_1, \dots , T_n\), we have \(\Vert T_1 \otimes \cdots \otimes T_n\Vert _{\ell ^0} \le \sum _{i=1}^n \Vert T_i\Vert _{\ell ^0}\), as well as
$$\begin{aligned} \Vert T_1 \otimes \cdots \otimes T_n \Vert _{\ell ^{0,\infty }}\le & {} \max _{i \in \{1,\dots ,n\}} \Vert T_i \Vert _{\ell ^{0,\infty }} \quad \text {and} \\ \Vert T_1 \otimes \cdots \otimes T_n \Vert _{\ell ^{0,\infty }_{*}}\le & {} \max _{i \in \{1,\dots ,n\}} \Vert T_i \Vert _{\ell ^{0,\infty }_{*}} . \blacktriangleleft \end{aligned}$$
Let us continue with the proof from above. By definition, \( \Vert U_{\ell }\Vert _{\ell ^{0}} = \Vert T_{\ell }\Vert _{\ell ^{0}} \le \Vert Q \Vert _{\ell ^{0,\infty }} \cdot \Vert T_{\ell }\Vert _{\ell ^{0}} \) for \(\ell \in \{1,\dots ,L-1\}\). By Lemma A.3, we also have \(\Vert U_{L}\Vert _{\ell ^{0}} \le \Vert Q \Vert _{\ell ^{0,\infty }} \cdot \Vert T_{L}\Vert _{\ell ^{0}}\), and hence,
Finally, if \(\Phi \) is strict, then \(\Psi \) is strict as well; thus, the claim also holds with \({\mathtt {SNN}}\) instead of \({\mathtt {NN}}\).
Part (3): Let \( \Phi _1 = \big ( (T_1, \alpha _1), \dots , (T_L, \alpha _L) \big ) \in {\mathcal {NN}}^{\varrho , d, d_1} \) and \( \Phi _2 = \big ( (S_1, \beta _1), \dots , (S_K, \beta _K) \big ) \in {\mathcal {NN}}^{\varrho , d_1, d_2} \).
We distinguish two cases: First, if \(L = 1\), then \({\mathtt {R}}(\Phi _1) = T_1\). Since \(T_1 : {\mathbb {R}}^d \rightarrow {\mathbb {R}}^{d_1}\), this implies \(\Vert T_1\Vert _{\ell ^{0,\infty }_*} \le d\). Thus, Part (1) shows that
where \(N := \max \{ N(\Phi _1), d\}\).
Let us now assume that \(L > 1\). In this case, define
It is not hard to see that \(N(\Psi ) \le N(\Phi _1) + N(\Phi _2)\) and—because of \(\alpha _L = \mathrm {id}_{{\mathbb {R}}^{d_1}}\)—that
Note \(T_\ell : {\mathbb {R}}^{M_{\ell - 1}} \rightarrow {\mathbb {R}}^{M_\ell }\) for certain \(M_0,\dots ,M_L \in {\mathbb {N}}\). Since \(L > 1\), we have \(M_{L - 1} \le N(\Phi _1) \le N\). Furthermore, since \(T_L : {\mathbb {R}}^{M_{L-1}} \rightarrow {\mathbb {R}}^{M_L}\), we get \(\Vert T_L\Vert _{\ell ^{0,\infty }_*} \le M_{L-1} \le N\) directly from the definition. Thus, Lemma A.3 shows \( \Vert S_1 \circ T_L\Vert _{\ell ^0} \le \Vert S_1\Vert _{\ell ^0} \cdot \Vert T_L\Vert _{\ell ^{0,\infty }_*} \le N \cdot \Vert S_1\Vert _{\ell ^0} \). Therefore, and since \(N \ge 1\), we see that
Finally, note that if \(\Phi _1,\Phi _2\) are strict networks, then so is \(\Psi \). \(\square \)
Proof of Lemma A.3
The stated estimates follow directly from the definitions by direct computations and are thus left to the reader. For instance, the main observation for proving that \(\Vert B A \Vert _{\ell ^0} \le \Vert B \Vert _{\ell ^{0,\infty }} \cdot \Vert A \Vert _{\ell ^0}\) is that
\(\square \)
1.6 Proof of Lemma 2.19
We start with an auxiliary lemma.
Lemma A.4
Consider two activation functions \(\varrho ,\sigma \) such that \(\sigma = {\mathtt {R}}(\Psi _{\sigma })\) for some \( \Psi _{\sigma } \in {\mathcal {NN}}^{\varrho ,1,1}_{w,\ell ,m} \) with \(L(\Psi _{\sigma }) = \ell \in {\mathbb {N}}\), \(w \in {\mathbb {N}}_{0}\), \(m \in {\mathbb {N}}\). Furthermore, assume that \(\sigma \not \equiv \mathrm {const}\).
Then, for any \(d \in {\mathbb {N}}\) and \(\alpha _{i} \in \{\mathrm {id}_{{\mathbb {R}}},\sigma \}\), \(1 \le i \le d\) we have \(\alpha _{1} \otimes \cdots \otimes \alpha _{d} = {\mathtt {R}}(\Phi )\) for some network
satisfying \(\Vert U_1\Vert _{\ell ^{0,\infty }} \le m\), \(\Vert U_1\Vert _{\ell ^{0,\infty }_{*}} \le 1\), \(\Vert U_{\ell }\Vert _{\ell ^{0,\infty }} \le 1\), and \(\Vert U_{\ell }\Vert _{\ell ^{0,\infty }_{*}} \le m\).
If \(\Psi _\sigma \) is a strict network and \(\alpha _i = \sigma \) for all i, then \(\Phi \) can be chosen to be a strict network.\(\blacktriangleleft \)
Proof of Lemma A.4
First, we show that any \(\alpha \in \{\mathrm {id}_{{\mathbb {R}}},\sigma \}\) satisfies \(\alpha = {\mathtt {R}}(\Psi _{\alpha })\) for some network
with \(\Vert U_1^{\alpha }\Vert _{\ell ^{0,\infty }} \le m\), \(\Vert U_1^{\alpha }\Vert _{\ell ^{0,\infty }_{*}} \le 1\), \(\Vert U_{\ell }^{\alpha }\Vert _{\ell ^{0,\infty }} \le 1\) and \(\Vert U_{\ell }^{\alpha }\Vert _{\ell ^{0,\infty }_{*}} \le m\).
For \(\alpha = \sigma \), we have \(\alpha = {\mathtt {R}}(\Psi _{\sigma })\) where \(\Psi _\sigma \) is of the form \( \Psi _{\sigma } = \big ( (T_{1}, \beta _{1}), \ldots , (T_{\ell }, \beta _{\ell }) \big ) \in {\mathcal {NN}}^{\varrho ,1,1}_{w,\ell ,m} \). For \(\alpha = \mathrm {id}_{{\mathbb {R}}}\), observe that \(\alpha = {\mathtt {R}}(\Psi _{\mathrm {id}_{{\mathbb {R}}}})\) with
where it is easy to see that \(N(\Psi _{\mathrm {id}_{{\mathbb {R}}}}) = \ell - 1 \le m\) and \(W(\Psi _{\mathrm {id}_{{\mathbb {R}}}}) = \ell \le w\). Indeed, Eq. (2.1) shows that \(\ell = L(\Psi _\sigma ) \le 1 + N(\Psi _\sigma ) \le 1 + m\). On the other hand, since \(\sigma \not \equiv \mathrm {const}\), Corollary 2.10 shows that \(\ell = L(\Psi _\sigma ) \le W(\Psi _\sigma ) \le w\).
Denoting by \(N_{i}\) the number of neurons in the i-th layer of \(\Psi _{\sigma }\) (where layer 0 is the input layer, and layer \(\ell \) the output layer), we get because of \(\Psi _{\sigma } \in {\mathcal {NN}}^{\varrho ,1,1}_{w,\ell ,m}\) that \(N_{i} \le m\) for \(1 \le i \le L-1\). Furthermore, since \(T_{1}: {\mathbb {R}}\rightarrow {\mathbb {R}}^{N_{1}}\), we have \(\Vert T_{1}\Vert _{\ell ^{0,\infty }} \le N_{1} \le m\) and \(\Vert T_{1}\Vert _{\ell ^{0,\infty }_{*}} \le 1\). Similarly, as \(T_{\ell }: {\mathbb {R}}^{N_{\ell -1}} \rightarrow {\mathbb {R}}\) we have \(\Vert T_{\ell }\Vert _{\ell ^{0,\infty }} \le 1\) and \(\Vert T_{\ell }\Vert _{\ell ^{0,\infty }_{*}} \le m\). The same bounds trivially hold for \(T'_{1}\) and \(T'_{\ell }\).
We now prove the claim of the lemma by induction on d. The result is trivial for \(d=1\) using \(\Phi = \Psi _{\alpha _{1}}\). Assuming it is true for \(d \in {\mathbb {N}}\), we prove it for \(d+1\).
Define \(\alpha = \alpha _{1} \otimes \cdots \otimes \alpha _{d}\) and \({\overline{\alpha }} = \alpha _{1} \otimes \cdots \otimes \alpha _{d+1} = \alpha \otimes \alpha _{d+1}\). By induction, there are networks \( \Psi _{1} = \big ( (V_{1},\lambda _{1}),\ldots ,(V_{\ell },\lambda _{\ell }) \big ) \in {\mathcal {NN}}^{\varrho ,d,d}_{dw,\ell ,dm} \) and \( \Psi _{2} = \big ( (W_{1},\mu _{1}),\ldots ,(W_{\ell },\mu _{\ell }) \big ) \in {\mathcal {NN}}^{\varrho ,1,1}_{w,\ell ,m} \) such that \({\mathtt {R}}(\Psi _1)=\alpha \) and \({\mathtt {R}}(\Psi _2)=\alpha _{d+1}\) and such that \(\Vert V_1\Vert _{\ell ^{0,\infty }} \le m\), \(\Vert V_1\Vert _{\ell ^{0,\infty }_*} \le 1\), \(\Vert V_\ell \Vert _{\ell ^{0,\infty }} \le 1\), and \(\Vert V_\ell \Vert _{\ell ^{0,\infty }_*} \le m\), and likewise for \(W_1\) instead of \(V_1\) and \(W_\ell \) instead of \(V_\ell \).
Define \(U_{i} := V_{i} \otimes W_{i}\) and \(\gamma _{i}:= \lambda _{i} \otimes \mu _{i}\) for \(1 \le i \le \ell \), and \(\Phi := \big ( (U_{1},\gamma _{1}),\ldots ,(U_{\ell },\gamma _{\ell }) \big )\). One can check that \({\mathtt {R}}(\Phi ) = {\overline{\alpha }}\). Moreover, Lemma A.3 shows that \(\Vert U_{i}\Vert _{\ell ^0} = \Vert V_{i}\Vert _{\ell ^0}+\Vert W_{i}\Vert _{\ell ^0}\) for \(1 \le i \le \ell \), whence \(W(\Phi ) = W(\Psi _1) + W(\Psi _2) \le dw+d = (d+1)w\) and similarly \(N(\Phi ) = N(\Psi _{1}) + N(\Psi _{2}) \le (d+1)m\). Finally, Lemma A.3 shows that
Clearly, if \(\Psi _\sigma \) is strict, and if \(\alpha _i = \sigma \) for all i, then the same induction shows that \(\Phi \) can be chosen to be a strict network. \(\square \)
Proof of Lemma 2.19
For the first statement with \(\ell =2\), consider \(f = {\mathtt {R}}(\Psi )\) for some
In case of \(K = 1\), we trivially have \(\Psi \in {\mathcal {NN}}_{W,L,N}^{\varrho ,d,k}\), so that we can assume \(K \ge 2\) in the following.
Denoting by \(N_{i}\) the number of neurons at the i-th layer of \(\Psi \), Lemma A.4 yields for each \({i \in \{1,\dots ,K-1\}}\) a network \( \Phi _{i} = \big ( (U_{1}^{i}, \gamma _{i}), (U_{2}^{i}, \mathrm {id}_{{\mathbb {R}}^{N_{i}}}) \big ) \in {\mathcal {NN}}^{\varrho ,N_{i},N_{i}}_{N_{i}w,2,N_{i}m} \) satisfying \(\alpha _i = {\mathtt {R}}(\Phi _{i})\) and \(\gamma _{i}: {\mathbb {R}}^{N(\Phi _{i})} \rightarrow {\mathbb {R}}^{N(\Phi _{i})}\) with \(N(\Phi _{i}) \le N_{i}m\) and finally \(\Vert U_{1}^{i}\Vert _{\ell ^{0,\infty }} \le m\) and \(\Vert U_{2}^{i}\Vert _{\ell ^{0,\infty }_{*}} \le m\). With \(T_{1} := U_{1}^{1} \circ S_{1}\), \(T_{K} := S_{K} \circ U_{2}^{K-1}\), \(T_{i} := U_{1}^{i} \circ S_{i} \circ U_{2}^{i-1}\) for \(2 \le i \le K-1\) and
one can check that \(f = {\mathtt {R}}(\Phi )\).
By Lemma A.3, \( \Vert T_{i}\Vert _{\ell ^{0}} \le \Vert U_{1}^{i}\Vert _{\ell ^{0,\infty }} \Vert S_{i}\Vert _{\ell ^{0}} \Vert U_{2}^{i-1}\Vert _{\ell ^{0,\infty }_{*}} \le m^{2}\Vert S_{i}\Vert _{\ell ^{0}} \) for \(2 \le i \le K-1\), and the same overall bound also holds for \(i \in \{1,K\}\). As a result, we get \(L(\Phi ) = K \le L\) as well as
For the second statement, we prove by induction on \(L \in {\mathbb {N}}\) that \({\mathtt {NN}}_{W,L,N}^{\sigma ,d,k} \subset {\mathtt {NN}}^{\varrho ,d,k}_{mW + Nw , 1 + (L-1)\ell , N(1+m)}\).
For \(L = 1\), it is easy to see \({\mathtt {NN}}_{W,1,N}^{\sigma ,d,k} = {\mathtt {NN}}^{\varrho ,d,k}_{W,1,N}\), simply because on the last (and for \(L=1\) only) layer, the activation function is always given by \(\mathrm {id}_{{\mathbb {R}}^k}\). Thus, the claim follows from the trivial inclusion \({\mathtt {NN}}_{W,1,N}^{\varrho ,d,k} \subset {\mathtt {NN}}^{\varrho ,d,k}_{mW + Nw , 1, N(1+m)}\), since \(m \ge 1\).
Now, assuming the claim holds true for L, we prove it for \(L+1\). Consider \(f \in {\mathtt {NN}}^{\sigma ,d,k}_{W,L+1,N}\). In case of \({f \in {\mathtt {NN}}^{\sigma ,d,k}_{W,L,N}}\), we get \( f \in {\mathtt {NN}}^{\varrho ,d,k}_{mW + Nw , 1 + (L-1)\ell , N(1+m)} \subset {\mathtt {NN}}^{\varrho ,d,k}_{mW + Nw , 1 + ( (L+1)-1)\ell , N(1+m)} \) by the induction hypothesis. In the remaining case where \(f \notin {\mathtt {NN}}^{\sigma ,d,k}_{W,L,N}\), there is a network \(\Psi \in {\mathcal {NN}}^{\sigma ,d,k}_{W,L+1,N}\) of the form \( \Psi = \big ( (S_{1},\alpha _{1}),\ldots ,(S_L,\alpha _L),(S_{L+1},\mathrm {id}_{{\mathbb {R}}^{k}}) \big ) \) such that \(f = {\mathtt {R}}(\Psi )\). Observe that \(S_{L+1}: {\mathbb {R}}^{{\overline{k}}} \rightarrow {\mathbb {R}}^{k}\) with \({\overline{k}} := N_L\) the number of neurons of the last hidden layer. Defining \( \Psi _{1} := \big ( (S_{1}, \alpha _1), \ldots , (S_{L-1}, \alpha _{L-1}), (S_L, \mathrm {id}_{{\mathbb {R}}^{{\overline{k}}}}) \big ), \) we have \(\Psi _{1} \in {\mathcal {NN}}^{\sigma ,d,{\overline{k}}}_{{\overline{W}},L,{\overline{N}}}\) where \({\overline{W}} := W(\Psi _{1})\) and \({\overline{N}} := N(\Psi _{1})\) satisfy
Define \(g := {\mathtt {R}}(\Psi _1)\), so that \(f = S_{L+1} \circ \alpha _L \circ g\). We now exhibit a \(\varrho \)-network \(\Phi \) (instead of the \(\sigma \)-network \(\Psi \)) of controlled complexity such that \(f = {\mathtt {R}}(\Phi )\). As \(g := {\mathtt {R}}(\Psi _{1}) \in {\mathtt {NN}}^{\sigma ,d,{\overline{k}}}_{{\overline{W}},L,{\overline{N}}}\), the induction hypothesis shows that \(g = {\mathtt {R}}(\Phi _{1})\) for some network
Moreover, Lemma A.4 shows that \(\alpha _L = {\mathtt {R}}(\Phi _{2})\) for a network
with \(\Vert U_\ell \Vert _{\ell ^{0,\infty }_*} \le m\). By construction, we have \(f = S_{L+1} \circ \alpha _{L} \circ g = {\mathtt {R}}(\Phi )\) for the network
To conclude, we observe that \( L(\Phi ) = K + \ell \le 1 + (L-1)\ell + \ell = 1 + \big ( (L+1) - 1 \big ) \ell \), as well as
Finally, we also have \( N(\Phi ) = N(\Phi _1) + {\overline{k}} + N(\Phi _2) \le {\overline{N}} (1 + m) + {\overline{k}} + {\overline{k}} \cdot m = ({\overline{N}} + {\overline{k}}) (1 + m) \le N (1+m) \). \(\square \)
1.7 Proof of Lemma 2.20
Let \( \Psi = \big ( (S_{1},\alpha _{1}),\ldots , (S_{K-1}, \alpha _{K-1}), (S_K,\mathrm {id}_{{\mathbb {R}}^{k}}) \big ) \in {\mathcal {NN}}^{\sigma ,d,k}_{W,L,N} \) be arbitrary and \({g = {\mathtt {R}}(\Psi )}\). We prove that there is some \(\Phi \in {\mathcal {NN}}^{\varrho ,d,k}_{W+(s-1)N,1+s(L-1),sN}\) such that \(g = {\mathtt {R}}(\Phi )\). This is easy to see if \(s=1\) or \(K=1\); hence, we now assume \(K \ge 2\) and \(s \ge 2\). Denoting by \(N_{\ell }\) the number of neurons at the \(\ell \)-th layer of \(\Psi \), for \(1 \le \ell \le K-1\), we have \(\alpha _{\ell } = \alpha _{\ell }^{(1)} \otimes \ldots \otimes \alpha _{\ell }^{(N_{\ell })}\) where \(\alpha _{\ell }^{(i)} \in \{\mathrm {id}_{{\mathbb {R}}},\sigma \}\). For \(1 \le \ell \le L-1\), \(1 \le j \le K_{\ell }\), \(1 \le i \le s\), define
and let \(\beta _{s(\ell -1)+i} := \beta _{s(\ell -1)+i}^{(1)} \otimes \ldots \otimes \beta _{s(\ell -1)+i}^{(N_{\ell })}\). Define also \(T_{s(\ell -1)+1} := S_{\ell }: {\mathbb {R}}^{N_{\ell -1}} \rightarrow {\mathbb {R}}^{N_{\ell }}\) and \(T_{s(\ell -1)+i} := \mathrm {id}_{{\mathbb {R}}^{N_{\ell }}}\) for \(2 \le i \le s\). It is painless to check that
and hence,
That is to say, \(g = {\mathtt {R}}(\Phi )\) with
where we compute
We conclude as claimed that \(\Phi \in {\mathcal {NN}}^{\varrho ,d,k}_{W+(s-1)N,1+s(L-1),sN}\). Finally, if \(\Psi \) is strict, then so is \(\Phi \). \(\square \)
1.8 Proof of Lemma 2.21
For \(f \in {\mathtt {NN}}^{\sigma ,d,k}_{W,L,N}\), there is \( \Phi = \big ( (S_{1},\alpha _{1}),\ldots ,(S_{L'},\alpha _{L'}) \big ) \in {\mathcal {NN}}^{\sigma ,d,k}_{W,L',N} \) with \(L(\Phi ) = L' \le L\) and such that \(f = {\mathtt {R}}(\Phi )\). Replace each occurrence of the activation function \(\sigma \) by \(\sigma _{h}\) in the nonlinearities \(\alpha _{j}\) to define a \(\sigma _{h}\)-network \(\Phi _{h} := \big ( (S_{1},\alpha _{1}^{(h)}),\ldots ,(S_{L'},\alpha _{L'}^{(h)}) \big ) \in {\mathcal {NN}}^{\sigma _{h},d,k}_{W,L',N}\) and its realization \(f_{h} := {\mathtt {R}}(\Phi _{h})\in {\mathtt {NN}}^{\sigma _{h},d,k}_{W,L',N}\). Since \(\sigma \) is continuous and \(\sigma _{h} \rightarrow \sigma \) locally uniformly on \({\mathbb {R}}\) as \(h \rightarrow 0\), we get by Lemma A.7 (which is proved independently below) that \(f_{h} \rightarrow f\) locally uniformly on \({\mathbb {R}}^{d}\). To conclude for \(\ell =2\), observe that \(\sigma _{h} = {\mathtt {R}}(\Psi _{h})\) with \(\Psi _{h} \in {\mathcal {NN}}^{\varrho ,1,1}_{w,\ell ,m}\) and \(L(\Psi _{h}) = \ell \), whence Lemma 2.19 yields
For arbitrary \(\ell \), we similarly conclude that
1.9 Proof of Lemmas 2.22 and 2.25
In this section, we provide a unified proof for Lemmas 2.22 and 2.25. To be able to handle both claims simultaneously, the following concept will be important.
Definition A.5
For each \(d,k \in {\mathbb {N}}\), let us fix a subset \({\mathcal {G}}_{d,k} \subset \{ f : {\mathbb {R}}^d \rightarrow {\mathbb {R}}^k \}\) and a topology \({\mathcal {T}}_{d,k}\) on the space of all functions \(f : {\mathbb {R}}^d \rightarrow {\mathbb {R}}^k\). Let \({\mathcal {G}}:= ({\mathcal {G}}_{d,k})_{d,k \in {\mathbb {N}}}\) and \({\mathcal {T}}:= ({\mathcal {T}}_{d,k})_{d,k \in {\mathbb {N}}}\). The tuple \(({\mathcal {G}},{\mathcal {T}})\) is called a network compatible topology family if it satisfies the following:
-
(1)
We have \(\{ T : {\mathbb {R}}^d \rightarrow {\mathbb {R}}^k \,\mid \, T \text { affine-linear} \} \subset {\mathcal {G}}_{d,k}\) for all \(d,k \in {\mathbb {N}}\).
-
(2)
If \(p \in {\mathbb {N}}\) and for each \(i \in \{1,\dots ,p\}\), we are given a sequence \((f_i^{(n)})_{n \in {\mathbb {N}}_0}\) of functions \(f_i^{(n)} : {\mathbb {R}}\rightarrow {\mathbb {R}}\) satisfying \(f_i^{(0)} \in {\mathcal {G}}_{1,1}\) and \(f_i^{(n)} \xrightarrow [n\rightarrow \infty ]{{\mathcal {T}}_{1,1}} f_i^{(0)}\), then \( f_1^{(n)} \otimes \cdots \otimes f_p^{(n)} \xrightarrow [n\rightarrow \infty ]{{\mathcal {T}}_{p,p}} f_1^{(0)} \otimes \cdots \otimes f_p^{(0)} \) and \(f_1^{(0)} \otimes \cdots \otimes f_p^{(0)} \in {\mathcal {G}}_{p,p}\).
-
(3)
If \(f_n : {\mathbb {R}}^d \rightarrow {\mathbb {R}}^k\) and \(g_n : {\mathbb {R}}^k \rightarrow {\mathbb {R}}^\ell \) for all \(n \in {\mathbb {N}}_0\) and if \(f_0 \in {\mathcal {G}}_{d,k}\) and \(g_0 \in {\mathcal {G}}_{k,\ell }\) as well as \(f_n \xrightarrow [n\rightarrow \infty ]{{\mathcal {T}}_{d,k}} f_0\) and \(g_n \xrightarrow [n\rightarrow \infty ]{{\mathcal {T}}_{k,\ell }} g_0\), then \(g_0 \circ f_0 \in {\mathcal {G}}_{d,\ell }\) and \(g_n \circ f_n \xrightarrow [n\rightarrow \infty ]{{\mathcal {T}}_{d,\ell }} g_0 \circ f_0\). \(\blacktriangleleft \)
Remark
Roughly speaking, the above definition introduces certain topologies \({\mathcal {T}}_{d,k}\) and certain sets of “good functions” \({\mathcal {G}}_{d,k}\) such that—for limit functions that are “good”—convergence in the topology is compatible with taking \(\otimes \)-products and with composition.
By induction, it is easy to see that if \(p \in {\mathbb {N}}\) and if for each \(i \in \{1,\dots ,p\}\), we are given a sequence \((f_i^{(n)})_{n \in {\mathbb {N}}}\) with \(f_i^{(n)} : {\mathbb {R}}^{d_{i-1}} \rightarrow {\mathbb {R}}^{d_i}\) and \(f_i^{(0)} \in {\mathcal {G}}_{d_{i-1}, d_i}\) as well as \(f_i^{(n)} \xrightarrow [n\rightarrow \infty ]{{\mathcal {T}}_{d_{i-1},d_i}} f_i^{(0)}\), then also \(f_p^{(0)} \circ \cdots \circ f_1^{(0)} \in {\mathcal {G}}_{d_0, d_p}\), as well as \( f_p^{(n)} \circ \cdots \circ f_1^{(0)} \xrightarrow [n\rightarrow \infty ]{{\mathcal {T}}_{d_0,d_p}} f_p^{(0)} \circ \cdots \circ f_1^{(0)} \). Indeed, the base case of the induction is contained in Definition A.5. Now, assuming that the claim holds for \(p \in {\mathbb {N}}\), we prove it for \(p+1\). To this end, let \(F_1^{(n)} := f_p^{(n)} \circ \cdots \circ f_1^{(n)}\) and \(F_2^{(n)} := f_{p+1}^{(n)}\). By induction, we know \(F_1^{(0)} \in {\mathcal {G}}_{d_0, d_p}\) and \(F_1^{(n)} \xrightarrow [n\rightarrow \infty ]{{\mathcal {T}}_{d_0,d_p}} F_1^{(0)}\). Since also \(F_2^{(0)} = f_{p+1}^{(0)} \in {\mathcal {G}}_{d_p, d_{p+1}}\), Definition A.5 implies \(F_2^{(0)} \circ F_1^{(0)} \in {\mathcal {G}}_{d_0, d_{p+1}}\) and \(F_2^{(n)} \circ F_1^{(n)} \xrightarrow [n\rightarrow \infty ]{{\mathcal {T}}_{d_0, d_{p+1}}} F_2^{(0)} \circ F_1^{(0)}\), which is precisely the claim for \(p+1\) instead of p. \(\blacklozenge \)
We now have the following important result:
Proposition A.6
Let \(\varrho : {\mathbb {R}}\rightarrow {\mathbb {R}}\), and let \(({\mathcal {G}}, {\mathcal {T}})\) be a network compatible topology family satisfying the following
-
\(\varrho \in {\mathcal {G}}_{1,1}\);
-
There is some \(n \in {\mathbb {N}}\) such that for each \(m \in {\mathbb {N}}\), there are affine-linear maps \(E_m : {\mathbb {R}}\rightarrow {\mathbb {R}}^n\) and \(D_m : {\mathbb {R}}^n \rightarrow {\mathbb {R}}\) such that \(F_m := D_m \circ (\varrho \otimes \cdots \otimes \varrho ) \circ E_m : {\mathbb {R}}\rightarrow {\mathbb {R}}\) satisfies \(F_m \xrightarrow [m\rightarrow \infty ]{{\mathcal {T}}_{1,1}} \mathrm {id}_{{\mathbb {R}}}\).
Then, we have for arbitrary \(d,k \in {\mathbb {N}}\), \(W,N \in {\mathbb {N}}_0 \cup \{\infty \}\) and \(L \in {\mathbb {N}}\cup \{\infty \}\) the inclusion
where the closure is a sequential closure which is taken with respect to the topology \({\mathcal {T}}_{d,k}\). \(\blacktriangleleft \)
Remark
Before we give the proof of Proposition A.6, we explain a convention that will be used in the proof. Precisely, in the definition of \(W(\Phi )\), we always assume that the affine-linear maps \(T_\ell \) are of the form \(T_\ell : {\mathbb {R}}^{N_{\ell - 1}} \rightarrow {\mathbb {R}}^{N_\ell }\). Clearly, the expressivity of networks will not change if instead of the spaces \({\mathbb {R}}^{N_1},\dots , {\mathbb {R}}^{N_{L - 1}}\), one uses finite-dimensional vector spaces \(V_1, \dots , V_{L-1}\) with \(\dim V_i = N_i\). The only non-trivial question is the interpretation of \(\Vert T_\ell \Vert _{\ell ^0}\) for an affine-linear map \(T_\ell : V_{\ell - 1} \rightarrow V_\ell \), since for the case of \({\mathbb {R}}^{N_{\ell }}\), we chose the standard basis for obtaining the matrix representation of \(T_\ell \), while for general vector spaces \(V_\ell \), there is no such canonical choice of basis. Yet, in the proof below, we will consider the case \(V_\ell = {\mathbb {R}}^{n_1} \times \cdots \times {\mathbb {R}}^{n_{m}}\). In this case, there is a canonical way of identifying \(V_\ell \) with \({\mathbb {R}}^{N_\ell }\) for \(N_\ell = \sum _{j=1}^m n_j\), and there is also a canonical choice of “standard basis” in the space \(V_\ell \). We will use this convention in the proof below to simplify the notation. \(\blacklozenge \)
Proof of Proposition A.6
Let \(\Phi \in {\mathcal {NN}}_{W,L,N}^{\varrho ,d,k}\). We will construct a sequence \((\Phi _m)_{m \in {\mathbb {N}}} \subset {\mathcal {SNN}}_{n^2 W, L, nN}^{\varrho ,d,k}\) satisfying \({\mathtt {R}}(\Phi _m) \xrightarrow [m\rightarrow \infty ]{{\mathcal {T}}_{d,k}} {\mathtt {R}}(\Phi )\). To this end, note that \(\Phi = \big ( (T_1, \alpha _1), \dots , (T_K, \alpha _K) \big )\) for some \(K \le L\) and that there are \(N_0, \dots , N_K \in {\mathbb {N}}\) (with \(N_0 = d\) and \(N_K = k\)) such that \(T_\ell : {\mathbb {R}}^{N_{\ell - 1}} \rightarrow {\mathbb {R}}^{N_\ell }\) is affine-linear for each \(\ell \in \{1,\dots ,K\}\).
Let us first consider the special case \(K = 1\). By definition of a neural network, we have \(\alpha _K = \mathrm {id}_{{\mathbb {R}}^k}\), so that \(\Phi \) is already a strict \(\varrho \)-network. Therefore, we can choose \( \Phi _m := \Phi \in {\mathcal {SNN}}_{W,L,N}^{\varrho ,d,k} \subset {\mathcal {SNN}}_{n^2 W, L, nN}^{\varrho ,d,k} \) for all \(m \in {\mathbb {N}}\).
From now on, we assume \(K \ge 2\). For brevity, set \(\varrho _1 := \varrho \) and \(\varrho _2 := \mathrm {id}_{{\mathbb {R}}}\), as well as \(D(1) := 1\) and \(D(2) := n\), and furthermore,
By definition of a generalized \(\varrho \)-network, for each \(\ell \in \{1,\dots ,K\}\) there are \(\iota _1^{(\ell )}, \dots , \iota _{N_\ell }^{(\ell )} \in \{1,2\}\) with \(\alpha _\ell = \varrho _{\iota _1^{(\ell )}} \otimes \cdots \otimes \varrho _{\iota _{N_\ell }^{(\ell )}}\), and with \(\iota _j^{(K)} = 2\) for all \(j \in \{1,\dots ,N_K\}\). Now, define \(V_0 := {\mathbb {R}}^d ={\mathbb {R}}^{N_{0}}\), \(V_K := {\mathbb {R}}^k = {\mathbb {R}}^{N_K}\), and
Since we eventually want to obtain strict networks \(\Phi _m\), furthermore set
Using these maps, finally define \(\beta _K := \mathrm {id}_{{\mathbb {R}}^k}\), as well as
Finally, for \(\ell \in \{1,\dots ,K\}\) and \(m \in {\mathbb {N}}\), define affine-linear maps
The crucial observation is that by assumption regarding the maps \(D_m, E_m\), we have
Finally, for the construction of the strict networks \(\Phi _m\), we define for \(m \in {\mathbb {N}}\)
and then set \(\Phi _m := \big ( (S_1^{(m)}, \beta _1), \dots , (S_K^{(m)}, \beta _K) \big )\). Because of \(D(\iota _{i^{(\ell )}}) \in \{1,n\}\), we obtain
Furthermore, by the second part of Lemma A.3 and in view of the product structure of \(P_\ell ^{(m)}\), we have
for arbitrary \(\ell \in \{1,\dots ,K\}\), simply because \(E^{(m)}_j : {\mathbb {R}}\rightarrow {\mathbb {R}}^{D(j)}\) for \(j \in \{1,2\}\). Likewise,
because \(D_j^{(m)} : {\mathbb {R}}^{D(j)} \rightarrow {\mathbb {R}}\) for \(j \in \{1,2\}\). By the first part of Lemma A.3, we thus see for \(2 \le \ell \le K - 1\) that
Similar arguments yield \(\Vert S_1^{(m)}\Vert _{\ell ^0} \le n \cdot \Vert T_1\Vert _{\ell ^0} \le n^2 \cdot \Vert T_1\Vert _{\ell ^0}\) and \(\Vert S_K^{(m)}\Vert _{\ell ^0} \le n \cdot \Vert T_K\Vert _{\ell ^0} \le n^2 \cdot \Vert T_K\Vert _{\ell ^0}\). All in all, this implies \(W(\Phi _m) \le n^2 \cdot W(\Phi ) \le n^2 W\), as desired.
Now, since \(\varrho _1 = \varrho \in {\mathcal {G}}_{1,1}\) by the assumptions of the current proposition, since \(\varrho _2 = \mathrm {id}_{{\mathbb {R}}} \in {\mathcal {G}}_{1,1}\) as an affine-linear map, and since \(({\mathcal {G}},{\mathcal {T}})\) is a network compatible topology family, we see for all \(1 \le \ell \le K - 1\) that \( \alpha _\ell = \varrho _{\iota _1^{(\ell )}} \otimes \cdots \otimes \varrho _{\iota _{N_\ell }^{(\ell )}} \in {\mathcal {G}}_{N_\ell ,N_\ell } \) and furthermore that
Finally, since \(\beta _K = \mathrm {id}_{{\mathbb {R}}^k} = \alpha _K \in {\mathcal {G}}_{k, k}\), and since \(({\mathcal {G}},{\mathcal {T}})\) is a network compatible topology family and thus compatible with compositions (as long as the “factors” of the limit are “good,” which is satisfied here, since \(\alpha _\ell \in {\mathcal {G}}_{N_\ell , N_\ell }\) as we just saw and since \(T_\ell \in {\mathcal {G}}_{N_{\ell - 1}, N_\ell }\) as an affine-linear map), we see that
and hence \({\mathtt {R}}(\Phi ) \in \overline{{\mathtt {SNN}}^{\varrho ,d,k}_{n^2 W, L, n N}}\). \(\square \)
Now, we use Proposition A.6 to prove Lemma 2.25.
Proof of Lemma 2.25
For \(d,k \in {\mathbb {N}}\), let \({\mathcal {G}}_{d,k} := \{ f : {\mathbb {R}}^d \rightarrow {\mathbb {R}}^k \}\), and let \({\mathcal {T}}_{d,k} = 2^{{\mathcal {G}}_{d,k}}\) be the discrete topology on the set \(\{ f : {\mathbb {R}}^d \rightarrow {\mathbb {R}}^k \}\). This means that every set is open, so that the only convergent sequences are those that are eventually constant. It is easy to see that \(({\mathcal {G}},{\mathcal {T}})\) is a network compatible topology family and \(\varrho \in {\mathcal {G}}_{1,1}\).
Finally, by assumption of Lemma 2.25, there are \(a_i, b_i, c_i \in {\mathbb {R}}\) for \(i \in \{1,\dots ,n\}\) and some \(c \in {\mathbb {R}}\) such that \(x = c + \sum _{i=1}^n a_i \, \varrho (b_i \, x + c_i)\) for all \(x \in {\mathbb {R}}\). If we define \(E_m : {\mathbb {R}}\rightarrow {\mathbb {R}}^n, x \mapsto (b_1 \, x + c_1, \dots , b_n \, x + c_n)\) and \(D_m : {\mathbb {R}}^n \rightarrow {\mathbb {R}}, y \mapsto c + \sum _{i=1}^n a_i \, y_i\), then \(E_m, D_m\) are affine-linear, and \(\mathrm {id}_{{\mathbb {R}}} = D_m \circ (\varrho \otimes \cdots \otimes \varrho ) \circ E_m\) for all \(m \in {\mathbb {N}}\). Thus, all assumptions of Proposition A.6 are satisfied, so that this proposition implies \( {\mathtt {NN}}^{\varrho ,d,k}_{W,L,N} \subset \overline{{\mathtt {SNN}}_{n^2 W, L, nN}^{\varrho ,d,k}} = {\mathtt {SNN}}_{n^2 W, L, nN}^{\varrho ,d,k} \) for all \(d,k \in {\mathbb {N}}\), \(W,N \in {\mathbb {N}}_0 \cup \{\infty \}\) and \(L \in {\mathbb {N}}\cup \{\infty \}\). Here, we used that the (sequential) closure of a set M with respect to the discrete topology is simply the set M itself. \(\square \)
Finally, we will use Proposition A.6 to provide a proof of Lemma 2.22. To this end, the following lemma is essential.
Lemma A.7
Let \((f_n)_{n \in {\mathbb {N}}_0}\) and \((g_n)_{n \in {\mathbb {N}}_0}\) be sequences of functions \(f_n : {\mathbb {R}}^d \rightarrow {\mathbb {R}}^k\) and \(g_n : {\mathbb {R}}^k \rightarrow {\mathbb {R}}^\ell \). Assume that \(f_0, g_0\) are continuous and that \(f_n \xrightarrow [n\rightarrow \infty ]{} f_0\) and \(g_n \xrightarrow [n\rightarrow \infty ]{} g_0\) with locally uniform convergence. Then, \(g_0 \circ f_0\) is continuous, and \(g_n \circ f_n \xrightarrow [n\rightarrow \infty ]{} g_0 \circ f_0\) with locally uniform convergence.\(\blacktriangleleft \)
Proof
Locally uniform convergence on \({\mathbb {R}}^d\) is equivalent to uniform convergence on bounded sets. Furthermore, the continuous function \(f_0\) is bounded on each bounded set \(K \subset {\mathbb {R}}^d\); by uniform convergence, this implies that \(K' := \{ f(x) :x \in K \} \cup \{ f_n (x) :n \in {\mathbb {N}}\text { and } x \in K \} \subset {\mathbb {R}}^k\) is bounded as well. Hence, the continuous function \(g_0\) is uniformly continuous on \(K'\). From these observations, the claim follows easily; the details are left to the reader. \(\square \)
Given this auxiliary result, we can now prove Lemma 2.22.
Proof of Lemma 2.22
For \(d,k \in {\mathbb {N}}\), define \({\mathcal {G}}_{d,k} := \{ f : {\mathbb {R}}^d \rightarrow {\mathbb {R}}^k \,\mid \, f \text { continuous} \}\), and let \({\mathcal {T}}_{d,k}\) denote the topology of locally uniform convergence on \(\{ f : {\mathbb {R}}^d \rightarrow {\mathbb {R}}^k \}\). We claim that \(({\mathcal {G}},{\mathcal {T}})\) is a network compatible topology family. Indeed, the first condition in Definition A.5 is trivial, and the third condition holds thanks to Lemma A.7. Finally, it is not hard to see that if \(f_i^{(n)} : {\mathbb {R}}\rightarrow {\mathbb {R}}\) satisfy \(f_i^{(n)} \rightarrow f_i^{(0)}\) locally uniformly for all \(i \in \{1,\dots ,p\}\), then \( f_1^{(n)} \otimes \cdots \otimes f_p^{(n)} \xrightarrow [n\rightarrow \infty ]{} f_1^{(0)} \otimes \cdots \otimes f_p^{(0)} \) locally uniformly. This proves the second condition in Definition A.5.
We want to apply Proposition A.6 with \(n = 2\). We have \(\varrho \in {\mathcal {G}}_{1,1}\), since \(\varrho \) is continuous by the assumptions of Lemma 2.22. Thus, it remains to construct sequences \((E_m)_{m \in {\mathbb {N}}}, (D_m)_{m \in {\mathbb {N}}}\) of affine-linear maps \(E_m : {\mathbb {R}}\rightarrow {\mathbb {R}}^2\) and \(D_m : {\mathbb {R}}^2 \rightarrow {\mathbb {R}}\) such that \(D_m \circ (\varrho \otimes \varrho ) \circ E_m \rightarrow \mathrm {id}_{{\mathbb {R}}}\) with locally uniform convergence. Once these are constructed, Proposition A.6 shows that \({\mathtt {NN}}^{\varrho ,d,k}_{W,L,N} \subset \overline{{\mathtt {SNN}}^{\varrho ,d,k}_{4W, L, 2N}}\), where the closure is with respect to locally uniform convergence. This is precisely what is claimed in Lemma 2.22.
To construct \(E_m, D_m\), let us set \(a := \varrho ' (x_0) \ne 0\). By definition of the derivative, for arbitrary \(m \in {\mathbb {N}}\) and \(\varepsilon _m := |a|/m\), there is some \(\delta _m > 0\) satisfying
Now, define affine-linear maps
and set \(F_m := D_m \circ (\varrho \otimes \varrho ) \circ E_m\).
Finally, let \(x \in {\mathbb {R}}\) be arbitrary with \(0 < |x| \le \sqrt{m}\), and set \(h := \delta _m \cdot x / \sqrt{m}\), so that \(0 < |h| \le \delta _m\). By multiplying Eq. (A.5) with |h|/|a|, we then get
where the last step used that \(|x| \le \sqrt{m}\). This estimate is trivially valid for \(x = 0\). Put differently, we have thus shown \(|F_m (x) - x|\le 1/\sqrt{m}\) for all \(x \in {\mathbb {R}}\) with \(|x| \le \sqrt{m}\). That is, \(F_m \xrightarrow [m\rightarrow \infty ]{} \mathrm {id}_{{\mathbb {R}}}\) with locally uniform convergence. \(\square \)
1.10 Proof of Lemma 2.24
We will need the following lemma that will also be used elsewhere.
Lemma A.8
For \(f : {\mathbb {R}}\rightarrow {\mathbb {R}}\) and \(a \in {\mathbb {R}}\), let \(T_a f : {\mathbb {R}}\rightarrow {\mathbb {R}}, x \mapsto T_a f (x) = f(x-a)\). Furthermore, for \(n \in {\mathbb {N}}_0\), let \(X^n : {\mathbb {R}}\rightarrow {\mathbb {R}}, x \mapsto x^n\) and \(V_n := \mathrm {span} \{T_a X^n \, :\, a \in {\mathbb {R}}\}\), with the convention \(X^0 \equiv 1\).
We have \(V_n = {\mathbb {R}}_{\deg \le n}[x]\); that is, \(V_n\) is the space of all polynomials of degree at most n.\(\blacktriangleleft \)
Proof
Clearly, \(V_n \subset {\mathbb {R}}_{\deg \le n} [x] =: V\), where \(\dim V = n+1\). Therefore, it suffices to show that \(V_n\) contains \(n+1\) linearly independent elements. In fact, we show that whenever \(a_1,\dots ,a_{n+1} \in {\mathbb {R}}\) are pairwise distinct, then the family \((T_{a_i} X^n)_{i=1,\dots ,n+1} \subset V_n\) is linearly independent.
To see this, suppose that \(\theta _1,\dots ,\theta _{n+1} \in {\mathbb {R}}\) are such that \(0 \equiv \sum _{i=1}^{n+1} \theta _i \, T_{a_i} X^n\). A direct computation using the binomial theorem shows that this implies \( 0 \equiv \sum _{\ell =0}^n \big [ \left( {\begin{array}{c}n\\ \ell \end{array}}\right) (-1)^\ell X^{n-\ell } \sum _{i=1}^{n+1} \theta _i a_i^\ell \big ] \).
By comparing the coefficients of \(X^t\), this leads to \(0 = \big ( \sum _{i=1}^{n+1} a_i^\ell \, \theta _i \big )_{\ell =0,\dots ,n} = A^T \theta \), where \(\theta = (\theta _1,\dots ,\theta _{n+1}) \in {\mathbb {R}}^n\), and where the Vandermonde matrix \(A := (a_i^j)_{i=1,\dots ,n+1, j=0,\dots ,n} \in {\mathbb {R}}^{(n+1) \times (n+1)}\) is invertible; see [34, Equation (4-15)]. Hence, \(\theta = 0\), showing that \((T_{a_i} X^n)_{i=1,\dots ,n+1}\) is a linearly independent family. \(\square \)
Proof of Lemma 2.24
First, note
Next, Lemma A.8 shows that \(V_r = {\mathbb {R}}_{\deg \le r}[x]\) has dimension \(r+1\). Thus, given any polynomial \({f \in {\mathbb {R}}_{\deg \le r}[x]}\), there are \(a_1, \dots , a_{r+1} \in {\mathbb {R}}\) and \(b_1, \dots , b_{r+1} \in {\mathbb {R}}\) such that for all \(x \in {\mathbb {R}}\)
\(\square \)
1.11 Proof of Lemma 2.26
For Part (1), define \(w_{j} := 6n(2^{j}-1)\) and \(m_{j} := (2n+1)(2^j-1)-1\). We will prove below by induction on \(j \in {\mathbb {N}}\) that \(M_{2^{j}} \in {\mathtt {NN}}^{\varrho ,2^{j},1}_{w_j,2j,m_j}\). Let us see first that this implies the result. For arbitrary \(d \in {\mathbb {N}}_{\ge 2}\) and \(j = \lceil \log _{2} d \rceil \), it is not hard to see that
is affine-linear with \(\Vert P\Vert _{\ell ^{0,\infty }_*}=1\) [cf. Eq. (2.4)] and that \(M_{d} = M_{2^{j}} \circ P\). Using Lemma 2.18-(1) we get \(M_{d} \in {\mathtt {NN}}^{\varrho ,2^{j},1}_{w_j,2j,m_j}\) as claimed.
We now proceed to the induction. As a preliminary, note that by assumption there are \(a \in {\mathbb {R}}\), \(\alpha _1, \dots , \alpha _n \in {\mathbb {R}}\) and \(\beta _1, \dots , \beta _n \in {\mathbb {R}}\) such that for all \(x \in {\mathbb {R}}\)
Put differently, the affine-linear maps \(T_1 : {\mathbb {R}}\rightarrow {\mathbb {R}}^{n}, x \mapsto (x-\alpha _{\ell })_{\ell =1}^{n}\) and \({T_2 : {\mathbb {R}}^{n} \rightarrow {\mathbb {R}}, y \mapsto a + \sum _{\ell =1}^{n} \beta _{\ell } \, y_{\ell }}\) satisfy \(x^2 = T_2 \circ (\varrho \otimes \cdots \otimes \varrho ) \circ T_1 (x)\) for all \(x \in {\mathbb {R}}\), where the \(\otimes \)-product has n factors. Since \({x \cdot y = \tfrac{1}{4} \big ( (x+y)^2 - (x-y)^2 \big )}\) for all \(x,y \in {\mathbb {R}}\), if we define the maps \({T_0 : {\mathbb {R}}^2 \rightarrow {\mathbb {R}}^2, (x,y) \mapsto (x + y, x-y)}\) and \(T_3 : {\mathbb {R}}^2 \rightarrow {\mathbb {R}}, (u,v) \mapsto \frac{1}{4} (u - v)\), then for all \(x,y \in {\mathbb {R}}\)
where \({S_1 := (T_1 \otimes T_1) \circ T_0 : {\mathbb {R}}^2 \rightarrow {\mathbb {R}}^{2n}}\) and \(S_2 := T_3 \circ (T_2 \otimes T_2) : {\mathbb {R}}^{2n} \rightarrow {\mathbb {R}}\). As \(\Vert S_1\Vert _{\ell ^{0}} \le 4n\) and \(\Vert S_2\Vert _{\ell ^{0}} \le 2n\), we obtain \(M_{2} = {\mathtt {R}}(\Phi _{1})\) where \( \Phi _{1} = \big ( (S_{1}, \varrho \otimes \cdots \otimes \varrho ),(S_{2},\mathrm {id}) \big ) \in {\mathcal {NN}}_{6n, 2, 2n}^{\varrho , 2, 1}. \) This establishes our induction hypothesis for \(j=1\): \(M_{2} \in {\mathtt {SNN}}_{6n, 2, 2n}^{\varrho , 2, 1} \subset {\mathtt {NN}}_{w_j, 2^j, m_j}^{\varrho , 2, 1}\) for \(j = 1\).
We proceed to the actual induction step. Define the affine maps \(U_1, U_2 : {\mathbb {R}}^{2^{j+1}} \rightarrow {\mathbb {R}}^{2^{j}}\) by
With these definitions, observe that \( M_{2^{j+1}}(x) = M_{2^{j}}({\overline{x}}) M_{2^{j}}(x') = M_{2} \big ( M_{2^{j}}(U_{1}(x)),M_{2^{j}}(U_{2}(x)) \big ) \).
By the induction hypothesis, there is a network \( \Phi _{j} = \big ( (V_{1},\alpha _{1}), \ldots , (V_{L},\mathrm {id}) \big ) \in {\mathcal {NN}}_{w_{j}, 2j, m_{j}}^{\varrho , 2^{j}, 1} \) with \(L(\Phi _{j}) = L \le 2j\) such that \(M_{2^{j}} = {\mathtt {R}}(\Phi _{j})\). Since \(\Vert U_{i}\Vert _{\ell ^{0,\infty }_*}=1\), the second part of Lemma A.3 shows \(\Vert V_1 \circ U_i\Vert _{\ell ^0} \le \Vert V_{1}\Vert _{\ell ^{0}}\), whence \(M_{2^{j}} \circ U_i = {\mathtt {R}}(\Psi _{i})\), where \( \Psi _{i} = \big ( (V_{1} \circ U_{i}, \alpha _{1}), (V_{2}, \alpha _{2}), \ldots , (V_{L},\mathrm {id}) \big ) \) satisfies \(W(\Psi _{i}) \le W(\Phi _{j})\), \(N(\Psi _{i}) \le N(\Phi _{j})\), \(L(\Psi _{i}) = L\), and \(\Psi _{i} \in {\mathcal {NN}}^{\varrho ,2^{j},1}_{w_{j},2j,m_{j}}\). Thus, Lemma A.1 shows that \( f := (M_{2^{j}} \circ U_1, M_{2^{j}} \circ U_2) \in {\mathtt {NN}}_{2w_{j},2j,2m_{j}}^{\varrho ,2^{j+1},2} \). Since \(M_{2} \in {\mathtt {NN}}^{\varrho ,2,1}_{6n,2,2n}\), Lemma 2.18-(2) shows that \(M_{2^{j+1}} = M_{2} \circ f \in {\mathtt {NN}}^{\varrho ,2^{j+1},1}_{2w_{j}+6n,2j+2,2m_{j}+2n+2}\).
To conclude the proof of Part (1), note that \(2w_{j}+6n = 12 n(2^{j}-1) + 6n = 6 n(2^{j+1}-1) = w_{j+1}\) and \(2m_{j}+2n+2 = 2(2n+1)(2^{j}-1)+2n = (2n+1) (2^{j+1}-2)+2n+1-1 = m_{j+1}\).
To prove Part (2), we recall from Part (1) that \(M_{2} : {\mathbb {R}}^2 \rightarrow {\mathbb {R}}, (x,y) \mapsto x \cdot y\) satisfies \(M_{2} = {\mathtt {R}}(\Psi )\) with \(\Psi \in {\mathcal {SNN}}_{6n, 2, 2n}^{\varrho , 2, 1}\) and \(L(\Psi ) = 2\). Next, let \(P^{(i)} : {\mathbb {R}}\times {\mathbb {R}}^k \rightarrow {\mathbb {R}}\times {\mathbb {R}}, (x, y) \mapsto (x, y_i)\) for each \(i \in \{1,\dots ,k\}\), and note that \(P^{(i)}\) is linear with \(\Vert P^{(i)}\Vert _{\ell ^{0,\infty }} = 1 = \Vert P^{(i)}\Vert _{\ell ^{0,\infty }_*}\). Lemma 2.18-(1) shows that \(M_{2} \circ P^{(i)} = {\mathtt {R}}(\Psi _{i})\) where \(\Psi _{i} \in {\mathcal {SNN}}^{\varrho ,1+k,1}_{6n,2,2n}\) and \(L(\Psi _{i}) = L(\Psi )=2\). To conclude, observe \({(M_{2} \circ P^{(i)}) (x,y) = x \cdot y_i = [m(x,y)]_i}\) for \({m : {\mathbb {R}}\times {\mathbb {R}}^k \rightarrow {\mathbb {R}}^k, (x,y) \mapsto x \cdot y}\). Therefore, Lemma 2.17-(2) shows that \(m = (M_{2} \circ P^{(1)}, \dots , M_{2} \circ P^{(k)}) \in {\mathtt {NN}}^{\varrho , 1+k, k}_{6kn, 2,2kn}\), as desired. \(\square \)
Appendix B. Proofs for Sect. 3
1.1 Proof of Lemma 3.1
Let . For the sake of brevity, set \(\varepsilon _n := E(f, \Sigma _n)_X\) and \(\delta _n := E(f, \Sigma _n')_{X}\) for \(n \in {\mathbb {N}}_0\). First, observe that \(\varepsilon _n \le \Vert f\Vert _X = \delta _0\) for all \(n \in {\mathbb {N}}_0\). Furthermore, we have by assumption that \(\varepsilon _{cm} \le \delta _m\) for all \(m \in {\mathbb {N}}\). Now, setting \(m_n := \lfloor \frac{n - 1}{c} \rfloor \in {\mathbb {N}}\) for \(n \in {\mathbb {N}}_{\ge c + 1}\), note that \(n - 1 \ge c \, m_n\), and hence \(\varepsilon _{n-1} \le \varepsilon _{c \, m_n} \le C \cdot \delta _{m_n}\). Therefore, we see
Next, note for \(n \in {\mathbb {N}}_{\ge c + 1}\) that \(m_n \ge 1\) and \(m_n \ge \frac{n - 1}{c} - 1\), whence \(n \le c \, m_n + c + 1 \le (2 c + 1) m_n\). Therefore, \(n^\alpha \le (2c + 1)^\alpha m_n^\alpha \). Likewise, since \(m_n \le n\), we have \(n^{-1} \le m_n^{-1}\) for all \(n \in {\mathbb {N}}_{\ge c + 1}\).
There are now two cases. First, if \(q < \infty \), and if we set \(K := K(\alpha ,q,c) := \sum _{n=1}^c n^{\alpha q - 1}\), then
Further, for \(n \in {\mathbb {N}}_{\ge c + 1}\) satisfying \(m_n = m\) for some \(m \in {\mathbb {N}}\), we have \(m \le \frac{n-1}{c} < m+1\), which easily implies \(|\{ n \in {\mathbb {N}}_{\ge c + 1} :m_n = m\}| \le |\{ n \in {\mathbb {N}}:c m + 1 \le n < c m + c + 1 \}| = c\). Thus,
Overall, we thus see for \(q < \infty \) that
where the constant \(K + C^q (2c+1)^{\alpha q} c\) only depends on \(\alpha ,q,c,C\).
The adaptations for the (easier) case \(q = \infty \) are left to the reader. \(\square \)
1.2 Proof of Lemma 3.20
For \(p \in (0,\infty )\), the claim is clear, since it is well known that \(L_{p}(\Omega ;{\mathbb {R}}^{k})\) is complete, and since one can extend each by zero to a function \(f \in L^p(\Omega ;{\mathbb {R}}^k)\) satisfying \(g = f|_{\Omega }\).
Now, we consider the case \(p = \infty \). We first prove completeness of . Let be a Cauchy sequence. It is well known that there is a continuous function \(f : \Omega \rightarrow {\mathbb {R}}^k\) such that \(f_n \rightarrow f\) uniformly. In fact (see, for instance, [63, Theorem 12.8]), f is uniformly continuous. It remains to show that f vanishes at infinity. Let \(\varepsilon > 0\) be arbitrary, and choose \(n \in {\mathbb {N}}\) such that \(\Vert f - f_n\Vert _{\sup } \le \frac{\varepsilon }{2}\). Since \(f_n\) vanishes at \(\infty \), there is \(R > 0\) such that \(|f_n(x)| \le \frac{\varepsilon }{2}\) for \(x \in \Omega \) with \(|x| \ge R\). Therefore, \(|f(x)| \le \varepsilon \) for such x, proving that , while follows from the uniform convergence \(f_n \rightarrow f\).
Finally, we prove that . By considering components it is enough to prove that . To see that , simply note thatFootnote 7 if \(f \in C_0 ({\mathbb {R}}^d)\), then f is not only continuous, but in fact uniformly continuous. Therefore, \(f|_{\Omega }\) is also uniformly continuous (and vanishes at infinity), whence .
For proving , we will use the notion of the one-point compactification \(Z_\infty := \{\infty \} \cup Z\) of a locally compact Hausdorff space Z (where we assume that \(\infty \notin Z\)); see [26, Proposition 4.36]. The topology on \(Z_\infty \) is given by \( {\mathcal {T}}_Z := \{ U :U \subset Z \text { open} \} \cup \{ Z_\infty {\setminus } K :K \subset Z \text { compact} \} \). Then, \((Z_\infty ,{\mathcal {T}}_Z)\) is a compact Hausdorff space and the topology induced on Z as a subspace of \(Z_\infty \) coincides with the original topology on Z; see [26, Proposition 4.36]. Furthermore, if \(A \subset Z\) is closed, then a direct verification shows that the relative topology on \(A_\infty \) as a subset of \(Z_\infty \) coincides with the topology \({\mathcal {T}}_A\).
Now, let . Since g is uniformly continuous, it follows (see [3, Lemma 3.11]) that there is a uniformly continuous function \({\widetilde{g}} : A \rightarrow {\mathbb {R}}\) satisfying \(g = {\widetilde{g}}|_{\Omega }\), with \(A := {\overline{\Omega }} \subset {\mathbb {R}}^{d}\) the closure of \(\Omega \) in \({\mathbb {R}}^d\).
Since \(g \in C_0(\Omega )\), it is not hard to see that \({\widetilde{g}} \in C_0(A)\). Hence, [26, Proposition 4.36] shows that the function \(G : A_\infty \rightarrow {\mathbb {R}}\) defined by \(G(x) = {\widetilde{g}}(x)\) for \(x \in A\) and \(G(\infty ) = 0\) is continuous. Since \(A_\infty \subset ({\mathbb {R}}^d)_\infty \) is compact, the Tietze extension theorem (see [26, Theorem 4.34]) shows that there is a continuous extension \(H : ({\mathbb {R}}^d)_\infty \rightarrow {\mathbb {R}}\) of G. Again by [26, Proposition 4.36], this implies that \(f := H|_{{\mathbb {R}}^d} \in C_0({\mathbb {R}}^d)\). By construction, we have \(g = f|_{\Omega }\). \(\square \)
1.3 Proof of Theorem 3.23
1.3.1 Proof of Claims 1a-1b
We use the following lemma.
Lemma B.1
Let \({\mathcal {C}}\) be one of the following classes of functions:
-
locally bounded functions;
-
Borel-measurable functions;
-
continuous functions;
-
Lipschitz continuous functions;
-
locally Lipschitz continuous functions.
If the activation function \(\varrho \) belongs to \({\mathcal {C}}\), then any \(f \in {\mathtt {NN}}^{\varrho ,d,k}\) also belongs to \({\mathcal {C}}\).\(\blacktriangleleft \)
Proof
First, note that each affine-linear map \(T : {\mathbb {R}}^d \rightarrow {\mathbb {R}}^k\) belongs to all of the mentioned classes. Furthermore, note that since \({\mathbb {R}}^d\) is locally compact, a function \(f : {\mathbb {R}}^d \rightarrow {\mathbb {R}}^k\) is locally bounded [locally Lipschitz] if and only if f is bounded [Lipschitz continuous] on each bounded set. From this, it easily follows that each class \({\mathcal {C}}\) is closed under composition.
Finally, it is not hard to see that if \(f_1, \dots , f_n : {\mathbb {R}}\rightarrow {\mathbb {R}}\) all belong to the class \({\mathcal {C}}\), then so does \(f_1 \otimes \cdots \otimes f_n : {\mathbb {R}}^n \rightarrow {\mathbb {R}}^n\).
Combining these facts with the definition of the realization of a neural network, we get the claim. \(\square \)
As \(\varrho \) is locally bounded and Borel measurable, by Lemma B.1 each \(g \in {\mathtt {NN}}^{\varrho ,d,k}\) is locally bounded and measurable. As \(\Omega \) is bounded, we get \(g|_{\Omega } \in L_{p}(\Omega ;{\mathbb {R}}^{k})\) for all \(p \in (0,\infty ]\), and hence if \(p < \infty \). This establishes claim 1a. Finally, if \(p = \infty \), then by our additional assumption that \(\varrho \) is continuous, g is continuous by Lemma B.1. On the compact set \({\overline{\Omega }}\), g is thus uniformly continuous and bounded, so that \(g|_{\Omega }\) is uniformly continuous and bounded as well, that is, . This establishes claim 1b. \(\square \)
1.3.2 Proof of claims 1c-1d
We first consider the case \(p < \infty \). Let and \(\varepsilon > 0\). For each \(i \in \{1,\dots ,k\}\), extend the i-th component function \(f_i\) by zero to a function \(g_i \in L_p({\mathbb {R}}^d)\). As is well known (see, for instance, [25, Chapter VI, Theorem 2.31]), \(C_c^\infty ({\mathbb {R}}^d)\) is dense in \(L_p({\mathbb {R}}^d)\), so that we find \(h_i \in C_c^\infty ({\mathbb {R}}^d)\) satisfying \(\Vert g_i - h_i \Vert _{L_p} < \varepsilon \). Choose \(R > 0\) satisfying \({{\text {supp}}}(h_i) \subset [-R,R]^d\) and \({\Omega \subset [-R,R]^d}\). By the universal approximation theorem (Theorem 3.22), we can find \(\gamma _i \in {\mathtt {NN}}^{\varrho ,d,1}_{\infty ,2,\infty } \subset {\mathtt {NN}}^{\varrho ,d,1}_{\infty ,L,\infty }\) satisfying \(\Vert h_i - \gamma _i \Vert _{L_\infty ([-R,R]^d)} \le \varepsilon / (4R)^{d/p}\). Note that the inclusion \({\mathtt {NN}}^{\varrho ,d,1}_{\infty ,2,\infty } \subset {\mathtt {NN}}^{\varrho ,d,1}_{\infty ,L,\infty }\) used above is (only) true since we are considering generalized neural networks, and since \(L \ge 2\).
Using the elementary estimate \((a + b)^p \le (2 \max \{a, b \})^p \le 2^p (a^p + b^p)\), we see
which easily implies \(\Vert \gamma _i - g_i\Vert _{L_p ([-R,R]^d)}^p \le 2^p (\varepsilon ^p + \Vert h_i - g_i\Vert _{L_p([-R,R]^d)}^p) \le 2^{1 + p} \varepsilon ^p\).
Lemma 2.17 shows that \(\gamma := (\gamma _1, \dots , \gamma _k) \in {\mathtt {NN}}^{\varrho ,d,k}_{\infty ,L,\infty }\), whence by claims 1a-1b of Theorem 3.23. Finally, since \(g_i|_{\Omega } = f_i\), we have
Since \(\varepsilon > 0\) was arbitrary, this proves the desired density.
Now, we consider the case \(p = \infty \). Let . Lemma 3.20 shows that there is a continuous function \(g : {\mathbb {R}}^d \rightarrow {\mathbb {R}}^k\) such that \(f = g|_{\Omega }\). Since \(L \ge 2\), we can apply the universal approximation theorem (Theorem 3.22) to each of the component functions \(g_i\) of \(g = (g_1,\dots ,g_k)\) to obtain functions \(\gamma _i \in {\mathtt {NN}}^{\varrho ,d,1}_{\infty ,2,\infty } \subset {\mathtt {NN}}^{\varrho ,d,1}_{\infty ,L,\infty }\) satisfying \(\Vert g_i - \gamma _i \Vert _{L_\infty ([-R,R]^d)} \le \varepsilon \), where we chose \(R > 0\) so large that \(\Omega \subset [-R,R]^d\). Lemma 2.17 shows that \(\gamma := (\gamma _1, \dots , \gamma _k) \in {\mathtt {NN}}^{\varrho ,d,k}_{\infty ,L,\infty }\), whence by claims 1a-1b of Theorem 3.23, since \(\varrho \) is continuous. Finally, since \(g_i |_{\Omega } = f_i\), we have
Since \(\varepsilon > 0\) was arbitrary, this proves the desired density. \(\square \)
1.3.3 Proof of Claim (2)
Set . Lemma 2.17 easily shows that \({\mathcal {V}}\) is a vector space. Furthermore, Lemma 2.18 shows that if \(f \in {\mathcal {V}}\), \(A \in \mathrm {GL}({\mathbb {R}}^d)\), and \(b \in {\mathbb {R}}^d\), then \(f (A \bullet + b) \in {\mathcal {V}}\) as well. Clearly, all these properties also hold for \(\overline{{\mathcal {V}}}\) instead of \({\mathcal {V}}\), where the closure is taken in \(X_p({\mathbb {R}}^d)\).
It suffices to show that \({\mathcal {V}}\) is dense in . Indeed, suppose for the moment that this is true. Let be arbitrary. By applying Lemma 3.20 to each of the component functions \(f_i\) of f, we see for each \(i \in \{1,\dots ,k\}\) that there is a function such that \(f_i = F_i |_{\Omega }\). Now, let \(\varepsilon > 0\) be arbitrary, and set \(p_0 := \min \{1,p\}\). Since \({\mathcal {V}}\) is dense in , there is for each \(i \in \{1,\dots ,k\}\) a function \(G_i \in {\mathcal {V}}\) such that \(\Vert G_i - F_i\Vert _{L_p}^{p_0} \le \varepsilon ^{p_0} / k\). Lemma 2.17 shows , and it is not hard to see that , and hence, . As \(\varepsilon > 0\) and were arbitrary, this proves that is dense in , as desired.
It remains to show that is dense. To prove this, we distinguish three cases:
Case 1 (\(p \in [1,\infty )\)): First, the existence of the “radially decreasing \(L_1\)-majorant” \(\mu \) for g, [11, Lemma A.2] shows that \(P|g| \in L_\infty ({\mathbb {R}}^d) \subset L_p^{\mathrm {loc}}({\mathbb {R}}^d)\), where P|g| is a certain periodization of |g| whose precise definition is immaterial for us. Since \(g \in L_p ({\mathbb {R}}^d)\) and \(P|g| \in L_p^{\mathrm {loc}}({\mathbb {R}}^d)\), and \(\int _{{\mathbb {R}}^d} g(x) \, dx \ne 0\), [11, Corollary 1] implies that \({\mathcal {V}}_0 := \mathrm {span}\{ g_{j,k} :j \in {\mathbb {N}}, k \in {\mathbb {Z}}^d \}\) is dense in \(L_p({\mathbb {R}}^d)\), where \(g_{j,k}(x) = 2^{jd/p} \cdot g(2^j x - k)\). As a consequence of the properties of the space \({\mathcal {V}}\) that we mentioned above, and since \(g \in \overline{{\mathcal {V}}}\), we have \({\mathcal {V}}_0 \subset \overline{{\mathcal {V}}}\). Hence, \({\mathcal {V}} \subset L_p ({\mathbb {R}}^d)\) is dense, and we have since \(p < \infty \).
Case 2 (\(p \in (0,1)\)): Since \(g \in L_1({\mathbb {R}}^d) \cap L_p({\mathbb {R}}^d)\) with \(\int _{{\mathbb {R}}^d} g(x) \, d x \ne 0\), [39, Theorem 4 and Proposition 5(a)] show that \({\mathcal {V}}_0 \subset L_p({\mathbb {R}}^d)\) is dense, where the space \({\mathcal {V}}_0\) is defined precisely as for \(p \in [1,\infty )\). The rest of the proof is as for \(p \in [1,\infty )\).
Case 3 (\(p = \infty \)): Note . Let us assume toward a contradiction that \({\mathcal {V}}\) is not dense in \(C_0({\mathbb {R}}^d)\). By the Hahn–Banach theorem (see, for instance, [26, Theorem 5.8]), there is a bounded linear functional \(\varphi \in (C_0({\mathbb {R}}^d))^*\) such that \(\varphi \not \equiv 0\), but \(\varphi \equiv 0\) on \(\overline{{\mathcal {V}}}\).
By the Riesz representation theorem for \(C_0\) (see [26, Theorem 7.17]), there is a finite real-valued Borel-measure \(\mu \) on \({\mathbb {R}}^d\) such that \(\varphi (f) = \int _{{\mathbb {R}}^d} f(x) \, d \mu (x)\) for all \(f \in C_0({\mathbb {R}}^d)\). Thanks to the Jordan decomposition theorem (see [26, Theorem 3.4]), there are finite positive Borel measures \(\mu _+\) and \(\mu _-\) such that \(\mu = \mu _+ - \mu _-\).
Let \(f \in C_0 ({\mathbb {R}}^d)\) be arbitrary. For \(a > 0\), define \(g_a : {\mathbb {R}}^d \rightarrow {\mathbb {R}}, x \mapsto a^d \, g(a x)\), and note \(T_x g_a \in \overline{{\mathcal {V}}}\) (and hence \(\varphi (T_x g_a) = 0\)) for all \(x \in {\mathbb {R}}^d\), where \(T_x g_a (y) = g_a (y-x)\). By Fubini’s theorem and the change of variables \(y = -z\), we get
for all \(a \ge 1\). Here, Fubini’s theorem was applied to each of the integrals \(\int (f *g_a)(x) \, d \mu _{\pm } (x)\), which is justified since
Now, since \(f \in C_0({\mathbb {R}}^d)\) is bounded and uniformly continuous, [26, Theorem 8.14] shows \(f *g_a \rightarrow f\) uniformly as \(a \rightarrow \infty \). Therefore, (B.1) implies \( \varphi (f) = \int _{{\mathbb {R}}^d} f(x) \, d \mu (x) = \lim _{a \rightarrow \infty } \int _{{\mathbb {R}}^d} (f *g_a) (x) \, d \mu (x) = 0 \), since \(\mu \) is a finite measure. This implies \(\varphi \equiv 0\) on \(C_0 ({\mathbb {R}}^d)\), which is the desired contradiction. \(\square \)
1.4 Proof of Lemma 3.26
Part (1):
Define
A straightforward calculation using the properties of \(\sigma \) shows that
We claim that \(0 \le t \le 1\). To see this, first note that if \(r \ge 1\), then \(\sigma (x - r) \le \sigma (x)\) for all \(x \in {\mathbb {R}}\). Indeed, if \(x \le r\), then \(\sigma (x - r) = 0 \le \sigma (x)\); otherwise, if \(x > r\), then \(x \ge 1\), and hence \(\sigma (x - r) \le 1 = \sigma (x)\). Since \(r := \frac{1}{\varepsilon } - 1 \ge 1\), we thus see that \(t(x) = \sigma (\frac{x}{\varepsilon }) - \sigma (\frac{x}{\varepsilon } - r) \ge 0\) for all \(x \in {\mathbb {R}}\). Finally, we trivially have \(t(x) \le \sigma (\frac{x}{\varepsilon }) \le 1\) for all \(x \in {\mathbb {R}}\).
Now, if we define
we see \(0 \le g_0 \le 1\). Furthermore, for \(x \in [\varepsilon , 1-\varepsilon ]^d\), we have \(t(x_i) = 1\) for all \(i \in \{1,\dots ,d\}\), whence \(g_0(x) = 1\). Likewise, if \(x \notin [0,1]^d\), then \(t (x_i) = 0\) for at least one \(i \in \{1,\dots ,d\}\). Since \(0 \le t (x_i) \le 1\) for all i, this implies \(\sum _{i=1}^d t (x_i) - d \le -1\), and thus \(g_0(x) = 0\). All in all, and because of \(0 \le g_0 \le 1\), these considerations imply that \({{\text {supp}}}(g_0) \subset [0,1]^{d}\) and
Now, for proving the general case of Part (1), let \(h := g_0\), while \(h := t\) in case of \(d = 1\). As a consequence of Eqs. (B.2) and (B.3) and of \(0 \le t \le 1\), we then see that Condition (3.10) is satisfied in both cases. Thus, all that needs to be shown is that \(h = g_0 \in {\mathtt {NN}}^{\varrho ,d,1}_{2dW(N+1), 2L-1, (2d+1)N}\) or that \(h = t \in {\mathtt {NN}}^{\varrho ,1,1}_{2W,L,2N}\) in case of \(d = 1\).
We will verify both of these properties in the proof of Part (2) of the lemma.
Part (2): We first establish the claim for the special case \([a,b]= [0,1]^{d}\). With \(\lambda \) denoting the d-dimensional Lebesgue measure, and with h as constructed in Part (1), we deduce from (3.10) that
Since the right-hand side vanishes as \(\varepsilon \rightarrow 0\), this proves the claim for the special case \([a,b] = [0,1]^d\), once we show \(h = {\mathtt {R}}(\Phi )\) for \(\Phi \) with appropriately many layers, neurons, and nonzero weights.
By assumption on \(\sigma \), there is \(L_0 \le L\) such that \(\sigma = {\mathtt {R}}(\Phi _\sigma )\) for some \(\Phi _\sigma \in {\mathcal {NN}}^{\varrho ,1,1}_{W,L_0,N}\) with \(L(\Phi _\sigma ) = L_0\).
For \(i \in \{1,\dots ,d\}\) set \(f_{i, 1} : {\mathbb {R}}^d \rightarrow {\mathbb {R}}, x \mapsto \sigma (\frac{x_i}{\varepsilon })\) and \(f_{i, 2} : {\mathbb {R}}^d \rightarrow {\mathbb {R}}, x \mapsto -\sigma (1 + \frac{x_i - 1}{\varepsilon })\). By Lemma 2.18-(1), there exist \(\Psi _{i,1}, \Psi _{i,2} \in {\mathcal {NN}}^{\varrho ,d,1}_{W,L_0,N}\) with \(L(\Psi _{i,1}) = L(\Psi _{i,2}) = L_0\) for any \(i \in \{1,\dots ,d\}\) such that \(f_{i,1} = {\mathtt {R}}(\Psi _{i,1})\) and \(f_{i,2} = {\mathtt {R}}(\Psi _{i,2})\).
Lemma 2.17-(3) then shows that
satisfies \(F = {\mathtt {R}}(\Phi _F)\) for some \(\Phi _F \in {\mathcal {NN}}^{\varrho ,d,1}_{2dW,L_0,2dN}\) with \(L(\Phi _F) = L_0\). Hence, Lemma 2.18-(1) shows that \(G : {\mathbb {R}}^d \rightarrow {\mathbb {R}}, x \mapsto 1 + \sum _{i=1}^d t(x_i) - d\) satisfies \(G = {\mathtt {R}}(\Phi _G)\) for some \(\Phi _G \in {\mathcal {NN}}^{\varrho ,d,1}_{2dW,L_0,2dN}\) with \(L(\Phi _G) = L_0\).
In case of \(d = 1\), set \(L' := L_0\) and recall that \(h = t = F\), where we saw above that \(F = {\mathtt {R}}(\Phi _F)\) and \(\Phi _F \in {\mathcal {NN}}^{\varrho ,1,1}_{2W,L_0,2N}\) with \(L(\Phi _F) = L_0\).
For general \(d \in {\mathbb {N}}\) set \(L' := 2 L_0 - 1\) and recall that \(h = g_0 = \sigma \circ G\).
Hence, Lemma 2.18-(3) shows \(h = {\mathtt {R}}(\Phi _h)\) for some \(\Phi _h \in {\mathcal {NN}}^{\varrho ,d,1}\) with \(L(\Phi _h) = L'\), \(N(\Phi _h) \le (2d+1)N\) and \(W(\Phi _h) \le 2d W + \max \{ 2 d N, d \} W \le 2 d W (N+1)\).
It remains to transfer the result from \([0,1]^d\) to the general case [a, b]. To this end, define the invertible affine-linear map
A direct calculation shows \({\mathbb {1}}_{[0,1]^d} \circ T_0 = {\mathbb {1}}_{T_0^{-1} [0,1]^d} = {\mathbb {1}}_{[a,b]}\). Since \(\Vert T_{0}\Vert _{\ell ^{0,\infty }_{*}} =1\), the first part of Lemma 2.18 shows that \(g := h \circ T_0 = {\mathtt {R}}(\Phi )\) for some \(\Phi \in {\mathcal {NN}}^{\varrho ,d,1}_{2dW (N+1), 2 L_0 - 1, (2d+1) N}\) with \(L(\Phi ) = 2 L_0 - 1 = L'\) (resp. \(g := h \circ T_0 = {\mathtt {R}}(\Phi )\) for some \(\Phi \in {\mathcal {NN}}^{\varrho ,1,1}_{2W, L_0, 2N}\) with \(L(\Phi ) = L_0 = L'\) in case of \(d = 1\)) with h as above. Moreover, by an application of the change of variables formula, we get
As seen above, the first factor can be made arbitrarily small by choosing \(\varepsilon \in (0, \frac{1}{2})\) suitably. Since the second factor is constant, this proves the claim. \(\square \)
Appendix C. Proofs for Section 4
1.1 Proof of Lemma 4.9
We begin with three auxiliary results.
Lemma C.1
Let \(f : {\mathbb {R}}\rightarrow {\mathbb {R}}\) be continuously differentiable. Define \(f_h : {\mathbb {R}}\rightarrow {\mathbb {R}}, x \mapsto h^{-1} \cdot (f(x+h) - f(x))\) for \(h \in {\mathbb {R}}{\setminus } \{0\}\). Then, \(f_h \rightarrow f\) as \(h \rightarrow 0\) with locally uniform convergence on \({\mathbb {R}}\).\(\blacktriangleleft \)
Proof
This is an easy consequence of the mean-value theorem, using that \(f'\) is locally uniformly continuous. For more details, we refer to [40, Theorem 4.14]. \(\square \)
Since \(\varrho _{r+1}\) is continuously differentiable with \(\varrho _{r+1} ' = \varrho _r\), the preceding lemma immediately implies the following result.
Corollary C.2
For \(r \in {\mathbb {N}}\), \(h > 0\), \( \sigma _h : {\mathbb {R}}\rightarrow {\mathbb {R}}, x \mapsto (r+1)^{-1} \cdot h^{-1} \cdot \big ( \varrho _{r+1} (x + h) - \varrho _{r+1} (x) \big ) \), we have \(\sigma _{h} = {\mathtt {R}}(\Psi _{h})\) where \(\Psi _{h} \in {\mathcal {SNN}}_{4,2,2}^{\varrho _{r+1},1,1}\), \(L(\Psi _{h}) = 2\), and \(\lim _{h \rightarrow 0}\sigma _h = \varrho _{r}\) with locally uniform convergence on \({\mathbb {R}}\).
We need one more auxiliary result for the proof of Lemma 4.9.
Corollary C.3
For any \(d,k,r \in {\mathbb {N}}\), \(j \in {\mathbb {N}}_{0}\), \(W,N \in {\mathbb {N}}_{0}\), \(L \in {\mathbb {N}}\), we have
where closure is with respect to locally uniform convergence on \({\mathbb {R}}^d\).
Proof
We prove by induction on \(\delta \) that the result holds for any \(0 \le j \le \delta \). This is trivial for \(\delta =0\). By Corollary C.2, we can apply Lemma 2.21 to \(\varrho := \varrho _{r+1}\) and \(\sigma := \varrho _{r}\) (which is continuous) with \(w = 4\), \(\ell =2\), \(m=2\). This yields for any \(W,N \in {\mathbb {N}}_{0}\), \(L \in {\mathbb {N}}\) that \( {\mathtt {NN}}^{\varrho _{r},d,k}_{W,L,N} \subset \overline{{\mathtt {NN}}^{\varrho _{r+1},d,k}_{4W,L,2N}}, \) which shows that our induction hypothesis is valid for \(\delta = 1\). Assume now that the hypothesis holds for some \(\delta \in {\mathbb {N}}\), and consider \(W,N \in {\mathbb {N}}_{0}\), \(r,L \in {\mathbb {N}}\), \(0 \le j \le \delta +1\). If \(j \le \delta \) then the induction hypothesis yields (C.1), so there only remains to check the case \(j = \delta +1\). By the induction hypothesis, for \(r' = r+\delta \), \(W' = 4^{\delta }W\), \(N' = 2^{\delta }N\), \(j=1\) we have \( {\mathtt {NN}}^{\varrho _{r+\delta },d,k}_{4^{\delta }W,L,2^{\delta }N} \subset \overline{{\mathtt {NN}}^{\varrho _{r+\delta +1},d,k}_{4^{\delta +1}W,L,2^{\delta +1}N}}. \) Finally, \( {\mathtt {NN}}^{\varrho _{r},d,k}_{W,L,N} \subset \overline{{\mathtt {NN}}^{\varrho _{r+\delta },d,k}_{4^{\delta }W,L,2^{\delta }N}} \subset \overline{{\mathtt {NN}}^{\varrho _{r+\delta +1},d,k}_{4^{\delta +1}W,L,2^{\delta +}N}} \) by the induction hypothesis for \(j=\delta \). \(\square \)
Proof of Lemma 4.9
The proof is by induction on n. For \(n=1\), \(\varrho \) is a polynomial of degree at most r. By Lemma 2.24, \(\varrho _{r}\) can represent any such polynomial with \(2(r+1)\) terms, whence \(\varrho \in {\mathtt {NN}}^{\varrho _{r},1,1}_{4(r+1),2,2(r+1)}\). When \(r=1\), \(\varrho \) is an affine function; hence, there are \(a,b \in {\mathbb {R}}\) such that \(\varrho (x) = b+ax = b+ a\varrho _{1}(x)-a\varrho _{1}(-x)\) for all x, showing that \(\varrho \in {\mathtt {SNN}}^{\varrho _{1},1,1}_{4,2,2} = {\mathtt {SNN}}^{\varrho _1,1,1}_{2(n+1),2,n+1}\).
Assuming the result true for \(n \in {\mathbb {N}}\), we prove it for \(n+1\). Consider \(\varrho \) made of \(n+1\) polynomial pieces: \({\mathbb {R}}\) is the disjoint union of \(n+1\) intervals \(I_{i}\), \(0 \le i \le n\) and there are polynomials \(p_{i}\) such that \(\varrho (x) = p_{i}(x)\) on the interval \(I_{i}\) for \(0 \le i \le n\). Without loss of generality, order the intervals by increasing “position” and define \({\bar{\varrho }}(x) = \varrho (x)\) for \(x \in \cup _{i=0}^{n-1} I_{i} = {\mathbb {R}}{\setminus } I_{n}\), and \({\bar{\varrho }}(x) = p_{n-1}(x)\) on \(I_{n}\). It is not hard to see that \({\bar{\varrho }}\) is continuous and made of n polynomial pieces, the last one being \(p_{n-1}(x)\) on \(I_{n-1} \cup I_{n}\). Observe that \(\varrho (x) = {\bar{\varrho }}(x) + f(x - t_{n})\) where \(\{t_{n}\} = \overline{I_{n-1}} \cap \overline{I_{n}}\) is the breakpoint between the intervals \(I_{n-1}\) and \(I_{n}\), and
Note that \(q(x) := p_{n}(x + t_n) - p_{n-1}(x + t_n)\) satisfies \(q(0) = f(0) = 0\), since \(\varrho \) is continuous. Because q is a polynomial of degree at most r, there are \(a_i \in {\mathbb {R}}\) such that \(q(x) = \sum _{i=1}^{r} a_i \, x^i\). This shows that \(f = \sum _{i=1}^{r} a_i \varrho _i\). In case of \(r = 1\), this shows that \(f \in {\mathtt {SNN}}^{\varrho _1,1,1}_{2,2,1}\). For \(r \ge 2\), since \(\varrho _i \in {\mathtt {NN}}^{\varrho _i,1,1}_{2,2,1}\), Corollary C.3 yields \( \varrho _i \in \overline{{\mathtt {NN}}^{\varrho _r,1,1}_{2 \cdot 4^{r-i},2,2^{r-i}}} \), where the closure is with respect to the topology of locally uniform convergence. Observing that \(2\sum _{i=1}^{r}4^{r-i} = 2 \cdot (4^{r}-1)/3 = w\) and \(\sum _{i=1}^{r}2^{r-i} = 2^{r}-1 = m\), Lemma 2.17-(3) implies thatFootnote 8\(f \in \overline{{\mathtt {NN}}^{\varrho _{r},1,1}_{w,2,m}}\). Since \(P: {\mathbb {R}}\rightarrow {\mathbb {R}}, x \mapsto x+t_n\) is affine with \(\Vert P\Vert _{\ell ^{0,\infty }} = \Vert P\Vert _{\ell ^{0,\infty }_{*}}=1\), by the induction hypothesis, Lemma 2.18-(1) and Lemma 2.17-(3) again, we get
For \(r=1\), it is not hard to see \( \varrho \in {\mathtt {SNN}}^{\varrho _1,1,1}_{2(n+1)+2,2,n+1+1} = {\mathtt {SNN}}^{\varrho _1,1,1}_{2((n+1)+1),2,(n+1)+1} \). \(\square \)
1.2 Proof of Lemma 4.10
First we show that if \(s \in {\mathbb {N}}\) and if \(\varrho \in {\mathtt {Spline}}^s\) is not a polynomial, then there are \(\alpha ,\beta ,t_{0} \in {\mathbb {R}}\), \(\varepsilon >0\) and p a polynomial of degree at most \(s-1\) such that
Consider any \(t_{0} \in {\mathbb {R}}\). Since \(\varrho \in {\mathtt {Spline}}^{s}\), there are \(\varepsilon > 0\) and two polynomials \(p_{-},p_{+}\) of degree at most s, with matching \(s-1\) first derivatives at \(t_{0}\), such that
Since \(\varrho \) is not a polynomial, there is \(t_{0}\) such that the s-th derivatives of \(p_{\pm }\) at \(t_{0}\) do not match, i.e., \(a_{-} := p^{(s)}_{-}(t_{0})/s! \ne p^{(s)}_{+}(t_{0})/s! =: a_{+}\). A Taylor expansion yields
where \(q(z) := \sum _{n=0}^{s-1} p_{\pm }^{(n)}(t_{0})z^{n}/n!\) is a polynomial of degree at most \(s-1\). As a result, for \(|z| \le \varepsilon \)
Since \(a_{+} \ne a_{-}\), setting \(\alpha := a_{+}/(a_{+}^{2}-a_{-}^{2})\) and \(\beta := (-1)^{s+1} a_{-}/(a_{+}^{2}-a_{-}^{2})\), as well as \(p(x) := \alpha q(z)+\beta q(-z)\) we get as claimed \( \varrho _s(z) = \alpha \varrho (z+t_{0}) + \beta \varrho (-z+t_{0}) -p(z) \) for every \(|z| \le \varepsilon \).
Now, consider \(r \in {\mathbb {N}}\). Given \(R > 0\) we now set
with \(\alpha ,\beta ,t_{0},\varepsilon ,p\) from (C.2). Observe that \( \varrho _r(x) = (R/\varepsilon )^{r} \varrho _r(\varepsilon x/R) = f_{R}(x) \) for all \(x \in [-R,R]\), so that \(f_{R}\) converges locally uniformly to \(\varrho _{r}\) on \({\mathbb {R}}\).
We show by induction on \(r \in {\mathbb {N}}\) that \(f_{R} \in {\mathtt {NN}}^{\varrho ,1,1}_{w,2,m}\) where \(w = w(r),m = m(r) \in {\mathbb {N}}\) only depend on r. For \(r=1\), this trivially holds as the polynomial p in (C.2) is a constant; hence \(f_{R} \in {\mathtt {NN}}^{\varrho ,1,1}_{4,2,2}\).
Assuming the result true for some \(r \in {\mathbb {N}}\), we now prove it for \(r+1\). Consider \(\varrho \in {\mathtt {Spline}}^{r+1}\) that is not a polynomial. The polynomial p in (C.2) with \(s=r+1\) is of degree at most r; hence, by Lemma 2.24 there are \(c,a_{i},b_{i},c_{i} \in {\mathbb {R}}\) such that \( p(x) = c+ \sum _{i=1}^{r+1} a_{i} \, \varrho _r(b_{i} x + c_{i}) \) for all \(x \in {\mathbb {R}}\). Now, observe that since \(\varrho \in {\mathtt {Spline}}^{r+1}\) is not a polynomial, its derivative satisfies \(\varrho ' \in {\mathtt {Spline}}^{r}\) and is not a polynomial either. The induction hypothesis yields \(\varrho _r \in \overline{{\mathtt {NN}}^{\varrho ',1,1}_{w,2,m}}\) for \(w=w(r),m=m(r) \in {\mathbb {N}}\). It is not hard to check that this implies \(p \in \overline{{\mathtt {NN}}^{\varrho ',1,1}_{2(r+1)w,2,(r+1)m}}\). Finally, as \(\varrho '(x)\) is the locally uniform limit of \((\varrho (x+h)-\varrho (x))/h\) as \(h \rightarrow 0\) (see Lemma C.1), we obtain \(p \in \overline{{\mathtt {NN}}^{\varrho ,1,1}_{4(r+1)w,2,2(r+1)m}}\) thanks to Lemma 2.21. Combined with the definition of \(f_{R}\) we obtain \(f_{R} \in \overline{{\mathtt {NN}}^{\varrho ,1,1}_{4(r+1)w+4,2,2(r+1)m+2}}\).
Finally, we quantify w, m: First of all, note that \(w(1) = 4 \le 5\) and \(m(1) = 2 \le 3\); furthermore, \(w(r+1) \le 4(r+1) w(r)+4 \le 5(r+1) w(r)\) and \(m(r+1) \le 2(r+1)m+2 \le 3(r+1)m\). An induction therefore yields \(w(r) \le 5^r r!\) and \(m(r) \le 3^r r!\). \(\square \)
1.3 Proof of Lemma 4.11
Step 1: In this step, we construct \(\theta _{R,\delta } \in {\mathtt {NN}}^{\varrho _{r},d,1}_{w,\ell ,m}\) satisfying
with \(\ell =3\) (resp. \(\ell =2\) if \(d=1\)) and with w, m only depending on d and r.
The affine map \( P: {\mathbb {R}}^{d}\rightarrow {\mathbb {R}}^{d}, x = (x_{i})_{i=1}^{d} \mapsto \left( \tfrac{x_{i}}{2(R+\delta )}+\tfrac{1}{2}\right) _{i=1}^{d} \) satisfies \(\Vert P\Vert _{\ell ^{0,\infty }} = \Vert P\Vert _{\ell ^{0,\infty }_*} = 1\). For \(x \in {\mathbb {R}}^d\), we have \(x \in [-R-\delta ,R+\delta ]^{d}\) if and only if \(P(x) \in [0,1]^{d}\), and \(x \in [-R,R]^{d}\) if and only if \(P(x) \in [\varepsilon ,1-\varepsilon ]^{d}\), where \(\varepsilon := \tfrac{2\delta }{2(R+\delta )}\); thus, \({\mathbb {1}}_{[-R,R]^d} (P^{-1} x) = {\mathbb {1}}_{[\varepsilon , 1-\varepsilon ]^d}(x)\) for all \(x \in {\mathbb {R}}^d\).
Next, by combining Lemmas 4.4 and 3.26 (see in particular Eq. (3.10)), we obtain \(f \in {\mathtt {NN}}^{\varrho _{r},d,1}_{w,\ell ,m}\) (with the above-mentioned properties for \(w,\ell ,m\) and \(m \ge d\)) such that \( |f(x)-{\mathbb {1}}_{[0,1]^{d}} (x)| \le {\mathbb {1}}_{[0,1]^{d} {\setminus } [\varepsilon ,1-\varepsilon ]^{d}} \) for all \(x \in {\mathbb {R}}^d\). Therefore, the function \(\theta _{R,\delta } := f \circ P\) satisfies
for all \(x \in {\mathbb {R}}^{d}\). Finally, by Lemma 2.18-(1), we have \(\theta _{R,\delta } \in {\mathtt {NN}}^{\varrho _{r},d,1}_{w,\ell ,m}\).
Step 2: Consider \(g \in {\mathtt {NN}}^{\varrho _{r},d,k}_{W,L,N}\) and define \(g_{R,\delta } (x) := \theta _{R,\delta } (x) \cdot g(x)\) for all \(x \in {\mathbb {R}}^d\). The desired estimate (4.6) is an easy consequence of (C.3). It only remains to show that one can implement \(g_{R,\delta }\) with a \(\varrho _{r}\)-network of controlled complexity.
Since we assume \(W \ge 1\), we can use Lemma 2.14; combining it with Eq. (2.1), we get \(g \in {\mathtt {NN}}^{\varrho _{r},d,k}_{W,L',N'}\) with \(L' = \min \{ L,W,N+1 \}\) and \(N' = \min \{ N,W \}\). Lemma 2.17-(2) yields \((\theta _{R,\delta },g) \in {\mathtt {NN}}^{\varrho _r,d,k+1}_{w', L'', m'}\) with \(L'' = \max \{ L',\ell \}\) as well as \(w' = W + w + \min \{ d,k \} \cdot (L''-1)\) and \(m' = N' + m + \min \{ d,k \} \cdot (L''-1)\). Since \(L''-1 = \max \{ L'-1,\ell -1 \} \le \max \{ W-1,\ell -1 \} \le W+\ell -2\) and \(N' \le W\), we get
where \(c_{1},c_{2}\) only depend on d, k, r.
As \(r \ge 2\), Lemma 2.24 shows that \(\varrho _{r}\) can represent any polynomial of degree two with \({n=2(r+1)}\) terms. Thus, Lemma 2.26 shows that the multiplication map \(m : {\mathbb {R}}\times {\mathbb {R}}^k \rightarrow {\mathbb {R}}^k, (x,y) \mapsto x \cdot y\) satisfies \(m \in {\mathtt {NN}}^{\varrho _r, 1+k, k}_{12k(r+1), 2,4k(r+1)}\). Finally, Lemma 2.18-(3) proves that \(g_{R,\delta } = m \circ (\theta _{R,\delta },g) \in {\mathtt {NN}}^{\varrho _r,d,k}_{w'',L''',m''}\), where \({L''' = L''+1}\) and \(m'' = m'+ 4k(r+1) = N' + m + \min \{ d,k \} \cdot (L''-1) + 4k(r+1)\) as well as \({w'' = w' + \max \{ m',d \} \cdot 12 k(r+1)}\).
As \(L'' = \max \{ L',\ell \} \le \max \{ L,\ell \}\) we have \(L''' \le \max \{ L+1,4 \}\) (respectively \(L''' \le \max \{ L+1,3 \}\) if \(d=1\)). Furthermore, since \(m' \ge m \ge d\) we have \(\max \{ m',d \} = m'\). Because of \(W \ge 1\), we thus see that
where \(c_{3},c_{4}\) only depend on d, k, r. Finally, \(L''-1 = \max \{ L'-1,\ell -1 \} \le \max \{ N,\ell -1 \} \le N + \ell - 1\). Since \(N' \le N\), we get \(m'' \le N \cdot (1+\min \{ d,k \}) + c_{5} \le c_{6} N\) where again \(c_{5},c_{6}\) only depend on d, k, r. To conclude, we set \(c := \max \{ c_{4},c_{6} \}\). \(\square \)
1.4 Proof of Proposition 4.12
When \(r=1\) and \(\varrho \in {\mathtt {NN}}^{\varrho _{r},1,1}_{\infty ,2,m}\), the result follows from Lemma 2.19.
Now, consider \(f \in {\mathtt {NN}}_{W,L,N}^{\varrho , d, k}\) such that \(f|_{\Omega } \in X\). Since \(\varrho \in \overline{{\mathtt {NN}}^{\varrho _{r},1,1}_{\infty ,2,m}}\), Lemma 2.21 shows that
For bounded \(\Omega \), locally uniform convergence implies convergence in for all \(p \in (0,\infty ]\) hence the result.
For unbounded \(\Omega \), we need to work a bit harder. First, we deal with the degenerate case where \(W=0\) or \(N=0\). If \(W=0\), then by Lemma 2.13f is a constant map; hence, \(f \in {\mathtt {NN}}^{\varrho _r,d,k}_{0,1,0}\). If \(N=0\), then f is affine-linear with \(\Vert f\Vert _{\ell ^{0}} \le W\); hence, \(f\in {\mathtt {NN}}^{\varrho _r,d,k}_{W,1,0}\). In both cases, the result trivially holds.
From now on, we assume that \(W,N \ge 1\). Consider \(\varepsilon > 0\). By the dominated convergence theorem (in case of \(p < \infty \)) or our special choice of [cf. Eq. (1.3)] (in case of \(p = \infty \)), we see that there is some \(R \ge 1\) such that
Denoting by \(\lambda (\cdot )\) the Lebesgue measure, (C.4) implies that there is \(g \in {\mathtt {NN}}_{Wm^{2}, L,Nm}^{\varrho _r, d, k}\) such that
Consider \(c = c(d,k,r)\), \(\ell = \min \{ d+1, 3 \}\), \(L' = \max \{ L+1,\ell \}\) and the function \(g_{R,1} \in {\mathtt {NN}}^{\varrho _r, d, k}_{cWm^{2},L',cNm}\) from Lemma 4.11. By (4.6) and the fact that \(\Vert \cdot \Vert _{L_p}^{\min \{1,p\}}\) is subadditive, we see
To estimate the final term, note that
Because of \(2^{\min \{1,p\}} \le 2\), this implies \( \Big ( \Vert \ 2 \cdot |g| \cdot {\mathbb {1}}_{[-R-1,R+1]^d {\setminus } [-R,R]^d} \Vert _{L_p (\Omega )} \Big )^{\min \{1,p\}} \le \tfrac{\varepsilon ^{\min \{1,p\}}}{2} \). Overall, we thus see that \(\Vert f - g_{R,1} \Vert _{L_p (\Omega ; {\mathbb {R}}^k)} \le \varepsilon < \infty \). Because of \(f|_{\Omega } \in X\), this implies in particular that \(g_{R,1}|_{\Omega } \in X\). Since \(\varepsilon > 0\) was arbitrary, we get as desired that \( f|_{\Omega } \in \overline{{\mathtt {NN}}^{\varrho _r, d, k}_{cWm^{2}, L', cNm} \cap X}^{X} \), where the closure is taken in X. \(\square \)
Appendix D. Proofs for Section 5
1.1 Proof of Lemma 5.2
In light of (4.1), we have \(\beta _{+}^{(t)} \in {\mathtt {NN}}^{\varrho _t,1,1}_{2(t+2),2,t+2}\). This yields the result for \(d=1\), including when \(t=1\).
For \(d \ge 2\) and \(t \ge \min \{d,2\} = 2\), define \(f_j:{\mathbb {R}}^d\rightarrow {\mathbb {R}}\) by \(f_j := \beta ^{(t)}_+ \circ \pi _j\) with \(\pi _j:{\mathbb {R}}^d\rightarrow {\mathbb {R}}, x \mapsto x_j\), \(j=1,\ldots ,d\). By Lemma 2.18–(1) together with the fact that \(\Vert \pi _j\Vert _{\ell ^{0,\infty }_*} = 1\), we get \(f_j \in {\mathtt {NN}}^{\varrho _t,d,1}_{2(t+2),2,t+2}\). Form the vector function \(f := (f_1,f_2,\ldots ,f_d)\). Using Lemma 2.17-(2), we deduce \(f\in {\mathtt {NN}}^{\varrho _t,d,d}_{2d(t+2),2,d(t+2)}\).
As \(t \ge 2\), by Lemma 2.24, \(\varrho _t\) can represent any polynomial of degree two with \(n = 2(t+1)\) terms. Hence, for \(d \ge 2\), by Lemma 2.26 the multiplication function \(M_d : {\mathbb {R}}^d \rightarrow {\mathbb {R}}, (x_1,\dots ,x_d) \mapsto x_1 \cdots x_d\) satisfies \(M_{d} \in {\mathtt {NN}}^{\varrho _t,d,1}_{4n(2^{j}-1),2j,(2n+1)(2^{j}-1)-1}\) with \(j := \lceil \log _{2} d \rceil \). By definition, \(2^{j-1} < d \le 2^{j}\); hence, \(2^{j}-1 \le 2(d-1)\) and \(6 n (2^{j}-1) \le 12 n(d-1) = 24(t+1)(d-1)\), as well as
so that \(M_{d} \in {\mathtt {NN}}^{\varrho _t,d,1}_{24(t+1)(d-1),2j,(8t+10)(d-1)-1}\). As \(\beta _d^{(t)} = M_{d} \circ f\), by Lemma 2.18–(2), we get
To conclude, we observe that
1.2 Proof of Theorem 5.5
We divide the proof into three steps.
Step 1 (Recalling results from [19]): Using the tensor B-splines \(\beta _d^{(t)}\) introduced in Eq. (5.5), define \(N := N^{(\tau )} := \beta _d^{(\tau -1)}\) for \(\tau \in {\mathbb {N}}\), and note that this coincides with the definition of N in [19, Equation (4.1)]. Next, as in [19, Equations (4.2) and (4.3)], for \(k \in {\mathbb {N}}_0\) and \(j \in {\mathbb {Z}}^d\), define \(N_k^{(\tau )} (x) := N^{(\tau )} (2^k x)\) and \(N_{j,k}^{(\tau )} (x) := N^{(\tau )} (2^k x - j)\). Furthermore, let \(\Omega _0 := (- \tfrac{1}{2}, \tfrac{1}{2})^d\) denote the unit cube, and set
and finally \(s_k^{(\tau )} (f)_p := \inf _{g \in \Sigma _k^{(\tau )}} \Vert f - g\Vert _{L_p}\) for \(f \in X_p (\Omega _0)\) and \(k \in {\mathbb {N}}_0\). Setting \(\lambda ^{(\tau ,p)} := \tau - 1 + \min \{ 1, p^{-1} \}\), [19, Theorem 5.1] shows
Here, \(\Vert (c_k)_{k \in {\mathbb {N}}_0} \Vert _{\ell _q^\alpha } = \Vert (2^{\alpha k} \, c_k)_{k \in {\mathbb {N}}_0}\Vert _{\ell ^q}\); see [19, Equation (5.1)].
Step 2 (Proving the embedding \(B_{p,q}^{d \alpha } (\Omega _0) \hookrightarrow A_q^\alpha (X_p(\Omega _0), \Sigma ({\mathcal {D}}_d^{\tau -1}))\)): Define \(\Sigma ({\mathcal {D}}_d^t) := (\Sigma _n ({\mathcal {D}}_d^t))_{n \in {\mathbb {N}}_0}\). In this step, we show that \(B_{p,q}^{d \alpha } (\Omega _0) \hookrightarrow A_q^\alpha (X_p (\Omega _0), \Sigma ({\mathcal {D}}_d^{\tau -1}))\) for any \(\tau \in {\mathbb {N}}\) and all \(p,q \in (0,\infty ]\) and \(\alpha > 0\) with \(0< d \alpha < \lambda ^{(\tau ,p)}\).
To this end, we first show that if we choose \(X = X_{p}(\Omega _0)\), then the family \(\Sigma ({\mathcal {D}}_d^{\tau -1})\) satisfies the properties (P1)–(P5). To see this, we first have to show \(\Sigma _n({\mathcal {D}}_d^{\tau -1}) \subset X_p(\Omega _0)\). For \(p < \infty \), this is trivial, since \(N^{(\tau )} = \beta _d^{(\tau -1)}\) is bounded and measurable. For \(p = \infty \), this holds as well, since if \(\tau \ge 2\), then \(N^{(\tau )} = \beta _d^{(\tau -1)}\) is continuous; finally, the case \(\tau = 1\) cannot occur for \(p = \infty \), since this would imply
Next, Properties (P1)–(P4) are trivially satisfied. Finally, the density of \(\bigcup _{n=0}^\infty \Sigma _n ({\mathcal {D}}_{d}^{\tau -1})\) in \(X_p(\Omega _0)\) is well known for \(\tau = 1\), since then \(\beta _0^{(\tau -1)} = {\mathbb {1}}_{[0,1)^d}\) and \(p < \infty \). For \(\tau \ge 2\), the density follows with the same arguments that were used for the case \(p = \infty \) in Section B.3.3.
Next, note that \({{\text {supp}}}N^{(\tau )} \subset [0,\tau ]^d\) and thus \({{\text {supp}}}N^{(\tau )}_{j,k} \subset 2^{-k} (j + [0,\tau ]^d)\). Therefore, if \(j \in \Lambda ^{(\tau )}(k)\), then \(\varnothing \ne \Omega _0 \cap {{\text {supp}}}N_{j,k}\), so that there is some \(x \in \Omega _0 \cap 2^{-k}(j + [0,\tau ]^d)\). This implies \({j \in {\mathbb {Z}}^d \cap [-2^{k-1} - \tau , 2^{k-1}]^d}\), and thus, \(|\Lambda ^{{(\tau )}}(k)| \le (2^k + \tau + 1)^d\). Directly by definition of \(\Sigma _n({\mathcal {D}}_d^t)\) and \(\Sigma _k^{(\tau )}\), this implies
Next, since we are assuming \(0< \alpha d < \lambda ^{(\tau ,p)}\), Eq. (D.1) yields a constant \(C_1 = C_1 (p,q,\alpha ,\tau ,d) > 0\) such that \( \Vert f\Vert _{L_p} + \big \Vert \big (s_k^{(\tau )} (f)_p \big )_{k \in {\mathbb {N}}_0} \big \Vert _{\ell _q^{d \alpha }} \le C_1 \cdot \Vert f\Vert _{B_{p,q}^{d \alpha }(\Omega _0)} \) for all \(f \in B_{p,q}^{d \alpha }(\Omega _0)\). Therefore, we see for \(f \in B_{p,q}^{d \alpha }(\Omega _0)\) and \(q < \infty \) that
At the step marked with \((*)\), we used that Eq. (D.2) yields \( \Sigma _{n-1}({\mathcal {D}}_d^{\tau -1}) \supset \Sigma _{(2^k+\tau +1)^d} ({\mathcal {D}}_d^{\tau -1}) \supset \Sigma _k^{(\tau )} \) for all \(n \ge 1 + (2^k + \tau + 1)^d\), and furthermore that if \(1 + (2^k + \tau + 1)^d \le n \le (2^{k+1} + \tau + 1)^d\), then \(2^{dk} \le n \le (\tau +3)^d \cdot 2^{dk}\), so that \(n^{\alpha q - 1} \le C_3 2^{dk (\alpha q - 1)}\) for some constant \(C_3 = C_3(d,\tau ,\alpha ,q)\), and finally that \( \sum _{n= (2^k + \tau + 1)^d + 1}^{(2^{k+1} + \tau + 1)^d} 1 \le (2^{k+1} + \tau + 1)^d \le (\tau +3)^d \cdot 2^{dk} \).
For \(q = \infty \), the proof is similar. Setting \(\ell _k := (2^k + \tau + 1)^d\) for brevity, we see with similar estimates as above that
Overall, we have shown \(B_{p,q}^{d \alpha }(\Omega _0) \hookrightarrow A_q^\alpha (X_p (\Omega _0), \Sigma ({\mathcal {D}}_d^{\tau -1}))\) for \(\tau \in {\mathbb {N}}\), \(p,q \in (0,\infty ]\) and \(0< \alpha d < \lambda ^{(\tau ,p)}\).
Step 3 (Proving the embeddings (5.9) and (5.10)): In case of \(d = 1\), let us set \(r_0 := r\), while \(r_0\) is as in the statement of the theorem for \(d > 1\). Since \(\Omega \) is bounded and \(\Omega _0 = (-\tfrac{1}{2}, \tfrac{1}{2})^d\), there is some \(R > 0\) such that \(\Omega \subset R \cdot \Omega _0\). Let us fix \(p,q \in (0,\infty ]\) and \(s > 0\) such that \(d s < r_0 + \min \{1, p^{-1} \}\).
Since \(\Omega \) and \(R \cdot \Omega _0\) are bounded Lipschitz domains, there exists a (not necessarily linear) extension operator \({\mathcal {E}} : B^{d s}_{p,q} (\Omega ) \rightarrow B^{d s}_{p,q} (R \Omega _0)\) with the properties \(({\mathcal {E}} f)|_{\Omega } = f\) and \(\Vert {\mathcal {E}} f\Vert _{B^{d s}_{p,q}(R \Omega _0)} \le C \cdot \Vert f\Vert _{B^{d s}_{p,q}(\Omega )}\) for all \(f \in B^{d s}_{p,q}(\Omega )\). Indeed, for \(p \in [1,\infty ]\) this follows from [37, Section 4, Corollary 1], since this corollary yields an extension operator \({\mathcal {E}} : X_p (\Omega ) \rightarrow X_p (R \Omega _0)\) with the additional property that the j-th modulus of continuity \(\omega _j\) satisfies \(\omega _j (t, {\mathcal {E}} f)_{R \Omega _0} \le M_j \cdot \omega _j (t, f)_{\Omega }\) for all \(j \in {\mathbb {N}}\), all \(f \in X_p(\Omega )\), and all \(t \in [0,1]\). In view of the definition of the Besov spaces (see in particular [21, Chapter 2, Theorem 10.1]), this easily implies the result. Finally, in case of \(p \in (0,1)\), the existence of the extension operator follows from [20, Theorem 6.1]. In addition to the existence of the extension operator, we will also need that the dilation operator \(D_1 : B^{d s}_{p,q}(R \Omega _0) \rightarrow B^{d s}_{p,q} (\Omega _0), f \mapsto f(R \bullet )\) is well-defined and bounded, say \(\Vert D_1\Vert \le C_1\); this follows directly from the definition of the Besov spaces.
We first prove Eq. (5.9), that is, we consider the case \(d = 1\). To this end, define \(\tau := r + 1 \in {\mathbb {N}}\), let \(f \in B^{s}_{p,q}(\Omega )\) be arbitrary, and set \(f_1 := D_1 ({\mathcal {E}} f) \in B^{s}_{p,q}(\Omega _0)\). By applying Step 2 with \(\alpha = s\) (and noting that \(0< d \alpha = s < r + \min \{1,p^{-1}\} = \lambda ^{(\tau ,p)}\)), we get \(f_1 \in A_q^s (X_p(\Omega _0), \Sigma ({\mathcal {D}}_d^{r}))\), with \(\Vert f_1\Vert _{A_q^s (X_p(\Omega _0), \Sigma ({\mathcal {D}}_d^{r}))} \le C C_1 C_2 \cdot \Vert f\Vert _{B^{d s}_{p,q}(\Omega )}\), where the constant \(C_2\) is provided by Step 2.
Next, we note that \(L := \sup _{n \in {\mathbb {N}}} {\mathscr {L}}(n) \ge 2 = 2 + 2 \lceil \log _2 d \rceil \) and \(r \ge 1 = \min \{d,2\}\), so that Corollary 5.4-(2) shows . But it is an easy consequence of Lemma 2.18-(1) that the dilation operator is well-defined and bounded. Hence, we see that with . Now, note \(D_2 f_1(x) = f_1 (x/R) = {\mathcal {E}} f (x) = f(x)\) for all \(x \in \Omega \subset R \Omega _0\), and hence \(f = (D_2 f_1)|_{\Omega }\). Thus, Remark 3.17 implies that with , as claimed.
Now, we prove Eq. (5.10). To this end, define \(\tau := r_0 + 1 \in {\mathbb {N}}\), let \(f \in B^{s d}_{p,q}(\Omega )\) be arbitrary, and set \(f_1 := D_1 ({\mathcal {E}} f) \in B^{d s}_{p,q}(\Omega _0)\). Applying Step 2 with \(\alpha = s\) (noting \({0< d \alpha = d s < r_0 + \min \{1, p^{-1}\} = \lambda ^{(\tau ,p)}}\)), we get \(f_1 \in A_q^s (X_p (\Omega _0), \Sigma ({\mathcal {D}}_d^{r_0}))\), with \( \Vert f_1\Vert _{A_q^s (X_p (\Omega _0), \Sigma ({\mathcal {D}}_d^{r_0}))} \le C C_1 C_2 \cdot \Vert f\Vert _{B^{d s}_{p,q}(\Omega )} \), where the constant \(C_2\) is provided by Step 2.
Next, we claim that . Indeed, if \(r \ge 2\) and \(L \ge 2 + 2 \lceil \log _2 d \rceil \), then this follows from Corollary 5.4-(2). Otherwise, we have \(r_0 = 0\) and \(L \ge 3 \ge \min \{d+1, 3\}\), so that the claim follows from Corollary 5.4-(1); here, we note that \(p < \infty \), since we would otherwise get the contradiction \(0< \alpha d < r_0 + \min \{1, p^{-1} \} = 0\). Therefore, with . The rest of the argument is exactly as in the case \(d = 1\). \(\square \)
1.3 Proof of Lemma 5.10
Lemma 5.10 shows that deeper networks can implement the sawtooth function \(\Delta _j\) using less connections/neurons than more shallow networks. The reason for this is indicated by the following lemma.
Lemma D.1
For arbitrary \(j \in {\mathbb {N}}\), we have \(\Delta _j \circ \Delta _1 = \Delta _{j+1}\).\(\blacktriangleleft \)
Proof
It suffices to verify the identity on [0, 1], since if \(x \in {\mathbb {R}}{\setminus } [0,1]\), then \(\Delta _1 (x) = 0 = \Delta _{j+1} (x)\), so that \(\Delta _j(\Delta _1(x)) = \Delta _j (0) = 0 = \Delta _{j+1} (x)\). We now distinguish two cases for \(x \in [0,1]\).
Case 1: \(x \in [0,\tfrac{1}{2}]\). This implies \(\Delta _1 (x) = 2x\), and hence (recall the definition of \(\Delta _{j}\) in Eq. (5.11))
In the last equality we used that \(2^j x - k \le 2^{j-1} - k \le 0\) for \(k \ge 2^{j-1}\), so that \(\Delta _1 (2^j x - k) = 0\) for those k.
Case 2: \(x \in (\tfrac{1}{2}, 1]\).
Observe that \(\Delta _j (x) = \Delta _j (1-x)\) for all \(x \in {\mathbb {R}}\) and \(j \in {\mathbb {N}}\). Since \(x' := 1-x \in [0,1/2]\), this identity and Case 1 yield \( \Delta _j \circ \Delta _1 (x) = \Delta _j \circ \Delta _1 (1-x) = \Delta _{j+1}(1-x) = \Delta _{j+1}(x) \). \(\square \)
Using Lemma D.1, we can now provide the proof of Lemma 5.10.
Proof of Lemma 5.10
Part (1): Write \(j = k (L-1) + s\) for suitable \(k \in {\mathbb {N}}_0\) and \(0 \le s \le L - 2\). Note that this implies \(k \le j / (L-1)\). Thanks to Lemma D.1, we have \(\Delta _j = \Delta _{k+s} \circ \Delta _k \circ \cdots \circ \Delta _k\), where \(\Delta _k\) occurs \(L-2\) times. Furthermore, since \(\Delta _k : {\mathbb {R}}\rightarrow {\mathbb {R}}\) is affine with \(2 + 2^k\) pieces (see Fig. 4, and note that we consider \(\Delta _k\) as a function on all of \({\mathbb {R}}\), not just on [0, 1]), Lemma 4.9 shows that \(\Delta _k \in {\mathtt {NN}}^{\varrho _1,1,1}_{\infty ,2,3+2^k}\). By the same reasoning, we get \(\Delta _{k+s} \in {\mathtt {NN}}_{\infty ,2,3+2^{k+s}}^{\varrho _1,1,1}\). Now, a repeated application of Lemma 2.18-(3) shows that
Finally, \(\Delta _j \in {\mathtt {NN}}^{\varrho _1,1,1}_{\infty ,L,C_L \cdot 2^{j/(L-1)}}\) with \(C_L := 4 \, L + 2^{L-1}\) since
Part (2): Set \(\kappa := \lfloor L/2\rfloor \) and write \(j = k \kappa + s\) for \(k \in {\mathbb {N}}_0\) and \(0 \le s \le \kappa - 1\). Note that \(k \le j / \kappa = j / \lfloor L/2 \rfloor \). As above, \(\Delta _j = \Delta _{k+s} \circ \Delta _k \circ \cdots \circ \Delta _k\), where \(\Delta _k\) occurs \(\kappa - 1\) times, and since \(\Delta _k : {\mathbb {R}}\rightarrow {\mathbb {R}}\) is affine with \(2 + 2^k\) pieces, using Lemma 4.9 again shows that \(\Delta _k \in {\mathtt {NN}}^{\varrho _1,1,1}_{6+2^{k+1},2,\infty }\), and
\(\Delta _{k+s} \in {\mathtt {NN}}_{6+2^{k+s+1},2,\infty }^{\varrho _1,1,1}\). Now, a repeated application of Lemma 2.18-(2) shows that
Finally, \(\Delta _k \in {\mathtt {NN}}^{\varrho _1,1,1}_{C_{L} 2^{j/\lfloor L/2\rfloor },\lfloor L/2\rfloor ,\infty }\), as \(2+2(\kappa -1) = 2\kappa \le L\), \(s+1 \le \kappa \le L/2 \le L-1\) (since \(L \ge 2\)) and
\(\square \)
1.4 Proof of Lemma 5.12
For \(h \in {\mathbb {R}}^d\), we define the translation operator \(T_h\) by \((T_h f)(x) = f(x - h)\) for \(f : {\mathbb {R}}^d \rightarrow {\mathbb {R}}\). With this, the h-difference operator of order k is given by \(D_h^k = (D_h)^k\), where \(D_h := (T_{-h} - \mathrm {id})\). For later use, we note for \(a > 0\) that \(D_h [f(a \bullet )](x) = (D_{a h}f)(a x)\), as can be verified by a direct calculation. By induction, this implies \(D_h^k [f(a \bullet )] = (D_{a h}^k f)(a \bullet )\) for all \(k \in {\mathbb {N}}\). Furthermore, \(T_x D_h^k = D_h^k T_x\) for all \(x,h \in {\mathbb {R}}^d\) and \(k \in {\mathbb {N}}\).
A direct computation shows
Next, note that \((T_{-1/4} - \mathrm {id}) (T_{1/4} + \mathrm {id}) = T_{-1/4} - T_{1/4}\), and hence, since \(T_{-1/4}\) and \(T_{1/4}\) commute,
Moreover by induction on \(\ell \in {\mathbb {N}}_{0}\), we see that
Define \(h_j := 2^{-(j+1)}\), so that \(2^{j-1} h_j = 1/4\). Since \(\Delta _j = \sum _{k=0}^{2^{j-1} - 1} (T_k \Delta _1)(2^{j-1} \bullet )\) [cf. Eq. (5.11)], Equations (D.3) and (D.4) and the properties from the beginning of the proof yield for \(x \in {\mathbb {R}}\) that
Recall for \(g \in X_p (\Omega )\) that the r-th modulus of continuity of g is given by
Let \(e_1 = (1,0,\dots ,0) \in {\mathbb {R}}^d\). For \(h = h_j \, e_1\), we have \(\Omega _{2,h} \supset (0,\frac{1}{2}) \times (0,1)^{d-1}\) since \(\Omega = (0,1)^{d}\). Next, because of \({{{\text {supp}}}\, \widetilde{\Delta _1} = [0, \tfrac{1}{2}]}\), the family \((T_{i/2} \widetilde{\Delta _1})_{i \in {\mathbb {Z}}}\) has pairwise disjoint supports (up to null-sets), and
Combining these observations with the fact that \( (T_{\frac{i}{2}} \widetilde{\Delta _1})(2^{j-1} \bullet ) = \widetilde{\Delta _1}(2^{j-1} \bullet - i/2) = \Delta _1(2^j \bullet -i)/2 \), Eq. (D.5) yields for \(p < \infty \) that
and hence, \(\Vert D_{h_j e_1}^2 \Delta _{j,d}\Vert _{L_p(\Omega _{2,h_je_1})} \ge C_p\), where \(C_p := 2^{-1/p} \, \Vert \Delta _1\Vert _{L_p}\) for \(p < \infty \). Since \(\Omega _{2, h_j e_1} \subset \Omega = (0,1)^d\) has at most measure 1, we have \(\Vert \cdot \Vert _{L_1(\Omega _{2, h_j e_1})} \le \Vert \cdot \Vert _{L_\infty (\Omega _{2, h_j e_1})}\); hence, the same holds for \(p=\infty \) with \(C_\infty := C_1\). By definition, this implies \(\omega _2 (\Delta _{j,d})_p (t) \ge C_p\) for \(t \ge |h_je_1| = 2^{-(j+1)}\).
Overall, we get by definition of the Besov quasi-norms in case of \(q < \infty \) that
and hence, \(\Vert \Delta _{j,d}\Vert _{B^{{s'}}_{p,q}(\Omega )} \ge \frac{C_p}{({{s'}} q)^{1/q}} \, 2^{{{s'}} ( j+1)}\) for all \(j \in {\mathbb {N}}\). In case of \(q = \infty \), we see similarly that
for all \(j \in {\mathbb {N}}\). In both cases, we used that \({{s'}} < 2\) to ensure that we can use the modulus of continuity of order 2 to compute the Besov quasi-norm. Finally, note because of \({{s'}} \le s\) that \(B^s_{p,q}(\Omega ) \hookrightarrow B^{{s'}}_{p,q}(\Omega )\); see Eq. (5.4). This easily implies the claim. \(\square \)
1.5 Proof of Lemma 5.19
In this section, we prove Lemma 5.19, based on results of Telgarsky [64].
Telgarsky makes extensive use of two special classes of functions: First, a function \(\sigma : {\mathbb {R}}\rightarrow {\mathbb {R}}\) is called \((t,\beta )\)-poly (where \(t \in {\mathbb {N}}\) and \(\beta \in {\mathbb {N}}_0\)) if there is a partition of \({\mathbb {R}}\) into t intervals \(I_1,\dots ,I_t\) such that \(\sigma |_{I_j}\) is a polynomial of degree at most \(\beta \) for each \(j \in \{1,\dots ,t\}\). In the language of Definition 4.6, these are precisely those functions which belong to \({\mathtt {PPoly}}_{t}^{\beta }({\mathbb {R}})\). The second class of functions which is important are the \((t,\alpha ,\beta )\)-semi-algebraic functions \(f : {\mathbb {R}}^k \rightarrow {\mathbb {R}}\) (where \(t \in {\mathbb {N}}\) and \(\alpha ,\beta \in {\mathbb {N}}_0\)). The definition of this class (see [64, Definition 2.1]) is somewhat technical. Luckily, we don’t need the definition, all we need to know is the following result:
Lemma D.2
(see [64, Lemma 2.3-(1)]) If \(\sigma : {\mathbb {R}}\rightarrow {\mathbb {R}}\) is \((t,\beta )\)-poly and \(q : {\mathbb {R}}^d \rightarrow {\mathbb {R}}\) is a (multivariate) polynomial of degree at most \(\alpha \in {\mathbb {N}}_0\), then \(\sigma \circ q\) is \((t,\alpha ,\alpha \beta )\)-semi-algebraic.\(\blacktriangleleft \)
In most of our proofs, we will mainly be interested in knowing that a function \(\sigma : {\mathbb {R}}\rightarrow {\mathbb {R}}\) is \((t,\alpha )\)-poly for certain \(t,\alpha \). The following lemma gives a sufficient condition for this to be the case.
Lemma D.3
(see [64, Lemma 3.6]) If \(f : {\mathbb {R}}^k \rightarrow {\mathbb {R}}\) is \((s,\alpha ,\beta )\)-semi-algebraic and if \(g_1,\dots ,g_k : {\mathbb {R}}\rightarrow {\mathbb {R}}\) are \((t,\gamma )\)-poly, then the function \(f \circ (g_1,\dots ,g_k) : {\mathbb {R}}\rightarrow {\mathbb {R}}\) is \(\big ( s t (1 + \alpha \gamma ) \cdot k , \beta \gamma \big )\)-poly.\(\blacktriangleleft \)
For proving Lemma 5.19, we begin with the easier case where we count neurons instead of weights.
Proof of the second part of Lemma 5.19
We want to show that for any depth \(L \in {\mathbb {N}}_{\ge 2}\) and degree \(r \in {\mathbb {N}}\) there is a constant \(\Lambda _{L,r} \in {\mathbb {N}}\) such that each function \(f \in {\mathtt {NN}}^{\varrho _r, 1, 1}_{\infty ,L,N}\) is \((\Lambda _{L,r}N^{L-1},r^{L-1})\)-poly. To show this, let \(\Phi \in {\mathcal {NN}}^{\varrho _r,1,1}_{\infty ,L,N}\) with \(f = {\mathtt {R}}(\Phi )\), say \(\Phi = \big ( (T_1,\alpha _1), \dots , (T_K, \alpha _K) \big )\), where necessarily \(K \le L\), and where each \(T_\ell : {\mathbb {R}}^{N_{\ell - 1}} \rightarrow {\mathbb {R}}^{N_\ell }\) is affine-linear.
For \(\ell \in \{1,\dots ,K\}\) and \(j \in \{1,\dots ,N_\ell \}\), we let \(f_j^{(\ell )} : {\mathbb {R}}\rightarrow {\mathbb {R}}\) denote the output of neuron j in the \(\ell \)-th layer. Formally, let \(f_j^{(1)} : {\mathbb {R}}\rightarrow {\mathbb {R}}, x \mapsto \big ( \alpha _1 (T_1 \, x) \big )_j\), and inductively
We prove below by induction on \(\ell \in \{1,\dots ,K\}\) that there is a constant \(C_{\ell ,r} \in {\mathbb {N}}\) which only depends on \(\ell ,r\) and such that \(f_j^{(\ell )}\) is \(\big (C_{\ell ,r} \prod _{t=0}^{\ell -1} N_t, r^{\gamma (\ell )}\big )\)-poly, where \(\gamma (\ell ) := \min \{\ell , L-1\}\). Once this is shown, we see that \(f = {\mathtt {R}}(\Phi ) = f_1^{(K)}\) is \(\big (C_{K,r} \prod _{t=0}^{K-1} N_t, r^{L-1}\big )\)-poly. Then, because of \(N_0 = 1\), we see that
where \(\Lambda _{L,r} := \max _{1 \le K \le L} C_{K,r}\). Therefore, f is indeed \((\Lambda _{L,r} \, N^{L-1}, r^{L-1})\)-poly.
Start of induction (\(\ell = 1\)): Note that \(L \ge 2\), so that \(\gamma (\ell ) = \ell = 1\). We have \(T_1 x = a x + b\) for certain \(a,b \in {\mathbb {R}}^{N_1}\) and \(\alpha _1 = \varrho ^{(1)} \otimes \cdots \otimes \varrho ^{(N_1)}\) for certain \(\varrho ^{(j)} \in \{\mathrm {id}_{\mathbb {R}}, \varrho _r\}\). Thus, \(\varrho ^{(j)}\) is (2, r)-poly, and thus (2, 1, r)-semi-algebraic according to Lemma D.2. Therefore, Lemma D.3 shows because of \(f_j^{(1)}(x) = \varrho ^{(j)} (b_j + a_j x)\) that \(f_j^{(1)}\) is \((2(1+1), r)\)-poly, for any \(j \in \{1,\dots ,N_1\}\). Because of \(N_0 = 1\), the claim holds for \(C_{1,r} := 4\).
Induction step (\(\ell \rightarrow \ell + 1\)): Suppose that \(\ell \in \{1,\dots ,K-1\}\) is such that the claim holds. Note that \(\ell \le K-1 \le L-1\), so that \(\gamma (\ell ) = \ell \).
We have \(T_{\ell + 1} \, y = A \, y + b\) for certain \(A \in {\mathbb {R}}^{N_{\ell + 1} \times N_\ell }\) and \(b \in {\mathbb {R}}^{N_{\ell + 1}}\), and \(\alpha _{\ell + 1} = \varrho ^{(1)} \otimes \cdots \otimes \varrho ^{(N_{\ell + 1})}\) for certain \(\varrho ^{(j)} \in \{\mathrm {id}_{\mathbb {R}}, \varrho _r\}\), where \(\varrho ^{(j)} = \mathrm {id}_{\mathbb {R}}\) for all \(j \in \{1,\dots ,N_{\ell + 1}\}\) in case of \(\ell = K - 1\). Hence, \(\varrho ^{(j)}\) is (2, r)-poly, and even (2, 1)-poly in case of \(\ell = K-1\). Moreover, each of the polynomials \({ p_{j,\ell } : {\mathbb {R}}^{N_\ell } \rightarrow {\mathbb {R}}, y \mapsto (A \, y + b)_j = b_j + \sum _{t=1}^{N_\ell } A_{j,t} \, y_t }\) is of degree at most 1; hence, by Lemma D.2, \(\varrho ^{(j)} \circ p_{j,\ell }\) is (2, 1, r)-semi-algebraic, and even (2, 1, 1)-semi-algebraic in case of \(\ell = K-1\).
Each function \(f_t^{(\ell )}\) is \((C_{\ell ,r} \prod _{t=0}^{\ell -1} N_t, r^{\ell })\)-poly by the induction hypothesis. By Lemma D.3, since
it follows that \(f_j^{(\ell +1)}\) is \((P, r^{\ell +1})\)-poly [respectively, \((P,r^{\ell })\)-poly if \(\ell = K-1\)], where
Finally, note in case of \(\ell < K{-}1\) that \(\ell {+} 1 \le K {-} 1 \le L{-}1\), and hence, \(\gamma (\ell +1) {=} \ell {+}1\), while in case of \(\ell {=} K-1\) we have \(\ell \le \min \{\ell {+}1, L{-}1\} {=} \gamma (\ell {+}1)\). Therefore, each \(f_j^{(\ell +1)}\) is \((C_{\ell +1,r} \cdot \prod _{t=0}^{(\ell +1) - 1} N_t, r^{\gamma (\ell +1)})\)-poly. This completes the induction, and thus the proof. \(\square \)
The proof of the first part of Lemma 5.19 uses the same basic arguments as in the preceding proof, but in a more careful way. In particular, we will also need the following elementary lemma.
Lemma D.4
Let \(k \in {\mathbb {N}}\), and for each \(i \in \{1,\dots ,k\}\) let \(f_i : {\mathbb {R}}\rightarrow {\mathbb {R}}\) be \((t_i,\alpha )\)-poly and continuous. Then the function \(\sum _{i=1}^k f_i\) is \((t,\alpha )\)-poly, where \(t = 1 - k + \sum _{i=1}^k t_i\).\(\blacktriangleleft \)
Proof
For each \(i \in \{1,\dots ,k\}\), there are “breakpoints” \( b_0^{(i)} := - \infty< b_1^{(i)}< \cdots< b_{t_i - 1}^{(i)} < \infty =: b_{t_i}^{(i)} \) such that \(f_i |_{{\mathbb {R}}\cap [b_j^{(i)}, b_{j+1}^{(i)}]}\) is a polynomial of degree at most \(\alpha \) for each \(0 \le j \le t_{i} - 1\). Here, we used the continuity of \(f_i\) to ensure that we can use closed intervals.
Now, let \(M := \bigcup _{i=1}^k \{b_1^{(i)}, \ldots , b_{t_i - 1}^{(i)}\}\). We have \(|M| \le \sum _{i=1}^k (t_i - 1) = t - 1\), with t as in the statement of the lemma. Thus, \(M = \{b_1,\dots ,b_s\}\) for some \(0 \le s \le t - 1\), where \( b_0 := - \infty< b_1< \cdots< b_s < \infty =: b_{s+1}. \) It is easy to see that \(F := \sum _{i=1}^k f_i\) is such that \(F|_{{\mathbb {R}}\cap [b_j, b_{j+1}]}\) is a polynomial of degree at most \(\alpha \) for each \(0 \le j \le s\). Thus, F is \((s+1,\alpha )\)-poly and therefore also \((t,\alpha )\)-poly. \(\square \)
Proof of the first part of Lemma 5.19
Let us first consider an arbitrary network \(\Phi \in {\mathcal {NN}}^{\varrho _r, 1, 1}_{W,L,\infty }\) satisfying \(L(\Phi ) = L\). Let \(L_0 := \lfloor L/2 \rfloor \in {\mathbb {N}}_0\). We claim that
In case of \(L = 1\), this is trivial, since then \({\mathtt {R}}(\Phi ) : {\mathbb {R}}\rightarrow {\mathbb {R}}\) is affine-linear. Thus, we will assume \(L \ge 2\) in what follows. Note that this entails \(L_0 \ge 1\).
Let \(\Phi = \big ( (T_1,\alpha _1), \dots , (T_L, \alpha _L) \big )\), where \(T_\ell : {\mathbb {R}}^{N_{\ell - 1}} \rightarrow {\mathbb {R}}^{N_\ell }\) is affine-linear. We first consider the special case that \(\Vert T_\ell \Vert _{\ell ^0} = 0\) for some \(\ell \in \{1,\dots ,L\}\). In this case, Lemma 2.9 shows that \({\mathtt {R}}(\Phi ) \equiv c\) for some \(c \in {\mathbb {R}}\). This trivially implies that \({\mathtt {R}}(\Phi )\) is \((\max \{1, \Lambda _{L,r} \, W^{L_0} \}, r^{L-1})\)-poly. Thus, we can assume in the following that \(\Vert T_\ell \Vert _{\ell ^0} \ne 0\) for all \(\ell \in \{1,\dots ,L\}\). As in the proof of the first part of Lemma 5.19, we define \(f_j^{(\ell )} : {\mathbb {R}}\rightarrow {\mathbb {R}}\) to be the function computed by neuron \(j \in \{1,\dots ,N_\ell \}\) in layer \(\ell \in \{1,\dots ,L\}\), cf. Eq. (D.6).
Step 1. We let \(L_1 := \lfloor \tfrac{L - 1}{2} \rfloor \in {\mathbb {N}}_0\), and we show by induction on \(t \in \{0,1,\dots , L_1\}\) that
where \(\gamma (t) := \min \{L-1, 2 t + 1\}\) and where the constant \(C_{t,r} \in {\mathbb {N}}\) only depends on t, r. Here, we use the convention that the empty product satisfies \(\prod _{\ell =1}^0 \Vert T_{2\ell }\Vert _{\ell ^0} = 1\).
Induction start (\(t=0\)): We have \(T_1 x = a x + b\) for certain \(a,b \in {\mathbb {R}}^{N_1}\) and \(\alpha _1 = \varrho ^{(1)} \otimes \cdots \otimes \varrho ^{(N_1)}\) for certain \(\varrho ^{(j)} \in \{\mathrm {id}_{\mathbb {R}}, \varrho _r\}\). In any case, \(\varrho ^{(j)}\) is (2, r)-poly, and hence, (2, 1, r)-semi-algebraic by Lemma D.2. Now, note \( f_j^{(2t+1)}(x) = f_j^{(1)}(x) = \varrho ^{(j)} \big ( (T_1 x)_j \big ) = \varrho ^{(j)} (a_j x + b_j) \), so that Lemma D.3 shows that \(f_j^{(2t+1)}\) is \((2 (1 + 1), r)\)-poly. Thus, Eq. (D.8) holds for \(t = 0\) if we choose \(C_{0,r} := 4\). Here, we used that \(L \ge 2\) and \(t=0\), so that \(L-1 \ge 2t + 1\) and hence \(\gamma (t) = 2 t + 1 = 1\).
Induction step \((t \rightarrow t+1)\): Let \(t \in {\mathbb {N}}_0\) such that \(t+1 \le \tfrac{L-1}{2}\) and such that Eq. (D.8) holds for t. We have \(T_{2t + 2} \bullet = A \bullet + b\) for certain \(A \in {\mathbb {R}}^{N_{2t+2} \times N_{2t+1}}\) and \(b \in {\mathbb {R}}^{N_{2t+2}}\), and furthermore \(\alpha _{2t+2} = \varrho ^{(1)} \otimes \cdots \otimes \varrho ^{(N_{2t+2})}\) for certain \(\varrho ^{(j)} \in \{ \mathrm {id}_{\mathbb {R}}, \varrho _r \}\).
Recall from Appendix A that \(A_{j,-} \in {\mathbb {R}}^{1 \times N_{2t+1}}\) denotes the j-th row of A. For \(j \in \{1,\dots ,N_{2t+2}\}\), we claim that
where \(M_j := \Vert A_{j,-}\Vert _{\ell ^0}\), and where the constant \(C_{t,r}' \in {\mathbb {N}}\) only depends on t, r.
The first case where \(A_{j,-}=0\) is trivial. For proving the second case where \(A_{j,-} \ne 0\), let us define \(\Omega _j := \{ i \in \{1,\dots ,N_{2t+1}\} :A_{j,i} \ne 0 \}\), say \(\Omega _j = \{ i_1, \dots , i_{M_j} \}\) with (necessarily) pairwise distinct \(i_1,\dots ,i_{M_j}\). By introducing the polynomial \({p_{j,t} : {\mathbb {R}}^{M_j} \rightarrow {\mathbb {R}}, y \mapsto b_j + \sum _{m=1}^{M_j} A_{j,i_m} y_m}\), we can then write
Since \(\varrho ^{(j)}\) is (2, r)-poly and \(p_{j,t}\) is a polynomial of degree at most 1, Lemma D.2 shows that \(\varrho ^{(j)} \circ p_{j,t}\) is (2, 1, r)-semi-algebraic. Furthermore, by the induction hypothesis we know that each function \(f_{i_m}^{(2t+1)}\) is \((C_{t,r} \prod _{\ell =1}^t \Vert T_{2\ell }\Vert _{\ell ^0}, r^{2t+1})\)-poly, where we used that \(\gamma (t)=2t+1\) since \(t+1 \le (L-1)/2\). Therefore—in view of the preceding displayed equation—Lemma D.3 shows that the function \(f_j^{(2t+2)}\) is indeed \((C_{t,r}' \cdot M_j \cdot \prod _{\ell =1}^t \Vert T_{2\ell }\Vert _{\ell ^0}, r^{2t+2})\)-poly, where \(C_{t,r}' := 2 C_{t,r} \cdot (1 + r^{2t+1})\).
We now estimate the number of polynomial pieces of the function \(f_i^{(2t+3)}\) for \(i \in \{1,\dots ,N_{2t+3}\}\). To this end, let \(B \in {\mathbb {R}}^{N_{2t+3} \times N_{2t+2}}\) and \(c \in {\mathbb {R}}^{N_{2t+3}}\) such that \(T_{2t+3} = B \bullet + c\), and choose \(\sigma ^{(i)} \in \{\mathrm {id}_{\mathbb {R}}, \varrho _r\}\) such that \(\alpha _{2t+3} = \sigma ^{(1)} \otimes \cdots \otimes \sigma ^{(N_{2t+3})}\). For \(i \in \{1,\dots ,N_{2t+3}\}\), let us define
In view of Eq. (D.9), Lemma D.4 shows that \(G_{i,t}\) is \((P,r^{2t+2})\)-poly, where
Here, we used that \(\Vert T_{2t+2} \Vert _{\ell ^0} \ne 0\) and hence \(A \ne 0\), so that \(|\{ j \in \{1,\dots ,N_{2t+2}\} : A_{j,-} \ne 0\}| \ge 1\).
Next, note because of Eq. (D.9) and by definition of \(G_{i,t}\) that there is some \(\theta _{i,t} \in {\mathbb {R}}\) satisfying
Now there are two cases: If \(2t + 3 > L-1\), then \(2t+3 = L\), since \(t+1 \le \tfrac{L-1}{2}\). Therefore, \(\sigma ^{(i)} = \mathrm {id}_{\mathbb {R}}\), so that we see that \(f_i^{(2t+3)} = \theta _{i,t} + G_{i,t}\) is \((C_{t,r}' \cdot \prod _{\ell =1}^{t+1} \Vert T_{2\ell }\Vert _{\ell ^0}, r^{2t+2})\)-poly, where \(2t+2 = L-1 = \gamma (t+1)\).
If \(2t + 3 \le L-1\), then \(\gamma (t+1) = 2t + 3\). Furthermore, each \(\sigma ^{(i)}\) is (2, r)-poly and hence (2, 1, r)-semi-algebraic by Lemma D.2. In view of the preceding displayed equation, and since \(G_{i,t}\) is \({(C_{t,r}' \cdot \prod _{\ell =1}^{t+1} \Vert T_{2\ell }\Vert _{\ell ^0}, r^{2t+2})}\)-poly, Lemma D.3 shows that \(f_i^{(2t+3)}\) is \(\big ( 2 (1 + r^{2t+2}) C_{t,r}' \cdot \prod _{\ell =1}^{t+1} \Vert T_{2\ell }\Vert _{\ell ^0}, r^{2t+3} \big )\)-poly.
In each case, with \(C_{t+1,r} := 2 (1 + r^{2t+2}) C_{t,r}'\), we see that Eq. (D.8) holds for \(t+1\) instead of t.
Step 2. We now complete the proof of Eq. (D.7), by distinguishing whether L is odd or even.
If L is odd: In this case \(L_1 = \lfloor \tfrac{L-1}{2} \rfloor = \tfrac{L-1}{2}\), so that we can use Eq. (D.8) for the choice \(t = \tfrac{L-1}{2}\) to see that \({\mathtt {R}}(\Phi ) = f_1^{(L)} = f_1^{(2t+1)}\) is \((P, r^{L-1})\)-poly, where
If L is even: In this case, set \(t := \tfrac{L}{2} - 1 \in \{ 0,1,\dots ,L_1 \}\), and note \(2t+1 = L-1 =\gamma (t)\). Hence, with \(A \in {\mathbb {R}}^{1 \times N_{L-1}}\) and \(b \in {\mathbb {R}}\) such that \(T_L = A \bullet + b\), we have
Therefore, thanks to Eq. (D.8), Lemma D.4 shows that \({\mathtt {R}}(\Phi )\) is \((P,r^{2t+1})\)-poly, where
In the second inequality we used \(|\{ k \in \{1,\dots ,N_{L-1}\} :A_{1,k} \ne 0\}| = \Vert A\Vert _{\ell ^{0}} = \Vert T_L\Vert _{\ell ^0} \ge 1\). We have thus established Eq. (D.7) in all cases.
Step 3. It remains to prove the actual claim. Let \(f \in {\mathtt {NN}}^{\varrho _r,1,1}_{W,L,\infty }\) be arbitrary, whence \(f = {\mathtt {R}}(\Phi )\) for some \(\Phi \in {\mathtt {NN}}^{\varrho _r,1,1}_{W,K,\infty }\) with \(L(\Phi ) = K\) for some \(K \in {\mathbb {N}}_{\le L}\). In view of Eq. (D.7), this implies that \(f = {\mathtt {R}}(\Phi )\) is \((\max \{1, \Lambda _{K,r} \, W^{\lfloor K/2 \rfloor }\}, r^{K-1})\)-poly. If we set \(\Theta _{L,r} := \max _{1 \le K \le L} \Lambda _{K,r}\), then this easily implies that f is \((\max \{1, \Theta _{L,r} \, W^{\lfloor L/2 \rfloor }\}, r^{L-1})\)-poly, as desired. \(\square \)
Appendix E. The spaces and are distinct
In this section, we show that for a fixed depth \(L \ge 3\) and \(\Omega = (0,1)^d\) the approximation spaces defined in terms of the number of weights and in terms of the number of neurons are distinct; that is, we show
The proof is based on several results by Telgarsky [64], which we first collect. The first essential concept is the notion of the crossing number of a function.
Definition E.1
For any piecewise polynomial function \(f : {\mathbb {R}}\rightarrow {\mathbb {R}}\) with finitely many pieces, define \({\widetilde{f}} : {\mathbb {R}}\rightarrow \{0,1\}, x \mapsto {\mathbb {1}}_{f(x) \ge 1/2}\). Thanks to our assumption on f, the sets \({\widetilde{f}}^{-1} (\{0\}) \subset {\mathbb {R}}\) and \({\widetilde{f}}^{-1} (\{1\}) \subset {\mathbb {R}}\) are finite unions of (possibly degenerate) intervals. For \(i \in \{0,1\}\), denote by \(I_f^{(i)} \subset 2^{{\mathbb {R}}}\) the set of connected components of \({\widetilde{f}}^{-1} (\{i\})\). Finally, set \(I_f := I_f^{(0)} \cup I_f^{(1)}\) and define the crossing number \(\mathrm {Cr}(f)\) of f as \(\mathrm {Cr}(f) := |I_f| \in {\mathbb {N}}\). \(\blacktriangleleft \)
The following result gives a bound on the crossing number of f, based on bounds on the complexity of f. Here, we again use the notion of \((t,\beta )\)–poly functions as introduced at the beginning of Appendix D.5.
Lemma E.2
[64, Lemma 3.3] If \(f : {\mathbb {R}}\rightarrow {\mathbb {R}}\) is \((t,\alpha )\)–poly, then \(\mathrm {Cr}(f) \le t (1 + \alpha )\).\(\blacktriangleleft \)
Finally, we will need the following result which tells us that if \(\mathrm {Cr}(f) \gg \mathrm {Cr}(g)\), then the functions \({\widetilde{f}},{\widetilde{g}}\) introduced in Definition E.1 differ on a large number of intervals \(I \in I_f\).
Lemma E.3
[64, Lemma 3.1] Let \(f : {\mathbb {R}}\rightarrow {\mathbb {R}}\) and \(g : {\mathbb {R}}\rightarrow {\mathbb {R}}\) be piecewise polynomial with finitely many pieces. Then
The first step to proving Eq. (E.1) will be the following estimate:
Lemma E.4
Let \(p \in (0,\infty ]\). There is a constant \(C_p > 0\) such that the error of best approximation [cf. Eq. (3.1)] of the “sawtooth function” \(\Delta _j\) [cf. Eq. (5.11)] by piecewise polynomials satisfies
For proving this lower bound, we first need to determine the crossing number of \(\Delta _j\).
Lemma E.5
Let \(j \in {\mathbb {N}}\) and \(\Delta _j : {\mathbb {R}}\rightarrow {\mathbb {R}}\) as in Eq. (5.11). We have \(\mathrm {Cr}(\Delta _j) = 1 + 2^j\) and
Proof
The formal proof is omitted as it involves tedious but straightforward computations; graphically, the claimed properties are straightforward consequences of Fig. 4. \(\square \)
Proof of Lemma E.4
Let \(j,\alpha \in {\mathbb {N}}\) and let \(N \in {\mathbb {N}}\) with \(N \le \frac{2^{j} + 1}{4(1+\alpha )}\) and \(f \in {\mathtt {PPoly}}_N^\alpha \) be arbitrary. Lemma E.2 shows \(\mathrm {Cr}(f) \le N(1 + \alpha ) \le \frac{2^{j} + 1}{4}\), so that Lemma E.5 implies \({\theta := 1 - 2 \frac{\mathrm {Cr}(f)}{\mathrm {Cr}(\Delta _j)} = 1 - 2 \frac{\mathrm {Cr}(f)}{1 + 2^j} \ge \tfrac{1}{2}}\). Now, recall the notation of Definition E.1, and set
By Lemma E.3, \(\frac{1}{\mathrm {Cr}(\Delta _j)} |G| \ge \frac{\theta }{2} \ge \frac{1}{4}\), which means \(|G| \ge \frac{1 + 2^j}{4} \ge 2^{j-2}\), since we have \(\mathrm {Cr}(\Delta _j) = 1 + 2^j\).
For arbitrary \(I \in G\), we have \(\widetilde{\Delta _j}(x) \ne {\widetilde{f}}(x)\) for all \(x \in I\), so that either \(f(x) < \tfrac{1}{2} \le \Delta _j (x)\) or \(\Delta _j(x) < \tfrac{1}{2} \le f(x)\). In both cases, we get \(|\Delta _j (x) - f(x)| \ge |\Delta _j(x) - \tfrac{1}{2}|\). Furthermore, recall that \(0 \le \Delta _j \le 1\), so that \(|\Delta _j(x) - \tfrac{1}{2}| \le \tfrac{1}{2} \le 1\). Because of \(\Vert \Delta _j - f\Vert _{L_p ( (0,1) )} \ge \Vert \Delta _j - f\Vert _{L_1 ( (0,1) )}\) for \(p \ge 1\), it is sufficient to prove the result for \(0 < p \le 1\). For this range of p, we see that
Overall, we get \(|\Delta _j(x) - f(x)|^p \ge |\Delta _j(x) - \tfrac{1}{2}|^p \ge |\Delta _j(x) - \tfrac{1}{2}|\) for all \(x \in I\) and \(I \in G\). Thus,
This implies \(\Vert \Delta _j - f\Vert _{L_p ( (0,1) )} \ge 2^{-5/p} =: C_p\). \(\square \)
As a consequence of the lower bound in Lemma E.4, we can now prove lower bounds for the neural network approximation space norms of the multivariate sawtooth function \(\Delta _{j,d}\) [cf. Definition 5.9]
Proposition E.6
Consider \(\Omega = [0,1]^{d}\), \(r \in {\mathbb {N}}\), \(L \in {\mathbb {N}}_{\ge 2}\), \(\alpha \in (0, \infty )\), \(p,q \in (0,\infty ]\). There is a constant \({C = C(d,r,L,\alpha ,p,q) > 0}\) such that
Proof
According to Lemma 5.19, there is a constant \(C_1 = C_1 (r,L) \in {\mathbb {N}}\) such that
We first prove the estimate regarding . To this end, note that there is a constant \(C_2 = C_2(L,\beta ,C_1) = C_2(L,r) > 0\) such that \( \big ( \tfrac{2^{j+1}}{4 C_1 (1 + \beta )} \big )^{1/\lfloor L/2 \rfloor } = C_2 \cdot 2^{(j+1) / \lfloor L/2 \rfloor } \). Now, let \(W \in {\mathbb {N}}_0\) with \(W \le C_2 \cdot 2^{(j+1) / \lfloor L/2 \rfloor }\) and \(F \in {\mathtt {NN}}^{\varrho _{r},d,1}_{W,L,\infty }\) be arbitrary. Define \(F_{x'} : {\mathbb {R}}\rightarrow {\mathbb {R}}, t \mapsto F ((t, x'))\) for \(x' \in [0,1]^{d-1}\).
According to Lemma 2.18-(1) and Eq. (E.2), we have \( F_{x'} \in {\mathtt {NN}}_{W,L,\infty }^{\varrho _r,1,1} \subset {\mathtt {PPoly}}_{C_1 \cdot W^{\lfloor L/2 \rfloor }}^{\beta }. \) Since \( C_1 \cdot W^{\lfloor L/2 \rfloor } \le C_1 \cdot \tfrac{2^{j+1}}{4 C_1 (1+\beta )} = \tfrac{2^{j+1}}{4(1+\beta )}, \) Lemma E.4 yields a constant \(C_3 = C_3 (p) > 0\) such that \(C_3 \le \Vert \Delta _j - F_{x'}\Vert _{L_p ( (0,1) )}\). For \(p < \infty \), Fubini’s theorem shows that
Therefore,
Since \(\Vert \bullet \Vert _{L^\infty (\Omega )} \ge \Vert \bullet \Vert _{L^1(\Omega )}\), this also holds for \(p = \infty \).
In light of the embedding (3.2), it is sufficient to lower bound when \(q = \infty \). In this case, we have
as desired. This completes the proof of the lower bound of .
The lower bound for can be derived similarly. First, in the same way that we proved Eq. (E.3), one can show that
for a suitable constant \(C'_2 = C'_2 (L,r) > 0\). The remainder of the argument is then almost identical to that for estimating , and is thus omitted. \(\square \)
As our final preparation for showing that the spaces and are distinct for \(L \ge 3\) (Lemma 3.10), we will show that the lower bound derived in Proposition E.6 is sharp and extends to arbitrary measurable \(\Omega \) with non-empty interior.
Theorem E.7
Let \(p,q \in (0,\infty ]\), \(\alpha > 0\), \(r \in {\mathbb {N}}\), \(L \in {\mathbb {N}}_{\ge 2}\), and let \(\Omega \subset {\mathbb {R}}^d\) be a bounded admissible domain with non-empty interior. Consider \(y \in {\mathbb {R}}^d\) and \(s > 0\) satisfying \(y + [0,s]^d \subset \Omega \) and define
Then there are \(C_1, C_2 > 0\) such that for all \(j \in {\mathbb {N}}\) the function \(\Delta _j^{(y,s)}\)
satisfies
Proof
For the upper bound, since \(\Omega \) is bounded, Theorem 4.7 [Eq. (4.3), which also holds for \(N_q^\alpha \) instead of \(W_q^\alpha \)] shows that it suffices to prove the claim for \(r = 1\). Since \(T_{y,s} : {\mathbb {R}}^d \rightarrow {\mathbb {R}}^{d}, x \mapsto s^{-1} (x - y)\) satisfies \(\Vert T_{y,s}\Vert _{\ell ^{0,\infty }_*} = 1\), a combination of Lemmas 5.10 and 2.18-(1) shows that there is a constant \(C_L > 0\) satisfying
Furthermore, \(\Delta _j^{(y,s)} \in X_p (\Omega )\) since \(\Omega \) is bounded and \(\Delta _j^{(y,s)}\) is bounded and continuous. Thus, the Bernstein inequality (5.1) yields a constant \(K_1 > 0\) such that
for all \(j \in {\mathbb {N}}\); similarly, we get a constant \(K_2 > 0\) such that
for all \(j \in {\mathbb {N}}\). Considering \(C_2 := \max \{ K_1,K_2 \} \cdot C_L^\alpha \) establishes the desired upper bound.
For the lower bound, consider arbitrary \(W,N \in {\mathbb {N}}_{0}\), \(F \in {\mathtt {NN}}^{\varrho _{r},d,1}_{W,L,N}\), and observe that by Lemma 2.18-(1) we have \(F' := F \circ T_{y,s}^{-1} \in {\mathtt {NN}}^{\varrho _{r},d,1}_{W,L,N}\). In view of Proposition E.6, the lower bound follows from the inequality
\(\square \)
We can now prove Lemma 3.10.
Proof of Lemma 3.10
Ad (1) If , then the linear map
is well-defined. Furthermore, this map has a closed graph. Indeed, if \(f_n \rightarrow f\) in and \(f_n = \iota f_n \rightarrow g\) in , then the embeddings and (see Proposition 3.2 and Theorem 4.7) imply that \(f_n \rightarrow f\) in \(L_{p_1}\) and \(f_n \rightarrow g\) in \(L_{p_2}\). But \(L_p\)-convergence implies convergence in measure, so that we get \(f = g\).
Now, the closed graph theorem (which applies to F-spaces (see [59, Theorem 2.15]), hence to quasi-Banach spaces, since these are F-spaces (see [66, Remark after Lemma 2.1.5])) shows that \(\iota \) is continuous. Here, we used that the approximation classes and are quasi-Banach spaces; this is proved independently in Theorem 3.27.
Since \(\Omega \) has non-empty interior, there are \(y \in {\mathbb {R}}^d\) and \(s > 0\) such that \(y + [0,s]^d \subset \Omega \). The continuity of \(\iota \), combined with Theorem E.7, implies for the functions \(\Delta _j^{(y,s)}\) from Theorem E.7 for all \(j \in {\mathbb {N}}\) that
where the implicit constants are independent of j. Hence, \(\beta / (L' - 1) \le \alpha / \lfloor L/2 \rfloor \); that is, \(L' - 1 \ge \tfrac{\beta }{\alpha } \cdot \lfloor L/2 \rfloor \).
Ad (2) Exactly as in the argument above, we get for all \(j \in {\mathbb {N}}\) that
with implied constants independent of j. Hence, \(\alpha / \lfloor L/2 \rfloor \le \beta / (L' - 1)\); that is, \(\lfloor L/2 \rfloor \ge \frac{\alpha }{\beta } \cdot (L' - 1)\).
Proof of the “in particular” part: If , then Parts (1) and (2) show (because of \(\alpha = \beta \)) that \(L - 1 = \lfloor L /2 \rfloor \). Since \(L \in {\mathbb {N}}_{\ge 2}\), this is only possible for \(L = 2\). \(\square \)
As a further consequence of Lemma E.4, we can now prove the non-triviality of the neural network approximation spaces, as formalized in Theorem 4.16.
Proof of Theorem 4.16
In view of the embedding (see Lemma 3.9), it suffices to prove the claim for . Furthermore, it is enough to consider the case \(q = \infty \), since Eq. (3.2) shows that . Next, in view of Remark 3.17, it suffices to consider the case \(k=1\). Finally, thanks to Theorem 4.7, it is enough to prove the claim for the special case \(\varrho = \varrho _r\) (for fixed but arbitrary \(r \in {\mathbb {N}}\)).
Since \(\Omega \) has non-empty interior, there are \(y \in {\mathbb {R}}^d\) and \(s > 0\) such that \(y + [0,s]^d \subset \Omega \). Let us fix \(\varphi \in C_c ({\mathbb {R}}^d)\) satisfying \(0 \le \varphi \le 1\) and \(\varphi |_{y + [0,s]^d} \equiv 1\). With \(\Delta _{j}^{(y,s)}\) as in Theorem E.7, define for \(j \in {\mathbb {N}}\)
Note that \(g_j \in C_c ({\mathbb {R}}^d)\), and hence, \(g_j |_{\Omega } \in X\). Furthermore, since \(0 \le \Delta _j^{(y,s)} \le 1\), it is easy to see \(\Vert g_j|_{\Omega }\Vert _{X} \le \Vert g_j\Vert _{L_p({\mathbb {R}}^d)} \le \Vert \varphi \Vert _{L_p({\mathbb {R}}^d)} =: C\) for all \(j \in {\mathbb {N}}\).
By Theorem 4.2 and Proposition 3.2, we know that is a well-defined quasi-Banach space satisfying . Let us assume toward a contradiction that the claim of Theorem 4.16 fails; this means . Using the same “closed graph theorem arguments” as in the proof of Lemma 3.10, we see that this implies for all and a fixed constant \(C' > 0\). In particular, this implies for all \(j \in {\mathbb {N}}\). In the remainder of the proof, we will show that as \(j \rightarrow \infty \), which then provides the desired contradiction.
To prove , choose \(N_0 \in {\mathbb {N}}\) satisfying \({\mathscr {L}}(N_0) \ge 2\), and let \(N \in {\mathbb {N}}_{\ge N_0}\) and \(f \in {\mathtt {NN}}^{\varrho _r,d,1}_{\infty ,{\mathscr {L}}(N),N}\) be arbitrary. Reasoning as in the proof of Theorem E.7, since \(\varphi \equiv 1\) on \(y + [0,s]^{d}\), we see that if we set \(T_{y,s} : {\mathbb {R}}^d \rightarrow {\mathbb {R}}^d, x \mapsto s^{-1} (y-x)\), then
Now, given any \(x' \in {\mathbb {R}}^{d-1}\), let us set \( f_{x'} : {\mathbb {R}}\rightarrow {\mathbb {R}}, t \mapsto (f \circ T^{-1}_{y,s}) ((t, x')) \). As a consequence of Lemma 2.18-(1), we see \(f_{x'} \in {\mathtt {NN}}^{\varrho _r,1,1}_{\infty ,{\mathscr {L}}(N), N}\). According to Part 2 of Lemma 5.19, there is a constant \(K_N \in {\mathbb {N}}\) such that \(f_{x'} \in {\mathtt {PPoly}}_{K_N}^{r^{{\mathscr {L}}(N) - 1}}\) Hence, Lemma E.4 yields a constant \(C_2 = C_2(p) > 0\) such that \(\Vert \Delta _j - f_{x'}\Vert _{L_p ( (0,1) )} \ge C_2\) as soon as \(2^{j} + 1 \ge 4 \, K_N \cdot (1 + r^{{\mathscr {L}}(N) - 1}) =: K_N '\). Because of \(2^{j} + 1 \ge j\), this is satisfied if \(j \ge K_N '\). In case of \(p < \infty \), Fubini’s theorem shows
whence \( \Vert g_{j}-f\Vert _{L_{p}(\Omega )} \ge s^{d/p} \Vert \Delta _{j,d} - f \circ T^{-1}_{y,s}\Vert _{L_p ([0,1]^{d})} \ge C_2 \cdot s^{d/p} \). For \(p = \infty \), the same estimate remains true because \(\Vert \bullet \Vert _{L_p([0,1]^d)} \le \Vert \bullet \Vert _{L_\infty ([0,1]^d)}\). Since \(f \in {\mathtt {NN}}^{\varrho _r,d,1}_{\infty ,{\mathscr {L}}(N),N}\) was arbitrary, we have shown
Directly from the definition of the norm , this implies that for arbitrary \(N \in {\mathbb {N}}_{\ge N_0}\)
This proves as \(j \rightarrow \infty \), and thus completes the proof. \(\square \)
Rights and permissions
About this article
Cite this article
Gribonval, R., Kutyniok, G., Nielsen, M. et al. Approximation Spaces of Deep Neural Networks. Constr Approx 55, 259–367 (2022). https://doi.org/10.1007/s00365-021-09543-4
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00365-021-09543-4
Keywords
- Deep neural networks
- Sparsely connected networks
- Approximation spaces
- Besov spaces
- Direct estimates
- Inverse estimates
- Piecewise polynomials
- ReLU activation function