Skip to main content
Log in

Approximation Spaces of Deep Neural Networks

  • Published:
Constructive Approximation Aims and scope

Abstract

We study the expressivity of deep neural networks. Measuring a network’s complexity by its number of connections or by its number of neurons, we consider the class of functions for which the error of best approximation with networks of a given complexity decays at a certain rate when increasing the complexity budget. Using results from classical approximation theory, we show that this class can be endowed with a (quasi)-norm that makes it a linear function space, called approximation space. We establish that allowing the networks to have certain types of “skip connections” does not change the resulting approximation spaces. We also discuss the role of the network’s nonlinearity (also known as activation function) on the resulting spaces, as well as the role of depth. For the popular ReLU nonlinearity and its powers, we relate the newly constructed spaces to classical Besov spaces. The established embeddings highlight that some functions of very low Besov smoothness can nevertheless be well approximated by neural networks, if these networks are sufficiently deep.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Notes

  1. See, e.g., [4, Section 3] for reminders on quasi-norms and quasi-Banach spaces.

  2. A function \(\sigma : {\mathbb {R}}\rightarrow {\mathbb {R}}\) is a squashing function if it is non-decreasing with \(\lim _{x \rightarrow -\infty } \sigma (x) = 0\) and \(\lim _{x \rightarrow \infty } \sigma (x) = 1\); see [36, Definition 2.3].

  3. Note that \(4 = 1 \!\mod 3\) and hence \(4^n - 1 = 0 \!\mod 3\), so that \(w \in {\mathbb {N}}\).

  4. Notice the restriction to \(W,N \ge 1\); in fact, the result of Lemma 4.11 as stated cannot hold for \(W=0\) or \(N=0\).

  5. Here, the term “domain” is to be understood as an open connected set.

  6. With the convention \(\lfloor \infty /2\rfloor = \infty -1 = \infty \).

  7. For instance, [26, Proposition 4.35] shows that each function in \(C_0({\mathbb {R}}^d)\) is a uniform limit of continuous, compactly supported functions, [27, Proposition (2.6)] shows that such functions are uniformly continuous, while [63, Theorem 12.8] shows that the uniform continuity is preserved by the uniform limit.

  8. This implicitly uses that \(\varrho _i\) is not affine-linear, so that \( \varrho _i \in \overline{{\mathtt {NN}}^{\varrho _r,1,1}_{2 \cdot 4^{r-i},2,2^{r-i}} {\setminus } {\mathtt {NN}}^{\varrho _r,1,1}_{\infty ,1,\infty }} \).

References

  1. Adams, R.A., Fournier, J.J.F.: Sobolev Spaces. Pure and Applied Mathematics (Amsterdam), vol. 140, 2nd edn. Elsevier/Academic Press, Amsterdam (2003)

    MATH  Google Scholar 

  2. Adler, J., Öktem, O.: Solving ill-posed inverse problems using iterative deep neural networks. Inverse Probl. 33, 124007 (2017)

    Article  MathSciNet  MATH  Google Scholar 

  3. Aliprantis, C.D., Border, K.C.: Infinite Dimensional Analysis: A Hitchhiker’s Guide, third edn. Springer, Berlin (2006)

    MATH  Google Scholar 

  4. Almira, J.M., Luther, U.: Generalized approximation spaces and applications. Math. Nachr. 263(264), 3–35 (2004)

    Article  MathSciNet  MATH  Google Scholar 

  5. Barron, A.R.: Universal approximation bounds for superpositions of a sigmoidal function. IEEE Trans. Inf. Theory 39(3), 930–945 (1993)

    Article  MathSciNet  MATH  Google Scholar 

  6. Barron, A.R.: Approximation and estimation bounds for artificial neural networks. Mach. Learn. 14(1), 115–133 (1994)

    Article  MATH  Google Scholar 

  7. Bartlett, P.L., Harvey, N., Liaw, C., Mehrabian, A.: Nearly-tight VC-dimension and pseudodimension bounds for piecewise linear neural networks. arXiv (2017)

  8. Beck, C., Becker, S., Grohs, P., Jaafari, N., Jentzen, A.: Solving stochastic differential equations and Kolmogorov equations by means of deep learning (2018). arXiv preprint arXiv:1806.00421

  9. Bölcskei, H., Grohs, P., Kutyniok, G., Petersen, P.: Optimal approximation with sparsely connected deep neural networks. SIAM J. Math. Data Sci. 1, 8–45 (2019)

    Article  MathSciNet  Google Scholar 

  10. Bubba, T.A., Kutyniok, G., Lassas, M., März, M., Samek, W., Siltanen, S., Srinivasan, V.: Learning the invisible: a hybrid deep learning-shearlet framework for limited angle computed tomography. Inverse Probl. 35(6), 064002 (2019)

  11. Bui, H.-Q., Laugesen, R.S.: Affine systems that span Lebesgue spaces. J. Fourier Anal. Appl. 11(5), 533–556 (2005)

    Article  MathSciNet  MATH  Google Scholar 

  12. Candès, E.J.: Ridgelets: Theory and Applications. Ph.D. thesis, Stanford University (1998)

  13. Chui, C.K., Li, X., Mhaskar, H.N.: Neural networks for localized approximation. Math. Comput. 63(208), 607–623 (1994)

    Article  MathSciNet  MATH  Google Scholar 

  14. Cohen, N., Sharir, O., Shashua, A.: On the expressive power of deep learning: a tensor analysis. In: Conference on Learning Theory, pp. 698–728 (2016)

  15. Cohen, N., Shashua, A.: Convolutional rectifier networks as generalized tensor decompositions. In: International Conference on Machine Learning, pp. 955–963 (2016)

  16. Cybenko, G.: Approximation by superpositions of a sigmoidal function. Math. Control Signals Syst. 2(4), 303–314 (1989)

    Article  MathSciNet  MATH  Google Scholar 

  17. DeVore, R.A.: Nonlinear approximation. In: Acta Numerica, pp. 51–150. Cambridge Univ. Press, Cambridge (1998)

  18. DeVore, R.A., Oskolkov, K.I., Petrushev, P.P.: Approximation by feed-forward neural networks. Ann. Numer. Math. 4, 261–287 (1996)

    MathSciNet  MATH  Google Scholar 

  19. DeVore, R.A., Popov, V.A.: Interpolation of Besov spaces. Trans. Am. Math. Soc. 305(1), 397–414 (1988)

    Article  MathSciNet  MATH  Google Scholar 

  20. DeVore, R.A., Sharpley, R.C.: Besov spaces on domains in \( {R}^d\). Trans. Am. Math. Soc. 335(2), 843–864 (1993)

    MATH  Google Scholar 

  21. DeVore, R.A., Lorentz, G.G.: Constructive approximation. Grundlehren der Mathematischen Wissenschaften [Fundamental Principles of Mathematical Sciences], vol. 303. Springer, Berlin (1993)

  22. Elad, M.: Deep, Deep Trouble. Deep Learning’s Impact on Image Processing, Mathematics, and Humanity. SIAM News (2017)

  23. Eldan, R., Shamir, O.: The power of depth for feedforward neural networks. In: Proceedings of the 29th Conference on Learning Theory, COLT 2016, New York, USA, June 23–26, 2016, pp. 907–940. (2016)

  24. Ellacott, S.W.: Aspects of the numerical analysis of neural networks. Acta Numer. 3, 145–202 (1994)

    Article  MathSciNet  MATH  Google Scholar 

  25. Elstrodt, J.: Maß- und Integrationstheorie. Springer Spektrum, 8th edn. Springer, Berlin, Heidelberg (2018)

    Book  MATH  Google Scholar 

  26. Folland, G.B.: Real Analysis: Modern Techniques and Their Applications. Pure and Applied Mathematics, 2nd edn. Wiley, Amsterdam (1999)

    MATH  Google Scholar 

  27. Folland, G.B.: A Course in Abstract Harmonic Analysis. Studies in Advanced Mathematics, 2nd edn. CRC Press, Boca Raton (1995)

    MATH  Google Scholar 

  28. Foucart, S., Rauhut, H.: A Mathematical Introduction to Compressive Sensing. Springer, Berlin (2012)

    MATH  Google Scholar 

  29. Funahashi, K.-I.: On the approximate realization of continuous mappings by neural networks. Neural Netw. 2(3), 183–192 (1989)

    Article  Google Scholar 

  30. Gregor, K., LeCun, Y.: Learning fast approximations of sparse coding. In: Proceedings of the 27th Annual International Conference on Machine Learning, pp. 399–406 (2010)

  31. Håstad, J.T.: Computational limitations for small-depth circuits. ACM Doctoral Dissertation Award (1986) (1987)

  32. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)

  33. He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In: Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), ICCV ’15, pp. 1026–1034. IEEE Computer Society, Washington, DC, USA (2015)

  34. Hoffman, K., Kunze, R.: Linear Algebra, 2nd edn. Prentice-Hall Inc, Englewood Cliffs (1971)

    MATH  Google Scholar 

  35. Hornik, K.: Approximation capabilities of multilayer feedforward networks. Neural Netw. 4(2), 251–257 (1991)

    Article  MathSciNet  Google Scholar 

  36. Hornik, K., Stinchcombe, M., White, H.: Multilayer feedforward networks are universal approximators. Neural Netw. 2(5), 359–366 (1989)

    Article  MATH  Google Scholar 

  37. Johnen, H., Scherer, K.: On the equivalence of the \(K\)-functional and moduli of continuity and some applications. In: Constructive Theory of Functions of Several Variables (Proc. Conf., Math. Res. Inst., Oberwolfach, 1976), pp. 119–140. Lecture Notes in Math., Vol. 571. Springer, Berlin (1977)

  38. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Proceedings of the 25th International Conference on Neural Information Processing Systems, Volume 1, NIPS’12, pp. 1097–1105. Curran Associates Inc, USA (2012)

  39. Laugesen, R.S.: Affine synthesis onto \(L^p\) when \(0<p\le 1\). J. Fourier Anal. Appl. 14(2), 235–266 (2008)

    Article  MathSciNet  MATH  Google Scholar 

  40. Lax, P.D., Terrell, M.S.: Calculus with Applications. Undergraduate Texts in Mathematics, 2nd edn. Springer, New York (2014)

    Book  MATH  Google Scholar 

  41. Le Magoarou, L., Gribonval, R.: Flexible multi-layer sparse approximations of matrices and applications. IEEE J. Sel. Top. Signal Process. 10(4), 688–700 (2016). https://doi.org/10.1109/JSTSP.2016.2543461

  42. LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436–444 (2015)

    Article  Google Scholar 

  43. Leshno, M., Lin, V.Ya., Pinkus, A., Schocken, S.: Multilayer feedforward networks with a nonpolynomial activation function can approximate any function. Neural Netw. 6(6), 861–867 (1993)

    Article  Google Scholar 

  44. Maas, A.L., Hannun, A.Y., Ng, A.Y.: Rectifier nonlinearities improve neural network acoustic models. In: Proceedings of the ICML, vol. 30, pp. 3 (2013)

  45. Maiorov, V., Pinkus, A.: Lower bounds for approximation by MLP neural networks. Neurocomputing 25(1), 81–91 (1999)

    Article  MATH  Google Scholar 

  46. Mallat, Stéphane: Understanding deep convolutional networks. Philos. Trans. R. Soc. A 374(2065), 20150203–16 (2016)

    Article  Google Scholar 

  47. Mardt, A., Pasquali, L., Wu, H., Noé, F.: Vampnets: deep learning of molecular kinetics. Nat. Commun. 9, 5 (2018)

    Article  Google Scholar 

  48. McCulloch, W.S., Pitts, W.: A logical calculus of the ideas immanent in nervous activity. Bull. Math. Biophys. 5(4), 115–133 (1943)

    Article  MathSciNet  MATH  Google Scholar 

  49. Mhaskar, H.N.: Approximation properties of a multilayered feedforward artificial neural network. Adv. Comput. Math. 1(1), 61–80 (1993)

    Article  MathSciNet  MATH  Google Scholar 

  50. Mhaskar, H.N.: Neural networks for optimal approximation of smooth and analytic functions. Neural Comput. 8(1), 164–177 (1996)

    Article  Google Scholar 

  51. Mhaskar, H.N., Poggio, T.: Deep vs. shallow networks: an approximation theory perspective. Anal. Appl. 14(06), 829–848 (2016)

    Article  MathSciNet  MATH  Google Scholar 

  52. Mhaskar, H.N., Micchelli, C.A.: Degree of approximation by neural and translation networks with a single hidden layer. Adv. Appl. Math. 16(2), 151–183 (1995)

    Article  MathSciNet  MATH  Google Scholar 

  53. Nguyen-Thien, T., Tran-Cong, T.: Approximation of functions and their derivatives: a neural network implementation with applications. Appl. Math. Model. 23(9), 687–704 (1999)

    Article  MATH  Google Scholar 

  54. Orhan, A.E., Pitkow, X.: Skip Connections Eliminate Singularities (2017). arXiv preprint arXiv:1701.09175

  55. Petersen, P., Voigtlaender, F.: Optimal approximation of piecewise smooth functions using deep ReLU neural networks. Neural Netw. 108, 296–330 (2018)

    Article  MATH  Google Scholar 

  56. Petrushev, P.P.: Direct and converse theorems for spline and rational approximation and Besov spaces. In: Function Spaces and Applications (Lund, 1986), volume 1302 of Lecture Notes in Math., pp. 363–377. Springer, Berlin (1988)

  57. Pinkus, A.: Approximation theory of the MLP model in neural networks. Acta Numer. 8, 143–195 (1999)

    Article  MathSciNet  MATH  Google Scholar 

  58. Ronneberger, O., Fischer, P., Brox, T.: U-Net: Convolutional Networks for Biomedical Image Segmentation. Springer, Cham (2015)

    Google Scholar 

  59. Rudin, W.: Functional Analysis. International Series in Pure and Applied Mathematics, 2nd edn. McGraw-Hill Inc, New York (1991)

    MATH  Google Scholar 

  60. Schmidt-Hieber, J.: Nonparametric regression using deep neural networks with ReLU activation function (2017). arXiv preprint arXiv:1708.06633

  61. Schütt, K.T., Arbabzadah, F., Chmiela, S., Müller, K.R., Tkatchenko, A.: Quantum-chemical insights from deep tensor neural networks. Nat. Commun. 8, 13890 (2017)

    Article  Google Scholar 

  62. Shaham, U., Cloninger, A., Coifman, R.R.: Provable approximation properties for deep neural networks. Appl. Comput. Harmon. Anal. 44(3), 537–557 (2018)

    Article  MathSciNet  MATH  Google Scholar 

  63. Somashekhar, A.N., Peters, J.F.: Topology with Applications. World Scientific, Singapore (2013)

    MATH  Google Scholar 

  64. Telgarsky, M.: Benefits of depth in neural networks (2016). arXiv preprint arXiv:1602.04485

  65. Unser, M.A.: Splines: a perfect fit for signal and image processing. IEEE Signal Process. Mag. 16(6), 22–38 (1999)

    Article  Google Scholar 

  66. Voigtlaender, F.: Embedding Theorems for Decomposition Spaces with Applications to Wavelet Coorbit Spaces. PhD thesis, RWTH Aachen University (2015). http://publications.rwth-aachen.de/record/564979

  67. Wu, Z., Shen, C., Hengel, A.V.D.: Wider or deeper: revisiting the resnet model for visual recognition (2016). arXiv preprint arXiv:1611.10080

  68. Yarotsky, D.: Error bounds for approximations with deep ReLU networks. Neural Netw. 94, 103–114 (2017). https://doi.org/10.1016/j.neunet.2017.07.002

  69. Yarotsky, D.: Optimal approximation of continuous functions by very deep relu networks (2018). arXiv preprintarXiv:1802.03620

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Gitta Kutyniok.

Additional information

Communicated by Wolfgang Dahmen, Ronald A. Devore, and Philipp Grohs.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This work was conducted while R.G. was with Univ Rennes, Inria, CNRS, IRISA.

G.K. acknowledges partial support by the Bundesministerium für Bildung und Forschung (BMBF) through the Berliner Zentrum for Machine Learning (BZML), Project AP4, RTG DAEDALUS (RTG 2433), Projects P1 and P3, RTG BIOQIC (RTG 2260), Projects P4 and P9, and by the Berlin Mathematics Research Center MATH+, Projects EF1-1 and EF1-4. G.K. and F.V. acknowledge support by the European Commission-Project DEDALE (Contract No. 665044) within the H2020 Framework.

Appendices

Appendix A. Proofs for Section 2

For a matrix \(A \in {\mathbb {R}}^{n \times d}\), we write \(A^T \in {\mathbb {R}}^{d \times n}\) for the transpose of A. For \(i \in \{1,\dots ,n\}\), we write \(A_{i,-} \in {\mathbb {R}}^{1 \times d}\) for the i-th row of A, while \(A_{{(i)}} \in {\mathbb {R}}^{(n-1) \times d}\) denotes the matrix obtained by deleting the i-th row of A. We use the same notation \(b_{(i)}\) for vectors \(b\in {\mathbb {R}}^{n}\cong {\mathbb {R}}^{n \times 1}\). Finally, for \(j \in \{1,\dots ,d\}\), \(A_{[j]} \in {\mathbb {R}}^{n \times (d-1)}\) denotes the matrix obtained by removing the j-th column of A.

1.1 Proof of Lemma 2.6

Write \(N_0 (\Phi ) := {d_{\mathrm {in}}}(\Phi ) + {d_{\mathrm {out}}}(\Phi ) + N(\Phi )\) for the total number of neurons of the network \(\Phi \), including the “non-hidden” neurons.

The proof is by contradiction. Assume that there is a network \(\Phi \) for which the claim fails. Among all such networks, consider one with minimal value of \(N_0(\Phi )\), i.e., such the claim holds for all networks \(\Psi \) with \(N_0(\Psi ) < N_0(\Phi )\). Let us write \(\Phi = \big ( (T_1, \alpha _1), \dots , (T_L, \alpha _L) \big )\) with \(T_\ell \, x = A^{(\ell )} x + b^{(\ell )}\), for certain \(A^{(\ell )} \in {\mathbb {R}}^{N_\ell \times N_{\ell -1}}\) and \(b^{(\ell )} \in {\mathbb {R}}^{N_\ell }\).

Let us first consider the case that

$$\begin{aligned} \forall \, \, \ell \in \{1,\dots ,L\} \,\, \forall \,\, i \in \{1,\dots ,N_\ell \} \, : \, A^{(\ell )}_{i, -} \ne 0. \end{aligned}$$
(A.1)

By (A.1), we get \(\Vert A^{(\ell )}\Vert _{\ell ^0} \ge N_\ell \ge \Vert b^{(\ell )}\Vert _{\ell ^0}\), so that

$$\begin{aligned} W_0(\Phi ) = \sum _{\ell = 1}^L (\Vert A^{(\ell )}\Vert _{\ell ^0} + \Vert b^{(\ell )}\Vert _{\ell ^0}) \le 2 \cdot \sum _{\ell = 1}^L \Vert A^{(\ell )}\Vert _{\ell ^0} = 2 W(\Phi ) \le {d_{\mathrm {out}}}(\Phi ) + 2 W(\Phi ). \end{aligned}$$

Hence, with \({\widetilde{\Phi }} = \Phi \), \(\Phi \) satisfies the claim of the lemma, in contradiction to our assumption.

Thus, there is some \(\ell _0 \in \{1,\dots ,L\}\) and some \(i \in \{1,\dots ,N_{\ell _0}\}\) satisfying \(A^{(\ell _0)}_{i, -} = 0\). In other words, there is a neuron that is not connected to the previous layers. Intuitively, one can “remove it” without changing \({\mathtt {R}}(\Phi )\). This is what we now show formally.

Let us write \(\alpha _\ell = \bigotimes _{j=1}^{N_\ell } \varrho _j^{(\ell )}\) for certain \(\varrho _j^{(\ell )} \in \{\mathrm {id}_{{\mathbb {R}}}, \varrho \}\), and set \(\theta _\ell := \alpha _\ell \circ T_\ell \), so that \({\mathtt {R}}(\Phi ) = \theta _L \circ \cdots \circ \theta _1\). By our choice of \(\ell _0\) and i, note

$$\begin{aligned} \big ( \theta _{\ell _0} (x) \big )_i= & {} \varrho _i^{(\ell _0)} \left( (A^{(\ell _0)} x + b^{(\ell _0)})_i \right) \nonumber \\= & {} \varrho _i^{(\ell _0)} \left( \langle A^{(\ell _0)}_{i, -}, x \rangle + b_i^{(\ell _0)} \right) = \varrho _i^{(\ell _0)} (b_i^{(\ell _0)}) =: c \in {\mathbb {R}}, \end{aligned}$$
(A.2)

for arbitrary \(x \in {\mathbb {R}}^{N_{\ell _0 - 1}}\). After these initial observations, we now distinguish four cases:

Case 1 (Neuron on the output layer of size \({d_{\mathrm {out}}}(\Phi ) = 1\)): We have \(\ell _0 = L\) and \(N_L = 1\), so that necessarily \(i = 1\). In view of Eq. (A.2), we then have \({\mathtt {R}}(\Phi ) \equiv c\). Thus, if we choose the affine-linear map \(S_1 : {\mathbb {R}}^{N_0}\rightarrow {\mathbb {R}}^1, x\mapsto c\), and set \(\gamma _1 := \mathrm {id}_{\mathbb {R}}\), then the strict \(\varrho \)-network \({\widetilde{\Phi }} := \big ( (S_1, \gamma _1) \big )\) satisfies \({\mathtt {R}}(\, {\widetilde{\Phi }} \,) \equiv c \equiv {\mathtt {R}}(\Phi )\), and \(L(\, {\widetilde{\Phi }} \,) = 1 \le L(\Phi )\), as well as \(W_0(\, {\widetilde{\Phi }} \,) =1={d_{\mathrm {out}}}(\Phi ) \le {d_{\mathrm {out}}}(\Phi ) +2 W(\Phi )\) and \(N(\, {\widetilde{\Phi }} \,) = 0 \le N(\Phi )\). Thus, \(\Phi \) satisfies the claim of the lemma, contradicting our assumption.

Case 2 (Neuron on the output layer of size \({d_{\mathrm {out}}}(\Phi )>1\)): We have \(\ell _0 = L\) and \(N_L > 1\). Define

$$\begin{aligned} B^{(\ell )} := A^{(\ell )}, \quad c^{(\ell )} := b^{(\ell )}, \quad \text {and } \quad \beta _\ell := \alpha _\ell \quad \text {for} \quad \ell \in \{1,\dots ,L - 1\}. \end{aligned}$$

We then set \(B^{(L)} := A^{(L)}_{(i)} \in {\mathbb {R}}^{(N_L - 1) \times N_{L-1}}\) and \(c^{(L)} := b^{(L)}_{(i)} \in {\mathbb {R}}^{N_{L} - 1}\), as well as \(\beta _L := \mathrm {id}_{{\mathbb {R}}^{N_L - 1}}\).

Setting \(S_\ell \, x := B^{(\ell )} x+c^{(\ell )}\) for \(x \in {\mathbb {R}}^{N_{\ell - 1}}\), the network \(\Phi _0 := \big ( (S_1, \beta _1), \dots , (S_L,\beta _L) \big )\) then satisfies \({\mathtt {R}}(\Phi _0) (x) = \big ( {\mathtt {R}}(\Phi ) (x) \big )_{(i)}\) for all \(x \in {\mathbb {R}}^{N_0}\), and \(N_0 (\Phi _0) = N_0 (\Phi ) - 1 < N_0 (\Phi )\). Furthermore, if \(\Phi \) is strict, then so is \(\Phi _0\).

By the “minimality” assumption on \(\Phi \), there is thus a network \({\widetilde{\Phi }}_0\) (which is strict if \(\Phi \) is strict) with \({\mathtt {R}}(\, {\widetilde{\Phi }} \,_0) = {\mathtt {R}}(\Phi _0)\) and such that \(L' := L(\, {\widetilde{\Phi }} \,_0) \le L(\Phi _0) = L(\Phi )\), as well as \(N (\, {\widetilde{\Phi }} \,_0) \le N (\Phi _0) = N(\Phi )\), and

$$\begin{aligned} W (\, {\widetilde{\Phi }} \,_0) \le W_0 (\, {\widetilde{\Phi }} \,_0) \le {d_{\mathrm {out}}}(\Phi _0) + 2 \cdot W(\Phi _0) \le {d_{\mathrm {out}}}(\Phi ) - 1 + 2 \cdot W(\Phi ). \end{aligned}$$

Let us write \({\widetilde{\Phi }}_0 = \big ( (U_1, \gamma _1), \dots , (U_{L'}, \gamma _{L'}) \big )\), with affine-linear maps \(U_\ell : {\mathbb {R}}^{M_{\ell - 1}} \rightarrow {\mathbb {R}}^{M_\ell }\), so that \(U_\ell \, x = C^{(\ell )} x + d^{(\ell )}\) for \(\ell \in \{1,\dots ,L'\}\) and \(x \in {\mathbb {R}}^{M_{\ell - 1}}\). Note that \(M_{L'} = N_L - 1\), and define

$$\begin{aligned} {\widetilde{C}}^{(L')} := \left( \begin{matrix} C^{(L')}_{1, -} \\ \vdots \\ C^{(L')}_{i-1, -} \\ 0 \\ C^{(L')}_{i, -} \\ \vdots \\ C^{(L')}_{M_{L'}, -} \end{matrix} \right) \in {\mathbb {R}}^{N_L \times M_{L'-1}} \quad \text {and} \quad {\widetilde{d}}^{(L')} := \left( \begin{matrix} d^{(L')}_1 \\ \vdots \\ d^{(L')}_{i-1} \\ c \\ d^{(L')}_{i} \\ \vdots \\ d^{(L')}_{M_{L'}} \end{matrix} \right) \in {\mathbb {R}}^{N_L} , \end{aligned}$$

as well as \({\widetilde{\gamma }}_{L'} := \mathrm {id}_{{\mathbb {R}}^{N_L}}\), and \({\widetilde{U}}_{L'} : {\mathbb {R}}^{M_{L' - 1}} \rightarrow {\mathbb {R}}^{N_L}, x \mapsto {\widetilde{C}}^{(L')} x + {\widetilde{d}}^{(L')}\), and finally,

$$\begin{aligned} {\widetilde{\Phi }} := \big ( (U_1, \gamma _1), \dots , (U_{L' - 1}, \gamma _{L' - 1}), ({\widetilde{U}}_{L'}, {\widetilde{\gamma }}_{L'}) \big ). \end{aligned}$$

By virtue of Eq. (A.2), we then have \({\mathtt {R}}(\, {\widetilde{\Phi }} \,) = {\mathtt {R}}(\Phi )\), and if \(\Phi \) is strict, then so is \(\Phi _0\) and thus also \({\widetilde{\Phi }}_0\) and \({\widetilde{\Phi }}\). Furthermore, we have \(L (\, {\widetilde{\Phi }} \,) = L' \le L(\Phi )\), and \(N(\, {\widetilde{\Phi }} \,) = N({\widetilde{\Phi }}_0) \le N(\Phi )\), as well as \(W (\, {\widetilde{\Phi }} \,) \le W_0 (\, {\widetilde{\Phi }} \,) \le 1 + W_0 (\, {\widetilde{\Phi }} \,_0) \le {d_{\mathrm {out}}}(\Phi ) + 2 W(\Phi )\). Thus, \(\Phi \) satisfies the claim of the lemma, contradicting our assumption.

Case 3 (Hidden neuron on layer \(\ell _0\) with \(N_{\ell _0} = 1\)): We have \(1 \le \ell _0 < L\) and \(N_{\ell _0} = 1\). In this case, Eq. (A.2) implies \(\theta _{\ell _0} \equiv c\), whence \({\mathtt {R}}(\Phi ) = \theta _L \circ \cdots \circ \theta _1 \equiv {\widetilde{c}}\) for some \({\widetilde{c}} \in {\mathbb {R}}^{N_L}\).

Thus, if we choose the affine map \(S_1 : {\mathbb {R}}^{N_0} \rightarrow {\mathbb {R}}^{N_L}, x \mapsto {\widetilde{c}}\), then the strict \(\varrho \)-network \({\widetilde{\Phi }} = \big ( (S_1, \gamma _1) \big )\) satisfies \({\mathtt {R}}({\widetilde{\Phi }}) \equiv {\widetilde{c}} \equiv {\mathtt {R}}(\Phi )\) and \(L({\widetilde{\Phi }}) = 1 \le L(\Phi )\), as well as \(W_0 ({\widetilde{\Phi }}) \le d_{\mathrm {out}} (\Phi ) \le d_{\mathrm {out}} (\Phi ) + 2 \, W(\Phi )\) and \(N({\widetilde{\Phi }}) = 0 \le N(\Phi )\). Thus, \(\Phi \) satisfies the claim of the lemma, in contradiction to our choice of \(\Phi \).

Case 4 (Hidden neuron on layer \(\ell _0\) with \(N_{\ell _0} > 1\)): In this case, we have \(1 \le \ell _0 < L\) and \(N_{\ell _0} > 1\). Define \(S_\ell := T_\ell \) and \(\beta _\ell := \alpha _\ell \) for \(\ell \in \{1,\dots ,L\} {\setminus } \{\ell _0, \ell _0 + 1\}\), and let us choose \({S_{\ell _0} : {\mathbb {R}}^{N_{\ell _0 - 1}} \rightarrow {\mathbb {R}}^{N_{\ell _0} - 1}, x \mapsto B^{(\ell _0)} x + c^{(\ell _0)}}\), where

$$\begin{aligned} B^{(\ell _0)} := A^{(\ell _0)}_{(i)}, \quad c^{(\ell _0)} := b^{(\ell _0)}_{(i)}, \quad \text {and} \quad \beta _{\ell _0} := \varrho _1^{(\ell _0)} \otimes \cdots \otimes \varrho _{i - 1}^{(\ell _0)} \otimes \varrho _{i+1}^{(\ell _0)} \otimes \cdots \otimes \varrho _{N_{\ell _0}}^{(\ell _0)} . \end{aligned}$$

Finally, for \(x \in {\mathbb {R}}^{N_{\ell _0} - 1}\), let \( \iota _c (x) := \left( x_1,\dots , x_{i-1}, c, x_i,\dots , x_{N_{\ell _0} - 1} \right) ^{\mathrm{T}} \in {\mathbb {R}}^{N_{\ell _0}} , \) and set \(\beta _{\ell _{0} + 1} := \alpha _{\ell _0 + 1}\), as well as

$$\begin{aligned}&S_{\ell _0 + 1} : {\mathbb {R}}^{N_{\ell _0} - 1} \rightarrow {\mathbb {R}}^{N_{\ell _0 + 1}}, x \mapsto A_{[i]}^{(\ell _0 + 1)} \, x+ c \cdot A^{(\ell _0 + 1)} e_{i} + b^{(\ell _0 + 1)} \\&\quad \quad = A^{(\ell _0 + 1)} (\iota _c (x) ) + b^{(\ell _0 + 1)}, \end{aligned}$$

where \(e_i\) is the i-th element of the standard basis of \({\mathbb {R}}^{N_{\ell _0}}\).

Setting \(\vartheta _{\ell } := \beta _\ell \circ S_\ell \) and recalling that \(\theta _\ell = \alpha _\ell \circ T_\ell \) for \(\ell \in \{1,\dots ,L\}\), we then have \(\vartheta _{\ell _0} (x) = (\theta _{\ell _0} (x) )_{(i)}\) for all \(x \in {\mathbb {R}}^{N_{\ell _0 - 1}}\). By virtue of Eq. (A.2), this implies \(\theta _{\ell _0} (x) = \iota _c ( \vartheta _{\ell _0} (x) )\), so that

$$\begin{aligned} S_{\ell _0 + 1} (\vartheta _{\ell _0} (x) )= & {} A^{(\ell _0 + 1)} \big (\iota _c ( \vartheta _{\ell _0 } (x) )\big ) + b^{(\ell _0 + 1)} \\= & {} A^{(\ell _0 + 1)} (\theta _{\ell _0} (x) ) + b^{(\ell _0 + 1)} = T_{\ell _0 + 1} (\theta _{\ell _0} (x) ) . \end{aligned}$$

Recalling that \(\beta _{\ell _0 + 1} = \alpha _{\ell _0 + 1}\), we thus see \(\vartheta _{\ell _0 + 1} \circ \vartheta _{\ell _0} = \theta _{\ell _0 + 1} \circ \theta _{\ell _0}\), which then easily shows \({\mathtt {R}}(\Phi _0) = {\mathtt {R}}(\Phi )\) for \(\Phi _0 := \big ( (S_1, \beta _1),\dots , (S_L, \beta _L) \big )\). Note that if \(\Phi \) is strict, then so is \(\Phi _0\). Furthermore, we have \(N_{0}(\Phi _0) = N_{0}(\Phi ) - 1 < N_{0}(\Phi )\) so that by “minimality” of \(\Phi \), there is a network \({\widetilde{\Phi }}_0\) (which is strict if \(\Phi \) is strict) satisfying \({\mathtt {R}}(\, {\widetilde{\Phi }}_0\,)={\mathtt {R}}(\Phi _0)={\mathtt {R}}(\Phi )\) and furthermore \(L(\, {\widetilde{\Phi }}_0 \,) \le L(\Phi _0) = L(\Phi )\), as well as \(N(\, {\widetilde{\Phi }}_0\, ) \le N(\Phi _0) \le N(\Phi )\), and finally \( W (\, {\widetilde{\Phi }}_0 \,) \le W_0 (\, {\widetilde{\Phi }}_0 \,) \le {d_{\mathrm {out}}}(\Phi _0) + 2 W(\Phi _0) \le {d_{\mathrm {out}}}(\Phi ) + 2 W(\Phi ). \) Thus, the claim holds for \(\Phi \), contradicting our assumption. \(\square \)

1.2 Proof of Lemma 2.14

We begin by showing \({\mathtt {NN}}_{W, L,W}^{\varrho ,d,k} \subset {\mathtt {NN}}_{W,W,W}^{\varrho ,d,k}\). Let \(f \in {\mathtt {NN}}_{W, L,W}^{\varrho ,d,k}\). By definition, there is \(\Phi \in {\mathcal {NN}}_{W, L,W}^{\varrho ,d,k}\) such that \(f = {\mathtt {R}}(\Phi )\). Note that \(W(\Phi ) \le W\), and let us distinguish two cases: If \(L(\Phi ) \le W(\Phi )\) then \(L(\Phi ) \le W\), whence in fact \(\Phi \in {\mathcal {NN}}_{W, W, W}^{\varrho ,d,k}\) and \(f \in {\mathtt {NN}}_{W, W, W}^{\varrho ,d,k}\) as claimed. Otherwise, \(W(\Phi ) < L(\Phi )\) and by Corollary 2.10 we have \(f = {\mathtt {R}}(\Phi ) \equiv c\) for some \(c \in {\mathbb {R}}^{k}\). Therefore, Lemma 2.13 shows that \(f \in {\mathtt {NN}}_{0,1,0}^{\varrho ,d,k} \subset {\mathtt {NN}}_{W,W,W}^{\varrho ,d,k}\), where the inclusion holds by definition of these sets.

The inclusion \({\mathtt {NN}}_{W,L,W}^{\varrho ,d,k} \subset {\mathtt {NN}}_{W,L,\infty }^{\varrho ,d,k}\) is trivial. Similarly, if \(L \ge W\), then trivially \({\mathtt {NN}}_{W, W, W}^{\varrho ,d,k} \subset {\mathtt {NN}}_{W,L,W}^{\varrho ,d,k}\).

Thus, it remains to show \({\mathtt {NN}}_{W,L,\infty }^{\varrho ,d,k} \subset {\mathtt {NN}}_{W,L,W}^{\varrho ,d,k}\). To prove this, we will show that for each network \( \Phi = \big ( (T_{1}, \alpha _{1}), \dots , (T_{K}, \alpha _{K}) \big ) \in {\mathcal {NN}}_{W,L,\infty }^{\varrho ,d,k} \) (so that necessarily \(K \le L\)) with \(N (\Phi ) > W\), one can find a neural network \(\Phi ' \in {\mathcal {NN}}_{W,L, \infty }^{\varrho ,d,k}\) with \({\mathtt {R}}(\Phi ') = {\mathtt {R}}(\Phi )\), and such that \(N(\Phi ') < N(\Phi )\). If \(\Phi \) is strict, then we show that \(\Phi '\) can also be chosen to be strict. The desired inclusion can then be obtained by repeating this “compression” step until one reaches the point where \(N(\Phi ') \le W\).

For each \(\ell \in \{1,\dots ,K\}\), let \(b^{(\ell )} \in {\mathbb {R}}^{N_\ell }\) and \(A^{(\ell )} \in {\mathbb {R}}^{N_{\ell } \times N_{\ell -1}}\) be such that \(T_\ell = A^{(\ell )} \bullet + b^{(\ell )}\). Since \(\Phi \in {\mathcal {NN}}_{W,L,\infty }^{\varrho ,d,k}\), we have \(W(\Phi )\le W\). In combination with \(N(\Phi ) > W\), this implies

$$\begin{aligned} \sum _{\ell = 1}^{K-1} N_\ell = N(\Phi ) > W \ge W(\Phi ) = \sum _{\ell = 1}^K \Vert A^{(\ell )} \Vert _{\ell _0} \ge \sum _{\ell = 1}^{K-1} \sum _{i = 1}^{N_\ell } \Vert A^{(\ell )}_{i, -} \Vert _{\ell ^0} . \end{aligned}$$

Therefore, \(K > 1\), and there must be some \(\ell _0 \in \{1,\dots ,K-1\}\) and \(i \in \{1,\dots ,N_{\ell _0}\}\) with \(A^{(\ell _0)}_{i, -} = 0\). We now distinguish two cases:

Case 1 (Single neuron on layer \(\ell _{0}\)): We have \(N_{\ell _0} = 1\). In this case, \(A^{(\ell _0)} = 0\) and hence \(T_{\ell _0} \equiv b^{(\ell _0)}\). Therefore, \({\mathtt {R}}(\Phi )\) is constant; say \({\mathtt {R}}(\Phi ) \equiv c \in {\mathbb {R}}^k\). Choose \(S_{1} : {\mathbb {R}}^d \rightarrow {\mathbb {R}}^k, x \mapsto c\), and \(\beta _{1} := \mathrm {id}_{{\mathbb {R}}^k}\). Then, \({\mathtt {R}}(\Phi ) \equiv c \equiv {\mathtt {R}}(\Phi ')\) for the strict \(\varrho \)-network \( \Phi ' := \big ( (S_{1},\beta _{1}) \big ) \in {\mathcal {NN}}_{0,1,0}^{\varrho ,d,k} \subset {\mathcal {NN}}_{W,L,\infty }^{\varrho ,d,k} \), which indeed satisfies \(N(\Phi ') = 0 \le W < N(\Phi )\).

Case 2 (Multiple neurons on layer \(\ell _{0}\)): We have \(N_{\ell _0} > 1\). Recall that \(\ell _0 \in \{1,\dots ,K-1\}\), so that \(\ell _0 + 1 \in \{1,\dots ,K\}\). Now, define \(S_{\ell } := T_{\ell }\) and \(\beta _{\ell } := \alpha _{\ell }\) for \(\ell \in \{1,\dots ,K\} {\setminus } \{\ell _0, \ell _0 + 1\}\). Further, define

$$\begin{aligned}&S_{\ell _0} : {\mathbb {R}}^{N_{\ell _0 - 1}} \rightarrow {\mathbb {R}}^{N_{\ell _0}-1}, \quad \! \text {with} \! \\&\quad (S_{\ell _0} \, x)_{j} := \! {\left\{ \begin{array}{ll} (T_{\ell _0} \, x)_j , &{} \text {if } j < i, \\ (T_{\ell _0} \, x)_{j+1}, &{} \text {if } j \ge i \end{array}\right. } \quad \! \text {for } j \in \{1,\dots ,N_{\ell _0}-1\} . \end{aligned}$$

Using the notation \(A_{(i)}, b_{(i)}\) from the beginning of Appendix A, this means \(S_{\ell _0} \, x=A^{(\ell _0)}_{(i)} x + b^{(\ell _0)}_{(i)} = (T_{\ell _0} \, x)_{(i)}\).

Finally, writing \(\alpha _{\ell } = \varrho _1^{(\ell )} \otimes \cdots \otimes \varrho _{N_\ell }^{(\ell )}\) for \(\ell \in \{1,\dots ,K\}\), define \(\beta _{\ell _0 + 1} := \alpha _{\ell _0 + 1}\), as well as

$$\begin{aligned} \beta _{\ell _0} := \varrho _1^{(\ell _0)} \otimes \cdots \otimes \varrho _{i-1}^{(\ell _0)} \otimes \varrho _{i+1}^{(\ell _0)} \otimes \cdots \otimes \varrho _{N_{\ell _0}}^{(\ell _0)} \quad : \quad {\mathbb {R}}^{N_{\ell _0}-1} \rightarrow {\mathbb {R}}^{N_{\ell _0}-1} , \end{aligned}$$

and

$$\begin{aligned} S_{\ell _0 + 1} : {\mathbb {R}}^{N_{\ell _0}-1} \rightarrow {\mathbb {R}}^{N_{\ell _0 + 1}}, y&\mapsto T_{\ell _0 + 1} \left( y_1, \dots , y_{i-1}, \varrho _i^{(\ell _0)}(b_i^{(\ell _0)}), y_i, \dots , y_{N_{\ell _0}-1} \right) \\&= A_{[i]}^{(\ell _0 + 1)} y + b^{(\ell _0 + 1)} + \varrho _i^{(\ell _0)} (b_i^{(\ell _0)}) \cdot A^{(\ell _0 + 1)} \, e_i, \end{aligned}$$

where \(e_i \in {\mathbb {R}}^{N_{\ell _0}}\) denotes the i-th element of the standard basis, and where \(A_{[i]}\) is the matrix obtained from a given matrix A by removing its i-th column.

Now, for arbitrary \(x \in {\mathbb {R}}^{N_{\ell _0 - 1}}\), let \(y := S_{\ell _0} \, x \in {\mathbb {R}}^{N_{\ell _0} - 1}\) and \(z := T_{\ell _0} \, x \in {\mathbb {R}}^{N_{\ell _0}}\). Because of \(A^{(\ell _0)}_{i, -} = 0\), we then have \(z_i = b_i^{(\ell _0)}\). Further, by definition of \(S_{\ell _0}\), we have \(y_{j}=(T_{\ell _0} \, x)_j = z_j\) for \(j<i\), and \(y_j =(T_{\ell _0} \, x)_{j+1}=z_{j+1}\) for \(j\ge i\). All in all, this shows

$$\begin{aligned} S_{\ell _0 + 1} \big ( \beta _{\ell _0} (S_{\ell _0} x) \big )&= S_{\ell _0 + 1} (\beta _{\ell _0} (y) ) \\&= T_{\ell _0 + 1} \left( \varrho _1^{(\ell _0)} (y_1), \dots , \varrho _{i-1}^{(\ell _0)} (y_{i-1}), \varrho _i^{(\ell _0)} (b_i^{(\ell _0)}), \right. \\&\quad \left. \varrho _{i+1}^{(\ell _0)} (y_i), \dots , \varrho _{N_{\ell _0}}^{(\ell _0)} (y_{N_{\ell _0} - 1}) \right) \\&= T_{\ell _0 + 1} \left( \varrho _1^{(\ell _0)} (z_1), \dots , \varrho _{i-1}^{(\ell _0)} (z_{i-1}), \varrho _i^{(\ell _0)} (z_i),\right. \\&\quad \left. \varrho _{i+1}^{(\ell _0)} (z_{i+1}), \dots , \varrho _{N_{\ell _0}}^{(\ell _0)} (z_{N_{\ell _0}}) \right) \\&= T_{\ell _0 + 1} \big ( \alpha _{\ell _0} (z) \big ) = T_{\ell _0 + 1} \big ( \alpha _{\ell _0} (T_{\ell _0} x) \big ) . \end{aligned}$$

Recall that this holds for all \(x \in {\mathbb {R}}^{N_{\ell _0 - 1}}\). From this, it is not hard to see \({\mathtt {R}}(\Phi ) = {\mathtt {R}}(\Phi ')\) for the network \( \Phi ' := \big ( (S_{1}, \beta _{1}), \dots , (S_{K}, \beta _{K}) \big ) \in {\mathcal {NN}}_{\infty , K,\infty }^{\varrho ,d,k} \subset {\mathcal {NN}}_{\infty , L,\infty }^{\varrho ,d,k} \). Note that \(\Phi '\) is a strict network if \(\Phi \) is strict. Finally, directly from the definition of \(\Phi '\), we see \(W(\Phi ') \le W(\Phi ) \le W\), so that \(\Phi ' \in {\mathcal {NN}}_{W, L,\infty }^{\varrho ,d,k}\). Also, \(N(\Phi ') = N(\Phi ) - 1 < N(\Phi )\), as desired. \(\square \)

1.3 Proof of Lemma 2.16

Write \(\Phi = \big ( (T_1,\alpha _1),\dots ,(T_{L},\alpha _{L}) \big )\) with \(L = L(\Phi )\). If \(L_0 = 0\), we can simply choose \(\Psi = \Phi \). Thus, let us assume \(L_0 > 0\), and distinguish two cases:

Case 1: If \(k \le d\), so that \(c = k\), set

$$\begin{aligned} \Psi := \Big ( (T_1, \alpha _1), \dots , (T_{L}, \alpha _L), \underbrace{(\mathrm {id}_{{\mathbb {R}}^k}, \mathrm {id}_{{\mathbb {R}}^k}), \dots , (\mathrm {id}_{{\mathbb {R}}^k}, \mathrm {id}_{{\mathbb {R}}^k})}_{L_0 \text { terms}} \Big ) , \end{aligned}$$

and note that the affine map \(T := \mathrm {id}_{{\mathbb {R}}^{k}}\) satisfies \(\Vert T\Vert _{\ell ^{0}} = k=c\), and hence \(W(\Psi ) = W(\Phi ) + c \, L_0\). Furthermore, \({\mathtt {R}}(\Psi ) = {\mathtt {R}}(\Phi )\), \(L(\Psi ) = L(\Phi ) + L_0\), and \(N(\Psi ) = N(\Phi ) + c L_0\). Here, we used crucially that the definition of generalized neural networks allows us to use the identity as the activation function for some neurons.

Case 2: If \(d < k\), so that \(c = d\), the proof proceeds as in the previous case, but with

$$\begin{aligned}&\Psi := \Big ( \underbrace{(\mathrm {id}_{{\mathbb {R}}^d}, \mathrm {id}_{{\mathbb {R}}^d}), \dots , (\mathrm {id}_{{\mathbb {R}}^d}, \mathrm {id}_{{\mathbb {R}}^d})}_{L_0 \text { terms}} \, , (T_1, \alpha _1), \dots , (T_{L}, \alpha _{L}) \Big ) .&\square \end{aligned}$$

1.4 Proof of Lemma 2.17

For the proof of the first part, denoting \(\Phi = \big ( (T_{1}, \alpha _{1}), \dots , (T_L, \alpha _{L}) \big )\), we set \( \Psi := \big ( (T_{1}, \alpha _{1}), \dots , (c \cdot T_L, \alpha _{L}) \big ) \). By Definition 2.1, we have \(\alpha _{L} = \mathrm {id}_{{\mathbb {R}}^{k}}\); hence, one easily sees \({\mathtt {R}}(\Psi ) = c \cdot {\mathtt {R}}(\Phi )\). If \(\Phi \) is strict, then so is \(\Psi \). By construction, \(\Phi \) and \(\Psi \) have the same number of layers and neurons, and \(W(\Psi ) \le W(\Phi )\) with equality if \(c \ne 0\).

For the second and third part, we proceed by induction, using two auxiliary claims.

Lemma A.1

Let \(\Psi _1 \in {\mathcal {NN}}^{\varrho ,d,k_{1}}\) and \(\Psi _2 \in {\mathcal {NN}}^{\varrho ,d,k_{2}}\). There is a network \(\Psi \in {\mathcal {NN}}^{\varrho ,d,k_{1}+k_{2}}\) with \({L(\Psi ) = \max \{ L(\Psi _1), L(\Psi _2) \}}\) such that \({\mathtt {R}}(\Psi ) = g\), where \( g : {\mathbb {R}}^{d} \rightarrow {\mathbb {R}}^{k_{1}+k_{2}}, x \mapsto \big ( {\mathtt {R}}(\Psi _{1})(x),{\mathtt {R}}(\Psi _{2})(x) \big ) \). Furthermore, setting \(c := \min \big \{ d,\max \{ k_{1},k_{2} \} \big \}\), \(\Psi \) can be chosen to satisfy

$$\begin{aligned} W(\Psi )&\le W(\Psi _{1}) + W(\Psi _{2}) + c \cdot |L(\Psi _2) - L(\Psi _1)| \\ N(\Psi )&\le N(\Psi _1) + N(\Psi _2) + c \cdot |L(\Psi _2) - L(\Psi _1)| . \end{aligned}$$

\(\blacktriangleleft \)

Lemma A.2

Let \(\Psi _{1}, \Psi _{2} \in {\mathcal {NN}}^{\varrho ,d,k}\). There is \(\Psi \in {\mathcal {NN}}^{\varrho ,d,k}\) with \(L(\Psi ) = \max \{ L(\Psi _1), L(\Psi _2) \}\) such that \({\mathtt {R}}(\Psi ) = {\mathtt {R}}(\Psi _{1}) + {\mathtt {R}}(\Psi _{2})\) and, with \(c = \min \{d,k\}\),

$$\begin{aligned} W(\Psi )&\le W(\Psi _{1}) + W(\Psi _{2}) + c \cdot |L(\Psi _{2})-L(\Psi _{1})| \\ N(\Psi )&\le N(\Psi _1) + N(\Psi _2) + c \cdot |L(\Psi _2) - L(\Psi _1)| . \end{aligned}$$

\(\blacktriangleleft \)

Proof of Lemmas A.1 and A.2

Set \(L := \max \{ L(\Psi _1), L(\Psi _2) \}\) and \(L_i := L(\Psi _i)\) for \(i \in \{1,2\}\). By Lemma 2.16 applied to \(\Psi _i\) and \(L_0 = L - L_i \in {\mathbb {N}}_0\), we get for each \(i \in \{1,2\}\) a network \({\Psi _i ' \in {\mathcal {NN}}^{\varrho ,d,k_{i}}}\) with \({\mathtt {R}}(\Psi _i ') = {\mathtt {R}}(\Psi _i)\) and such that \(L(\Psi _i ') = L\), as well as \(W(\Psi _i ') \le W(\Psi _i) + c (L - L_i)\) and furthermore \(N(\Psi _i') \le N(\Psi _i) + c (L - L_i)\). By choice of L, we have \((L - L_1) + (L - L_2) = |L_1 - L_2|\), whence \(W(\Psi _1 ') + W(\Psi _2 ') \le W(\Psi _1) + W(\Psi _2) + c \, |L_1 - L_2|\), and \(N(\Psi _1 ') + N(\Psi _2 ') \le N(\Psi _1) + N(\Psi _2) + c \, |L_1 - L_2|\).

First, we deal with the pathological case \(L = 1\). In this case, each \(\Psi '_i\) is of the form \(\Psi '_i = \big ( ( T_i, \mathrm {id}_{{\mathbb {R}}^k}) \big )\), with \(T_i : {\mathbb {R}}^d \rightarrow {\mathbb {R}}^k\) an affine-linear map. For proving Lemma A.1, we set \(\Psi := \big ( (T,\mathrm {id}_{{\mathbb {R}}^{k_1+k_2}}) \big )\) with the affine-linear map \(T:{\mathbb {R}}^{d} \rightarrow {\mathbb {R}}^{k_1+k_2},\ x \mapsto \big ( T_1(x),T_2(x) \big )\), so that \({\mathtt {R}}(\Psi ) = g\). For proving Lemma A.2, we set \(\Psi := \big ( (T, \mathrm {id}_{{\mathbb {R}}^k}) \big )\) with \(T = T_1+T_2\), so that \( {\mathtt {R}}(\Psi ) = T_1 + T_2 = {\mathtt {R}}(\Psi '_1) + {\mathtt {R}}(\Psi '_2) = {\mathtt {R}}(\Psi _1) + {\mathtt {R}}(\Psi _2) \). Finally, we see for both cases that \(N(\Psi ) = 0 = N(\Psi '_1) + N(\Psi '_2)\) and

$$\begin{aligned} W(\Psi ) = \left\| T \right\| _{\ell ^0} \le \left\| T_1 \right\| _{\ell ^0} + \left\| T_2 \right\| _{\ell ^0} = W(\Psi '_1) + W(\Psi '_2) . \end{aligned}$$

This establishes the result for the case \(L=1\).

For \(L > 1\), write \(\Psi _1 ' = \big ( (T_1, \alpha _1), \dots , (T_L, \alpha _L) \big )\) and \(\Psi _2 ' = \big ( (S_1, \beta _1), \dots , (S_L, \beta _L) \big )\) with affine-linear maps \(T_\ell : {\mathbb {R}}^{N_{\ell - 1}} \rightarrow {\mathbb {R}}^{N_\ell }\) and \(S_\ell : {\mathbb {R}}^{M_{\ell -1}} \rightarrow {\mathbb {R}}^{M_\ell }\) for \(\ell \in \{1,\dots ,L\}\). Let us define \(\theta _\ell := \alpha _\ell \otimes \beta _\ell \) for \(\ell \in \{1,\dots ,L\}\)—except for \(\ell = L\) when proving Lemma A.2, in which case we set \(\theta _L := \mathrm {id}_{{\mathbb {R}}^k}\). Next, set

$$\begin{aligned}&R_1 : {\mathbb {R}}^d \rightarrow {\mathbb {R}}^{N_1 + M_1}, x \mapsto (T_1 x , S_1 x) \quad \text {and}\\&\quad R_\ell : {\mathbb {R}}^{N_{\ell -1} + M_{\ell -1}} \rightarrow {\mathbb {R}}^{N_\ell + M_\ell }, (x,y) \mapsto (T_\ell \, x , S_\ell \, y) \end{aligned}$$

for \(2 \le \ell \le L\)—except if \(\ell = L\) when proving Lemma A.2. In this latter case, we instead define \(R_L\) as \( {R_L : {\mathbb {R}}^{N_{L-1} + M_{L-1}} \rightarrow {\mathbb {R}}^{k}, (x,y) \mapsto T_L \, x + S_L \, y} \). Finally, set \(\Psi := \big ( (R_1, \theta _1), \dots , (R_L, \theta _L) \big )\).

When proving Lemma A.1, it is straightforward to verify that \(\Psi \) satisfies

$$\begin{aligned} {\mathtt {R}}(\Psi ) (x) = \big ( {\mathtt {R}}(\Psi _1') (x), {\mathtt {R}}(\Psi _2')(x) \big ) = \big ( {\mathtt {R}}(\Psi _1) (x), {\mathtt {R}}(\Psi _2)(x) \big ) = g(x) \qquad \forall \, x \in {\mathbb {R}}^d \, . \end{aligned}$$

Similarly, when proving Lemma A.2, one can easily check that \( {\mathtt {R}}(\Psi ) = {\mathtt {R}}(\Psi '_1) + {\mathtt {R}}(\Psi '_2) = {\mathtt {R}}(\Psi _1) + {\mathtt {R}}(\Psi _2) \).

Further, for arbitrary \(\ell \in \{1,\dots ,L\}\), we have \(\Vert R_\ell \Vert _{\ell ^0} \le \Vert T_\ell \Vert _{\ell ^0}+ \Vert S_\ell \Vert _{\ell ^0}\) so that

$$\begin{aligned} W(\Psi ) = \sum _{\ell =1}^{L} \Vert R_\ell \Vert _{\ell ^0} \le \sum _{\ell =1}^{L} (\Vert T_\ell \Vert _{\ell ^0}+ \Vert S_\ell \Vert _{\ell ^0}) = W(\Psi _1 ')+W(\Psi '_2) . \end{aligned}$$

Finally, \(N(\Psi ) = \sum _{\ell =1}^{L-1} (N_{\ell }+M_{\ell }) = N(\Psi '_1)+N(\Psi '_2)\). Given the estimates for \(W(\Psi _1') + W(\Psi _2')\) and \(N(\Psi _1') + N(\Psi _2')\) stated at the beginning of the proof, this yields the claim. \(\square \)

Let us now return to the proof of Parts 2 and 3 of Lemma 2.17. Set \(f_{i} := {\mathtt {R}}(\Phi _{i})\) and \(L_i := L(\Phi _i)\). We first show that we can without loss of generality assume \(L_1 \le \dots \le L_n\). To see this, note that there is a permutation \(\sigma \in S_n\) such that if we set \(\Gamma _j := \Phi _{\sigma (j)}\), then \(L(\Gamma _1) \le \dots \le L(\Gamma _n)\). Furthermore, \(\sum _{j=1}^n {\mathtt {R}}(\Gamma _j) = \sum _{j=1}^n {\mathtt {R}}(\Phi _j)\). Finally, there is a permutation matrix \(P \in \mathrm {GL}({\mathbb {R}}^d)\) such that

$$\begin{aligned} P \circ \big ( {\mathtt {R}}(\Gamma _1), \dots , {\mathtt {R}}(\Gamma _n) \big ) = \big ( {\mathtt {R}}(\Phi _1), \dots , {\mathtt {R}}(\Phi _n) \big ) = (f_1, \dots , f_n) = g . \end{aligned}$$

Since the permutation matrix P has exactly one nonzero entry per row and column, we have \(\Vert P\Vert _{\ell ^{0,\infty }} = 1\) in the notation of Eq. (2.4). Therefore, the first part of Lemma 2.18 (which will be proven independently) shows that \(g \in {\mathtt {NN}}^{\varrho ,d,K}_{W,L,N}\), provided that \(\big ( {\mathtt {R}}(\Gamma _1), \dots , {\mathtt {R}}(\Gamma _n) \big ) \in {\mathtt {NN}}^{\varrho ,d,K}_{W,L,N}\). These considerations show that we can assume \(L(\Phi _1) \le \dots \le L(\Phi _n)\) without loss of generality.

We now prove the following claim by induction on \(j \in \{1,\dots ,n\}\): There is \(\Theta _{j} \in {\mathcal {NN}}^{\varrho ,d,K_{j}}\) satisfying \(W(\Theta _{j}) \le \sum _{i=1}^{j} W(\Phi _{i}) + c \, (L_j - L_1)\), and \(N(\Theta _{j}) = \sum _{i=1}^{j} N(\Phi _{i}) + c \, (L_j - L_1)\), as well as \({L(\Theta _{j}) = L_j}\), and such that \({\mathtt {R}}(\Theta _j) = g_{j} := \sum _{i=1}^{j} f_{i}\) and \(K_{j} := k\) for the summation, respectively such that \({{\mathtt {R}}(\Theta _{j}) = g_{j} := (f_1, \dots , f_j)}\) and \(K_{j} := \sum _{i=1}^{j} k_{i}\) for the Cartesian product. Here, c is as in the corresponding claim of Lemma 2.17.

Specializing to \(j=n\) then yields the conclusion of Lemma 2.17.

We now proceed to the induction. The claim trivially holds for \(j=1\)—just take \(\Theta _1 = \Phi _1\). Assuming that the claim holds for some \(j \in \{1,\dots ,n-1\}\), we define \(\Psi _{1} := \Theta _{j}\) and \(\Psi _{2} := \Phi _{j+1}\). Note that \(L(\Psi _1) = L(\Theta _j) = L_j \le L_{j+1} = L(\Psi _2)\). For the summation, by Lemma A.2 there is a network \(\Psi \in {\mathcal {NN}}^{\varrho ,d,k}\) with \(L(\Psi ) = L_{j+1}\) and \( {\mathtt {R}}(\Psi ) = {\mathtt {R}}(\Psi _{1}) + {\mathtt {R}}(\Psi _{2}) = {\mathtt {R}}(\Theta _{j}) + {\mathtt {R}}(\Phi _{j+1}) = g_{j}+ f_{j+1} = g_{j+1} \), and such that

$$\begin{aligned} W(\Psi )\le & {} W(\Psi _{1}) + W(\Psi _{2}) + c' \cdot |L(\Psi _2) - L(\Psi _1)| \\\le & {} W(\Theta _{j}) + W(\Phi _{j+1}) + c' \cdot (L_{j+1} - L_j) \end{aligned}$$

and likewise \(N(\Psi ) \le N(\Theta _j) + N(\Phi _{j+1}) + c' \cdot (L_{j+1} - L_j)\), where \(c' = \min \{d,k\} = c\). For the Cartesian product, Lemma A.1 yields a network \(\Psi \in {\mathcal {NN}}^{\varrho ,d,K_{j}+k_{j+1}} = {\mathcal {NN}}^{\varrho ,d,K_{j+1}}\) satisfying

$$\begin{aligned} {\mathtt {R}}(\Psi ) = \big ( {\mathtt {R}}(\Psi _{1}),{\mathtt {R}}(\Psi _{2}) \big ) = \big ( {\mathtt {R}}(\Theta _{j}), {\mathtt {R}}(\Phi _{j+1}) \big ) = g_{j+1} \end{aligned}$$

and such that, setting \(c' := \min \big \{ d, \max \{ K_{j}, k_{j+1} \} \big \} \le \min \{d,K-1\} = c\), we have

$$\begin{aligned} W(\Psi )\le & {} W(\Psi _{1})+W(\Psi _{2}) + c' \cdot | L(\Psi _{2}) - L(\Psi _{1}) | \\= & {} W(\Theta _{j}) + W(\Phi _{j+1}) + c' \cdot (L_{j+1} - L_j) \end{aligned}$$

and \(N(\Psi ) \le N(\Theta _j) + N(\Phi _{j+1}) + c' \cdot (L_{j+1} - L_j)\).

With \(\Theta _{j+1} := \Psi \), we get \({\mathtt {R}}(\Theta _{j+1}) = g_{j+1}\), \(L(\Theta _{j+1}) = L_{j+1}\) and, by the induction hypothesis,

$$\begin{aligned} W(\Theta _{j+1})\le & {} \sum _{i=1}^{j} W(\Phi _{i}) + c \, (L_j - L_1) + W(\Phi _{j+1}) + c \, (L_{j+1} - L_j) \\= & {} \sum _{i=1}^{j+1} W(\Phi _{i}) + c \, (L_{j+1} - L_1) . \end{aligned}$$

Similarly, \( N(\Theta _{j+1}) \le \sum _{i=1}^{j+1} N(\Phi _{i}) + c \cdot (L_{j+1} - L_1) \). This completes the induction and the proof. \(\square \)

1.5 Proof of Lemma 2.18

We prove each part of the lemma individually.

Part (2): Let \( \Phi _1 = \big ( (T_1, \alpha _1), \dots , (T_{L_1}, \alpha _{L_1}) \big ) \in {\mathcal {NN}}^{\varrho , d, d_1} \) and \( \Phi _2 = \big ( (S_1, \beta _1), \dots , (S_{L_2}, \beta _{L_2}) \big ) \in {\mathcal {NN}}^{\varrho , d_1, d_2} \) Define

$$\begin{aligned} \Psi := \big ( (T_1, \alpha _1), \dots , (T_{L_1}, \alpha _{L_1}), (S_1, \beta _1), \dots , (S_{L_2}, \beta _{L_2}) \big ) . \end{aligned}$$

We emphasize that \(\Psi \) is indeed a generalized \(\varrho \)-network, since all \(T_\ell \) and all \(S_\ell \) are affine-linear (with “fitting” dimensions), and since all \(\alpha _\ell \) and all \(\beta _\ell \) are \(\otimes \)-products of \(\varrho \) and \(\mathrm {id}_{{\mathbb {R}}}\), with \(\beta _{L_2} = \mathrm {id}_{{\mathbb {R}}^{d_2}}\). Furthermore, we clearly have \(L(\Psi ) = L_1 + L_2 = L(\Phi _1) + L(\Phi _2)\), and

$$\begin{aligned} W(\Psi ) = \sum _{\ell =1}^{L_1} \Vert T_\ell \Vert _{\ell ^0} + \sum _{\ell '=1}^{L_2} \Vert S_{\ell '}\Vert _{\ell ^0} = W(\Phi _1) + W(\Phi _2). \end{aligned}$$

Clearly, \(N(\Psi ) = N(\Phi _1) + d_1 + N(\Phi _2)\). Finally, the property \({\mathtt {R}}(\Psi ) = {\mathtt {R}}(\Phi _2) \circ {\mathtt {R}}(\Phi _1)\) is a direct consequence of the definition of the realization of neural networks.

Part (1): Let \(\Phi = \big ( (T_{1},\alpha _{1}), \dots , (T_L,\alpha _{L}) \big ) \in {\mathcal {NN}}^{\varrho ,d,k}\). We give the proof for \(Q \circ {\mathtt {R}}(\Phi )\), since the proof for \({\mathtt {R}}(\Phi ) \circ P\) is similar but simpler; the general statement in the lemma then follows from the identity \( Q \circ {\mathtt {R}}(\Phi ) \circ P = (Q \circ {\mathtt {R}}(\Phi )) \circ P = {\mathtt {R}}(\Psi _1) \circ P \).

We first treat the special case \(\Vert Q\Vert _{\ell ^{0,\infty }}=0\) which implies \(\Vert Q\Vert _{\ell ^{0}} = 0\), and hence, \(Q \circ {\mathtt {R}}(\Phi ) \equiv c\) for some \(c \in {\mathbb {R}}^{k_1}\). Choose \(N_0,\dots ,N_L\) such that \(T_\ell : {\mathbb {R}}^{N_{\ell - 1}} \rightarrow {\mathbb {R}}^{N_\ell }\) for \(\ell \in \{1,\dots ,L\}\), and define \(S_\ell : {\mathbb {R}}^{N_{\ell - 1}} \rightarrow {\mathbb {R}}^{N_\ell }, x \mapsto 0\) for \(\ell \in \{1,\dots ,L-1\}\) and \(S_L : {\mathbb {R}}^{N_{L-1}} \rightarrow {\mathbb {R}}^{k_1}, x \mapsto c\). It is then not hard to see that the network \(\Psi := \big ( (S_1,\alpha _1),\dots ,(S_L,\alpha _L) \big )\) satisfies \(L(\Psi ) = L(\Phi )\) and \(N(\Psi ) = N(\Phi )\), as well as \(W(\Psi ) = 0\) and \({\mathtt {R}}(\Psi ) \equiv c = Q \circ {\mathtt {R}}(\Phi )\).

We now consider the case \(\Vert Q\Vert _{\ell ^{0,\infty }} \ge 1\). Define \(U_{\ell } := T_{\ell }\) for \(\ell \in \{1,\dots ,L-1\}\) and \({U_{L} := Q \circ T_{L}}\). By Definition 2.1, we have \(\alpha _{L} = \mathrm {id}_{{\mathbb {R}}^{k}}\), whence \({ \Psi := \big ( (U_{1},\alpha _{1}), \dots , (U_{L-1}, \alpha _{L-1}), (U_L, \mathrm {id}_{{\mathbb {R}}^{k_{1}}}) \big ) \in {\mathcal {NN}}_{\infty ,L,N(\Phi )}^{\varrho ,d,k_1} }\) satisfies \({\mathtt {R}}(\Psi ) = Q \circ {\mathtt {R}}(\Phi )\). To control \(W(\Psi )\), we use the following lemma. The proof is slightly deferred.

Lemma A.3

Let \(p,q,r \in {\mathbb {N}}\) be arbitrary.

  1. (1)

    For arbitrary affine-linear maps \(T : {\mathbb {R}}^p \rightarrow {\mathbb {R}}^q\) and \(S : {\mathbb {R}}^q \rightarrow {\mathbb {R}}^r\), we have

    $$\begin{aligned} \Vert S \circ T \Vert _{\ell ^0} \le \Vert S \Vert _{\ell ^{0,\infty }} \cdot \Vert T \Vert _{\ell ^0} \quad \text {and} \quad \Vert S \circ T \Vert _{\ell ^0} \le \Vert S \Vert _{\ell ^0} \cdot \Vert T \Vert _{\ell ^{0,\infty }_{*}} . \end{aligned}$$
  2. (2)

    For affine-linear maps \(T_1, \dots , T_n\), we have \(\Vert T_1 \otimes \cdots \otimes T_n\Vert _{\ell ^0} \le \sum _{i=1}^n \Vert T_i\Vert _{\ell ^0}\), as well as

    $$\begin{aligned} \Vert T_1 \otimes \cdots \otimes T_n \Vert _{\ell ^{0,\infty }}\le & {} \max _{i \in \{1,\dots ,n\}} \Vert T_i \Vert _{\ell ^{0,\infty }} \quad \text {and} \\ \Vert T_1 \otimes \cdots \otimes T_n \Vert _{\ell ^{0,\infty }_{*}}\le & {} \max _{i \in \{1,\dots ,n\}} \Vert T_i \Vert _{\ell ^{0,\infty }_{*}} . \blacktriangleleft \end{aligned}$$

Let us continue with the proof from above. By definition, \( \Vert U_{\ell }\Vert _{\ell ^{0}} = \Vert T_{\ell }\Vert _{\ell ^{0}} \le \Vert Q \Vert _{\ell ^{0,\infty }} \cdot \Vert T_{\ell }\Vert _{\ell ^{0}} \) for \(\ell \in \{1,\dots ,L-1\}\). By Lemma A.3, we also have \(\Vert U_{L}\Vert _{\ell ^{0}} \le \Vert Q \Vert _{\ell ^{0,\infty }} \cdot \Vert T_{L}\Vert _{\ell ^{0}}\), and hence,

$$\begin{aligned} W(\Psi ) = \sum _{\ell =1}^L \Vert U_\ell \Vert _{\ell ^0} \le \Vert Q \Vert _{\ell ^{0,\infty }} \, \sum _{\ell =1}^L \Vert T_\ell \Vert _{\ell ^0} = \Vert Q \Vert _{\ell ^{0,\infty }} \cdot W(\Phi ). \end{aligned}$$

Finally, if \(\Phi \) is strict, then \(\Psi \) is strict as well; thus, the claim also holds with \({\mathtt {SNN}}\) instead of \({\mathtt {NN}}\).

Part (3): Let \( \Phi _1 = \big ( (T_1, \alpha _1), \dots , (T_L, \alpha _L) \big ) \in {\mathcal {NN}}^{\varrho , d, d_1} \) and \( \Phi _2 = \big ( (S_1, \beta _1), \dots , (S_K, \beta _K) \big ) \in {\mathcal {NN}}^{\varrho , d_1, d_2} \).

We distinguish two cases: First, if \(L = 1\), then \({\mathtt {R}}(\Phi _1) = T_1\). Since \(T_1 : {\mathbb {R}}^d \rightarrow {\mathbb {R}}^{d_1}\), this implies \(\Vert T_1\Vert _{\ell ^{0,\infty }_*} \le d\). Thus, Part (1) shows that

$$\begin{aligned} {\mathtt {R}}(\Phi _2) \circ {\mathtt {R}}(\Phi _1) = {\mathtt {R}}(\Phi _2) \circ T_1\in & {} {\mathtt {NN}}_{d \cdot W(\Phi _2), K, N(\Phi _2)}^{\varrho ,d,d_2} \\\subset & {} {\mathtt {NN}}_{W(\Phi _1) + N \cdot W(\Phi _2), L + K - 1, N(\Phi _1) + N(\Phi _2)}^{\varrho ,d,d_2}, \end{aligned}$$

where \(N := \max \{ N(\Phi _1), d\}\).

Let us now assume that \(L > 1\). In this case, define

$$\begin{aligned} \Psi := \big ( (T_1, \alpha _1), \dots , (T_{L-1},\alpha _{L-1}), (S_1 \circ T_L, \beta _1), (S_2, \beta _2) \dots , (S_K,\beta _K) \big ) . \end{aligned}$$

It is not hard to see that \(N(\Psi ) \le N(\Phi _1) + N(\Phi _2)\) and—because of \(\alpha _L = \mathrm {id}_{{\mathbb {R}}^{d_1}}\)—that

$$\begin{aligned} {\mathtt {R}}(\Psi ) = (\beta _K \circ S_K) \circ \cdots \circ (\beta _1 \circ S_1) \circ (\alpha _L \circ T_L) \circ \cdots \circ (\alpha _1 \circ T_1) = {\mathtt {R}}(\Phi _2) \circ {\mathtt {R}}(\Phi _1). \end{aligned}$$

Note \(T_\ell : {\mathbb {R}}^{M_{\ell - 1}} \rightarrow {\mathbb {R}}^{M_\ell }\) for certain \(M_0,\dots ,M_L \in {\mathbb {N}}\). Since \(L > 1\), we have \(M_{L - 1} \le N(\Phi _1) \le N\). Furthermore, since \(T_L : {\mathbb {R}}^{M_{L-1}} \rightarrow {\mathbb {R}}^{M_L}\), we get \(\Vert T_L\Vert _{\ell ^{0,\infty }_*} \le M_{L-1} \le N\) directly from the definition. Thus, Lemma A.3 shows \( \Vert S_1 \circ T_L\Vert _{\ell ^0} \le \Vert S_1\Vert _{\ell ^0} \cdot \Vert T_L\Vert _{\ell ^{0,\infty }_*} \le N \cdot \Vert S_1\Vert _{\ell ^0} \). Therefore, and since \(N \ge 1\), we see that

$$\begin{aligned} W(\Psi )= & {} \sum _{\ell =1}^{L-1} \Vert T_\ell \Vert _{\ell ^0} + \Vert S_1 \circ T_L\Vert _{\ell ^0} + \sum _{\ell =2}^{K} \Vert S_\ell \Vert _{\ell ^0} \\\le & {} W(\Phi _1) + N \cdot \Vert S_1\Vert _{\ell ^0} + N \cdot \sum _{\ell =2}^{K} \Vert S_\ell \Vert _{\ell ^0} = W(\Phi _1) + N \cdot W(\Phi _2). \end{aligned}$$

Finally, note that if \(\Phi _1,\Phi _2\) are strict networks, then so is \(\Psi \). \(\square \)

Proof of Lemma A.3

The stated estimates follow directly from the definitions by direct computations and are thus left to the reader. For instance, the main observation for proving that \(\Vert B A \Vert _{\ell ^0} \le \Vert B \Vert _{\ell ^{0,\infty }} \cdot \Vert A \Vert _{\ell ^0}\) is that

\(\square \)

1.6 Proof of Lemma 2.19

We start with an auxiliary lemma.

Lemma A.4

Consider two activation functions \(\varrho ,\sigma \) such that \(\sigma = {\mathtt {R}}(\Psi _{\sigma })\) for some \( \Psi _{\sigma } \in {\mathcal {NN}}^{\varrho ,1,1}_{w,\ell ,m} \) with \(L(\Psi _{\sigma }) = \ell \in {\mathbb {N}}\), \(w \in {\mathbb {N}}_{0}\), \(m \in {\mathbb {N}}\). Furthermore, assume that \(\sigma \not \equiv \mathrm {const}\).

Then, for any \(d \in {\mathbb {N}}\) and \(\alpha _{i} \in \{\mathrm {id}_{{\mathbb {R}}},\sigma \}\), \(1 \le i \le d\) we have \(\alpha _{1} \otimes \cdots \otimes \alpha _{d} = {\mathtt {R}}(\Phi )\) for some network

$$\begin{aligned} \Phi = \big ( (U_1, \gamma _1), \ldots , (U_{\ell }, \gamma _{\ell }) \big ) \in {\mathcal {NN}}^{\varrho ,d,d}_{dw,\ell ,dm} \end{aligned}$$

satisfying \(\Vert U_1\Vert _{\ell ^{0,\infty }} \le m\), \(\Vert U_1\Vert _{\ell ^{0,\infty }_{*}} \le 1\), \(\Vert U_{\ell }\Vert _{\ell ^{0,\infty }} \le 1\), and \(\Vert U_{\ell }\Vert _{\ell ^{0,\infty }_{*}} \le m\).

If \(\Psi _\sigma \) is a strict network and \(\alpha _i = \sigma \) for all i, then \(\Phi \) can be chosen to be a strict network.\(\blacktriangleleft \)

Proof of Lemma A.4

First, we show that any \(\alpha \in \{\mathrm {id}_{{\mathbb {R}}},\sigma \}\) satisfies \(\alpha = {\mathtt {R}}(\Psi _{\alpha })\) for some network

$$\begin{aligned} \Psi _{\alpha } = \big ( (U_{1}^{\alpha }, \gamma _{1}^{\alpha }), \ldots , (U_{\ell }^{\alpha }, \gamma _{\ell }^{\alpha }) \big ) \in {\mathcal {NN}}^{\varrho ,1,1}_{w,\ell ,m} \end{aligned}$$

with \(\Vert U_1^{\alpha }\Vert _{\ell ^{0,\infty }} \le m\), \(\Vert U_1^{\alpha }\Vert _{\ell ^{0,\infty }_{*}} \le 1\), \(\Vert U_{\ell }^{\alpha }\Vert _{\ell ^{0,\infty }} \le 1\) and \(\Vert U_{\ell }^{\alpha }\Vert _{\ell ^{0,\infty }_{*}} \le m\).

For \(\alpha = \sigma \), we have \(\alpha = {\mathtt {R}}(\Psi _{\sigma })\) where \(\Psi _\sigma \) is of the form \( \Psi _{\sigma } = \big ( (T_{1}, \beta _{1}), \ldots , (T_{\ell }, \beta _{\ell }) \big ) \in {\mathcal {NN}}^{\varrho ,1,1}_{w,\ell ,m} \). For \(\alpha = \mathrm {id}_{{\mathbb {R}}}\), observe that \(\alpha = {\mathtt {R}}(\Psi _{\mathrm {id}_{{\mathbb {R}}}})\) with

$$\begin{aligned} \Psi _{\mathrm {id}_{{\mathbb {R}}}} := \big ( (T_1', \mathrm {id}_{{\mathbb {R}}}), \ldots , (T_\ell ', \mathrm {id}_{{\mathbb {R}}}) \big ) := \big ( (\mathrm {id}_{{\mathbb {R}}}, \mathrm {id}_{{\mathbb {R}}}), \ldots , (\mathrm {id}_{{\mathbb {R}}}, \mathrm {id}_{{\mathbb {R}}}) \big ), \end{aligned}$$

where it is easy to see that \(N(\Psi _{\mathrm {id}_{{\mathbb {R}}}}) = \ell - 1 \le m\) and \(W(\Psi _{\mathrm {id}_{{\mathbb {R}}}}) = \ell \le w\). Indeed, Eq. (2.1) shows that \(\ell = L(\Psi _\sigma ) \le 1 + N(\Psi _\sigma ) \le 1 + m\). On the other hand, since \(\sigma \not \equiv \mathrm {const}\), Corollary 2.10 shows that \(\ell = L(\Psi _\sigma ) \le W(\Psi _\sigma ) \le w\).

Denoting by \(N_{i}\) the number of neurons in the i-th layer of \(\Psi _{\sigma }\) (where layer 0 is the input layer, and layer \(\ell \) the output layer), we get because of \(\Psi _{\sigma } \in {\mathcal {NN}}^{\varrho ,1,1}_{w,\ell ,m}\) that \(N_{i} \le m\) for \(1 \le i \le L-1\). Furthermore, since \(T_{1}: {\mathbb {R}}\rightarrow {\mathbb {R}}^{N_{1}}\), we have \(\Vert T_{1}\Vert _{\ell ^{0,\infty }} \le N_{1} \le m\) and \(\Vert T_{1}\Vert _{\ell ^{0,\infty }_{*}} \le 1\). Similarly, as \(T_{\ell }: {\mathbb {R}}^{N_{\ell -1}} \rightarrow {\mathbb {R}}\) we have \(\Vert T_{\ell }\Vert _{\ell ^{0,\infty }} \le 1\) and \(\Vert T_{\ell }\Vert _{\ell ^{0,\infty }_{*}} \le m\). The same bounds trivially hold for \(T'_{1}\) and \(T'_{\ell }\).

We now prove the claim of the lemma by induction on d. The result is trivial for \(d=1\) using \(\Phi = \Psi _{\alpha _{1}}\). Assuming it is true for \(d \in {\mathbb {N}}\), we prove it for \(d+1\).

Define \(\alpha = \alpha _{1} \otimes \cdots \otimes \alpha _{d}\) and \({\overline{\alpha }} = \alpha _{1} \otimes \cdots \otimes \alpha _{d+1} = \alpha \otimes \alpha _{d+1}\). By induction, there are networks \( \Psi _{1} = \big ( (V_{1},\lambda _{1}),\ldots ,(V_{\ell },\lambda _{\ell }) \big ) \in {\mathcal {NN}}^{\varrho ,d,d}_{dw,\ell ,dm} \) and \( \Psi _{2} = \big ( (W_{1},\mu _{1}),\ldots ,(W_{\ell },\mu _{\ell }) \big ) \in {\mathcal {NN}}^{\varrho ,1,1}_{w,\ell ,m} \) such that \({\mathtt {R}}(\Psi _1)=\alpha \) and \({\mathtt {R}}(\Psi _2)=\alpha _{d+1}\) and such that \(\Vert V_1\Vert _{\ell ^{0,\infty }} \le m\), \(\Vert V_1\Vert _{\ell ^{0,\infty }_*} \le 1\), \(\Vert V_\ell \Vert _{\ell ^{0,\infty }} \le 1\), and \(\Vert V_\ell \Vert _{\ell ^{0,\infty }_*} \le m\), and likewise for \(W_1\) instead of \(V_1\) and \(W_\ell \) instead of \(V_\ell \).

Define \(U_{i} := V_{i} \otimes W_{i}\) and \(\gamma _{i}:= \lambda _{i} \otimes \mu _{i}\) for \(1 \le i \le \ell \), and \(\Phi := \big ( (U_{1},\gamma _{1}),\ldots ,(U_{\ell },\gamma _{\ell }) \big )\). One can check that \({\mathtt {R}}(\Phi ) = {\overline{\alpha }}\). Moreover, Lemma A.3 shows that \(\Vert U_{i}\Vert _{\ell ^0} = \Vert V_{i}\Vert _{\ell ^0}+\Vert W_{i}\Vert _{\ell ^0}\) for \(1 \le i \le \ell \), whence \(W(\Phi ) = W(\Psi _1) + W(\Psi _2) \le dw+d = (d+1)w\) and similarly \(N(\Phi ) = N(\Psi _{1}) + N(\Psi _{2}) \le (d+1)m\). Finally, Lemma A.3 shows that

$$\begin{aligned} \Vert U_{1}\Vert _{\ell ^{0,\infty }}&\le \max \big \{ \Vert V_{1}\Vert _{\ell ^{0,\infty }}, \Vert W_{1}\Vert _{\ell ^{0,\infty }} \big \} \le m, \quad \\ \Vert U_{1}\Vert _{\ell ^{0,\infty }_{*}}&\le \max \big \{ \Vert V_{1}\Vert _{\ell ^{0,\infty }_{*}}, \Vert W_{1}\Vert _{\ell ^{0,\infty }_{*}} \big \} \le 1, \\ \Vert U_{\ell }\Vert _{\ell ^{0,\infty }}&\le \max \big \{ \Vert V_{\ell }\Vert _{\ell ^{0,\infty }}, \Vert W_{\ell }\Vert _{\ell ^{0,\infty }} \big \} \le 1, \quad \\ \Vert U_{\ell }\Vert _{\ell ^{0,\infty }_{*}}&\le \max \big \{ \Vert V_{\ell }\Vert _{\ell ^{0,\infty }_{*}}, \Vert W_{\ell }\Vert _{\ell ^{0,\infty }_{*}} \big \} \le m . \end{aligned}$$

Clearly, if \(\Psi _\sigma \) is strict, and if \(\alpha _i = \sigma \) for all i, then the same induction shows that \(\Phi \) can be chosen to be a strict network. \(\square \)

Proof of Lemma 2.19

For the first statement with \(\ell =2\), consider \(f = {\mathtt {R}}(\Psi )\) for some

$$\begin{aligned} \Psi = \big ( (S_{1}, \alpha _{1}), \ldots , (S_{K-1}, \alpha _{K-1}), (S_K, \mathrm {id}_{{\mathbb {R}}^{k}}) \big ) \in {\mathcal {NN}}^{\sigma ,d,k}_{W,L,N}. \end{aligned}$$

In case of \(K = 1\), we trivially have \(\Psi \in {\mathcal {NN}}_{W,L,N}^{\varrho ,d,k}\), so that we can assume \(K \ge 2\) in the following.

Denoting by \(N_{i}\) the number of neurons at the i-th layer of \(\Psi \), Lemma A.4 yields for each \({i \in \{1,\dots ,K-1\}}\) a network \( \Phi _{i} = \big ( (U_{1}^{i}, \gamma _{i}), (U_{2}^{i}, \mathrm {id}_{{\mathbb {R}}^{N_{i}}}) \big ) \in {\mathcal {NN}}^{\varrho ,N_{i},N_{i}}_{N_{i}w,2,N_{i}m} \) satisfying \(\alpha _i = {\mathtt {R}}(\Phi _{i})\) and \(\gamma _{i}: {\mathbb {R}}^{N(\Phi _{i})} \rightarrow {\mathbb {R}}^{N(\Phi _{i})}\) with \(N(\Phi _{i}) \le N_{i}m\) and finally \(\Vert U_{1}^{i}\Vert _{\ell ^{0,\infty }} \le m\) and \(\Vert U_{2}^{i}\Vert _{\ell ^{0,\infty }_{*}} \le m\). With \(T_{1} := U_{1}^{1} \circ S_{1}\), \(T_{K} := S_{K} \circ U_{2}^{K-1}\), \(T_{i} := U_{1}^{i} \circ S_{i} \circ U_{2}^{i-1}\) for \(2 \le i \le K-1\) and

$$\begin{aligned} \Phi := \big ( (T_1, \gamma _1), \ldots , (T_{K-1},\gamma _{K-1}), (T_K, \mathrm {id}_{{\mathbb {R}}^k}) \big ) , \end{aligned}$$

one can check that \(f = {\mathtt {R}}(\Phi )\).

By Lemma A.3, \( \Vert T_{i}\Vert _{\ell ^{0}} \le \Vert U_{1}^{i}\Vert _{\ell ^{0,\infty }} \Vert S_{i}\Vert _{\ell ^{0}} \Vert U_{2}^{i-1}\Vert _{\ell ^{0,\infty }_{*}} \le m^{2}\Vert S_{i}\Vert _{\ell ^{0}} \) for \(2 \le i \le K-1\), and the same overall bound also holds for \(i \in \{1,K\}\). As a result, we get \(L(\Phi ) = K \le L\) as well as

$$\begin{aligned} \frac{W(\Phi )}{m^2}&= \sum _{i=1}^{K} \frac{\Vert T_{i}\Vert _{\ell ^{0}}}{m^2} \le \sum _{i=1}^{K} \Vert S_{i}\Vert _{\ell ^{0}} = W(\Psi ) \le W \quad \text {and} \quad \\ \frac{N(\Phi )}{m}&= \sum _{i=1}^{K-1} \frac{N(\Phi _i)}{m} \le \sum _{i=1}^{K-1} N_{i} = N(\Psi ) \le N . \end{aligned}$$

For the second statement, we prove by induction on \(L \in {\mathbb {N}}\) that \({\mathtt {NN}}_{W,L,N}^{\sigma ,d,k} \subset {\mathtt {NN}}^{\varrho ,d,k}_{mW + Nw , 1 + (L-1)\ell , N(1+m)}\).

For \(L = 1\), it is easy to see \({\mathtt {NN}}_{W,1,N}^{\sigma ,d,k} = {\mathtt {NN}}^{\varrho ,d,k}_{W,1,N}\), simply because on the last (and for \(L=1\) only) layer, the activation function is always given by \(\mathrm {id}_{{\mathbb {R}}^k}\). Thus, the claim follows from the trivial inclusion \({\mathtt {NN}}_{W,1,N}^{\varrho ,d,k} \subset {\mathtt {NN}}^{\varrho ,d,k}_{mW + Nw , 1, N(1+m)}\), since \(m \ge 1\).

Now, assuming the claim holds true for L, we prove it for \(L+1\). Consider \(f \in {\mathtt {NN}}^{\sigma ,d,k}_{W,L+1,N}\). In case of \({f \in {\mathtt {NN}}^{\sigma ,d,k}_{W,L,N}}\), we get \( f \in {\mathtt {NN}}^{\varrho ,d,k}_{mW + Nw , 1 + (L-1)\ell , N(1+m)} \subset {\mathtt {NN}}^{\varrho ,d,k}_{mW + Nw , 1 + ( (L+1)-1)\ell , N(1+m)} \) by the induction hypothesis. In the remaining case where \(f \notin {\mathtt {NN}}^{\sigma ,d,k}_{W,L,N}\), there is a network \(\Psi \in {\mathcal {NN}}^{\sigma ,d,k}_{W,L+1,N}\) of the form \( \Psi = \big ( (S_{1},\alpha _{1}),\ldots ,(S_L,\alpha _L),(S_{L+1},\mathrm {id}_{{\mathbb {R}}^{k}}) \big ) \) such that \(f = {\mathtt {R}}(\Psi )\). Observe that \(S_{L+1}: {\mathbb {R}}^{{\overline{k}}} \rightarrow {\mathbb {R}}^{k}\) with \({\overline{k}} := N_L\) the number of neurons of the last hidden layer. Defining \( \Psi _{1} := \big ( (S_{1}, \alpha _1), \ldots , (S_{L-1}, \alpha _{L-1}), (S_L, \mathrm {id}_{{\mathbb {R}}^{{\overline{k}}}}) \big ), \) we have \(\Psi _{1} \in {\mathcal {NN}}^{\sigma ,d,{\overline{k}}}_{{\overline{W}},L,{\overline{N}}}\) where \({\overline{W}} := W(\Psi _{1})\) and \({\overline{N}} := N(\Psi _{1})\) satisfy

$$\begin{aligned} {\overline{W}} + \Vert S_{L+1}\Vert _{\ell ^{0}} \le W(\Psi ) \le W \quad \text {and} \quad {\overline{N}} + {\overline{k}} \le N(\Psi ) \le N . \end{aligned}$$

Define \(g := {\mathtt {R}}(\Psi _1)\), so that \(f = S_{L+1} \circ \alpha _L \circ g\). We now exhibit a \(\varrho \)-network \(\Phi \) (instead of the \(\sigma \)-network \(\Psi \)) of controlled complexity such that \(f = {\mathtt {R}}(\Phi )\). As \(g := {\mathtt {R}}(\Psi _{1}) \in {\mathtt {NN}}^{\sigma ,d,{\overline{k}}}_{{\overline{W}},L,{\overline{N}}}\), the induction hypothesis shows that \(g = {\mathtt {R}}(\Phi _{1})\) for some network

$$\begin{aligned} \Phi _{1} = \big ( (T_{1}, \beta _{1}), \ldots , (T_{K-1}, \beta _{K-1}) (T_{K}, \mathrm {id}_{{\mathbb {R}}^{{\overline{k}}}}) \big ) \in {\mathcal {NN}}^{\varrho ,d,{\overline{k}}}_{m {\overline{W}} + {\overline{N}} w, 1 + (L-1)\ell , {\overline{N}}(1+m)} . \end{aligned}$$

Moreover, Lemma A.4 shows that \(\alpha _L = {\mathtt {R}}(\Phi _{2})\) for a network

$$\begin{aligned} \Phi _{2} = \big ( (U_{1}, \gamma _{1}), \ldots , (U_{\ell -1}, \gamma _{\ell -1}), (U_{\ell }, \mathrm {id}_{{\mathbb {R}}^{{\overline{k}}}}) \big ) \in {\mathcal {NN}}^{\varrho ,{\overline{k}},{\overline{k}}}_{{\overline{k}}w, \ell ,{\overline{k}}m} \end{aligned}$$

with \(\Vert U_\ell \Vert _{\ell ^{0,\infty }_*} \le m\). By construction, we have \(f = S_{L+1} \circ \alpha _{L} \circ g = {\mathtt {R}}(\Phi )\) for the network

$$\begin{aligned}&\Phi := \big ( (T_1, \beta _1), \dots , (T_{K-1}, \beta _{K-1}), (T_K, \mathrm {id}_{{\mathbb {R}}^{{\overline{k}}}}), \\&\quad (U_1, \gamma _1), \dots , (U_{\ell -1}, \gamma _{\ell -1}), (S_{L+1} \circ U_\ell , \mathrm {id}_{{\mathbb {R}}^k}) \big ). \end{aligned}$$

To conclude, we observe that \( L(\Phi ) = K + \ell \le 1 + (L-1)\ell + \ell = 1 + \big ( (L+1) - 1 \big ) \ell \), as well as

$$\begin{aligned} W(\Phi )&= W(\Phi _1) + \big ( W(\Phi _2) - \Vert U_\ell \Vert _{\ell ^0} \big ) + \Vert S_{L+1} \circ U_\ell \Vert _{\ell ^0} \\ ({\scriptstyle {\text {Lemma}~A.3}})&\le m {\overline{W}} + {\overline{N}} w + W(\Phi _2) + \Vert S_{L+1}\Vert _{\ell ^0} \cdot \Vert U_\ell \Vert _{\ell ^{0,\infty }_*} \\&\le m {\overline{W}} + {\overline{N}} w + {\overline{k}} w + m \cdot \Vert S_{L+1}\Vert _{\ell ^0} \le m W + N w. \end{aligned}$$

Finally, we also have \( N(\Phi ) = N(\Phi _1) + {\overline{k}} + N(\Phi _2) \le {\overline{N}} (1 + m) + {\overline{k}} + {\overline{k}} \cdot m = ({\overline{N}} + {\overline{k}}) (1 + m) \le N (1+m) \). \(\square \)

1.7 Proof of Lemma 2.20

Let \( \Psi = \big ( (S_{1},\alpha _{1}),\ldots , (S_{K-1}, \alpha _{K-1}), (S_K,\mathrm {id}_{{\mathbb {R}}^{k}}) \big ) \in {\mathcal {NN}}^{\sigma ,d,k}_{W,L,N} \) be arbitrary and \({g = {\mathtt {R}}(\Psi )}\). We prove that there is some \(\Phi \in {\mathcal {NN}}^{\varrho ,d,k}_{W+(s-1)N,1+s(L-1),sN}\) such that \(g = {\mathtt {R}}(\Phi )\). This is easy to see if \(s=1\) or \(K=1\); hence, we now assume \(K \ge 2\) and \(s \ge 2\). Denoting by \(N_{\ell }\) the number of neurons at the \(\ell \)-th layer of \(\Psi \), for \(1 \le \ell \le K-1\), we have \(\alpha _{\ell } = \alpha _{\ell }^{(1)} \otimes \ldots \otimes \alpha _{\ell }^{(N_{\ell })}\) where \(\alpha _{\ell }^{(i)} \in \{\mathrm {id}_{{\mathbb {R}}},\sigma \}\). For \(1 \le \ell \le L-1\), \(1 \le j \le K_{\ell }\), \(1 \le i \le s\), define

$$\begin{aligned} \beta _{s(\ell -1)+i}^{(j)} := {\left\{ \begin{array}{ll} \varrho , &{} \text {if } \alpha _{\ell }^{(j)} = \sigma , \\ \mathrm {id}_{{\mathbb {R}}}, &{} \text {otherwise} \end{array}\right. } \end{aligned}$$

and let \(\beta _{s(\ell -1)+i} := \beta _{s(\ell -1)+i}^{(1)} \otimes \ldots \otimes \beta _{s(\ell -1)+i}^{(N_{\ell })}\). Define also \(T_{s(\ell -1)+1} := S_{\ell }: {\mathbb {R}}^{N_{\ell -1}} \rightarrow {\mathbb {R}}^{N_{\ell }}\) and \(T_{s(\ell -1)+i} := \mathrm {id}_{{\mathbb {R}}^{N_{\ell }}}\) for \(2 \le i \le s\). It is painless to check that

$$\begin{aligned} \alpha _{\ell } \circ S_{\ell }&= \beta _{s(\ell -1)+s} \circ T_{s(\ell -1)+s} \circ \cdots \circ \beta _{s(\ell -1)+2} \circ T_{s(\ell -1)+2} \circ \beta _{s(\ell -1)+1} \circ T_{s(\ell -1)+1} \\&= \beta _{s\ell } \circ T_{s\ell } \circ \cdots \circ \beta _{s(\ell -1)+1} \circ T_{s(\ell -1)+1} , \end{aligned}$$

and hence,

$$\begin{aligned} g = S_{K} \circ \alpha _{K-1} \circ S_{K-1} \circ \cdots \circ \alpha _{1} \circ S_{1} = S_{K} \circ \beta _{s(K-1)} \circ T_{s(K-1)} \circ \cdots \circ \beta _{1} \circ T_{1} . \end{aligned}$$

That is to say, \(g = {\mathtt {R}}(\Phi )\) with

$$\begin{aligned} \Phi := \big ( (T_{1},\beta _{1}),\ldots ,(T_{s(K-1)},\beta _{s(K-1)}),(S_{K},\mathrm {id}_{{\mathbb {R}}^{k}}) \big )\in & {} {\mathcal {NN}}^{\varrho ,d,k}_{W',1+s(K-1),sN} \\\subset & {} {\mathcal {NN}}^{\varrho ,d,k}_{W', 1 + s(L-1), sN} , \end{aligned}$$

where we compute

$$\begin{aligned} W'&:= \Vert S_{K} \Vert _{\ell ^{0}} + \sum _{j =1}^{s(K-1)} \Vert T_{j} \Vert _{\ell ^{0}} = \Vert S_{K}\Vert _{\ell ^{0}} + \sum _{\ell =1}^{K-1} \sum _{i=1}^{s} \Vert T_{s(\ell -1)+i} \Vert _{\ell ^{0}} \\&= \Vert S_{K} \Vert _{\ell ^{0}} + \sum _{\ell =1}^{K-1} \Big ( \Vert T_{s(\ell -1)+1} \Vert _{\ell ^{0}} + \sum _{i=2}^{s} \Vert T_{s(\ell -1)+i} \Vert _{\ell ^{0}} \Big ) \\&= \Vert S_{K} \Vert _{\ell ^{0}} + \sum _{\ell =1}^{K-1} \left( \Vert S_\ell \Vert _{\ell ^{0}} + (s-1) N_{\ell } \right) = \sum _{\ell =1}^{K} \Vert S_\ell \Vert _{\ell ^{0}} + (s-1) \sum _{\ell =1}^{K-1} N_{\ell } \\&= W(\Psi ) + (s-1) N(\Psi ) \le W+(s-1)N . \end{aligned}$$

We conclude as claimed that \(\Phi \in {\mathcal {NN}}^{\varrho ,d,k}_{W+(s-1)N,1+s(L-1),sN}\). Finally, if \(\Psi \) is strict, then so is \(\Phi \). \(\square \)

1.8 Proof of Lemma 2.21

For \(f \in {\mathtt {NN}}^{\sigma ,d,k}_{W,L,N}\), there is \( \Phi = \big ( (S_{1},\alpha _{1}),\ldots ,(S_{L'},\alpha _{L'}) \big ) \in {\mathcal {NN}}^{\sigma ,d,k}_{W,L',N} \) with \(L(\Phi ) = L' \le L\) and such that \(f = {\mathtt {R}}(\Phi )\). Replace each occurrence of the activation function \(\sigma \) by \(\sigma _{h}\) in the nonlinearities \(\alpha _{j}\) to define a \(\sigma _{h}\)-network \(\Phi _{h} := \big ( (S_{1},\alpha _{1}^{(h)}),\ldots ,(S_{L'},\alpha _{L'}^{(h)}) \big ) \in {\mathcal {NN}}^{\sigma _{h},d,k}_{W,L',N}\) and its realization \(f_{h} := {\mathtt {R}}(\Phi _{h})\in {\mathtt {NN}}^{\sigma _{h},d,k}_{W,L',N}\). Since \(\sigma \) is continuous and \(\sigma _{h} \rightarrow \sigma \) locally uniformly on \({\mathbb {R}}\) as \(h \rightarrow 0\), we get by Lemma A.7 (which is proved independently below) that \(f_{h} \rightarrow f\) locally uniformly on \({\mathbb {R}}^{d}\). To conclude for \(\ell =2\), observe that \(\sigma _{h} = {\mathtt {R}}(\Psi _{h})\) with \(\Psi _{h} \in {\mathcal {NN}}^{\varrho ,1,1}_{w,\ell ,m}\) and \(L(\Psi _{h}) = \ell \), whence Lemma 2.19 yields

$$\begin{aligned} f_{h} \in {\mathtt {NN}}^{\sigma _{h},d,k}_{W,L',N} \subset {\mathtt {NN}}^{\varrho ,d,k}_{Wm^{2},L',Nm} \subset {\mathtt {NN}}^{\varrho ,d,k}_{Wm^{2},L,Nm} . \end{aligned}$$

For arbitrary \(\ell \), we similarly conclude that

$$\begin{aligned}&f_{h} \in {\mathtt {NN}}^{\sigma _{h},d,k}_{W,L',N} \subset {\mathtt {NN}}^{\varrho ,d,k}_{W+Nw,1+(L'-1)(\ell +1),N(2+m)} \subset {\mathtt {NN}}^{\varrho ,d,k}_{W+Nw,1+(L-1)(\ell +1),N(2+m)} . \quad \square \end{aligned}$$

1.9 Proof of Lemmas 2.22 and 2.25

In this section, we provide a unified proof for Lemmas 2.22 and 2.25. To be able to handle both claims simultaneously, the following concept will be important.

Definition A.5

For each \(d,k \in {\mathbb {N}}\), let us fix a subset \({\mathcal {G}}_{d,k} \subset \{ f : {\mathbb {R}}^d \rightarrow {\mathbb {R}}^k \}\) and a topology \({\mathcal {T}}_{d,k}\) on the space of all functions \(f : {\mathbb {R}}^d \rightarrow {\mathbb {R}}^k\). Let \({\mathcal {G}}:= ({\mathcal {G}}_{d,k})_{d,k \in {\mathbb {N}}}\) and \({\mathcal {T}}:= ({\mathcal {T}}_{d,k})_{d,k \in {\mathbb {N}}}\). The tuple \(({\mathcal {G}},{\mathcal {T}})\) is called a network compatible topology family if it satisfies the following:

  1. (1)

    We have \(\{ T : {\mathbb {R}}^d \rightarrow {\mathbb {R}}^k \,\mid \, T \text { affine-linear} \} \subset {\mathcal {G}}_{d,k}\) for all \(d,k \in {\mathbb {N}}\).

  2. (2)

    If \(p \in {\mathbb {N}}\) and for each \(i \in \{1,\dots ,p\}\), we are given a sequence \((f_i^{(n)})_{n \in {\mathbb {N}}_0}\) of functions \(f_i^{(n)} : {\mathbb {R}}\rightarrow {\mathbb {R}}\) satisfying \(f_i^{(0)} \in {\mathcal {G}}_{1,1}\) and \(f_i^{(n)} \xrightarrow [n\rightarrow \infty ]{{\mathcal {T}}_{1,1}} f_i^{(0)}\), then \( f_1^{(n)} \otimes \cdots \otimes f_p^{(n)} \xrightarrow [n\rightarrow \infty ]{{\mathcal {T}}_{p,p}} f_1^{(0)} \otimes \cdots \otimes f_p^{(0)} \) and \(f_1^{(0)} \otimes \cdots \otimes f_p^{(0)} \in {\mathcal {G}}_{p,p}\).

  3. (3)

    If \(f_n : {\mathbb {R}}^d \rightarrow {\mathbb {R}}^k\) and \(g_n : {\mathbb {R}}^k \rightarrow {\mathbb {R}}^\ell \) for all \(n \in {\mathbb {N}}_0\) and if \(f_0 \in {\mathcal {G}}_{d,k}\) and \(g_0 \in {\mathcal {G}}_{k,\ell }\) as well as \(f_n \xrightarrow [n\rightarrow \infty ]{{\mathcal {T}}_{d,k}} f_0\) and \(g_n \xrightarrow [n\rightarrow \infty ]{{\mathcal {T}}_{k,\ell }} g_0\), then \(g_0 \circ f_0 \in {\mathcal {G}}_{d,\ell }\) and \(g_n \circ f_n \xrightarrow [n\rightarrow \infty ]{{\mathcal {T}}_{d,\ell }} g_0 \circ f_0\). \(\blacktriangleleft \)

Remark

Roughly speaking, the above definition introduces certain topologies \({\mathcal {T}}_{d,k}\) and certain sets of “good functions” \({\mathcal {G}}_{d,k}\) such that—for limit functions that are “good”—convergence in the topology is compatible with taking \(\otimes \)-products and with composition.

By induction, it is easy to see that if \(p \in {\mathbb {N}}\) and if for each \(i \in \{1,\dots ,p\}\), we are given a sequence \((f_i^{(n)})_{n \in {\mathbb {N}}}\) with \(f_i^{(n)} : {\mathbb {R}}^{d_{i-1}} \rightarrow {\mathbb {R}}^{d_i}\) and \(f_i^{(0)} \in {\mathcal {G}}_{d_{i-1}, d_i}\) as well as \(f_i^{(n)} \xrightarrow [n\rightarrow \infty ]{{\mathcal {T}}_{d_{i-1},d_i}} f_i^{(0)}\), then also \(f_p^{(0)} \circ \cdots \circ f_1^{(0)} \in {\mathcal {G}}_{d_0, d_p}\), as well as \( f_p^{(n)} \circ \cdots \circ f_1^{(0)} \xrightarrow [n\rightarrow \infty ]{{\mathcal {T}}_{d_0,d_p}} f_p^{(0)} \circ \cdots \circ f_1^{(0)} \). Indeed, the base case of the induction is contained in Definition A.5. Now, assuming that the claim holds for \(p \in {\mathbb {N}}\), we prove it for \(p+1\). To this end, let \(F_1^{(n)} := f_p^{(n)} \circ \cdots \circ f_1^{(n)}\) and \(F_2^{(n)} := f_{p+1}^{(n)}\). By induction, we know \(F_1^{(0)} \in {\mathcal {G}}_{d_0, d_p}\) and \(F_1^{(n)} \xrightarrow [n\rightarrow \infty ]{{\mathcal {T}}_{d_0,d_p}} F_1^{(0)}\). Since also \(F_2^{(0)} = f_{p+1}^{(0)} \in {\mathcal {G}}_{d_p, d_{p+1}}\), Definition A.5 implies \(F_2^{(0)} \circ F_1^{(0)} \in {\mathcal {G}}_{d_0, d_{p+1}}\) and \(F_2^{(n)} \circ F_1^{(n)} \xrightarrow [n\rightarrow \infty ]{{\mathcal {T}}_{d_0, d_{p+1}}} F_2^{(0)} \circ F_1^{(0)}\), which is precisely the claim for \(p+1\) instead of p. \(\blacklozenge \)

We now have the following important result:

Proposition A.6

Let \(\varrho : {\mathbb {R}}\rightarrow {\mathbb {R}}\), and let \(({\mathcal {G}}, {\mathcal {T}})\) be a network compatible topology family satisfying the following

  • \(\varrho \in {\mathcal {G}}_{1,1}\);

  • There is some \(n \in {\mathbb {N}}\) such that for each \(m \in {\mathbb {N}}\), there are affine-linear maps \(E_m : {\mathbb {R}}\rightarrow {\mathbb {R}}^n\) and \(D_m : {\mathbb {R}}^n \rightarrow {\mathbb {R}}\) such that \(F_m := D_m \circ (\varrho \otimes \cdots \otimes \varrho ) \circ E_m : {\mathbb {R}}\rightarrow {\mathbb {R}}\) satisfies \(F_m \xrightarrow [m\rightarrow \infty ]{{\mathcal {T}}_{1,1}} \mathrm {id}_{{\mathbb {R}}}\).

Then, we have for arbitrary \(d,k \in {\mathbb {N}}\), \(W,N \in {\mathbb {N}}_0 \cup \{\infty \}\) and \(L \in {\mathbb {N}}\cup \{\infty \}\) the inclusion

$$\begin{aligned} {\mathtt {NN}}^{\varrho ,d,k}_{W,L,N} \subset \overline{{\mathtt {SNN}}^{\varrho ,d,k}_{n^2 W, L, n N}} , \end{aligned}$$

where the closure is a sequential closure which is taken with respect to the topology \({\mathcal {T}}_{d,k}\). \(\blacktriangleleft \)

Remark

Before we give the proof of Proposition A.6, we explain a convention that will be used in the proof. Precisely, in the definition of \(W(\Phi )\), we always assume that the affine-linear maps \(T_\ell \) are of the form \(T_\ell : {\mathbb {R}}^{N_{\ell - 1}} \rightarrow {\mathbb {R}}^{N_\ell }\). Clearly, the expressivity of networks will not change if instead of the spaces \({\mathbb {R}}^{N_1},\dots , {\mathbb {R}}^{N_{L - 1}}\), one uses finite-dimensional vector spaces \(V_1, \dots , V_{L-1}\) with \(\dim V_i = N_i\). The only non-trivial question is the interpretation of \(\Vert T_\ell \Vert _{\ell ^0}\) for an affine-linear map \(T_\ell : V_{\ell - 1} \rightarrow V_\ell \), since for the case of \({\mathbb {R}}^{N_{\ell }}\), we chose the standard basis for obtaining the matrix representation of \(T_\ell \), while for general vector spaces \(V_\ell \), there is no such canonical choice of basis. Yet, in the proof below, we will consider the case \(V_\ell = {\mathbb {R}}^{n_1} \times \cdots \times {\mathbb {R}}^{n_{m}}\). In this case, there is a canonical way of identifying \(V_\ell \) with \({\mathbb {R}}^{N_\ell }\) for \(N_\ell = \sum _{j=1}^m n_j\), and there is also a canonical choice of “standard basis” in the space \(V_\ell \). We will use this convention in the proof below to simplify the notation. \(\blacklozenge \)

Proof of Proposition A.6

Let \(\Phi \in {\mathcal {NN}}_{W,L,N}^{\varrho ,d,k}\). We will construct a sequence \((\Phi _m)_{m \in {\mathbb {N}}} \subset {\mathcal {SNN}}_{n^2 W, L, nN}^{\varrho ,d,k}\) satisfying \({\mathtt {R}}(\Phi _m) \xrightarrow [m\rightarrow \infty ]{{\mathcal {T}}_{d,k}} {\mathtt {R}}(\Phi )\). To this end, note that \(\Phi = \big ( (T_1, \alpha _1), \dots , (T_K, \alpha _K) \big )\) for some \(K \le L\) and that there are \(N_0, \dots , N_K \in {\mathbb {N}}\) (with \(N_0 = d\) and \(N_K = k\)) such that \(T_\ell : {\mathbb {R}}^{N_{\ell - 1}} \rightarrow {\mathbb {R}}^{N_\ell }\) is affine-linear for each \(\ell \in \{1,\dots ,K\}\).

Let us first consider the special case \(K = 1\). By definition of a neural network, we have \(\alpha _K = \mathrm {id}_{{\mathbb {R}}^k}\), so that \(\Phi \) is already a strict \(\varrho \)-network. Therefore, we can choose \( \Phi _m := \Phi \in {\mathcal {SNN}}_{W,L,N}^{\varrho ,d,k} \subset {\mathcal {SNN}}_{n^2 W, L, nN}^{\varrho ,d,k} \) for all \(m \in {\mathbb {N}}\).

From now on, we assume \(K \ge 2\). For brevity, set \(\varrho _1 := \varrho \) and \(\varrho _2 := \mathrm {id}_{{\mathbb {R}}}\), as well as \(D(1) := 1\) and \(D(2) := n\), and furthermore,

$$\begin{aligned}&E_1^{(m)} \! := \! \mathrm {id}_{{\mathbb {R}}} : {\mathbb {R}}\rightarrow {\mathbb {R}}^{D(1)} \quad \text {and} \quad&E_2^{(m)} \! := \! E_m : {\mathbb {R}}\rightarrow {\mathbb {R}}^{D(2)} , \\ \text {as well as} \quad&D_1^{(m)} \! := \! \mathrm {id}_{{\mathbb {R}}} : {\mathbb {R}}^{D(1)} \rightarrow {\mathbb {R}}\quad \text {and} \quad&D_2^{(m)} \! := \! D_m : {\mathbb {R}}^{D(2)} \rightarrow {\mathbb {R}}. \end{aligned}$$

By definition of a generalized \(\varrho \)-network, for each \(\ell \in \{1,\dots ,K\}\) there are \(\iota _1^{(\ell )}, \dots , \iota _{N_\ell }^{(\ell )} \in \{1,2\}\) with \(\alpha _\ell = \varrho _{\iota _1^{(\ell )}} \otimes \cdots \otimes \varrho _{\iota _{N_\ell }^{(\ell )}}\), and with \(\iota _j^{(K)} = 2\) for all \(j \in \{1,\dots ,N_K\}\). Now, define \(V_0 := {\mathbb {R}}^d ={\mathbb {R}}^{N_{0}}\), \(V_K := {\mathbb {R}}^k = {\mathbb {R}}^{N_K}\), and

$$\begin{aligned} V_\ell := {\mathbb {R}}^{D(\iota _1^{(\ell )})} \times \cdots \times {\mathbb {R}}^{D(\iota _{N_\ell }^{(\ell )})} \cong {\mathbb {R}}^{\sum _{i=1}^{N_{\ell }} D(\iota _{i}^{(\ell )})} \quad \text {for} \quad 1 \le \ell \le K-1. \end{aligned}$$

Since we eventually want to obtain strict networks \(\Phi _m\), furthermore set

$$\begin{aligned} \beta ^{(1)} := \varrho : {\mathbb {R}}^{D(1)} \rightarrow {\mathbb {R}}^{D(1)} \qquad \text {and} \qquad \beta ^{(2)} := \varrho \otimes \cdots \otimes \varrho : {\mathbb {R}}^{D(2)} \rightarrow {\mathbb {R}}^{D(2)} . \end{aligned}$$

Using these maps, finally define \(\beta _K := \mathrm {id}_{{\mathbb {R}}^k}\), as well as

$$\begin{aligned} \beta _\ell := \beta ^{(\iota _1^{(\ell )})} \otimes \cdots \otimes \beta ^{(\iota _{N_\ell }^{(\ell )})} : V_{\ell } \rightarrow V_{\ell } \quad \text {for} \quad 1 \le \ell \le K-1 . \end{aligned}$$

Finally, for \(\ell \in \{1,\dots ,K\}\) and \(m \in {\mathbb {N}}\), define affine-linear maps

$$\begin{aligned} P_\ell ^{(m)}&:= E_{\iota _1^{(\ell )}}^{(m)} \otimes \cdots \otimes E_{\iota _{N_\ell }^{(\ell )}}^{(m)} : {\mathbb {R}}^{N_{\ell }} \rightarrow V_\ell \qquad \text {and} \qquad \\ Q_\ell ^{(m)}&:= D_{\iota _1^{(\ell )}}^{(m)} \otimes \cdots \otimes D_{\iota _{N_\ell }^{(\ell )}}^{(m)} : V_\ell \rightarrow {\mathbb {R}}^{N_\ell } . \end{aligned}$$

The crucial observation is that by assumption regarding the maps \(D_m, E_m\), we have

$$\begin{aligned} \begin{aligned}&D_2^{(m)} \circ \beta ^{(2)} \circ E_2^{(m)} = F_m \xrightarrow [m \rightarrow \infty ]{{\mathcal {T}}_{1,1}} \mathrm {id}_{{\mathbb {R}}} = \varrho _2 , \\ \text {and} \quad&D_1^{(m)} \circ \beta ^{(1)} \circ E_1^{(m)} = \mathrm {id}_{{\mathbb {R}}} \circ \varrho \circ \mathrm {id}_{{\mathbb {R}}} = \varrho = \varrho _1 . \end{aligned} \end{aligned}$$
(A.3)

Finally, for the construction of the strict networks \(\Phi _m\), we define for \(m \in {\mathbb {N}}\)

and then set \(\Phi _m := \big ( (S_1^{(m)}, \beta _1), \dots , (S_K^{(m)}, \beta _K) \big )\). Because of \(D(\iota _{i^{(\ell )}}) \in \{1,n\}\), we obtain

$$\begin{aligned} N(\Phi _m) = \sum _{\ell = 1}^{K-1} \dim V_\ell = \sum _{\ell =1}^{K-1} \sum _{i=1}^{N_\ell } D(\iota _i^{(\ell )}) \le \sum _{\ell =1}^{K-1} n N_\ell = n N(\Phi ) \le n N . \end{aligned}$$

Furthermore, by the second part of Lemma A.3 and in view of the product structure of \(P_\ell ^{(m)}\), we have

$$\begin{aligned} \Vert P_\ell ^{(m)} \Vert _{\ell ^{0,\infty }} \le \max \big \{ \Vert E^{(m)}_1\Vert _{\ell ^{0,\infty }}, \Vert E^{(m)}_2\Vert _{\ell ^{0,\infty }} \big \} \le \max \{ D(1), D(2) \} \le n , \end{aligned}$$

for arbitrary \(\ell \in \{1,\dots ,K\}\), simply because \(E^{(m)}_j : {\mathbb {R}}\rightarrow {\mathbb {R}}^{D(j)}\) for \(j \in \{1,2\}\). Likewise,

$$\begin{aligned} \Vert Q_{\ell }^{(m)} \Vert _{\ell ^{0,\infty }_*} \le \max \big \{ \Vert D^{(m)}_1 \Vert _{\ell ^{0,\infty }_{*}}, \Vert D^{(m)}_2 \Vert _{\ell ^{0,\infty }_{*}} \big \} \le \max \{ D(1), D(2) \} \le n , \end{aligned}$$

because \(D_j^{(m)} : {\mathbb {R}}^{D(j)} \rightarrow {\mathbb {R}}\) for \(j \in \{1,2\}\). By the first part of Lemma A.3, we thus see for \(2 \le \ell \le K - 1\) that

$$\begin{aligned} \Vert S_\ell ^{(m)} \Vert _{\ell ^0} \le \Vert P_\ell ^{(m)}\Vert _{\ell ^{0,\infty }} \cdot \Vert T_\ell \Vert _{\ell ^0} \cdot \Vert Q_{\ell - 1}^{(m)} \Vert _{\ell ^{0,\infty }_*} \le n^2 \cdot \Vert T_\ell \Vert _{\ell ^0} . \end{aligned}$$

Similar arguments yield \(\Vert S_1^{(m)}\Vert _{\ell ^0} \le n \cdot \Vert T_1\Vert _{\ell ^0} \le n^2 \cdot \Vert T_1\Vert _{\ell ^0}\) and \(\Vert S_K^{(m)}\Vert _{\ell ^0} \le n \cdot \Vert T_K\Vert _{\ell ^0} \le n^2 \cdot \Vert T_K\Vert _{\ell ^0}\). All in all, this implies \(W(\Phi _m) \le n^2 \cdot W(\Phi ) \le n^2 W\), as desired.

Now, since \(\varrho _1 = \varrho \in {\mathcal {G}}_{1,1}\) by the assumptions of the current proposition, since \(\varrho _2 = \mathrm {id}_{{\mathbb {R}}} \in {\mathcal {G}}_{1,1}\) as an affine-linear map, and since \(({\mathcal {G}},{\mathcal {T}})\) is a network compatible topology family, we see for all \(1 \le \ell \le K - 1\) that \( \alpha _\ell = \varrho _{\iota _1^{(\ell )}} \otimes \cdots \otimes \varrho _{\iota _{N_\ell }^{(\ell )}} \in {\mathcal {G}}_{N_\ell ,N_\ell } \) and furthermore that

$$\begin{aligned} \begin{aligned}&Q_\ell ^{(m)} \circ \beta _\ell \circ P_\ell ^{(m)} = \left( D_{\iota _1^{(\ell )}}^{(m)} \circ \beta ^{(\iota _1^{(\ell )})} \circ E_{\iota _1^{(\ell )}}^{(m)} \right) \otimes \cdots \otimes \left( D_{\iota _{N_\ell }^{(\ell )}}^{(m)} \circ \beta ^{(\iota _{N_\ell }^{(\ell )})} \circ E_{\iota _{N_\ell }^{(\ell )}}^{(m)} \right) \\&\quad ({\scriptstyle {\text {Eq.}~(\mathrm{A.3}) \text { and compatibility of } ({\mathcal {G}},{\mathcal {T}}) \text { with }\otimes }}) \xrightarrow [m\rightarrow \infty ]{{\mathcal {T}}_{N_\ell , N_\ell }} \varrho _{\iota _1^{(\ell )}} \otimes \cdots \otimes \varrho _{\iota _{N_\ell }^{(\ell )}} = \alpha _\ell . \end{aligned} \end{aligned}$$
(A.4)

Finally, since \(\beta _K = \mathrm {id}_{{\mathbb {R}}^k} = \alpha _K \in {\mathcal {G}}_{k, k}\), and since \(({\mathcal {G}},{\mathcal {T}})\) is a network compatible topology family and thus compatible with compositions (as long as the “factors” of the limit are “good,” which is satisfied here, since \(\alpha _\ell \in {\mathcal {G}}_{N_\ell , N_\ell }\) as we just saw and since \(T_\ell \in {\mathcal {G}}_{N_{\ell - 1}, N_\ell }\) as an affine-linear map), we see that

$$\begin{aligned} {\mathtt {R}}(\Phi _m)&= \beta _K \circ S_K^{(m)} \circ \cdots \circ \beta _1 \circ S_1^{(m)} \\&= \alpha _K \circ T_K \circ (Q_{K-1}^{(m)} \circ \beta _{K-1} \circ P_{K-1}^{(m)}) \circ T_{K-1} \circ \cdots \circ (Q_1^{(m)} \circ \beta _1 \circ P_1^{(m)}) \circ T_1 \\ ({\scriptstyle {\text {Eq.}~(\mathrm{A.4})}})&\xrightarrow [m\rightarrow \infty ]{{\mathcal {T}}_{d,k}} \alpha _K \circ T_K \circ \alpha _{K-1} \circ T_{K-1} \circ \cdots \circ \alpha _1 \circ T_1 = {\mathtt {R}}(\Phi ) , \end{aligned}$$

and hence \({\mathtt {R}}(\Phi ) \in \overline{{\mathtt {SNN}}^{\varrho ,d,k}_{n^2 W, L, n N}}\). \(\square \)

Now, we use Proposition A.6 to prove Lemma 2.25.

Proof of Lemma 2.25

For \(d,k \in {\mathbb {N}}\), let \({\mathcal {G}}_{d,k} := \{ f : {\mathbb {R}}^d \rightarrow {\mathbb {R}}^k \}\), and let \({\mathcal {T}}_{d,k} = 2^{{\mathcal {G}}_{d,k}}\) be the discrete topology on the set \(\{ f : {\mathbb {R}}^d \rightarrow {\mathbb {R}}^k \}\). This means that every set is open, so that the only convergent sequences are those that are eventually constant. It is easy to see that \(({\mathcal {G}},{\mathcal {T}})\) is a network compatible topology family and \(\varrho \in {\mathcal {G}}_{1,1}\).

Finally, by assumption of Lemma 2.25, there are \(a_i, b_i, c_i \in {\mathbb {R}}\) for \(i \in \{1,\dots ,n\}\) and some \(c \in {\mathbb {R}}\) such that \(x = c + \sum _{i=1}^n a_i \, \varrho (b_i \, x + c_i)\) for all \(x \in {\mathbb {R}}\). If we define \(E_m : {\mathbb {R}}\rightarrow {\mathbb {R}}^n, x \mapsto (b_1 \, x + c_1, \dots , b_n \, x + c_n)\) and \(D_m : {\mathbb {R}}^n \rightarrow {\mathbb {R}}, y \mapsto c + \sum _{i=1}^n a_i \, y_i\), then \(E_m, D_m\) are affine-linear, and \(\mathrm {id}_{{\mathbb {R}}} = D_m \circ (\varrho \otimes \cdots \otimes \varrho ) \circ E_m\) for all \(m \in {\mathbb {N}}\). Thus, all assumptions of Proposition A.6 are satisfied, so that this proposition implies \( {\mathtt {NN}}^{\varrho ,d,k}_{W,L,N} \subset \overline{{\mathtt {SNN}}_{n^2 W, L, nN}^{\varrho ,d,k}} = {\mathtt {SNN}}_{n^2 W, L, nN}^{\varrho ,d,k} \) for all \(d,k \in {\mathbb {N}}\), \(W,N \in {\mathbb {N}}_0 \cup \{\infty \}\) and \(L \in {\mathbb {N}}\cup \{\infty \}\). Here, we used that the (sequential) closure of a set M with respect to the discrete topology is simply the set M itself. \(\square \)

Finally, we will use Proposition A.6 to provide a proof of Lemma 2.22. To this end, the following lemma is essential.

Lemma A.7

Let \((f_n)_{n \in {\mathbb {N}}_0}\) and \((g_n)_{n \in {\mathbb {N}}_0}\) be sequences of functions \(f_n : {\mathbb {R}}^d \rightarrow {\mathbb {R}}^k\) and \(g_n : {\mathbb {R}}^k \rightarrow {\mathbb {R}}^\ell \). Assume that \(f_0, g_0\) are continuous and that \(f_n \xrightarrow [n\rightarrow \infty ]{} f_0\) and \(g_n \xrightarrow [n\rightarrow \infty ]{} g_0\) with locally uniform convergence. Then, \(g_0 \circ f_0\) is continuous, and \(g_n \circ f_n \xrightarrow [n\rightarrow \infty ]{} g_0 \circ f_0\) with locally uniform convergence.\(\blacktriangleleft \)

Proof

Locally uniform convergence on \({\mathbb {R}}^d\) is equivalent to uniform convergence on bounded sets. Furthermore, the continuous function \(f_0\) is bounded on each bounded set \(K \subset {\mathbb {R}}^d\); by uniform convergence, this implies that \(K' := \{ f(x) :x \in K \} \cup \{ f_n (x) :n \in {\mathbb {N}}\text { and } x \in K \} \subset {\mathbb {R}}^k\) is bounded as well. Hence, the continuous function \(g_0\) is uniformly continuous on \(K'\). From these observations, the claim follows easily; the details are left to the reader. \(\square \)

Given this auxiliary result, we can now prove Lemma 2.22.

Proof of Lemma 2.22

For \(d,k \in {\mathbb {N}}\), define \({\mathcal {G}}_{d,k} := \{ f : {\mathbb {R}}^d \rightarrow {\mathbb {R}}^k \,\mid \, f \text { continuous} \}\), and let \({\mathcal {T}}_{d,k}\) denote the topology of locally uniform convergence on \(\{ f : {\mathbb {R}}^d \rightarrow {\mathbb {R}}^k \}\). We claim that \(({\mathcal {G}},{\mathcal {T}})\) is a network compatible topology family. Indeed, the first condition in Definition A.5 is trivial, and the third condition holds thanks to Lemma A.7. Finally, it is not hard to see that if \(f_i^{(n)} : {\mathbb {R}}\rightarrow {\mathbb {R}}\) satisfy \(f_i^{(n)} \rightarrow f_i^{(0)}\) locally uniformly for all \(i \in \{1,\dots ,p\}\), then \( f_1^{(n)} \otimes \cdots \otimes f_p^{(n)} \xrightarrow [n\rightarrow \infty ]{} f_1^{(0)} \otimes \cdots \otimes f_p^{(0)} \) locally uniformly. This proves the second condition in Definition A.5.

We want to apply Proposition A.6 with \(n = 2\). We have \(\varrho \in {\mathcal {G}}_{1,1}\), since \(\varrho \) is continuous by the assumptions of Lemma 2.22. Thus, it remains to construct sequences \((E_m)_{m \in {\mathbb {N}}}, (D_m)_{m \in {\mathbb {N}}}\) of affine-linear maps \(E_m : {\mathbb {R}}\rightarrow {\mathbb {R}}^2\) and \(D_m : {\mathbb {R}}^2 \rightarrow {\mathbb {R}}\) such that \(D_m \circ (\varrho \otimes \varrho ) \circ E_m \rightarrow \mathrm {id}_{{\mathbb {R}}}\) with locally uniform convergence. Once these are constructed, Proposition A.6 shows that \({\mathtt {NN}}^{\varrho ,d,k}_{W,L,N} \subset \overline{{\mathtt {SNN}}^{\varrho ,d,k}_{4W, L, 2N}}\), where the closure is with respect to locally uniform convergence. This is precisely what is claimed in Lemma 2.22.

To construct \(E_m, D_m\), let us set \(a := \varrho ' (x_0) \ne 0\). By definition of the derivative, for arbitrary \(m \in {\mathbb {N}}\) and \(\varepsilon _m := |a|/m\), there is some \(\delta _m > 0\) satisfying

$$\begin{aligned} \left| \big ( \varrho (x_0 + h) - \varrho (x_0) \big ) / h - a \right| \le \varepsilon _m = |a| / m \qquad \forall \, \, h \in {\mathbb {R}}\text { with } 0 < |h| \le \delta _m . \nonumber \\ \end{aligned}$$
(A.5)

Now, define affine-linear maps

$$\begin{aligned}&E_m : {\mathbb {R}}\rightarrow {\mathbb {R}}^2, x \mapsto \left( x_0 + m^{-1/2} \cdot \delta _m \cdot x , \, x_0 \right) ^T \\&\quad \text {and} \quad D_m : {\mathbb {R}}^2\rightarrow {\mathbb {R}}, (y_1, y_2) \mapsto \sqrt{m} \cdot (y_1 - y_2) / (a \cdot \delta _m) , \end{aligned}$$

and set \(F_m := D_m \circ (\varrho \otimes \varrho ) \circ E_m\).

Finally, let \(x \in {\mathbb {R}}\) be arbitrary with \(0 < |x| \le \sqrt{m}\), and set \(h := \delta _m \cdot x / \sqrt{m}\), so that \(0 < |h| \le \delta _m\). By multiplying Eq. (A.5) with |h|/|a|, we then get

$$\begin{aligned}&\left| a^{-1} \cdot \big ( \varrho (x_0 + h) - \varrho (x_0) \big ) - h \right| \le \frac{|h|}{m} \\&\quad ({\scriptstyle {\text {multiply by } \sqrt{m} / \delta _m}}) \Longrightarrow \left| \frac{\sqrt{m}}{a \cdot \delta _m} \left( \varrho \left( x_0 + \frac{\delta _m \cdot x}{\sqrt{m}} \right) - \varrho (x_0) \right) - x \right| \\&\quad \le \frac{|h|}{\delta _m \cdot \sqrt{m}} = \frac{|x|}{m} \le \frac{1}{\sqrt{m}} , \end{aligned}$$

where the last step used that \(|x| \le \sqrt{m}\). This estimate is trivially valid for \(x = 0\). Put differently, we have thus shown \(|F_m (x) - x|\le 1/\sqrt{m}\) for all \(x \in {\mathbb {R}}\) with \(|x| \le \sqrt{m}\). That is, \(F_m \xrightarrow [m\rightarrow \infty ]{} \mathrm {id}_{{\mathbb {R}}}\) with locally uniform convergence. \(\square \)

1.10 Proof of Lemma 2.24

We will need the following lemma that will also be used elsewhere.

Lemma A.8

For \(f : {\mathbb {R}}\rightarrow {\mathbb {R}}\) and \(a \in {\mathbb {R}}\), let \(T_a f : {\mathbb {R}}\rightarrow {\mathbb {R}}, x \mapsto T_a f (x) = f(x-a)\). Furthermore, for \(n \in {\mathbb {N}}_0\), let \(X^n : {\mathbb {R}}\rightarrow {\mathbb {R}}, x \mapsto x^n\) and \(V_n := \mathrm {span} \{T_a X^n \, :\, a \in {\mathbb {R}}\}\), with the convention \(X^0 \equiv 1\).

We have \(V_n = {\mathbb {R}}_{\deg \le n}[x]\); that is, \(V_n\) is the space of all polynomials of degree at most n.\(\blacktriangleleft \)

Proof

Clearly, \(V_n \subset {\mathbb {R}}_{\deg \le n} [x] =: V\), where \(\dim V = n+1\). Therefore, it suffices to show that \(V_n\) contains \(n+1\) linearly independent elements. In fact, we show that whenever \(a_1,\dots ,a_{n+1} \in {\mathbb {R}}\) are pairwise distinct, then the family \((T_{a_i} X^n)_{i=1,\dots ,n+1} \subset V_n\) is linearly independent.

To see this, suppose that \(\theta _1,\dots ,\theta _{n+1} \in {\mathbb {R}}\) are such that \(0 \equiv \sum _{i=1}^{n+1} \theta _i \, T_{a_i} X^n\). A direct computation using the binomial theorem shows that this implies \( 0 \equiv \sum _{\ell =0}^n \big [ \left( {\begin{array}{c}n\\ \ell \end{array}}\right) (-1)^\ell X^{n-\ell } \sum _{i=1}^{n+1} \theta _i a_i^\ell \big ] \).

By comparing the coefficients of \(X^t\), this leads to \(0 = \big ( \sum _{i=1}^{n+1} a_i^\ell \, \theta _i \big )_{\ell =0,\dots ,n} = A^T \theta \), where \(\theta = (\theta _1,\dots ,\theta _{n+1}) \in {\mathbb {R}}^n\), and where the Vandermonde matrix \(A := (a_i^j)_{i=1,\dots ,n+1, j=0,\dots ,n} \in {\mathbb {R}}^{(n+1) \times (n+1)}\) is invertible; see [34, Equation (4-15)]. Hence, \(\theta = 0\), showing that \((T_{a_i} X^n)_{i=1,\dots ,n+1}\) is a linearly independent family. \(\square \)

Proof of Lemma 2.24

First, note

$$\begin{aligned}&\varrho _r (x) + (-1)^r \, \varrho _r (-x) \nonumber \\&\quad = {\left\{ \begin{array}{ll} \varrho _r (x) = (x_+)^r = x^r , &{} \text {if } x \ge 0 \, \\ (-1)^r \varrho _r (-x) = (-1)^r [(-x)_+]^r = (-1)^r (-x)^r = x^r, &{} \text {if } x < 0 . \end{array}\right. } \end{aligned}$$
(A.6)

Next, Lemma A.8 shows that \(V_r = {\mathbb {R}}_{\deg \le r}[x]\) has dimension \(r+1\). Thus, given any polynomial \({f \in {\mathbb {R}}_{\deg \le r}[x]}\), there are \(a_1, \dots , a_{r+1} \in {\mathbb {R}}\) and \(b_1, \dots , b_{r+1} \in {\mathbb {R}}\) such that for all \(x \in {\mathbb {R}}\)

$$\begin{aligned} f(x) = \sum _{\ell = 1}^{r+1} a_\ell \cdot (T_{b_\ell } X^r)(x) {\mathop {=}\limits ^{ (\mathrm{A.6})}} \sum _{\ell = 1}^{r+1} a_\ell \cdot [\varrho _r (x - b_\ell ) + (-1)^r \varrho _r \big ( - ({x - b_\ell }) \big )]. \end{aligned}$$

\(\square \)

1.11 Proof of Lemma 2.26

For Part (1), define \(w_{j} := 6n(2^{j}-1)\) and \(m_{j} := (2n+1)(2^j-1)-1\). We will prove below by induction on \(j \in {\mathbb {N}}\) that \(M_{2^{j}} \in {\mathtt {NN}}^{\varrho ,2^{j},1}_{w_j,2j,m_j}\). Let us see first that this implies the result. For arbitrary \(d \in {\mathbb {N}}_{\ge 2}\) and \(j = \lceil \log _{2} d \rceil \), it is not hard to see that

$$\begin{aligned} P: {\mathbb {R}}^{d} \rightarrow {\mathbb {R}}^{2^{j}}, x \mapsto (x,1_{2^{j}-d}) = (x,0_{2^{j}-d}) + (0_{d},1_{2^{j}-d}) \end{aligned}$$

is affine-linear with \(\Vert P\Vert _{\ell ^{0,\infty }_*}=1\) [cf. Eq. (2.4)] and that \(M_{d} = M_{2^{j}} \circ P\). Using Lemma 2.18-(1) we get \(M_{d} \in {\mathtt {NN}}^{\varrho ,2^{j},1}_{w_j,2j,m_j}\) as claimed.

We now proceed to the induction. As a preliminary, note that by assumption there are \(a \in {\mathbb {R}}\), \(\alpha _1, \dots , \alpha _n \in {\mathbb {R}}\) and \(\beta _1, \dots , \beta _n \in {\mathbb {R}}\) such that for all \(x \in {\mathbb {R}}\)

$$\begin{aligned} x^2 = a + \sum _{\ell = 1}^{n} \beta _\ell \, \varrho (x - \alpha _\ell ). \end{aligned}$$

Put differently, the affine-linear maps \(T_1 : {\mathbb {R}}\rightarrow {\mathbb {R}}^{n}, x \mapsto (x-\alpha _{\ell })_{\ell =1}^{n}\) and \({T_2 : {\mathbb {R}}^{n} \rightarrow {\mathbb {R}}, y \mapsto a + \sum _{\ell =1}^{n} \beta _{\ell } \, y_{\ell }}\) satisfy \(x^2 = T_2 \circ (\varrho \otimes \cdots \otimes \varrho ) \circ T_1 (x)\) for all \(x \in {\mathbb {R}}\), where the \(\otimes \)-product has n factors. Since \({x \cdot y = \tfrac{1}{4} \big ( (x+y)^2 - (x-y)^2 \big )}\) for all \(x,y \in {\mathbb {R}}\), if we define the maps \({T_0 : {\mathbb {R}}^2 \rightarrow {\mathbb {R}}^2, (x,y) \mapsto (x + y, x-y)}\) and \(T_3 : {\mathbb {R}}^2 \rightarrow {\mathbb {R}}, (u,v) \mapsto \frac{1}{4} (u - v)\), then for all \(x,y \in {\mathbb {R}}\)

where \({S_1 := (T_1 \otimes T_1) \circ T_0 : {\mathbb {R}}^2 \rightarrow {\mathbb {R}}^{2n}}\) and \(S_2 := T_3 \circ (T_2 \otimes T_2) : {\mathbb {R}}^{2n} \rightarrow {\mathbb {R}}\). As \(\Vert S_1\Vert _{\ell ^{0}} \le 4n\) and \(\Vert S_2\Vert _{\ell ^{0}} \le 2n\), we obtain \(M_{2} = {\mathtt {R}}(\Phi _{1})\) where \( \Phi _{1} = \big ( (S_{1}, \varrho \otimes \cdots \otimes \varrho ),(S_{2},\mathrm {id}) \big ) \in {\mathcal {NN}}_{6n, 2, 2n}^{\varrho , 2, 1}. \) This establishes our induction hypothesis for \(j=1\): \(M_{2} \in {\mathtt {SNN}}_{6n, 2, 2n}^{\varrho , 2, 1} \subset {\mathtt {NN}}_{w_j, 2^j, m_j}^{\varrho , 2, 1}\) for \(j = 1\).

We proceed to the actual induction step. Define the affine maps \(U_1, U_2 : {\mathbb {R}}^{2^{j+1}} \rightarrow {\mathbb {R}}^{2^{j}}\) by

$$\begin{aligned} U_1(x) := (x_{1}, \ldots , x_{2^{j}}) =: {\overline{x}} \quad \text {and} \quad U_2(x) := (x_{2^{j}+1}, \ldots , x_{2^{j+1}}) =: x' \quad \text {for} \quad x \in {\mathbb {R}}^{2^{j+1}}. \end{aligned}$$

With these definitions, observe that \( M_{2^{j+1}}(x) = M_{2^{j}}({\overline{x}}) M_{2^{j}}(x') = M_{2} \big ( M_{2^{j}}(U_{1}(x)),M_{2^{j}}(U_{2}(x)) \big ) \).

By the induction hypothesis, there is a network \( \Phi _{j} = \big ( (V_{1},\alpha _{1}), \ldots , (V_{L},\mathrm {id}) \big ) \in {\mathcal {NN}}_{w_{j}, 2j, m_{j}}^{\varrho , 2^{j}, 1} \) with \(L(\Phi _{j}) = L \le 2j\) such that \(M_{2^{j}} = {\mathtt {R}}(\Phi _{j})\). Since \(\Vert U_{i}\Vert _{\ell ^{0,\infty }_*}=1\), the second part of Lemma A.3 shows \(\Vert V_1 \circ U_i\Vert _{\ell ^0} \le \Vert V_{1}\Vert _{\ell ^{0}}\), whence \(M_{2^{j}} \circ U_i = {\mathtt {R}}(\Psi _{i})\), where \( \Psi _{i} = \big ( (V_{1} \circ U_{i}, \alpha _{1}), (V_{2}, \alpha _{2}), \ldots , (V_{L},\mathrm {id}) \big ) \) satisfies \(W(\Psi _{i}) \le W(\Phi _{j})\), \(N(\Psi _{i}) \le N(\Phi _{j})\), \(L(\Psi _{i}) = L\), and \(\Psi _{i} \in {\mathcal {NN}}^{\varrho ,2^{j},1}_{w_{j},2j,m_{j}}\). Thus, Lemma A.1 shows that \( f := (M_{2^{j}} \circ U_1, M_{2^{j}} \circ U_2) \in {\mathtt {NN}}_{2w_{j},2j,2m_{j}}^{\varrho ,2^{j+1},2} \). Since \(M_{2} \in {\mathtt {NN}}^{\varrho ,2,1}_{6n,2,2n}\), Lemma 2.18-(2) shows that \(M_{2^{j+1}} = M_{2} \circ f \in {\mathtt {NN}}^{\varrho ,2^{j+1},1}_{2w_{j}+6n,2j+2,2m_{j}+2n+2}\).

To conclude the proof of Part (1), note that \(2w_{j}+6n = 12 n(2^{j}-1) + 6n = 6 n(2^{j+1}-1) = w_{j+1}\) and \(2m_{j}+2n+2 = 2(2n+1)(2^{j}-1)+2n = (2n+1) (2^{j+1}-2)+2n+1-1 = m_{j+1}\).

To prove Part (2), we recall from Part (1) that \(M_{2} : {\mathbb {R}}^2 \rightarrow {\mathbb {R}}, (x,y) \mapsto x \cdot y\) satisfies \(M_{2} = {\mathtt {R}}(\Psi )\) with \(\Psi \in {\mathcal {SNN}}_{6n, 2, 2n}^{\varrho , 2, 1}\) and \(L(\Psi ) = 2\). Next, let \(P^{(i)} : {\mathbb {R}}\times {\mathbb {R}}^k \rightarrow {\mathbb {R}}\times {\mathbb {R}}, (x, y) \mapsto (x, y_i)\) for each \(i \in \{1,\dots ,k\}\), and note that \(P^{(i)}\) is linear with \(\Vert P^{(i)}\Vert _{\ell ^{0,\infty }} = 1 = \Vert P^{(i)}\Vert _{\ell ^{0,\infty }_*}\). Lemma 2.18-(1) shows that \(M_{2} \circ P^{(i)} = {\mathtt {R}}(\Psi _{i})\) where \(\Psi _{i} \in {\mathcal {SNN}}^{\varrho ,1+k,1}_{6n,2,2n}\) and \(L(\Psi _{i}) = L(\Psi )=2\). To conclude, observe \({(M_{2} \circ P^{(i)}) (x,y) = x \cdot y_i = [m(x,y)]_i}\) for \({m : {\mathbb {R}}\times {\mathbb {R}}^k \rightarrow {\mathbb {R}}^k, (x,y) \mapsto x \cdot y}\). Therefore, Lemma 2.17-(2) shows that \(m = (M_{2} \circ P^{(1)}, \dots , M_{2} \circ P^{(k)}) \in {\mathtt {NN}}^{\varrho , 1+k, k}_{6kn, 2,2kn}\), as desired. \(\square \)

Appendix B. Proofs for Sect. 3

1.1 Proof of Lemma 3.1

Let . For the sake of brevity, set \(\varepsilon _n := E(f, \Sigma _n)_X\) and \(\delta _n := E(f, \Sigma _n')_{X}\) for \(n \in {\mathbb {N}}_0\). First, observe that \(\varepsilon _n \le \Vert f\Vert _X = \delta _0\) for all \(n \in {\mathbb {N}}_0\). Furthermore, we have by assumption that \(\varepsilon _{cm} \le \delta _m\) for all \(m \in {\mathbb {N}}\). Now, setting \(m_n := \lfloor \frac{n - 1}{c} \rfloor \in {\mathbb {N}}\) for \(n \in {\mathbb {N}}_{\ge c + 1}\), note that \(n - 1 \ge c \, m_n\), and hence \(\varepsilon _{n-1} \le \varepsilon _{c \, m_n} \le C \cdot \delta _{m_n}\). Therefore, we see

$$\begin{aligned} \varepsilon _{n-1} \le \delta _0 \text { if } 1 \le n \le c \qquad \text {and} \qquad \varepsilon _{n-1} \le C \cdot \delta _{m_n} \text { if } n \ge c+1. \end{aligned}$$

Next, note for \(n \in {\mathbb {N}}_{\ge c + 1}\) that \(m_n \ge 1\) and \(m_n \ge \frac{n - 1}{c} - 1\), whence \(n \le c \, m_n + c + 1 \le (2 c + 1) m_n\). Therefore, \(n^\alpha \le (2c + 1)^\alpha m_n^\alpha \). Likewise, since \(m_n \le n\), we have \(n^{-1} \le m_n^{-1}\) for all \(n \in {\mathbb {N}}_{\ge c + 1}\).

There are now two cases. First, if \(q < \infty \), and if we set \(K := K(\alpha ,q,c) := \sum _{n=1}^c n^{\alpha q - 1}\), then

Further, for \(n \in {\mathbb {N}}_{\ge c + 1}\) satisfying \(m_n = m\) for some \(m \in {\mathbb {N}}\), we have \(m \le \frac{n-1}{c} < m+1\), which easily implies \(|\{ n \in {\mathbb {N}}_{\ge c + 1} :m_n = m\}| \le |\{ n \in {\mathbb {N}}:c m + 1 \le n < c m + c + 1 \}| = c\). Thus,

Overall, we thus see for \(q < \infty \) that

where the constant \(K + C^q (2c+1)^{\alpha q} c\) only depends on \(\alpha ,q,c,C\).

The adaptations for the (easier) case \(q = \infty \) are left to the reader. \(\square \)

1.2 Proof of Lemma 3.20

For \(p \in (0,\infty )\), the claim is clear, since it is well known that \(L_{p}(\Omega ;{\mathbb {R}}^{k})\) is complete, and since one can extend each by zero to a function \(f \in L^p(\Omega ;{\mathbb {R}}^k)\) satisfying \(g = f|_{\Omega }\).

Now, we consider the case \(p = \infty \). We first prove completeness of . Let be a Cauchy sequence. It is well known that there is a continuous function \(f : \Omega \rightarrow {\mathbb {R}}^k\) such that \(f_n \rightarrow f\) uniformly. In fact (see, for instance, [63, Theorem 12.8]), f is uniformly continuous. It remains to show that f vanishes at infinity. Let \(\varepsilon > 0\) be arbitrary, and choose \(n \in {\mathbb {N}}\) such that \(\Vert f - f_n\Vert _{\sup } \le \frac{\varepsilon }{2}\). Since \(f_n\) vanishes at \(\infty \), there is \(R > 0\) such that \(|f_n(x)| \le \frac{\varepsilon }{2}\) for \(x \in \Omega \) with \(|x| \ge R\). Therefore, \(|f(x)| \le \varepsilon \) for such x, proving that , while follows from the uniform convergence \(f_n \rightarrow f\).

Finally, we prove that . By considering components it is enough to prove that . To see that , simply note thatFootnote 7 if \(f \in C_0 ({\mathbb {R}}^d)\), then f is not only continuous, but in fact uniformly continuous. Therefore, \(f|_{\Omega }\) is also uniformly continuous (and vanishes at infinity), whence .

For proving , we will use the notion of the one-point compactification \(Z_\infty := \{\infty \} \cup Z\) of a locally compact Hausdorff space Z (where we assume that \(\infty \notin Z\)); see [26, Proposition 4.36]. The topology on \(Z_\infty \) is given by \( {\mathcal {T}}_Z := \{ U :U \subset Z \text { open} \} \cup \{ Z_\infty {\setminus } K :K \subset Z \text { compact} \} \). Then, \((Z_\infty ,{\mathcal {T}}_Z)\) is a compact Hausdorff space and the topology induced on Z as a subspace of \(Z_\infty \) coincides with the original topology on Z; see [26, Proposition 4.36]. Furthermore, if \(A \subset Z\) is closed, then a direct verification shows that the relative topology on \(A_\infty \) as a subset of \(Z_\infty \) coincides with the topology \({\mathcal {T}}_A\).

Now, let . Since g is uniformly continuous, it follows (see [3, Lemma 3.11]) that there is a uniformly continuous function \({\widetilde{g}} : A \rightarrow {\mathbb {R}}\) satisfying \(g = {\widetilde{g}}|_{\Omega }\), with \(A := {\overline{\Omega }} \subset {\mathbb {R}}^{d}\) the closure of \(\Omega \) in \({\mathbb {R}}^d\).

Since \(g \in C_0(\Omega )\), it is not hard to see that \({\widetilde{g}} \in C_0(A)\). Hence, [26, Proposition 4.36] shows that the function \(G : A_\infty \rightarrow {\mathbb {R}}\) defined by \(G(x) = {\widetilde{g}}(x)\) for \(x \in A\) and \(G(\infty ) = 0\) is continuous. Since \(A_\infty \subset ({\mathbb {R}}^d)_\infty \) is compact, the Tietze extension theorem (see [26, Theorem 4.34]) shows that there is a continuous extension \(H : ({\mathbb {R}}^d)_\infty \rightarrow {\mathbb {R}}\) of G. Again by [26, Proposition 4.36], this implies that \(f := H|_{{\mathbb {R}}^d} \in C_0({\mathbb {R}}^d)\). By construction, we have \(g = f|_{\Omega }\). \(\square \)

1.3 Proof of Theorem 3.23

1.3.1 Proof of Claims 1a-1b

We use the following lemma.

Lemma B.1

Let \({\mathcal {C}}\) be one of the following classes of functions:

  • locally bounded functions;

  • Borel-measurable functions;

  • continuous functions;

  • Lipschitz continuous functions;

  • locally Lipschitz continuous functions.

If the activation function \(\varrho \) belongs to \({\mathcal {C}}\), then any \(f \in {\mathtt {NN}}^{\varrho ,d,k}\) also belongs to \({\mathcal {C}}\).\(\blacktriangleleft \)

Proof

First, note that each affine-linear map \(T : {\mathbb {R}}^d \rightarrow {\mathbb {R}}^k\) belongs to all of the mentioned classes. Furthermore, note that since \({\mathbb {R}}^d\) is locally compact, a function \(f : {\mathbb {R}}^d \rightarrow {\mathbb {R}}^k\) is locally bounded [locally Lipschitz] if and only if f is bounded [Lipschitz continuous] on each bounded set. From this, it easily follows that each class \({\mathcal {C}}\) is closed under composition.

Finally, it is not hard to see that if \(f_1, \dots , f_n : {\mathbb {R}}\rightarrow {\mathbb {R}}\) all belong to the class \({\mathcal {C}}\), then so does \(f_1 \otimes \cdots \otimes f_n : {\mathbb {R}}^n \rightarrow {\mathbb {R}}^n\).

Combining these facts with the definition of the realization of a neural network, we get the claim. \(\square \)

As \(\varrho \) is locally bounded and Borel measurable, by Lemma B.1 each \(g \in {\mathtt {NN}}^{\varrho ,d,k}\) is locally bounded and measurable. As \(\Omega \) is bounded, we get \(g|_{\Omega } \in L_{p}(\Omega ;{\mathbb {R}}^{k})\) for all \(p \in (0,\infty ]\), and hence if \(p < \infty \). This establishes claim 1a. Finally, if \(p = \infty \), then by our additional assumption that \(\varrho \) is continuous, g is continuous by Lemma B.1. On the compact set \({\overline{\Omega }}\), g is thus uniformly continuous and bounded, so that \(g|_{\Omega }\) is uniformly continuous and bounded as well, that is, . This establishes claim 1b. \(\square \)

1.3.2 Proof of claims 1c-1d

We first consider the case \(p < \infty \). Let and \(\varepsilon > 0\). For each \(i \in \{1,\dots ,k\}\), extend the i-th component function \(f_i\) by zero to a function \(g_i \in L_p({\mathbb {R}}^d)\). As is well known (see, for instance, [25, Chapter VI, Theorem 2.31]), \(C_c^\infty ({\mathbb {R}}^d)\) is dense in \(L_p({\mathbb {R}}^d)\), so that we find \(h_i \in C_c^\infty ({\mathbb {R}}^d)\) satisfying \(\Vert g_i - h_i \Vert _{L_p} < \varepsilon \). Choose \(R > 0\) satisfying \({{\text {supp}}}(h_i) \subset [-R,R]^d\) and \({\Omega \subset [-R,R]^d}\). By the universal approximation theorem (Theorem 3.22), we can find \(\gamma _i \in {\mathtt {NN}}^{\varrho ,d,1}_{\infty ,2,\infty } \subset {\mathtt {NN}}^{\varrho ,d,1}_{\infty ,L,\infty }\) satisfying \(\Vert h_i - \gamma _i \Vert _{L_\infty ([-R,R]^d)} \le \varepsilon / (4R)^{d/p}\). Note that the inclusion \({\mathtt {NN}}^{\varrho ,d,1}_{\infty ,2,\infty } \subset {\mathtt {NN}}^{\varrho ,d,1}_{\infty ,L,\infty }\) used above is (only) true since we are considering generalized neural networks, and since \(L \ge 2\).

Using the elementary estimate \((a + b)^p \le (2 \max \{a, b \})^p \le 2^p (a^p + b^p)\), we see

$$\begin{aligned} |\gamma _i (x) - g_i (x)|^p\le & {} \big ( |\gamma _i(x) - h_i(x)| + |h_i(x) - g_i(x)| \big )^p \\\le & {} 2^p \Big ( \frac{\varepsilon ^p}{(4R)^{d}} + |h_i(x) - g_i(x)|^p \Big ) \quad \forall \, x \in [-R,R]^d , \end{aligned}$$

which easily implies \(\Vert \gamma _i - g_i\Vert _{L_p ([-R,R]^d)}^p \le 2^p (\varepsilon ^p + \Vert h_i - g_i\Vert _{L_p([-R,R]^d)}^p) \le 2^{1 + p} \varepsilon ^p\).

Lemma 2.17 shows that \(\gamma := (\gamma _1, \dots , \gamma _k) \in {\mathtt {NN}}^{\varrho ,d,k}_{\infty ,L,\infty }\), whence by claims 1a-1b of Theorem 3.23. Finally, since \(g_i|_{\Omega } = f_i\), we have

$$\begin{aligned} \Vert f - \gamma |_{\Omega } \Vert _{L_p (\Omega )}^p \le \sum _{i=1}^k \Vert g_i - \gamma _i \Vert _{L_p ([-R,R]^d)}^p \le 2^{1 + p} k \cdot \varepsilon ^p . \end{aligned}$$

Since \(\varepsilon > 0\) was arbitrary, this proves the desired density.

Now, we consider the case \(p = \infty \). Let . Lemma 3.20 shows that there is a continuous function \(g : {\mathbb {R}}^d \rightarrow {\mathbb {R}}^k\) such that \(f = g|_{\Omega }\). Since \(L \ge 2\), we can apply the universal approximation theorem (Theorem 3.22) to each of the component functions \(g_i\) of \(g = (g_1,\dots ,g_k)\) to obtain functions \(\gamma _i \in {\mathtt {NN}}^{\varrho ,d,1}_{\infty ,2,\infty } \subset {\mathtt {NN}}^{\varrho ,d,1}_{\infty ,L,\infty }\) satisfying \(\Vert g_i - \gamma _i \Vert _{L_\infty ([-R,R]^d)} \le \varepsilon \), where we chose \(R > 0\) so large that \(\Omega \subset [-R,R]^d\). Lemma 2.17 shows that \(\gamma := (\gamma _1, \dots , \gamma _k) \in {\mathtt {NN}}^{\varrho ,d,k}_{\infty ,L,\infty }\), whence by claims 1a-1b of Theorem 3.23, since \(\varrho \) is continuous. Finally, since \(g_i |_{\Omega } = f_i\), we have

$$\begin{aligned} \sup _{x \in \Omega } \Vert f(x) - \gamma (x) \Vert _{\ell ^\infty } \le \sup _{x \in [-R,R]^d} \,\, \max _{i \in \{1,\dots ,k\}} | g_i(x) - \gamma _i(x)| \le \varepsilon . \end{aligned}$$

Since \(\varepsilon > 0\) was arbitrary, this proves the desired density. \(\square \)

1.3.3 Proof of Claim (2)

Set . Lemma 2.17 easily shows that \({\mathcal {V}}\) is a vector space. Furthermore, Lemma 2.18 shows that if \(f \in {\mathcal {V}}\), \(A \in \mathrm {GL}({\mathbb {R}}^d)\), and \(b \in {\mathbb {R}}^d\), then \(f (A \bullet + b) \in {\mathcal {V}}\) as well. Clearly, all these properties also hold for \(\overline{{\mathcal {V}}}\) instead of \({\mathcal {V}}\), where the closure is taken in \(X_p({\mathbb {R}}^d)\).

It suffices to show that \({\mathcal {V}}\) is dense in . Indeed, suppose for the moment that this is true. Let be arbitrary. By applying Lemma 3.20 to each of the component functions \(f_i\) of f, we see for each \(i \in \{1,\dots ,k\}\) that there is a function such that \(f_i = F_i |_{\Omega }\). Now, let \(\varepsilon > 0\) be arbitrary, and set \(p_0 := \min \{1,p\}\). Since \({\mathcal {V}}\) is dense in , there is for each \(i \in \{1,\dots ,k\}\) a function \(G_i \in {\mathcal {V}}\) such that \(\Vert G_i - F_i\Vert _{L_p}^{p_0} \le \varepsilon ^{p_0} / k\). Lemma 2.17 shows , and it is not hard to see that , and hence, . As \(\varepsilon > 0\) and were arbitrary, this proves that is dense in , as desired.

It remains to show that is dense. To prove this, we distinguish three cases:

Case 1 (\(p \in [1,\infty )\)): First, the existence of the “radially decreasing \(L_1\)-majorant” \(\mu \) for g, [11, Lemma A.2] shows that \(P|g| \in L_\infty ({\mathbb {R}}^d) \subset L_p^{\mathrm {loc}}({\mathbb {R}}^d)\), where P|g| is a certain periodization of |g| whose precise definition is immaterial for us. Since \(g \in L_p ({\mathbb {R}}^d)\) and \(P|g| \in L_p^{\mathrm {loc}}({\mathbb {R}}^d)\), and \(\int _{{\mathbb {R}}^d} g(x) \, dx \ne 0\), [11, Corollary 1] implies that \({\mathcal {V}}_0 := \mathrm {span}\{ g_{j,k} :j \in {\mathbb {N}}, k \in {\mathbb {Z}}^d \}\) is dense in \(L_p({\mathbb {R}}^d)\), where \(g_{j,k}(x) = 2^{jd/p} \cdot g(2^j x - k)\). As a consequence of the properties of the space \({\mathcal {V}}\) that we mentioned above, and since \(g \in \overline{{\mathcal {V}}}\), we have \({\mathcal {V}}_0 \subset \overline{{\mathcal {V}}}\). Hence, \({\mathcal {V}} \subset L_p ({\mathbb {R}}^d)\) is dense, and we have since \(p < \infty \).

Case 2 (\(p \in (0,1)\)): Since \(g \in L_1({\mathbb {R}}^d) \cap L_p({\mathbb {R}}^d)\) with \(\int _{{\mathbb {R}}^d} g(x) \, d x \ne 0\), [39, Theorem 4 and Proposition 5(a)] show that \({\mathcal {V}}_0 \subset L_p({\mathbb {R}}^d)\) is dense, where the space \({\mathcal {V}}_0\) is defined precisely as for \(p \in [1,\infty )\). The rest of the proof is as for \(p \in [1,\infty )\).

Case 3 (\(p = \infty \)): Note . Let us assume toward a contradiction that \({\mathcal {V}}\) is not dense in \(C_0({\mathbb {R}}^d)\). By the Hahn–Banach theorem (see, for instance, [26, Theorem 5.8]), there is a bounded linear functional \(\varphi \in (C_0({\mathbb {R}}^d))^*\) such that \(\varphi \not \equiv 0\), but \(\varphi \equiv 0\) on \(\overline{{\mathcal {V}}}\).

By the Riesz representation theorem for \(C_0\) (see [26, Theorem 7.17]), there is a finite real-valued Borel-measure \(\mu \) on \({\mathbb {R}}^d\) such that \(\varphi (f) = \int _{{\mathbb {R}}^d} f(x) \, d \mu (x)\) for all \(f \in C_0({\mathbb {R}}^d)\). Thanks to the Jordan decomposition theorem (see [26, Theorem 3.4]), there are finite positive Borel measures \(\mu _+\) and \(\mu _-\) such that \(\mu = \mu _+ - \mu _-\).

Let \(f \in C_0 ({\mathbb {R}}^d)\) be arbitrary. For \(a > 0\), define \(g_a : {\mathbb {R}}^d \rightarrow {\mathbb {R}}, x \mapsto a^d \, g(a x)\), and note \(T_x g_a \in \overline{{\mathcal {V}}}\) (and hence \(\varphi (T_x g_a) = 0\)) for all \(x \in {\mathbb {R}}^d\), where \(T_x g_a (y) = g_a (y-x)\). By Fubini’s theorem and the change of variables \(y = -z\), we get

$$\begin{aligned} \begin{aligned} \int _{{\mathbb {R}}^d} (f *g_a)(x) \, d \mu (x)&= \int _{{\mathbb {R}}^d} \int _{{\mathbb {R}}^d} f(z) \, g_a (x-z) \, d z \, d \mu (x) \\&= \int _{{\mathbb {R}}^d} f(-y) \int _{{\mathbb {R}}^d} g_a (y + x) \, d \mu (x) \, d y \\&= \int _{{\mathbb {R}}^d} f(-y) \, \varphi ( T_{-y} \, g_a ) \, d y = 0 \end{aligned} \end{aligned}$$
(B.1)

for all \(a \ge 1\). Here, Fubini’s theorem was applied to each of the integrals \(\int (f *g_a)(x) \, d \mu _{\pm } (x)\), which is justified since

$$\begin{aligned} \int \int |f(z) \, g_a (x-z)| \, d z \, d \mu _{\pm } (x)&\le \mu _{\pm }({\mathbb {R}}^d) \, \Vert f\Vert _{L_{\infty }} \, \Vert T_{z} g_a\Vert _{L_{1}} \\&= \mu _{\pm }({\mathbb {R}}^d) \, \Vert f\Vert _{L_{\infty }} \, \Vert g_a\Vert _{L_1} < \infty . \end{aligned}$$

Now, since \(f \in C_0({\mathbb {R}}^d)\) is bounded and uniformly continuous, [26, Theorem 8.14] shows \(f *g_a \rightarrow f\) uniformly as \(a \rightarrow \infty \). Therefore, (B.1) implies \( \varphi (f) = \int _{{\mathbb {R}}^d} f(x) \, d \mu (x) = \lim _{a \rightarrow \infty } \int _{{\mathbb {R}}^d} (f *g_a) (x) \, d \mu (x) = 0 \), since \(\mu \) is a finite measure. This implies \(\varphi \equiv 0\) on \(C_0 ({\mathbb {R}}^d)\), which is the desired contradiction. \(\square \)

1.4 Proof of Lemma 3.26

Part (1):

Define

$$\begin{aligned} t : {\mathbb {R}}\rightarrow {\mathbb {R}}, x \mapsto \sigma \big ( x/\varepsilon \big )-\sigma \big ( 1+(x-1)/\varepsilon \big ) . \end{aligned}$$

A straightforward calculation using the properties of \(\sigma \) shows that

$$\begin{aligned} t(x) = {\left\{ \begin{array}{ll} 0, &{} \text {if } x \in {\mathbb {R}}{\setminus } [0,1], \\ 1, &{} \text {if } x \in [\varepsilon , 1-\varepsilon ]. \end{array}\right. } \end{aligned}$$
(B.2)

We claim that \(0 \le t \le 1\). To see this, first note that if \(r \ge 1\), then \(\sigma (x - r) \le \sigma (x)\) for all \(x \in {\mathbb {R}}\). Indeed, if \(x \le r\), then \(\sigma (x - r) = 0 \le \sigma (x)\); otherwise, if \(x > r\), then \(x \ge 1\), and hence \(\sigma (x - r) \le 1 = \sigma (x)\). Since \(r := \frac{1}{\varepsilon } - 1 \ge 1\), we thus see that \(t(x) = \sigma (\frac{x}{\varepsilon }) - \sigma (\frac{x}{\varepsilon } - r) \ge 0\) for all \(x \in {\mathbb {R}}\). Finally, we trivially have \(t(x) \le \sigma (\frac{x}{\varepsilon }) \le 1\) for all \(x \in {\mathbb {R}}\).

Now, if we define

$$\begin{aligned} g_0 : {\mathbb {R}}^d \rightarrow {\mathbb {R}}, x \mapsto \sigma \left( 1+ \sum _{i=1}^d t (x_i)-d \right) , \end{aligned}$$

we see \(0 \le g_0 \le 1\). Furthermore, for \(x \in [\varepsilon , 1-\varepsilon ]^d\), we have \(t(x_i) = 1\) for all \(i \in \{1,\dots ,d\}\), whence \(g_0(x) = 1\). Likewise, if \(x \notin [0,1]^d\), then \(t (x_i) = 0\) for at least one \(i \in \{1,\dots ,d\}\). Since \(0 \le t (x_i) \le 1\) for all i, this implies \(\sum _{i=1}^d t (x_i) - d \le -1\), and thus \(g_0(x) = 0\). All in all, and because of \(0 \le g_0 \le 1\), these considerations imply that \({{\text {supp}}}(g_0) \subset [0,1]^{d}\) and

$$\begin{aligned} |g_0(x) - {\mathbb {1}}_{[0,1]^d} (x)| \le {\mathbb {1}}_{[0,1]^d {\setminus } [\varepsilon , 1-\varepsilon ]^d} (x) \quad \forall \, \, x \in {\mathbb {R}}^d . \end{aligned}$$
(B.3)

Now, for proving the general case of Part (1), let \(h := g_0\), while \(h := t\) in case of \(d = 1\). As a consequence of Eqs. (B.2) and (B.3) and of \(0 \le t \le 1\), we then see that Condition (3.10) is satisfied in both cases. Thus, all that needs to be shown is that \(h = g_0 \in {\mathtt {NN}}^{\varrho ,d,1}_{2dW(N+1), 2L-1, (2d+1)N}\) or that \(h = t \in {\mathtt {NN}}^{\varrho ,1,1}_{2W,L,2N}\) in case of \(d = 1\).

We will verify both of these properties in the proof of Part (2) of the lemma.

Part (2): We first establish the claim for the special case \([a,b]= [0,1]^{d}\). With \(\lambda \) denoting the d-dimensional Lebesgue measure, and with h as constructed in Part (1), we deduce from (3.10) that

$$\begin{aligned} \Vert h - {\mathbb {1}}_{[0,1]^d} \Vert _{L_p}^{p} \le \lambda ([0,1]^d {\setminus } [\varepsilon , 1-\varepsilon ]^d) = [1 - (1 - 2\varepsilon )^d] . \end{aligned}$$

Since the right-hand side vanishes as \(\varepsilon \rightarrow 0\), this proves the claim for the special case \([a,b] = [0,1]^d\), once we show \(h = {\mathtt {R}}(\Phi )\) for \(\Phi \) with appropriately many layers, neurons, and nonzero weights.

By assumption on \(\sigma \), there is \(L_0 \le L\) such that \(\sigma = {\mathtt {R}}(\Phi _\sigma )\) for some \(\Phi _\sigma \in {\mathcal {NN}}^{\varrho ,1,1}_{W,L_0,N}\) with \(L(\Phi _\sigma ) = L_0\).

For \(i \in \{1,\dots ,d\}\) set \(f_{i, 1} : {\mathbb {R}}^d \rightarrow {\mathbb {R}}, x \mapsto \sigma (\frac{x_i}{\varepsilon })\) and \(f_{i, 2} : {\mathbb {R}}^d \rightarrow {\mathbb {R}}, x \mapsto -\sigma (1 + \frac{x_i - 1}{\varepsilon })\). By Lemma 2.18-(1), there exist \(\Psi _{i,1}, \Psi _{i,2} \in {\mathcal {NN}}^{\varrho ,d,1}_{W,L_0,N}\) with \(L(\Psi _{i,1}) = L(\Psi _{i,2}) = L_0\) for any \(i \in \{1,\dots ,d\}\) such that \(f_{i,1} = {\mathtt {R}}(\Psi _{i,1})\) and \(f_{i,2} = {\mathtt {R}}(\Psi _{i,2})\).

Lemma 2.17-(3) then shows that

$$\begin{aligned} F : {\mathbb {R}}^d \rightarrow {\mathbb {R}}, x \mapsto \sum _{i = 1}^d t(x_i) = \sum _{i=1}^d f_{i,1}(x) + \sum _{i=1}^d f_{i,2}(x) \end{aligned}$$

satisfies \(F = {\mathtt {R}}(\Phi _F)\) for some \(\Phi _F \in {\mathcal {NN}}^{\varrho ,d,1}_{2dW,L_0,2dN}\) with \(L(\Phi _F) = L_0\). Hence, Lemma 2.18-(1) shows that \(G : {\mathbb {R}}^d \rightarrow {\mathbb {R}}, x \mapsto 1 + \sum _{i=1}^d t(x_i) - d\) satisfies \(G = {\mathtt {R}}(\Phi _G)\) for some \(\Phi _G \in {\mathcal {NN}}^{\varrho ,d,1}_{2dW,L_0,2dN}\) with \(L(\Phi _G) = L_0\).

In case of \(d = 1\), set \(L' := L_0\) and recall that \(h = t = F\), where we saw above that \(F = {\mathtt {R}}(\Phi _F)\) and \(\Phi _F \in {\mathcal {NN}}^{\varrho ,1,1}_{2W,L_0,2N}\) with \(L(\Phi _F) = L_0\).

For general \(d \in {\mathbb {N}}\) set \(L' := 2 L_0 - 1\) and recall that \(h = g_0 = \sigma \circ G\).

Hence, Lemma 2.18-(3) shows \(h = {\mathtt {R}}(\Phi _h)\) for some \(\Phi _h \in {\mathcal {NN}}^{\varrho ,d,1}\) with \(L(\Phi _h) = L'\), \(N(\Phi _h) \le (2d+1)N\) and \(W(\Phi _h) \le 2d W + \max \{ 2 d N, d \} W \le 2 d W (N+1)\).

It remains to transfer the result from \([0,1]^d\) to the general case [ab]. To this end, define the invertible affine-linear map

$$\begin{aligned} T_0 : {\mathbb {R}}^d \rightarrow {\mathbb {R}}^d, x \mapsto \left( (b_i - a_i)^{-1} \cdot (x_i - a_i) \right) _{i \in \{1,\dots ,d\}} . \end{aligned}$$

A direct calculation shows \({\mathbb {1}}_{[0,1]^d} \circ T_0 = {\mathbb {1}}_{T_0^{-1} [0,1]^d} = {\mathbb {1}}_{[a,b]}\). Since \(\Vert T_{0}\Vert _{\ell ^{0,\infty }_{*}} =1\), the first part of Lemma 2.18 shows that \(g := h \circ T_0 = {\mathtt {R}}(\Phi )\) for some \(\Phi \in {\mathcal {NN}}^{\varrho ,d,1}_{2dW (N+1), 2 L_0 - 1, (2d+1) N}\) with \(L(\Phi ) = 2 L_0 - 1 = L'\) (resp. \(g := h \circ T_0 = {\mathtt {R}}(\Phi )\) for some \(\Phi \in {\mathcal {NN}}^{\varrho ,1,1}_{2W, L_0, 2N}\) with \(L(\Phi ) = L_0 = L'\) in case of \(d = 1\)) with h as above. Moreover, by an application of the change of variables formula, we get

$$\begin{aligned} \Vert g - {\mathbb {1}}_{[a,b]} \Vert _{L_p}&= \Vert h \circ T_0 - {\mathbb {1}}_{[0,1]^d} \circ T_0 \Vert _{L_p} \\&= \big | \det {{\text {diag}}}\big ( (b_i - a_i)^{-1} \big )_{i \in \{1,\dots ,d\}} \big |^{-1/p} \cdot \Vert g - {\mathbb {1}}_{[0,1]^d} \Vert _{L_p} \\&= \Vert g - {\mathbb {1}}_{[0,1]^d} \Vert _{L_p} \cdot \prod _{i=1}^d (b_i - a_i)^{1/p} . \end{aligned}$$

As seen above, the first factor can be made arbitrarily small by choosing \(\varepsilon \in (0, \frac{1}{2})\) suitably. Since the second factor is constant, this proves the claim. \(\square \)

Appendix C. Proofs for Section 4

1.1 Proof of Lemma 4.9

We begin with three auxiliary results.

Lemma C.1

Let \(f : {\mathbb {R}}\rightarrow {\mathbb {R}}\) be continuously differentiable. Define \(f_h : {\mathbb {R}}\rightarrow {\mathbb {R}}, x \mapsto h^{-1} \cdot (f(x+h) - f(x))\) for \(h \in {\mathbb {R}}{\setminus } \{0\}\). Then, \(f_h \rightarrow f\) as \(h \rightarrow 0\) with locally uniform convergence on \({\mathbb {R}}\).\(\blacktriangleleft \)

Proof

This is an easy consequence of the mean-value theorem, using that \(f'\) is locally uniformly continuous. For more details, we refer to [40, Theorem 4.14]. \(\square \)

Since \(\varrho _{r+1}\) is continuously differentiable with \(\varrho _{r+1} ' = \varrho _r\), the preceding lemma immediately implies the following result.

Corollary C.2

For \(r \in {\mathbb {N}}\), \(h > 0\), \( \sigma _h : {\mathbb {R}}\rightarrow {\mathbb {R}}, x \mapsto (r+1)^{-1} \cdot h^{-1} \cdot \big ( \varrho _{r+1} (x + h) - \varrho _{r+1} (x) \big ) \), we have \(\sigma _{h} = {\mathtt {R}}(\Psi _{h})\) where \(\Psi _{h} \in {\mathcal {SNN}}_{4,2,2}^{\varrho _{r+1},1,1}\), \(L(\Psi _{h}) = 2\), and \(\lim _{h \rightarrow 0}\sigma _h = \varrho _{r}\) with locally uniform convergence on \({\mathbb {R}}\).

We need one more auxiliary result for the proof of Lemma 4.9.

Corollary C.3

For any \(d,k,r \in {\mathbb {N}}\), \(j \in {\mathbb {N}}_{0}\), \(W,N \in {\mathbb {N}}_{0}\), \(L \in {\mathbb {N}}\), we have

$$\begin{aligned} {\mathtt {NN}}^{\varrho _{r},d,k}_{W,L,N} \subset \overline{{\mathtt {NN}}^{\varrho _{r+j},d,k}_{4^{j}W,L,2^{j}N}} \end{aligned}$$
(C.1)

where closure is with respect to locally uniform convergence on \({\mathbb {R}}^d\).

Proof

We prove by induction on \(\delta \) that the result holds for any \(0 \le j \le \delta \). This is trivial for \(\delta =0\). By Corollary C.2, we can apply Lemma 2.21 to \(\varrho := \varrho _{r+1}\) and \(\sigma := \varrho _{r}\) (which is continuous) with \(w = 4\), \(\ell =2\), \(m=2\). This yields for any \(W,N \in {\mathbb {N}}_{0}\), \(L \in {\mathbb {N}}\) that \( {\mathtt {NN}}^{\varrho _{r},d,k}_{W,L,N} \subset \overline{{\mathtt {NN}}^{\varrho _{r+1},d,k}_{4W,L,2N}}, \) which shows that our induction hypothesis is valid for \(\delta = 1\). Assume now that the hypothesis holds for some \(\delta \in {\mathbb {N}}\), and consider \(W,N \in {\mathbb {N}}_{0}\), \(r,L \in {\mathbb {N}}\), \(0 \le j \le \delta +1\). If \(j \le \delta \) then the induction hypothesis yields (C.1), so there only remains to check the case \(j = \delta +1\). By the induction hypothesis, for \(r' = r+\delta \), \(W' = 4^{\delta }W\), \(N' = 2^{\delta }N\), \(j=1\) we have \( {\mathtt {NN}}^{\varrho _{r+\delta },d,k}_{4^{\delta }W,L,2^{\delta }N} \subset \overline{{\mathtt {NN}}^{\varrho _{r+\delta +1},d,k}_{4^{\delta +1}W,L,2^{\delta +1}N}}. \) Finally, \( {\mathtt {NN}}^{\varrho _{r},d,k}_{W,L,N} \subset \overline{{\mathtt {NN}}^{\varrho _{r+\delta },d,k}_{4^{\delta }W,L,2^{\delta }N}} \subset \overline{{\mathtt {NN}}^{\varrho _{r+\delta +1},d,k}_{4^{\delta +1}W,L,2^{\delta +}N}} \) by the induction hypothesis for \(j=\delta \). \(\square \)

Proof of Lemma 4.9

The proof is by induction on n. For \(n=1\), \(\varrho \) is a polynomial of degree at most r. By Lemma 2.24, \(\varrho _{r}\) can represent any such polynomial with \(2(r+1)\) terms, whence \(\varrho \in {\mathtt {NN}}^{\varrho _{r},1,1}_{4(r+1),2,2(r+1)}\). When \(r=1\), \(\varrho \) is an affine function; hence, there are \(a,b \in {\mathbb {R}}\) such that \(\varrho (x) = b+ax = b+ a\varrho _{1}(x)-a\varrho _{1}(-x)\) for all x, showing that \(\varrho \in {\mathtt {SNN}}^{\varrho _{1},1,1}_{4,2,2} = {\mathtt {SNN}}^{\varrho _1,1,1}_{2(n+1),2,n+1}\).

Assuming the result true for \(n \in {\mathbb {N}}\), we prove it for \(n+1\). Consider \(\varrho \) made of \(n+1\) polynomial pieces: \({\mathbb {R}}\) is the disjoint union of \(n+1\) intervals \(I_{i}\), \(0 \le i \le n\) and there are polynomials \(p_{i}\) such that \(\varrho (x) = p_{i}(x)\) on the interval \(I_{i}\) for \(0 \le i \le n\). Without loss of generality, order the intervals by increasing “position” and define \({\bar{\varrho }}(x) = \varrho (x)\) for \(x \in \cup _{i=0}^{n-1} I_{i} = {\mathbb {R}}{\setminus } I_{n}\), and \({\bar{\varrho }}(x) = p_{n-1}(x)\) on \(I_{n}\). It is not hard to see that \({\bar{\varrho }}\) is continuous and made of n polynomial pieces, the last one being \(p_{n-1}(x)\) on \(I_{n-1} \cup I_{n}\). Observe that \(\varrho (x) = {\bar{\varrho }}(x) + f(x - t_{n})\) where \(\{t_{n}\} = \overline{I_{n-1}} \cap \overline{I_{n}}\) is the breakpoint between the intervals \(I_{n-1}\) and \(I_{n}\), and

$$\begin{aligned} f(x) := \varrho (x + t_n) - {\bar{\varrho }}(x + t_n) = {\left\{ \begin{array}{ll} 0 &{} \text {for}\ x < 0 \\ p_{n}(x + t_n)-p_{n-1}(x + t_n) &{} \text {for}\ x \ge 0. \end{array}\right. } \end{aligned}$$

Note that \(q(x) := p_{n}(x + t_n) - p_{n-1}(x + t_n)\) satisfies \(q(0) = f(0) = 0\), since \(\varrho \) is continuous. Because q is a polynomial of degree at most r, there are \(a_i \in {\mathbb {R}}\) such that \(q(x) = \sum _{i=1}^{r} a_i \, x^i\). This shows that \(f = \sum _{i=1}^{r} a_i \varrho _i\). In case of \(r = 1\), this shows that \(f \in {\mathtt {SNN}}^{\varrho _1,1,1}_{2,2,1}\). For \(r \ge 2\), since \(\varrho _i \in {\mathtt {NN}}^{\varrho _i,1,1}_{2,2,1}\), Corollary C.3 yields \( \varrho _i \in \overline{{\mathtt {NN}}^{\varrho _r,1,1}_{2 \cdot 4^{r-i},2,2^{r-i}}} \), where the closure is with respect to the topology of locally uniform convergence. Observing that \(2\sum _{i=1}^{r}4^{r-i} = 2 \cdot (4^{r}-1)/3 = w\) and \(\sum _{i=1}^{r}2^{r-i} = 2^{r}-1 = m\), Lemma 2.17-(3) implies thatFootnote 8\(f \in \overline{{\mathtt {NN}}^{\varrho _{r},1,1}_{w,2,m}}\). Since \(P: {\mathbb {R}}\rightarrow {\mathbb {R}}, x \mapsto x+t_n\) is affine with \(\Vert P\Vert _{\ell ^{0,\infty }} = \Vert P\Vert _{\ell ^{0,\infty }_{*}}=1\), by the induction hypothesis, Lemma 2.18-(1) and Lemma 2.17-(3) again, we get

$$\begin{aligned} \varrho (\bullet ) = {\bar{\varrho }}(\bullet ) + f(\cdot - t_n)\in & {} \overline{{\mathtt {NN}}^{\varrho _{r},1,1}_{4(r+1)+(n-1)w+w,2,2(r+1)+(n-1)m+m}}\\= & {} \overline{{\mathtt {NN}}^{\varrho _{r},1,1}_{4(r+1)+(n+1-1)w,2,2(r+1)+(n+1-1)m}}. \end{aligned}$$

For \(r=1\), it is not hard to see \( \varrho \in {\mathtt {SNN}}^{\varrho _1,1,1}_{2(n+1)+2,2,n+1+1} = {\mathtt {SNN}}^{\varrho _1,1,1}_{2((n+1)+1),2,(n+1)+1} \). \(\square \)

1.2 Proof of Lemma 4.10

First we show that if \(s \in {\mathbb {N}}\) and if \(\varrho \in {\mathtt {Spline}}^s\) is not a polynomial, then there are \(\alpha ,\beta ,t_{0} \in {\mathbb {R}}\), \(\varepsilon >0\) and p a polynomial of degree at most \(s-1\) such that

$$\begin{aligned} \varrho _{s}(z) = \alpha \varrho (t_{0}+z) + \beta \varrho (t_{0}-z) -p(z) \quad \forall z \in [-\varepsilon ,+\varepsilon ]. \end{aligned}$$
(C.2)

Consider any \(t_{0} \in {\mathbb {R}}\). Since \(\varrho \in {\mathtt {Spline}}^{s}\), there are \(\varepsilon > 0\) and two polynomials \(p_{-},p_{+}\) of degree at most s, with matching \(s-1\) first derivatives at \(t_{0}\), such that

$$\begin{aligned} \varrho (x) = {\left\{ \begin{array}{ll} p_{+}(x)&{} \text {for}\ x \in [t_{0},t_{0}+\varepsilon ]\\ p_{-}(x)&{} \text {for}\ x \in [t_{0}-\varepsilon , t_{0}]. \end{array}\right. } \end{aligned}$$

Since \(\varrho \) is not a polynomial, there is \(t_{0}\) such that the s-th derivatives of \(p_{\pm }\) at \(t_{0}\) do not match, i.e., \(a_{-} := p^{(s)}_{-}(t_{0})/s! \ne p^{(s)}_{+}(t_{0})/s! =: a_{+}\). A Taylor expansion yields

$$\begin{aligned} \varrho (t_0+z) = {\left\{ \begin{array}{ll} q(z)+ a_{+} z^s&{} \text {for}\ z \in [0,\varepsilon ]\\ q(z)+ a_{-} z^s&{} \text {for}\ z \in [-\varepsilon ,0], \end{array}\right. } \end{aligned}$$

where \(q(z) := \sum _{n=0}^{s-1} p_{\pm }^{(n)}(t_{0})z^{n}/n!\) is a polynomial of degree at most \(s-1\). As a result, for \(|z| \le \varepsilon \)

$$\begin{aligned}&a_{+} \cdot [\varrho (t_{0}+z) - q(z)] - (-1)^{s} \, a_{-} \cdot [\varrho (t_{0}-z) - q(-z)] \\&\quad = {\left\{ \begin{array}{ll} (a_{+}^{2}-a_{-}^{2}) \cdot z^s &{} \text {for}\ z \in [0,\varepsilon ] \\ 0 &{} \text {for}\ z \in [-\varepsilon ,0] \end{array}\right. } = (a_{+}^{2}-a_{-}^{2}) \cdot \varrho _s(z). \end{aligned}$$

Since \(a_{+} \ne a_{-}\), setting \(\alpha := a_{+}/(a_{+}^{2}-a_{-}^{2})\) and \(\beta := (-1)^{s+1} a_{-}/(a_{+}^{2}-a_{-}^{2})\), as well as \(p(x) := \alpha q(z)+\beta q(-z)\) we get as claimed \( \varrho _s(z) = \alpha \varrho (z+t_{0}) + \beta \varrho (-z+t_{0}) -p(z) \) for every \(|z| \le \varepsilon \).

Now, consider \(r \in {\mathbb {N}}\). Given \(R > 0\) we now set

$$\begin{aligned} f_{R}(x) := (\tfrac{R}{\varepsilon })^{r} \left[ \alpha \varrho (\varepsilon x/R+t_{0}) +\beta \varrho (-\varepsilon x/R+t_{0}) - p(\varepsilon x / R) \right] \end{aligned}$$

with \(\alpha ,\beta ,t_{0},\varepsilon ,p\) from (C.2). Observe that \( \varrho _r(x) = (R/\varepsilon )^{r} \varrho _r(\varepsilon x/R) = f_{R}(x) \) for all \(x \in [-R,R]\), so that \(f_{R}\) converges locally uniformly to \(\varrho _{r}\) on \({\mathbb {R}}\).

We show by induction on \(r \in {\mathbb {N}}\) that \(f_{R} \in {\mathtt {NN}}^{\varrho ,1,1}_{w,2,m}\) where \(w = w(r),m = m(r) \in {\mathbb {N}}\) only depend on r. For \(r=1\), this trivially holds as the polynomial p in (C.2) is a constant; hence \(f_{R} \in {\mathtt {NN}}^{\varrho ,1,1}_{4,2,2}\).

Assuming the result true for some \(r \in {\mathbb {N}}\), we now prove it for \(r+1\). Consider \(\varrho \in {\mathtt {Spline}}^{r+1}\) that is not a polynomial. The polynomial p in (C.2) with \(s=r+1\) is of degree at most r; hence, by Lemma 2.24 there are \(c,a_{i},b_{i},c_{i} \in {\mathbb {R}}\) such that \( p(x) = c+ \sum _{i=1}^{r+1} a_{i} \, \varrho _r(b_{i} x + c_{i}) \) for all \(x \in {\mathbb {R}}\). Now, observe that since \(\varrho \in {\mathtt {Spline}}^{r+1}\) is not a polynomial, its derivative satisfies \(\varrho ' \in {\mathtt {Spline}}^{r}\) and is not a polynomial either. The induction hypothesis yields \(\varrho _r \in \overline{{\mathtt {NN}}^{\varrho ',1,1}_{w,2,m}}\) for \(w=w(r),m=m(r) \in {\mathbb {N}}\). It is not hard to check that this implies \(p \in \overline{{\mathtt {NN}}^{\varrho ',1,1}_{2(r+1)w,2,(r+1)m}}\). Finally, as \(\varrho '(x)\) is the locally uniform limit of \((\varrho (x+h)-\varrho (x))/h\) as \(h \rightarrow 0\) (see Lemma C.1), we obtain \(p \in \overline{{\mathtt {NN}}^{\varrho ,1,1}_{4(r+1)w,2,2(r+1)m}}\) thanks to Lemma 2.21. Combined with the definition of \(f_{R}\) we obtain \(f_{R} \in \overline{{\mathtt {NN}}^{\varrho ,1,1}_{4(r+1)w+4,2,2(r+1)m+2}}\).

Finally, we quantify wm: First of all, note that \(w(1) = 4 \le 5\) and \(m(1) = 2 \le 3\); furthermore, \(w(r+1) \le 4(r+1) w(r)+4 \le 5(r+1) w(r)\) and \(m(r+1) \le 2(r+1)m+2 \le 3(r+1)m\). An induction therefore yields \(w(r) \le 5^r r!\) and \(m(r) \le 3^r r!\). \(\square \)

1.3 Proof of Lemma 4.11

Step 1: In this step, we construct \(\theta _{R,\delta } \in {\mathtt {NN}}^{\varrho _{r},d,1}_{w,\ell ,m}\) satisfying

$$\begin{aligned} |\theta _{R,\delta } (x) - {\mathbb {1}}_{[-R,R]^{d}} (x)| \le 2 \cdot {\mathbb {1}}_{[-R-\delta ,R+\delta ]^{d} {\setminus } [-R,R]^{d}} \qquad \forall \, x \in {\mathbb {R}}^d , \end{aligned}$$
(C.3)

with \(\ell =3\) (resp. \(\ell =2\) if \(d=1\)) and with wm only depending on d and r.

The affine map \( P: {\mathbb {R}}^{d}\rightarrow {\mathbb {R}}^{d}, x = (x_{i})_{i=1}^{d} \mapsto \left( \tfrac{x_{i}}{2(R+\delta )}+\tfrac{1}{2}\right) _{i=1}^{d} \) satisfies \(\Vert P\Vert _{\ell ^{0,\infty }} = \Vert P\Vert _{\ell ^{0,\infty }_*} = 1\). For \(x \in {\mathbb {R}}^d\), we have \(x \in [-R-\delta ,R+\delta ]^{d}\) if and only if \(P(x) \in [0,1]^{d}\), and \(x \in [-R,R]^{d}\) if and only if \(P(x) \in [\varepsilon ,1-\varepsilon ]^{d}\), where \(\varepsilon := \tfrac{2\delta }{2(R+\delta )}\); thus, \({\mathbb {1}}_{[-R,R]^d} (P^{-1} x) = {\mathbb {1}}_{[\varepsilon , 1-\varepsilon ]^d}(x)\) for all \(x \in {\mathbb {R}}^d\).

Next, by combining Lemmas 4.4 and 3.26 (see in particular Eq. (3.10)), we obtain \(f \in {\mathtt {NN}}^{\varrho _{r},d,1}_{w,\ell ,m}\) (with the above-mentioned properties for \(w,\ell ,m\) and \(m \ge d\)) such that \( |f(x)-{\mathbb {1}}_{[0,1]^{d}} (x)| \le {\mathbb {1}}_{[0,1]^{d} {\setminus } [\varepsilon ,1-\varepsilon ]^{d}} \) for all \(x \in {\mathbb {R}}^d\). Therefore, the function \(\theta _{R,\delta } := f \circ P\) satisfies

$$\begin{aligned} |\theta _{R,\delta } (x) - {\mathbb {1}}_{[-R,R]^d}(x)|&= |f(P x) - {\mathbb {1}}_{[-R,R]^d} (P^{-1} P x)| \\&\le |f (P x) - {\mathbb {1}}_{[0,1]^d}(P x)| + |{\mathbb {1}}_{[0,1]^d} (P x) - {\mathbb {1}}_{[\varepsilon ,1-\varepsilon ]^d} (P x)| \\&\le 2 \cdot {\mathbb {1}}_{[-R-\delta , R+\delta ]^d {\setminus } [-R,R]^d} (x) \end{aligned}$$

for all \(x \in {\mathbb {R}}^{d}\). Finally, by Lemma 2.18-(1), we have \(\theta _{R,\delta } \in {\mathtt {NN}}^{\varrho _{r},d,1}_{w,\ell ,m}\).

Step 2: Consider \(g \in {\mathtt {NN}}^{\varrho _{r},d,k}_{W,L,N}\) and define \(g_{R,\delta } (x) := \theta _{R,\delta } (x) \cdot g(x)\) for all \(x \in {\mathbb {R}}^d\). The desired estimate (4.6) is an easy consequence of (C.3). It only remains to show that one can implement \(g_{R,\delta }\) with a \(\varrho _{r}\)-network of controlled complexity.

Since we assume \(W \ge 1\), we can use Lemma 2.14; combining it with Eq. (2.1), we get \(g \in {\mathtt {NN}}^{\varrho _{r},d,k}_{W,L',N'}\) with \(L' = \min \{ L,W,N+1 \}\) and \(N' = \min \{ N,W \}\). Lemma 2.17-(2) yields \((\theta _{R,\delta },g) \in {\mathtt {NN}}^{\varrho _r,d,k+1}_{w', L'', m'}\) with \(L'' = \max \{ L',\ell \}\) as well as \(w' = W + w + \min \{ d,k \} \cdot (L''-1)\) and \(m' = N' + m + \min \{ d,k \} \cdot (L''-1)\). Since \(L''-1 = \max \{ L'-1,\ell -1 \} \le \max \{ W-1,\ell -1 \} \le W+\ell -2\) and \(N' \le W\), we get

$$\begin{aligned} w'&\le W + w + \min \{ d,k \} \cdot (W+\ell -2) = W \cdot (1+\min \{ d,k \}) +c_{1} \\ m'&\le W + m + \min \{ d,k \} \cdot (W+\ell -2) = W \cdot (1+\min \{ d,k \}) +c_{2}. \end{aligned}$$

where \(c_{1},c_{2}\) only depend on dkr.

As \(r \ge 2\), Lemma 2.24 shows that \(\varrho _{r}\) can represent any polynomial of degree two with \({n=2(r+1)}\) terms. Thus, Lemma 2.26 shows that the multiplication map \(m : {\mathbb {R}}\times {\mathbb {R}}^k \rightarrow {\mathbb {R}}^k, (x,y) \mapsto x \cdot y\) satisfies \(m \in {\mathtt {NN}}^{\varrho _r, 1+k, k}_{12k(r+1), 2,4k(r+1)}\). Finally, Lemma 2.18-(3) proves that \(g_{R,\delta } = m \circ (\theta _{R,\delta },g) \in {\mathtt {NN}}^{\varrho _r,d,k}_{w'',L''',m''}\), where \({L''' = L''+1}\) and \(m'' = m'+ 4k(r+1) = N' + m + \min \{ d,k \} \cdot (L''-1) + 4k(r+1)\) as well as \({w'' = w' + \max \{ m',d \} \cdot 12 k(r+1)}\).

As \(L'' = \max \{ L',\ell \} \le \max \{ L,\ell \}\) we have \(L''' \le \max \{ L+1,4 \}\) (respectively \(L''' \le \max \{ L+1,3 \}\) if \(d=1\)). Furthermore, since \(m' \ge m \ge d\) we have \(\max \{ m',d \} = m'\). Because of \(W \ge 1\), we thus see that

$$\begin{aligned} w'' = w'+m' \cdot 12 k(r+1) \le W \cdot (1 + \min \{ d,k \}) \cdot (1 + 12k(r+1)) + c_{3} \le c_{4}W \end{aligned}$$

where \(c_{3},c_{4}\) only depend on dkr. Finally, \(L''-1 = \max \{ L'-1,\ell -1 \} \le \max \{ N,\ell -1 \} \le N + \ell - 1\). Since \(N' \le N\), we get \(m'' \le N \cdot (1+\min \{ d,k \}) + c_{5} \le c_{6} N\) where again \(c_{5},c_{6}\) only depend on dkr. To conclude, we set \(c := \max \{ c_{4},c_{6} \}\). \(\square \)

1.4 Proof of Proposition 4.12

When \(r=1\) and \(\varrho \in {\mathtt {NN}}^{\varrho _{r},1,1}_{\infty ,2,m}\), the result follows from Lemma 2.19.

Now, consider \(f \in {\mathtt {NN}}_{W,L,N}^{\varrho , d, k}\) such that \(f|_{\Omega } \in X\). Since \(\varrho \in \overline{{\mathtt {NN}}^{\varrho _{r},1,1}_{\infty ,2,m}}\), Lemma 2.21 shows that

$$\begin{aligned}&{\mathtt {NN}}_{W, L,N}^{\varrho , d, k} \subset \overline{{\mathtt {NN}}_{Wm^{2}, L,Nm}^{\varrho _r, d, k}},\nonumber \\&\quad \text { with closure in the topology of locally uniform convergence on } {\mathbb {R}}^d. \end{aligned}$$
(C.4)

For bounded \(\Omega \), locally uniform convergence implies convergence in for all \(p \in (0,\infty ]\) hence the result.

For unbounded \(\Omega \), we need to work a bit harder. First, we deal with the degenerate case where \(W=0\) or \(N=0\). If \(W=0\), then by Lemma 2.13f is a constant map; hence, \(f \in {\mathtt {NN}}^{\varrho _r,d,k}_{0,1,0}\). If \(N=0\), then f is affine-linear with \(\Vert f\Vert _{\ell ^{0}} \le W\); hence, \(f\in {\mathtt {NN}}^{\varrho _r,d,k}_{W,1,0}\). In both cases, the result trivially holds.

From now on, we assume that \(W,N \ge 1\). Consider \(\varepsilon > 0\). By the dominated convergence theorem (in case of \(p < \infty \)) or our special choice of [cf. Eq. (1.3)] (in case of \(p = \infty \)), we see that there is some \(R \ge 1\) such that

$$\begin{aligned} \Vert f - f \cdot {\mathbb {1}}_{[-R,R]^{d}}\Vert _{L_p(\Omega ; {\mathbb {R}}^k)} \le \varepsilon ' := \frac{\varepsilon }{8^{1 / \min \{1,p\}}}. \end{aligned}$$

Denoting by \(\lambda (\cdot )\) the Lebesgue measure, (C.4) implies that there is \(g \in {\mathtt {NN}}_{Wm^{2}, L,Nm}^{\varrho _r, d, k}\) such that

$$\begin{aligned} \Vert f - g\Vert _{L_\infty ([-R-1,R+1]^{d};{\mathbb {R}}^{k})} \le \varepsilon ' \big / \, \big [ \lambda ([-R-1,R+1]^{d}) \big ]^{1/p} . \end{aligned}$$

Consider \(c = c(d,k,r)\), \(\ell = \min \{ d+1, 3 \}\), \(L' = \max \{ L+1,\ell \}\) and the function \(g_{R,1} \in {\mathtt {NN}}^{\varrho _r, d, k}_{cWm^{2},L',cNm}\) from Lemma 4.11. By (4.6) and the fact that \(\Vert \cdot \Vert _{L_p}^{\min \{1,p\}}\) is subadditive, we see

$$\begin{aligned} \Vert f - g_{R,1}\Vert _{L_p (\Omega ;{\mathbb {R}}^k)}^{\min \{1,p\}}&\le \Vert f - f \cdot {\mathbb {1}}_{[-R,R]^d} \Vert _{L_p (\Omega ; {\mathbb {R}}^k)}^{\min \{1,p\}} + \Vert (f - g) {\mathbb {1}}_{[-R,R]^d} \Vert _{L_p (\Omega ; {\mathbb {R}}^k)}^{\min \{1,p\}} \\&\quad + \Vert g \cdot {\mathbb {1}}_{[-R,R]^d}(x) - g_{R,1} \Vert _{L_p (\Omega ;{\mathbb {R}}^k)}^{\min \{1,p\}} \\&\le \frac{\varepsilon ^{\min \{1,p\}}}{8} + \left( \Vert f - g\Vert _{L_\infty ([-R-1,R+1]^d;{\mathbb {R}}^{k})} \cdot [ \lambda ([-R,R]^d) ]^{1/p} \right) ^{\min \{1,p\}} \\&\quad + \left( \Vert \ 2 \cdot |g| \cdot {\mathbb {1}}_{[-R-1,R+1]^d {\setminus } [-R,R]^d}\Vert _{L_p (\Omega )} \right) ^{\min \{1,p\}} , \\&\le \frac{\varepsilon ^{\min \{1,p\}}}{2} + \left( \Vert \ 2 \cdot |g| \cdot {\mathbb {1}}_{[-R-1,R+1]^d {\setminus } [-R,R]^d}\Vert _{L_p (\Omega )} \right) ^{\min \{1,p\}} . \end{aligned}$$

To estimate the final term, note that

$$\begin{aligned}&\Big ( \Vert \ |g| \cdot {\mathbb {1}}_{[-R-1,R+1]^d {\setminus } [-R,R]^d} \Vert _{L_p (\Omega )} \Big )^{\min \{1,p\}} \\&\le \Big ( \Vert \ |g-f| \cdot {\mathbb {1}}_{[-R-1,R+1]^d {\setminus } [-R,R]^d}\Vert _{L_p (\Omega )} \Big )^{\min \{1,p\}} \\&\quad + \Big ( \Vert \ |f| \cdot {\mathbb {1}}_{[-R-1,R+1]^d {\setminus } [-R,R]^d} \Vert _{L_p (\Omega )} \Big )^{\min \{1,p\}} \\&\le \Big ( \Vert f - g \Vert _{L_\infty ([-R-1,R+1]^d;{\mathbb {R}}^{k})} \cdot [\lambda ([-R-1,R+1]^d)]^{1/p} \Big )^{\min \{1,p\}} \\&\quad + \Big ( \Vert f - f \cdot {\mathbb {1}}_{[-R,R]^d}\Vert _{L_p (\Omega ; {\mathbb {R}}^k)} \Big )^{\min \{1,p\}} \\&\le \frac{\varepsilon ^{\min \{1,p\}}}{8} + \frac{\varepsilon ^{\min \{1,p\}}}{8} . \end{aligned}$$

Because of \(2^{\min \{1,p\}} \le 2\), this implies \( \Big ( \Vert \ 2 \cdot |g| \cdot {\mathbb {1}}_{[-R-1,R+1]^d {\setminus } [-R,R]^d} \Vert _{L_p (\Omega )} \Big )^{\min \{1,p\}} \le \tfrac{\varepsilon ^{\min \{1,p\}}}{2} \). Overall, we thus see that \(\Vert f - g_{R,1} \Vert _{L_p (\Omega ; {\mathbb {R}}^k)} \le \varepsilon < \infty \). Because of \(f|_{\Omega } \in X\), this implies in particular that \(g_{R,1}|_{\Omega } \in X\). Since \(\varepsilon > 0\) was arbitrary, we get as desired that \( f|_{\Omega } \in \overline{{\mathtt {NN}}^{\varrho _r, d, k}_{cWm^{2}, L', cNm} \cap X}^{X} \), where the closure is taken in X. \(\square \)

Appendix D. Proofs for Section 5

1.1 Proof of Lemma 5.2

In light of (4.1), we have \(\beta _{+}^{(t)} \in {\mathtt {NN}}^{\varrho _t,1,1}_{2(t+2),2,t+2}\). This yields the result for \(d=1\), including when \(t=1\).

For \(d \ge 2\) and \(t \ge \min \{d,2\} = 2\), define \(f_j:{\mathbb {R}}^d\rightarrow {\mathbb {R}}\) by \(f_j := \beta ^{(t)}_+ \circ \pi _j\) with \(\pi _j:{\mathbb {R}}^d\rightarrow {\mathbb {R}}, x \mapsto x_j\), \(j=1,\ldots ,d\). By Lemma 2.18–(1) together with the fact that \(\Vert \pi _j\Vert _{\ell ^{0,\infty }_*} = 1\), we get \(f_j \in {\mathtt {NN}}^{\varrho _t,d,1}_{2(t+2),2,t+2}\). Form the vector function \(f := (f_1,f_2,\ldots ,f_d)\). Using Lemma 2.17-(2), we deduce \(f\in {\mathtt {NN}}^{\varrho _t,d,d}_{2d(t+2),2,d(t+2)}\).

As \(t \ge 2\), by Lemma 2.24, \(\varrho _t\) can represent any polynomial of degree two with \(n = 2(t+1)\) terms. Hence, for \(d \ge 2\), by Lemma 2.26 the multiplication function \(M_d : {\mathbb {R}}^d \rightarrow {\mathbb {R}}, (x_1,\dots ,x_d) \mapsto x_1 \cdots x_d\) satisfies \(M_{d} \in {\mathtt {NN}}^{\varrho _t,d,1}_{4n(2^{j}-1),2j,(2n+1)(2^{j}-1)-1}\) with \(j := \lceil \log _{2} d \rceil \). By definition, \(2^{j-1} < d \le 2^{j}\); hence, \(2^{j}-1 \le 2(d-1)\) and \(6 n (2^{j}-1) \le 12 n(d-1) = 24(t+1)(d-1)\), as well as

$$\begin{aligned} (2n+1)(2^{j}-1)-1 \le (4n+2)(d-1)-1 = (8t+10)(d-1)-1 , \end{aligned}$$

so that \(M_{d} \in {\mathtt {NN}}^{\varrho _t,d,1}_{24(t+1)(d-1),2j,(8t+10)(d-1)-1}\). As \(\beta _d^{(t)} = M_{d} \circ f\), by Lemma 2.18–(2), we get

$$\begin{aligned} \beta _d^{(t)} \in {\mathtt {NN}}^{\varrho _t,d,1}_{2d(t+2)+24(t+1)(d-1),2j+2,d(t+2)+(8t+10)(d-1)-1+d}. \end{aligned}$$

To conclude, we observe that

$$\begin{aligned}&2 d(t + 2) + 24(t+1)(d-1) \le d(2t+4+24t+24) = d(26t+28) \le 28d(t+1), \\&d(t+2)+(8t+10)(d-1)-1+d \le d(t+2+8t+10+1) \\&\quad = d(9t+13) \le 13d(t+1). \square \end{aligned}$$

1.2 Proof of Theorem 5.5

We divide the proof into three steps.

Step 1 (Recalling results from [19]): Using the tensor B-splines \(\beta _d^{(t)}\) introduced in Eq. (5.5), define \(N := N^{(\tau )} := \beta _d^{(\tau -1)}\) for \(\tau \in {\mathbb {N}}\), and note that this coincides with the definition of N in [19, Equation (4.1)]. Next, as in [19, Equations (4.2) and (4.3)], for \(k \in {\mathbb {N}}_0\) and \(j \in {\mathbb {Z}}^d\), define \(N_k^{(\tau )} (x) := N^{(\tau )} (2^k x)\) and \(N_{j,k}^{(\tau )} (x) := N^{(\tau )} (2^k x - j)\). Furthermore, let \(\Omega _0 := (- \tfrac{1}{2}, \tfrac{1}{2})^d\) denote the unit cube, and set

$$\begin{aligned} \Lambda ^{(\tau )}(k) := \big \{ j \in {\mathbb {Z}}^d :N_{j,k}^{(\tau )}|_{\Omega _0} \not \equiv 0 \big \} \quad \text {and} \quad \Sigma _k^{(\tau )} := \mathrm {span} \{ N_{j,k}^{(\tau )} :j \in \Lambda ^{{(\tau )}}(k) \big \}, \end{aligned}$$

and finally \(s_k^{(\tau )} (f)_p := \inf _{g \in \Sigma _k^{(\tau )}} \Vert f - g\Vert _{L_p}\) for \(f \in X_p (\Omega _0)\) and \(k \in {\mathbb {N}}_0\). Setting \(\lambda ^{(\tau ,p)} := \tau - 1 + \min \{ 1, p^{-1} \}\), [19, Theorem 5.1] shows

$$\begin{aligned}&\Vert f\Vert _{B_{p,q}^\alpha (\Omega _0)} \asymp \Vert f\Vert _{L_p} + \big \Vert \big ( s_k^{(\tau )} (f)_p \big )_{k \in {\mathbb {N}}_0} \big \Vert _{\ell _q^\alpha } \quad \forall \, p,q \in (0,\infty ], \alpha \in (0,\lambda ^{(\tau ,p)}),\nonumber \\&\quad \text { and } f \in B_{p,q}^\alpha (\Omega _0). \end{aligned}$$
(D.1)

Here, \(\Vert (c_k)_{k \in {\mathbb {N}}_0} \Vert _{\ell _q^\alpha } = \Vert (2^{\alpha k} \, c_k)_{k \in {\mathbb {N}}_0}\Vert _{\ell ^q}\); see [19, Equation (5.1)].

Step 2 (Proving the embedding \(B_{p,q}^{d \alpha } (\Omega _0) \hookrightarrow A_q^\alpha (X_p(\Omega _0), \Sigma ({\mathcal {D}}_d^{\tau -1}))\)): Define \(\Sigma ({\mathcal {D}}_d^t) := (\Sigma _n ({\mathcal {D}}_d^t))_{n \in {\mathbb {N}}_0}\). In this step, we show that \(B_{p,q}^{d \alpha } (\Omega _0) \hookrightarrow A_q^\alpha (X_p (\Omega _0), \Sigma ({\mathcal {D}}_d^{\tau -1}))\) for any \(\tau \in {\mathbb {N}}\) and all \(p,q \in (0,\infty ]\) and \(\alpha > 0\) with \(0< d \alpha < \lambda ^{(\tau ,p)}\).

To this end, we first show that if we choose \(X = X_{p}(\Omega _0)\), then the family \(\Sigma ({\mathcal {D}}_d^{\tau -1})\) satisfies the properties (P1)–(P5). To see this, we first have to show \(\Sigma _n({\mathcal {D}}_d^{\tau -1}) \subset X_p(\Omega _0)\). For \(p < \infty \), this is trivial, since \(N^{(\tau )} = \beta _d^{(\tau -1)}\) is bounded and measurable. For \(p = \infty \), this holds as well, since if \(\tau \ge 2\), then \(N^{(\tau )} = \beta _d^{(\tau -1)}\) is continuous; finally, the case \(\tau = 1\) cannot occur for \(p = \infty \), since this would imply

$$\begin{aligned} 0< d \alpha < \lambda ^{(\tau ,p)} = \tau - 1 + \min \{ 1, p^{-1} \} = 0. \end{aligned}$$

Next, Properties (P1)–(P4) are trivially satisfied. Finally, the density of \(\bigcup _{n=0}^\infty \Sigma _n ({\mathcal {D}}_{d}^{\tau -1})\) in \(X_p(\Omega _0)\) is well known for \(\tau = 1\), since then \(\beta _0^{(\tau -1)} = {\mathbb {1}}_{[0,1)^d}\) and \(p < \infty \). For \(\tau \ge 2\), the density follows with the same arguments that were used for the case \(p = \infty \) in Section B.3.3.

Next, note that \({{\text {supp}}}N^{(\tau )} \subset [0,\tau ]^d\) and thus \({{\text {supp}}}N^{(\tau )}_{j,k} \subset 2^{-k} (j + [0,\tau ]^d)\). Therefore, if \(j \in \Lambda ^{(\tau )}(k)\), then \(\varnothing \ne \Omega _0 \cap {{\text {supp}}}N_{j,k}\), so that there is some \(x \in \Omega _0 \cap 2^{-k}(j + [0,\tau ]^d)\). This implies \({j \in {\mathbb {Z}}^d \cap [-2^{k-1} - \tau , 2^{k-1}]^d}\), and thus, \(|\Lambda ^{{(\tau )}}(k)| \le (2^k + \tau + 1)^d\). Directly by definition of \(\Sigma _n({\mathcal {D}}_d^t)\) and \(\Sigma _k^{(\tau )}\), this implies

$$\begin{aligned} \Sigma _k^{(\tau )} \subset \Sigma _{(2^k + \tau + 1)^d} ({\mathcal {D}}_d^{\tau -1}) \qquad \forall \, k \in {\mathbb {N}}_0 . \end{aligned}$$
(D.2)

Next, since we are assuming \(0< \alpha d < \lambda ^{(\tau ,p)}\), Eq. (D.1) yields a constant \(C_1 = C_1 (p,q,\alpha ,\tau ,d) > 0\) such that \( \Vert f\Vert _{L_p} + \big \Vert \big (s_k^{(\tau )} (f)_p \big )_{k \in {\mathbb {N}}_0} \big \Vert _{\ell _q^{d \alpha }} \le C_1 \cdot \Vert f\Vert _{B_{p,q}^{d \alpha }(\Omega _0)} \) for all \(f \in B_{p,q}^{d \alpha }(\Omega _0)\). Therefore, we see for \(f \in B_{p,q}^{d \alpha }(\Omega _0)\) and \(q < \infty \) that

$$\begin{aligned}&\Vert f\Vert _{A_q^\alpha (X_p (\Omega _0), \Sigma ({\mathcal {D}}_d^{\tau -1}))}^q \\&\quad = \sum _{n=1}^{\infty } n^{-1} \cdot [n^\alpha \cdot E(f, \Sigma _{n-1}({\mathcal {D}}_d^{\tau -1}))_{L_p(\Omega _0)}]^q \\&\quad \le \Vert f\Vert _{L_p}^q \sum _{n=1}^{(\tau +2)^d} n^{\alpha q - 1} + \sum _{k=0}^{\infty } \sum _{n = (2^k + \tau + 1)^d + 1}^{(2^{k+1} + \tau + 1)^d} n^{\alpha q - 1} [E (f, \Sigma _{n-1}({\mathcal {D}}_d^{\tau -1}))_{L_p(\Omega _0)}]^q \\&\qquad \overset{(*)}{\le } C_2 \cdot \Vert f\Vert _{L_p}^q + C_4 \sum _{k=0}^\infty 2^{kd} 2^{dk (\alpha q - 1)} [s_k^{(\tau )}(f)_p]^q \\&\quad \le (C_2 + C_4) \cdot \big ( \Vert f\Vert _{L_p} + \big \Vert \big ( s_k^{(\tau )} (f)_p \big )_{k \in {\mathbb {N}}_0} \big \Vert _{\ell _q^{d\alpha }} \big )^q \le C_1^q \cdot (C_2 + C_4) \cdot \Vert f\Vert _{B_{p,q}^{d\alpha }(\Omega _0)}^q . \end{aligned}$$

At the step marked with \((*)\), we used that Eq. (D.2) yields \( \Sigma _{n-1}({\mathcal {D}}_d^{\tau -1}) \supset \Sigma _{(2^k+\tau +1)^d} ({\mathcal {D}}_d^{\tau -1}) \supset \Sigma _k^{(\tau )} \) for all \(n \ge 1 + (2^k + \tau + 1)^d\), and furthermore that if \(1 + (2^k + \tau + 1)^d \le n \le (2^{k+1} + \tau + 1)^d\), then \(2^{dk} \le n \le (\tau +3)^d \cdot 2^{dk}\), so that \(n^{\alpha q - 1} \le C_3 2^{dk (\alpha q - 1)}\) for some constant \(C_3 = C_3(d,\tau ,\alpha ,q)\), and finally that \( \sum _{n= (2^k + \tau + 1)^d + 1}^{(2^{k+1} + \tau + 1)^d} 1 \le (2^{k+1} + \tau + 1)^d \le (\tau +3)^d \cdot 2^{dk} \).

For \(q = \infty \), the proof is similar. Setting \(\ell _k := (2^k + \tau + 1)^d\) for brevity, we see with similar estimates as above that

$$\begin{aligned}&\Vert f\Vert _{A_\infty ^\alpha (X_p(\Omega _0), \Sigma ({\mathcal {D}}_d^{\tau -1}))} \\&\quad = \max \Big \{ \max _{0 \le n \le (\tau +2)^d} n^\alpha \, E(f, \Sigma _{n-1}({\mathcal {D}}_d^{\tau -1}))_{L_p(\Omega _0)}, \quad \sup _{k \in {\mathbb {N}}_0} \max _{\ell _k + 1 \le n \le \ell _{k+1}} n^\alpha \, E(f, \Sigma _{n-1}({\mathcal {D}}_d^{\tau -1}))_{L_p (\Omega _0)} \Big \} \\&\quad \le \max \Big \{ (\tau +2)^{\alpha d} \, \Vert f\Vert _{L_p(\Omega _0)}, \quad \sup _{k \in {\mathbb {N}}_0} (\tau +3)^{\alpha d} \, 2^{\alpha d k} s_k^{(\tau )}(f)_p \Big \} \\&\quad \le (\tau +3)^{\alpha d} \big ( \Vert f\Vert _{L_p(\Omega _0)} + \Vert (s_k^{(\tau )}(f)_p)_{k \in {\mathbb {N}}_0} \Vert _{\ell _q^{d \alpha }} \big ) \le C_1 \, (\tau +3)^{\alpha d} \, \Vert f\Vert _{B_{p,\infty }^{d \alpha }(\Omega _0)}. \end{aligned}$$

Overall, we have shown \(B_{p,q}^{d \alpha }(\Omega _0) \hookrightarrow A_q^\alpha (X_p (\Omega _0), \Sigma ({\mathcal {D}}_d^{\tau -1}))\) for \(\tau \in {\mathbb {N}}\), \(p,q \in (0,\infty ]\) and \(0< \alpha d < \lambda ^{(\tau ,p)}\).

Step 3 (Proving the embeddings (5.9) and (5.10)): In case of \(d = 1\), let us set \(r_0 := r\), while \(r_0\) is as in the statement of the theorem for \(d > 1\). Since \(\Omega \) is bounded and \(\Omega _0 = (-\tfrac{1}{2}, \tfrac{1}{2})^d\), there is some \(R > 0\) such that \(\Omega \subset R \cdot \Omega _0\). Let us fix \(p,q \in (0,\infty ]\) and \(s > 0\) such that \(d s < r_0 + \min \{1, p^{-1} \}\).

Since \(\Omega \) and \(R \cdot \Omega _0\) are bounded Lipschitz domains, there exists a (not necessarily linear) extension operator \({\mathcal {E}} : B^{d s}_{p,q} (\Omega ) \rightarrow B^{d s}_{p,q} (R \Omega _0)\) with the properties \(({\mathcal {E}} f)|_{\Omega } = f\) and \(\Vert {\mathcal {E}} f\Vert _{B^{d s}_{p,q}(R \Omega _0)} \le C \cdot \Vert f\Vert _{B^{d s}_{p,q}(\Omega )}\) for all \(f \in B^{d s}_{p,q}(\Omega )\). Indeed, for \(p \in [1,\infty ]\) this follows from [37, Section 4, Corollary 1], since this corollary yields an extension operator \({\mathcal {E}} : X_p (\Omega ) \rightarrow X_p (R \Omega _0)\) with the additional property that the j-th modulus of continuity \(\omega _j\) satisfies \(\omega _j (t, {\mathcal {E}} f)_{R \Omega _0} \le M_j \cdot \omega _j (t, f)_{\Omega }\) for all \(j \in {\mathbb {N}}\), all \(f \in X_p(\Omega )\), and all \(t \in [0,1]\). In view of the definition of the Besov spaces (see in particular [21, Chapter 2, Theorem 10.1]), this easily implies the result. Finally, in case of \(p \in (0,1)\), the existence of the extension operator follows from [20, Theorem 6.1]. In addition to the existence of the extension operator, we will also need that the dilation operator \(D_1 : B^{d s}_{p,q}(R \Omega _0) \rightarrow B^{d s}_{p,q} (\Omega _0), f \mapsto f(R \bullet )\) is well-defined and bounded, say \(\Vert D_1\Vert \le C_1\); this follows directly from the definition of the Besov spaces.

We first prove Eq. (5.9), that is, we consider the case \(d = 1\). To this end, define \(\tau := r + 1 \in {\mathbb {N}}\), let \(f \in B^{s}_{p,q}(\Omega )\) be arbitrary, and set \(f_1 := D_1 ({\mathcal {E}} f) \in B^{s}_{p,q}(\Omega _0)\). By applying Step 2 with \(\alpha = s\) (and noting that \(0< d \alpha = s < r + \min \{1,p^{-1}\} = \lambda ^{(\tau ,p)}\)), we get \(f_1 \in A_q^s (X_p(\Omega _0), \Sigma ({\mathcal {D}}_d^{r}))\), with \(\Vert f_1\Vert _{A_q^s (X_p(\Omega _0), \Sigma ({\mathcal {D}}_d^{r}))} \le C C_1 C_2 \cdot \Vert f\Vert _{B^{d s}_{p,q}(\Omega )}\), where the constant \(C_2\) is provided by Step 2.

Next, we note that \(L := \sup _{n \in {\mathbb {N}}} {\mathscr {L}}(n) \ge 2 = 2 + 2 \lceil \log _2 d \rceil \) and \(r \ge 1 = \min \{d,2\}\), so that Corollary 5.4-(2) shows . But it is an easy consequence of Lemma 2.18-(1) that the dilation operator is well-defined and bounded. Hence, we see that with . Now, note \(D_2 f_1(x) = f_1 (x/R) = {\mathcal {E}} f (x) = f(x)\) for all \(x \in \Omega \subset R \Omega _0\), and hence \(f = (D_2 f_1)|_{\Omega }\). Thus, Remark 3.17 implies that with , as claimed.

Now, we prove Eq. (5.10). To this end, define \(\tau := r_0 + 1 \in {\mathbb {N}}\), let \(f \in B^{s d}_{p,q}(\Omega )\) be arbitrary, and set \(f_1 := D_1 ({\mathcal {E}} f) \in B^{d s}_{p,q}(\Omega _0)\). Applying Step 2 with \(\alpha = s\) (noting \({0< d \alpha = d s < r_0 + \min \{1, p^{-1}\} = \lambda ^{(\tau ,p)}}\)), we get \(f_1 \in A_q^s (X_p (\Omega _0), \Sigma ({\mathcal {D}}_d^{r_0}))\), with \( \Vert f_1\Vert _{A_q^s (X_p (\Omega _0), \Sigma ({\mathcal {D}}_d^{r_0}))} \le C C_1 C_2 \cdot \Vert f\Vert _{B^{d s}_{p,q}(\Omega )} \), where the constant \(C_2\) is provided by Step 2.

Next, we claim that . Indeed, if \(r \ge 2\) and \(L \ge 2 + 2 \lceil \log _2 d \rceil \), then this follows from Corollary 5.4-(2). Otherwise, we have \(r_0 = 0\) and \(L \ge 3 \ge \min \{d+1, 3\}\), so that the claim follows from Corollary 5.4-(1); here, we note that \(p < \infty \), since we would otherwise get the contradiction \(0< \alpha d < r_0 + \min \{1, p^{-1} \} = 0\). Therefore, with . The rest of the argument is exactly as in the case \(d = 1\). \(\square \)

1.3 Proof of Lemma 5.10

Lemma 5.10 shows that deeper networks can implement the sawtooth function \(\Delta _j\) using less connections/neurons than more shallow networks. The reason for this is indicated by the following lemma.

Lemma D.1

For arbitrary \(j \in {\mathbb {N}}\), we have \(\Delta _j \circ \Delta _1 = \Delta _{j+1}\).\(\blacktriangleleft \)

Proof

It suffices to verify the identity on [0, 1], since if \(x \in {\mathbb {R}}{\setminus } [0,1]\), then \(\Delta _1 (x) = 0 = \Delta _{j+1} (x)\), so that \(\Delta _j(\Delta _1(x)) = \Delta _j (0) = 0 = \Delta _{j+1} (x)\). We now distinguish two cases for \(x \in [0,1]\).

Case 1: \(x \in [0,\tfrac{1}{2}]\). This implies \(\Delta _1 (x) = 2x\), and hence (recall the definition of \(\Delta _{j}\) in Eq. (5.11))

$$\begin{aligned} \Delta _j (\Delta _1(x)) = \sum _{k=0}^{2^{j-1} - 1} \Delta _1 \big ( 2^{j-1} 2x - k \big ) = \sum _{k=0}^{2^{j - 1} - 1} \Delta _1(2^{(j+1) - 1} x - k) = \Delta _{j+1}(x). \end{aligned}$$

In the last equality we used that \(2^j x - k \le 2^{j-1} - k \le 0\) for \(k \ge 2^{j-1}\), so that \(\Delta _1 (2^j x - k) = 0\) for those k.

Case 2: \(x \in (\tfrac{1}{2}, 1]\).

Observe that \(\Delta _j (x) = \Delta _j (1-x)\) for all \(x \in {\mathbb {R}}\) and \(j \in {\mathbb {N}}\). Since \(x' := 1-x \in [0,1/2]\), this identity and Case 1 yield \( \Delta _j \circ \Delta _1 (x) = \Delta _j \circ \Delta _1 (1-x) = \Delta _{j+1}(1-x) = \Delta _{j+1}(x) \). \(\square \)

Using Lemma D.1, we can now provide the proof of Lemma 5.10.

Proof of Lemma 5.10

Part (1): Write \(j = k (L-1) + s\) for suitable \(k \in {\mathbb {N}}_0\) and \(0 \le s \le L - 2\). Note that this implies \(k \le j / (L-1)\). Thanks to Lemma D.1, we have \(\Delta _j = \Delta _{k+s} \circ \Delta _k \circ \cdots \circ \Delta _k\), where \(\Delta _k\) occurs \(L-2\) times. Furthermore, since \(\Delta _k : {\mathbb {R}}\rightarrow {\mathbb {R}}\) is affine with \(2 + 2^k\) pieces (see Fig. 4, and note that we consider \(\Delta _k\) as a function on all of \({\mathbb {R}}\), not just on [0, 1]), Lemma 4.9 shows that \(\Delta _k \in {\mathtt {NN}}^{\varrho _1,1,1}_{\infty ,2,3+2^k}\). By the same reasoning, we get \(\Delta _{k+s} \in {\mathtt {NN}}_{\infty ,2,3+2^{k+s}}^{\varrho _1,1,1}\). Now, a repeated application of Lemma 2.18-(3) shows that

$$\begin{aligned} \Delta _j = \Delta _{k+s} \circ \Delta _k \circ \cdots \circ \Delta _k \in {\mathtt {NN}}_{\infty ,L,(L-2)(3 + 2^k) + 3 + 2^{k+s}}^{\varrho _1,1,1} . \end{aligned}$$

Finally, \(\Delta _j \in {\mathtt {NN}}^{\varrho _1,1,1}_{\infty ,L,C_L \cdot 2^{j/(L-1)}}\) with \(C_L := 4 \, L + 2^{L-1}\) since

$$\begin{aligned} (L-2)(3+2^k)+3+2^{k+s}= & {} 3(L-1) + (L-2+2^{s})2^{k} \le (4L-5 + 2^{L-2}) 2^k \\\le & {} (4L+2^{L-1}) 2^{j/(L-1)} = C_{L} \cdot 2^{j/(L-1)}. \end{aligned}$$

Part (2): Set \(\kappa := \lfloor L/2\rfloor \) and write \(j = k \kappa + s\) for \(k \in {\mathbb {N}}_0\) and \(0 \le s \le \kappa - 1\). Note that \(k \le j / \kappa = j / \lfloor L/2 \rfloor \). As above, \(\Delta _j = \Delta _{k+s} \circ \Delta _k \circ \cdots \circ \Delta _k\), where \(\Delta _k\) occurs \(\kappa - 1\) times, and since \(\Delta _k : {\mathbb {R}}\rightarrow {\mathbb {R}}\) is affine with \(2 + 2^k\) pieces, using Lemma 4.9 again shows that \(\Delta _k \in {\mathtt {NN}}^{\varrho _1,1,1}_{6+2^{k+1},2,\infty }\), and

\(\Delta _{k+s} \in {\mathtt {NN}}_{6+2^{k+s+1},2,\infty }^{\varrho _1,1,1}\). Now, a repeated application of Lemma 2.18-(2) shows that

$$\begin{aligned} \Delta _j = \Delta _{k+s} \circ \Delta _k \circ \cdots \circ \Delta _k \in {\mathtt {NN}}_{6+2^{k+s+1} + (\kappa - 1)(6 + 2^{k+1}), 2 + 2 \cdot (\kappa -1), \infty }^{\varrho _1,1,1} . \end{aligned}$$

Finally, \(\Delta _k \in {\mathtt {NN}}^{\varrho _1,1,1}_{C_{L} 2^{j/\lfloor L/2\rfloor },\lfloor L/2\rfloor ,\infty }\), as \(2+2(\kappa -1) = 2\kappa \le L\), \(s+1 \le \kappa \le L/2 \le L-1\) (since \(L \ge 2\)) and

$$\begin{aligned} 6+2^{k+s+1} + (\kappa - 1)(6 + 2^{k+1})= & {} 6\kappa + (2^{s+1}+2)2^{k} \le (3L+2^{L-1}+2) 2^{j / \lfloor L/2 \rfloor } \\\le & {} C_L \cdot 2^{j / \lfloor L/2 \rfloor }. \end{aligned}$$

\(\square \)

1.4 Proof of Lemma 5.12

For \(h \in {\mathbb {R}}^d\), we define the translation operator \(T_h\) by \((T_h f)(x) = f(x - h)\) for \(f : {\mathbb {R}}^d \rightarrow {\mathbb {R}}\). With this, the h-difference operator of order k is given by \(D_h^k = (D_h)^k\), where \(D_h := (T_{-h} - \mathrm {id})\). For later use, we note for \(a > 0\) that \(D_h [f(a \bullet )](x) = (D_{a h}f)(a x)\), as can be verified by a direct calculation. By induction, this implies \(D_h^k [f(a \bullet )] = (D_{a h}^k f)(a \bullet )\) for all \(k \in {\mathbb {N}}\). Furthermore, \(T_x D_h^k = D_h^k T_x\) for all \(x,h \in {\mathbb {R}}^d\) and \(k \in {\mathbb {N}}\).

A direct computation shows

$$\begin{aligned} \Delta _1 = \widetilde{\Delta _1} + 2 \widetilde{\Delta _1}(\bullet - \tfrac{1}{4}) + \widetilde{\Delta _1} (\bullet - \tfrac{1}{2}) = (T_{1/4} + \mathrm {id})^2 \widetilde{\Delta _1} \quad \text {where} \quad \widetilde{\Delta _1} := \frac{1}{2} \Delta _1 (2 \bullet ). \end{aligned}$$

Next, note that \((T_{-1/4} - \mathrm {id}) (T_{1/4} + \mathrm {id}) = T_{-1/4} - T_{1/4}\), and hence, since \(T_{-1/4}\) and \(T_{1/4}\) commute,

$$\begin{aligned} D_{1/4}^2 \Delta _1= & {} (T_{-1/4} - \mathrm {id})^2 (T_{1/4} + \mathrm {id})^2 \widetilde{\Delta _1} = (T_{-1/4} - T_{1/4})^2 \widetilde{\Delta _1} \nonumber \\= & {} (T_{-1/2} - 2 \, \mathrm {id}+ T_{1/2}) \widetilde{\Delta _1}. \end{aligned}$$
(D.3)

Moreover by induction on \(\ell \in {\mathbb {N}}_{0}\), we see that

$$\begin{aligned} \sum _{k=0}^{\ell } T_k (T_{-\frac{1}{2}} - 2 \, \mathrm {id}+ T_{\frac{1}{2}}) = T_{-\frac{1}{2}} + T_{\frac{2\ell + 1}{2}} + 2 \sum _{i=0}^{2\ell } (-1)^{i-1} \, T_{\frac{i}{2}}. \end{aligned}$$
(D.4)

Define \(h_j := 2^{-(j+1)}\), so that \(2^{j-1} h_j = 1/4\). Since \(\Delta _j = \sum _{k=0}^{2^{j-1} - 1} (T_k \Delta _1)(2^{j-1} \bullet )\) [cf. Eq. (5.11)], Equations (D.3) and (D.4) and the properties from the beginning of the proof yield for \(x \in {\mathbb {R}}\) that

$$\begin{aligned} \begin{aligned} (D_{h_j}^2 \Delta _j)(x)&= \sum _{k=0}^{2^{j-1}-1} [D_{2^{j-1} h_j}^2 (T_k \Delta _1)](2^{j-1} x) = \left[ \sum _{k=0}^{2^{j-1} - 1} T_k (D_{1/4}^2 \Delta _1) \right] (2^{j-1} x) \\&= (T_{-\frac{1}{2}} \widetilde{\Delta _1})(2^{j-1} x) + (T_{\frac{2^j - 1}{2}} \widetilde{\Delta _1})(2^{j-1} x) + 2 \sum _{i=0}^{2^j - 2} (-1)^{i-1} (T_{\frac{i}{2}} \widetilde{\Delta _1}) (2^{j-1} x). \end{aligned} \end{aligned}$$
(D.5)

Recall for \(g \in X_p (\Omega )\) that the r-th modulus of continuity of g is given by

$$\begin{aligned}&\omega _r (g)_p (t) := \sup _{h \in {\mathbb {R}}^d, |h| \le t} \Vert D_h^r g\Vert _{X_p (\Omega _{r,h})} \\&\quad \text {where} \quad \Omega _{r,h} := \{ x \in \Omega :x + u h \in \Omega \text { for each } u \in [0,r] \}. \end{aligned}$$

Let \(e_1 = (1,0,\dots ,0) \in {\mathbb {R}}^d\). For \(h = h_j \, e_1\), we have \(\Omega _{2,h} \supset (0,\frac{1}{2}) \times (0,1)^{d-1}\) since \(\Omega = (0,1)^{d}\). Next, because of \({{{\text {supp}}}\, \widetilde{\Delta _1} = [0, \tfrac{1}{2}]}\), the family \((T_{i/2} \widetilde{\Delta _1})_{i \in {\mathbb {Z}}}\) has pairwise disjoint supports (up to null-sets), and

$$\begin{aligned} {{\text {supp}}}\big [ (T_{\frac{i}{2}} \widetilde{\Delta _1})(2^{j-1} \bullet ) \big ] = 2^{-j} (i + [0,1]) \subset \big [ 0, \tfrac{1}{2} \big ] \qquad \text {for} \quad 0 \le i \le 2^{j-1} - 1. \end{aligned}$$

Combining these observations with the fact that \( (T_{\frac{i}{2}} \widetilde{\Delta _1})(2^{j-1} \bullet ) = \widetilde{\Delta _1}(2^{j-1} \bullet - i/2) = \Delta _1(2^j \bullet -i)/2 \), Eq. (D.5) yields for \(p < \infty \) that

$$\begin{aligned} \Vert D_{h_j \, e_1}^2 \Delta _{j,d} \Vert _{L_p (\Omega _{2, h_j e_1})}^p&\ge \sum _{i = 0}^{2^{j-1} - 1} 2^p \Vert (T_{\frac{i}{2}} \widetilde{\Delta _1}) (2^{j-1} \bullet ) \Vert _{L_p (2^{-j} (i + [0,1]))}^p \\&= \sum _{i = 0}^{2^{j-1} - 1} \Vert \Delta _1 (2^{j} \bullet - i) \Vert _{L_p (2^{-j} (i + [0,1]))}^p \\&= \sum _{i = 0}^{2^{j-1} - 1} 2^{-j} \Vert \Delta _1 \Vert _{L_p ([0,1])}^p = \frac{\Vert \Delta _1\Vert _{L_p}^p}{2}, \end{aligned}$$

and hence, \(\Vert D_{h_j e_1}^2 \Delta _{j,d}\Vert _{L_p(\Omega _{2,h_je_1})} \ge C_p\), where \(C_p := 2^{-1/p} \, \Vert \Delta _1\Vert _{L_p}\) for \(p < \infty \). Since \(\Omega _{2, h_j e_1} \subset \Omega = (0,1)^d\) has at most measure 1, we have \(\Vert \cdot \Vert _{L_1(\Omega _{2, h_j e_1})} \le \Vert \cdot \Vert _{L_\infty (\Omega _{2, h_j e_1})}\); hence, the same holds for \(p=\infty \) with \(C_\infty := C_1\). By definition, this implies \(\omega _2 (\Delta _{j,d})_p (t) \ge C_p\) for \(t \ge |h_je_1| = 2^{-(j+1)}\).

Overall, we get by definition of the Besov quasi-norms in case of \(q < \infty \) that

$$\begin{aligned} \Vert \Delta _{j,d}\Vert _{B^{{s'}}_{p,q}(\Omega )}^q\ge & {} \int _0^\infty [t^{-{{s'}}} \omega _2 (\Delta _{j,d})_p (t)]^q \frac{\mathrm{d}t}{t} \ge C_p^q \cdot \int _{2^{-(j+1)}}^\infty t^{-{{s'}} q - 1} \, \mathrm{d}t\\= & {} \frac{C_p^q}{{{s'}} q} \cdot 2^{{{s'}} q (j+1)}, \end{aligned}$$

and hence, \(\Vert \Delta _{j,d}\Vert _{B^{{s'}}_{p,q}(\Omega )} \ge \frac{C_p}{({{s'}} q)^{1/q}} \, 2^{{{s'}} ( j+1)}\) for all \(j \in {\mathbb {N}}\). In case of \(q = \infty \), we see similarly that

$$\begin{aligned} \Vert \Delta _{j,d}\Vert _{B^{{s'}}_{p,q}(\Omega )} \ge \sup _{t \in (0,\infty )} t^{-{{s'}}} \, \omega _2 (\Delta _{j,d})_p (t) \ge C_p \cdot (2^{-(j+1)})^{-{{s'}}} = C_p \cdot 2^{{{s'}} (j+1)} \end{aligned}$$

for all \(j \in {\mathbb {N}}\). In both cases, we used that \({{s'}} < 2\) to ensure that we can use the modulus of continuity of order 2 to compute the Besov quasi-norm. Finally, note because of \({{s'}} \le s\) that \(B^s_{p,q}(\Omega ) \hookrightarrow B^{{s'}}_{p,q}(\Omega )\); see Eq. (5.4). This easily implies the claim. \(\square \)

1.5 Proof of Lemma 5.19

In this section, we prove Lemma 5.19, based on results of Telgarsky [64].

Telgarsky makes extensive use of two special classes of functions: First, a function \(\sigma : {\mathbb {R}}\rightarrow {\mathbb {R}}\) is called \((t,\beta )\)-poly (where \(t \in {\mathbb {N}}\) and \(\beta \in {\mathbb {N}}_0\)) if there is a partition of \({\mathbb {R}}\) into t intervals \(I_1,\dots ,I_t\) such that \(\sigma |_{I_j}\) is a polynomial of degree at most \(\beta \) for each \(j \in \{1,\dots ,t\}\). In the language of Definition 4.6, these are precisely those functions which belong to \({\mathtt {PPoly}}_{t}^{\beta }({\mathbb {R}})\). The second class of functions which is important are the \((t,\alpha ,\beta )\)-semi-algebraic functions \(f : {\mathbb {R}}^k \rightarrow {\mathbb {R}}\) (where \(t \in {\mathbb {N}}\) and \(\alpha ,\beta \in {\mathbb {N}}_0\)). The definition of this class (see [64, Definition 2.1]) is somewhat technical. Luckily, we don’t need the definition, all we need to know is the following result:

Lemma D.2

(see [64, Lemma 2.3-(1)]) If \(\sigma : {\mathbb {R}}\rightarrow {\mathbb {R}}\) is \((t,\beta )\)-poly and \(q : {\mathbb {R}}^d \rightarrow {\mathbb {R}}\) is a (multivariate) polynomial of degree at most \(\alpha \in {\mathbb {N}}_0\), then \(\sigma \circ q\) is \((t,\alpha ,\alpha \beta )\)-semi-algebraic.\(\blacktriangleleft \)

In most of our proofs, we will mainly be interested in knowing that a function \(\sigma : {\mathbb {R}}\rightarrow {\mathbb {R}}\) is \((t,\alpha )\)-poly for certain \(t,\alpha \). The following lemma gives a sufficient condition for this to be the case.

Lemma D.3

(see [64, Lemma 3.6]) If \(f : {\mathbb {R}}^k \rightarrow {\mathbb {R}}\) is \((s,\alpha ,\beta )\)-semi-algebraic and if \(g_1,\dots ,g_k : {\mathbb {R}}\rightarrow {\mathbb {R}}\) are \((t,\gamma )\)-poly, then the function \(f \circ (g_1,\dots ,g_k) : {\mathbb {R}}\rightarrow {\mathbb {R}}\) is \(\big ( s t (1 + \alpha \gamma ) \cdot k , \beta \gamma \big )\)-poly.\(\blacktriangleleft \)

For proving Lemma 5.19, we begin with the easier case where we count neurons instead of weights.

Proof of the second part of Lemma 5.19

We want to show that for any depth \(L \in {\mathbb {N}}_{\ge 2}\) and degree \(r \in {\mathbb {N}}\) there is a constant \(\Lambda _{L,r} \in {\mathbb {N}}\) such that each function \(f \in {\mathtt {NN}}^{\varrho _r, 1, 1}_{\infty ,L,N}\) is \((\Lambda _{L,r}N^{L-1},r^{L-1})\)-poly. To show this, let \(\Phi \in {\mathcal {NN}}^{\varrho _r,1,1}_{\infty ,L,N}\) with \(f = {\mathtt {R}}(\Phi )\), say \(\Phi = \big ( (T_1,\alpha _1), \dots , (T_K, \alpha _K) \big )\), where necessarily \(K \le L\), and where each \(T_\ell : {\mathbb {R}}^{N_{\ell - 1}} \rightarrow {\mathbb {R}}^{N_\ell }\) is affine-linear.

For \(\ell \in \{1,\dots ,K\}\) and \(j \in \{1,\dots ,N_\ell \}\), we let \(f_j^{(\ell )} : {\mathbb {R}}\rightarrow {\mathbb {R}}\) denote the output of neuron j in the \(\ell \)-th layer. Formally, let \(f_j^{(1)} : {\mathbb {R}}\rightarrow {\mathbb {R}}, x \mapsto \big ( \alpha _1 (T_1 \, x) \big )_j\), and inductively

$$\begin{aligned}&f_j^{(\ell +1)} : {\mathbb {R}}\rightarrow {\mathbb {R}}, x \mapsto \Big [ \alpha _{\ell + 1} \Big ( T_{\ell +1} \big ( f_k^{(\ell )}(x) \big )_{k \in \{1,\dots ,N_\ell \}} \Big ) \Big ]_j \nonumber \\&\quad \text { for } \quad 1 \le \ell \le L-1 \text { and } 1 \le j \le N_{\ell + 1}. \end{aligned}$$
(D.6)

We prove below by induction on \(\ell \in \{1,\dots ,K\}\) that there is a constant \(C_{\ell ,r} \in {\mathbb {N}}\) which only depends on \(\ell ,r\) and such that \(f_j^{(\ell )}\) is \(\big (C_{\ell ,r} \prod _{t=0}^{\ell -1} N_t, r^{\gamma (\ell )}\big )\)-poly, where \(\gamma (\ell ) := \min \{\ell , L-1\}\). Once this is shown, we see that \(f = {\mathtt {R}}(\Phi ) = f_1^{(K)}\) is \(\big (C_{K,r} \prod _{t=0}^{K-1} N_t, r^{L-1}\big )\)-poly. Then, because of \(N_0 = 1\), we see that

$$\begin{aligned} C_{K,r} \prod _{t=0}^{K-1} N_t \le \Lambda _{L,r} \prod _{t=1}^{K-1} N_t \le \Lambda _{L,r} \prod _{t=1}^{K-1} N(\Phi ) \le \Lambda _{L,r} \cdot [N(\Phi )]^{K-1} \le \Lambda _{L,r} \cdot N^{L-1} , \end{aligned}$$

where \(\Lambda _{L,r} := \max _{1 \le K \le L} C_{K,r}\). Therefore, f is indeed \((\Lambda _{L,r} \, N^{L-1}, r^{L-1})\)-poly.

Start of induction (\(\ell = 1\)): Note that \(L \ge 2\), so that \(\gamma (\ell ) = \ell = 1\). We have \(T_1 x = a x + b\) for certain \(a,b \in {\mathbb {R}}^{N_1}\) and \(\alpha _1 = \varrho ^{(1)} \otimes \cdots \otimes \varrho ^{(N_1)}\) for certain \(\varrho ^{(j)} \in \{\mathrm {id}_{\mathbb {R}}, \varrho _r\}\). Thus, \(\varrho ^{(j)}\) is (2, r)-poly, and thus (2, 1, r)-semi-algebraic according to Lemma D.2. Therefore, Lemma D.3 shows because of \(f_j^{(1)}(x) = \varrho ^{(j)} (b_j + a_j x)\) that \(f_j^{(1)}\) is \((2(1+1), r)\)-poly, for any \(j \in \{1,\dots ,N_1\}\). Because of \(N_0 = 1\), the claim holds for \(C_{1,r} := 4\).

Induction step (\(\ell \rightarrow \ell + 1\)): Suppose that \(\ell \in \{1,\dots ,K-1\}\) is such that the claim holds. Note that \(\ell \le K-1 \le L-1\), so that \(\gamma (\ell ) = \ell \).

We have \(T_{\ell + 1} \, y = A \, y + b\) for certain \(A \in {\mathbb {R}}^{N_{\ell + 1} \times N_\ell }\) and \(b \in {\mathbb {R}}^{N_{\ell + 1}}\), and \(\alpha _{\ell + 1} = \varrho ^{(1)} \otimes \cdots \otimes \varrho ^{(N_{\ell + 1})}\) for certain \(\varrho ^{(j)} \in \{\mathrm {id}_{\mathbb {R}}, \varrho _r\}\), where \(\varrho ^{(j)} = \mathrm {id}_{\mathbb {R}}\) for all \(j \in \{1,\dots ,N_{\ell + 1}\}\) in case of \(\ell = K - 1\). Hence, \(\varrho ^{(j)}\) is (2, r)-poly, and even (2, 1)-poly in case of \(\ell = K-1\). Moreover, each of the polynomials \({ p_{j,\ell } : {\mathbb {R}}^{N_\ell } \rightarrow {\mathbb {R}}, y \mapsto (A \, y + b)_j = b_j + \sum _{t=1}^{N_\ell } A_{j,t} \, y_t }\) is of degree at most 1; hence, by Lemma D.2, \(\varrho ^{(j)} \circ p_{j,\ell }\) is (2, 1, r)-semi-algebraic, and even (2, 1, 1)-semi-algebraic in case of \(\ell = K-1\).

Each function \(f_t^{(\ell )}\) is \((C_{\ell ,r} \prod _{t=0}^{\ell -1} N_t, r^{\ell })\)-poly by the induction hypothesis. By Lemma D.3, since

$$\begin{aligned} f_j^{(\ell + 1)} (x)= & {} \varrho ^{(j)} \Big ( \big [ A \, \big ( f_t^{(\ell )} (x) \big )_{t \in \{1,\dots ,N_\ell \}} + b \big ]_j \Big ) \\= & {} (\varrho ^{(j)} \circ p_{j,\ell }) \big ( f_1^{(\ell )} (x), \dots , f_{N_\ell }^{(\ell )} (x) \big ), \end{aligned}$$

it follows that \(f_j^{(\ell +1)}\) is \((P, r^{\ell +1})\)-poly [respectively, \((P,r^{\ell })\)-poly if \(\ell = K-1\)], where

$$\begin{aligned} P \le 2 C_{\ell ,r} (1 + r^{\ell }) \cdot N_\ell \cdot \prod _{t=0}^{\ell -1} N_t =: C_{\ell + 1, r} \cdot \prod _{t=0}^{(\ell +1) - 1} N_t . \end{aligned}$$

Finally, note in case of \(\ell < K{-}1\) that \(\ell {+} 1 \le K {-} 1 \le L{-}1\), and hence, \(\gamma (\ell +1) {=} \ell {+}1\), while in case of \(\ell {=} K-1\) we have \(\ell \le \min \{\ell {+}1, L{-}1\} {=} \gamma (\ell {+}1)\). Therefore, each \(f_j^{(\ell +1)}\) is \((C_{\ell +1,r} \cdot \prod _{t=0}^{(\ell +1) - 1} N_t, r^{\gamma (\ell +1)})\)-poly. This completes the induction, and thus the proof. \(\square \)

The proof of the first part of Lemma 5.19 uses the same basic arguments as in the preceding proof, but in a more careful way. In particular, we will also need the following elementary lemma.

Lemma D.4

Let \(k \in {\mathbb {N}}\), and for each \(i \in \{1,\dots ,k\}\) let \(f_i : {\mathbb {R}}\rightarrow {\mathbb {R}}\) be \((t_i,\alpha )\)-poly and continuous. Then the function \(\sum _{i=1}^k f_i\) is \((t,\alpha )\)-poly, where \(t = 1 - k + \sum _{i=1}^k t_i\).\(\blacktriangleleft \)

Proof

For each \(i \in \{1,\dots ,k\}\), there are “breakpoints” \( b_0^{(i)} := - \infty< b_1^{(i)}< \cdots< b_{t_i - 1}^{(i)} < \infty =: b_{t_i}^{(i)} \) such that \(f_i |_{{\mathbb {R}}\cap [b_j^{(i)}, b_{j+1}^{(i)}]}\) is a polynomial of degree at most \(\alpha \) for each \(0 \le j \le t_{i} - 1\). Here, we used the continuity of \(f_i\) to ensure that we can use closed intervals.

Now, let \(M := \bigcup _{i=1}^k \{b_1^{(i)}, \ldots , b_{t_i - 1}^{(i)}\}\). We have \(|M| \le \sum _{i=1}^k (t_i - 1) = t - 1\), with t as in the statement of the lemma. Thus, \(M = \{b_1,\dots ,b_s\}\) for some \(0 \le s \le t - 1\), where \( b_0 := - \infty< b_1< \cdots< b_s < \infty =: b_{s+1}. \) It is easy to see that \(F := \sum _{i=1}^k f_i\) is such that \(F|_{{\mathbb {R}}\cap [b_j, b_{j+1}]}\) is a polynomial of degree at most \(\alpha \) for each \(0 \le j \le s\). Thus, F is \((s+1,\alpha )\)-poly and therefore also \((t,\alpha )\)-poly. \(\square \)

Proof of the first part of Lemma 5.19

Let us first consider an arbitrary network \(\Phi \in {\mathcal {NN}}^{\varrho _r, 1, 1}_{W,L,\infty }\) satisfying \(L(\Phi ) = L\). Let \(L_0 := \lfloor L/2 \rfloor \in {\mathbb {N}}_0\). We claim that

$$\begin{aligned} {\mathtt {R}}(\Phi ) \text { is } \big (\max \{1, \Lambda _{L,r} \, W^{L_0} \}, r^{L-1}\big )\text {-poly} \text { where } \Lambda _{L,r} \in {\mathbb {N}}\text { only depends on } L,r.\nonumber \\ \end{aligned}$$
(D.7)

In case of \(L = 1\), this is trivial, since then \({\mathtt {R}}(\Phi ) : {\mathbb {R}}\rightarrow {\mathbb {R}}\) is affine-linear. Thus, we will assume \(L \ge 2\) in what follows. Note that this entails \(L_0 \ge 1\).

Let \(\Phi = \big ( (T_1,\alpha _1), \dots , (T_L, \alpha _L) \big )\), where \(T_\ell : {\mathbb {R}}^{N_{\ell - 1}} \rightarrow {\mathbb {R}}^{N_\ell }\) is affine-linear. We first consider the special case that \(\Vert T_\ell \Vert _{\ell ^0} = 0\) for some \(\ell \in \{1,\dots ,L\}\). In this case, Lemma 2.9 shows that \({\mathtt {R}}(\Phi ) \equiv c\) for some \(c \in {\mathbb {R}}\). This trivially implies that \({\mathtt {R}}(\Phi )\) is \((\max \{1, \Lambda _{L,r} \, W^{L_0} \}, r^{L-1})\)-poly. Thus, we can assume in the following that \(\Vert T_\ell \Vert _{\ell ^0} \ne 0\) for all \(\ell \in \{1,\dots ,L\}\). As in the proof of the first part of Lemma 5.19, we define \(f_j^{(\ell )} : {\mathbb {R}}\rightarrow {\mathbb {R}}\) to be the function computed by neuron \(j \in \{1,\dots ,N_\ell \}\) in layer \(\ell \in \{1,\dots ,L\}\), cf. Eq. (D.6).

Step 1. We let \(L_1 := \lfloor \tfrac{L - 1}{2} \rfloor \in {\mathbb {N}}_0\), and we show by induction on \(t \in \{0,1,\dots , L_1\}\) that

$$\begin{aligned} f_j^{(2 t + 1)} \text { is } \Big ( C_{t,r} \prod _{\ell =1}^t \Vert T_{2\ell }\Vert _{\ell ^0}, r^{\gamma (t)} \Big )\text {-poly} \quad \forall \, t \in \{0,1,\dots ,L_1\} \text { and } j \in \{1,\dots ,N_{2t + 1}\} ,\nonumber \\ \end{aligned}$$
(D.8)

where \(\gamma (t) := \min \{L-1, 2 t + 1\}\) and where the constant \(C_{t,r} \in {\mathbb {N}}\) only depends on tr. Here, we use the convention that the empty product satisfies \(\prod _{\ell =1}^0 \Vert T_{2\ell }\Vert _{\ell ^0} = 1\).

Induction start (\(t=0\)): We have \(T_1 x = a x + b\) for certain \(a,b \in {\mathbb {R}}^{N_1}\) and \(\alpha _1 = \varrho ^{(1)} \otimes \cdots \otimes \varrho ^{(N_1)}\) for certain \(\varrho ^{(j)} \in \{\mathrm {id}_{\mathbb {R}}, \varrho _r\}\). In any case, \(\varrho ^{(j)}\) is (2, r)-poly, and hence, (2, 1, r)-semi-algebraic by Lemma D.2. Now, note \( f_j^{(2t+1)}(x) = f_j^{(1)}(x) = \varrho ^{(j)} \big ( (T_1 x)_j \big ) = \varrho ^{(j)} (a_j x + b_j) \), so that Lemma D.3 shows that \(f_j^{(2t+1)}\) is \((2 (1 + 1), r)\)-poly. Thus, Eq. (D.8) holds for \(t = 0\) if we choose \(C_{0,r} := 4\). Here, we used that \(L \ge 2\) and \(t=0\), so that \(L-1 \ge 2t + 1\) and hence \(\gamma (t) = 2 t + 1 = 1\).

Induction step \((t \rightarrow t+1)\): Let \(t \in {\mathbb {N}}_0\) such that \(t+1 \le \tfrac{L-1}{2}\) and such that Eq. (D.8) holds for t. We have \(T_{2t + 2} \bullet = A \bullet + b\) for certain \(A \in {\mathbb {R}}^{N_{2t+2} \times N_{2t+1}}\) and \(b \in {\mathbb {R}}^{N_{2t+2}}\), and furthermore \(\alpha _{2t+2} = \varrho ^{(1)} \otimes \cdots \otimes \varrho ^{(N_{2t+2})}\) for certain \(\varrho ^{(j)} \in \{ \mathrm {id}_{\mathbb {R}}, \varrho _r \}\).

Recall from Appendix A that \(A_{j,-} \in {\mathbb {R}}^{1 \times N_{2t+1}}\) denotes the j-th row of A. For \(j \in \{1,\dots ,N_{2t+2}\}\), we claim that

$$\begin{aligned} {\left\{ \begin{array}{ll} f_j^{(2t+2)} \equiv \varrho ^{(j)}(b_j), &{} \text {if } A_{j,-} = 0,\\ f_j^{(2t+2)} \text { is } (C_{t,r}' \cdot M_j \cdot \prod _{\ell =1}^t \Vert T_{2\ell }\Vert _{\ell ^0}, r^{2t+2})\text {-poly}, &{} \text {if } A_{j,-} \ne 0, \end{array}\right. } \end{aligned}$$
(D.9)

where \(M_j := \Vert A_{j,-}\Vert _{\ell ^0}\), and where the constant \(C_{t,r}' \in {\mathbb {N}}\) only depends on tr.

The first case where \(A_{j,-}=0\) is trivial. For proving the second case where \(A_{j,-} \ne 0\), let us define \(\Omega _j := \{ i \in \{1,\dots ,N_{2t+1}\} :A_{j,i} \ne 0 \}\), say \(\Omega _j = \{ i_1, \dots , i_{M_j} \}\) with (necessarily) pairwise distinct \(i_1,\dots ,i_{M_j}\). By introducing the polynomial \({p_{j,t} : {\mathbb {R}}^{M_j} \rightarrow {\mathbb {R}}, y \mapsto b_j + \sum _{m=1}^{M_j} A_{j,i_m} y_m}\), we can then write

$$\begin{aligned} f_j^{(2t+2)}(x)= & {} \varrho ^{(j)} \Big ( b_j + A_{j,-} \big ( f_k^{(2t+1)} (x) \big )_{k \in \{1,\dots ,N_{2t+1}\}} \Big ) \\= & {} (\varrho ^{(j)} \circ p_{j,t}) \big ( f_{i_1}^{(2t+1)} (x), \dots , f_{i_{M_j}}^{(2t+1)} (x) \big ) . \end{aligned}$$

Since \(\varrho ^{(j)}\) is (2, r)-poly and \(p_{j,t}\) is a polynomial of degree at most 1, Lemma D.2 shows that \(\varrho ^{(j)} \circ p_{j,t}\) is (2, 1, r)-semi-algebraic. Furthermore, by the induction hypothesis we know that each function \(f_{i_m}^{(2t+1)}\) is \((C_{t,r} \prod _{\ell =1}^t \Vert T_{2\ell }\Vert _{\ell ^0}, r^{2t+1})\)-poly, where we used that \(\gamma (t)=2t+1\) since \(t+1 \le (L-1)/2\). Therefore—in view of the preceding displayed equation—Lemma D.3 shows that the function \(f_j^{(2t+2)}\) is indeed \((C_{t,r}' \cdot M_j \cdot \prod _{\ell =1}^t \Vert T_{2\ell }\Vert _{\ell ^0}, r^{2t+2})\)-poly, where \(C_{t,r}' := 2 C_{t,r} \cdot (1 + r^{2t+1})\).

We now estimate the number of polynomial pieces of the function \(f_i^{(2t+3)}\) for \(i \in \{1,\dots ,N_{2t+3}\}\). To this end, let \(B \in {\mathbb {R}}^{N_{2t+3} \times N_{2t+2}}\) and \(c \in {\mathbb {R}}^{N_{2t+3}}\) such that \(T_{2t+3} = B \bullet + c\), and choose \(\sigma ^{(i)} \in \{\mathrm {id}_{\mathbb {R}}, \varrho _r\}\) such that \(\alpha _{2t+3} = \sigma ^{(1)} \otimes \cdots \otimes \sigma ^{(N_{2t+3})}\). For \(i \in \{1,\dots ,N_{2t+3}\}\), let us define

$$\begin{aligned} G_{i,t} : {\mathbb {R}}\rightarrow {\mathbb {R}}, x \mapsto \sum _{j \in \{1,\dots ,N_{2t+2}\} \text { such that } A_{j,-} \ne 0} B_{i,j} \, f_j^{(2t+2)} (x) . \end{aligned}$$

In view of Eq. (D.9), Lemma D.4 shows that \(G_{i,t}\) is \((P,r^{2t+2})\)-poly, where

$$\begin{aligned} P&\le 1 - |\{ j \in \{1,\dots ,N_{2t+2}\} :A_{j,-} \ne 0 \}|\\&\quad + C_{t,r}' \cdot \Big ( \prod _{\ell =1}^t \Vert T_{2\ell }\Vert _{\ell ^0} \Big ) \sum _{j \in \{1,\dots ,N_{2t+2}\} \text { such that } A_{j,-} \ne 0} M_j \\&\le C_{t,r}' \cdot \Big ( \prod _{\ell =1}^t \Vert T_{2\ell }\Vert _{\ell ^0} \Big ) \cdot \Vert A\Vert _{\ell ^0} = C_{t,r}' \cdot \prod _{\ell =1}^{t+1} \Vert T_{2\ell }\Vert _{\ell ^0} \end{aligned}$$

Here, we used that \(\Vert T_{2t+2} \Vert _{\ell ^0} \ne 0\) and hence \(A \ne 0\), so that \(|\{ j \in \{1,\dots ,N_{2t+2}\} : A_{j,-} \ne 0\}| \ge 1\).

Next, note because of Eq. (D.9) and by definition of \(G_{i,t}\) that there is some \(\theta _{i,t} \in {\mathbb {R}}\) satisfying

$$\begin{aligned} f_i^{(2t+3)}(x) = \sigma ^{(i)} \Big ( c_i + \sum _{j=1}^{N_{2t+2}} B_{i,j} \, f_j^{(2t+2)} (x) \Big ) = \sigma ^{(i)} \big ( \theta _{i,t} + G_{i,t}(x) \big ) \quad \forall \, x \in {\mathbb {R}}. \end{aligned}$$

Now there are two cases: If \(2t + 3 > L-1\), then \(2t+3 = L\), since \(t+1 \le \tfrac{L-1}{2}\). Therefore, \(\sigma ^{(i)} = \mathrm {id}_{\mathbb {R}}\), so that we see that \(f_i^{(2t+3)} = \theta _{i,t} + G_{i,t}\) is \((C_{t,r}' \cdot \prod _{\ell =1}^{t+1} \Vert T_{2\ell }\Vert _{\ell ^0}, r^{2t+2})\)-poly, where \(2t+2 = L-1 = \gamma (t+1)\).

If \(2t + 3 \le L-1\), then \(\gamma (t+1) = 2t + 3\). Furthermore, each \(\sigma ^{(i)}\) is (2, r)-poly and hence (2, 1, r)-semi-algebraic by Lemma D.2. In view of the preceding displayed equation, and since \(G_{i,t}\) is \({(C_{t,r}' \cdot \prod _{\ell =1}^{t+1} \Vert T_{2\ell }\Vert _{\ell ^0}, r^{2t+2})}\)-poly, Lemma D.3 shows that \(f_i^{(2t+3)}\) is \(\big ( 2 (1 + r^{2t+2}) C_{t,r}' \cdot \prod _{\ell =1}^{t+1} \Vert T_{2\ell }\Vert _{\ell ^0}, r^{2t+3} \big )\)-poly.

In each case, with \(C_{t+1,r} := 2 (1 + r^{2t+2}) C_{t,r}'\), we see that Eq. (D.8) holds for \(t+1\) instead of t.

Step 2. We now complete the proof of Eq. (D.7), by distinguishing whether L is odd or even.

If L is odd: In this case \(L_1 = \lfloor \tfrac{L-1}{2} \rfloor = \tfrac{L-1}{2}\), so that we can use Eq. (D.8) for the choice \(t = \tfrac{L-1}{2}\) to see that \({\mathtt {R}}(\Phi ) = f_1^{(L)} = f_1^{(2t+1)}\) is \((P, r^{L-1})\)-poly, where

$$\begin{aligned} P \le C_{t,r} \prod _{\ell =1}^t \Vert T_{2\ell }\Vert _{\ell ^0} \le C_{t,r} \prod _{\ell =1}^{(L-1)/2} \!\!\! W(\Phi ) \le C_{t,r} \cdot [W(\Phi )]^{(L-1)/2} \le C_{t,r} \cdot W^{\lfloor L/2 \rfloor }. \end{aligned}$$

If L is even: In this case, set \(t := \tfrac{L}{2} - 1 \in \{ 0,1,\dots ,L_1 \}\), and note \(2t+1 = L-1 =\gamma (t)\). Hence, with \(A \in {\mathbb {R}}^{1 \times N_{L-1}}\) and \(b \in {\mathbb {R}}\) such that \(T_L = A \bullet + b\), we have

$$\begin{aligned} {\mathtt {R}}(\Phi ) = T_L \big ( f_k^{(2t+1)} (x) \big )_{k \in \{1,\dots ,N_{L-1}\}} = b + \sum _{k \in \{1,\dots ,N_{L-1}\} \text { such that } A_{1,k} \ne 0} A_{1,k} \, f_k^{(2t+1)}(x) . \end{aligned}$$

Therefore, thanks to Eq. (D.8), Lemma D.4 shows that \({\mathtt {R}}(\Phi )\) is \((P,r^{2t+1})\)-poly, where

$$\begin{aligned} P&\le 1 - |\{ k \in \{1,\dots ,N_{L-1}\} :A_{1,k} \ne 0 \}| + C_{t,r} \sum _{\begin{array}{c} k \in \{1,\dots ,N_{L-1}\} \\ \text {such that } A_{1,k} \ne 0 \end{array}} \prod _{\ell =1}^t \Vert T_{2\ell }\Vert _{\ell ^0} \\&\le C_{t,r} \cdot \Vert A\Vert _{\ell ^0} \cdot \prod _{\ell =1}^{\frac{L}{2} - 1} \Vert T_{2\ell }\Vert _{\ell ^0} = C_{t,r} \prod _{\ell =1}^{L/2} \Vert T_{2\ell }\Vert _{\ell ^0} \\&\le C_{t,r} \cdot [W(\Phi )]^{L/2} = C_{t,r} \cdot [W(\Phi )]^{\lfloor L/2 \rfloor } \le C_{t,r} \cdot W^{\lfloor L/2 \rfloor } . \end{aligned}$$

In the second inequality we used \(|\{ k \in \{1,\dots ,N_{L-1}\} :A_{1,k} \ne 0\}| = \Vert A\Vert _{\ell ^{0}} = \Vert T_L\Vert _{\ell ^0} \ge 1\). We have thus established Eq. (D.7) in all cases.

Step 3. It remains to prove the actual claim. Let \(f \in {\mathtt {NN}}^{\varrho _r,1,1}_{W,L,\infty }\) be arbitrary, whence \(f = {\mathtt {R}}(\Phi )\) for some \(\Phi \in {\mathtt {NN}}^{\varrho _r,1,1}_{W,K,\infty }\) with \(L(\Phi ) = K\) for some \(K \in {\mathbb {N}}_{\le L}\). In view of Eq. (D.7), this implies that \(f = {\mathtt {R}}(\Phi )\) is \((\max \{1, \Lambda _{K,r} \, W^{\lfloor K/2 \rfloor }\}, r^{K-1})\)-poly. If we set \(\Theta _{L,r} := \max _{1 \le K \le L} \Lambda _{K,r}\), then this easily implies that f is \((\max \{1, \Theta _{L,r} \, W^{\lfloor L/2 \rfloor }\}, r^{L-1})\)-poly, as desired. \(\square \)

Appendix E. The spaces and are distinct

In this section, we show that for a fixed depth \(L \ge 3\) and \(\Omega = (0,1)^d\) the approximation spaces defined in terms of the number of weights and in terms of the number of neurons are distinct; that is, we show

(E.1)

The proof is based on several results by Telgarsky [64], which we first collect. The first essential concept is the notion of the crossing number of a function.

Definition E.1

For any piecewise polynomial function \(f : {\mathbb {R}}\rightarrow {\mathbb {R}}\) with finitely many pieces, define \({\widetilde{f}} : {\mathbb {R}}\rightarrow \{0,1\}, x \mapsto {\mathbb {1}}_{f(x) \ge 1/2}\). Thanks to our assumption on f, the sets \({\widetilde{f}}^{-1} (\{0\}) \subset {\mathbb {R}}\) and \({\widetilde{f}}^{-1} (\{1\}) \subset {\mathbb {R}}\) are finite unions of (possibly degenerate) intervals. For \(i \in \{0,1\}\), denote by \(I_f^{(i)} \subset 2^{{\mathbb {R}}}\) the set of connected components of \({\widetilde{f}}^{-1} (\{i\})\). Finally, set \(I_f := I_f^{(0)} \cup I_f^{(1)}\) and define the crossing number \(\mathrm {Cr}(f)\) of f as \(\mathrm {Cr}(f) := |I_f| \in {\mathbb {N}}\). \(\blacktriangleleft \)

The following result gives a bound on the crossing number of f, based on bounds on the complexity of f. Here, we again use the notion of \((t,\beta )\)–poly functions as introduced at the beginning of Appendix D.5.

Lemma E.2

[64, Lemma 3.3] If \(f : {\mathbb {R}}\rightarrow {\mathbb {R}}\) is \((t,\alpha )\)–poly, then \(\mathrm {Cr}(f) \le t (1 + \alpha )\).\(\blacktriangleleft \)

Finally, we will need the following result which tells us that if \(\mathrm {Cr}(f) \gg \mathrm {Cr}(g)\), then the functions \({\widetilde{f}},{\widetilde{g}}\) introduced in Definition E.1 differ on a large number of intervals \(I \in I_f\).

Lemma E.3

[64, Lemma 3.1] Let \(f : {\mathbb {R}}\rightarrow {\mathbb {R}}\) and \(g : {\mathbb {R}}\rightarrow {\mathbb {R}}\) be piecewise polynomial with finitely many pieces. Then

$$\begin{aligned} \frac{1}{\mathrm {Cr}(f)} \cdot \Big | \big \{ I \in I_f :\forall x \in I : {\widetilde{f}}(x) \ne {\widetilde{g}} (x) \big \} \Big | \ge \frac{1}{2} \Big ( 1 - 2 \frac{\mathrm {Cr}(g)}{\mathrm {Cr}(f)} \Big ). \blacktriangleleft \end{aligned}$$

The first step to proving Eq. (E.1) will be the following estimate:

Lemma E.4

Let \(p \in (0,\infty ]\). There is a constant \(C_p > 0\) such that the error of best approximation [cf. Eq. (3.1)] of the “sawtooth function” \(\Delta _j\) [cf. Eq. (5.11)] by piecewise polynomials satisfies

$$\begin{aligned} E(\Delta _j , {\mathtt {PPoly}}_N^\alpha )_{L_p ( (0,1) )} \ge C_p \qquad \forall \, j,\alpha \in {\mathbb {N}},\quad \forall \, 1 \le N \le \frac{2^{j} + 1}{4(1+\alpha )}. \blacktriangleleft \end{aligned}$$

For proving this lower bound, we first need to determine the crossing number of \(\Delta _j\).

Lemma E.5

Let \(j \in {\mathbb {N}}\) and \(\Delta _j : {\mathbb {R}}\rightarrow {\mathbb {R}}\) as in Eq. (5.11). We have \(\mathrm {Cr}(\Delta _j) = 1 + 2^j\) and

$$\begin{aligned} \int _{I \cap [0,1]} \big | \Delta _j (x) - \frac{1}{2} \big | \, d x \ge 2^{-j-3} \qquad \forall \, I \in I_{\Delta _j}. \blacktriangleleft \end{aligned}$$

Proof

The formal proof is omitted as it involves tedious but straightforward computations; graphically, the claimed properties are straightforward consequences of Fig. 4. \(\square \)

Proof of Lemma E.4

Let \(j,\alpha \in {\mathbb {N}}\) and let \(N \in {\mathbb {N}}\) with \(N \le \frac{2^{j} + 1}{4(1+\alpha )}\) and \(f \in {\mathtt {PPoly}}_N^\alpha \) be arbitrary. Lemma E.2 shows \(\mathrm {Cr}(f) \le N(1 + \alpha ) \le \frac{2^{j} + 1}{4}\), so that Lemma E.5 implies \({\theta := 1 - 2 \frac{\mathrm {Cr}(f)}{\mathrm {Cr}(\Delta _j)} = 1 - 2 \frac{\mathrm {Cr}(f)}{1 + 2^j} \ge \tfrac{1}{2}}\). Now, recall the notation of Definition E.1, and set

$$\begin{aligned} G := \big \{ I \in I_{\Delta _j} \,\big |\, \forall \, x \in I : \widetilde{\Delta _j}(x) \ne {\widetilde{f}}(x) \big \}. \end{aligned}$$

By Lemma E.3, \(\frac{1}{\mathrm {Cr}(\Delta _j)} |G| \ge \frac{\theta }{2} \ge \frac{1}{4}\), which means \(|G| \ge \frac{1 + 2^j}{4} \ge 2^{j-2}\), since we have \(\mathrm {Cr}(\Delta _j) = 1 + 2^j\).

For arbitrary \(I \in G\), we have \(\widetilde{\Delta _j}(x) \ne {\widetilde{f}}(x)\) for all \(x \in I\), so that either \(f(x) < \tfrac{1}{2} \le \Delta _j (x)\) or \(\Delta _j(x) < \tfrac{1}{2} \le f(x)\). In both cases, we get \(|\Delta _j (x) - f(x)| \ge |\Delta _j(x) - \tfrac{1}{2}|\). Furthermore, recall that \(0 \le \Delta _j \le 1\), so that \(|\Delta _j(x) - \tfrac{1}{2}| \le \tfrac{1}{2} \le 1\). Because of \(\Vert \Delta _j - f\Vert _{L_p ( (0,1) )} \ge \Vert \Delta _j - f\Vert _{L_1 ( (0,1) )}\) for \(p \ge 1\), it is sufficient to prove the result for \(0 < p \le 1\). For this range of p, we see that

$$\begin{aligned} \big | \Delta _j(x) - \tfrac{1}{2} \big | = \big | \Delta _j (x) - \tfrac{1}{2} \big |^{1 - p} \cdot \big | \Delta _j (x) - \tfrac{1}{2} \big |^{p} \le \big | \Delta _j (x) - \tfrac{1}{2} \big |^{p} . \end{aligned}$$

Overall, we get \(|\Delta _j(x) - f(x)|^p \ge |\Delta _j(x) - \tfrac{1}{2}|^p \ge |\Delta _j(x) - \tfrac{1}{2}|\) for all \(x \in I\) and \(I \in G\). Thus,

$$\begin{aligned} \int _{[0,1]} |\Delta _j(x) - f(x)|^p \, d x&\ge \sum _{I \in G} \int _{I \cap [0,1]} | \Delta _j(x) - f(x)|^p \, dx \\&\ge \sum _{I \in G} \int _{I \cap [0,1]} \Big | \Delta _j(x) - \frac{1}{2} \Big | \, d x \\ ({\scriptstyle {\text {Lemma}~E.5}})&\ge \sum _{I \in G} 2^{-j-3} = |G| \cdot 2^{-j-3} \ge 2^{j-2} \cdot 2^{-j-3} = 2^{-5}. \end{aligned}$$

This implies \(\Vert \Delta _j - f\Vert _{L_p ( (0,1) )} \ge 2^{-5/p} =: C_p\). \(\square \)

As a consequence of the lower bound in Lemma E.4, we can now prove lower bounds for the neural network approximation space norms of the multivariate sawtooth function \(\Delta _{j,d}\) [cf. Definition 5.9]

Proposition E.6

Consider \(\Omega = [0,1]^{d}\), \(r \in {\mathbb {N}}\), \(L \in {\mathbb {N}}_{\ge 2}\), \(\alpha \in (0, \infty )\), \(p,q \in (0,\infty ]\). There is a constant \({C = C(d,r,L,\alpha ,p,q) > 0}\) such that

Proof

According to Lemma 5.19, there is a constant \(C_1 = C_1 (r,L) \in {\mathbb {N}}\) such that

$$\begin{aligned} {\mathtt {NN}}^{\varrho _r, 1, 1}_{W,L,\infty } \subset {\mathtt {PPoly}}_{C_1 \cdot W^{\lfloor L/2 \rfloor }}^{\beta } \quad \text {and} \quad {\mathtt {NN}}^{\varrho _r,1,1}_{\infty ,L,N} \subset {\mathtt {PPoly}}_{C_1 \cdot N^{L-1}}^{\beta } \quad \text {where} \quad \beta := r^{L-1}.\nonumber \\ \end{aligned}$$
(E.2)

We first prove the estimate regarding . To this end, note that there is a constant \(C_2 = C_2(L,\beta ,C_1) = C_2(L,r) > 0\) such that \( \big ( \tfrac{2^{j+1}}{4 C_1 (1 + \beta )} \big )^{1/\lfloor L/2 \rfloor } = C_2 \cdot 2^{(j+1) / \lfloor L/2 \rfloor } \). Now, let \(W \in {\mathbb {N}}_0\) with \(W \le C_2 \cdot 2^{(j+1) / \lfloor L/2 \rfloor }\) and \(F \in {\mathtt {NN}}^{\varrho _{r},d,1}_{W,L,\infty }\) be arbitrary. Define \(F_{x'} : {\mathbb {R}}\rightarrow {\mathbb {R}}, t \mapsto F ((t, x'))\) for \(x' \in [0,1]^{d-1}\).

According to Lemma 2.18-(1) and Eq. (E.2), we have \( F_{x'} \in {\mathtt {NN}}_{W,L,\infty }^{\varrho _r,1,1} \subset {\mathtt {PPoly}}_{C_1 \cdot W^{\lfloor L/2 \rfloor }}^{\beta }. \) Since \( C_1 \cdot W^{\lfloor L/2 \rfloor } \le C_1 \cdot \tfrac{2^{j+1}}{4 C_1 (1+\beta )} = \tfrac{2^{j+1}}{4(1+\beta )}, \) Lemma E.4 yields a constant \(C_3 = C_3 (p) > 0\) such that \(C_3 \le \Vert \Delta _j - F_{x'}\Vert _{L_p ( (0,1) )}\). For \(p < \infty \), Fubini’s theorem shows that

$$\begin{aligned} \begin{aligned} \Vert \Delta _{j,d} - F \Vert _{L_p(\Omega )}^p&\ge \int _{[0,1]^{d-1}} \int _{[0,1]} \Big | \Delta _{j}(x_{1}) - F((x_1, x')) \Big |^p \, d x_1 \, d x' \\&= \int _{[0,1]^{d-1}} \Vert \Delta _j - F_{x'} \Vert _{L_p ( (0,1) )}^p \, d x' \ge C_3^p \cdot \int _{[0,1]^{d-1}} \, d x' = C_3^p. \end{aligned} \end{aligned}$$

Therefore,

$$\begin{aligned} E(\Delta _{j,d}, {\mathtt {NN}}^{\varrho _r, d, 1}_{W,L,\infty })_{L_p (\Omega )} \ge C_3 > 0 \quad \forall \, W \in {\mathbb {N}}_0 \text { satisfying } W \le C_2 \cdot 2^{(j+1) / \lfloor L / 2 \rfloor } .\nonumber \\ \end{aligned}$$
(E.3)

Since \(\Vert \bullet \Vert _{L^\infty (\Omega )} \ge \Vert \bullet \Vert _{L^1(\Omega )}\), this also holds for \(p = \infty \).

In light of the embedding (3.2), it is sufficient to lower bound when \(q = \infty \). In this case, we have

as desired. This completes the proof of the lower bound of .

The lower bound for can be derived similarly. First, in the same way that we proved Eq. (E.3), one can show that

$$\begin{aligned} E(\Delta _{j,d}, {\mathtt {NN}}^{\varrho _r, d, 1}_{\infty ,L,N})_{L_p (\Omega )} \ge C_3 > 0 \quad \forall \, N \in {\mathbb {N}}_0 \text { satisfying } N \le C'_2 \cdot 2^{(j+1) / (L-1)} , \end{aligned}$$

for a suitable constant \(C'_2 = C'_2 (L,r) > 0\). The remainder of the argument is then almost identical to that for estimating , and is thus omitted. \(\square \)

As our final preparation for showing that the spaces and are distinct for \(L \ge 3\) (Lemma 3.10), we will show that the lower bound derived in Proposition E.6 is sharp and extends to arbitrary measurable \(\Omega \) with non-empty interior.

Theorem E.7

Let \(p,q \in (0,\infty ]\), \(\alpha > 0\), \(r \in {\mathbb {N}}\), \(L \in {\mathbb {N}}_{\ge 2}\), and let \(\Omega \subset {\mathbb {R}}^d\) be a bounded admissible domain with non-empty interior. Consider \(y \in {\mathbb {R}}^d\) and \(s > 0\) satisfying \(y + [0,s]^d \subset \Omega \) and define

$$\begin{aligned} \Delta _j^{(y,s)} : {\mathbb {R}}^d \rightarrow [0,1], x \mapsto \Delta _{j,d}\Big (\frac{x-y}{s}\Big ) \quad \text {for } j \in {\mathbb {N}}. \end{aligned}$$

Then there are \(C_1, C_2 > 0\) such that for all \(j \in {\mathbb {N}}\) the function \(\Delta _j^{(y,s)}\)

satisfies

Proof

For the upper bound, since \(\Omega \) is bounded, Theorem 4.7 [Eq. (4.3), which also holds for \(N_q^\alpha \) instead of \(W_q^\alpha \)] shows that it suffices to prove the claim for \(r = 1\). Since \(T_{y,s} : {\mathbb {R}}^d \rightarrow {\mathbb {R}}^{d}, x \mapsto s^{-1} (x - y)\) satisfies \(\Vert T_{y,s}\Vert _{\ell ^{0,\infty }_*} = 1\), a combination of Lemmas 5.10 and  2.18-(1) shows that there is a constant \(C_L > 0\) satisfying

$$\begin{aligned} \Delta _j^{(y,s)} \in {\mathtt {NN}}^{\varrho _1, d, 1}_{\infty , L, \lfloor C_L \cdot 2^{j / (L-1)} \rfloor } \qquad \text {and} \qquad \Delta _j^{(y,s)} \in {\mathtt {NN}}^{\varrho _1, d, 1}_{\lfloor C_L \cdot 2^{j / \lfloor L/2 \rfloor } \rfloor , L, \infty } \qquad \forall \, j \in {\mathbb {N}}. \end{aligned}$$

Furthermore, \(\Delta _j^{(y,s)} \in X_p (\Omega )\) since \(\Omega \) is bounded and \(\Delta _j^{(y,s)}\) is bounded and continuous. Thus, the Bernstein inequality (5.1) yields a constant \(K_1 > 0\) such that

for all \(j \in {\mathbb {N}}\); similarly, we get a constant \(K_2 > 0\) such that

for all \(j \in {\mathbb {N}}\). Considering \(C_2 := \max \{ K_1,K_2 \} \cdot C_L^\alpha \) establishes the desired upper bound.

For the lower bound, consider arbitrary \(W,N \in {\mathbb {N}}_{0}\), \(F \in {\mathtt {NN}}^{\varrho _{r},d,1}_{W,L,N}\), and observe that by Lemma 2.18-(1) we have \(F' := F \circ T_{y,s}^{-1} \in {\mathtt {NN}}^{\varrho _{r},d,1}_{W,L,N}\). In view of Proposition E.6, the lower bound follows from the inequality

$$\begin{aligned}&\Vert \Delta _{j}^{(y,s)} \!-\! F\Vert _{L_{p}(\Omega )} \!\ge \! \Vert \Delta _{j}^{(y,s)} \!- F\Vert _{L_{p}(y+[0,s]^{d})}\\&\quad \!=\! \Vert \Delta _{j,d} \circ T_{y,s} -F' \circ T_{y,s}\Vert _{L_{p}(y+[0,s]^{d})} \!=\! s^{d/p} \, \Vert \Delta _{j,d} - F'\Vert _{L_{p}([0,1]^{d})}. \end{aligned}$$

\(\square \)

We can now prove Lemma 3.10.

Proof of Lemma 3.10

Ad (1) If , then the linear map

is well-defined. Furthermore, this map has a closed graph. Indeed, if \(f_n \rightarrow f\) in and \(f_n = \iota f_n \rightarrow g\) in , then the embeddings and (see Proposition 3.2 and Theorem 4.7) imply that \(f_n \rightarrow f\) in \(L_{p_1}\) and \(f_n \rightarrow g\) in \(L_{p_2}\). But \(L_p\)-convergence implies convergence in measure, so that we get \(f = g\).

Now, the closed graph theorem (which applies to F-spaces (see [59, Theorem 2.15]), hence to quasi-Banach spaces, since these are F-spaces (see [66, Remark after Lemma 2.1.5])) shows that \(\iota \) is continuous. Here, we used that the approximation classes and are quasi-Banach spaces; this is proved independently in Theorem 3.27.

Since \(\Omega \) has non-empty interior, there are \(y \in {\mathbb {R}}^d\) and \(s > 0\) such that \(y + [0,s]^d \subset \Omega \). The continuity of \(\iota \), combined with Theorem E.7, implies for the functions \(\Delta _j^{(y,s)}\) from Theorem E.7 for all \(j \in {\mathbb {N}}\) that

where the implicit constants are independent of j. Hence, \(\beta / (L' - 1) \le \alpha / \lfloor L/2 \rfloor \); that is, \(L' - 1 \ge \tfrac{\beta }{\alpha } \cdot \lfloor L/2 \rfloor \).

Ad (2) Exactly as in the argument above, we get for all \(j \in {\mathbb {N}}\) that

with implied constants independent of j. Hence, \(\alpha / \lfloor L/2 \rfloor \le \beta / (L' - 1)\); that is, \(\lfloor L/2 \rfloor \ge \frac{\alpha }{\beta } \cdot (L' - 1)\).

Proof of the “in particular” part: If , then Parts (1) and (2) show (because of \(\alpha = \beta \)) that \(L - 1 = \lfloor L /2 \rfloor \). Since \(L \in {\mathbb {N}}_{\ge 2}\), this is only possible for \(L = 2\). \(\square \)

As a further consequence of Lemma E.4, we can now prove the non-triviality of the neural network approximation spaces, as formalized in Theorem 4.16.

Proof of Theorem 4.16

In view of the embedding (see Lemma 3.9), it suffices to prove the claim for . Furthermore, it is enough to consider the case \(q = \infty \), since Eq. (3.2) shows that . Next, in view of Remark 3.17, it suffices to consider the case \(k=1\). Finally, thanks to Theorem 4.7, it is enough to prove the claim for the special case \(\varrho = \varrho _r\) (for fixed but arbitrary \(r \in {\mathbb {N}}\)).

Since \(\Omega \) has non-empty interior, there are \(y \in {\mathbb {R}}^d\) and \(s > 0\) such that \(y + [0,s]^d \subset \Omega \). Let us fix \(\varphi \in C_c ({\mathbb {R}}^d)\) satisfying \(0 \le \varphi \le 1\) and \(\varphi |_{y + [0,s]^d} \equiv 1\). With \(\Delta _{j}^{(y,s)}\) as in Theorem E.7, define for \(j \in {\mathbb {N}}\)

$$\begin{aligned} g_j : {\mathbb {R}}^d \rightarrow {\mathbb {R}}, x \mapsto \Delta _j^{(y,s)} (x) \cdot \varphi (x) . \end{aligned}$$

Note that \(g_j \in C_c ({\mathbb {R}}^d)\), and hence, \(g_j |_{\Omega } \in X\). Furthermore, since \(0 \le \Delta _j^{(y,s)} \le 1\), it is easy to see \(\Vert g_j|_{\Omega }\Vert _{X} \le \Vert g_j\Vert _{L_p({\mathbb {R}}^d)} \le \Vert \varphi \Vert _{L_p({\mathbb {R}}^d)} =: C\) for all \(j \in {\mathbb {N}}\).

By Theorem 4.2 and Proposition 3.2, we know that is a well-defined quasi-Banach space satisfying . Let us assume toward a contradiction that the claim of Theorem 4.16 fails; this means . Using the same “closed graph theorem arguments” as in the proof of Lemma 3.10, we see that this implies for all and a fixed constant \(C' > 0\). In particular, this implies for all \(j \in {\mathbb {N}}\). In the remainder of the proof, we will show that as \(j \rightarrow \infty \), which then provides the desired contradiction.

To prove , choose \(N_0 \in {\mathbb {N}}\) satisfying \({\mathscr {L}}(N_0) \ge 2\), and let \(N \in {\mathbb {N}}_{\ge N_0}\) and \(f \in {\mathtt {NN}}^{\varrho _r,d,1}_{\infty ,{\mathscr {L}}(N),N}\) be arbitrary. Reasoning as in the proof of Theorem E.7, since \(\varphi \equiv 1\) on \(y + [0,s]^{d}\), we see that if we set \(T_{y,s} : {\mathbb {R}}^d \rightarrow {\mathbb {R}}^d, x \mapsto s^{-1} (y-x)\), then

$$\begin{aligned} \Vert g_{j} - f\Vert _{L_{p}(\Omega )}\ge & {} \Vert g_{j} - f\Vert _{L_{p}(y+[0,s]^{d})} = \Vert \Delta ^{(y,s)}_{j} - f\Vert _{L_{p}(y+[0,s]^{d})} \\= & {} s^{d/p} \cdot \Vert \Delta _{j,d} - f \circ T^{-1}_{y,s}\Vert _{L_{p}([0,1]^{d})}. \end{aligned}$$

Now, given any \(x' \in {\mathbb {R}}^{d-1}\), let us set \( f_{x'} : {\mathbb {R}}\rightarrow {\mathbb {R}}, t \mapsto (f \circ T^{-1}_{y,s}) ((t, x')) \). As a consequence of Lemma 2.18-(1), we see \(f_{x'} \in {\mathtt {NN}}^{\varrho _r,1,1}_{\infty ,{\mathscr {L}}(N), N}\). According to Part 2 of Lemma 5.19, there is a constant \(K_N \in {\mathbb {N}}\) such that \(f_{x'} \in {\mathtt {PPoly}}_{K_N}^{r^{{\mathscr {L}}(N) - 1}}\) Hence, Lemma E.4 yields a constant \(C_2 = C_2(p) > 0\) such that \(\Vert \Delta _j - f_{x'}\Vert _{L_p ( (0,1) )} \ge C_2\) as soon as \(2^{j} + 1 \ge 4 \, K_N \cdot (1 + r^{{\mathscr {L}}(N) - 1}) =: K_N '\). Because of \(2^{j} + 1 \ge j\), this is satisfied if \(j \ge K_N '\). In case of \(p < \infty \), Fubini’s theorem shows

$$\begin{aligned} \Vert \Delta _{j,d} - f \circ T^{-1}_{y,s}\Vert _{L_p([0,1]^{d})}^p&\ge \int _{[0,1]^{d-1}} \int _{[0,1]} \Big | \Delta _j (t) - f_{x'}(t) \Big |^p \, d t \, d x' \\&= \int _{[0,1]^{d-1}} \Vert \Delta _j - f_{x'}\Vert _{L_p ( (0,1) )}^p \, d x' \ge C_2^p, \end{aligned}$$

whence \( \Vert g_{j}-f\Vert _{L_{p}(\Omega )} \ge s^{d/p} \Vert \Delta _{j,d} - f \circ T^{-1}_{y,s}\Vert _{L_p ([0,1]^{d})} \ge C_2 \cdot s^{d/p} \). For \(p = \infty \), the same estimate remains true because \(\Vert \bullet \Vert _{L_p([0,1]^d)} \le \Vert \bullet \Vert _{L_\infty ([0,1]^d)}\). Since \(f \in {\mathtt {NN}}^{\varrho _r,d,1}_{\infty ,{\mathscr {L}}(N),N}\) was arbitrary, we have shown

$$\begin{aligned} E(g_j, {\mathtt {NN}}^{\varrho _r,d,1}_{\infty ,{\mathscr {L}}(N),N})_{L_p(\Omega )} \ge C_2 \cdot s^{d/p} =: C_3 \qquad \forall \, N \in {\mathbb {N}}_{\ge N_0} \text { and } j \ge K_N '. \end{aligned}$$

Directly from the definition of the norm , this implies that for arbitrary \(N \in {\mathbb {N}}_{\ge N_0}\)

This proves as \(j \rightarrow \infty \), and thus completes the proof. \(\square \)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Gribonval, R., Kutyniok, G., Nielsen, M. et al. Approximation Spaces of Deep Neural Networks. Constr Approx 55, 259–367 (2022). https://doi.org/10.1007/s00365-021-09543-4

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00365-021-09543-4

Keywords

Mathematics Subject Classification

Navigation