Topological Properties of the Set of Functions Generated by Neural Networks of Fixed Size

We analyze the topological properties of the set of functions that can be implemented by neural networks of a fixed size. Surprisingly, this set has many undesirable properties. It is highly non-convex, except possibly for a few exotic activation functions. Moreover, the set is not closed with respect to Lp\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$L^p$$\end{document}-norms, 0<p<∞\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$0< p < \infty $$\end{document}, for all practically used activation functions, and also not closed with respect to the L∞\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$L^\infty $$\end{document}-norm for all practically used activation functions except for the ReLU and the parametric ReLU. Finally, the function that maps a family of weights to the function computed by the associated network is not inverse stable for every practically used activation function. In other words, if f1,f2\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$f_1, f_2$$\end{document} are two functions realized by neural networks and if f1,f2\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$f_1, f_2$$\end{document} are close in the sense that ‖f1-f2‖L∞≤ε\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\Vert f_1 - f_2\Vert _{L^\infty } \le \varepsilon $$\end{document} for ε>0\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varepsilon > 0$$\end{document}, it is, regardless of the size of ε\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varepsilon $$\end{document}, usually not possible to find weights w1,w2\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$w_1, w_2$$\end{document} close together such that each fi\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$f_i$$\end{document} is realized by a neural network with weights wi\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$w_i$$\end{document}. Overall, our findings identify potential causes for issues in the training procedure of deep learning such as no guaranteed convergence, explosion of parameters, and slow convergence.


Introduction
Neural networks, introduced in 1943 by McCulloch and Pitts [49], are the basis of every modern machine learning algorithm based on deep learning [30,43,63]. The term deep learning describes a variety of methods that are based on the data-driven manipulation of the weights of a neural network. Since these methods perform spectac-Communicated by Francis Bach.
All authors have contributed equally to this work.
Extended author information available on the last page of the article ularly well in practice, they have become the state-of-the-art technology for a host of applications including image classification [36,41,65], speech recognition [22,34,69], game intelligence [64,66,70], and many more.
This success of deep learning has encouraged many scientists to pick up research in the area of neural networks after the field had gone dormant for decades. In particular, quite a few mathematicians have recently investigated the properties of different neural network architectures, hoping that this can explain the effectiveness of deep learning techniques. In this context, mathematical analysis has mainly been conducted in the context of statistical learning theory [20], where the overall success of a learning method is determined by the approximation properties of the underlying function class, the feasibility of optimizing over this class, and the generalization capabilities of the class, when only training with finitely many samples.
In the approximation theoretical part of deep learning research, one analyzes the expressiveness of deep neural network architectures. The universal approximation theorem [21,35,45] demonstrates that neural networks can approximate any continuous function, as long as one uses networks of increasing complexity for the approximation. If one is interested in approximating more specific function classes than the class of all continuous functions, then one can often quantify more precisely how large the networks have to be to achieve a given approximation accuracy for functions from the restricted class. Examples of such results are [7,14,51,52,57,71]. Some articles [18,54,57,62,72] study in particular in which sense deep networks have a superior expressiveness compared to their shallow counterparts, thereby partially explaining the efficiency of networks with many layers in deep learning.
Another line of research studies the training procedures employed in deep learning. Given a set of training samples, the training process is an optimization problem over the parameters of a neural network, where a loss function is minimized. The loss function is typically a nonlinear, non-convex function of the weights of the network, rendering the optimization of this function highly challenging [8,13,38]. Nonetheless, in applications, neural networks are often trained successfully through a variation of stochastic gradient descent. In this regard, the energy landscape of the problem was studied and found to allow convergence to a global optimum, if the problem is sufficiently overparametrized; see [1,16,27,56,67].
The third large area of mathematical research on deep neural networks is analyzing the so-called generalization error of deep learning. In the framework of statistical learning theory [20,53], the discrepancy between the empirical loss and the expected loss of a classifier is called the generalization error. Specific bounds for this error for the class of deep neural networks were analyzed for instance in [4,11], and in more specific settings for instance in [9,10].
In this work, we study neural networks from a different point of view. Specifically, we study the structure of the set of functions implemented by neural networks of fixed size. These sets are naturally (nonlinear) subspaces of classical function spaces like L p ( ) and C( ) for compact sets .
Due to the size of the networks being fixed, our analysis is inherently nonasymptotic. Therefore, our viewpoint is fundamentally different from the analysis in the framework of statistical learning theory. Indeed, in approximation theory, the expressive power of networks growing in size is analyzed. In optimization, one stud-ies the convergence properties of iterative algorithms-usually that of some form of stochastic gradient descent. Finally, when considering the generalization capabilities of deep neural networks, one mainly studies how and with which probability the empirical loss of a classifier converges to the expected loss, for increasing numbers of random training samples and depending on the sizes of the underlying networks.
Given this strong delineation to the classical fields, we will see that our point of view yields interpretable results describing phenomena in deep learning that are not directly explained by the classical approaches. We will describe these results and their interpretations in detail in Sects. 1.1-1.3. We will use standard notation throughout most of the paper without explicitly introducing it. We do, however, collect a list of used symbols and notions in Appendix A.
To not interrupt the flow of reading, we have deferred several auxiliary results to Appendix B and all proofs and related statements to Appendices C-E.
Before we continue, we formally introduce the notion of spaces of neural networks of fixed size.

Neural Networks of Fixed Size: Basic Terminology
To state our results, it will be necessary to distinguish between a neural network as a set of weights and the associated function implemented by the network, which we call its realization. To explain this distinction, let us fix numbers L, N 0 , N 1 , . . . , N L ∈ N. We say that a family In what follows, we study topological properties of sets of realizations of neural networks with a fixed size. Naturally, there are multiple conventions to specify the size of a network. We will study the set of realizations of networks with a given architecture S and activation function ; that is, the set RN N (S):={R ( ) : ∈ N N (S)}. In the context of machine learning, this point of view is natural, since one usually prescribes the network architecture, and during training only adapts the weights of the network.
Before we continue, let us note that the set N N (S) of all neural networks (that is, the network weights) with a fixed architecture forms a finite-dimensional vector space, which we equip with the norm While the activation function can in principle be chosen arbitrarily, a couple of particularly useful activation functions have been established in the literature. We proceed by listing some of the most common activation functions, a few of their properties, as well as references to articles using these functions in the context of deep learning. We note that all activation functions listed below are non-constant, monotonically increasing, globally Lipschitz continuous functions. This property is much stronger than the assumption of local Lipschitz continuity that we will require in many of our results. Furthermore, all functions listed below belong to the class C ∞ (R \ {0}).
In the remainder of this introduction, we discuss our results concerning the topological properties of the sets of realizations of neural networks with fixed architecture and their interpretation in the context of deep learning. Then, we give an overview of related work. We note at this point that it is straightforward to generalize all of the results in this paper to neural networks for which one only prescribes the total number of neurons and layers and not the specific architecture.
For simplicity, we will always assume in the remainder of this introduction that ⊂ R N 0 is compact with non-empty interior.

Non-convexity of the Set of Realizations
We will show in Sect. 2 (Theorem 2.1) that, for a given architecture S, the set RN N (S) is not convex, except possibly when the activation function is a polynomial, which is clearly not the case for any of the activation functions that are commonly used in practice. In fact, for a large class of activation functions (including the ReLU and the standard sigmoid activation function), the set RN N (S) turns out to be highly non-convex in the sense that for every r ∈ [0, ∞), the set of functions having uniform distance at most r to any function in RN N (S) is not convex. We prove this result in Theorem 2.2 and Remark 2.3. This non-convexity is undesirable, since for non-convex sets, there do not necessarily exist well-defined projection operators onto them. In classical statistical learning theory [20], the property that the so-called regression function can be uniquely projected onto a convex (and compact) hypothesis space greatly simplifies the learning problem; see [20,Sect. 7]. Furthermore, in applications where the realization of a network-rather than its set of weights-is the quantity of interest (for example when a network is used as an Ansatz for the solution of a PDE, as in [24,42]), our results show that the Ansatz space is non-convex. This non-convexity is inconvenient if one aims for a convergence proof of the underlying optimization algorithm, since one cannot apply convexity-based fixed-point theorems. Concretely, if a neural network is optimized by stochastic gradient descent so as to satisfy a certain PDE, then it is interesting to see if there even exists a network so that the iteration stops. In other words, one might ask whether gradient descent on the set of neural networks (potentially with bounded weights) has a fixed point. If the space of neural networks were convex and compact, then the fixed-point theorem of Schauder would guarantee the existence of such a fixed point.

(Non-)Closedness of the Set of Realizations
For any fixed architecture S, we show in Sect. 3 (Theorem 3.1) that the set RN N (S) is not a closed subset of L p (μ) for 0 < p < ∞, under very mild assumptions on the measure μ and the activation function . The assumptions concerning are satisfied for all activation functions used in practice.
For the case p = ∞, the situation is more involved: For all activation functions that are commonly used in practice-except for the (parametric) ReLU-the associated sets RN N (S) are non-closed also with respect to the uniform norm; see Theorem 3.3. For the (parametric) ReLU, however, the question of closedness of the sets RN N (S) remains mostly open. Nonetheless, in two special cases, we prove in Sect. 3.4 that the sets RN N (S) are closed. In particular, for neural network architectures with two layers only, Theorem 3.8 establishes the closedness of RN N (S), where is the (parametric) ReLU.
A practical consequence of the observation of non-closedness can be identified with the help of the following argument that is made precise in Sect. 3.3: We show that the set of realizations of neural networks with a fixed architecture and all affine linear maps bounded in a suitable norm, is always closed. As a consequence, we observe the following phenomenon of exploding weights: If a function f is such that it does not have a best approximation in RN N (S), that is, if there does not exist f * ∈ RN N (S) such that then for any sequence of networks ( n ) n∈N with architecture S which satisfies f − R ( n ) L p (μ) → τ f , the weights of the networks n cannot remain uniformly bounded as n → ∞. In words, if f does not have a best approximation in the set of neural networks of fixed size, then every sequence of realizations approximately minimizing the distance to f will have exploding weights. Since RN N (S) is not closed, there do exist functions f which do not have a best approximation in RN N (S).
Certainly, the presence of large coefficients will make the numerical optimization increasingly unstable. Thus, exploding weights in the sense described above are highly undesirable in practice.
The argument above discusses an approximation problem in an L p -norm. In practice, one usually only minimizes "empirical norms". We will demonstrate in Proposition 3.6 that also in this situation, for increasing numbers of samples, the weights of the neural networks that minimize the empirical norms necessarily explode under certain assumptions. Note that the setup of having a fixed architecture and a potentially unbounded number of training samples is common in applications where neural networks are trained to solve partial differential equations. There, training samples are generated during the training process [25,42].

Failure of Inverse Stability of the Realization Map
As our final result, we study (in Sect. 4) the stability of the realization map R introduced in Eq. (1.1), which maps a family of weights to its realization. Even though this map will turn out to be continuous from the finite dimensional parameter space to L p ( ) for any p ∈ (0, ∞], we will show that it is not inverse stable. In other words, for two realizations that are very close in the uniform norm, there do not always exist network weights associated with these realizations that have a small distance. In fact, Theorem 4.2 even shows that there exists a sequence of realizations of networks converging uniformly to 0, but such that every sequence of weights with these realizations is necessarily unbounded.
For both of these results-continuity and no inverse stability-we only need to assume that the activation function is Lipschitz continuous and not constant.
These properties of the realization map pinpoint a potential problem that can occur when training a neural network: Let us consider a regression problem, where a network is iteratively updated by a (stochastic) gradient descent algorithm trying to minimize a loss function. It is then possible that at some iterate the loss function exhibits a very small error, even though the associated network parameters have a large distance to the optimal parameters. This issue is especially severe since a small error term leads to small steps if gradient descent methods are used in the optimization. Consequently, convergence to the very distant optimal weights will be slow even if the energy landscape of the optimization problem happens to be free of spurious local minima.

Structural properties
The aforementioned properties of non-convexity and non-closedness have, to some extent, been studied before. Classical results analyze the spaces of shallow neural networks, that is, of RN N (S) for S = (d, N 0 , 1), so that L = 2. For such sets of shallow networks, a property that has been extensively studied is to what extent RN N (S) has the best approximation property. Here, we say that RN N (S) has the best approximation property, if for every function f ∈ L p ( ), In [40] it was shown that even if a minimizer always exists, the map f → F( f ) is necessarily discontinuous. Furthermore, at least for the Heaviside activation function, there does exist a (non-unique) best approximation; see [39].
Additionally, [28,Proposition 4.1] demonstrates, for shallow networks as before, that for the logistic activation function (x) = (1 + e −x ) −1 , the set RN N (S) does not have the best approximation property in C( ). In the proof of this statement, it was also shown that RN N (S) is not closed. Furthermore, it is claimed that this result should hold for every nonlinear activation function. The previously mentioned result of [39] and Theorem 3.8 below disprove this conjecture for the Heaviside and ReLU activation functions, respectively.

Other notions of (non-)convexity
In deep learning, one chooses a loss function L : C( ) → [0, ∞), which is then minimized over the set of neural networks RN N (S) with fixed architecture S. A typical loss function is the empirical square loss, that is, In practice, one solves the minimization problem over the weights of the network; that is, one attempts to minimize the function L•R : N N (S) → [0, ∞). In this context, to assess the hardness of this optimization problem, one studies whether L • R is convex, the degree to which it is non-convex, and if one can find remedies to alleviate the problem of non-convexity, see for instance [5,6,27,37,50,56,59,67,73].
It is important to emphasize that this notion of non-convexity describes properties of the loss function, in contrast to the non-convexity of the sets of functions that we analyze in this work. used activation functions listed in Table 1. First, we examine the convexity of the set RN N (S): N 1 , . . . , N L ) be a neural network architecture with L ∈ N ≥2 and let ⊂ R d with non-empty interior. Moreover, let : R → R be locally Lipschitz continuous.
If RN N (S) is convex, then is a polynomial.
Remark (1) It is easy to see that all of the activation functions in Table 1 are locally Lipschitz continuous, and that none of them is a polynomial. Thus, the associated sets of realizations are never convex.
(2) In the case where is a polynomial, the set RN N (S) might or might not be convex. Indeed, if S = (1, N , 1) and (x) = x m , then it is not hard to see that Proof The detailed proof of Theorem 2.1 is the subject of "Appendix C.1". Let us briefly outline the proof strategy: show in Proposition C.6 that RN N R d (S) (and hence also RN N (S)) contains infinitely many linearly independent functions, if is not a polynomial.
In applications, the non-convexity of RN N (S) might not be as problematic as it first seems. If, for instance, the set RN N (S) + B δ (0) of functions that can be approximated up to error δ > 0 by a neural network with architecture S was convex, then one could argue that the non-convexity of RN N (S) was not severe. Indeed, in practice, neural networks are only trained to minimize a certain empirical loss function, with resulting bounds on the generalization error which are typically of size ε = O(m −1/2 ), with m denoting the number of training samples. In this setting, one is not really interested in "completely minimizing" the (empirical) loss function, but would be content with finding a function for which the empirical loss is ε-close to the global minimum. Hence, one could argue that one is effectively working with a hypothesis space of the form RN N (S) + B δ (0), containing all functions that can be represented up to an error of δ by neural networks of architecture S.  [33] Exponential linear unit [12] Inverse square root linear unit [15] Inverse square root unit Analytic/bounded [15] sigmoid/logistic Analytic/bounded [47] arctan arctan(x) Analytic/bounded [46] Softplus ln(1 + exp(x)) Analytic/unbounded [29] To quantify this potentially more relevant notion of convexity of neural networks, we define, for a subset A of a vector space Y, the convex hull of A as co(A):= B⊂Y convex and B⊃A B .
Hence, the notion of ε-convexity asks whether the convex hull of a set is contained in an enlargement of this set. Note that if RN N (S) is dense in C( ), then its closure is trivially ε-convex for all ε > 0. Our main result regarding the ε-convexity of neural network sets shows that this is the only case in which RN N (S) is ε-convex for any ε > 0.

(Non-)Closedness of the Set of Realizations
Let ∅ = ⊂ R d be compact with non-empty interior. In the present section, we analyze whether the neural network realization set RN N (S) with S = (d, N 1 , . . . , N L−1 , 1) is closed in C( ), or in L p (μ), for p ∈ (0, ∞) and any measure μ satisfying a mild "non-atomicness" condition. For the L p -spaces, the answer is simple: Under very mild assumptions on the activation function , we will see that RN N (S) is never closed in L p (μ). In particular, this holds for all of the activation functions listed in Table 1. Closedness in C( ), however, is more subtle: For this setting, we will identify several different classes of activation functions for which the set RN N (S) is not closed in C( ). As we will see, these classes of activation functions cover all those functions listed in Table 1, except for the ReLU and the parametric ReLU. For these two activation functions, we were unable to determine whether the set RN N (S) is closed in C( ) in general, but we conjecture this to be true. Only for the case L = 2, we could show that these sets are indeed closed.
Closedness of RN N (S) is a highly desirable property as we will demonstrate in Sect. 3.3. Indeed, we establish that if X = L p (μ) or X = C( ), then, for all functions f ∈ X that do not possess a best approximation within R = RN N (S), the weights of approximating networks necessarily explode. In other words, if (R ( n )) n∈N ⊂ R is such that f − R ( n ) X converges to inf g∈R f − g X for n → ∞, then n total → ∞. Such functions without a best approximation in R necessarily exist if R is not closed. Moreover, even in practical applications, where empirical error terms instead of L p (μ) norms are minimized, the absence of closedness implies exploding weights as we show in Proposition 3.6.
Finally, we note that for simplicity, all "non-closedness" results in this section are formulated for compact rectangles of the form = [−B, B] d only; but our arguments easily generalize to any compact set ⊂ R d with non-empty interior.

Non-closedness in L p ( )
We start by examining the closedness with respect to L p -norms for p ∈ (0, ∞). In fact, for all B > 0 and all widely used activation functions (including all activation functions presented in Table 1 , for any p ∈ (0, ∞) and any "sufficiently non-atomic" measure μ on [−B, B] d . To be more precise, the following is true: N 1 , . . . , N L−1 , 1) be a neural network architecture with L ∈ N ≥2 . Let : R → R be a function satisfying the following conditions: (i) is continuous and increasing; (ii) There is some x 0 ∈ R such that is differentiable at x 0 with (x 0 ) = 0; (iii) There is some r > 0 such that | (−∞,−r )∪(r ,∞) is differentiable; (iv) At least one of the following two conditions is satisfied: Remark If suppμ is countable, then μ = x∈suppμ μ({x}) δ x is a countable sum of Dirac measures, meaning that μ is purely atomic. In particular, if μ is non-atomic (meaning that μ({x}) = 0 for all x ∈ [−B, B] d ), then suppμ is uncountable and the theorem is applicable.
Proof For the proof of the theorem, we refer to "Appendix D.1". The main proof idea consists in the approximation of a (discontinuous) step function which cannot be represented by a neural network with continuous activation function.  Table 1. In any case where is bounded, one can take N L−1 = 1; otherwise, one can take N L−1 = 2.
Proof For a proof of this statement, we refer to "Appendix D.2".

Non-closedness in (C([−B, B] d ) for Many Widely Used Activation Functions
We have seen in Theorem 3.1 that under reasonably mild assumptions on the activation function -which are satisfied for all commonly used activation functions-the set RN N [−B,B] d (S) is not closed in any L p -space where p ∈ (0, ∞). However, the argument of the proof of Theorem 3.1 breaks down if one considers closedness with respect to the · sup -norm. Therefore, we will analyze this setting more closely in this section. More precisely, in Theorem 3.3, we present several criteria regarding the activation function which imply that the set We remark that in all these results, will be assumed to be at least C 1 . Developing similar criteria for non-differentiable functions is an interesting topic for future research.
Before we formulate Theorem 3.3, we need the following notion: We say that a function f : Now the following theorem holds: N 1 , . . . , N L−1 , 1) be a neural network architecture with L ∈ N ≥2 , let B > 0, and let : R → R. Assume that at least one of the following three conditions is satisfied: (ii) N L−1 ≥ 2 and is bounded, analytic, and not constant. (iii) is approximately homogeneous of order (r , q) for certain r , q ∈ N 0 with r = q, and ∈ C max{r ,q} (R).

Then the set RN N [−B,B] d (S) is not closed in the space C([−B, B] d ).
Proof For the proof of the statement, we refer to "Appendix D.3". In particular, the proof of the statement under Condition (i) can be found in "Appendix D.3.1". Its main idea consists of the uniform approximation of (which cannot be represented by neural networks with activation function , due to its lack of sufficient regularity) by neural networks. For the proof of the statement under Condition (ii), we refer to "Appendix D.3.2". The main proof idea consists in the uniform approximation of an unbounded analytic function which cannot be represented by a neural network with activation function , since itself is bounded. Finally, the proof of the statement under Condition (iii) can be found in "Appendix D.3.3". Its main idea consists in the approximation of the function x → (x) max{r ,q} + / ∈ C max{r ,q} . Table 1 except for the ReLU and the parametric ReLU. To be more precise,

Corollary 3.4 Theorem 3.3 applies to all activation functions listed in
(1) Condition (i) is fulfilled by the function x → max{0, x} k for k ≥ 2, and by the exponential linear unit, the softsign function, and the inverse square root linear unit.
(2) Condition (ii) is fulfilled by the inverse square root unit, the sigmoid function, the tanh function, and the arctan function. Proof For the proof of this statement, we refer to "Appendix D.4". In particular, for the proof of (1), we refer to "Appendix D.4.1", the proof of (2) is clear and for the proof of (3), we refer to "Appendix D.4.2".

The Phenomenon of Exploding Weights
We have just seen that the realization set RN N [−B,B] d (S) is not closed in L p (μ) for any p ∈ (0, ∞) and every practically relevant activation function. Furthermore, for a variety of activation functions, we have seen that The situation is substantially different if the weights are taken from a compact subset: N 1 , . . . , N L ) be a neural network architecture, let ⊂ R d be compact, let furthermore p ∈ (0, ∞), and let : R → R be continuous. For C > 0, let Then the set R ( C ) is compact in C( ) as well as in L p (μ), for any finite Borel measure μ on and any p ∈ (0, ∞).

Proof
The proof of this statement is based on the continuity of the realization map and can be found in "Appendix D.5".
Proposition 3.5 helps to explain the phenomenon of exploding network weights that is sometimes observed during the training of neural networks. Indeed, let us assume in fact, one can take any f ∈ R \ R. Next, recall from Proposition 3.5 that the subset of R that contains only realizations of networks with uniformly bounded weights is compact.
Hence, we conclude the following: For every sequence we must have n total → ∞, since otherwise, by compactness, ( f n ) n∈N would have a subsequence that converges to some h ∈ R ( C ) ⊂ R. In other words, the weights of the networks n necessarily explode.
The argument above only deals with the approximation problem in the space In practice, one is often not concerned with these norms, but instead wants to minimize an empirical loss function over R. For the empirical square loss, this loss function takes the form according to a probability distribution σ on × R. By the strong law of large numbers, for each fixed measurable function f , the empirical loss function converges almost surely to the expected loss This expected loss is related to an L 2 minimization problem. Indeed, [20,Proposition 1] shows that there is a measurable function f σ : → R-called the regression function-such that the expected risk from Eq. (3.1) satisfies for each measurable f : → R.

(3.2)
Here, σ is the marginal probability distribution of σ on , and E σ ( f σ ) is called the Bayes risk; it is the minimal expected loss achievable by any (measurable) function.
In this context of a statistical learning problem, we have the following result regarding exploding weights: ∼ σ ; all probabilities below will be with respect to this family of random variables.
then N total → ∞ in probability as N → ∞. Remark A compact way of stating Proposition 3.6 is that, if f σ has no best approximation in RN N (S) with respect to · L 2 (σ ) , then the weights of the minimizers (or approximate minimizers) of the empirical square loss explode for increasing numbers of samples.
Since σ is unknown in applications, it is indeed possible that f σ has no best approximation in the set of neural networks. As just one example, this is true if σ is any Borel probability measure on and if σ is the distribution of (X , g(X )), where X ∼ σ and g ∈ L 2 (σ ) is bounded and satisfies g ∈ RN N (S) \ RN N (S), with the closure taken in L 2 (σ ). The existence of such a function g is guaranteed by Theorem 3.1 if suppσ is uncountable.
Proof For the proof of Proposition 3.6, we refer to "Appendix D.6". The proof is based on classical arguments of statistical learning theory as given in [20].

Closedness of ReLU Networks in C([−B, B] d )
In this subsection, we analyze the closedness of sets of realizations of neural networks with respect to the ReLU or the parametric ReLU activation function in C( ), mostly for the case = [−B, B] d . We conjecture that the set of (realizations of) ReLU networks of a fixed complexity is closed in C( ), but were not able to prove such a result in full generality. In two special cases, namely when the networks have only two layers, or when at least the scaling weights are bounded, we can show that the associated set of ReLU realizations is closed in C( ); see below.
We begin by analyzing the set of realizations with uniformly bounded scaling weights and possibly unbounded biases, before proceeding with the analysis of two layer ReLU networks.
scaling ≤ C for some C > 0, we say that the network has C-bounded scaling weights. Note that this does not require the biases b of the network to satisfy |b | ≤ C.
Our first goal in this subsection is to show that if denotes the ReLU, if S = (d, N 1 , . . . , N L ), if C > 0, and if ⊂ R d is measurable and bounded, then the set Here, and in the remainder of the paper, we use the norm f L p (μ;R N L ) = | f | L p (μ) for vector-valued L p -spaces. The norm on C( ; R N L ) is defined similarly. The difference between the following proposition and Proposition 3.5 is that in the following proposition, the "shift weights" (the biases) of the networks can be potentially unbounded. Therefore, the resulting set is merely closed, not compact.

Remark
In fact, the proof shows that each subset of Proof For the proof of the statement, we refer to "Appendix D.7". The main idea is to show that for every sequence ( n ) n∈N ⊂ N N (S) of neural networks with uniformly bounded scaling weights and with R ( n ) L 1 (μ) ≤ M, there exist a subsequence ( n k ) k∈N of ( n ) n∈N and neural networks ( n k ) k∈N with uniformly bounded scaling weights and biases such that R n k = R n k . The rest then follows from Proposition 3.5.
As our second result in this section, we show that the set of realizations of two-layer neural networks with arbitrary scaling weights and biases is closed in if the activation is the parametric ReLU. It is a fascinating question for further research whether this also holds for deeper networks.
Proof For the proof of the statement, we refer to "Appendix D.8"; here we only sketch the main idea: First, note that each The proof is based on a careful-and quite technical-analysis of the singularity hyperplanes of the functions a ( α i , x + β i ), that is, the hyperplanes α i , x + β i = 0 on which these functions are not differentiable. More precisely, given a uniformly convergent N 0 , 1)), we analyze how the singularity hyperplanes of the functions f n behave as n → ∞, in order to show that the limit is again of the same form as the f n . For more details, we refer to the actual proof.

Failure of Inverse Instability of the Realization Map
In this section, we study the properties of the realization map R . First of all, we observe that the realization map is continuous.

1) is continuous. If is locally Lipschitz continuous, then so is R . Finally, if is globally Lipschitz continuous, then there is a constant C
Proof For the proof of this statement, we refer to "Appendix E.1".
In general, the realization map is not injective; that is, there can be networks = but such that R ( ) = R ( ); in fact, if for instance then the realizations of , are identical. In this section, our main goal is to determine whether, up to the failure of injectivity, the realization map is a homeomorphism onto its range; mathematically, this means that we want to determine whether the realization map is a quotient map. We will see that this is not the case.
To this end, we will prove for fixed that even if R ( ) is very close to R ( ), it is not true in general that R ( ) = R ( ) for network weights close to . Precisely, this follows from the following theorem for = 0 and = n . Then there is a sequence ( n ) n∈N of networks with architecture S and the following properties: Finally, if ( n ) n∈N is a sequence of networks with architecture S and the preceding two properties, then the following holds: For each sequence of networks ( n ) n∈N with architecture S and R ( n ) = R ( n ), we have n scaling → ∞.
Proof For the proof of the statement, we refer to "Appendix E.2". The proof is based on the fact that the Lipschitz constant of the realization of a network essentially yields a lower bound on the · scaling norm of every neural network with this realization. We construct neural networks n the realizations of which have small amplitude but high Lipschitz constants. The associated realizations uniformly converge to 0, but every associated neural network must have exploding weights.
We finally rephrase the preceding result in more topological terms: Proof For the proof of the statement, we refer to "Appendix E.3".

Appendix A: Notation
The symbol N denotes the natural numbers N = {1, 2, 3, . . . }, whereas N 0 = {0}∪N stands for the natural numbers including zero. Moreover, we set N ≥d :={n ∈ N : n ≥ d} for d ∈ N. The number of elements of a set M will be denoted by |M| ∈ N 0 ∪ {∞}. Furthermore, we write n:={k ∈ N : k ≤ n} for n ∈ N 0 . In particular, 0 = ∅. For two sets A, B, a map f : A → B, and C ⊂ A, we write f | C for the restriction of f to C. For a set A, we denote by The algebraic dual space of a K-vector space Y (with K = R or K = C), that is the space of all linear functions ϕ : Y → K, will be denoted by Y * . In contrast, if Y is a topological vector space, we denote by Y the topological dual space of Y, which consists of all functions ϕ ∈ Y * that are continuous.
Given functions ( f i ) i∈n with f i : X i → Y i , we consider three different types of products between these maps: The Cartesian product of f 1 , . . . , f n is The tensor product of f 1 , . . . , f n is defined if Y 1 , . . . , Y n ⊂ C, and is then given by Finally, the direct sum of f 1 , . . . , f n is defined if X 1 = · · · = X n , and given by The closure of a subset A of a topological space will be denoted by A, while the As an example, the Euclidean scalar product on R d is given by x, y = d i=1 x i y i . We denote the Euclidean norm by |x| : For n ∈ N and ∅ = ⊂ R d , we denote by C( ; R n ) the space of all continuous functions defined on with values in R n . If is compact, then (C( ; R n ), · sup ) denotes the Banach space of R n -valued continuous functions equipped with the supremum norm, where we use the Euclidean norm on R n . If n = 1, then we shorten the notation to C( ). We note that on C( ), the supremum norm coincides with the L ∞ ( )-norm, if for all x ∈ and for all ε > 0 we have that λ( ∩ B ε (x)) > 0, where λ denotes the Lebesgue measure on R d . For any nonempty set U ⊂ R, we say that a function for all such x, y, we say that f is strictly increasing. The terms "decreasing" and "strictly decreasing" are defined analogously.
The Schwartz space will be denoted by S(R d ) and the space of tempered distributions by S (R d ). The associated bilinear dual pairing will be denoted by ·, · S ,S . We refer to [26, Sects. 8.1-8.3 and 9.2] for more details on the spaces S(R d ) and S (R d ). Finally, the Dirac delta distribution δ x at x ∈ R d is given by

Appendix B: Auxiliary Results: Operations with Neural Networks
This part of the appendix is devoted to auxiliary results that are connected with basic operations one can perform with neural networks and which we will frequently make use of in the proofs below. We start by showing that one can "enlarge" a given neural network in such a way that the realizations of the original network and the enlarged network coincide. To be more precise, the following holds: Here, 0 m 1 ×m 2 and 0 k denote the zero-matrix in R m 1 ×m 2 and the zero vector in R k , respectively. Clearly, R ( ) = R ( ). This yields the claim.
Another operation that we can perform with networks is concatenation, as given in the following definition.
be two neural networks such that the input layer of 1 has the same dimension as the output layer of 2 . Then, 1 2 denotes the following L 1 + L 2 − 1 layer network: Then, we call 1 2 the concatenation of 1 and 2 .
One directly verifies that for every : R → R the definition of concatenation is reasonable, that is, if d i is the dimension of the input layer of i , i = 1, 2, and if We close this section by showing that under mild assumptions on -which are always satisfied in practice-and on the network architecture, one can construct a neural network which locally approximates the identity mapping id R d to arbitrary accuracy. Similarly, one can obtain a neural network the realization of which approximates the projection onto the i-th coordinate. The main ingredient of the proof is the approximation , which holds for |x| small enough and where x 0 is chosen such that (x 0 ) = 0.
are monotonically increasing in every coordinate and for all j ∈ {1, . . . , d}.
Proof We first consider the special case L = 1. Here, we can take B which implies that all claimed properties are satisfied. Thus, we can assume in the following that L ≥ 2.
We set r 0 := (x 0 ) and s 0 := (x 0 ). Next, for C > 0, we define To see this, first note by definition of the derivative that there is some δ > 0 with Here we implicitly used that s 0 = (x 0 ) = 0 to ensure that the right-hand side is a positive multiple of |t|. Now, set C 0 :=(B + L)/δ, and let C ≥ C 0 be arbitrary. Note Hence, if we set t:=x/C, then |t| ≤ δ. Therefore, Using these preliminary observations, we now construct the neural networks B and d, d)). To shorten the notation, let : . We obtain C ∈ N N ((d, . . . , d)) (with L layers) and where C is applied t ≤ L times. Therefore, since ε = ε/(d L), we conclude for As we saw above, C is differentiable at 0 with C (0) = 0 and C (0) = 1. By induction, we thus get d dx x=0 ( C • · · · • C )(x) = 1, where the composition has at most L factors. Thanks to Eq. (B.1), this shows that R ( C ) is totally differentiable at 0, with D(R ( C ))(0) = id R d , as claimed.
Also by Eq. (B.1), we see that for every j ∈ {1, . . . , d}, R ( C )(x) j is constant in all but the j-th coordinate. Additionally, if is increasing, then s 0 > 0, so that C is also increasing, and hence R ( C ) j is increasing in the j-th coordinate, since compositions of increasing functions are increasing. Hence, B ε := C satisfies the desired properties.
We proceed with the second part of the proposition. We first prove the statement where C is applied L − 1 times. Exactly as in the proof of the first part, this implies for C ≥ C 0 that Setting B ε,1 := C and repeating the previous arguments yields the claim for i = 1. Permuting the columns of A 1 yields the result for arbitrary i ∈ {1, . . . , d}. Now, let be increasing. Then, s 0 > 0, and thus C is increasing for every C > 0. Since R ( C ) is the composition of componentwise monotonically increasing functions, the claim regarding the monotonicity follows.

C.1. Proof of Theorem 2.1
We first establish the star-shapedness of the set of all realizations of neural networks, which is a direct consequence of the fact that the set is invariant under scalar multiplication. The following proposition provides the details. N 1 , . . . , N L ) be a neural network architecture, let ⊂ R d , and let : R → R. Then, the set RN N (S) is closed under scalar multiplication and is star-shaped with respect to the origin. Our next goal is to show that RN N (S) cannot contain infinitely many linearly independent centers.
As a preparation, we prove two related results which show that the class RN N (S) is "small". The main assumption for guaranteeing this is that the activation function should be locally Lipschitz continuous.
Proof Since is locally Lipschitz continuous, Proposition 4.1 (which will be proved completely independently) shows that the realization map As a composition of locally Lipschitz continuous functions, the map is locally Lipschitz continuous, and satisfies . But it is well known (see for instance [2]Theorem 5.9), that a locally Lipschitz continuous function between Euclidean spaces of the same dimension maps sets of Lebesgue measure zero to sets of Lebesgue measure zero. Hence, (RN N (S)) ⊂ R M is a set of Lebesgue measure zero.
As a corollary, we can now show that the class of neural network realizations cannot contain a subspace of large dimension. N 1 , . . . , N L ) be a neural network architecture, set N 0 :=d, and let : R → R be locally Lipschitz continuous.
Assume toward a contradiction that the claim of the corollary does not hold; then there exists a subspace V ⊂ C( ; We claim that the linear map Since x ∈ and ∈ N L were arbitrary, this means f ≡ 0. Therefore, 0 is injective and thus surjective. Now, let us define :={x 1 , . . . , x D+1 }, and note that ⊂ R d is compact. Set M:=D + 1, and define It is straightforward to verify that is Lipschitz continuous. Therefore, Lemma C.2 shows that the set (RN N (S)) ⊂ R M is a null-set. However, This yields the desired contradiction. Now, the announced estimate for the number of linearly independent centers of the set of all network realizations of a fixed size is a direct consequence.

Since RN N (S) is closed under multiplication with scalars, this implies
Indeed, this follows by induction on M, using the following observation: Since the family R ( k ) k∈M is linearly independent, we see In view of Corollary C.3, this yields the desired contradiction.
Next, we analyze the convexity of RN N (S). As a direct consequence of Proposition C.4, we see that RN N (S) is never convex if RN N (S) contains more than a certain number of linearly independent functions. Proof Every element of a convex set is a center. Thus the result follows directly from Proposition C.4.
Corollary C.5 claims that if a set of realizations of neural networks with fixed size contains more than a fixed number of linearly independent functions, then it cannot be convex. Since RN N R d (S) is translation invariant, it is very likely that RN N R d (S) (and hence also RN N (S)) contains infinitely many linearly independent functions. In fact, our next result shows under minor regularity assumptions on that if the set RN N (S) does not contain infinitely many linearly independent functions, then is necessarily a polynomial. S = (d, N 1 , . . . , N L ) be a neural network architecture with L ∈ N ≥2 . Moreover, let : R → R be continuous. Assume that there exists x 0 ∈ R such that is differentiable at x 0 with (x 0 ) = 0.

Proposition C.6 Let
Further assume that ⊂ R d has nonempty interior, and that RN N (S) does not contain infinitely many linearly independent functions. Then, is a polynomial.

Proof
Step 1 Set S := (d, N 1 , . . . , N L−1 , 1). We first show that RN N (S ) does not contain infinitely many linearly independent functions. To see this, first note that the map which maps an R N L -valued function to its first component, is linear, well-defined, and surjective.
Hence, if there were infinitely many linearly independent functions ( f n ) n∈N in the set RN N (S ), then we could find (g n ) n∈N in RN N (S) such that f n = g n . But then the (g n ) n∈N are necessarily linearly independent, contradicting the hypothesis of the theorem.
Step 2 We show that G := RN N R d (S ) does not contain infinitely many linearly independent functions.
To see this, first note that since F := RN N (S ) does not contain infinitely many linearly independent functions (Step 1), elementary linear algebra shows that there is a finite-dimensional subspace V ⊂ C( ; R) satisfying F ⊂ V . Let D := dim V , and assume toward a contradiction that there are D + 1 linearly independent functions f 1 , . . . , f D+1 ∈ G, and set W : We claim that the map is surjective. Since dim W = D + 1, it suffices to show that is injective. If this was not true, there would be some This contradiction shows that is injective, and hence surjective. Now, since has nonempty interior, there is some b ∈ and some r > 0 such that y := b + r x ∈ for all ∈ D + 1. Define It is not hard to see g ∈ G, and hence g | ∈ F ⊂ V for all ∈ D + 1. Now, define the linear operator Since the functions f 1 , . . . , f D+1 span the space W , this implies (V ) ⊃ (W ) = R D+1 , in contradiction to being linear and dim V = D < D + 1. This contradiction shows that G does not contain infinitely many linearly independent functions.
Step 3 From the previous step, we know that G = RN N R d (S ) does not contain infinitely many linearly independent functions. In this step, we show that this implies that the activation function is a polynomial.
Clearly, RN N * S , is dilation-and translation invariant; that is, if f ∈ RN N * S , , then also f (a ·) ∈ RN N * S , and f (· − x) ∈ RN N * S , for arbitrary a > 0 and x ∈ R. Furthermore, by Step 2, we see that RN N * S , does not contain infinitely many linearly independent functions. Therefore, V := span RN N * S , is a finite-dimensional translation-and dilation invariant subspace of C(R). Thanks to the translation invariance, it follows from [3] that there exists some r ∈ N, and certain λ j ∈ C, k j ∈ N 0 for j = 1, . . . , r such that where span C denotes the linear span, with C as the underlying field. Clearly, we can assume (k j , λ j ) = (k , λ ) for j = .
Step 4 Let N := max j∈{1,...,r } k j . We now claim that V is contained in the space C deg≤N [X ] of (complex) polynomials of degree at most N .
Step  N N ((d, N 1 , . . . , N L−1 )) such that In particular, this implies because of δ ≤ 1 that with acting componentwise. By (C.3), it follows that there is some ε,B ∈ N N (S ) satisfying From (C.4) and Step 4, we thus see where the closure is taken with respect to the sup norm, and where we implicitly used that the space on the right-hand side is a closed subspace of C ([−B, B]), since it is a finite dimensional subspace.
Since | [−B,B] is a polynomial of degree at most N ; we see that the N + 1-th derivative of satisfies (N +1) ≡ 0 on (−B, B), for arbitrary B > 0. Thus, (N +1) ≡ 0, meaning that is a polynomial.
In the above proof, we used the following elementary lemma, whose proof we provide for completeness. Lemma C.7 For k ∈ N 0 and λ ∈ C, define f k,λ : R → C, x → x k e λx .
Proof Let us assume toward a contradiction that for some coefficient vector (a 1 , . . . , a N  Note that this implies k < k j for all ∈ I \ { j}, since (k , λ ) = (k j , λ j ) and hence k = k j for ∈ I \ { j}.
Consider the differential operator where a j c k j = 0 and where deg q < k j , since k j > k for all ∈ I \ { j}. This is the desired contradiction.
As our final ingredient for the proof of Theorem 2.1, we show that every non-constant locally Lipschitz function satisfies the technical assumptions of Proposition C.6.  B] is not constant, the preceding formula shows that there has to be some x 0 ∈ (−B, B) such that (x 0 ) = 0; in particular, this means that is differentiable at x 0 . Now, a combination of Corollary C.5, Proposition C.6, and Lemma C.8 proves Theorem 2.1. For the application of Lemma C.8, note that if is constant, then is a polynomial, so that the conclusion of Theorem 2.1 also holds in this case.

C.2. Proof of Theorem 2.2
We first show in the following lemma that if RN N (S) is convex, then RN N (S) is dense in C( ). The proof of Theorem 2.2 is given thereafter. S = (d, N 1 , . . . , N L−1 , 1) be a neural network architecture with L ≥ 2. Let ⊂ R d be compact and let : R → R be continuous but not a polynomial. Finally, assume that there is some x 0 ∈ R such that is differentiable at

Lemma C.9 Let
Proof Since RN N (S) is convex and closed under scalar multiplication, RN N (S) forms a closed linear subspace of C( ). Below, we will show that As shown in [45], this then entails that RN N (S) (and hence also RN N (S)) is dense in C( ), since is not a polynomial.

C.3. Non-dense Network Sets
In this section, we review criteria on which ensure that RN N (S) = C( ).
Precisely, we will show that this is true if : R → R is computable by elementary operations, which means that there is some N ∈ N and an algorithm that takes x ∈ R as input and returns (x) after no more than N of the following operations: -applying the exponential function exp : R → R; -applying one of the arithmetic operations +, −, ×, and / on real numbers; -jumps conditioned on comparisons of real numbers using the following operators: <, >, ≤, ≥, =, =.
Then, a combination of [ Using this result, we can now show that the realization sets of networks with activation functions that are computable by elementary operations are never dense in L p ( ) or C( ). = (d, N 1 , . . . , N L−1 , 1) be a neural network architecture. Let ⊂ R d be any measurable set with nonempty interior, and let Y denote either L p ( ) (for some p ∈ [1, ∞)), or C( ). In case of Y = C( ), assume additionally that is compact.

Proposition C.10 Let : R → R be continuous and computable by elementary operations. Moreover, let S
Then Proof The considerations from before the statement of the proposition show that Therefore, all we need to show is that if F ⊂ C( ) is a function class for which F ∩ Y is dense in Y, then Pdim(F) = ∞.
For Y = C( ), this is easy: Let m ∈ N be arbitrary, choose distinct points x 1 , . . . , x m ∈ , and note that for each b ∈ {0, 1} m , there is g b ∈ C( ) satisfying g b (x j ) = b j for all j ∈ m. By density, for each Since m ∈ N was arbitrary, Pdim(F) = ∞.
For Y = L p ( ), one can modify this argument as follows: Since has nonempty interior, there are x 0 ∈ and r > 0 such that x 0 + r [0, 1] d ⊂ . Let m ∈ N be arbitrary, and for j ∈ m define M j : so that we can choose for each j ∈ m some x j ∈ M j \ b∈{0,1} m b . We then have Note that the following activation functions are computable by elementary operations: any piecewise polynomial function (in particular, the ReLU and the parametric ReLU), the exponential linear unit, the softsign (since the absolute value can be computed using a case distinction), the sigmoid, and the tanh. Thus, the preceding proposition applies to each of these activation functions.

D.1. Proof of Theorem 3.1
The proof of Theorem 3.1 is crucially based on the following lemma:

Lemma D.1 Let μ be a finite Borel measure on [−B, B] d with uncountable support suppμ. For x
In this step, we show that there is some This follows from a result in [68], where the following is shown: For x * ∈ R d and v ∈ S d−1 , as well as δ, η > 0, write and C(x * , v; δ, η) := x * + C(v; δ, η).
Then, for each uncountable set E ⊂ R d and for all but countably many x * ∈ E, there is some v ∈ S d−1 such that E ∩ C(x * , v; δ, η) and E ∩ C(x * , −v; δ, η) are both uncountable for all δ, η > 0. Now, if η < 1, then any v). From this it is easy to see that if x * , v are as provided by the result in [68] We remark that strictly speaking, the proof in [68] is only provided for E ⊂ R 3 , but the proof extends almost verbatim to R d . A direct proof of the existence of x * , v can be found in [58].
Step 2 Hence, μ(U n,± ) > 0. Since f = g μ-almost everywhere, there exist x n,± ∈ U n,± with f (x n,± ) = g(x n,± ). This implies g(x n,+ ) = c and g(x n,− ) = c . But since x n,± ∈ B 1/n (x * ), we have x n,± → x * , so that the continuity of g implies g(x * ) = lim n g(x n,+ ) = c and g(x * ) = lim n g(x n,− ) = c , in contradiction to c = c .  RN N ((d, N 1 , . . . , N L−1 , 1)) ⊂ RN N (S) such that the sequence converges (in L p (μ)) to a bounded, discontinuous limit f ∈ L ∞ (μ), meaning that f does not have a continuous representative, even after possibly changing it on a μ-nullset. Since RN N For the construction of the sequence, let x * ∈ suppμ and v ∈ S d−1 as provided by Lemma D.1. Extend v to an orthonormal basis (v, w 1 , . . . , w d−1 ) of R d , and define Next, using Proposition B.3, choose a neural network ∈ N N ((d, 1, . . . , 1)) with L − 1 layers such that Let J 0 :=R ( ). Since J 0 (0) = 0 and ∂ J 0 ∂ x 1 (0) = 1, we see directly from the definition of the partial derivative that for each δ ∈ (0, B ), there are x δ ∈ (−δ, 0) and y δ ∈ (0, δ) such that J 0 (x δ , 0, . . . , 0) < J 0 (0) = 0 and J 0 (y δ , 0, . . . , 0) > J 0 (0) = 0. Furthermore, Properties (3) and (4) from above show that J 0 (x) only depends on x 1 and that t → J 0 (t, 0, . . . , 0) is increasing. In combination, these observations imply that and note that ∈ N N ((d, 1, . . . , 1)) with L − 1 layers, and Combining this with the definition of A and with Eq. (D.2), and noting that A(x − x * ) ∈ for x ∈ , we see that J := R ( ) We now distinguish the cases given in Assumption (iv)(a) and (b) of Theorem 3.1. Case 1 is unbounded, so that necessarily Assumption (iv)(a) of Theorem 3.1 holds, and N L−1 = 2. For n ∈ N let n = (A n N N ((1, 2, 1)) be given by N 1 , . . . , N L−1 , 1)). Now, let us define Then, since h n is continuous and hence bounded on the compact set , we see that h n ∈ L p (μ) for every n ∈ N and all p ∈ (0, ∞]. We now show that (h n ) n∈N converges to a discontinuous limit. To see this, first consider x ∈ ∩ H + (x * , v). Since J (x) > 0 by (D.3), there exists some N x ∈ N such that for all n ≥ N x , the estimate n J (x) − 1 > r holds, where r > 0 is as in Assumption (iii) of Theorem 3.1. Hence, by the mean value theorem, there exists some since ξ x n → ∞ as n → ∞, n ≥ N x . Analogously, it follows for x ∈ ∩ H − (x * , v) that lim n→∞ h n (x) = λ . Hence, setting γ := (0) − (−1), we see for each x ∈ that We now claim that there is some To see this, note because of (x) → λ as x → ∞ and because of By what was shown in the preceding paragraph, we get |h n | ≤ M and hence also |h| ≤ M for all n ∈ N. Hence, by the dominated convergence theorem, we see for any p ∈ (0, ∞) that lim n→∞ h n − h L p (μ) = 0. But since λ = λ , Lemma D.1 shows that h doesn't have a continuous representative, even after changing it on a μ-null-set.
This yields the required non-continuity of a limit point as discussed at the beginning of the proof.
Case 2 is bounded, so that N L−1 = 1. Since is monotonically increasing, there exist c, c ∈ R such that lim x→∞ (x) = c and lim By the monotonicity and since is not constant (because of (x 0 ) = 0), we have c > c . For each n ∈ N, we now consider the neural network n = ( A n N N ((1, 1, 1)) given by N 1 , . . . , N L−1 , 1)). Now, let us define h n :=R ( n ) and note h n (x) = (n J (x)) for all x ∈ .
Since each of the h n is continuous and is compact, By the boundedness of , we get | h n (x)| ≤ C for all n ∈ N and x ∈ and a suitable C > 0, so that also h is bounded. Together with the dominated convergence theorem, this implies for any p ∈ (0, ∞) that lim n→∞ h n − h L p (μ) = 0. Since c = c , Lemma D.1 shows that h does not have a continuous representative (with respect to equality μ-almost everywhere). This yields the required non-continuity of a limit point as discussed at the beginning of the proof.

D.2. Proof of Corollary 3.2
It is not hard to verify that all functions listed in Table 1 are continuous and increasing. Furthermore, each activation function listed in Table 1 is not constant and satisfies ). This shows that | (−∞,−r )∪(r ,∞) is differentiable for any r > 0, and that there is some x 0 = x 0 ( ) ∈ R such that (x 0 ) = 0.

For the exponential linear unit
the quotient rule shows that for x < 0 we have

D.3.1. Proof of Theorem 3.3 Under Condition (i)
We  N N ((1, 2, 1)) be given by Note that there is some x * ∈ R such that (x * ) = 0, since otherwise ≡ 0 and hence ∈ C ∞ (R). Thus, for each n ∈ N, Proposition B.3 yields the existence of a neural network 2 n ∈ N N ((d, 1, . . . , 1)) with L − 1 layers such that We set n := 1 n 2 n ∈ N N (S ) and f n :=R ( n ). For x ∈ , we then have Now, by the Lipschitz continuity of (λ·) and Eq. (D.7), we conclude that This implies for every x ∈ that Here, the last step used that |ξ x n − x 1 | ≤ n −1 ≤ 1, so that Since f n → λ uniformly, where λ / ∈ C m ( ), and hence λ / ∈ RN N (S), we thus see that RN N (S) is not closed in C( ).

D.3.2. Proof of Theorem 3.3 Under Condition (ii)
Let Since is not constant, there is some x * ∈ R such that (x * ) = 0. For n ∈ N, let us define 1 With this choice, we have For any x ∈ R, the mean-value theorem yields x between x * and x * + x n satisfying Since is continuous, we conclude that Next, for each n ∈ N, Proposition B.3 yields a neural network 2 n ∈  N N ((d, 1, . . . , 1)) with L − 1 layers such that We set n := 1 n 2 n ∈ N N (S ) and note for all x ∈ that By the Lipschitz continuity of R R ( 1 n ) on [−(B + 1), B + 1], and using (D.9), we conclude that so that an application of (D.8) yields F| is not an element of RN N (S). This is accomplished, once we show that there do not exist any N 1 , . . . , N L−1 ∈ N such that F| is an element of  RN N ((d, N 1 , . . . , N L−1 , 1)). Toward a contradiction, we assume that there exist N 1 , . . . , N L−1 ∈ N such that F| = R ( 3 ) for a network 3 ∈ N N ((d, N 1 , . . . , N L−1 , 1)). Since F and R R d ( 3 ) are both analytic functions that coincide on = [−B, B] d , they must be equal on all of R d . However, F is unbounded (since is bounded, and since is bounded as a consequence of being bounded. This produces the desired contradiction.

D.3.3. Proof of Theorem 3.3 Under Condition (iii)
Let ∈ C max{r ,q} (R) be approximately homogeneous of order (r , q) with r = q. For simplicity, let us assume that r > q; we will briefly comment on the case q > r at the end of the proof.
Note that r ≥ 1, since r , q ∈ N 0 with r > q. Let (x) + := max{x, 0} for x ∈ R. We start by showing that Overall, we conclude that which implies (D.10). We observe that x → (x) r Hence, the proof is complete if we can construct a sequence ( n ) n∈N of neural networks in N N ((d, 1, . . . , 1)) (with L layers) such that the -realizations R ( n ) converge uniformly to the function By the preceding considerations, this is clearly possible, as can be seen by the same arguments used in the proofs of the previous results. For invoking these arguments, note that max{r , q} ≥ 1, so that ∈ C 1 (R). Also, since is approximately homogeneous of order (r , q) with r = q, cannot be constant, and hence (x 0 ) = 0 for some x 0 ∈ R. For completeness, let us briefly consider the case where q > r that was omitted at the beginning of the proof. In this case, Here, we used that q − r ≥ 1, since r , q ∈ N 0 with q > r . Now, the proof proceeds as before, noting that x →

D.5. Proof of Proposition 3.5
The set C is closed and bounded in the normed space N N (S), · N N (S) . Thus, the Heine-Borel Theorem implies the compactness of C . By Proposition 4.1 (which will be proved independently), the map R : N N (S), · N N (S) → C( ), · sup is continuous. As a consequence, the set R ( C ) is compact in C( ). Because of the compactness of , C( ) is continuously embedded into L p (μ) for every p ∈ (0, ∞) and any finite Borel measure μ on . This implies that the set R ( C ) is compact in L p (μ) as well.

D.6. Proof of Proposition 3.6
With ( N ) N ∈N as in the statement of Proposition 3.6, we want to show that N total → ∞ in probability. By definition, this means that for each fixed C > 0, and letting N denote the event where N ≥ C, we want to show that P( N ) → 1 as N → ∞. For brevity, let us write R Z :=R ( Z ) for Z , Z > 0 as in Proposition 3.5.
By compactness of R C , we can choose g ∈ R C satisfying Define M := inf h∈RN N (S) f σ − h 2 L 2 (σ ) . Since by assumption the infimum defining M is not attained, we have f σ − g 2 L 2 (σ ) > M, so that there are For N ∈ N and ε > 0, let us denote by (1) N ,ε the event where [20]Theorem B shows for arbitrary ε > 0 that P( We now claim that c N ⊂ N ,δ/3 . Once we prove this, we get N ,δ/3 , assume toward a contradiction that there exists a training sample ω: Using the decomposition of the expected loss from Eq. (3.2), we thus see By rearranging and recalling the choice of h and δ, we finally see which is the desired contradiction.

D.7. Proof of Proposition 3.7
The main ingredient of the proof will be to show that one can replace a given sequence of networks with C-bounded scaling weights by another sequence with C-bounded scaling weights that also has bounded biases. Then one can apply Proposition 3.5. S = (d, N 1 , . . . , N L ) be a neural network architecture, let C > 0 and let ⊂ R d be measurable and bounded. Let μ be a finite Borel measure on with μ( ) > 0. Finally, let : R → R, x → max{0, x} denote the ReLU activation function.

Let ( n ) n∈N be a sequence of networks in N N (S) with C-bounded scaling weights and such that there exists some M
Then, there is an infinite set I ⊂ N and a family of networks ( n ) n∈I ⊂ N N (S) with C-bounded scaling weights which satisfies R ( n ) = R ( n ) for n ∈ I and such that n total ≤ C for all n ∈ I and a suitable constant C > 0.
Proof Set N 0 := d. Since is bounded, there is some R > 0 with x ∞ ≤ R for all x ∈ . In the following, we will use without further comment the estimate Since is bounded, and since | (x)| ≤ |x| for all x ∈ R, there is thus a constant C L−1 > 0 such that if we set then β (n) (x) ∞ ≤ C L−1 for all x ∈ and all n ∈ I . For arbitrary i ∈ {1, . . . , N L } and x ∈ , this implies Since by assumption R ( n ) L 1 (μ) ≤ M and μ( ) > 0, we see that c For brevity, set T (n) : and L :=id R N L , and let := × · · · × denote the N -fold Cartesian product of for ∈ {1, . . . , L − 1}. Furthermore, let us define β n : Additionally, observe for n ∈ I m , ∈ {1, . . . , m} and x ∈ R N −1 that Combining these observations, and recalling that is bounded, we easily see that there is some R > 0 with β n (x) ∞ ≤ R for all x ∈ and n ∈ I m . Next, since c can find (by compactness) an infinite subset I m . Our goal is to construct vectors d (n) , e (n) ∈ R N m+1 , matrices C (n) ∈ R N m+1 ×N m , and an infinite subset I m+1 ⊂ I (0) m such that C (n) max ≤ C for all n ∈ I m+1 , such that d (n) n∈I m+1 is a bounded family, and such that we have for all n ∈ I m+1 .
Thus, it remains to construct d (n) , e (n) , C (n) for n ∈ I m+1 (and the set I m+1 itself) as described around Eq. (D.12). To this end, for n ∈ I , and e (n) as well as To see that these choices indeed fulfill the conditions outlined around Eq. (D.12) for a suitable choice of I m+1 ⊂ I for all k ∈ N m+1 and all x ∈ R N m with x ∞ ≤ R . As a final preparation, note that m+1 = × · · · × is a Cartesian product of ReLU functions, since m ≤ L − 2. Now, for k ∈ {1, . . . , N m+1 } there are three cases: where the last step used our choice of d (n) , e (n) , C (n) , and the fact that where the last step used our choice of d (n) , e (n) , C (n) .
Case 3: We have (c m+1 ) k ∈ R. In this case, set n k :=1, and note by our choice of d (n) , e (n) , C (n) for n ∈ I (0) Overall, we have thus shown that Eq. (D.12) is satisfied for all n ∈ I m+1 , where is clearly an infinite set, since I (0) m is.
Using Lemma D.2, we can now easily show that the set RN N ,C (S) is closed in L p (μ; R N L ) and in C( ; R N L ): Let Y denote either L p (μ; R N L ) for some p ∈ [1, ∞] and some finite Borel measure μ on , or C( ; R N L ), where we assume in the latter case that is compact and set μ = δ x 0 for a fixed x 0 ∈ . Note that we can assume μ( ) > 0, since otherwise the claim is trivial. Let ( f n ) n∈N be a sequence in RN N ,C (S) which satisfies f n → f for some f ∈ Y, with convergence in Y. Thus, f n = R ( n ) for a suitable sequence ( n ) n∈N in N N (S) with C-bounded scaling weights.
Since ( f n ) n∈N = R ( n ) n∈N is convergent in Y, it is also bounded in Y. But since is bounded and μ is a finite measure, it is not hard to see Y → L 1 (μ), so that we get R ( n ) L 1 (μ) ≤ M for all n ∈ N and a suitable constant M > 0.
Therefore, Lemma D.2 yields an infinite set I ⊂ N and networks ( n ) n∈I ⊂ N N (S) with C-bounded scaling weights such that f n = R ( n ) and n total ≤ C for all n ∈ I and a suitable C > 0.
Hence, ( n ) n∈I is a bounded, infinite family in the finite dimensional vector space N N (S). Thus, there is a further infinite set I 1 ⊂ I such that n → ∈ N N (S) as n → ∞ in I 1 .
But since is bounded, say ⊂ [−R, R] d , the realization map is continuous (even locally Lipschitz continuous); see Proposition 4.1, which will be proved independently. Hence,

D.8. Proof of Theorem 3.8
For the proof of Theorem 3.8, we will use a careful analysis of the singularity hyperplanes of functions of the form x → a ( α, x + β), that is, the hyperplane on which this function is not differentiable. To simplify this analysis, we first introduce a convenient terminology and collect quite a few auxiliary results.
Furthermore, for a ≥ 0 and with a : R → R, x → max{x, ax} denoting the parametric ReLU, we set Proof By discarding those (α j , β j ) for which x 0 / ∈ S α j ,β j , we can assume that x 0 ∈ S α j ,β j for all j ∈ N .
Assume toward a contradiction that the claim of the lemma is false; that is, where α ⊥ := {z ∈ R d : z, α = 0}. Since α ⊥ is a closed subset of R d and thus a complete metric space, and since the right-hand side of (D.14) is a countable (in fact, finite) union of closed sets, the Baire category theorem (see [26,Theorem 5.9]) shows that there are j ∈ N and ε > 0 such that But since V is a vector space, this easily implies V = α ⊥ , that is, z, α j = 0 for all z ∈ α ⊥ . In other words, α ⊥ ⊂ α ⊥ j , and then α ⊥ = α ⊥ j by a dimension argument, since α, α j = 0.
We claim that there is some ε > 0 such that α j ,β j . To see this, note for j ∈ N \ J that x 0 ∈ S α j ,β j , and hence since z, α j = 0 for all j ∈ N \ J , by choice of z. By combining all our observations, we see that for all t ∈ R. This easily shows that f is not differentiable at t = 0, since the rightsided derivative is 1, while the left-sided derivative is a = 1. This is the desired contradiction.
By Lemma D.4 there is some ε > 0 such that there exists where the right-hand side is differentiable at x 0 , since each summand is easily seen to be differentiable on the open set V , with x 0 ∈ V ∩ U .

Proof of
be such that f n :=R a ( n ) converges uniformly to some f ∈ C( ). Our goal is to prove f ∈ RN N a ((d, N 0 , 1)). The proof of this is divided into seven steps.
Step 1 (Normalizing the rows of the first layer): Our first goal is to normalize the rows of the matrices A n 1 ; that is, we want to change the parametrization of the network such that ( A n 1 ) i,− 2 = 1 for all i ∈ N 0 . To see that this is possible, consider arbitrary A ∈ R M 1 ×M 2 = 0 and b ∈ R M 1 ; then we obtain by the positive homogeneity of a for all C > 0 that This identity shows that for each n ∈ N, we can find a network such that the rows of A n 1 are normalized, that is, (A n 1 ) i,− 2 = 1 for all i ∈ N 0 , and such that R a n = R a ( n ) = f n for all n ∈ N.
Step 2 (Extracting a partially convergent subsequence): By the Theorem of Bolzano-Weierstraß, there is a common subsequence of (A n 1 ) n∈N and (b n 1 ) n∈N , denoted by (A n k 1 ) k∈N and (b n k 1 ) k∈N , converging to A 1 ∈ R N 0 ×d and b 1 ∈ [−∞, ∞] N 0 , respectively.
For j ∈ N 0 , let a k, j ∈ R d denote the j-th row of A n k 1 , and let a j ∈ R d denote the j-th row of A 1 . Note that a k, j 2 = a j 2 = 1 for all j ∈ N 0 and k ∈ N. Next, let where • = (−B, B) d denotes the interior of . Additionally, let J c :=N 0 \ J , and for j, ∈ J c write j iff (a j , (b 1 ) j ) ∼ (a , (b 1 ) ), with the relation ∼ introduced in Definition D.3. Note that this makes sense, since (b 1 ) j ∈ R if j ∈ J c . Clearly, the relation is an equivalence relation on J c . Let (J i ) i=1,...,r denote the equivalence classes of the relation . For each i ∈ r , choose α (i) ∈ S d−1 and β (i) ∈ R such that for each j ∈ J i there is a (unique) σ j ∈ {±1} with (a j , (b 1 ) j ) = σ j · (α (i) , β (i) ).
Step 3 (Handling the case of distinct singularity hyperplanes): Note that r ≤ |J c | ≤ N 0 . Before we continue with the general case, let us consider the special case where equality occurs, that is, where r = N 0 . This means that J = ∅ (and hence (b 1 ) j ∈ R and • ∩ S a j ,(b 1 ) j = ∅ for all j ∈ N 0 ), and that each equivalence class J i has precisely one element; that is, (a j , (b 1 ) j ) (a , (b 1 ) ) for j, ∈ N 0 with j = .
Therefore, Lemma D.6 shows that the functions (h j | • ) j=1,...,N 0 +1 , where we define h j :=h (a) a j ,(b 1 ) j | for j ∈ N 0 and h N 0 +1 : → R, x → 1, are linearly independent. In particular, these functions are linearly independent when considered on all of . Thus, we can define a norm · * on R N 0 +1 by virtue of Since all norms on the finite dimensional vector space R N 0 +1 are equivalent, there is some τ > 0 with c * ≥ τ · c 1 for all c ∈ R N 0 +1 . Now, recall that a k, j → a j and b n k 1 → b 1 ∈ R N 0 as k → ∞. Since is bounded, this implies for arbitrary j ∈ N 0 and h Since f n k = R a n k = b n k 2 + N 0 j=1 A n k 2 1, j h (k) j converges uniformly on , we thus see that the sequence consisting of (A n k 2 , b n k 2 ) ∈ R 1×N 0 ×R ∼ = R N 0 +1 is bounded. Thus, there is a further subsequence (n k ) ∈N such that A n k 2 → A 2 ∈ R 1×N 0 and b n k 2 → b 2 ∈ R as → ∞. But this implies as desired that RN N a ((d, N 0 , 1)).
Step 4 (Showing that the j-th neuron is eventually affine-linear, for j ∈ J ): Since Step 3 shows that the claim holds in case of r = N 0 , we will from now on consider only the case where r < N 0 .
For j ∈ J , there are two cases: In case of (b 1 ) j ∈ [0, ∞], define Next, for arbitrary 0 < δ < B, we define δ : for all i ∈ r and all 0 < δ ≤ δ 0 . For the remainder of this step, we will consider a fixed δ ∈ (0, δ 0 ], and we claim that there is some N 2 = N 2 (δ) > 0 such that for all j ∈ J , k ≥ N 2 and x ∈ δ , where sign x = 1 if x > 0, sign x = −1 if x < 0, and sign 0 = 0. Note that once this is shown, it is not hard to see that there is some To prove Eq. (D.17), we distinguish two cases for each j ∈ J ; by definition of J , these are the only two possible cases: Case 1: We have (b 1 ) j ∈ {±∞}. In this case, the first part of Eq. (D.17) is trivially satisfied. To prove the second part, note that because of (b n k for all x ∈ = [−B, B] d and k ≥ k j . Now, since the function x → a k, j , x +(b n k 1 ) j is continuous, since is connected (in fact convex), and since 0 ∈ , this implies sign( a k, j , x + (b n k 1 ) j ) = sign(b n k 1 ) j for all x ∈ and k ≥ k j . Case 2: We have (b 1 ) j ∈ R, but S a j ,(b 1 ) j ∩ • = ∅, and hence S a j ,(b 1 ) j ∩ δ = ∅.
To prove the second part, note that because of a k, j → a j and (b n k 1 ) j → (b 1 ) j as k → ∞, there is some k j = k j (ε j,δ ) = k j (δ) ∈ N such that a k, j − a j 2 ≤ ε j,δ /(4d B) and |(b n k 1 ) j − (b 1 ) j | ≤ ε j,δ /4 for all k ≥ k j . Therefore, With the same argument as at the end of Case 1, we thus see sign( a k, j , x +(b n k 1 ) j ) = sign(b n k 1 ) j for all x ∈ δ and k ≥ k j (δ). Together, the two cases prove that Eq. (D.17) holds for N 2 (δ):= max j∈J k j (δ).
Step 6 (Proving the "almost convergence" of the sum of all j-th neurons for j ∈ J i ): In combination with Eq. (D.18), we see r +1 : R d → R being affine-linear. Recall from Step 4 that • δ 0 ∩ S α (i) ,β (i) = ∅ for all i ∈ r , by choice of δ 0 . Therefore, Lemma D.4 shows (because of U Let us fix some x i ∈ K i and some r i > 0 such that • ; this is possible, since the set on the right-hand side contains ( ) . Therefore, as a consequence of the preceding step, we see that there is some N (i) 5 ∈ N such that g (k) is affine-linear on B r i (x i ) for all ∈ r \ {i} and all k ≥ N (i) 5 . Thus, setting N 5 := max{N 3 (δ 0 ), max i=1,...,r N (i) 5 }, we see as a consequence of Eq. (D. 19) and because of B r i (x i ) ⊂ • δ 0 that for each i ∈ r and any k ≥ N 5 , there is an affine-linear map q Next, note that Step 5 implies for arbitrary ε > 0 that for all k large enough (depending on ε), g (i) and continuous on ⊃ B r i (x i ), and we have x i ∈ B r i (x i ) ∩ S α (i) ,β (i) = ∅. Thus, Lemma D.8 shows that there are c i ∈ R, ζ i ∈ R d , and κ i ∈ R such that (D.21) We now intend to make use of the following elementary fact: If (ψ k ) k∈N is a sequence of maps ψ k : R d → R, if ⊂ R d is such that each ψ k is affine-linear on , and if U ⊂ is a nonempty open subset such that ψ(x):= lim k→∞ ψ k (x) ∈ R exists for all x ∈ U , then ψ can be uniquely extended to an affine-linear map ψ : R d → R, and we have ψ k (x) → ψ(x) for all x ∈ , even with locally uniform convergence. Essentially, what is used here is that the vector space of affine-linear maps R d → R is finite-dimensional, so that the (Hausdorff) topology of pointwise convergence on U coincides with that of locally uniform convergence on ; see [61,Theorem 1.21].
To use this observation, note that Eq.s (D.20) and (D. 21) show that g converges pointwise to G i on B r i (x i ). Furthermore, since x i ∈ S α (i) ,β (i) , it is not hard to see that there is some ε 0 > 0 with U (ε,±) α (i) ,β (i) But the latter set is dense in (since its complement is a null-set), and f and ψ + r i=1 G i are continuous on . Hence, Recalling from Steps 3 and 4 that r < N 0 , this implies f ∈ RN N a ((d, r + 1, 1)) ⊂ RN N a ((d, N 0 , 1)), as claimed. Here, we implicitly used that Appendix E: Proofs of the Results in Sect. 4

E.1. Proof of Proposition 4.1
Step 1: We first show that if ( f n ) n∈N and (g n ) n∈N are sequences of continuous functions f n : R d → R N and g n : R N → R D that satisfy f n → f and g n → g with locally uniform convergence, then also g n • f n → g • f locally uniformly. To see this, let R, ε > 0 be arbitrary. On B R (0) ⊂ R d , we then have f n → f uniformly. In particular, C:= sup n∈N sup |x|≤R | f n (x)| < ∞; here, we implicitly used that f and all f n are continuous, and hence bounded on B R (0). But on B C (0) ⊂ R N , we have g n → g uniformly, so that there is some n 1 ∈ N with |g n (y) − g(y)| < ε for all n ≥ n 1 and all y ∈ R N with |y| ≤ C. Furthermore, g is uniformly continuous on B C (0), so that there is some δ > 0 with |g(y) − g(z)| < ε for all y, z ∈ B C (0) with |y − z| ≤ δ. Finally, by the uniform convergence of f n → f on B R (0), we get some n 2 ∈ N with | f n (x) − f (x)| ≤ δ for all n ≥ n 2 and all x ∈ R d with |x| ≤ R.
Step 2 We show that R is continuous. Assume that some neural network sequence ( n ) n∈N ⊂ N N ((d, N 1 , . . . , N L ) N N ((d, N 1 , . . . , N L )). For ∈ {1, . . . , L − 1} set where := × · · · × denotes the N -fold Cartesian product of . Likewise, set By what was shown in Step 1, it is not hard to see for every ∈ {1, . . . , L} that α (n) → α locally uniformly as n → ∞. By another (inductive) application of Step 1, this shows with locally uniform convergence. Since is compact, this implies uniform convergence on , and thus completes the proof of the first claim.
Step 4 Let be Lipschitz with Lipschitz constant M, where we assume without loss of generality that M ≥ 1. With the functions from the preceding step, it is not hard to see that each is M-Lipschitz, where we use the · ∞ -norm on R N . Let = (A 1 , b 1 ), . . . , (A L , b L ) ∈ N N (S), and α : scaling . Thus, we finally see that R ( ) = α L • · · · • α 1 is Lipschitz with Lipschitz constant M L · N 0 · · · N L−1 · L scaling . This proves the final claim of the proposition when choosing the ∞ -norm on R d and R N L . Of course, choosing another norm than the ∞ -norm can be done, at the cost of possibly enlarging the constant C in the statement of the proposition.

E.2. Proof of Theorem 4.2
Step 1 For a > 0, define Our claim in this step is that there is some a > 0 with f a ≡ const.
Let us assume toward a contradiction that this fails; that is, f a ≡ c a for all a > 0. Since is Lipschitz continuous, it is at most of linear growth, so that is a tempered distribution. We will now make use of the Fourier transform, which we define by f (ξ ) = R f (x) e −2πi xξ dx for f ∈ L 1 (R), as in [26,31], where it is also explained how the Fourier transform is extended to the space of tempered distributions. Elementary properties of the Fourier transform for tempered distributions (see [31,Proposition 2.3.22]) show c a · δ 0 = f a = · g a with g a : R → R, ξ → e 2πiaξ − 2 + e −2πiaξ .
Next, setting z(ξ ):=e 2πiaξ = 0, we observe that as long as z(ξ ) = 1, that is, as long as ξ / ∈ a −1 Z. Let ϕ ∈ C ∞ c (R) such that 0 / ∈ suppϕ be fixed, but arbitrary. This implies suppϕ ⊂ R\a −1 Z for some sufficiently small a > 0. Since g a vanishes nowhere on the compact set suppϕ, it is not hard to see that there is some smooth, compactly supported function h with h · g a ≡ 1 on the support of ϕ. All in all, we thus get , ϕ S ,S = · g a , h · ϕ S ,S = f a , h · ϕ S ,S = c a · h(0) · ϕ(0) = 0.
Since ϕ ∈ C ∞ c (R) with 0 / ∈ suppϕ was arbitrary, we have shown supp ⊂ {0}. But by [31,Corollary 2.4.2], this implies that is a polynomial. Since the only globally Lipschitz continuous polynomials are affine-linear, must be affine-linear, contradicting the prerequisites of the theorem.
Step 2 In this step we construct certain continuous functions F n : R d → R which satisfy Lip(F n | ) → ∞ and F n → 0 uniformly on R d . We will then use these functions in the next step to construct the desired networks n .
We first note that each function f a from Step 1 is bounded. In fact, if is M-Lipschitz, then Next, recall that is Lipschitz continuous and not affine-linear. Therefore, Lemma C.8 shows that there is some t 0 ∈ R such that is differentiable at t 0 with (t 0 ) = 0. Therefore, Proposition B.3 shows that there is a neural network ∈ N N ((1, . . . , 1)) with L − 1 layers such that ψ:=R R ( ) is differentiable at the origin with ψ(0) = 0 and ψ (0) = 1. By definition, this means that there is a function δ : R → R such that ψ(x) = x + x · δ(x) and δ(x) → 0 = δ(0) as x → 0. Next, since has nonempty interior, there exist x 0 ∈ R d and r > 0 with x 0 + [−r , r ] d ⊂ . Let us now choose a > 0 with f a ≡ const (the existence of such an a > 0 is implied by the previous step), and define F n : R d → R, x → ψ n −1 · f a (n 2 · (x − x 0 ) 1 ) .
Since f a is not constant, there are b, c ∈ R with b < c and f a (b) = f a (c). Because of δ(x) → 0 as x → 0, we see that there is some κ > 0 and some n 1 ∈ N with for all x ∈ R d , where 2 ) . Thus, with the concatenation operation introduced in Definition B.2, the network (1) n := (0) n satisfies R ( (1) n ) = F n | . Furthermore, it is not hard to see that (1) n has L layers and has the architecture (d, 3, 1, . . . , 1). From this and because of N 1 ≥ 3, by Lemma B.1 there is a network n with architecture (d, N 1 , . . . , N L−1 , 1) and R ( n ) = F n | . By Step 2, this implies R ( n ) = F n | → 0 uniformly on , as well as Lip(R ( n )) → ∞ as n → ∞.
Step 4 In this step, we establish the final property which is stated in the theorem. For this, let us assume toward a contradiction that there is a family of networks ( n ) n∈N with architecture S and R ( n ) = R ( n ), some C > 0, and a subsequence ( n r ) r ∈N with n r scaling ≤ C for all r ∈ N. In view of the last part of Proposition 4.1, there is a constant C = C ( , S) > 0 with Lip R ( n r ) = Lip R ( n r ) ≤ C · n r L scaling ≤ C · C L , in contradiction to Lip R ( n ) → ∞.

E.3. Proof of Corollary 4.3
Let us denote the range of the realization map by R. By definition (see [44, p. 65 Clearly, by switching to complements, we can equivalently replace "open" by "closed" everywhere. Now, choose a sequence of neural networks ( n ) n∈N as in Theorem 4.2, and set F n :=R R d ( n ). Since Lip(F n | ) → ∞, we have F n | ≡ 0 for all n ≥ n 0 with n 0 ∈ N suitable. Define M:={F n | : n ≥ n 0 } ⊂ R. Note that M ⊂ R ⊂ C( ) is not closed, since F n | → 0 uniformly, but 0 ∈ R \ M. Hence, once we show that R −1

(M)
is closed, we will have shown that R is not a quotient map.
Thus, let ( n ) n∈N be a sequence in R −1 (M) and assume n → as n → ∞.
In particular, n scaling ≤ C for some C > 0 and all n ∈ N. We want to show ∈ R −1 (M) as well. Since n ∈ R −1 (M), there is for each n ∈ N some r n ∈ N with R ( n ) = F r n | . Now there are two cases: Case 1: The family (r n ) n∈N is infinite. But in view of Proposition 4.1, we have Lip(F r n | ) = Lip(R ( n )) ≤ C · n L scaling ≤ C · C L for a suitable constant C = C ( , S), in contradiction to the fact that Lip(F r n | ) → ∞ as r n → ∞. Thus, this case cannot occur.