Abstract
We analyze the topological properties of the set of functions that can be implemented by neural networks of a fixed size. Surprisingly, this set has many undesirable properties. It is highly non-convex, except possibly for a few exotic activation functions. Moreover, the set is not closed with respect to \(L^p\)-norms, \(0< p < \infty \), for all practically used activation functions, and also not closed with respect to the \(L^\infty \)-norm for all practically used activation functions except for the ReLU and the parametric ReLU. Finally, the function that maps a family of weights to the function computed by the associated network is not inverse stable for every practically used activation function. In other words, if \(f_1, f_2\) are two functions realized by neural networks and if \(f_1, f_2\) are close in the sense that \(\Vert f_1 - f_2\Vert _{L^\infty } \le \varepsilon \) for \(\varepsilon > 0\), it is, regardless of the size of \(\varepsilon \), usually not possible to find weights \(w_1, w_2\) close together such that each \(f_i\) is realized by a neural network with weights \(w_i\). Overall, our findings identify potential causes for issues in the training procedure of deep learning such as no guaranteed convergence, explosion of parameters, and slow convergence.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
Neural networks, introduced in 1943 by McCulloch and Pitts [49], are the basis of every modern machine learning algorithm based on deep learning [30, 43, 63]. The term deep learning describes a variety of methods that are based on the data-driven manipulation of the weights of a neural network. Since these methods perform spectacularly well in practice, they have become the state-of-the-art technology for a host of applications including image classification [36, 41, 65], speech recognition [22, 34, 69], game intelligence [64, 66, 70], and many more.
This success of deep learning has encouraged many scientists to pick up research in the area of neural networks after the field had gone dormant for decades. In particular, quite a few mathematicians have recently investigated the properties of different neural network architectures, hoping that this can explain the effectiveness of deep learning techniques. In this context, mathematical analysis has mainly been conducted in the context of statistical learning theory [20], where the overall success of a learning method is determined by the approximation properties of the underlying function class, the feasibility of optimizing over this class, and the generalization capabilities of the class, when only training with finitely many samples.
In the approximation theoretical part of deep learning research, one analyzes the expressiveness of deep neural network architectures. The universal approximation theorem [21, 35, 45] demonstrates that neural networks can approximate any continuous function, as long as one uses networks of increasing complexity for the approximation. If one is interested in approximating more specific function classes than the class of all continuous functions, then one can often quantify more precisely how large the networks have to be to achieve a given approximation accuracy for functions from the restricted class. Examples of such results are [7, 14, 51, 52, 57, 71]. Some articles [18, 54, 57, 62, 72] study in particular in which sense deep networks have a superior expressiveness compared to their shallow counterparts, thereby partially explaining the efficiency of networks with many layers in deep learning.
Another line of research studies the training procedures employed in deep learning. Given a set of training samples, the training process is an optimization problem over the parameters of a neural network, where a loss function is minimized. The loss function is typically a nonlinear, non-convex function of the weights of the network, rendering the optimization of this function highly challenging [8, 13, 38]. Nonetheless, in applications, neural networks are often trained successfully through a variation of stochastic gradient descent. In this regard, the energy landscape of the problem was studied and found to allow convergence to a global optimum, if the problem is sufficiently overparametrized; see [1, 16, 27, 56, 67].
The third large area of mathematical research on deep neural networks is analyzing the so-called generalization error of deep learning. In the framework of statistical learning theory [20, 53], the discrepancy between the empirical loss and the expected loss of a classifier is called the generalization error. Specific bounds for this error for the class of deep neural networks were analyzed for instance in [4, 11], and in more specific settings for instance in [9, 10].
In this work, we study neural networks from a different point of view. Specifically, we study the structure of the set of functions implemented by neural networks of fixed size. These sets are naturally (nonlinear) subspaces of classical function spaces like \(L^p(\Omega )\) and \(C(\Omega )\) for compact sets \(\Omega \).
Due to the size of the networks being fixed, our analysis is inherently non-asymptotic. Therefore, our viewpoint is fundamentally different from the analysis in the framework of statistical learning theory. Indeed, in approximation theory, the expressive power of networks growing in size is analyzed. In optimization, one studies the convergence properties of iterative algorithms—usually that of some form of stochastic gradient descent. Finally, when considering the generalization capabilities of deep neural networks, one mainly studies how and with which probability the empirical loss of a classifier converges to the expected loss, for increasing numbers of random training samples and depending on the sizes of the underlying networks.
Given this strong delineation to the classical fields, we will see that our point of view yields interpretable results describing phenomena in deep learning that are not directly explained by the classical approaches. We will describe these results and their interpretations in detail in Sects. 1.1–1.3.
We will use standard notation throughout most of the paper without explicitly introducing it. We do, however, collect a list of used symbols and notions in Appendix A. To not interrupt the flow of reading, we have deferred several auxiliary results to Appendix B and all proofs and related statements to Appendices C–E.
Before we continue, we formally introduce the notion of spaces of neural networks of fixed size.
1.1 Neural Networks of Fixed Size: Basic Terminology
To state our results, it will be necessary to distinguish between a neural network as a set of weights and the associated function implemented by the network, which we call its realization. To explain this distinction, let us fix numbers \(L, N_0, N_1, \dots , N_{L} \in \mathbb {N}\). We say that a family \(\Phi = \big ( (A_\ell ,b_\ell ) \big )_{\ell = 1}^L\) of matrix-vector tuples of the form \(A_\ell \in \mathbb {R}^{N_{\ell } \times N_{\ell -1}}\) and \(b_\ell \in \mathbb {R}^{N_\ell }\) is a neural network. We call \(S{:}{=}(N_0, N_1, \dots , N_L)\) the architecture of \(\Phi \); furthermore \(N(S){:}{=}\sum _{\ell = 0}^L N_\ell \) is called the number of neurons of S and \(L = L(S)\) is the number of layers of S. We call \(d{:}{=}N_0\) the input dimension of \(\Phi \), and throughout this introduction we assume that the output dimension \(N_L\) of the networks is equal to one. For a given architecture S, we denote by \(\mathcal {NN}(S)\) the set of neural networks with architecture S.
Defining the realization of such a network \(\Phi = \big ( (A_\ell ,b_\ell ) \big )_{\ell =1}^L\) requires two additional ingredients: a so-called activation function \(\varrho : \mathbb {R}\rightarrow \mathbb {R}\), and a domain of definition \(\Omega \subset \mathbb {R}^{N_0}\). Given these, the realization of the network \(\Phi = \big ( (A_\ell ,b_\ell ) \big )_{\ell =1}^L\) is the function
where \(x_L\) results from the following scheme:
and where \(\varrho \) acts componentwise; that is, \(\varrho (x_1,\dots ,x_d) := (\varrho (x_1),\dots ,\varrho (x_d))\). In what follows, we study topological properties of sets of realizations of neural networks with a fixed size. Naturally, there are multiple conventions to specify the size of a network. We will study the set of realizations of networks with a given architecture S and activation function \(\varrho \); that is, the set \(\mathcal {R}\mathcal {N}\mathcal {N}_{\varrho }^{\Omega }(S) {:}{=}\{\mathrm {R}_\varrho ^\Omega (\Phi ) :\Phi \in \mathcal {NN}(S) \}\). In the context of machine learning, this point of view is natural, since one usually prescribes the network architecture, and during training only adapts the weights of the network.
Before we continue, let us note that the set \(\mathcal {NN}(S)\) of all neural networks (that is, the network weights) with a fixed architecture forms a finite-dimensional vector space, which we equip with the norm
where \(\Vert \Phi \Vert _{\mathrm {scaling}} {:}{=}\max _{\ell = 1,\dots ,L } \Vert A_\ell \Vert _{\max }\). If the specific architecture of \(\Phi \) does not matter, we simply write \(\Vert \Phi \Vert _{\mathrm {total}}{:}{=}\Vert \Phi \Vert _{\mathcal {NN}(S)}\). In addition, if \(\varrho \) is continuous, we denote the realization map by
While the activation function \(\varrho \) can in principle be chosen arbitrarily, a couple of particularly useful activation functions have been established in the literature. We proceed by listing some of the most common activation functions, a few of their properties, as well as references to articles using these functions in the context of deep learning. We note that all activation functions listed below are non-constant, monotonically increasing, globally Lipschitz continuous functions. This property is much stronger than the assumption of local Lipschitz continuity that we will require in many of our results. Furthermore, all functions listed below belong to the class \(C^\infty (\mathbb {R}\setminus \{0\}).\)
In the remainder of this introduction, we discuss our results concerning the topological properties of the sets of realizations of neural networks with fixed architecture and their interpretation in the context of deep learning. Then, we give an overview of related work. We note at this point that it is straightforward to generalize all of the results in this paper to neural networks for which one only prescribes the total number of neurons and layers and not the specific architecture.
For simplicity, we will always assume in the remainder of this introduction that \(\Omega \subset \mathbb {R}^{N_0}\) is compact with non-empty interior.
1.2 Non-convexity of the Set of Realizations
We will show in Sect. 2 (Theorem 2.1) that, for a given architecture S, the set \(\mathcal {R}\mathcal {N}\mathcal {N}_{\varrho }^{\Omega }(S)\) is not convex, except possibly when the activation function is a polynomial, which is clearly not the case for any of the activation functions that are commonly used in practice.
In fact, for a large class of activation functions (including the ReLU and the standard sigmoid activation function), the set \(\mathcal {R}\mathcal {N}\mathcal {N}_{\varrho }^{\Omega }(S)\) turns out to be highly non-convex in the sense that for every \(r \in [0,\infty )\), the set of functions having uniform distance at most r to any function in \(\mathcal {R}\mathcal {N}\mathcal {N}_{\varrho }^{\Omega }(S)\) is not convex. We prove this result in Theorem 2.2 and Remark 2.3.
This non-convexity is undesirable, since for non-convex sets, there do not necessarily exist well-defined projection operators onto them. In classical statistical learning theory [20], the property that the so-called regression function can be uniquely projected onto a convex (and compact) hypothesis space greatly simplifies the learning problem; see [20, Sect. 7]. Furthermore, in applications where the realization of a network— rather than its set of weights—is the quantity of interest (for example when a network is used as an Ansatz for the solution of a PDE, as in [24, 42]), our results show that the Ansatz space is non-convex. This non-convexity is inconvenient if one aims for a convergence proof of the underlying optimization algorithm, since one cannot apply convexity-based fixed-point theorems. Concretely, if a neural network is optimized by stochastic gradient descent so as to satisfy a certain PDE, then it is interesting to see if there even exists a network so that the iteration stops. In other words, one might ask whether gradient descent on the set of neural networks (potentially with bounded weights) has a fixed point. If the space of neural networks were convex and compact, then the fixed-point theorem of Schauder would guarantee the existence of such a fixed point.
1.3 (Non-)Closedness of the Set of Realizations
For any fixed architecture S, we show in Sect. 3 (Theorem 3.1) that the set \(\mathcal {R}\mathcal {N}\mathcal {N}_{\varrho }^{\Omega }(S)\) is not a closed subset of \(L^p (\mu )\) for \(0< p < \infty \), under very mild assumptions on the measure \(\mu \) and the activation function \(\varrho \). The assumptions concerning \(\varrho \) are satisfied for all activation functions used in practice.
For the case \(p = \infty \), the situation is more involved: For all activation functions that are commonly used in practice—except for the (parametric) ReLU— the associated sets \(\mathcal {R}\mathcal {N}\mathcal {N}_{\varrho }^{\Omega }(S)\) are non-closed also with respect to the uniform norm; see Theorem 3.3. For the (parametric) ReLU, however, the question of closedness of the sets \(\mathcal {R}\mathcal {N}\mathcal {N}_{\varrho }^{\Omega }(S)\) remains mostly open. Nonetheless, in two special cases, we prove in Sect. 3.4 that the sets \(\mathcal {R}\mathcal {N}\mathcal {N}_{\varrho }^{\Omega }(S)\) are closed. In particular, for neural network architectures with two layers only, Theorem 3.8 establishes the closedness of \(\mathcal {R}\mathcal {N}\mathcal {N}_{\varrho }^{\Omega }(S)\), where \(\varrho \) is the (parametric) ReLU.
A practical consequence of the observation of non-closedness can be identified with the help of the following argument that is made precise in Sect. 3.3: We show that the set
of realizations of neural networks with a fixed architecture and all affine linear maps bounded in a suitable norm, is always closed. As a consequence, we observe the following phenomenon of exploding weights: If a function f is such that it does not have a best approximation in \(\mathcal {R}\mathcal {N}\mathcal {N}^\Omega _\varrho (S)\), that is, if there does not exist \(f^*\in \mathcal {R}\mathcal {N}\mathcal {N}^\Omega _\varrho (S)\) such that
then for any sequence of networks \((\Phi _n)_{n \in \mathbb {N}}\) with architecture S which satisfies \(\Vert f - \mathrm {R}_\varrho ^\Omega (\Phi _n)\Vert _{L^p(\mu )} \rightarrow \tau _f\), the weights of the networks \(\Phi _n\) cannot remain uniformly bounded as \(n \rightarrow \infty \). In words, if f does not have a best approximation in the set of neural networks of fixed size, then every sequence of realizations approximately minimizing the distance to f will have exploding weights. Since \(\mathcal {R}\mathcal {N}\mathcal {N}_{\varrho }^{\Omega }(S)\) is not closed, there do exist functions f which do not have a best approximation in \(\mathcal {R}\mathcal {N}\mathcal {N}_{\varrho }^{\Omega }(S)\).
Certainly, the presence of large coefficients will make the numerical optimization increasingly unstable. Thus, exploding weights in the sense described above are highly undesirable in practice.
The argument above discusses an approximation problem in an \(L^p\)-norm. In practice, one usually only minimizes “empirical norms”. We will demonstrate in Proposition 3.6 that also in this situation, for increasing numbers of samples, the weights of the neural networks that minimize the empirical norms necessarily explode under certain assumptions. Note that the setup of having a fixed architecture and a potentially unbounded number of training samples is common in applications where neural networks are trained to solve partial differential equations. There, training samples are generated during the training process [25, 42].
1.4 Failure of Inverse Stability of the Realization Map
As our final result, we study (in Sect. 4) the stability of the realization map \(\mathrm {R}_\varrho ^\Omega \) introduced in Eq. (1.1), which maps a family of weights to its realization. Even though this map will turn out to be continuous from the finite dimensional parameter space to \(L^p (\Omega )\) for any \(p \in (0,\infty ]\), we will show that it is not inverse stable. In other words, for two realizations that are very close in the uniform norm, there do not always exist network weights associated with these realizations that have a small distance. In fact, Theorem 4.2 even shows that there exists a sequence of realizations of networks converging uniformly to 0, but such that every sequence of weights with these realizations is necessarily unbounded.
For both of these results—continuity and no inverse stability—we only need to assume that the activation function \(\varrho \) is Lipschitz continuous and not constant.
These properties of the realization map pinpoint a potential problem that can occur when training a neural network: Let us consider a regression problem, where a network is iteratively updated by a (stochastic) gradient descent algorithm trying to minimize a loss function. It is then possible that at some iterate the loss function exhibits a very small error, even though the associated network parameters have a large distance to the optimal parameters. This issue is especially severe since a small error term leads to small steps if gradient descent methods are used in the optimization. Consequently, convergence to the very distant optimal weights will be slow even if the energy landscape of the optimization problem happens to be free of spurious local minima.
1.5 Related Work
Structural properties
The aforementioned properties of non-convexity and non-closedness have, to some extent, been studied before. Classical results analyze the spaces of shallow neural networks, that is, of \(\mathcal {RNN}_\varrho ^\Omega (S)\) for \(S = (d, N_0, 1)\), so that \(L = 2\). For such sets of shallow networks, a property that has been extensively studied is to what extent \(\mathcal {RNN}_\varrho ^\Omega (S)\) has the best approximation property. Here, we say that \(\mathcal {RNN}_\varrho ^\Omega (S)\) has the best approximation property, if for every function \(f \in L^p(\Omega )\), \(1 \le p \le \infty \), there exists a function \(F(f) \in \mathcal {RNN}_\varrho ^\Omega (S)\) such that \(\Vert f - F(f) \Vert _{L^p} = \inf _{g \in \mathcal {RNN}_\varrho ^\Omega (S)} \Vert f - g \Vert _{L^p}\). In [40] it was shown that even if a minimizer always exists, the map \(f \mapsto F(f)\) is necessarily discontinuous. Furthermore, at least for the Heaviside activation function, there does exist a (non-unique) best approximation; see [39].
Additionally, [28, Proposition 4.1] demonstrates, for shallow networks as before, that for the logistic activation function \(\varrho (x) = (1 + e^{-x})^{-1}\), the set \(\mathcal {RNN}_\varrho ^\Omega (S)\) does not have the best approximation property in \(C(\Omega )\). In the proof of this statement, it was also shown that \(\mathcal {RNN}_\varrho ^\Omega (S)\) is not closed. Furthermore, it is claimed that this result should hold for every nonlinear activation function. The previously mentioned result of [39] and Theorem 3.8 below disprove this conjecture for the Heaviside and ReLU activation functions, respectively.
Other notions of (non-)convexity
In deep learning, one chooses a loss function \({{\mathcal {L}}: C(\Omega ) \rightarrow [0,\infty )}\), which is then minimized over the set of neural networks \(\mathcal {R}\mathcal {N}\mathcal {N}_{\varrho }^{\Omega }(S)\) with fixed architecture S. A typical loss function is the empirical square loss, that is,
where \((x_i,y_i)_{i=1}^N \subset \Omega \times \mathbb {R}\), \(N \in \mathbb {N}\). In practice, one solves the minimization problem over the weights of the network; that is, one attempts to minimize the function \({\mathcal {L}} \circ \mathrm {R}_\varrho ^\Omega : \mathcal {NN}(S) \rightarrow [0,\infty )\). In this context, to assess the hardness of this optimization problem, one studies whether \({\mathcal {L}} \circ \mathrm {R}_\varrho ^\Omega \) is convex, the degree to which it is non-convex, and if one can find remedies to alleviate the problem of non-convexity, see for instance [5, 6, 27, 37, 50, 56, 59, 67, 73].
It is important to emphasize that this notion of non-convexity describes properties of the loss function, in contrast to the non-convexity of the sets of functions that we analyze in this work.
2 Non-convexity of the Set of Realizations
In this section, we analyze the convexity of the set of all neural network realizations. In particular, we will show that this set is highly non-convex for all practically used activation functions listed in Table 1. First, we examine the convexity of the set \(\mathcal {R}\mathcal {N}\mathcal {N}^{\Omega }_{\varrho }(S)\):
Theorem 2.1
Let \(S = (d, N_1, \dots , N_L)\) be a neural network architecture with \(L \in \mathbb {N}_{\ge 2}\) and let \(\Omega \subset \mathbb {R}^d\) with non-empty interior. Moreover, let \(\varrho : \mathbb {R}\rightarrow \mathbb {R}\) be locally Lipschitz continuous.
If \(\mathcal {R}\mathcal {N}\mathcal {N}_\varrho ^\Omega (S)\) is convex, then \(\varrho \) is a polynomial.
Remark
(1) It is easy to see that all of the activation functions in Table 1 are locally Lipschitz continuous, and that none of them is a polynomial. Thus, the associated sets of realizations are never convex.
(2) In the case where \(\varrho \) is a polynomial, the set \(\mathcal {R}\mathcal {N}\mathcal {N}_\varrho ^\Omega (S)\) might or might not be convex. Indeed, if \(S = (1, N, 1)\) and \(\varrho (x) = x^m\), then it is not hard to see that \(\mathcal {R}\mathcal {N}\mathcal {N}_\varrho ^\Omega (S)\) is convex if and only if \(N \ge m\).
Proof
The detailed proof of Theorem 2.1 is the subject of “Appendix C.1”. Let us briefly outline the proof strategy:
-
1.
We first show in Proposition C.1 that \(\mathcal {R}\mathcal {N}\mathcal {N}_\varrho ^\Omega (S)\) is closed under scalar multiplication, hence star-shaped with respect to the origin, i.e., 0 is a center.Footnote 1
-
2.
Next, using the local Lipschitz continuity of \(\varrho \), we establish in Proposition C.4 that the maximal number of linearly independent centers of the set \(\mathcal {R}\mathcal {N}\mathcal {N}_\varrho ^\Omega (S)\) is finite. Precisely, it is bounded by the number of parameters of the underlying neural networks, given by \(\sum _{\ell = 1}^L (N_{\ell -1} + 1) N_{\ell }\).
-
3.
A direct consequence of Step 2 is that if \(\mathcal {R}\mathcal {N}\mathcal {N}_\varrho ^\Omega (S)\) is convex, then it can only contain a finite number of linearly independent functions; see Corollary C.5.
-
4.
Finally, using that \(\mathcal {R}\mathcal {N}\mathcal {N}_\varrho ^{\mathbb {R}^d}(S)\) is a translation-invariant subset of \(C(\mathbb {R}^d)\), we show in Proposition C.6 that \(\mathcal {R}\mathcal {N}\mathcal {N}_\varrho ^{\mathbb {R}^d}(S)\) (and hence also \(\mathcal {R}\mathcal {N}\mathcal {N}_\varrho ^{\Omega }(S)\)) contains infinitely many linearly independent functions, if \(\varrho \) is not a polynomial.
\(\square \)
In applications, the non-convexity of \(\mathcal {R}\mathcal {N}\mathcal {N}^\Omega _\varrho (S)\) might not be as problematic as it first seems. If, for instance, the set \(\mathcal {R}\mathcal {N}\mathcal {N}^\Omega _\varrho (S) + B_\delta (0)\) of functions that can be approximated up to error \(\delta > 0\) by a neural network with architecture S was convex, then one could argue that the non-convexity of \(\mathcal {R}\mathcal {N}\mathcal {N}^\Omega _\varrho (S)\) was not severe. Indeed, in practice, neural networks are only trained to minimize a certain empirical loss function, with resulting bounds on the generalization error which are typically of size \(\varepsilon = {\mathcal {O}}(m^{-1/2})\), with m denoting the number of training samples. In this setting, one is not really interested in “completely minimizing” the (empirical) loss function, but would be content with finding a function for which the empirical loss is \(\varepsilon \)-close to the global minimum. Hence, one could argue that one is effectively working with a hypothesis space of the form \(\mathcal {R}\mathcal {N}\mathcal {N}^\Omega _\varrho (S) + B_\delta (0)\), containing all functions that can be represented up to an error of \(\delta \) by neural networks of architecture S.
To quantify this potentially more relevant notion of convexity of neural networks, we define, for a subset A of a vector space \({\mathcal {Y}}\), the convex hull of A as
For \(\varepsilon > 0\), we say that a subset A of a normed vector space \({\mathcal {Y}}\) is \(\varepsilon \)-convex in \(({\mathcal {Y}},\Vert \cdot \Vert _{{\mathcal {Y}}})\), if
Hence, the notion of \(\varepsilon \)-convexity asks whether the convex hull of a set is contained in an enlargement of this set. Note that if \(\mathcal {R}\mathcal {N}\mathcal {N}_\varrho ^\Omega (S)\) is dense in \(C(\Omega )\), then its closure is trivially \(\varepsilon \)-convex for all \(\varepsilon > 0\). Our main result regarding the \(\varepsilon \)-convexity of neural network sets shows that this is the only case in which \(\overline{\mathcal {R}\mathcal {N}\mathcal {N}_\varrho ^\Omega (S)}\) is \(\varepsilon \)-convex for any \(\varepsilon > 0\).
Theorem 2.2
Let \(S = (d, N_1, \dots , N_{L-1}, 1)\) be a neural network architecture with \(L \ge 2\), and let \(\Omega \subset \mathbb {R}^d\) be compact. Let \(\varrho : \mathbb {R}\rightarrow \mathbb {R}\) be continuous but not a polynomial, and such that \(\varrho '(x_0) \ne 0\) for some \(x_0 \in \mathbb {R}\).
Assume that \(\mathcal {R}\mathcal {N}\mathcal {N}_\varrho ^{\Omega }(S)\) is not dense in \(C(\Omega )\). Then there does not exist any \(\varepsilon > 0\) such that \(\overline{\mathcal {R}\mathcal {N}\mathcal {N}_\varrho ^{\Omega }(S)}\) is \(\varepsilon \)-convex in \(\big ( C(\Omega ), \Vert \cdot \Vert _{\sup } \big )\).
Remark
All closures in the theorem are taken in \(C(\Omega )\).
Proof
The proof of this theorem is the subject of “Appendix C.2”. It is based on showing that if \(\overline{\mathcal {R}\mathcal {N}\mathcal {N}_\varrho ^{\Omega }(S)}\) is \(\varepsilon \)-convex for some \(\varepsilon > 0\), then in fact \(\overline{\mathcal {R}\mathcal {N}\mathcal {N}_\varrho ^{\Omega }(S)}\) is convex, which we then use to show that \(\overline{\mathcal {R}\mathcal {N}\mathcal {N}_\varrho ^{\Omega }(S)}\) contains all realizations of two-layer neural networks with activation function \(\varrho \). As shown in [45], this implies that \(\mathcal {R}\mathcal {N}\mathcal {N}_\varrho ^{\Omega }(S)\) is dense in \(C(\Omega )\), since \(\varrho \) is not a polynomial. \(\square \)
Remark 2.3
While it is certainly natural to expect that \(\overline{\mathcal {R}\mathcal {N}\mathcal {N}_\varrho ^{\Omega }(S)} \ne C(\Omega )\) should hold for most activation functions \(\varrho \), giving a reference including large classes of activation functions such that the claim holds is not straightforward. We study this problem more closely in “Appendix C.3”.
To be more precise, from Proposition C.10 it follows that the ReLU, the parametric ReLU, the exponential linear unit, the softsign, the sigmoid, and the \(\tanh \) yield realization sets \(\mathcal {R}\mathcal {N}\mathcal {N}_\varrho ^{\Omega }(S)\) which are not dense in \(L^p(\Omega )\) and in \(C(\Omega )\).
The only activation functions listed in Table 1 for which we do not know whether any of the realization sets \(\mathcal {R}\mathcal {N}\mathcal {N}_\varrho ^{\Omega }(S)\) is dense in \(L^p(\Omega )\) or in \(C(\Omega )\) are: the inverse square root linear unit, the inverse square root unit, the softplus, and the \(\arctan \) function. Of course, we expect that also for these activation functions, the resulting sets of realizations are never dense in \(L^p(\Omega )\) or in \(C(\Omega )\).
Finally, we would like to mention that if \(\Omega \subset \mathbb {R}^d\) has non-empty interior and if the input dimension satisfies \(d \ge 2\), then it follows from the results in [48] that if \(S = (d, N_1, 1)\) is a two-layer architecture, then \(\mathcal {R}\mathcal {N}\mathcal {N}_\varrho ^{\Omega }(S)\) is not dense in \(C(\Omega )\) or \(L^p(\Omega )\).
3 (Non-)Closedness of the Set of Realizations
Let \(\varnothing \ne \Omega \subset \mathbb {R}^d\) be compact with non-empty interior. In the present section, we analyze whether the neural network realization set \(\mathcal {R}\mathcal {N}\mathcal {N}^\Omega _\varrho (S)\) with \(S=(d,N_1,\dots ,N_{L-1},1)\) is closed in \(C(\Omega )\), or in \(L^p (\mu )\), for \(p \in (0, \infty )\) and any measure \(\mu \) satisfying a mild “non-atomicness” condition. For the \(L^p\)-spaces, the answer is simple: Under very mild assumptions on the activation function \(\varrho \), we will see that \(\mathcal {R}\mathcal {N}\mathcal {N}^\Omega _\varrho (S)\) is never closed in \(L^p (\mu )\). In particular, this holds for all of the activation functions listed in Table 1. Closedness in \(C(\Omega )\), however, is more subtle: For this setting, we will identify several different classes of activation functions for which the set \(\mathcal {R}\mathcal {N}\mathcal {N}^\Omega _\varrho (S)\) is not closed in \(C(\Omega )\). As we will see, these classes of activation functions cover all those functions listed in Table 1, except for the ReLU and the parametric ReLU. For these two activation functions, we were unable to determine whether the set \(\mathcal {R}\mathcal {N}\mathcal {N}^\Omega _\varrho (S)\) is closed in \(C(\Omega )\) in general, but we conjecture this to be true. Only for the case \(L = 2\), we could show that these sets are indeed closed.
Closedness of \(\mathcal {R}\mathcal {N}\mathcal {N}^\Omega _\varrho (S)\) is a highly desirable property as we will demonstrate in Sect. 3.3. Indeed, we establish that if \(X = L^p(\mu )\) or \(X = C(\Omega )\), then, for all functions \(f \in X\) that do not possess a best approximation within \({\mathcal {R}} = \mathcal {R}\mathcal {N}\mathcal {N}^\Omega _\varrho (S)\), the weights of approximating networks necessarily explode. In other words, if \((\mathrm {R}_{\varrho }^{\Omega }(\Phi _n))_{n \in \mathbb {N}} \subset {\mathcal {R}}\) is such that \(\Vert f - \mathrm {R}_{\varrho }^{\Omega }(\Phi _n)\Vert _{X}\) converges to \(\inf _{g\in {\mathcal {R}}}\Vert f-g \Vert _{X}\) for \(n \rightarrow \infty \), then \(\Vert \Phi _n\Vert _{\mathrm {total}} \rightarrow \infty \). Such functions without a best approximation in \({\mathcal {R}}\) necessarily exist if \({\mathcal {R}}\) is not closed. Moreover, even in practical applications, where empirical error terms instead of \(L^p(\mu )\) norms are minimized, the absence of closedness implies exploding weights as we show in Proposition 3.6.
Finally, we note that for simplicity, all “non-closedness” results in this section are formulated for compact rectangles of the form \(\Omega = [-B,B]^d\) only; but our arguments easily generalize to any compact set \(\Omega \subset \mathbb {R}^d\) with non-empty interior.
3.1 Non-closedness in \(L^p(\mu )\)
We start by examining the closedness with respect to \(L^p\)-norms for \(p \in (0, \infty )\). In fact, for all \(B > 0\) and all widely used activation functions (including all activation functions presented in Table 1), the set \(\mathcal {R}\mathcal {N}\mathcal {N}_{\varrho }^{[-B,B]^d}(S)\) is not closed in \(L^p(\mu )\), for any \(p \in (0, \infty )\) and any “sufficiently non-atomic” measure \(\mu \) on \([-B,B]^d\). To be more precise, the following is true:
Theorem 3.1
Let \(S = (d, N_1, \dots , N_{L-1}, 1)\) be a neural network architecture with \(L \in \mathbb {N}_{\ge 2}\). Let \(\varrho : \mathbb {R}\rightarrow \mathbb {R}\) be a function satisfying the following conditions:
-
(i)
\(\varrho \) is continuous and increasing;
-
(ii)
There is some \(x_0 \in \mathbb {R}\) such that \(\varrho \) is differentiable at \(x_0\) with \(\varrho '(x_0) \ne 0\);
-
(iii)
There is some \(r > 0\) such that \(\varrho |_{(-\infty , -r) \cup (r, \infty )}\) is differentiable;
-
(iv)
At least one of the following two conditions is satisfied:
-
(a)
There are \(\lambda , \lambda ' \ge 0\) with \(\lambda \ne \lambda '\) such that \(\varrho '(x) \rightarrow \lambda \) as \(x \rightarrow \infty \), and \(\varrho '(x) \rightarrow \lambda '\) as \(x \rightarrow -\infty \), and we have \(N_{L-1} \ge 2\).
-
(b)
\(\varrho \) is bounded.
-
(a)
Finally, let \(B > 0\) and let \(\mu \) be a finite Borel measure on \([-B, B]^d\) for which the support \(\hbox {supp}\mu \) is uncountable. Then the set \(\mathcal {R}\mathcal {N}\mathcal {N}_{\varrho }^{[-B,B]^d}(S)\) is not closed in \(L^p(\mu )\) for any \(p \in (0, \infty )\). More precisely, there is a function \(f \in L^\infty (\mu )\) which satisfies \(f \in \overline{\mathcal {R}\mathcal {N}\mathcal {N}_{\varrho }^{[-B,B]^d}(S)} \setminus \mathcal {R}\mathcal {N}\mathcal {N}_{\varrho }^{[-B,B]^d}(S)\) for all \(p \in (0,\infty )\), where the closure is taken in \(L^p(\mu )\).
Remark
If \(\hbox {supp}\mu \) is countable, then \(\mu = \sum _{x \in \hbox {supp}\mu } \mu (\{ x \}) \, \delta _x\) is a countable sum of Dirac measures, meaning that \(\mu \) is purely atomic. In particular, if \(\mu \) is non-atomic (meaning that \(\mu (\{ x \}) = 0\) for all \(x \in [-B,B]^d\)), then \(\hbox {supp}\mu \) is uncountable and the theorem is applicable.
Proof
For the proof of the theorem, we refer to “Appendix D.1”. The main proof idea consists in the approximation of a (discontinuous) step function which cannot be represented by a neural network with continuous activation function. \(\square \)
Corollary 3.2
The assumptions concerning the activation function \(\varrho \) in Theorem 3.1 are satisfied for all of the activation functions listed in Table 1. In any case where \(\varrho \) is bounded, one can take \(N_{L-1} = 1\); otherwise, one can take \(N_{L-1} = 2\).
Proof
For a proof of this statement, we refer to “Appendix D.2”. \(\square \)
3.2 Non-closedness in \((C([-B,B]^d)\) for Many Widely Used Activation Functions
We have seen in Theorem 3.1 that under reasonably mild assumptions on the activation function \(\varrho \)—which are satisfied for all commonly used activation functions—the set \(\mathcal {R}\mathcal {N}\mathcal {N}_\varrho ^{[-B,B]^d}(S)\) is not closed in any \(L^p\)-space where \(p \in (0, \infty )\). However, the argument of the proof of Theorem 3.1 breaks down if one considers closedness with respect to the \(\Vert \cdot \Vert _{\sup }\)-norm. Therefore, we will analyze this setting more closely in this section. More precisely, in Theorem 3.3, we present several criteria regarding the activation function \(\varrho \) which imply that the set \(\mathcal {R}\mathcal {N}\mathcal {N}_\varrho ^{[-B,B]^d}(S)\) is not closed in \(C ([-B,B]^d)\). We remark that in all these results, \(\varrho \) will be assumed to be at least \(C^1\). Developing similar criteria for non-differentiable functions is an interesting topic for future research.
Before we formulate Theorem 3.3, we need the following notion: We say that a function \(f: \mathbb {R}\rightarrow \mathbb {R}\) is approximately homogeneous of order \((r,q) \in \mathbb {N}_0^2\) if there exists \(s > 0\) such that \(|f(x) - x^r| \le s\) for all \(x \ge 0\) and \(|f(x) - x^q| \le s\) for all \(x \le 0\). Now the following theorem holds:
Theorem 3.3
Let \(S = (d, N_1, \dots , N_{L-1}, 1)\) be a neural network architecture with \(L\in \mathbb {N}_{\ge 2}\), let \(B > 0\), and let \(\varrho : \mathbb {R}\rightarrow \mathbb {R}\). Assume that at least one of the following three conditions is satisfied:
-
(i)
\(N_{L-1}\ge 2\) and \(\varrho \in C^1(\mathbb {R}) \setminus C^\infty (\mathbb {R}).\)
-
(ii)
\(N_{L-1}\ge 2\) and \(\varrho \) is bounded, analytic, and not constant.
-
(iii)
\(\varrho \) is approximately homogeneous of order (r, q) for certain \(r,q \in \mathbb {N}_0\) with \(r \ne q\), and \(\varrho \in C^{\max \{r,q\}}(\mathbb {R})\).
Then the set \(\mathcal {R}\mathcal {N}\mathcal {N}_\varrho ^{[-B,B]^d}(S)\) is not closed in the space \(C([-B,B]^d)\).
Proof
For the proof of the statement, we refer to “Appendix D.3”. In particular, the proof of the statement under Condition (i) can be found in “Appendix D.3.1”. Its main idea consists of the uniform approximation of \(\varrho '\) (which cannot be represented by neural networks with activation function \(\varrho \), due to its lack of sufficient regularity) by neural networks. For the proof of the statement under Condition (ii), we refer to “Appendix D.3.2”. The main proof idea consists in the uniform approximation of an unbounded analytic function which cannot be represented by a neural network with activation function \(\varrho \), since \(\varrho \) itself is bounded. Finally, the proof of the statement under Condition (iii) can be found in “Appendix D.3.3”. Its main idea consists in the approximation of the function \(x\mapsto (x)^{\max \{r,q\}}_+ \not \in C^{\max \{r,q\}}.\) \(\square \)
Corollary 3.4
Theorem 3.3 applies to all activation functions listed in Table 1except for the ReLU and the parametric ReLU. To be more precise,
-
(1)
Condition (i) is fulfilled by the function \(x\mapsto \max \{0, x \}^k\) for \(k \ge 2\), and by the exponential linear unit, the softsign function, and the inverse square root linear unit.
-
(2)
Condition (ii) is fulfilled by the inverse square root unit, the sigmoid function, the \(\tanh \) function, and the \(\arctan \) function.
-
(3)
Condition (iii) (with \(r = 1\) and \(q = 0\)) is fulfilled by the softplus function.
Proof
For the proof of this statement, we refer to “Appendix D.4”. In particular, for the proof of (1), we refer to “Appendix D.4.1”, the proof of (2) is clear and for the proof of (3), we refer to “Appendix D.4.2”. \(\square \)
3.3 The Phenomenon of Exploding Weights
We have just seen that the realization set \(\mathcal {R}\mathcal {N}\mathcal {N}_\varrho ^{[-B,B]^d}(S)\) is not closed in \(L^p(\mu )\) for any \(p \in (0,\infty )\) and every practically relevant activation function. Furthermore, for a variety of activation functions, we have seen that \(\mathcal {R}\mathcal {N}\mathcal {N}_\varrho ^{[-B,B]^d}(S)\) is not closed in \(C([-B,B]^d)\). The situation is substantially different if the weights are taken from a compact subset:
Proposition 3.5
Let \(S = (d, N_1, \dots , N_L)\) be a neural network architecture, let \(\Omega \subset \mathbb {R}^d\) be compact, let furthermore \(p \in (0,\infty )\), and let \(\varrho : \mathbb {R}\rightarrow \mathbb {R}\) be continuous. For \(C > 0\), let
Then the set \(\mathrm {R}_{\varrho }^{\Omega }(\Theta _C)\) is compact in \(C(\Omega )\) as well as in \(L^p(\mu )\), for any finite Borel measure \(\mu \) on \(\Omega \) and any \(p \in (0, \infty )\).
Proof
The proof of this statement is based on the continuity of the realization map and can be found in “Appendix D.5”. \(\square \)
Proposition 3.5 helps to explain the phenomenon of exploding network weights that is sometimes observed during the training of neural networks. Indeed, let us assume that \({\mathcal {R}} := \mathcal {R}\mathcal {N}\mathcal {N}_\varrho ^{[-B,B]^d}(S)\) is not closed in \({\mathcal {Y}}\), where \({\mathcal {Y}} {:}{=}L^p (\mu )\) for a Borel measure \(\mu \) on \([-B,B]^d\), or \({\mathcal {Y}} {:}{=}C([-B,B]^d)\); as seen in Sects. 3.1 and 3.2, this is true under mild assumptions on \(\varrho \). Then, it follows that there exists a function \(f \in {\mathcal {Y}}\) which does not have a best approximation in \({\mathcal {R}}\), meaning that there does not exist any \(g \in {\mathcal {R}}\) satisfying
in fact, one can take any \(f \in \overline{{\mathcal {R}}} \setminus {\mathcal {R}}\). Next, recall from Proposition 3.5 that the subset of \({\mathcal {R}}\) that contains only realizations of networks with uniformly bounded weights is compact.
Hence, we conclude the following: For every sequence
satisfying \({\Vert f - f_n \Vert _{{\mathcal {Y}}} \rightarrow M}\), we must have \(\Vert \Phi _n\Vert _{\mathrm {total}} \rightarrow \infty \), since otherwise, by compactness, \((f_n)_{n \in \mathbb {N}}\) would have a subsequence that converges to some \(h \in \mathrm {R}_{\varrho }^{\Omega }(\Theta _C) \subset {\mathcal {R}}\). In other words, the weights of the networks \(\Phi _n\) necessarily explode.
The argument above only deals with the approximation problem in the space \(C([-B,B]^d)\) or in \(L^p(\mu )\) for \(p \in (0, \infty )\). In practice, one is often not concerned with these norms, but instead wants to minimize an empirical loss function over \({\mathcal {R}}\). For the empirical square loss, this loss function takes the form
for \(\big ( (x_i, y_i) \big )_{i=1}^N \subset \Omega \times \mathbb {R}\) drawn i.i.d. according to a probability distribution \(\sigma \) on \(\Omega \times \mathbb {R}\). By the strong law of large numbers, for each fixed measurable function f, the empirical loss function converges almost surely to the expected loss
This expected loss is related to an \(L^2\) minimization problem. Indeed, [20, Proposition 1] shows that there is a measurable function \(f_\sigma : \Omega \rightarrow \mathbb {R}\)—called the regression function—such that the expected risk from Eq. (3.1) satisfies
Here, \(\sigma _\Omega \) is the marginal probability distribution of \(\sigma \) on \(\Omega \), and \({\mathcal {E}}_\sigma (f_\sigma )\) is called the Bayes risk; it is the minimal expected loss achievable by any (measurable) function.
In this context of a statistical learning problem, we have the following result regarding exploding weights:
Proposition 3.6
Let \(d \in \mathbb {N}\) and \(B, K > 0\). Let \(\Omega {:}{=}[-B,B]^d\). Moreover, let \(\sigma \) be a Borel probability measure on \(\Omega \times [-K,K]\). Further, let \(S = (d, N_1, \dots , N_{L-1}, 1)\) be a neural network architecture and \(\varrho : \mathbb {R}\rightarrow \mathbb {R}\) be locally Lipschitz continuous. Assume that the regression function \(f_\sigma \) is such that there does not exist a best approximation of \(f_\sigma \) in \(\mathcal {R}\mathcal {N}\mathcal {N}_\varrho ^{\Omega }(S)\) with respect to \(\Vert \cdot \Vert _{L^2(\sigma _\Omega )}\). Let \(\big ( (x_i, y_i) \big )_{i \in \mathbb {N}} \overset{\mathrm {i.i.d.}}{\sim } \sigma \); all probabilities below will be with respect to this family of random variables.
If \((\Phi _N)_{N \in \mathbb {N}} \subset \mathcal {NN}(S)\) is a random sequence of neural networks (depending on \(\big ( (x_i, y_i) \big )_{i \in \mathbb {N}}\)) that satisfies
then \(\Vert \Phi _N \Vert _{\mathrm {total}} \rightarrow \infty \) in probability as \(N \rightarrow \infty \).
Remark
A compact way of stating Proposition 3.6 is that, if \(f_\sigma \) has no best approximation in \(\mathcal {R}\mathcal {N}\mathcal {N}_\varrho ^{\Omega }(S)\) with respect to \(\Vert \cdot \Vert _{L^2(\sigma _\Omega )}\), then the weights of the minimizers (or approximate minimizers) of the empirical square loss explode for increasing numbers of samples.
Since \(\sigma \) is unknown in applications, it is indeed possible that \(f_\sigma \) has no best approximation in the set of neural networks. As just one example, this is true if \(\sigma _\Omega \) is any Borel probability measure on \(\Omega \) and if \(\sigma \) is the distribution of (X, g(X)), where \(X \sim \sigma _\Omega \) and \(g \in L^2(\sigma _\Omega )\) is bounded and satisfies \(g \in \overline{\mathcal {R}\mathcal {N}\mathcal {N}_\varrho ^{\Omega }(S)} \setminus \mathcal {R}\mathcal {N}\mathcal {N}_\varrho ^{\Omega }(S)\), with the closure taken in \(L^2(\sigma _\Omega )\). The existence of such a function g is guaranteed by Theorem 3.1 if \(\hbox {supp}\sigma _\Omega \) is uncountable.
Proof
For the proof of Proposition 3.6, we refer to “Appendix D.6”. The proof is based on classical arguments of statistical learning theory as given in [20]. \(\square \)
3.4 Closedness of ReLU Networks in \(C([-B,B]^d)\)
In this subsection, we analyze the closedness of sets of realizations of neural networks with respect to the ReLU or the parametric ReLU activation function in \(C(\Omega )\), mostly for the case \(\Omega = [-B, B]^d\). We conjecture that the set of (realizations of) ReLU networks of a fixed complexity is closed in \(C(\Omega )\), but were not able to prove such a result in full generality. In two special cases, namely when the networks have only two layers, or when at least the scaling weights are bounded, we can show that the associated set of ReLU realizations is closed in \(C(\Omega )\); see below.
We begin by analyzing the set of realizations with uniformly bounded scaling weights and possibly unbounded biases, before proceeding with the analysis of two layer ReLU networks.
For \(\Phi = \big ( (A_1,b_1),\dots ,(A_L,b_L) \big ) \in \mathcal {NN}(S)\) satisfying \(\Vert \Phi \Vert _{\mathrm {scaling}} \le C\) for some \(C>0\), we say that the network \(\Phi \) has C -bounded scaling weights. Note that this does not require the biases \(b_\ell \) of the network to satisfy \(|b_\ell | \le C\).
Our first goal in this subsection is to show that if \(\varrho \) denotes the ReLU, if \(S = (d, N_1,\dots ,N_L)\), if \(C > 0\), and if \(\Omega \subset \mathbb {R}^d\) is measurable and bounded, then the set
is closed in \(C(\Omega ;\mathbb {R}^{N_L})\) and in \(L^p (\mu ; \mathbb {R}^{N_L})\) for arbitrary \(p \in [1, \infty ]\). Here, and in the remainder of the paper, we use the norm \(\Vert f\Vert _{L^p(\mu ;\mathbb {R}^{N_L})} = \Vert \, |f| \,\Vert _{L^p(\mu )}\) for vector-valued \(L^p\)-spaces. The norm on \(C(\Omega ; \mathbb {R}^{N_L})\) is defined similarly. The difference between the following proposition and Proposition 3.5 is that in the following proposition, the “shift weights” (the biases) of the networks can be potentially unbounded. Therefore, the resulting set is merely closed, not compact.
Proposition 3.7
Let \(S = (d, N_1, \dots , N_L)\) be a neural network architecture, let \(C > 0\), and let \(\Omega \subset \mathbb {R}^d\) be Borel measurable and bounded. Finally, let \(\varrho : \mathbb {R}\rightarrow \mathbb {R}, x \mapsto \max \{ 0, x \}\) denote the ReLU function.
Then the set \(\mathcal {R}\mathcal {N}\mathcal {N}_{\varrho }^{\Omega ,C}(S)\) is closed in \(L^p (\mu ;\mathbb {R}^{N_L})\) for every \(p \in [1,\infty ]\) and any finite Borel measure \(\mu \) on \(\Omega \). If \(\Omega \) is compact, then \(\mathcal {R}\mathcal {N}\mathcal {N}_{\varrho }^{\Omega ,C}(S)\) is also closed in \(C(\Omega ;\mathbb {R}^{N_L})\).
Remark
In fact, the proof shows that each subset of \(\mathcal {R}\mathcal {N}\mathcal {N}_{\varrho }^{\Omega ,C}(S)\) which is bounded in \(L^1 (\mu ; \mathbb {R}^{N_L})\) (when \(\mu (\Omega ) > 0\)) is precompact in \(L^p(\mu ; \mathbb {R}^{N_L})\) and in \(C(\Omega ; \mathbb {R}^{N_L})\).
Proof
For the proof of the statement, we refer to “Appendix D.7”. The main idea is to show that for every sequence \((\Phi _n)_{n \in \mathbb {N}} \subset \mathcal {NN}(S)\) of neural networks with uniformly bounded scaling weights and with \(\Vert \mathrm {R}_\varrho ^\Omega (\Phi _n) \Vert _{L^1(\mu )} \le M\), there exist a subsequence \((\Phi _{n_k})_{k\in \mathbb {N}}\) of \((\Phi _n)_{n\in \mathbb {N}}\) and neural networks \(({\widetilde{\Phi }}_{n_k})_{k\in \mathbb {N}}\) with uniformly bounded scaling weights and biases such that \( \mathrm {R}^\Omega _\varrho \big ( {\widetilde{\Phi }}_{n_k} \big ) = \mathrm {R}^\Omega _\varrho \big ( {\Phi }_{n_k} \big ) \). The rest then follows from Proposition 3.5. \(\square \)
As our second result in this section, we show that the set of realizations of two-layer neural networks with arbitrary scaling weights and biases is closed in \(C([-B,B]^d),\) if the activation is the parametric ReLU. It is a fascinating question for further research whether this also holds for deeper networks.
Theorem 3.8
Let \(d, N_0 \in \mathbb {N},\) let \(B > 0\), and let \(a \ge 0\). Let \(\varrho _a:\mathbb {R}\rightarrow \mathbb {R}, x \mapsto \max \{x, ax \}\) be the parametric ReLU. Then \(\mathcal {R}\mathcal {N}\mathcal {N}_{\varrho _a}^{[-B,B]^d}((d,N_0,1))\) is closed in \(C([-B,B]^d)\).
Proof
For the proof of the statement, we refer to “Appendix D.8”; here we only sketch the main idea: First, note that each \(f \in \mathcal {R}\mathcal {N}\mathcal {N}_{\varrho _a}^{[-B,B]^d}((d,N_0,1))\) is of the form \(f(x) = c + \sum _{i=1}^{N_0} \varrho _a (\langle \alpha _i , x \rangle + \beta _i)\). The proof is based on a careful—and quite technical—analysis of the singularity hyperplanes of the functions \(\varrho _a (\langle \alpha _i, x \rangle + \beta _i)\), that is, the hyperplanes \(\langle \alpha _i, x \rangle + \beta _i = 0\) on which these functions are not differentiable. More precisely, given a uniformly convergent sequence \((f_n)_{n \in \mathbb {N}} \subset \mathcal {R}\mathcal {N}\mathcal {N}_{\varrho _a}^{[-B,B]^d}((d,N_0,1))\), we analyze how the singularity hyperplanes of the functions \(f_n\) behave as \(n \rightarrow \infty \), in order to show that the limit is again of the same form as the \(f_n\). For more details, we refer to the actual proof. \(\square \)
4 Failure of Inverse Instability of the Realization Map
In this section, we study the properties of the realization map \(\mathrm {R}^{\Omega }_{\varrho }\). First of all, we observe that the realization map is continuous.
Proposition 4.1
Let \(\Omega \subset \mathbb {R}^d\) be compact and let \(S = (d, N_1, \dots , N_L)\) be a neural network architecture. If the activation function \(\varrho : \mathbb {R}\rightarrow \mathbb {R}\) is continuous, then the realization map from Eq. (1.1) is continuous. If \(\varrho \) is locally Lipschitz continuous, then so is \(\mathrm {R}^{\Omega }_{\varrho }\).
Finally, if \(\varrho \) is globally Lipschitz continuous, then there is a constant \(C = C(\varrho , S) > 0\) such that
Proof
For the proof of this statement, we refer to “Appendix E.1”. \(\square \)
In general, the realization map is not injective; that is, there can be networks \(\Phi \ne \Psi \) but such that \(\mathrm {R}^{\Omega }_{\varrho } (\Phi ) = \mathrm {R}^{\Omega }_{\varrho } (\Psi )\); in fact, if for instance
and
then the realizations of \(\Phi ,\Psi \) are identical.
In this section, our main goal is to determine whether, up to the failure of injectivity, the realization map is a homeomorphism onto its range; mathematically, this means that we want to determine whether the realization map is a quotient map. We will see that this is not the case.
To this end, we will prove for fixed \(\Phi \) that even if \(\mathrm {R}^{\Omega }_{\varrho } (\Psi )\) is very close to \(\mathrm {R}^{\Omega }_{\varrho }(\Phi )\), it is not true in general that \( \mathrm {R}^{\Omega }_{\varrho } (\Psi ) = \mathrm {R}^{\Omega }_{\varrho } ({\widetilde{\Psi }}) \) for network weights \({\widetilde{\Psi }}\) close to \(\Phi \). Precisely, this follows from the following theorem for \(\Phi = 0\) and \(\Psi = \Phi _n\).
Theorem 4.2
Let \(\varrho : \mathbb {R}\rightarrow \mathbb {R}\) be Lipschitz continuous, but not affine-linear. Let \(S = (N_0,\dots ,N_{L-1},1)\) be a network architecture with \(L \ge 2\), with \(N_0 = d\), and \(N_1 \ge 3\). Let \(\Omega \subset \mathbb {R}^d\) be bounded with nonempty interior.
Then there is a sequence \((\Phi _n)_{n \in \mathbb {N}}\) of networks with architecture S and the following properties:
-
1.
We have \(\mathrm {R}^{\Omega }_{\varrho } (\Phi _n) \rightarrow 0\) uniformly on \(\Omega \).
-
2.
We have \(\mathrm {Lip}(\mathrm {R}^{\Omega }_{\varrho }(\Phi _n)) \rightarrow \infty \) as \(n \rightarrow \infty \).
Finally, if \((\Phi _n)_{n \in \mathbb {N}}\) is a sequence of networks with architecture S and the preceding two properties, then the following holds: For each sequence of networks \((\Psi _n)_{n \in \mathbb {N}}\) with architecture S and \(\mathrm {R}^{\Omega }_{\varrho } (\Psi _n) = \mathrm {R}^{\Omega }_{\varrho } (\Phi _n)\), we have \(\Vert \Psi _n\Vert _{\mathrm {scaling}} \rightarrow \infty \).
Proof
For the proof of the statement, we refer to “Appendix E.2”. The proof is based on the fact that the Lipschitz constant of the realization of a network essentially yields a lower bound on the \(\Vert \cdot \Vert _{\mathrm {scaling}}\) norm of every neural network with this realization. We construct neural networks \(\Phi _n\) the realizations of which have small amplitude but high Lipschitz constants. The associated realizations uniformly converge to 0, but every associated neural network must have exploding weights. \(\square \)
We finally rephrase the preceding result in more topological terms:
Corollary 4.3
Under the assumptions of Theorem 4.2, the realization map \(\mathrm {R}^{\Omega }_{\varrho }\) from Eq. (1.1) is not a quotient map when considered as a map onto its range.
Proof
For the proof of the statement, we refer to “Appendix E.3”. \(\square \)
Notes
A subset A of some vector space V is called star-shaped, if there exists some \(f\in A\) such that for all \(g \in A\), also \(\{\lambda f + (1 - \lambda )g :\lambda \in [0,1]\} \subset A\). The vector f is called a center of A.
References
Z. Allen-Zhu, Y. Li, and Z. Song, A Convergence Theory for Deep Learning via Over-Parameterization, Proceedings of the 36th International Conference on Machine Learning, 2019, pp. 242–252.
H. Amann and J. Escher, Analysis III, Birkhäuser Verlag, Basel, 2009.
P. M. Anselone and J. Korevaar, Translation Invariant Subspaces of Finite Dimension, Proc. Amer. Math. Soc. 15 (1964), 747–752.
M. Anthony and P. L. Bartlett, Neural Network Learning: Theoretical Foundations, Cambridge University Press, Cambridge, 1999.
F. Bach, Breaking the Curse of Dimensionality with Convex Neural Networks, J. Mach. Learn. Res. 18 (2017), no. 1, 629–681.
P. Baldi and K. Hornik, Neural Networks and Principal Component Analysis: Learning from Examples Without Local Minima, Neural Netw. 2 (1988), no. 1, 53–58.
A.R. Barron, Universal Approximation Bounds for Superpositions of a Sigmoidal Function, IEEE Trans. Inf. Theory 39 (1993), no. 3, 930–945.
P. L. Bartlett and S. Ben-David, Hardness Results for Neural Network Approximation Problems, Theor. Comput. Sci. 284 (2002), no. 1, 53–66.
P. L. Bartlett, D. J. Foster, and M. J. Telgarsky, Spectrally-normalized margin bounds for neural networks, Adv. Neural Inf. Process. Syst. 30, 2017, pp. 6240–6249.
P. L. Bartlett, N. Harvey, C. Liaw, and A. Mehrabian, Nearly-tight VC-dimension and pseudodimension bounds for piecewise linear neural networks., J. Mach. Learn. Res. 20 (2019), no. 63, 1–17.
P. L. Bartlett and S. Mendelson, Rademacher and Gaussian Complexities: Risk Bounds and Structural Results, J. Mach. Learn. Res. 3 (2002), 463–482.
J. Bergstra, G. Desjardins, P. Lamblin, and Y. Bengio, Quadratic Polynomials Learn Better Image Features, Technical Report 1337, Département d’Informatique et de Recherche Opérationnelle, Université de Montréal, 2009.
A. Blum and R.L. Rivest, Training a 3-node neural network is NP-complete, Adv. Neural Inf. Process. Syst. 2, 1989, pp. 494–501.
H. Bölcskei, P. Grohs, G. Kutyniok, and P. C. Petersen, Optimal Approximation with Sparsely Connected Deep Neural Networks, SIAM J. Math. Data Sci. 1 (2019), 8–45.
B. Carlile, G. Delamarter, P. Kinney, A. Marti, and B. Whitney, Improving Deep Learn- ing by Inverse Square Root Linear Units (ISRLUs), arXiv preprint arXiv:1710.09967 (2017).
L. Chizat and F. Bach, On the Global Convergence of Gradient Descent for Over-parameterized Models using Optimal Transport, Adv. Neural Inf. Process. Syst. 31, 2018, pp. 3036–3046.
D. Clevert, T. Unterthiner, and S. Hochreiter, Fast and accurate deep network learn- ing by exponential linear units (elus), 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2–4, 2016, Conference Track Proceedings, 2016.
N. Cohen, O. Sharir, and A. Shashua, On the Expressive Power of Deep Learning: A Tensor Analysis, Conference on learning theory, 2016, pp. 698–728.
D. L. Cohn, Measure Theory, Birkhäuser Verlag, Basel, 2013.
F. Cucker and S. Smale, On the mathematical foundations of learning, Bull. Am. Math. Soc. 39 (2002), 1–49.
G. Cybenko, Approximation by superpositions of a sigmoidal function, Math. Control Signal 2 (1989), no. 4, 303–314.
G. E. Dahl, D. Yu, L. Deng, and A. Acero, Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition, IEEE Audio, Speech, Language Process. 20 (2012), no. 1, 30–42.
J. Dieudonné, Foundations of Modern Analysis, Pure and Applied Mathematics, Vol. X, Academic Press, New York-London, 1960.
W. E, J. Han, and A. Jentzen, Deep learning-based numerical methods for high- dimensional parabolic partial differential equations and backward stochastic differential equations, Commun. Math. Stat. 5 (2017), no. 4, 349–380.
W. E and B. Yu, The Deep Ritz Method: A deep learning-based numerical algorithm for solving variational problems, Communications in Mathematics and Statistics 6 (2018), no. 1, 1–12.
G. B. Folland, Real Analysis, Pure and Applied Mathematics (New York), Wiley, New York, 1999.
C. D. Freeman and J. Bruna, Topology and Geometry of Half-Rectified Network Opti- mization, 5th international conference on learning representations, ICLR 2017, Toulon, France, April 24–26, 2017, conference track proceedings, 2017.
F. Girosi and T. Poggio, Networks and the best approximation property, Biol. Cybern. 63 (1990), no. 3, 169–176.
X. Glorot, A. Bordes, and Y. Bengio, Deep Sparse Rectifier Neural Networks, Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, 2011, pp. 315–323.
I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning, MIT Press, Cambridge, MA, 2016. http://www.deeplearningbook.org.
L. Grafakos, Classical Fourier Analysis, Graduate Texts in Mathematics, vol. 249, Springer, New York, 2008.
S. Haykin, Neural Networks: A Comprehensive Foundation, Prentice Hall PTR, Upper Saddle River, 1998.
K. He, X. Zhang, S. Ren, and J. Sun, Delving Deep into Rectifiers: Surpassing Human- Level Performance on ImageNet Classification, Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 1026–1034.
G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-R. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath, and B. Kingsbury, Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups, IEEE Signal Process. Mag. 29 (2012), no. 6, 82–97.
K. Hornik, M. Stinchcombe, and H. White, Multilayer feedforward networks are universal approximators, Neural Netw. 2 (1989), no. 5, 359–366.
G. Huang, Z. Liu, L. van der Maaten, and K. Q.Weinberger, Densely Connected Convolutional Networks, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017), 2261–2269.
A. Jacot, F. Gabriel, and C. Hongler, Neural Tangent Kernel: Convergence and Gen- eralization in Neural Networks, Adv. Neural Inf. Process. Syst. 31, 2018, pp. 8571– 8580.
S. Judd, Learning in Networks is Hard, Proceedings of IEEE International Conference on Neural Networks, 1987, pp. 685–692.
P. Kainen, V. Kurková, and A. Vogt, Best approximation by Heaviside perceptron networks, Neural Netw. 13 (2000), no. 7, 695–697.
P. C. Kainen, V. Kurková, and A. Vogt, Approximation by neural networks is not continuous, Neurocomputing 29 (1999), no. 1-3, 47–56.
A. Krizhevsky, I. Sutskever, and G.E. Hinton, ImageNet Classification with Deep Convolutional Neural Networks, Adv. Neural Inf. Process. Syst. 25, 2012, pp. 1097–1105.
I. E. Lagaris, A. Likas, and D. I. Fotiadis, Artificial neural networks for solving ordinary and partial differential equations, IEEE Trans. Neural Netw. 9 (1998), no. 5, 987–1000.
Y. LeCun, Y. Bengio, and G. Hinton, Deep learning, Nature 521 (2015), no. 7553, 436–444.
J. M. Lee, Introduction to Topological Manifolds, Graduate Texts in Mathematics, vol. 202, Springer, New York, 2011.
M. Leshno, V. Y. Lin, A. Pinkus, and S. Schocken, Multilayer feedforward networks with a nonpolynomial activation function can approximate any function, Neural Netw. 6 (1993), no. 6, 861–867.
B. Liao, C. Ma, L. Xiao, R. Lu, and L. Ding, An Arctan-Activated WASD Neural Network Approach to the Prediction of Dow Jones Industrial Average, Advances in neural networks - ISNN 2017 - 14th international symposium, ISNN 2017, Sapporo, Hakodate, and Muroran, Hokkaido, Japan, June 21-26, 2017, Proceedings, Part I, 2017, pp. 120–126.
A. Maas, Y. Hannun, and A. Ng, Rectifier Nonlinearities Improve Neural Network Acoustic Models, ICML Workshop on Deep Learning for Audio, Speech and Language Processing, 2013.
V. E. Maiorov, Best approximation by ridge functions in \(L_p\)-spaces, Ukraïn. Mat. Zh. 62 (2010), no. 3, 396–408.
W. McCulloch and W. Pitts, A logical calculus of ideas immanent in nervous activity, Bull. Math. Biophys. 5 (1943), 115–133.
S. Mei, A. Montanari, and P.-M. Nguyen, A mean field view of the landscape of two-layer neural networks, Proc. Natl. Acad. Sci. USA 115 (2018), no. 33, E7665–E7671.
H. N. Mhaskar, Approximation properties of a multilayered feedforward artificial neural network, Adv. Comput. Math. 1 (1993), no. 1, 61–80.
H. N. Mhaskar, Neural networks for optimal approximation of smooth and analytic functions, Neural Comput. 8 (1996), no. 1, 164–177.
M. Mohri, A. Rostamizadeh, and A. Talwalkar, Foundations of Machine Learning, Adaptive Computation and Machine Learning, MIT Press, Cambridge, MA, 2012.
G. Montúfar, R. Pascanu, K. Cho, and Y. Bengio, On the Number of Linear Regions of Deep Neural Networks, Proceedings of the 27th International Conference on Neural Information Processing Systems, 2014, pp. 2924–2932.
V. Nair and G. Hinton, Rectified Linear Units Improve Restricted Boltzmann machines, Proceedings of the 27th International Conference on Machine Learning, 2010, pp. 807– 814.
Q. Nguyen and M. Hein, The Loss Surface of Deep and Wide Neural Networks, Proceedings of the 34th International Conference on Machine Learning-volume 70, 2017, pp. 2603–2612.
P. C. Petersen and F. Voigtlaender, Optimal approximation of piecewise smooth func- tions using deep ReLU neural networks, Neural Netw. 108 (2018), 296–330.
PhoemueX (https://math.stackexchange.com/users/151552/phoemuex), Uncountable closed set A, existence of point at which A accumulates “from two sides” of a hyper- plane, 2020. URL:https://math.stackexchange.com/q/3513692 (version: 2020-01-18).
G. M. Rotskoff and E. Vanden-Eijnden, Neural Networks as Interacting Particle Sys- tems: Asymptotic Convexity of the Loss Landscape and Universal Scaling of the Ap- proximation Error, arXiv preprint arXiv:1805.00915 (2018).
W. Rudin, Real and Complex Analysis, McGraw-Hill Book Co., New York, 1987.
W. Rudin, Functional Analysis, International Series in Pure and Applied Mathematics, McGraw-Hill, Inc., New York, 1991.
I. Safran and O. Shamir, Depth-Width Tradeoffs in Approximating Natural Functions with Neural Networks, Proceedings of the 34th International Conference on Machine Learning, 2017, pp. 2979–2987.
J. Schmidhuber, Deep learning in eural networks: An overview, Neural Netw. 61 (2015), 85–117.
D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez, M. Lanctot, L. Sifre, D. Kumaran, Graepel T., T. Lillicrap, K. Simonyan, and D. Hassabis, Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm, arXiv preprint arXiv:1712.01815 (2017).
Karen Simonyan and Andrew Zisserman, Very deep convolutional networks for large-scale image recognition, International conference on learning representations, 2015.
N. Usunier, G. Synnaeve, Z. Lin, and S. Chintala, Episodic Exploration for Deep Deterministic Policies for StarCraft Micromanagement, 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, 2017.
L. Venturi, A. S. Bandeira, and J. Bruna, Neural Networks with Finite Intrinsic Dimension have no Spurious Valleys, arXiv preprint, arXiv:1802.06384 (2018).
A. J. Ward, The Structure of Non-Enumerable Sets of Points, J. London Math. Soc. 8 (1933), no. 2, 109–112.
C. Wu, P. Karanasou, M. JF. Gales, and K. C. Sim, Stimulated Deep Neural Network for Speech Recognition, University of Cambridge, 2016.
G. N. Yannakakis and J. Togelius, Artificial Intelligence and Games, Springer, 2018.
D. Yarotsky, Error bounds for approximations with deep ReLU networks, Neural Netw. 94 (2017), 103–114.
D. Yarotsky and A. Zhevnerchuk, The phase diagram of approximation rates for deep neural networks, arXiv preprint arXiv:1906.09477 (2019).
Y. Zhang, P. Liang, and M. J.Wainwright, Convexified Convolutional Neural Networks, Proceedings of the 34th International Conference on Machine Learning-volume 70, 2017, pp. 4044–4053.
Acknowledgements
Open access funding provided by University of Vienna. P.P. and M.R. were supported by the DFG Collaborative Research Center TRR 109 “Discretization in Geometry and Dynamics”. P.P was supported by a DFG Research Fellowship “Shearlet-based energy functionals for anisotropic phase-field methods”. M.R. is supported by the Berlin Mathematical School. F.V. acknowledges support from the European Commission through DEDALE (contract no. 665044) within the H2020 Framework Program.
Author information
Authors and Affiliations
Corresponding authors
Additional information
Communicated by Francis Bach.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix A: Notation
The symbol \(\mathbb {N}\) denotes the natural numbers \(\mathbb {N}= \{1,2,3,\dots \}\), whereas \(\mathbb {N}_0 = \{ 0 \} \cup \mathbb {N}\) stands for the natural numbers including zero. Moreover, we set \(\mathbb {N}_{\ge d} {:}{=}\{ n \in \mathbb {N}:n \ge d \}\) for \(d \in \mathbb {N}\). The number of elements of a set M will be denoted by \(|M| \in \mathbb {N}_0 \cup \{\infty \}\). Furthermore, we write \({\underline{n}} {:}{=}\{k \in \mathbb {N}\,:\, k \le n\}\) for \(n \in \mathbb {N}_0\). In particular, \({\underline{0}} = \varnothing \).
For two sets A, B, a map \(f : A \rightarrow B\), and \(C \subset A\), we write \(f|_{C}\) for the restriction of f to C. For a set A, we denote by \(\chi _A = {\mathbb {1}}_A\) the indicator function of A, so that \(\chi _A (x) = 1\) if \(x \in A\) and \(\chi _A (x) = 0\) otherwise. For any \(\mathbb {R}\)-vector space \({\mathcal {Y}}\) we write \(A + B {:}{=}\{ a + b \,:\, a \in A, b \in B \}\) and \(\lambda A {:}{=}\{\lambda a \,:\, a \in A\}\), for \(\lambda \in \mathbb {R}\) and subsets \(A,B \subset {\mathcal {Y}}\).
The algebraic dual space of a \({\mathbb {K}}\)-vector space \({\mathcal {Y}}\) (with \({\mathbb {K}} = \mathbb {R}\) or \({\mathbb {K}} = \mathbb {C}\)), that is the space of all linear functions \(\varphi : {\mathcal {Y}} \rightarrow {\mathbb {K}}\), will be denoted by \({\mathcal {Y}}^*\). In contrast, if \({\mathcal {Y}}\) is a topological vector space, we denote by \({\mathcal {Y}}'\) the topological dual space of \({\mathcal {Y}}\), which consists of all functions \(\varphi \in {\mathcal {Y}}^*\) that are continuous.
Given functions \((f_i)_{i \in {\underline{n}}}\) with \(f_i : X_i \rightarrow Y_i\), we consider three different types of products between these maps: The Cartesian product of \(f_1, \dots , f_n\) is
The tensor product of \(f_1, \dots , f_n\) is defined if \(Y_1, \dots , Y_n \subset \mathbb {C}\), and is then given by
Finally, the direct sum of \(f_1, \dots , f_n\) is defined if \(X_1 = \dots = X_n\), and given by
The closure of a subset A of a topological space will be denoted by \({\overline{A}}\), while the interior of A is denoted by \(A^\circ \). For a metric space \(({\mathcal {U}}, d)\), we write \(B_\varepsilon (x){:}{=}\{y \in {\mathcal {U}}: d(x,y)< \varepsilon \}\) for the \(\varepsilon \)-ball around x, where \(x \in {\mathcal {U}}\) and \(\varepsilon > 0\). Furthermore, for a Lipschitz continuous function \(f : {\mathcal {U}}_1 \rightarrow {\mathcal {U}}_2\) between two metric spaces \({\mathcal {U}}_1\) and \({\mathcal {U}}_2\), we denote by \(\mathrm {Lip}(f)\) the smallest possible Lipschitz constant for f.
For \(d \in \mathbb {N}\) and a function \(f: A \rightarrow \mathbb {R}^{d}\) or a vector \(v \in \mathbb {R}^{d}\), we denote for \(j\in \{1, \dots , d\}\) the j -th component of f or v by \((f)_j\) or \(v_j\), respectively. As an example, the Euclidean scalar product on \(\mathbb {R}^d\) is given by \(\langle x,y \rangle = \sum _{i=1}^d x_i \, y_i\). We denote the Euclidean norm by \(|x| := \sqrt{\langle x,x \rangle }\) for \(x \in \mathbb {R}^d\). For a matrix \(A \in \mathbb {R}^{n \times d}\), let \(\Vert A \Vert _{\max } {:}{=}\max _{i = 1,\dots ,n} \,\, \max _{j = 1,\dots ,d} | A_{i,j} |\). The transpose of a matrix \(A \in \mathbb {R}^{n \times d}\) will be denoted by \(A^T \in \mathbb {R}^{d \times n}\). For \(A \in \mathbb {R}^{n \times d}\), \(i \in \{1,\dots ,n\}\) and \(j \in \{1,\dots ,d\}\), we denote by \(A_{i,-} \in \mathbb {R}^d\) the i-th row of A and by \(A_{-,j} \in \mathbb {R}^n\) the j-th column of A. The Euclidean unit sphere in \(\mathbb {R}^d\) will be denoted by \(S^{d-1} \subset \mathbb {R}^d\).
For \(n \in \mathbb {N}\) and \(\varnothing \ne \Omega \subset \mathbb {R}^d\), we denote by \(C(\Omega ; \mathbb {R}^n)\) the space of all continuous functions defined on \(\Omega \) with values in \(\mathbb {R}^n\). If \(\Omega \) is compact, then \((C(\Omega ; \mathbb {R}^n),\Vert \cdot \Vert _{\sup })\) denotes the Banach space of \(\mathbb {R}^n\)- valued continuous functions equipped with the supremum norm, where we use the Euclidean norm on \(\mathbb {R}^n\). If \(n = 1\), then we shorten the notation to \(C(\Omega )\).
We note that on \(C(\Omega )\), the supremum norm coincides with the \(L^\infty (\Omega )\)-norm, if for all \(x\in \Omega \) and for all \(\varepsilon > 0\) we have that \(\lambda (\Omega \cap B_{\varepsilon }(x))>0,\) where \(\lambda \) denotes the Lebesgue measure on \(\mathbb {R}^d\). For any nonempty set \(U \subset \mathbb {R}\), we say that a function \(f : U \rightarrow \mathbb {R}\) is increasing if \(f(x) \le f(y)\) for every \(x,y \in U\) with \(x < y\). If even \(f(x) < f(y)\) for all such x, y, we say that f is strictly increasing. The terms “decreasing” and “strictly decreasing” are defined analogously.
The Schwartz space will be denoted by \({\mathcal {S}}(\mathbb {R}^d)\) and the space of tempered distributions by \({\mathcal {S}}'(\mathbb {R}^d)\). The associated bilinear dual pairing will be denoted by \(\langle \cdot ,\cdot \rangle _{{\mathcal {S}}',{\mathcal {S}}}\). We refer to [26, Sects. 8.1–8.3 and 9.2] for more details on the spaces \({\mathcal {S}}(\mathbb {R}^d)\) and \({\mathcal {S}}'(\mathbb {R}^d)\). Finally, the Dirac delta distribution \(\delta _x\) at \(x \in \mathbb {R}^d\) is given by \(\delta _x : C(\mathbb {R}^d) \rightarrow \mathbb {R}, f \mapsto f(x)\).
Appendix B: Auxiliary Results: Operations with Neural Networks
This part of the appendix is devoted to auxiliary results that are connected with basic operations one can perform with neural networks and which we will frequently make use of in the proofs below.
We start by showing that one can “enlarge” a given neural network in such a way that the realizations of the original network and the enlarged network coincide. To be more precise, the following holds:
Lemma B.1
Let \(d, L\in \mathbb {N}\), \(\Omega \subset \mathbb {R}^d\), and \(\varrho :\mathbb {R}\rightarrow \mathbb {R}\). Also, let \(\Phi = \big ( (A_1, b_1), \dots , (A_L, b_L) \big )\) be a neural network with architecture \((d, N_1, \dots , N_L)\) and let \({\widetilde{N}}_1, \dots , {\widetilde{N}}_{L-1} \in \mathbb {N}\) such that \({\widetilde{N}}_\ell \ge N_\ell \) for all \(\ell = 1, \dots , L-1\). Then, there exists a neural network \({\widetilde{\Phi }}\) with architecture \((d,{\widetilde{N}}_1,\dots ,{\widetilde{N}}_{L-1},N_L)\) and such that \(\mathrm {R}^{\Omega }_{\varrho }(\Phi )=\mathrm {R}^{\Omega }_{\varrho }({\widetilde{\Phi }})\).
Proof
Setting \(N_0 {:}{=}{\widetilde{N}}_0 {:}{=}d\), and \({\widetilde{N}}_L {:}{=}N_L\), we define \( {\widetilde{\Phi }} {:}{=}\left( ({\widetilde{A}}_1,{\widetilde{b}}_1), \dots , ({\widetilde{A}}_L,{\widetilde{b}}_L) \right) \) by
and
for \(\ell = 1, \dots , L\). Here, \(0_{m_1 \times m_2}\) and \(0_k\) denote the zero-matrix in \(\mathbb {R}^{m_1 \times m_2}\) and the zero vector in \(\mathbb {R}^k\), respectively. Clearly, \(\mathrm {R}_\varrho ^\Omega ({\widetilde{\Phi }}) = \mathrm {R}_\varrho ^\Omega (\Phi )\). This yields the claim. \(\square \)
Another operation that we can perform with networks is concatenation, as given in the following definition.
Definition B.2
Let \(L_1, L_2 \in \mathbb {N}\) and let
be two neural networks such that the input layer of \(\Phi ^1\) has the same dimension as the output layer of \(\Phi ^2\). Then,
denotes the following \(L_1+L_2-1\) layer network:
![](http://media.springernature.com/lw342/springer-static/image/art%3A10.1007%2Fs10208-020-09461-0/MediaObjects/10208_2020_9461_Equ213_HTML.png)
Then, we call the concatenation of \(\Phi ^1\) and \(\Phi ^2\).
One directly verifies that for every \(\varrho : \mathbb {R}\rightarrow \mathbb {R}\) the definition of concatenation is reasonable, that is, if \(d_i\) is the dimension of the input layer of \(\Phi ^i\), \(i = 1,2\), and if \(\Omega \subset \mathbb {R}^{d_2}\), then . If \(\Phi ^2\) has architecture \((d, N_1, \dots , N_{L_2})\) and \(\Phi ^1\) has architecture \((N_{L_2}, {\widetilde{N}}_{1},\dots ,{\widetilde{N}}_{L_1-1}, {\widetilde{N}}_{L_1})\), then the neural network
has architecture \((d, N_1, \dots , N_{L_2 - 1}, {\widetilde{N}}_{1}, \dots , {\widetilde{N}}_{L_1})\). Therefore,
.
We close this section by showing that under mild assumptions on \(\varrho \)—which are always satisfied in practice—and on the network architecture, one can construct a neural network which locally approximates the identity mapping \(\mathrm {id}_{\mathbb {R}^d}\) to arbitrary accuracy. Similarly, one can obtain a neural network the realization of which approximates the projection onto the i-th coordinate. The main ingredient of the proof is the approximation \( x \approx \frac{\varrho (x_0 + x) - \varrho (x_0)}{\varrho '(x_0)} , \) which holds for |x| small enough and where \(x_0\) is chosen such that \(\varrho '(x_0) \ne 0\).
Proposition B.3
Let \(\varrho : \mathbb {R}\rightarrow \mathbb {R}\) be continuous, and assume that there exists \(x_0 \in \mathbb {R}\) such that \(\varrho \) is differentiable at \(x_0\) with \(\varrho ' (x_0) \ne 0\). Then, for every \(\varepsilon> 0, d \in \mathbb {N}, B > 0\) and every \(L\in \mathbb {N}\) there exists a neural network \(\Phi _\varepsilon ^B\in \mathcal {NN}((d,d, \dots , d))\) with L layers such that
-
\(\left| \mathrm {R}_{\varrho }^{[-B,B]^d}(\Phi _{\varepsilon }^B)(x)-x\right| \le \varepsilon \) for all \(x\in [-B,B]^d\);
-
\(\mathrm {R}_{\varrho }^{[-B,B]^d}({\Phi }_{\varepsilon }^B)(0)=0\);
-
\(\mathrm {R}_\varrho ^{[-B,B]^d} (\Phi _\varepsilon ^B)\) is totally differentiable at \(x = 0\) and its Jacobian matrix fulfills \( D \big ( \mathrm {R}_\varrho ^{[-B,B]^d} (\Phi _\varepsilon ^B) \big ) (0) = \mathrm {id}_{\mathbb {R}^d}; \)
-
for \(j \in \{1, \dots , d\}\), \(\left( \mathrm {R}_{\varrho }^{[-B,B]^d}({\Phi }_{\varepsilon }^B)\right) _j\) is constant in all but the j-th coordinate.
Furthermore, for every \(d, L \in \mathbb {N}\), \(\varepsilon > 0\), \(B > 0\) and every \(i\in \{1,\dots ,d\}\), one can construct a neural network \({\widetilde{\Phi }}_{\varepsilon ,i}^B \in \mathcal {NN}((d,1, \dots , 1))\) with L layers such that
-
\(\left| \mathrm {R}_{\varrho }^{[-B,B]^d}({\widetilde{\Phi }}_{\varepsilon ,i}^B)(x) -x_i \right| \le \varepsilon \) for all \(x\in [-B,B]^d\);
-
\(\mathrm {R}_{\varrho }^{[-B,B]^d}({\widetilde{\Phi }}_{\varepsilon ,i}^B)(0)=0\);
-
\(\mathrm {R}_\varrho ^{[-B,B]^d} (\Phi _{\varepsilon ,i}^B)\) is partially differentiable at \(x = 0\), with \( \frac{\partial }{\partial x_i}\Big |_{x=0} \mathrm {R}_{\varrho }^{[-B,B]^d}({\widetilde{\Phi }}_{\varepsilon ,i}^B)(x) = 1 \); and
-
\(\mathrm {R}_{\varrho }^{[-B,B]^d}({\widetilde{\Phi }}_{\varepsilon ,i}^B)\) is constant in all but the i-th coordinate.
Finally, if \(\varrho \) is increasing, then \(\big (\mathrm {R}_{\varrho }^{[-B,B]^d}({\Phi }_{\varepsilon }^B)\big )_j\) and \(\mathrm {R}_{\varrho }^{[-B,B]^d}({\widetilde{\Phi }}_{\varepsilon ,i}^B)\) are monotonically increasing in every coordinate and for all \(j\in \{1, \dots , d\}\).
Proof
We first consider the special case \(L = 1\). Here, we can take \(\Phi _\varepsilon ^B {:}{=}( (\mathrm {id}_{\mathbb {R}^d}, 0) )\) and \({\Phi _{\varepsilon ,i}^B {:}{=}( (e_i ,0) )}\), with \(e_i \in \mathbb {R}^{1 \times d}\) denoting the i-th standard basis vector in \({\mathbb {R}^d \cong \mathbb {R}^{1 \times d}}\). In this case, \(\mathrm {R}_\varrho ^{[-B,B]^d} (\Phi _\varepsilon ^B) = \mathrm {id}_{\mathbb {R}^d}\) and \(\mathrm {R}_\varrho ^{[-B,B]^d}(\Phi _{\varepsilon ,i}^B) (x) = x_i\) for all \(x \in [-B,B]^d\), which implies that all claimed properties are satisfied. Thus, we can assume in the following that \(L \ge 2\).
Without loss of generality, we only consider the case \(\varepsilon \le 1\). Define \(\varepsilon ' {:}{=}\varepsilon / (dL)\). Let \(x_0 \in \mathbb {R}\) be such that \(\varrho \) is differentiable at \(x_0\) with \(\varrho ' (x_0) \ne 0\).
We set \(r_0 {:}{=}\varrho (x_0)\) and \(s_0{:}{=}\varrho '(x_0)\). Next, for \(C > 0\), we define
We claim that there is some \(C_0 > 0\) such that \(|\varrho _C(x) - x| \le \varepsilon '\) for all \(x \in [-B-L\varepsilon ,B+L\varepsilon ]\) and all \(C \ge C_0\). To see this, first note by definition of the derivative that there is some \(\delta > 0\) with
Here we implicitly used that \(s_0 = \varrho '(x_0) \ne 0\) to ensure that the right-hand side is a positive multiple of |t|. Now, set \(C_0 {:}{=}(B+L)/\delta \), and let \(C \ge C_0\) be arbitrary. Note because of \(\varepsilon ' \le \varepsilon \le 1\) that every \(x \in [-B-L\varepsilon ,B+L\varepsilon ]\) satisfies \(| x | \le B+L\). Hence, if we set \(t {:}{=}x/C\), then \(| t | \le \delta \). Therefore,
Note that \(\varrho _C\) is differentiable at 0 with derivative \(\varrho _C ' (0) = \frac{C}{s_0} \varrho ' (x_0) \frac{1}{C} = 1\), thanks to the chain rule.
Using these preliminary observations, we now construct the neural networks \(\Phi _\varepsilon ^B\) and \(\Phi _{\varepsilon ,i}^B\). Define \({\Phi _0^C {:}{=}\big ( (A_1, b_1),(A_2,b_2) \big )}\), where
and
Note \(\Phi _0^C \in \mathcal {NN}((d,d,d))\). To shorten the notation, let \(\Omega {:}{=}[-B, B]^d\) and \(J = [-B, B]\). It is not hard to see that \({\mathrm {R}_{\varrho }^{\Omega }(\Phi _0^C) = \varrho _C|_J \times \cdots \times \varrho _C|_J}\), where the Cartesian product has d factors. We define , where we take \(L-2\) concatenations (meaning \(L-1\) factors, so that \(\Phi _C = \Phi _0^C\) if \(L = 2\)). We obtain \(\Phi _C \in \mathcal {NN}((d,\dots ,d))\) (with L layers) and
where \(\varrho _C\) is applied \(L-1\) times.
Since \(|\varrho _C (x) -x | \le \varepsilon ' \le \varepsilon \) for all \(x \in [-B-L\varepsilon ,B+L\varepsilon ]\), it is not hard to see by induction that
where \(\varrho _C\) is applied \(t \le L\) times. Therefore, since \(\varepsilon ' = \varepsilon / (dL)\), we conclude for \(C \ge C_0\) that
As we saw above, \(\varrho _C\) is differentiable at 0 with \(\varrho _{C}(0) = 0\) and \(\varrho _{C}'(0)=1\). By induction, we thus get \(\frac{\mathrm{{d}}}{\mathrm{{d}}x}\big |_{x=0} (\varrho _C \circ \cdots \circ \varrho _C)(x) = 1\), where the composition has at most L factors. Thanks to Eq. (B.1), this shows that \(\mathrm {R}_\varrho ^\Omega (\Phi _C)\) is totally differentiable at 0, with \(D (\mathrm {R}_\varrho ^\Omega (\Phi _C)) (0) = \mathrm {id}_{\mathbb {R}^d}\), as claimed.
Also by Eq. (B.1), we see that for every \(j \in \{1, \dots , d\}\), \(\big ( \mathrm {R}_{\varrho }^{\Omega }(\Phi _{C})(x) \big )_j\) is constant in all but the j-th coordinate. Additionally, if \(\varrho \) is increasing, then \(s_0 > 0\), so that \(\varrho _C\) is also increasing, and hence \(\big ( \mathrm {R}_{\varrho }^{\Omega }(\Phi _{C}) \big )_j\) is increasing in the j-th coordinate, since compositions of increasing functions are increasing. Hence, \(\Phi _\varepsilon ^B{:}{=}\Phi _{C}\) satisfies the desired properties.
We proceed with the second part of the proposition. We first prove the statement for \(i = 1\). Let \({\widetilde{\Phi }}_1^C {:}{=}\big ( (A_1', b_1'),(A_2',b_2') \big )\), where
We have \({\widetilde{\Phi }}_1^C \in \mathcal {NN}((d, 1, 1))\). Next, define \({\widetilde{\Phi }}_2^C {:}{=}\big ( (A_1'', b_1''),(A_2'',b_2'') \big )\), where
We have \({\widetilde{\Phi }}_2^C \in \mathcal {NN}((1, 1, 1))\). Setting , where we take \(L-2\) concatenations (meaning \(L-1\) factors), yields a neural network \({\widetilde{\Phi }}_C \in \mathcal {NN}((d,1,\dots ,1))\) (with L layers) such that
where \(\varrho _C\) is applied \(L-1\) times. Exactly as in the proof of the first part, this implies for \(C \ge C_0\) that
Setting \({\widetilde{\Phi }}_{\varepsilon ,1}^B {:}{=}{\widetilde{\Phi }}_{C}\) and repeating the previous arguments yields the claim for \(i = 1\). Permuting the columns of \(A_1'\) yields the result for arbitrary \(i \in \{1, \dots , d\}\).
Now, let \(\varrho \) be increasing. Then, \(s_0 > 0\), and thus \(\varrho _C\) is increasing for every \(C > 0\). Since \(\mathrm {R}_{\varrho }^\Omega ({\widetilde{\Phi }}_{C})\) is the composition of componentwise monotonically increasing functions, the claim regarding the monotonicity follows. \(\square \)
Appendix C: Proofs and Results Connected to Sect. 2
1.1 C.1. Proof of Theorem 2.1
We first establish the star-shapedness of the set of all realizations of neural networks, which is a direct consequence of the fact that the set is invariant under scalar multiplication. The following proposition provides the details.
Proposition C.1
Let \(S = (d, N_1, \dots , N_L)\) be a neural network architecture, let \(\Omega \subset \mathbb {R}^d\), and let \(\varrho : \mathbb {R}\rightarrow \mathbb {R}\). Then, the set \(\mathcal {R}\mathcal {N}\mathcal {N}_\varrho ^\Omega (S)\) is closed under scalar multiplication and is star-shaped with respect to the origin.
Proof
Let \(f \in \mathcal {R}\mathcal {N}\mathcal {N}_\varrho ^\Omega (S)\) and choose \(\Phi {:}{=}\big ( (A_1,b_1), \dots , (A_L,b_L)\big ) \in \mathcal {NN}(S)\) satisfying \({f = \mathrm {R}_\varrho ^\Omega (\Phi )}\). For \({\lambda \in \mathbb {R}}\), define \( {\widetilde{\Phi }} {:}{=}\big ( (A_1,b_1), \dots , (A_{L-1}, b_{L-1}), (\lambda A_{L}, \lambda b_L) \big ) \) and observe that \({\widetilde{\Phi }} \in \mathcal {NN}(S)\) and furthermore \({\lambda f = \mathrm {R}^\Omega _\varrho ({\widetilde{\Phi }}) \in \mathcal {R}\mathcal {N}\mathcal {N}^\Omega _\varrho (S)}\). This establishes the closedness of \(\mathcal {R}\mathcal {N}\mathcal {N}^\Omega _\varrho (S)\) under scalar multiplication.
We can choose \(\lambda = 0\) in the argument above and obtain \({0 \in \mathcal {R}\mathcal {N}\mathcal {N}^\Omega _\varrho (S)}\). For every \(f \in \mathcal {R}\mathcal {N}\mathcal {N}^\Omega _\varrho (S)\) the line \(\{\lambda f :\lambda \in [0,1]\}\) between 0 and f is contained in \(\mathcal {R}\mathcal {N}\mathcal {N}^\Omega _\varrho (S)\), since \(\mathcal {R}\mathcal {N}\mathcal {N}^\Omega _\varrho (S)\) is closed under scalar multiplication. We conclude that \(\mathcal {R}\mathcal {N}\mathcal {N}^\Omega _\varrho (S)\) is star-shaped with respect to the origin. \(\square \)
Our next goal is to show that \(\mathcal {R}\mathcal {N}\mathcal {N}^\Omega _{\varrho }(S)\) cannot contain infinitely many linearly independent centers.
As a preparation, we prove two related results which show that the class \(\mathcal {R}\mathcal {N}\mathcal {N}_\varrho ^\Omega (S)\) is “small”. The main assumption for guaranteeing this is that the activation function should be locally Lipschitz continuous.
Lemma C.2
Let \(S = (d, N_1, \dots , N_L)\) be a neural network architecture, set \(N_0 {:}{=}d\), and let \(M \in \mathbb {N}\). Let \(\varrho : \mathbb {R}\rightarrow \mathbb {R}\) be locally Lipschitz continuous. Let \(\Omega \subset \mathbb {R}^d\) be compact, and let \(\Lambda : C(\Omega ; \mathbb {R}^{N_L}) \rightarrow \mathbb {R}^M\) be locally Lipschitz continuous, with respect to the uniform norm on \(C(\Omega ; \mathbb {R}^{N_L})\).
If \(M > \sum _{\ell =1}^L (N_{\ell - 1} + 1) N_\ell \), then \(\Lambda (\mathcal {R}\mathcal {N}\mathcal {N}_\varrho ^\Omega (S)) \subset \mathbb {R}^M\) is a set of Lebesgue measure zero.
Proof
Since \(\varrho \) is locally Lipschitz continuous, Proposition 4.1 (which will be proved completely independently) shows that the realization map
is locally Lipschitz continuous. Here, the normed vector space \(\mathcal {NN}(S)\) is per definition isomorphic to \( \prod _{\ell = 1}^L \big ( \mathbb {R}^{N_{\ell - 1} \times N_\ell } \times \mathbb {R}^{N_\ell } \big ) \) and thus has dimension \(D := \sum _{\ell =1}^L (N_{\ell - 1} + 1) N_\ell \), so that there is an isomorphism \(J : \mathbb {R}^D \rightarrow \mathcal {NN}(S)\).
As a composition of locally Lipschitz continuous functions, the map
is locally Lipschitz continuous, and satisfies \( \Lambda \big ( \mathcal {R}\mathcal {N}\mathcal {N}_\varrho ^\Omega (S) \big ) = \mathrm {ran} (\Gamma ) = \Gamma (\mathbb {R}^D \times \{0\}^{M-D}) \). But it is well known (see for instance [2]Theorem 5.9), that a locally Lipschitz continuous function between Euclidean spaces of the same dimension maps sets of Lebesgue measure zero to sets of Lebesgue measure zero. Hence, \(\Lambda (\mathcal {R}\mathcal {N}\mathcal {N}_\varrho ^\Omega (S)) \subset \mathbb {R}^M\) is a set of Lebesgue measure zero. \(\square \)
As a corollary, we can now show that the class of neural network realizations cannot contain a subspace of large dimension.
Corollary C.3
Let \(S = (d, N_1, \dots , N_L)\) be a neural network architecture, set \(N_0 {:}{=}d\), and let \(\varrho : \mathbb {R}\rightarrow \mathbb {R}\) be locally Lipschitz continuous.
Let \(\varnothing \ne \Omega \subset \mathbb {R}^d\) be arbitrary. If \(V \subset C(\Omega ; \mathbb {R}^{N_L})\) is a vector space with \(V \subset \mathcal {R}\mathcal {N}\mathcal {N}_\varrho ^\Omega (S)\), then \({\dim V \le \sum _{\ell =1}^L (N_{\ell - 1} + 1) N_\ell }\).
Proof
Let \(D {:}{=}\sum _{\ell =1}^L (N_{\ell - 1} + 1) N_\ell \). Assume toward a contradiction that the claim of the corollary does not hold; then there exists a subspace \(V \subset C(\Omega ; \mathbb {R}^{N_L})\) of dimension \(\dim V = D + 1\) with \(V \subset \mathcal {R}\mathcal {N}\mathcal {N}_{\varrho }^\Omega (S)\). For \(x \in \Omega \) and \(\ell \in \underline{N_L}\), let \(\delta _x^{(\ell )} : C(\Omega ; \mathbb {R}^{N_L}) \rightarrow \mathbb {R}, f \mapsto \big ( f(x) \big )_\ell \). Define \( W {:}{=}\mathrm {span} \big \{ \delta _x^{(\ell )} |_V :x \in \Omega , \ell \in \underline{N_L} \, \big \} \), and note that W is a subspace of the finite-dimensional algebraic dual space \(V^*\) of V. In particular, \(\dim W \le \dim V^*= \dim V = D+1\), so that there are \((x_1, \ell _1), \dots , (x_{D+1}, \ell _{D+1}) \in \Omega \times \underline{N_L}\) such that \(W = \mathrm {span} \big \{ \delta _{x_k}^{(\ell _k)} :k \in \underline{D+1} \big \}\).
We claim that the linear map
is surjective. Since \(\dim V = D+1 = \dim \mathbb {R}^{D+1}\), it suffices to show that \(\Lambda _0\) is injective. But if \(\Lambda _0 f = 0\) for some \(f \in V \subset C(\Omega ; \mathbb {R}^{N_L})\), and if \(x \in \Omega \) and \(\ell \in \underline{N_L}\) are arbitrary, then \(\delta _x^{(\ell )} = \sum _{k = 1}^{D+1} a_k \, \delta _{x_k}^{(\ell _k)}\) for certain \(a_1, \dots , a_{D+1} \in \mathbb {R}\). Hence, \([f(x)]_\ell = \sum _{k=1}^{D+1} a_k [f(x_k)]_{\ell _k} = 0\). Since \(x \in \Omega \) and \(\ell \in \underline{N_L}\) were arbitrary, this means \(f \equiv 0\). Therefore, \(\Lambda _0\) is injective and thus surjective.
Now, let us define \(\Omega ' {:}{=}\{x_1, \dots , x_{D+1} \}\), and note that \(\Omega ' \subset \mathbb {R}^d\) is compact. Set \(M {:}{=}D+1\), and define
It is straightforward to verify that \(\Lambda \) is Lipschitz continuous. Therefore, Lemma C.2 shows that the set \(\Lambda (\mathcal {R}\mathcal {N}\mathcal {N}_\varrho ^{\Omega '} (S)) \subset \mathbb {R}^M\) is a null-set. However,
This yields the desired contradiction. \(\square \)
Now, the announced estimate for the number of linearly independent centers of the set of all network realizations of a fixed size is a direct consequence.
Proposition C.4
Let \(S = (d, N_1, \dots , N_L)\) be a neural network architecture, let \(\Omega \subset \mathbb {R}^d\), and let \(\varrho : \mathbb {R}\rightarrow \mathbb {R}\) be locally Lipschitz continuous. Then, \(\mathcal {R}\mathcal {N}\mathcal {N}^\Omega _\varrho (S)\) contains at most \(\sum _{\ell = 1}^L (N_{\ell -1} + 1) N_{\ell }\) linearly independent centers, where \(N_0 = d\). That is, the number of linearly independent centers is bounded by the total number of parameters of the underlying neural networks.
Proof
Let us set \(D {:}{=}\sum _{\ell = 1}^L (N_{\ell - 1} + 1) N_\ell \), and assume toward a contradiction that \(\mathcal {R}\mathcal {N}\mathcal {N}^\Omega _\varrho (S)\) contains \(M {:}{=}D+1\) linearly independent centers \(\mathrm {R}_\varrho ^\Omega (\Phi _1), \dots , \mathrm {R}_\varrho ^\Omega (\Phi _M)\). Since \(\mathcal {R}\mathcal {N}\mathcal {N}_\varrho ^\Omega (S)\) is closed under multiplication with scalars, this implies
Indeed, this follows by induction on M, using the following observation: If V is a vector space contained in a set A, if A is closed under multiplication with scalars, and if \(x_0 \in A\) is a center for A, then \(V + {{\,\mathrm{span}\,}}\{x_0\} \subset A\). To see this, let \(\mu \in \mathbb {R}\) and \(v \in V\). There is some \(\varepsilon \in \{1,-1\}\) such that \(\varepsilon \mu = |\mu |\). Now set \(x {:}{=}\varepsilon v \in V \subset A\) and \(\lambda {:}{=}| \mu | / (1+| \mu |) \in [0,1]\). Then,
Since the family \(\big ( \mathrm {R}_\varrho ^\Omega (\Phi _k) \big )_{k \in {\underline{M}}}\) is linearly independent, we see
In view of Corollary C.3, this yields the desired contradiction. \(\square \)
Next, we analyze the convexity of \(\mathcal {R}\mathcal {N}\mathcal {N}^\Omega _\varrho (S)\). As a direct consequence of Proposition C.4, we see that \(\mathcal {R}\mathcal {N}\mathcal {N}^\Omega _\varrho (S)\) is never convex if \(\mathcal {R}\mathcal {N}\mathcal {N}^\Omega _\varrho (S)\) contains more than a certain number of linearly independent functions.
Corollary C.5
Let \(S = (d, N_1, \dots , N_L)\) be a neural network architecture and let \(N_0 {:}{=}d\). Let \(\Omega \subset \mathbb {R}^d\), and let \(\varrho : \mathbb {R}\rightarrow \mathbb {R}\) be locally Lipschitz continuous.
If \(\mathcal {R}\mathcal {N}\mathcal {N}^\Omega _\varrho (S)\) contains more than \(\sum _{\ell = 1}^L (N_{\ell -1}+1)N_{\ell }\) linearly independent functions, then \(\mathcal {R}\mathcal {N}\mathcal {N}^\Omega _\varrho (S)\) is not convex.
Proof
Every element of a convex set is a center. Thus the result follows directly from Proposition C.4. \(\square \)
Corollary C.5 claims that if a set of realizations of neural networks with fixed size contains more than a fixed number of linearly independent functions, then it cannot be convex. Since \(\mathcal {R}\mathcal {N}\mathcal {N}^{\mathbb {R}^d}_\varrho (S)\) is translation invariant, it is very likely that \(\mathcal {R}\mathcal {N}\mathcal {N}^{\mathbb {R}^d}_\varrho (S)\) (and hence also \(\mathcal {R}\mathcal {N}\mathcal {N}^{\Omega }_\varrho (S)\)) contains infinitely many linearly independent functions. In fact, our next result shows under minor regularity assumptions on \(\varrho \) that if the set \(\mathcal {R}\mathcal {N}\mathcal {N}^\Omega _\varrho (S)\) does not contain infinitely many linearly independent functions, then \(\varrho \) is necessarily a polynomial.
Proposition C.6
Let \(S = (d, N_1, \dots , N_L)\) be a neural network architecture with \(L \in \mathbb {N}_{\ge 2}\). Moreover, let \(\varrho :\mathbb {R}\rightarrow \mathbb {R}\) be continuous. Assume that there exists \(x_0 \in \mathbb {R}\) such that \(\varrho \) is differentiable at \(x_0\) with \(\varrho '(x_0) \ne 0\).
Further assume that \(\Omega \subset \mathbb {R}^d\) has nonempty interior, and that \(\mathcal {R}\mathcal {N}\mathcal {N}_\varrho ^{\Omega }(S)\) does not contain infinitely many linearly independent functions. Then, \(\varrho \) is a polynomial.
Proof
Step 1 Set \(S' {:}{=}(d, N_1, \dots , N_{L-1}, 1)\). We first show that \(\mathcal {R}\mathcal {N}\mathcal {N}_\varrho ^\Omega (S')\) does not contain infinitely many linearly independent functions. To see this, first note that the map
which maps an \(\mathbb {R}^{N_L}\)-valued function to its first component, is linear, well-defined, and surjective.
Hence, if there were infinitely many linearly independent functions \((f_n)_{n \in \mathbb {N}}\) in the set \(\mathcal {R}\mathcal {N}\mathcal {N}_{\varrho }^{\Omega } (S')\), then we could find \((g_n)_{n \in \mathbb {N}}\) in \(\mathcal {R}\mathcal {N}\mathcal {N}_\varrho ^\Omega (S)\) such that \(f_n = \Theta \, g_n\). But then the \((g_n)_{n \in \mathbb {N}}\) are necessarily linearly independent, contradicting the hypothesis of the theorem.
Step 2 We show that \(\mathcal {G}:= \mathcal {R}\mathcal {N}\mathcal {N}_\varrho ^{\mathbb {R}^d} (S')\) does not contain infinitely many linearly independent functions.
To see this, first note that since \(\mathcal {F}:= \mathcal {R}\mathcal {N}\mathcal {N}_\varrho ^\Omega (S')\) does not contain infinitely many linearly independent functions (Step 1), elementary linear algebra shows that there is a finite-dimensional subspace \(V \subset C(\Omega ; \mathbb {R})\) satisfying \(\mathcal {F}\subset V\). Let \(D := \dim V\), and assume toward a contradiction that there are \(D+1\) linearly independent functions \(f_1, \dots , f_{D+1} \in \mathcal {G}\), and set \(W := \mathrm {span} \{f_1, \dots , f_{D+1} \} \subset C(\mathbb {R}^d; \mathbb {R})\). The space \({\Gamma := \mathrm {span} \{\delta _x |_W :x \in \mathbb {R}^d \} \subset W^*}\) spanned by the point evaluation functionals \(\delta _x : C(\mathbb {R}^d; \mathbb {R}) \rightarrow \mathbb {R}, f \mapsto f(x)\) is finite-dimensional with \(\dim \Gamma \le \dim W^*= \dim W = D+1\). Hence, there are \(x_1, \dots , x_{D+1} \in \mathbb {R}^d\) such that \(\Gamma = \mathrm {span} \{\delta _{x_1}|_W , \dots , \delta _{x_{D+1}} |_W \}\).
We claim that the map
is surjective. Since \(\dim W = D+1\), it suffices to show that \(\Theta \) is injective. If this was not true, there would be some \(f \in W \subset C(\mathbb {R}^d; \mathbb {R})\), \(f \not \equiv 0\) such that \(\Theta f = 0\). But since \(f \not \equiv 0\), there is some \(x_0 \in \mathbb {R}^d\) satisfying \(f(x_0) \ne 0\). Because of \(\delta _{x_0}|_W \in \Gamma \), we have \(\delta _{x_0}|_W = \sum _{\ell = 1}^{D+1} a_\ell \, \delta _{x_\ell }|_W\) for certain \(a_1, \dots , a_{D+1} \in \mathbb {R}\). Hence, \( 0 \ne f(x_0) = \delta _{x_0}|_W (f) = \sum _{\ell =1}^{D+1} a_\ell \, \delta _{x_\ell }|_W (f) = 0 , \) since \(f(x_\ell ) = \big ( \Theta (f) \big )_\ell = 0\) for all \(\ell \in \underline{D+1}\). This contradiction shows that \(\Theta \) is injective, and hence surjective.
Now, since \(\Omega \) has nonempty interior, there is some \(b \in \Omega \) and some \(r > 0\) such that \(y_\ell := b + r \, x_\ell \in \Omega \) for all \(\ell \in \underline{D+1}\). Define
It is not hard to see \(g_\ell \in \mathcal {G}\), and hence \(g_\ell |_\Omega \in \mathcal {F}\subset V\) for all \(\ell \in \underline{D+1}\). Now, define the linear operator \(\Lambda : V \rightarrow \mathbb {R}^{D+1}, f \mapsto \big ( f(y_\ell ) \big )_{\ell \in \underline{D+1}}\), and note that \( \Lambda (g_\ell ) = \big ( g_\ell (y_k) \big )_{k \in \underline{D+1}} = \big ( f_\ell (x_k) \big )_{k \in \underline{D+1}} = \Theta (f_\ell ) , \) because of \({y_\ell }/{r} - {b}/{r} = x_\ell \). Since the functions \(f_1, \dots , f_{D+1}\) span the space W, this implies \(\Lambda (V) \supset \Theta (W) = \mathbb {R}^{D+1}\), in contradiction to \(\Lambda \) being linear and \(\dim V = D < D+1\). This contradiction shows that \(\mathcal {G}\) does not contain infinitely many linearly independent functions.
Step 3 From the previous step, we know that \(\mathcal {G}= \mathcal {R}\mathcal {N}\mathcal {N}_\varrho ^{\mathbb {R}^d}(S')\) does not contain infinitely many linearly independent functions. In this step, we show that this implies that the activation function \(\varrho \) is a polynomial.
To this end, define
![](http://media.springernature.com/lw342/springer-static/image/art%3A10.1007%2Fs10208-020-09461-0/MediaObjects/10208_2020_9461_Equ214_HTML.png)
Clearly, \(\mathcal {R}\mathcal {N}\mathcal {N}_{S',\varrho }^*\) is dilation- and translation invariant; that is, if \(f \in \mathcal {R}\mathcal {N}\mathcal {N}_{S', \varrho }^*\), then also \({f(a \, \cdot ) \in \mathcal {R}\mathcal {N}\mathcal {N}_{S', \varrho }^*}\) and \({f(\cdot -x) \in \mathcal {R}\mathcal {N}\mathcal {N}_{S', \varrho }^*}\) for arbitrary \(a > 0\) and \(x \in \mathbb {R}\). Furthermore, by Step 2, we see that \(\mathcal {R}\mathcal {N}\mathcal {N}_{S', \varrho }^*\) does not contain infinitely many linearly independent functions. Therefore, \(V {:}{=}{{\,\mathrm{span}\,}}\mathcal {R}\mathcal {N}\mathcal {N}_{S', \varrho }^*\) is a finite-dimensional translation- and dilation invariant subspace of \(C(\mathbb {R})\). Thanks to the translation invariance, it follows from [3] that there exists some \(r \in \mathbb {N}\), and certain \(\lambda _j \in \mathbb {C}\), \(k_j \in \mathbb {N}_0\) for \(j = 1, \dots , r\) such that
where \({{\,\mathrm{span}\,}}_{{\mathbb {C}}}\) denotes the linear span, with \(\mathbb {C}\) as the underlying field. Clearly, we can assume \({(k_j, \lambda _j) \ne (k_\ell , \lambda _\ell )}\) for \(j \ne \ell \).
Step 4 Let \(N := \max _{j \in \{1,\dots ,r\}} \, k_j\). We now claim that V is contained in the space \(\mathbb {C}_{\deg \le N} [X]\) of (complex) polynomials of degree at most N.
Indeed, suppose toward a contradiction that there is some \({f \in V \setminus \mathbb {C}_{\deg \le N} [X]}\). Thanks to (C.1), we can write \(f = \sum _{j = 1}^r a_j \, x^{k_j} \, e^{\lambda _j x}\) with \(a_1, \dots , a_r \in \mathbb {C}\). Because of \(f \notin \mathbb {C}_{\deg \le N} [X]\), there is some \(\ell \in \{1,\dots ,r\}\) such that \(a_\ell \ne 0\) and \(\lambda _\ell \ne 0\). Now, choose \(\beta > 0\) such that \(|\beta \lambda _\ell | > |\lambda _j|\) for all \(j \in \{1,\dots ,r\}\), and note that \(f(\beta \, \cdot ) \in V\), so that Eq. (C.1) yields coefficients \(b_1, \dots , b_r \in \mathbb {C}\) such that \(f(\beta \, x) = \sum _{j=1}^r b_j \, x^{k_j} \, e^{\lambda _j x}\). By subtracting the two different representations for \(f(\beta \, x)\), we thus see
and hence
Note, however, that \(|\beta \lambda _\ell | > |\lambda _j|\) and hence \((k_\ell , \beta \lambda _\ell ) \ne (k_j, \lambda _j)\) for all \(j \in \{1,\dots ,r\}\), and furthermore that \((k_\ell , \beta \lambda _\ell ) \ne (k_j, \beta \lambda _j)\) for \(j \in \{1,\dots ,r\} \setminus \{ \ell \}\). Thus, Lemma C.7 below shows that Eq. (C.2) cannot be true. This is the desired contradiction.
Step 5 In this step, we complete the proof, by first showing for arbitrary \(B > 0\) that \(\varrho |_{[-B,B]}\) is a polynomial of degree at most N.
Let \(\varepsilon , B > 0\) be arbitrary. Since \(\varrho \) is continuous, it is uniformly continuous on \([-B-1,B+1]\), that is, there is some \(\delta \in (0,1)\) such that \(| \varrho (x) - \varrho (y) | \le \varepsilon \) for all \(x,y \in [-B-1,B+1]\) with \(| x -y | \le \delta \). Since \(\varrho '(x_0) \ne 0\) and \(L \ge 2\), Proposition B.3 and Lemma B.1 imply existence of a neural network \({\widetilde{\Phi }}_{\varepsilon ,B} \in \mathcal {NN}((d, N_1, \dots , N_{L-1}))\) such that
In particular, this implies because of \(\delta \le 1\) that \(\big [ \mathrm {R}^{[-B,B]^d}_\varrho ({\widetilde{\Phi }}_{\varepsilon ,B})(x) \big ]_1 \in [-B-1,B+1]\) for all \(x \in [-B,B]^d\). We conclude that
with \(\varrho \) acting componentwise. By (C.3), it follows that there is some \({\Phi _{\varepsilon ,B} \in \mathcal {NN}(S')}\) satisfying
From (C.4) and Step 4, we thus see
where the closure is taken with respect to the sup norm, and where we implicitly used that the space on the right-hand side is a closed subspace of \(C([-B,B])\), since it is a finite dimensional subspace.
Since \(\varrho |_{[-B,B]}\) is a polynomial of degree at most N; we see that the \(N+1\)-th derivative of \(\varrho \) satisfies \(\varrho ^{(N+1)} \equiv 0\) on \((-B, B)\), for arbitrary \(B > 0\). Thus, \(\varrho ^{(N+1)} \equiv 0\), meaning that \(\varrho \) is a polynomial. \(\square \)
In the above proof, we used the following elementary lemma, whose proof we provide for completeness.
Lemma C.7
For \(k \in \mathbb {N}_0\) and \(\lambda \in \mathbb {C}\), define \(f_{k,\lambda } : \mathbb {R}\rightarrow \mathbb {C}, x \mapsto x^{k} \, e^{\lambda x}\).
Let \(N \in \mathbb {N}\), and let \((k_1, \lambda _1), \dots , (k_N, \lambda _N) \in \mathbb {N}_0 \times \mathbb {C}\) satisfy \((k_\ell , \lambda _\ell ) \ne (k_j, \lambda _j)\) for \(\ell \ne j\). Then, the family \((f_{k_\ell , \lambda _\ell })_{\ell =1,\dots ,N}\) is linearly independent over \(\mathbb {C}\).
Proof
Let us assume toward a contradiction that
for some coefficient vector \((a_1,\dots ,a_N) \in \mathbb {C}^N \setminus \{ 0 \}\). By dropping those terms for which \(a_\ell = 0\), we can assume that \(a_\ell \ne 0\) for all \(\ell \in \{1,\dots ,N\}\).
Let \(\Lambda := \{ \lambda _i :i \in \{1,\dots ,N\} \}\). In the case where \(|\Lambda | = 1\), it follows that \(k_j \ne k_\ell \) for \(j \ne \ell \). Furthermore, multiplying Eq. (C.5) by \(e^{-\lambda _1 x}\), we see that \( 0 \equiv \sum _{\ell =1}^N a_\ell \, x^{k_\ell } , \) which is impossible since the monomials \((x^k)_{k \in \mathbb {N}_0}\) are linearly independent. Thus, we only need to consider the case that \(|\Lambda | > 1\).
Define \({M {:}{=}\max \{ k_\ell :\ell \in \{1,\dots ,N\} \}}\) and
Note that this implies \(k_\ell < k_j\) for all \(\ell \in I \setminus \{ j \}\), since \((k_\ell , \lambda _\ell ) \ne (k_j, \lambda _j)\) and hence \(k_\ell \ne k_j\) for \(\ell \in I \setminus \{ j \}\).
Consider the differential operator
Note that \( \big ( \frac{\mathrm{{d}}}{\mathrm{{d}}x} - \lambda \, \mathrm {id}\big ) (x^k \, e^{\mu x}) = (\mu - \lambda ) x^k \, e^{\mu x} + k \, x^{k-1} \, e^{\mu x} . \) Using this identity, it is easy to see that if \(\lambda \in \Lambda \setminus \{ \lambda _1 \}\) and \(k \in \mathbb {N}_0\) satisfies \(k \le M\), then \(T (x^k \, e^{\lambda x}) \equiv 0\). Furthermore, for each \(k \in \mathbb {N}_0\) with \(k \le M\), there exist a constant \(c_k \in \mathbb {C}\setminus \{ 0 \}\) and a polynomial \(p_k \in \mathbb {C}[X]\) with \(\deg p_k < k\) satisfying \(T(x^k \, e^{\lambda _1 x}) = e^{\lambda _1 x} \cdot (c_k x^k + p_k (x))\). Overall, Eq. (C.5) implies that
where \(a_j c_{k_j} \ne 0\) and where \(\deg q < k_j\), since \(k_j > k_\ell \) for all \(\ell \in I \setminus \{ j \}\). This is the desired contradiction. \(\square \)
As our final ingredient for the proof of Theorem 2.1, we show that every non-constant locally Lipschitz function \(\varrho \) satisfies the technical assumptions of Proposition C.6.
Lemma C.8
Let \(\varrho : \mathbb {R}\rightarrow \mathbb {R}\) be locally Lipschitz continuous and not constant. Then, there exists some \(x_0 \in \mathbb {R}\) such that \(\varrho \) is differentiable at \(x_0\) with \(\varrho '(x_0) \ne 0\).
Proof
Since \(\varrho \) is not constant, there is some \(B > 0\) such that \(\varrho |_{[-B,B]}\) is not constant. By assumption, \(\varrho \) is Lipschitz continuous on \([-B,B]\). Thus, \(\varrho |_{[-B,B]}\) is absolutely continuous; see for instance [60, Definition 7.17]. Thanks to the fundamental theorem of calculus for the Lebesgue integral (see [60, Theorem 7.20]), this implies that \(\varrho |_{[-B,B]}\) is differentiable almost everywhere on \((-B,B)\) and satisfies \(\varrho (y) - \varrho (x) = \int _x^y \varrho '(t) \, dt\) for \(-B \le x < y \le B\), where \(\varrho '(t) := 0\) if \(\varrho \) is not differentiable at t.
Since \(\varrho |_{[-B,B]}\) is not constant, the preceding formula shows that there has to be some \(x_0 \in (-B, B)\) such that \(\varrho ' (x_0) \ne 0\); in particular, this means that \(\varrho \) is differentiable at \(x_0\). \(\square \)
Now, a combination of Corollary C.5, Proposition C.6, and Lemma C.8 proves Theorem 2.1. For the application of Lemma C.8, note that if \(\varrho \) is constant, then \(\varrho \) is a polynomial, so that the conclusion of Theorem 2.1 also holds in this case.
1.2 C.2. Proof of Theorem 2.2
We first show in the following lemma that if \(\overline{\mathcal {R}\mathcal {N}\mathcal {N}_\varrho ^{\Omega }(S)}\) is convex, then \(\mathcal {R}\mathcal {N}\mathcal {N}_\varrho ^{\Omega }(S)\) is dense in \(C(\Omega )\). The proof of Theorem 2.2 is given thereafter.
Lemma C.9
Let \(S = (d, N_1, \dots , N_{L-1}, 1)\) be a neural network architecture with \(L \ge 2\). Let \(\Omega \subset \mathbb {R}^d\) be compact and let \(\varrho : \mathbb {R}\rightarrow \mathbb {R}\) be continuous but not a polynomial. Finally, assume that there is some \(x_0 \in \mathbb {R}\) such that \(\varrho \) is differentiable at \(x_0\) with \(\varrho '(x_0) \ne 0\).
If \(\overline{\mathcal {R}\mathcal {N}\mathcal {N}_\varrho ^{\Omega }(S)}\) is convex, then \(\mathcal {R}\mathcal {N}\mathcal {N}_\varrho ^{\Omega }(S)\) is dense in \(C(\Omega )\).
Proof
Since \(\overline{\mathcal {R}\mathcal {N}\mathcal {N}_{\varrho }^{\Omega }(S)}\) is convex and closed under scalar multiplication, \(\overline{\mathcal {R}\mathcal {N}\mathcal {N}_{\varrho }^{\Omega }(S)}\) forms a closed linear subspace of \(C(\Omega )\). Below, we will show that \( \big ( \Omega \rightarrow \mathbb {R}, x \mapsto \varrho (\langle a, x \rangle + b) \big ) \in \overline{\mathcal {R}\mathcal {N}\mathcal {N}_{\varrho }^{\Omega }(S)} \) for arbitrary \(a \in \mathbb {R}^d\) and \(b \in \mathbb {R}\). Once we prove this, it follows that \( \big ( \Omega \rightarrow \mathbb {R}, x \mapsto \sum _{i=1}^N c_i \, \varrho (b_i + \langle a_i, x \rangle ) \big ) \in \overline{\mathcal {R}\mathcal {N}\mathcal {N}_{\varrho }^{\Omega }(S)} \) for arbitrary \(N \in \mathbb {N}\), \(a_i \in \mathbb {R}^d\), and \(b_i, c_i \in \mathbb {R}\). As shown in [45], this then entails that \(\overline{\mathcal {R}\mathcal {N}\mathcal {N}_{\varrho }^{\Omega }(S)}\) (and hence also \(\mathcal {R}\mathcal {N}\mathcal {N}_{\varrho }^{\Omega }(S)\)) is dense in \(C(\Omega )\), since \(\varrho \) is not a polynomial.
Thus, let \(a \in \mathbb {R}^d\), \(b \in \mathbb {R}\), and \(\varepsilon > 0\) be arbitrary, and define \(g : \Omega \rightarrow \mathbb {R}, x \mapsto \varrho (b + \langle a, x \rangle )\) and \(\Psi {:}{=}\big ( (a^T, b), (1, 1) \big ) \in \mathcal {NN}( (d, 1, 1) )\), noting that \(\mathrm {R}_\varrho ^{\Omega } (\Psi ) = g\). Since g is continuous on the compact set \(\Omega \), we have \(|g(x)| \le B\) for all \(x \in \Omega \) and some \(B > 0\). By Proposition B.3 and since \(\varrho '(x_0) \ne 0\) and \(L \ge 2\), there exists a neural network \(\Phi _\varepsilon \in \mathcal {NN}( (1,\dots ,1) )\) (with \(L-1\) layers) such that \(|\mathrm {R}_\varrho ^{[-B,B]} (\Phi _\varepsilon ) - x| \le \varepsilon \) for all \(x \in [-B,B]\). This easily shows , while
by Lemma B.1. Therefore, \(g \in \overline{\mathcal {R}\mathcal {N}\mathcal {N}_{\varrho }^{\Omega }(S)}\), which completes the proof. \(\square \)
Now we are ready to prove Theorem 2.2. By assumption, \(\mathcal {R}\mathcal {N}\mathcal {N}_{\varrho }^{\Omega }(S)\) is not dense in \(C(\Omega )\). We start by proving that there exists at least one \(\varepsilon > 0\) such that the set \(\overline{\mathcal {R}\mathcal {N}\mathcal {N}_{\varrho }^{\Omega }(S)}\) is not \(\varepsilon \)-convex. Suppose toward a contradiction that this is not true, so that \(\overline{\mathcal {R}\mathcal {N}\mathcal {N}_{\varrho }^{\Omega }(S)}\) is \(\varepsilon \)-convex for all \(\varepsilon > 0\). This implies
where the last identity holds true, since if \({\widetilde{f}} \not \in \overline{\mathcal {R}\mathcal {N}\mathcal {N}_{\varrho }^{\Omega }(S)}\), there exists \(\varepsilon '>0\) such that \(\Vert {\widetilde{f}}-f\Vert _{\sup }>\varepsilon '\) for all \(f \in \overline{\mathcal {R}\mathcal {N}\mathcal {N}_{\varrho }^{\Omega }(S)}\). Equation (C.6) shows that \(\overline{\mathcal {R}\mathcal {N}\mathcal {N}_{\varrho }^{\Omega }(S)}\) is convex, which by Lemma C.9 implies that \(\mathcal {R}\mathcal {N}\mathcal {N}_{\varrho }^{\Omega }(S) \subset C(\Omega )\) is dense, in contradiction to the assumptions of Theorem 2.2. This is the desired contradiction, showing that \(\overline{\mathcal {R}\mathcal {N}\mathcal {N}_{\varrho }^{\Omega }(S)}\) is not \(\varepsilon \)-convex for some \(\varepsilon > 0\).
Thus, there exists \(g \in \text {co} \big ( \overline{\mathcal {R}\mathcal {N}\mathcal {N}_{\varrho }^{\Omega }(S)} \big )\) such that \( \Vert g - f\Vert _{\sup } \ge \varepsilon _0 \) for all \(f \in \overline{\mathcal {R}\mathcal {N}\mathcal {N}_{\varrho }^{\Omega }(S)}\). Now, let \(\varepsilon > 0\) be arbitrary. Then, \(\frac{\varepsilon }{\varepsilon _0} g \in \text {co}(\overline{\mathcal {R}\mathcal {N}\mathcal {N}_{\varrho }^{\Omega }(S)})\), since \(\mathcal {R}\mathcal {N}\mathcal {N}_{\varrho }^{\Omega }(S)\) is closed under scalar multiplication. Moreover,
again due to the closedness under scalar multiplication of \(\mathcal {R}\mathcal {N}\mathcal {N}_{\varrho }^{\Omega }(S)\). This shows that \(\overline{\mathcal {R}\mathcal {N}\mathcal {N}_{\varrho }^{\Omega }(S)}\) is not \(\varepsilon \)-convex for any \(\varepsilon > 0\). \(\square \)
1.3 C.3. Non-dense Network Sets
In this section, we review criteria on \(\varrho \) which ensure that \(\overline{\mathcal {R}\mathcal {N}\mathcal {N}_\varrho ^{\Omega }(S)} \ne C(\Omega )\).
Precisely, we will show that this is true if \(\varrho : \mathbb {R}\rightarrow \mathbb {R}\) is computable by elementary operations, which means that there is some \(N \in \mathbb {N}\) and an algorithm that takes \(x \in \mathbb {R}\) as input and returns \(\varrho (x)\) after no more than N of the following operations:
-
applying the exponential function \(\exp : \mathbb {R}\rightarrow \mathbb {R}\);
-
applying one of the arithmetic operations \(+, -, \times \), and / on real numbers;
-
jumps conditioned on comparisons of real numbers using the following operators: \(<, >, \le , \ge , = , \ne \).
Then, a combination of [4, Theorem 14.1] with [4, Theorem 8.14] shows that if \(\varrho \) is computable by elementary operations, then the pseudo-dimension of each of the function classes \(\mathcal {R}\mathcal {N}\mathcal {N}_\varrho ^{\mathbb {R}^d}(S)\) is finite. Here, the pseudo-dimension \(\mathrm {Pdim}({\mathcal {F}})\) of a function class \({\mathcal {F}} \subset \mathbb {R}^X\) is defined as follows (see [4, Sect. 11.2]):
Here, a finite set \(K = \{ x_1,\dots ,x_m \} \subset X\) (with pairwise distinct \(x_i\)) is pseudo-shattered by \({\mathcal {F}}\) if there are \(r_1,\dots ,r_m \in \mathbb {R}\) such that for each \(b \in \{0, 1\}^m\) there is a function \(f_b \in {\mathcal {F}}\) with \({\mathbb {1}}_{[0,\infty )} \big ( f_b (x_i) - r_i \big ) = b_i\) for all \(i \in \{1,\dots ,m\}\).
Using this result, we can now show that the realization sets of networks with activation functions that are computable by elementary operations are never dense in \(L^p (\Omega )\) or \(C(\Omega )\).
Proposition C.10
Let \(\varrho : \mathbb {R}\rightarrow \mathbb {R}\) be continuous and computable by elementary operations. Moreover, let \(S=(d, N_1, \dots , N_{L-1}, 1)\) be a neural network architecture. Let \(\Omega \subset \mathbb {R}^d\) be any measurable set with nonempty interior, and let \({\mathcal {Y}}\) denote either \(L^p(\Omega )\) (for some \(p \in [1,\infty )\)), or \(C(\Omega )\). In case of \({\mathcal {Y}} = C(\Omega )\), assume additionally that \(\Omega \) is compact.
Then. we have \(\overline{{\mathcal {Y}} \cap \mathcal {R}\mathcal {N}\mathcal {N}_\varrho ^{\Omega }(S)} \subsetneq {\mathcal {Y}}\).
Proof
The considerations from before the statement of the proposition show that
Therefore, all we need to show is that if \({\mathcal {F}} \subset C(\Omega )\) is a function class for which \({\mathcal {F}} \cap {\mathcal {Y}}\) is dense in \({\mathcal {Y}}\), then \(\mathrm {Pdim}({\mathcal {F}}) = \infty \).
For \({\mathcal {Y}} = C(\Omega )\), this is easy: Let \(m \in \mathbb {N}\) be arbitrary, choose distinct points \(x_1, \dots , x_m \in \Omega \), and note that for each \(b \in \{0,1\}^m\), there is \(g_b \in C(\Omega )\) satisfying \(g_b (x_j) = b_j\) for all \(j \in {\underline{m}}\). By density, for each \(b \in \{0,1\}^m\), there is \(f_b \in {\mathcal {F}}\) such that \(\Vert f_b - g_b\Vert _{\sup } < \frac{1}{2}\). In particular, \(f_b (x_j) > \frac{1}{2}\) if \(b_j = 1\) and \(f_b (x_j) < \frac{1}{2}\) if \(b_j = 0\). Thus, if we set \(r_1 {:}{=}\dots {:}{=}r_m {:}{=}\frac{1}{2}\), then \({\mathbb {1}}_{[0,\infty )} (f_b(x_j) - r_j) = b_j\) for all \(j \in {\underline{m}}\). Hence, \(S = \{x_1,\dots ,x_m\}\) is pseudo-shattered by \({\mathcal {F}}\), so that \(\mathrm {Pdim}({\mathcal {F}}) \ge m\). Since \(m \in \mathbb {N}\) was arbitrary, \(\mathrm {Pdim}({\mathcal {F}}) = \infty \).
For \({\mathcal {Y}} = L^p (\Omega )\), one can modify this argument as follows: Since \(\Omega \) has nonempty interior, there are \({x_0 \in \Omega }\) and \(r > 0\) such that \(x_0 + r [0,1]^d \subset \Omega \). Let \(m \in \mathbb {N}\) be arbitrary, and for \(j \in {\underline{m}}\) define \({M_j := x_0 + r \big [ (\frac{j-1}{m}, \frac{j}{m}) \times [0,1]^{d-1} \big ]}\). Furthermore, for \(b \in \{0,1\}^m\), let \(g_b := \sum _{j \in {\underline{m}} \text { with } b_j = 1} {\mathbb {1}}_{M_j}\), and note \(g_b \in L^p (\Omega )\).
Since \(L^p(\Omega ) \cap {\mathcal {F}} \subset L^p(\Omega )\) is dense, there is for each \(b \in \{0,1\}^m\) some \(f_b \in {\mathcal {F}} \cap L^p(\Omega )\) such that \(\Vert f_b - g_b \Vert _{L^p}^p \le r^d / (2^{1+p} \cdot m \cdot 2^m)\). If we set \(\Omega _b := \{x \in \Omega :|f_b (x) - g_b (x) | \ge 1/2 \}\), then \({\mathbb {1}}_{\Omega _b} \le 2^p \cdot |f_b - g_b|^p\), and hence
and thus \(\lambda (\bigcup _{b \in \{0,1\}^m} \Omega _b) \le \frac{r^d}{2m}\), where \(\lambda \) is the Lebesgue measure. Hence,
so that we can choose for each \(j \in {\underline{m}}\) some \(x_j \in M_j \setminus \bigcup _{b \in \{0,1\}^m} \Omega _b\). We then have
and hence \(f_b (x_j) > 1/2\) if \(b_j = 1\) and \(f_b (x_j) < 1/2\) otherwise. Thus, if we set \(r_1 {:}{=}\dots {:}{=}r_m {:}{=}\frac{1}{2}\), then we have as above that \({\mathbb {1}}_{[0,\infty )} \big ( f_b (x_j) - r_j \big ) = b_j\) for all \(j \in {\underline{m}}\) and \(b \in \{ 0, 1 \}^m\). The remainder of the proof is as for \({\mathcal {Y}} = C(\Omega )\). \(\square \)
Note that the following activation functions are computable by elementary operations: any piecewise polynomial function (in particular, the ReLU and the parametric ReLU), the exponential linear unit, the softsign (since the absolute value can be computed using a case distinction), the sigmoid, and the \(\tanh \). Thus, the preceding proposition applies to each of these activation functions.
Appendix D: Proofs of the Results in Sect. 3
1.1 D.1. Proof of Theorem 3.1
The proof of Theorem 3.1 is crucially based on the following lemma:
Lemma D.1
Let \(\mu \) be a finite Borel measure on \([-B, B]^d\) with uncountable support \(\hbox {supp}\mu \). For \(x^*,v \in \mathbb {R}^d\) with \(v \ne 0\), define
where
Then, there are \(x^*\in [-B, B]^d\) and \(v \in S^{d-1}\) such that if \(f : [-B,B]^d \rightarrow \mathbb {R}\) satisfies
then there is no continuous \(g : [-B,B]^d \rightarrow \mathbb {R}\) satisfying \(f = g\) \(\mu \)-almost everywhere.
Proof
Step 1 Let \(K := \hbox {supp}\mu \subset [-B,B]^d\). In this step, we show that there is some \(x^*\in K\) and some \(v \in S^{d-1}\) such that \(x^*\in \overline{K \cap H_+ (x^*, v)} \cap \overline{K \cap H_- (x^*, v)}\). This follows from a result in [68], where the following is shown: For \(x^*\in \mathbb {R}^d\) and \(v \in S^{d-1}\), as well as \(\delta , \eta > 0\), write
and
Then, for each uncountable set \(E \subset \mathbb {R}^d\) and for all but countably many \(x^*\in E\), there is some \(v \in S^{d-1}\) such that \(E \cap C(x^*, v; \delta , \eta )\) and \(E \cap C(x^*, -v; \delta , \eta )\) are both uncountable for all \(\delta , \eta > 0\).
Now, if \(\eta < 1\), then any \(r \xi \in C(v;\delta ,\eta )\) with \(|\xi - v| < \eta \) satisfies \( \langle v, \xi \rangle = \langle v,v \rangle + \langle v, \xi - v \rangle \ge 1 - |\xi - v| > 0 , \) so that \(C(x^*, v; \delta , \eta ) \subset B_\delta (x^*) \cap H_+ (x^*, v)\) and \(C(x^*, -v; \delta , \eta ) \subset B_\delta (x^*) \cap H_- (x^*, v)\). From this it is easy to see that if \(x^*, v\) are as provided by the result in [68] (for \(E = K\)), then indeed \(x^*\in \overline{K \cap H_+ (x^*, v)} \cap \overline{K \cap H_- (x^*, v)}\).
We remark that strictly speaking, the proof in [68] is only provided for \(E \subset \mathbb {R}^3\), but the proof extends almost verbatim to \(\mathbb {R}^d\). A direct proof of the existence of \(x^*, v\) can be found in [58].
Step 2 We show that if \(x^*, v\) are as in Step 1 and if \(f : [-B,B]^d \rightarrow \mathbb {R}\) satisfies (D.1), then there is no continuous \(g : [-B,B]^d \rightarrow \mathbb {R}\) satisfying \(f = g\) \(\mu \)-almost everywhere.
Assume toward a contradiction that such a continuous function g exists. Recall (see for instance [19, Sect. 7.4]) that the support of \(\mu \) is defined as
In particular, if \(U \subset [-B,B]^d\) is open with \(U \cap \hbox {supp}\mu \ne \varnothing \), then \(\mu (U) > 0\).
For each \(n \in \mathbb {N}\), set \(U_{n,+} := B_{1/n}(x^*) \cap [-B,B]^d \cap H_+ (x^*, v)\) and \(U_{n,-} := B_{1/n}(x^*) \cap [-B,B]^d \cap H_- (x^*, v)\), and note that \(U_{n,\pm }\) are both open (as subsets of \([-B,B]^d\)) with \(K \cap U_{n, \pm } \ne \varnothing \), since \(x^*\in \overline{K \cap H_+ (x^*, v)}\) and \(x^*\in \overline{K \cap H_- (x^*, v)}\). Hence, \(\mu (U_{n,\pm }) > 0\). Since \(f = g\) \(\mu \)-almost everywhere, there exist \(x_{n,\pm } \in U_{n,\pm }\) with \(f(x_{n,\pm }) = g(x_{n,\pm })\). This implies \(g(x_{n,+}) = c\) and \(g(x_{n,-}) = c'\). But since \(x_{n,\pm } \in B_{1/n}(x^*)\), we have \(x_{n,\pm } \rightarrow x^*\), so that the continuity of g implies \(g(x^*) = \lim _{n} g(x_{n,+}) = c\) and \(g(x^*) = \lim _n g(x_{n,-}) = c'\), in contradiction to \(c \ne c'\). \(\square \)
We now prove Theorem 3.1. Set \(\Omega {:}{=}[-B,B]^d\) and define \({\widetilde{N}}_i := 1\) for \({i = 1,\dots ,L-2}\) and \({\widetilde{N}}_{L-1} := 2\) if \(\varrho \) is unbounded, while \({\widetilde{N}}_{L-1} := 1\) otherwise. We show that for \(\varrho \) as given in the statement of the theorem there exists a sequence of functions in the set \( \mathcal {R}\mathcal {N}\mathcal {N}_\varrho ^{\Omega }( (d, {\widetilde{N}}_1, \dots , {\widetilde{N}}_{L-1}, 1) ) \subset \mathcal {R}\mathcal {N}\mathcal {N}_\varrho ^{\Omega }( S ) \) such that the sequence converges (in \(L^p(\mu )\)) to a bounded, discontinuous limit \(f \in L^\infty (\mu )\), meaning that f does not have a continuous representative, even after possibly changing it on a \(\mu \)-null-set. Since \( \mathcal {R}\mathcal {N}\mathcal {N}_{\varrho }^\Omega (S) \subset C(\Omega ) , \) this will show that \(f \in \overline{\mathcal {R}\mathcal {N}\mathcal {N}_\varrho ^{\Omega }(S)} \setminus \mathcal {R}\mathcal {N}\mathcal {N}_\varrho ^{\Omega }(S)\).
For the construction of the sequence, let \(x^*\in \hbox {supp}\mu \) and \(v \in S^{d-1}\) as provided by Lemma D.1. Extend v to an orthonormal basis \((v, w_1, \dots , w_{d-1})\) of \(\mathbb {R}^d\), and define \(A := O^T\) for \(O := (v, w_1, \dots , w_{d-1}) \in \mathbb {R}^{d \times d}\). Note that \(x^*\in \Omega \subset {\overline{B}}_{dB} (0)\) and hence \(A (\Omega - x^*) \subset {\overline{B}}_{2dB}(0) \subset [-2dB, 2dB]^d =: \Omega '\). Define \(B' := 2 d B\).
Next, using Proposition B.3, choose a neural network \(\Psi \in \mathcal {NN}((d, 1, \dots , 1))\) with \(L-1\) layers such that
-
(1)
\(\mathrm {R}^{\Omega '}_{\varrho }(\Psi )(0) = 0\);
-
(2)
\(\mathrm {R}_{\varrho }^{\Omega '}(\Psi )\) is differentiable at 0 and \(\frac{\partial \mathrm {R}_{\varrho }^{\Omega '}(\Psi )}{\partial x_1}(0)=1\);
-
(3)
\(\mathrm {R}_{\varrho }^{\Omega '}(\Psi )\) is constant in all but the \(x_1\)-direction; and
-
(4)
\(\mathrm {R}_\varrho ^{\Omega '}(\Psi )\) is increasing with respect to each variable (with the remaining variables fixed).
Let \(J_0 {:}{=}\mathrm {R}_{\varrho }^{\Omega '}(\Psi )\). Since \(J_0(0) = 0\) and \(\frac{\partial J_0}{\partial x_1} (0) = 1\), we see directly from the definition of the partial derivative that for each \(\delta \in (0, B')\), there are \(x_\delta \in (-\delta , 0)\) and \(y_\delta \in (0, \delta )\) such that \({J_0(x_\delta , 0, \dots , 0) < J_0(0) = 0}\) and \(J_0(y_\delta , 0, \dots , 0) > J_0(0) = 0\). Furthermore, Properties (3) and (4) from above show that \(J_0(x)\) only depends on \(x_1\) and that \(t \mapsto J_0(t, 0, \dots , 0)\) is increasing. In combination, these observations imply that
Finally, with \(\Psi = \big ( (A_1, b_1),\dots ,(A_{L-1}, b_{L-1}) \big )\), define
and note that \(\Phi \in \mathcal {NN}( (d,1,\dots ,1))\) with \(L-1\) layers, and \(\mathrm {R}_\varrho ^{\mathbb {R}^d} (\Phi )(x) = \mathrm {R}_\varrho ^{\mathbb {R}^d} (\Psi )(A (x - x^*))\) for all \(x \in \mathbb {R}^d\). Combining this with the definition of A and with Eq. (D.2), and noting that \(A(x - x^*) \in \Omega '\) for \(x \in \Omega \), we see that \(J := \mathrm {R}_\varrho ^\Omega (\Phi )\) satisfies
where \(H_0(x^*, v) := \mathbb {R}^d \setminus (H_- (x^*, v) \cup H_+ (x^*, v))\).
We now distinguish the cases given in Assumption (iv)(a) and (b) of Theorem 3.1.
Case 1 \(\varrho \) is unbounded, so that necessarily Assumption (iv)(a) of Theorem 3.1 holds, and \({\widetilde{N}}_{L-1} = 2\). For \(n \in \mathbb {N}\) let \(\Phi _n = \big ( (A_1^n,b_1^n),(A_2^n,b_2^n) \big ) \in \mathcal {NN}((1,2,1))\) be given by
Then, . Now, let us define
![](http://media.springernature.com/lw480/springer-static/image/art%3A10.1007%2Fs10208-020-09461-0/MediaObjects/10208_2020_9461_Equ215_HTML.png)
Then, since \(h_n\) is continuous and hence bounded on the compact set \(\Omega \), we see that \(h_n \in L^p(\mu )\) for every \(n \in \mathbb {N}\) and all \(p \in (0, \infty ]\).
We now show that \((h_n)_{n \in \mathbb {N}}\) converges to a discontinuous limit. To see this, first consider \(x \in \Omega \cap H_+ (x^*, v)\). Since \(J(x) > 0\) by (D.3), there exists some \(N_x \in \mathbb {N}\) such that for all \(n \ge N_x\), the estimate \(nJ(x) - 1 > r\) holds, where \(r > 0\) is as in Assumption (iii) of Theorem 3.1. Hence, by the mean value theorem, there exists some \(\xi _n^x\in [nJ(x)-1,nJ(x)]\) such that
since \(\xi _n^x \rightarrow \infty \) as \(n \rightarrow \infty , n \ge N_x\). Analogously, it follows for \(x \in \Omega \cap H_- (x^*, v)\) that \(\lim _{n\rightarrow \infty } h_n(x) = \lambda '\). Hence, setting \(\gamma := \varrho (0) - \varrho (-1)\), we see for each \(x \in \Omega \) that
We now claim that there is some \(M > 0\) such that \(|\varrho (x) - \varrho (x - 1)| \le M\) for all \(x \in \mathbb {R}\). To see this, note because of \(\varrho '(x) \rightarrow \lambda \) as \(x \rightarrow \infty \) and because of \(\varrho '(x) \rightarrow \lambda '\) as \(x \rightarrow -\infty \) that there are \(M_0 > 0\) and \(R > r\) with \(|\varrho '(x)| \le M_0\) for all \(x \in \mathbb {R}\) with \(|x| \ge R\). Hence, \(\varrho \) is \(M_0\)-Lipschitz on \((-\infty ,-R]\) and on \([R, \infty )\), so that \(| \varrho (x) - \varrho (x-1) | \le M_0\) for all \(x \in \mathbb {R}\) with \(|x| \ge R+1\). But by continuity and compactness, we also have \(| \varrho (x) - \varrho (x-1) | \le M_1\) for all \(|x| \le R+1\) and some constant \(M_1 > 0\). Thus, we can simply choose \(M {:}{=}\max \{M_0, M_1\}\).
By what was shown in the preceding paragraph, we get \(|h_n| \le M\) and hence also \(|h| \le M\) for all \(n \in \mathbb {N}\). Hence, by the dominated convergence theorem, we see for any \(p \in (0,\infty )\) that \( \lim _{n\rightarrow \infty } \left\| h_n - h \right\| _{L^p(\mu )} = 0 . \) But since \(\lambda \ne \lambda '\), Lemma D.1 shows that h doesn’t have a continuous representative, even after changing it on a \(\mu \)-null-set. This yields the required non-continuity of a limit point as discussed at the beginning of the proof.
Case 2
\(\varrho \) is bounded, so that \({\widetilde{N}}_{L-1} = 1\). Since \(\varrho \) is monotonically increasing, there exist \(c,c'\in \mathbb {R}\) such that
By the monotonicity and since \(\varrho \) is not constant (because of \(\varrho ' (x_0) \ne 0\)), we have \(c > c'\).
For each \(n\in \mathbb {N}\), we now consider the neural network \( {\widetilde{\Phi }}_n = \big ( ({\widetilde{A}}_1^n,{\widetilde{b}}_1^n),({\widetilde{A}}_2^n,{\widetilde{b}}_2^n) \big ) \in \mathcal {NN}((1, 1, 1)) \) given by
Then, . Now, let us define
![](http://media.springernature.com/lw528/springer-static/image/art%3A10.1007%2Fs10208-020-09461-0/MediaObjects/10208_2020_9461_Equ216_HTML.png)
Since each of the \({\widetilde{h}}_n\) is continuous and \(\Omega \) is compact, we have \({\widetilde{h}}_n\in L^p(\mu )\) for all \(p\in (0,\infty ]\). Equation (D.3) implies that \(J(x) > 0\) for all \(x \in \Omega \cap H_+ (x^*, v)\). This in turn yields that
Similarly, the fact that \(J(x) < 0\) for all \(x \in \Omega \cap H_- (x^*, v)\) yields
Combining (D.4) with (D.5) yields for all \(x \in \Omega \) that
By the boundedness of \(\varrho \), we get \(|{\widetilde{h}}_n (x)| \le C\) for all \(n \in \mathbb {N}\) and \(x \in \Omega \) and a suitable \(C > 0\), so that also \({\widetilde{h}}\) is bounded. Together with the dominated convergence theorem, this implies for any \(p \in (0,\infty )\) that \( \lim _{n \rightarrow \infty } \big \Vert {\widetilde{h}}_n - {\widetilde{h}} \big \Vert _{L^p(\mu )} =0. \) Since \(c \ne c'\), Lemma D.1 shows that \({\widetilde{h}}\) does not have a continuous representative (with respect to equality \(\mu \)-almost everywhere). This yields the required non-continuity of a limit point as discussed at the beginning of the proof. \(\square \)
1.2 D.2. Proof of Corollary 3.2
It is not hard to verify that all functions listed in Table 1 are continuous and increasing. Furthermore, each activation function \(\varrho \) listed in Table 1 is not constant and satisfies \(\varrho |_{\mathbb {R}\setminus \{0\}} \in C^\infty (\mathbb {R}\setminus \{0\})\). This shows that \(\varrho |_{(-\infty ,-r) \cup (r,\infty )}\) is differentiable for any \(r > 0\), and that there is some \(x_0 = x_0 (\varrho ) \in \mathbb {R}\) such that \(\varrho '(x_0) \ne 0\).
Next, the softsign, the inverse square root unit, the sigmoid, the \(\tanh \), and the \(\arctan \) function are all bounded, and thus satisfy condition (iv)(b) of Theorem 3.1. Thus, all that remains is to verify condition (iv)(a) of Theorem 3.1 for the remaining activation functions:
-
1.
For the ReLU \(\varrho (x) = \max \{0, x\}\), condition (iv)(a) is satisfied with \(\lambda = 1\) and \(\lambda ' = 0 \ne \lambda \).
-
2.
For the parametric ReLU \(\varrho (x) = \max \{a x, x\}\) (with \(a \ge 0\), \(a \ne 1\)), Condition (iv)(a) is satisfied with \(\lambda = \max \{ 1, a \}\) and \(\lambda ' = \min \{ 1, a \}\), where \(\lambda \ne \lambda '\) since \(a \ne 1\).
-
3.
For the exponential linear unit \(\varrho (x) = x {\mathbb {1}}_{[0,\infty )}(x) + (e^x - 1) {\mathbb {1}}_{(-\infty ,0)}(x)\), Condition (iv)(a) is satisfied for \(\lambda = 1\) and \(\lambda ' = \lim _{x \rightarrow -\infty } e^x = 0 \ne \lambda \).
-
4.
For the inverse square root linear unit \(\varrho (x) = x {\mathbb {1}}_{[0,\infty )}(x) + \frac{x}{\sqrt{1 + a x^2}} {\mathbb {1}}_{(-\infty ,0)}(x)\), the quotient rule shows that for \(x<0\) we have
$$\begin{aligned} \varrho '(x)= & {} \frac{\sqrt{1 + a x^2} - x \cdot \frac{1}{2} (1 + a x^2)^{-1/2} 2 a x}{1 + a x^2}\nonumber \\= & {} \frac{(1 + a x^2) - a x^2}{(1 + a x^2)^{3/2}} = (1 + a x^2)^{-3/2} . \end{aligned}$$(D.6)Therefore, Condition (iv)(a) is satisfied for \(\lambda = 1\) and \(\lambda ' = \lim _{x \rightarrow -\infty } \varrho '(x) = 0 \ne \lambda \).
-
5.
For the softplus function \(\varrho (x) = \ln (1 + e^x)\), Condition (iv)(a) is satisfied for
$$\begin{aligned} \lambda = \lim _{x \rightarrow \infty } \frac{e^x}{1 + e^x} = 1 \quad \text {and} \quad \lambda ' = \lim _{x \rightarrow -\infty } \frac{e^x}{1 + e^x} = 0 \ne \lambda . \end{aligned}$$\(\square \)
1.3 D.3. Proof of Theorem 3.3
1.3.1 D.3.1. Proof of Theorem 3.3 Under Condition (i)
Let \(\Omega {:}{=}[-B,B]^d\). Let \(m \in \mathbb {N}\) be maximal with \(\varrho \in C^m (\mathbb {R})\); this is possible since \(\varrho \in C^1 (\mathbb {R}) \setminus C^\infty (\mathbb {R})\). Note that \(\varrho \in C^{m}(\mathbb {R}) \setminus C^{m+1}(\mathbb {R})\). This easily implies \(\mathcal {R}\mathcal {N}\mathcal {N}_{\varrho }^{\Omega }(S) \subset C^{m}({\Omega })\).
We now show for the architecture \(S' := (d, {\widetilde{N}}_1, \dots , {\widetilde{N}}_{L-2}, 2, 1)\), where \({\widetilde{N}}_i := 1\) for all \(i = 1, \dots , L-2\), that the set \(\mathcal {R}\mathcal {N}\mathcal {N}_\varrho ^{\Omega }(S')\) is not closed in \(C(\Omega )\). If we had \(\varrho \in C^{m+1}([-C,C])\) for all \(C > 0\), this would imply \(\varrho \in C^{m+1}(\mathbb {R})\); thus, there is \(C > 0\) such that \(\varrho \notin C^{m+1} ([-C,C])\). Now, choose \(\lambda > C / B\), so that \(\lambda [-B,B] \supset [-C,C]\). This entails that \(\varrho (\lambda \cdot )\in C^{m}([-B,B]) \setminus C^{m+1}([-B,B])\). Next, since the continuous derivative \(\frac{\mathrm{{d}}}{\mathrm{{d}}x} \varrho (\lambda x) = \lambda \varrho ' (\lambda x)\) is bounded on the compact set \([-B,B]\), we see that \(\varrho (\lambda \cdot )\) is Lipschitz continuous on \([-B,B]\), and we set \(M_1 {:}{=}\mathrm {Lip}(\varrho (\lambda \cdot ))\). Next, by the uniform continuity of \(\lambda \cdot \varrho '(\lambda \cdot )\) on \([-(B+1),B+1]\), if we set
then \(\varepsilon _n \rightarrow 0\) as \(n \rightarrow \infty \).
For \(n\in \mathbb {N}\), let \(\Phi _n^1 = \big ((A_1^n,b_1^n),(A_2^n,b_2^n)\big )\in \mathcal {NN}((1,2,1))\) be given by
Note that there is some \(x^*\in \mathbb {R}\) such that \(\varrho ' (x^*) \ne 0\), since otherwise \(\varrho ' \equiv 0\) and hence \(\varrho \in C^\infty (\mathbb {R})\). Thus, for each \(n \in \mathbb {N}\), Proposition B.3 yields the existence of a neural network \(\Phi ^2_n \in \mathcal {NN}((d,1,\dots ,1))\) with \(L-1\) layers such that
We set and \(f_n {:}{=}\mathrm {R}_\varrho ^\Omega (\Phi _n)\). For \(x \in \Omega \), we then have
Now, by the Lipschitz continuity of \(\varrho (\lambda \cdot )\) and Eq. (D.7), we conclude that
This implies for every \(x \in \Omega \) that
Here, the last step used that \(| \xi _{n}^x - x_1 | \le n^{-1} \le 1\), so that \(x_1, \xi _{n}^x \in [-(B+1),B+1]\).
Overall, we established the existence of a sequence \((f_n)_{n \in \mathbb {N}}\) in \(\mathcal {R}\mathcal {N}\mathcal {N}_{\varrho }^{\Omega }(S')\) which converges uniformly to the function \(\Omega \rightarrow \mathbb {R}, ~x\mapsto \varrho _\lambda (x) {:}{=}\lambda \, \varrho '(\lambda x_1)\). By our choice of \(\lambda \), we have \(\varrho _\lambda \not \in C^{m}({\Omega })\). Because of \(\mathcal {R}\mathcal {N}\mathcal {N}_{\varrho }^{\Omega }(S') \subset C^{m}({\Omega })\), we thus see that \(\varrho _\lambda \not \in \mathcal {R}\mathcal {N}\mathcal {N}_{\varrho }^{\Omega }(S')\), so that \(\mathcal {R}\mathcal {N}\mathcal {N}_{\varrho }^{\Omega }(S')\) is not closed in \(C(\Omega )\).
Finally, note by Lemma B.1 that
Since \(f_n \rightarrow \varrho _\lambda \) uniformly, where \(\varrho _\lambda \notin C^m (\Omega )\), and hence \(\varrho _\lambda \notin \mathcal {R}\mathcal {N}\mathcal {N}_{\varrho }^{\Omega }(S)\), we thus see that \(\mathcal {R}\mathcal {N}\mathcal {N}_{\varrho }^{\Omega }(S)\) is not closed in \(C(\Omega )\). \(\square \)
1.3.2 D.3.2. Proof of Theorem 3.3 Under Condition (ii)
Let \(\Omega {:}{=}[-B,B]^d\). We first show that if we set \(S' := (d, {\widetilde{N}}_1, \dots , {\widetilde{N}}_{L-2},2,1)\), where \({\widetilde{N}}_i {:}{=}1\) for all \({i = 1, \dots , L-2}\), then there exists a limit point of \(\mathcal {R}\mathcal {N}\mathcal {N}_\varrho ^{\Omega }(S')\) which is the restriction \(f|_\Omega \) of an unbounded analytic function \(f : \mathbb {R}\rightarrow \mathbb {R}\).
Since \(\varrho \) is not constant, there is some \(x^*\in \mathbb {R}\) such that \(\varrho '(x^*) \ne 0\). For \(n \in \mathbb {N}\), let us define \({\Phi _n^1 {:}{=}\big ( (A_1^n,b_1^n),(A_2^n,b_2^n) \big ) \in \mathcal {NN}((1,2,1))}\) by
With this choice, we have
For any \(x \in \mathbb {R}\), the mean-value theorem yields \({\widetilde{x}}\) between \(x^*\) and \(x^*+ \frac{x}{n}\) satisfying \(\varrho (x^*+ \frac{x}{n}) - \varrho (x^*) = \frac{x}{n} \cdot \varrho ' ({\widetilde{x}})\). Therefore, if \(B > 0\) and \(x \in [-B,B]\), then
Since \(\varrho '\) is continuous, we conclude that
Moreover, note that \( \frac{\mathrm{{d}}}{\mathrm{{d}}x}\mathrm {R}_\varrho ^{\mathbb {R}}(\Phi _n^1)(x) = \varrho '(x) + \varrho '(x^*+ n^{-1} \cdot x) \) is bounded on \([-(B+1),B+1]\), uniformly with respect to \(n \in \mathbb {N}\). Hence, \(\mathrm {R}_\varrho ^{\mathbb {R}}(\Phi _n^1)\) is Lipschitz continuous on \([-(B+1), B+1]\), with Lipschitz constant \(C' > 0\) independent of \(n \in \mathbb {N}\).
Next, for each \(n \in \mathbb {N}\), Proposition B.3 yields a neural network \(\Phi ^2_n \in \mathcal {NN}((d,1,\dots ,1))\) with \(L-1\) layers such that
We set and note for all \(x \in \Omega \) that
By the Lipschitz continuity of \(\mathrm {R}_\varrho ^{\mathbb {R}}(\Phi _n^1)\) on \([-(B+1), B+1]\), and using (D.9), we conclude that
so that an application of (D.8) yields
Now, to show that \(\mathcal {R}\mathcal {N}\mathcal {N}_\varrho ^{\Omega }(S) \subset C(\Omega )\) is not closed, due to the fact that there holds \({ \mathrm {R}_\varrho ^\Omega (\Phi _n) \in \mathcal {R}\mathcal {N}\mathcal {N}_\varrho ^{\Omega }(S') \subset \mathcal {R}\mathcal {N}\mathcal {N}_\varrho ^{\Omega }(S), }\) it is sufficient to show that with
\(F|_{\Omega }\) is not an element of \(\mathcal {R}\mathcal {N}\mathcal {N}_\varrho ^{\Omega }(S)\). This is accomplished, once we show that there do not exist any \({\widehat{N}}_{1}, \dots , {\widehat{N}}_{L-1} \in \mathbb {N}\) such that \(F|_{\Omega }\) is an element of \(\mathcal {R}\mathcal {N}\mathcal {N}_\varrho ^{\Omega }( (d, {\widehat{N}}_1, \dots , {\widehat{N}}_{L-1}, 1) )\).
Toward a contradiction, we assume that there exist \({\widehat{N}}_1, \dots , {\widehat{N}}_{L-1} \in \mathbb {N}\) such that \(F|_{\Omega } = \mathrm {R}_\varrho ^{\Omega }(\Phi ^3)\) for a network \(\Phi ^3 \in \mathcal {NN}( (d, {\widehat{N}}_1, \dots , {\widehat{N}}_{L-1}, 1) )\). Since F and \( \mathrm {R}_\varrho ^{\mathbb {R}^{d}}(\Phi ^3)\) are both analytic functions that coincide on \(\Omega = [-B,B]^d\), they must be equal on all of \(\mathbb {R}^d\). However, F is unbounded (since \(\varrho \) is bounded, and since \(\varrho '(x^*) \ne 0\)), while \(\mathrm {R}_\varrho ^{\mathbb {R}^{d}}(\Phi ^3)\) is bounded as a consequence of \(\varrho \) being bounded. This produces the desired contradiction. \(\square \)
1.3.3 D.3.3. Proof of Theorem 3.3 Under Condition (iii)
Let \(\varrho \in C^{\max \{r,q\}}(\mathbb {R})\) be approximately homogeneous of order (r, q) with \(r \ne q\). For simplicity, let us assume that \(r > q\); we will briefly comment on the case \(q > r\) at the end of the proof.
Note that \(r \ge 1\), since \(r,q \in \mathbb {N}_0\) with \(r > q\). Let \((x)_+ {:}{=}\max \{x,0\}\) for \(x \in \mathbb {R}\). We start by showing that
To see this, let \(s > 0\) such that \(|\varrho (x) - x^r| \le s\) for all \(x> 0\) and \(|\varrho (x) - x^q| \le s\) for all \(x < 0\). For any \(k \in \mathbb {N}\) and \(x \in [-B,0]\), we have
for a constant \(c_0 = c_0(B, s, r, q) > 0\). Moreover, for \(x \in [0,B]\), we have
Overall, we conclude that
which implies (D.10).
We observe that \(\big ( x \mapsto (x)_+^r \big ) \not \in C^{r}([-B,B])\). Additionally, since \(\varrho \in C^{\max \{r,q\}}(\mathbb {R}) = C^r (\mathbb {R})\), we have \(\mathcal {R}\mathcal {N}\mathcal {N}_\varrho ^{[-B,B]^d}(S) \subset C^r([-B,B]^d)\). Hence, the proof is complete if we can construct a sequence \((\Phi _n)_{n \in \mathbb {N}}\) of neural networks in \(\mathcal {NN}((d,1,\dots ,1))\) (with L layers) such that the \(\varrho \)-realizations \(\mathrm {R}_\varrho ^\Omega (\Phi _n)\) converge uniformly to the function \([-B,B]^d \rightarrow \mathbb {R}, x \mapsto (x_1)_+^r\). By the preceding considerations, this is clearly possible, as can be seen by the same arguments used in the proofs of the previous results. For invoking these arguments, note that \(\max \{r,q\} \ge 1\), so that \(\varrho \in C^1 (\mathbb {R})\). Also, since \(\varrho \) is approximately homogeneous of order (r, q) with \(r \ne q\), \(\varrho \) cannot be constant, and hence \(\varrho '(x_0) \ne 0\) for some \(x_0 \in \mathbb {R}\).
For completeness, let us briefly consider the case where \(q > r\) that was omitted at the beginning of the proof. In this case, \((-k)^{-q} \, \varrho (-k \cdot ) \rightarrow (\cdot )_+^q\) with uniform convergence on \([-B,B]\). Indeed, for \(x \in [0,B]\), we have \(|(-k)^{-q} \, \varrho (-k x) - (x)_+^q| = k^{-q} |\varrho (-k x) - (-k x)^q| \le k^{-q} \cdot s \le s \cdot k^{-1}\). Similarly, for \(x \in [-B, 0]\), we get \(| (-k)^{-q} \, \varrho (-k x) - (x)_+^q | \le k^{-q} \big ( |\varrho (-k x) - (-k x)^r| + |(-k x)^r| \big ) \le k^{-q} (s + B^r k^r) \le c_1 \cdot k^{-1}\) for some constant \(c_1 = c_1(B,s,r,q) > 0\). Here, we used that \(q - r \ge 1\), since \(r,q \in \mathbb {N}_0\) with \(q > r\). Now, the proof proceeds as before, noting that \(\big ( x \mapsto (x)_+^q \big ) \notin C^q ([-B,B])\), while \(\varrho \in C^{\max \{r,q\}} (\mathbb {R}) \subset C^q (\mathbb {R})\). \(\square \)
1.4 D.4. Proof of Corollary 3.4
1.4.1 D.4.1. Proof of Corollary 3.4.(1)
Powers of ReLUs: For \(k \in \mathbb {N}\), let \(\mathrm {ReLU}_k : \mathbb {R}\rightarrow \mathbb {R}, x \mapsto \max \{0, x\}^k\), and note that this is a continuous function. On \(\mathbb {R}\setminus \{0\}\), \(\mathrm {ReLU}_k\) is differentiable with \(\mathrm {ReLU}_k ' = k \cdot \mathrm {ReLU}_{k - 1}\). Furthermore, if \(k \ge 2\), then \(| h^{-1} (\mathrm {ReLU}_k (h) - \mathrm {ReLU}_k (0))| \le |h|^{k-1} \rightarrow 0\) as \(h \rightarrow 0\). Thus, if \(k \ge 2\), then \(\mathrm {ReLU}_k\) is continuously differentiable with derivative \(\mathrm {ReLU}_k ' = k \cdot \mathrm {ReLU}_{k-1}\). Finally, \(\mathrm {ReLU}_1\) is not differentiable at \(x = 0\). Overall, this shows \(\mathrm {ReLU}_k \in C^1 (\mathbb {R}) \setminus C^\infty (\mathbb {R})\) for all \(k \ge 2\).
The exponential linear unit: We have \(\frac{d^k}{d x^k} (e^x - 1) = e^x\) for all \(k \in \mathbb {N}\). Therefore, the exponential linear unit \(\varrho : \mathbb {R}\rightarrow \mathbb {R}, x \mapsto x {\mathbb {1}}_{[0,\infty )}(x) + (e^x - 1) {\mathbb {1}}_{(-\infty ,0)}(x)\) satisfies for \(k \in \mathbb {N}_0\) that
By standard results in real analysis (see for instance [23, Problem 2 in Chapter VIII.6]), this implies that \(\varrho \in C^1 (\mathbb {R}) \setminus C^2 (\mathbb {R})\).
The softsign function: On \((-1,\infty )\), we have \(\frac{\mathrm{{d}}}{\mathrm{{d}}x} \frac{x}{1 + x} = (1 + x)^{-2}\) and \(\frac{d^2}{d x^2} \frac{x}{1 + x} = -2 (1 + x)^{-3}\). Furthermore, if \(x < 0\), then \(\mathrm {softsign}(x) = \frac{x}{1 + |x|} = - \frac{-x}{1 + (-x)} = - \mathrm {softsign}(-x)\). Therefore, the softsign function is \(C^\infty \) on \(\mathbb {R}\setminus \{0\}\), and satisfies
By standard results in real analysis (see for instance [23, Problem 2 in Chapter VIII.6]), this implies that \(\mathrm {softsign} \in C^1 (\mathbb {R})\). However, since
we have \(\mathrm {softsign} \notin C^2 (\mathbb {R})\).
The inverse square root linear unit: Let
denote the inverse square root linear unit with parameter \(a > 0\), and note \(\varrho |_{\mathbb {R}\setminus \{0\}} \in C^\infty (\mathbb {R}\setminus \{0\})\). As we saw in Eq. (D.6), we have \(\frac{\mathrm{{d}}}{\mathrm{{d}}x} \frac{x}{(1 + a x^2)^{1/2}} = (1 + a x^2)^{-3/2}\), and thus \(\frac{d^2}{d x^2} \frac{x}{(1 + a x^2)^{1/2}} = - 3 a x \cdot (1 + a x^2)^{-5/2}\), and finally \( \frac{d^3}{d x^3} \frac{x}{(1 + a x^2)^{1/2}} = -3 a (1 + a x^2)^{-5/2} + 15 a^2 x^2 (1 + a x^2)^{-7/2} . \) These calculations imply
and
but also
By standard results in real analysis (see for instance [23, Problem 2 in Chapter VIII.6]), this implies that \(\varrho \in C^2 (\mathbb {R}) \setminus C^3 (\mathbb {R})\). \(\square \)
1.4.2 D.4.2. Proof of Corollary 3.4.(3)
The softplus function Clearly, \(\mathrm {softplus} \in C^\infty (\mathbb {R}) \subset C^{\max \{1,0\}}(\mathbb {R})\). Furthermore, the softplus function is approximately homogeneous of order (1, 0). Indeed, for \(x \ge 0\), we have
and for \(x \le 0\), we have \(|\ln (1 + e^x) - x^0| \le 1 + \ln (2)\). \(\square \)
1.5 D.5. Proof of Proposition 3.5
The set \(\Theta _C\) is closed and bounded in the normed space \( \big ( \mathcal {NN}(S), \Vert \cdot \Vert _{\mathcal {NN}(S)} \big ) . \) Thus, the Heine-Borel Theorem implies the compactness of \(\Theta _C\). By Proposition 4.1 (which will be proved independently), the map
is continuous. As a consequence, the set \(\mathrm {R}_{\varrho }^\Omega (\Theta _C)\) is compact in \(C(\Omega )\). Because of the compactness of \(\Omega \), \(C(\Omega )\) is continuously embedded into \(L^p(\mu )\) for every \(p \in (0,\infty )\) and any finite Borel measure \(\mu \) on \(\Omega \). This implies that the set \(\mathrm {R}_\varrho ^\Omega (\Theta _C)\) is compact in \(L^p(\mu )\) as well. \(\square \)
1.6 D.6. Proof of Proposition 3.6
With \((\Phi _N)_{N \in \mathbb {N}}\) as in the statement of Proposition 3.6, we want to show that \(\Vert \Phi _N \Vert _{\mathrm {total}} \rightarrow \infty \) in probability. By definition, this means that for each fixed \(C > 0\), and letting \(\Omega _N\) denote the event where \(\Vert \Phi _N \Vert \ge C\), we want to show that \({\mathbb {P}}(\Omega _N) \rightarrow 1\) as \(N \rightarrow \infty \). For brevity, let us write \({\mathcal {R}}^Z {:}{=}\mathrm {R}_{\varrho }^{\Omega }(\Theta _{Z})\) for \(\Theta _{Z}\), \(Z > 0\) as in Proposition 3.5.
By compactness of \({\mathcal {R}}^C\), we can choose \(g \in {\mathcal {R}}^C\) satisfying
Define \(M := \inf _{h \in \mathcal {R}\mathcal {N}\mathcal {N}_\varrho ^\Omega (S)} \Vert f_\sigma - h \Vert _{L^2(\sigma _\Omega )}^2\). Since by assumption the infimum defining M is not attained, we have \(\Vert f_\sigma - g \Vert _{L^2(\sigma _\Omega )}^2 > M\), so that there are \(h \in \mathcal {R}\mathcal {N}\mathcal {N}_\varrho ^\Omega (S)\) and \(\delta > 0\) with \(\Vert f_\sigma - g \Vert _{L^2(\sigma _\Omega )}^2 \ge 2 \delta + \Vert f_\sigma - h \Vert _{L^2(\sigma _\Omega )}^2\). Let \(C' > 0\) with \(h \in {\mathcal {R}}^{C'}\). For \(N \in \mathbb {N}\) and \(\varepsilon > 0\), let us denote by \(\Omega _{N,\varepsilon }^{(1)}\) the event where \( \sup _{f \in {\mathcal {R}}^{C'}} |{\mathcal {E}}_\sigma (f) - E_N (f)| > \varepsilon . \) Since \({\mathcal {R}}^{C'}\) is compact, [20]Theorem B shows for arbitrary \(\varepsilon > 0\) that \({\mathbb {P}}(\Omega _{N,\varepsilon }^{(1)}) \xrightarrow [N \rightarrow \infty ]{} 0\) for each fixed \(\varepsilon > 0\). Similarly, denoting by \(\Omega _{N,\varepsilon }^{(2)}\) the event where \(E_N (\mathrm {R}_\varrho ^\Omega (\Phi _N)) - \inf _{f \in \mathcal {R}\mathcal {N}\mathcal {N}_\varrho ^\Omega (S)} E_N (f) > \varepsilon \), we have by assumption (3.3) that \({\mathbb {P}}(\Omega _{N,\varepsilon }^{(2)}) \xrightarrow [N \rightarrow \infty ]{} 0\), for each fixed \(\varepsilon > 0\).
We now claim that \(\Omega _N^c \subset \Omega _{N, \delta /3}^{(1)} \cup \Omega _{N, \delta /3}^{(2)}\). Once we prove this, we get
and hence \({\mathbb {P}}(\Omega _N) \rightarrow 1\), as desired.
To prove \(\Omega _N^c \subset \Omega _{N, \delta /3}^{(1)} \cup \Omega _{N, \delta /3}^{(2)}\), assume toward a contradiction that there exists a training sample \({ \omega {:}{=}\big ( (x_i ,y_i) \big )_{i \in \mathbb {N}} \in \Omega _N^c \setminus \big ( \Omega _{N, \delta /3}^{(1)} \cup \Omega _{N, \delta /3}^{(2)} \big ) . }\) Thus, \(\Vert \Phi _N \Vert _{\mathrm {total}} < C\), meaning \({f_N := \mathrm {R}_\varrho ^\Omega (\Phi _N) \in {\mathcal {R}}^C \subset {\mathcal {R}}^{C'}}\). Using the decomposition of the expected loss from Eq. (3.2), we thus see
By rearranging and recalling the choice of h and \(\delta \), we finally see
which is the desired contradiction. \(\square \)
1.7 D.7. Proof of Proposition 3.7
The main ingredient of the proof will be to show that one can replace a given sequence of networks with C-bounded scaling weights by another sequence with C-bounded scaling weights that also has bounded biases. Then one can apply Proposition 3.5.
Lemma D.2
Let \(S = (d, N_1, \dots , N_L)\) be a neural network architecture, let \(C > 0\) and let \(\Omega \subset \mathbb {R}^d\) be measurable and bounded. Let \(\mu \) be a finite Borel measure on \(\Omega \) with \(\mu (\Omega ) > 0\). Finally, let \({\varrho : \mathbb {R}\rightarrow \mathbb {R}, ~x \mapsto \max \{0,x\}}\) denote the ReLU activation function.
Let \((\Phi _n)_{n \in \mathbb {N}}\) be a sequence of networks in \(\mathcal {NN}(S)\) with C-bounded scaling weights and such that there exists some \(M > 0\) with \({\Vert \mathrm {R}_\varrho ^{\Omega } (\Phi _n) \Vert _{L^1 (\mu )} \le M}\) for all \(n \in \mathbb {N}\).
Then, there is an infinite set \(I \subset \mathbb {N}\) and a family of networks \((\Psi _n)_{n \in I} \subset \mathcal {NN}(S)\) with C-bounded scaling weights which satisfies \(\mathrm {R}_\varrho ^\Omega (\Phi _n) = \mathrm {R}_\varrho ^\Omega (\Psi _n)\) for \(n \in I\) and such that \(\Vert \Psi _n\Vert _{\mathrm {total}} \le C'\) for all \(n \in I\) and a suitable constant \(C' > 0\).
Proof
Set \(N_0 := d\). Since \(\Omega \) is bounded, there is some \(R > 0\) with \(\Vert x\Vert _{\ell ^\infty } \le R\) for all \(x \in \Omega \). In the following, we will use without further comment the estimate \(\Vert A x\Vert _{\ell ^\infty } \le k \cdot \Vert A\Vert _{\max } \cdot \Vert x\Vert _{\ell ^\infty }\) which is valid for \(A \in \mathbb {R}^{n \times k}\) and \(x \in \mathbb {R}^k\).
Below, we will show by induction on \(m\in \left\{ 0,\dots ,L-1\right\} \) that for all \(m\in \left\{ 0,\dots ,L-1\right\} \), there is an infinite subset \(I_{m} \subset \mathbb {N}\), and a family of networks \(\big ( \Psi ^{(m)}_n \big )_{n \in I_m} \subset \mathcal {NN}(S)\) of the form
with the following properties:
-
(A)
We have \( \mathrm {R}_\varrho ^\Omega \big ( \Psi _n^{(m)} \big ) = \mathrm {R}_\varrho ^{\Omega } (\Phi _n) \) for all \(n \in I_m\);
-
(B)
each network \(\Psi _n^{(m)}\), \(n \in I_m\), has C-bounded scaling weights;
-
(C)
there is a constant \(C_m > 0\) with \(\big \Vert c_\ell ^{(n,m)} \big \Vert _{\ell ^\infty } \le C_m\) for all \(n \in I_m\) and all \(\ell \in \{1,\dots ,m\}\).
Once this is shown, we set \(I {:}{=}I_{L-1}\) and \(\Psi _n {:}{=}\Psi _n^{(L-1)}\) for \(n \in I\). Clearly, \(\Psi _n\) has C-bounded scaling weights and satisfies \(\mathrm {R}_\varrho ^\Omega (\Psi _n) = \mathrm {R}_\varrho ^\Omega (\Phi _n)\), so that it remains to show \(\Vert \Psi _n \Vert _{\mathrm {total}} \le C'\), for which it suffices to show \(\Vert c_L^{(n,L-1)} \Vert _{\ell ^\infty } \le C''\) for some \(C'' > 0\) and all \(n \in I\), since we have \(\Vert c_\ell ^{(n,L-1)} \Vert _{\ell ^\infty } \le C_{L-1}\) for all \(\ell \in \{1,\dots ,L-1\}\).
Now, note for \(\ell \in \{1,\dots ,L-1\}\) and \(x \in \mathbb {R}^{N_{\ell -1}}\) that \(T_\ell ^{(n,L-1)}(x) {:}{=}B_\ell ^{(n,L-1)}x + c_\ell ^{(n,L-1)}\) satisfies
Since \(\Omega \) is bounded, and since \(| \varrho (x) | \le |x|\) for all \(x \in \mathbb {R}\), there is thus a constant \(C_{L-1} ' > 0\) such that if we set
then \(\Vert \beta ^{(n)} (x)\Vert _{\ell ^\infty } \le C_{L-1} '\) for all \(x \in \Omega \) and all \(n \in I\).
For arbitrary \(i \in \{1,\dots ,N_L\}\) and \(x \in \Omega \), this implies
Since by assumption \(\Vert \mathrm {R}^\Omega _\varrho (\Phi _n) \Vert _{L^1(\mu )} \le M\) and \(\mu (\Omega ) > 0\), we see that \(\big ( c_L^{(n,L-1)} \big )_{n \in I}\) must be a bounded sequence.
Thus, it remains to construct the networks \(\Psi _n^{(m)}\) for \(n \in I_m\) (and the sets \(I_m\)) for \(m \in \{0,\dots ,L-1\}\) with the properties (A)–(C) from above.
For the start of the induction (\(m=0\)), we can simply take \(I_0 {:}{=}\mathbb {N}\), \(\Psi _n^{(0)} {:}{=}\Phi _n\), and \(C_0 > 0\) arbitrary, since condition (C) is void in this case.
Now, assume that a family of networks \((\Psi _n^{(m)})_{n \in I_m}\) as in Eq. (D.11) with an infinite subset \(I_{m} \subset \mathbb {N}\) and satisfying conditions (A)–(C) has been constructed for some \(m \in \left\{ 0,\dots ,L-2\right\} \). In particular, \(L \ge 2\).
For brevity, set \(T_{\ell }^{(n)} : \mathbb {R}^{N_{\ell -1}} \rightarrow \mathbb {R}^{N_{\ell }}, ~x\mapsto B_{\ell }^{(n,m)}x + c^{(n,m)}_\ell \) for \(\ell \in \{1,\dots ,L\}\), and \(\varrho _{L} {:}{=}\mathrm {id}_{\mathbb {R}^{N_{L}}}\), and let \(\varrho _{\ell } {:}{=}\varrho \times \cdots \times \varrho \) denote the \(N_\ell \)-fold Cartesian product of \(\varrho \) for \(\ell \in \{1,\dots ,L-1\}\). Furthermore, let us define \(\beta _n {:}{=}\varrho _{m}\circ T_{m}^{(n)} \circ \cdots \circ \varrho _{1} \circ T_{1}^{(n)} : \mathbb {R}^{d} \rightarrow \mathbb {R}^{N_{m}}\). Note \(\Vert \varrho _{\ell } (x) \Vert _{\ell ^\infty } \le \Vert x \Vert _{\ell ^\infty }\) for all \(x\in \mathbb {R}^{N_{\ell }}\). Additionally, observe for \(n \in I_{m}\), \(\ell \in \{1,\dots ,m\}\) and \(x \in \mathbb {R}^{N_{\ell -1}}\) that
Combining these observations, and recalling that \(\Omega \) is bounded, we easily see that there is some \(R' > 0\) with \(\Vert \beta _n (x) \Vert _{\ell ^\infty } \le R'\) for all \(x \in \Omega \) and \(n \in I_{m}\).
Next, since \(\left( c_{m+1}^{(n,m)} \right) _{n \in I_{m}}\) is an infinite family in \(\mathbb {R}^{N_{m+1}} \subset [-\infty , \infty ]^{N_{m+1}}\), we can find (by compactness) an infinite subset \(I_{m}^{(0)} \subset I_{m}\) such that \(c_{m+1}^{(n,m)}\rightarrow c_{m+1} \in [-\infty , \infty ]^{N_{m+1}}\) as \(n \rightarrow \infty \) in the set \(I_{m}^{(0)}\).
Our goal is to construct vectors \(d^{(n)},e^{(n)} \in \mathbb {R}^{N_{m+1}}\), matrices \(C^{(n)} \in \mathbb {R}^{N_{m+1} \times N_{m}}\), and an infinite subset \(I_{m+1} \subset I_{m}^{(0)}\) such that \(\Vert C^{(n)}\Vert _{\max } \le C\) for all \(n \in I_{m+1}\), such that \(\left( d^{(n)}\right) _{n\in I_{m+1}}\) is a bounded family, and such that we have
for all \(n \in I_{m+1}\).
Once \(d^{(n)},e^{(n)},C^{(n)}\) are constructed, we can choose \(\Psi _n^{(m+1)}\) as in Eq. (D.11), where we define
and
for \(\ell \in \{1,\dots ,L\} \setminus \left\{ m+1,m+2 \right\} \), and finally
and
for \(n\in I_{m+1}\). Indeed, these choices clearly ensure \(\big \Vert B_\ell ^{(n,m+1)} \big \Vert _{\max } \le C\) for all \(\ell \in \{1,\dots ,L\}\), as well as \(\big \Vert c_\ell ^{(n,m+1)}\big \Vert _{\ell ^\infty } \le C_{m+1}\) for all \(\ell \in \{1,\dots ,m+1\}\) and \(n \in I_{m+1}\), for a suitable constant \(C_{m+1} > 0\).
Finally, since \(\Vert \beta _{n}(x) \Vert _{\ell ^\infty } \le R'\) for all \(x \in \Omega \) and \(n \in I_m\), Eq. (D.12) implies
for all \(x\in \Omega \) and \(n\in I_{m+1}\). By recalling the definition of \(\beta _n\), and by noting that \(B_{\ell }^{(n,m+1)},c_{\ell }^{(n,m+1)}\) are identical to \(B_{\ell }^{(n,m)},c_{\ell }^{(n,m)}\) for \(\ell \in \{1,\dots ,L\}\setminus \left\{ m+1,m+2\right\} \), this easily yields
Thus, it remains to construct \(d^{(n)}, e^{(n)}, C^{(n)}\) for \(n \in I_{m+1}\) (and the set \(I_{m+1}\) itself) as described around Eq. (D.12). To this end, for \(n \in I_{m}^{(0)}\) and \(k \in \{1,\dots ,N_{m+1}\}\), define
and
as well as
To see that these choices indeed fulfill the conditions outlined around Eq. (D.12) for a suitable choice of \(I_{m+1}\subset I_{m}^{\left( 0\right) }\), first note that \(\left( d^{(n)}\right) _{n\in I_{m}^{\left( 0\right) }}\) is indeed a bounded family. Furthermore, \(\big | C_{k,i}^{(n)} \big | \le \big | (B_{m+1}^{(n,m)})_{k,i} \big |\) for all \(k \in \{1,\dots ,N_{m+1}\}\) and \(i \in \{1,\dots ,N_m\}\), which easily implies \(\Vert C^{(n)} \Vert _{\max } \le \Vert B_{m+1}^{(n,m)} \Vert _{\max } \le C\) for all \(n \in I_m^{(0)}\). Thus, it remains to verify Eq. (D.12) itself. But the estimate \(\Vert B_{m+1}^{(n,m)} \Vert _{\max } \le C\) also implies
for all \(k \in \underline{N_{m+1}}\) and all \(x \in \mathbb {R}^{N_{m}}\) with \(\Vert x \Vert _{\ell ^\infty } \le R' .\) As a final preparation, note that \(\varrho _{m+1} = \varrho \times \cdots \times \varrho \) is a Cartesian product of ReLU functions, since \(m \le L - 2\). Now, for \(k \in \{1,\dots ,N_{m+1}\}\) there are three cases:
Case 1: We have \((c_{m+1})_{k} = \infty \). Thus, there is some \(n_{k} \in \mathbb {N}\) such that \(\big ( c_{m+1}^{(n,m)} \big )_{k} \ge R' \cdot N_m C\) for all \(n \in I_{m}^{(0)}\) with \(n \ge n_{k}\). In view of Eq. (D.13), this implies \( \big ( T_{m+1}^{(n)} (x) \big )_{k} = \big ( B_{m+1}^{(n,m)} x + c_{m+1}^{(n,m)} \big )_{k} \ge 0 , \) and hence
where the last step used our choice of \(d^{(n)},e^{(n)},C^{(n)}\), and the fact that \(\left( C^{(n)}x+d^{(n)}\right) _{k}\ge 0\) by Eq. (D.13).
Case 2: We have \((c_{m+1})_{k} = -\infty \). This implies that there is some \(n_{k} \in \mathbb {N}\) with \(\big ( c_{m+1}^{(n,m)} \big )_{k} \le -R' \cdot N_m C\) for all \(n \in I_{m}^{(0)}\) with \(n \ge n_{k}\). Because of Eq. (D.13), this yields \( \big ( T_{m+1}^{(n)} (x) \big )_{k} = \big ( B_{m+1}^{(n,m)} x + c_{m+1}^{(n,m)} \big )_{k} \le 0 \), and hence
where the last step used our choice of \(d^{(n)}, e^{(n)}, C^{(n)}\).
Case 3: We have \((c_{m+1})_{k} \in \mathbb {R}\). In this case, set \(n_{k} {:}{=}1\), and note by our choice of \(d^{(n)}, e^{(n)}, C^{(n)}\) for \(n \in I_{m}^{(0)}\) with \(n \ge 1 = n_{k}\) that
Overall, we have thus shown that Eq. (D.12) is satisfied for all \(n \in I_{m+1}\), where
is clearly an infinite set, since \(I_{m}^{\left( 0\right) }\) is. \(\square \)
Using Lemma D.2, we can now easily show that the set \(\mathcal {R}\mathcal {N}\mathcal {N}_{\varrho }^{\Omega ,C}(S)\) is closed in \(L^p (\mu ; \mathbb {R}^{N_L})\) and in \(C(\Omega ; \mathbb {R}^{N_L})\): Let \({\mathcal {Y}}\) denote either \(L^p (\mu ;\mathbb {R}^{N_L})\) for some \(p \in [1, \infty ]\) and some finite Borel measure \(\mu \) on \(\Omega \), or \(C(\Omega ;\mathbb {R}^{N_L})\), where we assume in the latter case that \(\Omega \) is compact and set \(\mu = \delta _{x_0}\) for a fixed \(x_0 \in \Omega \). Note that we can assume \(\mu (\Omega ) > 0\), since otherwise the claim is trivial. Let \((f_n)_{n \in \mathbb {N}}\) be a sequence in \(\mathcal {R}\mathcal {N}\mathcal {N}_{\varrho }^{\Omega ,C}(S)\) which satisfies \(f_n \rightarrow f\) for some \(f \in {\mathcal {Y}}\), with convergence in \({\mathcal {Y}}\). Thus, \(f_n = \mathrm {R}_\varrho ^\Omega (\Phi _n)\) for a suitable sequence \((\Phi _n)_{n \in \mathbb {N}}\) in \(\mathcal {NN}(S)\) with C-bounded scaling weights.
Since \((f_n)_{n \in \mathbb {N}} = \big ( \mathrm {R}_\varrho ^\Omega (\Phi _n) \big )_{n \in \mathbb {N}}\) is convergent in \({\mathcal {Y}}\), it is also bounded in \({\mathcal {Y}}\). But since \(\Omega \) is bounded and \(\mu \) is a finite measure, it is not hard to see \({\mathcal {Y}} \hookrightarrow L^1 (\mu )\), so that we get \(\Vert \mathrm {R}_\varrho ^\Omega (\Phi _n) \Vert _{L^1(\mu )} \le M\) for all \(n \in \mathbb {N}\) and a suitable constant \(M > 0\).
Therefore, Lemma D.2 yields an infinite set \(I \subset \mathbb {N}\) and networks \((\Psi _n)_{n \in I} \subset \mathcal {NN}(S)\) with C-bounded scaling weights such that \(f_n = \mathrm {R}_\varrho ^\Omega (\Psi _n)\) and \(\Vert \Psi _n \Vert _{\mathrm {total}} \le C'\) for all \(n \in I\) and a suitable \(C' > 0\).
Hence, \((\Psi _n)_{n \in I}\) is a bounded, infinite family in the finite dimensional vector space \(\mathcal {NN}(S)\). Thus, there is a further infinite set \(I_1 \subset I\) such that \(\Psi _n \rightarrow \Psi \in \mathcal {NN}(S)\) as \(n \rightarrow \infty \) in \(I_1\).
But since \(\Omega \) is bounded, say \(\Omega \subset [-R,R]^d\), the realization map
is continuous (even locally Lipschitz continuous); see Proposition 4.1, which will be proved independently. Hence, \(\mathrm {R}_\varrho ^{[-R,R]^d} (\Psi _n) \rightarrow \mathrm {R}_\varrho ^{[-R,R]^d}(\Psi )\) as \(n \rightarrow \infty \) in \(I_1\), with uniform convergence. This easily implies \(f_n = \mathrm {R}_\varrho ^{\Omega }(\Psi _n) \rightarrow \mathrm {R}_\varrho ^{\Omega }(\Psi )\), with convergence in \({\mathcal {Y}}\) as \(n \rightarrow \infty \) in \(I_1\). Hence, \(f = \mathrm {R}^\Omega _\varrho (\Psi ) \in \mathcal {R}\mathcal {N}\mathcal {N}_{\varrho }^{C,\Omega }(S)\). \(\square \)
1.8 D.8. Proof of Theorem 3.8
For the proof of Theorem 3.8, we will use a careful analysis of the singularity hyperplanes of functions of the form \(x \mapsto \varrho _a (\langle \alpha , x \rangle + \beta )\), that is, the hyperplane on which this function is not differentiable. To simplify this analysis, we first introduce a convenient terminology and collect quite a few auxiliary results.
Definition D.3
For \(\alpha , {\widetilde{\alpha }} \in S^{d-1}\) and \(\beta , {\widetilde{\beta }} \in \mathbb {R}\), we write \((\alpha ,\beta ) \sim ({\widetilde{\alpha }}, {\widetilde{\beta }})\) if there is a \(\varepsilon \in \{\pm 1\}\) such that \((\alpha , \beta ) = \varepsilon \cdot ({\widetilde{\alpha }}, {\widetilde{\beta }})\).
Furthermore, for \(a \ge 0\) and with \(\varrho _a : \mathbb {R}\rightarrow \mathbb {R}, x \mapsto \max \{x, ax \}\) denoting the parametric ReLU, we set
Moreover, we define
and finally
![](http://media.springernature.com/lw424/springer-static/image/art%3A10.1007%2Fs10208-020-09461-0/MediaObjects/10208_2020_9461_Equ217_HTML.png)
for \(\epsilon >0\)
Lemma D.3
Let \((\alpha , \beta ) \in S^{d-1} \times \mathbb {R}\) and \(x_0 \in S_{\alpha ,\beta }\). Also, let \((\alpha _1, \beta _1), \dots , (\alpha _N, \beta _N) \in S^{d-1} \times \mathbb {R}\) with \((\alpha _\ell , \beta _\ell ) \not \sim (\alpha , \beta )\) for all \(\ell \in {\underline{N}}\). Then, there exists \(z \in \mathbb {R}^d\) satisfying
Proof
By discarding those \((\alpha _j, \beta _j)\) for which \(x_0 \notin S_{\alpha _j, \beta _j}\), we can assume that \(x_0 \in S_{\alpha _j, \beta _j}\) for all \(j \in {\underline{N}}\).
Assume toward a contradiction that the claim of the lemma is false; that is,
where \(\alpha ^\perp := \{z \in \mathbb {R}^d \,:\, \langle z,\alpha \rangle = 0\}\). Since \(\alpha ^\perp \) is a closed subset of \(\mathbb {R}^d\) and thus a complete metric space, and since the right-hand side of (D.14) is a countable (in fact, finite) union of closed sets, the Baire category theorem (see [26, Theorem 5.9]) shows that there are \(j \in {\underline{N}}\) and \(\varepsilon > 0\) such that
But since V is a vector space, this easily implies \(V = \alpha ^\perp \), that is, \(\langle z, \alpha _j \rangle = 0\) for all \(z \in \alpha ^\perp \). In other words, \(\alpha ^\perp \subset \alpha _j^\perp \), and then \(\alpha ^\perp = \alpha _j^\perp \) by a dimension argument, since \(\alpha , \alpha _j \ne 0\).
Hence, \({{\,\mathrm{span}\,}}\alpha = (\alpha ^\perp )^\perp = (\alpha _j^\perp )^\perp = {{\,\mathrm{span}\,}}\alpha _j\). Because of \(|\alpha | = |\alpha _j| = 1\), we thus see \(\alpha = \varepsilon \, \alpha _j\) for some \(\varepsilon \in \{\pm 1\}\). Finally, since \(x_0 \in S_{\alpha ,\beta } \cap S_{\alpha _j, \beta _j}\), we see \( \beta = - \langle \alpha , x_0 \rangle = - \varepsilon \langle \alpha _j, x_0 \rangle = \varepsilon \beta _j , \) and thus \((\alpha , \beta ) = \varepsilon (\alpha _j, \beta _j)\), in contradiction to \((\alpha , \beta ) \not \sim (\alpha _j, \beta _j)\). \(\square \)
Lemma D.4
Let \((\alpha ,\beta ) \in S^{d-1} \times \mathbb {R}\) and \((\alpha _1, \beta _1), \dots , (\alpha _N, \beta _N) \in S^{d-1} \times \mathbb {R}\) with \((\alpha _i, \beta _i) \not \sim (\alpha , \beta )\) for all \(i \in {\underline{N}}\). Furthermore, let \(U \subset \mathbb {R}^d\) be open with \(S_{\alpha , \beta } \cap U \ne \varnothing \).
Then, there is \(\varepsilon > 0\) satisfying
Proof
By assumption, there exists \(x_0 \in U \cap S_{\alpha , \beta }\). Next, Lemma D.3 yields \(z \in \mathbb {R}^d\) such that \(\langle z, \alpha \rangle = 0\) and \(\langle z, \alpha _j \rangle \ne 0\) for all \(j \in {\underline{N}}\) with \(x_0 \in S_{\alpha _j, \beta _j}\). Note that this implies \(\langle \alpha , x_0 + tz \rangle + \beta = \langle \alpha , x_0 \rangle + \beta = 0\) and hence \(x_0 + tz \in S_{\alpha , \beta }\) for all \(t \in \mathbb {R}\).
Next, let \(J {:}{=}\{j \in {\underline{N}} \,:\, x_0 \notin S_{\alpha _j, \beta _j} \}\), so that \(\langle \alpha _j, x_0 \rangle + \beta _j \ne 0\) for all \(j \in J\). Thus, there are \(\varepsilon _1, \delta > 0\) with \(|\langle \alpha _j, x_0 + t z\rangle + \beta _j| \ge \varepsilon _1\) (that is, \(x_0 + t z \in U_{\alpha _j, \beta _j}^{(\varepsilon _1)}\)) for all \(t \in \mathbb {R}\) with \(|t| \le \delta \) and all \(j \in J\). Since U is open with \(x_0 \in U\), we can shrink \(\delta \) so that \(x_0 + t z \in U\) for all \(|t| \le \delta \). Let \(t {:}{=}\delta \).
We claim that there is some \(\varepsilon > 0\) such that \(x {:}{=}x_0 + t z \in U \cap S_{\alpha , \beta } \cap \bigcap _{j=1}^N U_{\alpha _j, \beta _j}^{(\varepsilon )}\). To see this, note for \(j \in {\underline{N}} \setminus J\) that \(x_0 \in S_{\alpha _j, \beta _j}\), and hence
since \(\langle z, \alpha _j \rangle \ne 0\) for all \(j \in {\underline{N}} \setminus J\), by choice of z. By combining all our observations, we see that \({ x_0 + t z \in U \cap S_{\alpha , \beta } \cap \bigcap _{j=1}^N U_{\alpha _j, \beta _j}^{(\varepsilon )} }\) for \(\varepsilon {:}{=}\min \{ \varepsilon _1, \varepsilon _2 \} > 0\). \(\square \)
Lemma D.5
If \(0 \le a < 1\) and \((\alpha , \beta ) \in S^{d-1} \times \mathbb {R}\), then \(h_{\alpha ,\beta }^{(a)}\) is not differentiable at any \(x_0 \in S_{\alpha ,\beta }\).
Proof
Assume toward a contradiction that \(h_{\alpha ,\beta }^{(a)}\) is differentiable at some \(x_0 \in S_{\alpha ,\beta }\). Then, the function \( f : \mathbb {R}\rightarrow \mathbb {R}, t \mapsto h_{\alpha ,\beta }^{(a)} (x_0 + t \alpha ) \) is differentiable at \(t = 0\). But since \(x_0 \in S_{\alpha ,\beta }\) and \(\Vert \alpha \Vert _{\ell ^2} = 1\), we have
for all \(t \in \mathbb {R}\). This easily shows that f is not differentiable at \(t = 0\), since the right-sided derivative is 1, while the left-sided derivative is \(a \ne 1\). This is the desired contradiction. \(\square \)
Lemma D.6
Let \(0 \le a < 1\), and let \((\alpha _1, \beta _1), \dots , (\alpha _N, \beta _N) \in S^{d-1} \times \mathbb {R}\) with \((\alpha _i, \beta _i) \not \sim (\alpha _j, \beta _j)\) for \(j \ne i\). Furthermore, let \(U \subset \mathbb {R}^d\) be open with \(U \cap S_{\alpha _i, \beta _i} \ne \varnothing \) for all \(i \in {\underline{N}}\). Finally, set \(h_i {:}{=}h_{\alpha _i, \beta _i}^{(a)}|_{U}\) for \(i \in {\underline{N}}\) with \(h_{\alpha _i, \beta _i}^{(a)}\) as in Definition D.3, and let \(h_{N+1} : U \rightarrow \mathbb {R}, x \mapsto 1\).
Then, the family \((h_i)_{i = 1,\dots ,N+1}\) is linearly independent.
Proof
Assume toward a contradiction that \(0 = \sum _{i=1}^{N+1} \gamma _i \, h_i\) for certain \(\gamma _1, \dots , \gamma _{N+1} \in \mathbb {R}\) with \(\gamma _\ell \ne 0\) for some \(\ell \in \underline{N+1}\). Note that if we had \(\gamma _i = 0\) for all \(i \in {\underline{N}}\), we would get \(0 = \gamma _{N+1} \, h_{N+1} \equiv \gamma _{N+1}\), and thus \(\gamma _i = 0\) for all \(i \in \underline{N+1}\), a contradiction. Hence, there is some \(j \in {\underline{N}}\) with \(\gamma _j \ne 0\).
By Lemma D.4 there is some \(\varepsilon > 0\) such that there exists
Therefore, \(x_0 \in U \cap S_{\alpha _j, \beta _j} \cap V\) for the open set \( V {:}{=}\bigcap _{i \in {\underline{N}} \setminus \{j\}} \big ( \mathbb {R}^d \setminus S_{\alpha _i, \beta _i} \big ) . \)
Because of \(x_0 \in U \cap S_{\alpha _j, \beta _j}\), Lemma D.5 shows that \(h_{\alpha _j, \beta _j}^{(a)} |_U\) is not differentiable at \(x_0\). On the other hand, we have
where the right-hand side is differentiable at \(x_0\), since each summand is easily seen to be differentiable on the open set V, with \(x_0 \in V \cap U\). \(\square \)
Lemma D.7
Let \((\alpha , \beta ) \in S^{d-1} \times \mathbb {R}\). If \(\Omega \subset \mathbb {R}^d\) is compact with \(\Omega \cap S_{\alpha ,\beta } = \varnothing \), then there is some \(\varepsilon > 0\) such that \(\Omega \subset U_{\alpha ,\beta }^{(\varepsilon )}\).
Proof
The continuous function \(\Omega \rightarrow (0,\infty ), x \mapsto |\langle \alpha , x \rangle + \beta |\), which is well-defined by assumption, attains a minimum \(\varepsilon = \min _{x \in \Omega } |\langle \alpha , x \rangle + \beta | > 0\). \(\square \)
Lemma D.8
Let \(0 \le a < 1\), let \((\alpha , \beta ) \in S^{d-1} \times \mathbb {R}\), and let \(U \subset \mathbb {R}^d\) be open with \(U \cap S_{\alpha , \beta } \ne \varnothing \). Finally, let \(f : U \rightarrow \mathbb {R}\) be continuous, and assume that f is affine-linear on \(U \cap W_{\alpha ,\beta }^{+}\) and on \(U \cap W_{\alpha ,\beta }^{-}\).
Then, there are \(c, \kappa \in \mathbb {R}\) and \(\zeta \in \mathbb {R}^d\) such that
Proof
By assumption, there are \(\xi _1, \xi _2 \in \mathbb {R}^d\) and \(\omega _1, \omega _2 \in \mathbb {R}\) satisfying
Step 1: We claim that \(U \cap S_{\alpha ,\beta } \subset \overline{U \cap W_{\alpha ,\beta }^{\pm }}\). Indeed, for arbitrary \(x \in U \cap S_{\alpha ,\beta }\), we have \(x + t \alpha \in U\) for \(t \in (-\varepsilon ,\varepsilon )\) for a suitable \(\varepsilon > 0\), since U is open. But since \(x \in S_{\alpha ,\beta }\) and \(\Vert \alpha \Vert _{\ell ^2} = 1\), we have \(\langle x + t \alpha , \alpha \rangle + \beta = t\). Hence, \(x + t \alpha \in U \cap W_{\alpha ,\beta }^{+}\) for \(t \in (0,\varepsilon )\) and \(x + t \alpha \in U \cap W_{\alpha ,\beta }^{-}\) for \(t \in (-\varepsilon ,0)\). This easily implies the claim of this step.
Step 2: We claim that \(\xi _1 - \xi _2 \in {{\,\mathrm{span}\,}}\alpha \). To see this, consider the modified function
which is continuous and satisfies \({\widetilde{f}} \equiv 0\) on \(U \cap W_{\alpha ,\beta }^{-}\) and \({\widetilde{f}}(x) = \langle \theta , x \rangle + \omega \) on \(U \cap W_{\alpha ,\beta }^{+}\), where we defined \(\theta {:}{=}\xi _1 - \xi _2\) and \(\omega {:}{=}\omega _1 - \omega _2\).
Since we saw in Step 1 that \(U \cap S_{\alpha ,\beta } \subset \overline{U \cap W_{\alpha ,\beta }^{\pm }}\), we thus get by continuity of \({\widetilde{f}}\) that
But by assumption on U, there is some \(x_0 \in U \cap S_{\alpha ,\beta }\). For arbitrary \(v \in \alpha ^{\perp }\), we then have \(x_0 + t v \in U \cap S_{\alpha ,\beta }\) for all \(t \in (-\varepsilon ,\varepsilon )\) and a suitable \(\varepsilon = \varepsilon (v) > 0\), since U is open. Hence, \(0 = \langle \theta , x_0 + t v \rangle + \omega = t \cdot \langle \theta , v \rangle \) for all \(t \in (-\varepsilon ,\varepsilon )\), and thus \(v \in \theta ^{\perp }\). In other words, \(\alpha ^{\perp } \subset \theta ^{\perp }\), and thus \( {{\,\mathrm{span}\,}}\alpha = (\alpha ^{\perp })^\perp \supset (\theta ^{\perp })^{\perp } \ni \theta = \xi _1 - \xi _2 , \) as claimed in this step.
Step 3: In this step, we complete the proof. As seen in the previous step, there is some \(c \in \mathbb {R}\) satisfying \(c \alpha = (\xi _1 - \xi _2) / (1 - a)\). Now, set \(\zeta {:}{=}(\xi _2 - a \xi _1) / (1 - a)\) and \(\kappa {:}{=}f(x_0) - \langle \zeta , x_0 \rangle \), where \(x_0 \in U \cap S_{\alpha ,\beta }\) is arbitrary. Finally, define
Because of \(x_0 \in S_{\alpha ,\beta }\), we then have \(g(x_0) = \langle \zeta , x_0 \rangle + \kappa = f(x_0)\). Furthermore, since \(\varrho _a (x) = x\) for \(x \ge 0\), we see for all \(x \in U \cap W_{\alpha ,\beta }^{+}\) that
Here, the last step used that \(f(x) = \langle \xi _1, x \rangle + \omega _1\) for \(x \in U \cap W_{\alpha ,\beta }^{+}\), and that \(x_0\in U\cap S_{\alpha ,\beta } \subset \overline{U \cap W_{\alpha ,\beta }^{+}}\) by Step 1, so that we get \(f(x_0) = \langle \xi _1, x_0 \rangle + \omega _1\) as well.
Likewise, since \(\varrho _a (t) = a \, t\) for \(t < 0\), we see for \(x \in U \cap W_{\alpha ,\beta }^{-}\) that
In combination, Eqs. (D.15) and (D.16) show \(f(x) = g(x)\) for all \(x \in U \cap (W_{\alpha ,\beta }^{+} \cup W_{\alpha ,\beta }^{-})\). Since this set is dense in U by Step 1, we are done. \(\square \)
With all of these preparations, we can finally prove Theorem 3.8.
Proof of Theorem 3.8
Since \(\varrho _1 = \mathrm {id}_\mathbb {R}\), the result is trivial for \(a = 1\), since the set \(\mathcal {R}\mathcal {N}\mathcal {N}_{\varrho _1}^{[-B,B]^d}((d,N_0,1))\) is just the set of all affine-linear maps \([-B,B]^d \rightarrow \mathbb {R}\). Furthermore, if \(a > 1\), then
and hence \(\mathcal {R}\mathcal {N}\mathcal {N}_{\varrho _a}^{[-B,B]^d}((d,N_0,1)) = \mathcal {R}\mathcal {N}\mathcal {N}_{\varrho _{a^{-1}}}^{[-B,B]^d}((d,N_0,1))\). Therefore, we can assume \(a < 1\) in the sequel. For brevity, let \(\Omega {:}{=}[-B,B]^d\). Then, each \(\Phi \in \mathcal {NN}((d,N_0,1))\) is of the form \(\Phi = \big ( (A_1, b_1), (A_2, b_2) \big )\) with \(A_1 \in \mathbb {R}^{N_0 \times d}\), \(A_2 \in \mathbb {R}^{1 \times N_0}\), and \(b_1 \in \mathbb {R}^{N_0}\), \(b_2 \in \mathbb {R}^1\).
Let \((\Phi _n)_{n\in \mathbb {N}} \subset \mathcal {NN}((d,N_0,1))\) with
be such that \(f_n {:}{=}\mathrm {R}_{\varrho _a}^{\Omega }(\Phi _n)\) converges uniformly to some \(f \in C(\Omega )\). Our goal is to prove \(f \in \mathcal {R}\mathcal {N}\mathcal {N}_{\varrho _a}^{\Omega }((d,N_0,1))\). The proof of this is divided into seven steps.
Step 1 (Normalizing the rows of the first layer): Our first goal is to normalize the rows of the matrices \({\widetilde{A}}_1^n\); that is, we want to change the parametrization of the network such that \(\Vert ({\widetilde{A}}_1^n)_{i,-} \Vert _{\ell ^2} = 1\) for all \(i \in \underline{N_0}\). To see that this is possible, consider arbitrary \(A \in \mathbb {R}^{M_1 \times M_2} \ne 0\) and \(b\in \mathbb {R}^{M_1}\); then we obtain by the positive homogeneity of \({\varrho _a}\) for all \(C >0\) that
This identity shows that for each \(n \in \mathbb {N}\), we can find a network
such that the rows of \(A_1^n\) are normalized, that is, \(\Vert (A_1^n)_{i,-} \Vert _{\ell ^2} = 1\) for all \(i \in \underline{N_0}\), and such that
Step 2 (Extracting a partially convergent subsequence): By the Theorem of Bolzano-Weierstraß, there is a common subsequence of \((A_1^{n})_{n\in \mathbb {N}}\) and \((b_1^{n})_{n\in \mathbb {N}}\), denoted by \((A^{n_k}_1)_{k \in \mathbb {N}}\) and \((b_1^{n_k})_{k \in \mathbb {N}}\), converging to \(A_1 \in \mathbb {R}^{N_0 \times d}\) and \(b_1 \in [-\infty , \infty ]^{N_0}\), respectively.
For \(j \in \underline{N_0}\), let \(a_{k,j} \in \mathbb {R}^{d}\) denote the j-th row of \(A_1^{n_k}\), and let \(a_{j} \in \mathbb {R}^d\) denote the j-th row of \(A_1\). Note that \(\Vert a_{k,j} \Vert _{\ell ^2} = \Vert a_j \Vert _{\ell ^2} = 1\) for all \(j \in \underline{N_0}\) and \(k \in \mathbb {N}\). Next, let
where \(\Omega ^\circ = (-B, B)^d\) denotes the interior of \(\Omega \). Additionally, let \(J^c {:}{=}\underline{N_0} \setminus J\), and for \(j,\ell \in J^c\) write \(j \simeq \ell \) iff \((a_j, (b_1)_j) \sim (a_\ell , (b_1)_\ell )\), with the relation \(\sim \) introduced in Definition D.3. Note that this makes sense, since \((b_1)_j \in \mathbb {R}\) if \(j \in J^c\). Clearly, the relation \(\simeq \) is an equivalence relation on \(J^c\). Let \((J_i)_{i = 1,\dots ,r}\) denote the equivalence classes of the relation \(\simeq \). For each \(i \in {\underline{r}}\), choose \(\alpha ^{(i)} \in S^{d-1}\) and \(\beta ^{(i)} \in \mathbb {R}\) such that for each \(j \in J_i\) there is a (unique) \(\sigma _j \in \{\pm 1\}\) with \((a_j, (b_1)_j) = \sigma _j \cdot (\alpha ^{(i)}, \beta ^{(i)})\).
Step 3 (Handling the case of distinct singularity hyperplanes): Note that \(r \le |J^c| \le N_0\). Before we continue with the general case, let us consider the special case where equality occurs, that is, where \(r = N_0\). This means that \(J = \varnothing \) (and hence \((b_1)_j \in \mathbb {R}\) and \(\Omega ^\circ \cap S_{a_j, (b_1)_j} \ne \varnothing \) for all \(j \in \underline{N_0}\)), and that each equivalence class \(J_i\) has precisely one element; that is, \((a_j, (b_1)_j) \not \sim (a_\ell , (b_1)_\ell )\) for \(j, \ell \in \underline{N_0}\) with \(j \ne \ell \).
Therefore, Lemma D.6 shows that the functions \((h_j |_{\Omega ^\circ })_{j = 1,\dots ,N_0 + 1},\) where we define \(h_j {:}{=}h_{a_j, (b_1)_j}^{(a)}|_{\Omega }\) for \(j \in \underline{N_0}\) and \(h_{N_0+1} : \Omega \rightarrow \mathbb {R}, x \mapsto 1,\) are linearly independent. In particular, these functions are linearly independent when considered on all of \(\Omega \). Thus, we can define a norm \(\Vert \cdot \Vert _{*}\) on \(\mathbb {R}^{N_0 + 1}\) by virtue of
Since all norms on the finite dimensional vector space \(\mathbb {R}^{N_0 + 1}\) are equivalent, there is some \(\tau > 0\) with \(\Vert c\Vert _{*} \ge \tau \cdot \Vert c\Vert _{\ell ^1}\) for all \(c \in \mathbb {R}^{N_0 + 1}\).
Now, recall that \(a_{k,j} \rightarrow a_j\) and \(b_1^{n_k} \rightarrow b_1 \in \mathbb {R}^{N_0}\) as \(k \rightarrow \infty \). Since \(\Omega \) is bounded, this implies for arbitrary \(j \in \underline{N_0}\) and \(h_j^{(k)} {:}{=}h_{a_{k,j}, (b_1^{n_k})_j}^{(a)}\) that \(h_j^{(k)} \rightarrow h_{a_j, (b_1)_j}^{(a)}\) as \(k \rightarrow \infty \), with uniform convergence on \(\Omega \). Thus, there is some \(N_1 \in \mathbb {N}\) such that \(\big \Vert h_j^{(k)} - h_{a_j, (b_1)_j}^{(a)} \big \Vert _{L^\infty (\Omega )} \le \tau / 2\) for all \(k \ge N_1\) and \(j \in \underline{N_0}\). Therefore, if \(k \ge N_1\), we have
Since \( f_{n_k} = \mathrm {R}_{\varrho _a}^\Omega \big ( {\widetilde{\Phi }}_{n_k} \big ) = b_2^{n_k} + \sum _{j=1}^{N_0} \big (A_2^{n_k}\big )_{1,j} \,\, h_j^{(k)} \) converges uniformly on \(\Omega \), we thus see that the sequence consisting of \((A_2^{n_k}, b_2^{n_k}) \in \mathbb {R}^{1 \times N_0} \times \mathbb {R}\cong \mathbb {R}^{N_0 + 1}\) is bounded. Thus, there is a further subsequence \((n_{k_{\ell }})_{\ell \in \mathbb {N}}\) such that \(A_2^{n_{k_\ell }} \rightarrow A_2 \in \mathbb {R}^{1 \times N_0}\) and \(b_2^{n_{k_\ell }} \rightarrow b_2 \in \mathbb {R}\) as \(\ell \rightarrow \infty \). But this implies as desired that
Step 4 (Showing that the j -th neuron is eventually affine-linear, for \(j \in J\)): Since Step 3 shows that the claim holds in case of \(r = N_0\), we will from now on consider only the case where \(r < N_0\).
For \(j \in J\), there are two cases: In case of \((b_1)_j \in [0,\infty ]\), define
If otherwise \((b_1)_j \in [-\infty ,0)\), define
Next, for arbitrary \(0< \delta < B\), we define \(\Omega _{\delta } {:}{=}[-(B - \delta ), B - \delta ]^d\). Note that since \(S_{\alpha ^{(i)}, \beta ^{(i)}} \cap \Omega ^\circ \ne \varnothing \) for all \(i \in {\underline{r}}\), there is some \(\delta _0 > 0\) such that \(S_{\alpha ^{(i)}, \beta ^{(i)}} \cap ( -(B - \delta ), B - \delta )^d \ne \varnothing \) for all \(i \in {\underline{r}}\) and all \(0 < \delta \le \delta _0\). For the remainder of this step, we will consider a fixed \(\delta \in (0, \delta _0]\), and we claim that there is some \(N_2 = N_2 (\delta ) > 0\) such that
for all \(j \in J, \, k \ge N_2\) and \(x \in \Omega _{\delta }\), where \({\text {sign}}x = 1\) if \(x > 0\), \({\text {sign}}x = -1\) if \(x < 0\), and \({\text {sign}}0 = 0\). Note that once this is shown, it is not hard to see that there is some \(N_3 = N_3 (\delta ) \in \mathbb {N}\) such that
simply because \((b_1^{n_k})_j \rightarrow (b_1)_j\) and \(\varrho _a (x) = x\) if \(x \ge 0\), and \(\varrho _a (x) = ax\) if \(x < 0\). Therefore, for the affine-linear function
To prove Eq. (D.17), we distinguish two cases for each \(j \in J\); by definition of J, these are the only two possible cases:
Case 1: We have \((b_1)_j \in \{ \pm \infty \}\). In this case, the first part of Eq. (D.17) is trivially satisfied. To prove the second part, note that because of \((b_1^{n_k})_j \rightarrow (b_1)_j \in \{-\infty ,\infty \}\), there is some \(k_j \in \mathbb {N}\) with \(|(b_1^{n_k})_j| \ge 2d \cdot B\) for all \(k \ge k_j\). Since we have \(\Vert a_{k,j} \Vert _{\ell ^2} = 1\) and \(\Vert x \Vert _{\ell ^2} \le \sqrt{d} B \le d B\) for \(x \in \Omega \), this implies
for all \(x \in \Omega = [-B, B]^d\) and \(k \ge k_j\) . Now, since the function \(x \mapsto \langle a_{k,j}, x \rangle + (b_1^{n_k})_j\) is continuous, since \(\Omega \) is connected (in fact convex), and since \(0 \in \Omega \), this implies \({\text {sign}}(\langle a_{k,j}, x \rangle + (b_1^{n_k})_j) = {\text {sign}}(b_1^{n_k})_j\) for all \(x \in \Omega \) and \(k \ge k_j\).
Case 2: We have \((b_1)_j \in \mathbb {R}\), but \(S_{a_j, (b_1)_j} \cap \Omega ^\circ = \varnothing \), and hence \(S_{a_j, (b_1)_j} \cap \Omega _{\delta } = \varnothing \). In view of Lemma D.7, there is thus some \(\varepsilon _{j,\delta } > 0\) satisfying \(\Omega _{\delta } \subset U_{a_j, (b_1)_j}^{(\varepsilon _{j,\delta })}\); that is, \(|\langle a_j, x \rangle + (b_1)_j| \ge \varepsilon _{j,\delta } > 0\) for all \(x \in \Omega _{\delta }\). In particular, since \(0 \in \Omega _{\delta }\), this implies \(|(b_1)_j| \ge \varepsilon _{j,\delta } > 0\) and hence \((b_1)_j \ne 0\), as claimed in the first part of Eq. (D.17).
To prove the second part, note that because of \(a_{k,j} \rightarrow a_j\) and \((b_1^{n_k})_j \rightarrow (b_1)_j\) as \(k \rightarrow \infty \), there is some \(k_j = k_j (\varepsilon _{j,\delta }) = k_j (\delta ) \in \mathbb {N}\) such that \(\Vert a_{k,j} - a_j \Vert _{\ell ^2} \le \varepsilon _{j,\delta } / (4d B)\) and \(|(b_1^{n_k})_j - (b_1)_j| \le \varepsilon _{j,\delta } / 4\) for all \(k \ge k_j\). Therefore,
With the same argument as at the end of Case 1, we thus see \({\text {sign}}(\langle a_{k,j}, x \rangle + (b_1^{n_k})_j) = {\text {sign}}(b_1^{n_k})_j\) for all \(x \in \Omega _{\delta }\) and \(k \ge k_j (\delta )\).
Together, the two cases prove that Eq. (D.17) holds for \(N_2 (\delta ) {:}{=}\max _{j \in J} k_j (\delta )\).
Step 5 (Showing that the j-th neuron is affine-linear on \(U_{\alpha ^{(i)}, \beta ^{(i)}}^{(\varepsilon ,+)}\) and on \(U_{\alpha ^{(i)}, \beta ^{(i)}}^{(\varepsilon ,-)}\) for \(j \in J_i\)): In the following, we write \(U_{\alpha ^{(i)}, \beta ^{(i)}}^{(\varepsilon , \pm )}\) for one of the two sets \(U_{\alpha ^{(i)}, \beta ^{(i)}}^{(\varepsilon ,+)}\) or \(U_{\alpha ^{(i)}, \beta ^{(i)}}^{(\varepsilon ,-)}\). We claim that for each \(\varepsilon > 0\), there is some \(N_4 (\varepsilon ) \in \mathbb {N}\) such that:
To see this, let \(\varepsilon > 0\) be arbitrary, and recall \(J^c = \bigcup _{i=1}^r J_i\). By definition of \(J_i\), and by choice of \(\alpha ^{(i)}\) and \(\beta ^{(i)}\), there is for each \(i \in {\underline{r}}\) and \(j \in J_i\) some \(\sigma _j \in \{\pm 1\}\) satisfying
Thus, there is some \(k^{(j)}(\varepsilon ) \in \mathbb {N}\) such that \(\Vert a_{k,j} - \sigma _j \, \alpha ^{(i)} \Vert _{\ell ^2} \le \varepsilon / (4 d B)\) and \(|(b_1^{n_k})_j - \sigma _j \, \beta ^{(i)}| \le \varepsilon / 4\) for all \(k \ge k^{(j)}(\varepsilon )\).
Define \(N_4 (\varepsilon ) {:}{=}\max _{j \in J^c} k^{(j)}(\varepsilon )\). Then, for \(k \ge N_4 (\varepsilon )\), \(i \in {\underline{r}}\), \(j \in J_i\), and arbitrary \(x \in \Omega \cap U_{\alpha ^{(i)},\beta ^{(i)}}^{(\varepsilon , \pm )}\), we have on the one hand \(|\sigma _j \cdot (\langle \alpha ^{(i)}, x \rangle + \beta ^{(i)})| \ge \varepsilon \), and on the other hand
since \(\Vert x \Vert _{\ell ^2} \le \sqrt{d} \cdot B \le d B\). In combination, this shows \(|\langle a_{k,j}, x \rangle + (b_1^{n_k})_j| \ge \varepsilon /2 > 0\) for all \(x \in \Omega \cap U_{\alpha ^{(i)}, \beta ^{(i)}}^{(\varepsilon , \pm )}\). But since \(\Omega \cap U_{\alpha ^{(i)}, \beta ^{(i)}}^{(\varepsilon , \pm )}\) is connected (in fact, convex), and since the function \(x \mapsto \langle a_{k,j} , \, x \rangle + (b_1^{n_k})_j\) is continuous, it must have a constant sign on \(\Omega \cap U_{\alpha ^{(i)}, \beta ^{(i)}}^{(\varepsilon , \pm )}\). This easily implies that \(\nu _j^{(k)} = \varrho _a \big (\langle a_{k,j}, \cdot \rangle + (b_1^{n_k})_j\big )\) is indeed affine-linear on \(\Omega \cap U_{\alpha ^{(i)}, \beta ^{(i)}}^{(\varepsilon , \pm )}\) for \(k \ge N_4 (\varepsilon )\).
Step 6 (Proving the “almost convergence” of the sum of all j-th neurons for \(j \in J_i\)): For \(i \in {\underline{r}}\) define
In combination with Eq. (D.18), we see
with \(g_{r+1}^{(k)} : \mathbb {R}^d \rightarrow \mathbb {R}\) being affine-linear.
Recall from Step 4 that \(\Omega _{\delta _0}^\circ \cap S_{\alpha ^{(i)}, \beta ^{(i)}} \ne \varnothing \) for all \(i \in {\underline{r}}\), by choice of \(\delta _0\). Therefore, Lemma D.4 shows (because of \(U_{\alpha ,\beta }^{(\sigma )} \subset (U_{\alpha ,\beta }^{(\varepsilon )})^\circ \) for \(\varepsilon < \sigma \)) for each \(i \in {\underline{r}}\) that
Let us fix some \(x_i \in K_i\) and some \(r_i > 0\) such that \({\overline{B}}_{r_i} (x_i) \subset \Omega _{\delta _0}^\circ \cap \bigcap _{\ell \in {\underline{r}} \setminus \{i\}} \big ( U_{\alpha ^{(\ell )}, \beta ^{(\ell )}}^{(\varepsilon _i)} \big )^\circ \); this is possible, since the set on the right-hand side contains \(x_i\) and is open. Now, since \({\overline{B}}_{r_i}(x_i)\) is connected, we see for each \(\ell \in {\underline{r}} \setminus \{i\}\) that either \({\overline{B}}_{r_i}(x_i)\subset U_{\alpha ^{(\ell )},\beta ^{(\ell )}}^{(\varepsilon _i,+)}\) or \({\overline{B}}_{r_i}(x_i)\subset U_{\alpha ^{(\ell )},\beta ^{(\ell )}}^{(\varepsilon _i,-)}\). Therefore, as a consequence of the preceding step, we see that there is some \(N_5^{(i)} \in \mathbb {N}\) such that \(g_\ell ^{(k)}\) is affine-linear on \({\overline{B}}_{r_i} (x_i)\) for all \(\ell \in {\underline{r}} \setminus \{i\}\) and all \(k \ge N_5^{(i)}\).
Thus, setting \(N_5 {:}{=}\max \{ N_3 (\delta _0), \max _{i = 1,\dots ,r} N_5^{(i)} \}\), we see as a consequence of Eq. (D.19) and because of \({\overline{B}}_{r_i} (x_i) \subset \Omega _{\delta _0}^\circ \) that for each \(i \in {\underline{r}}\) and any \(k \ge N_5\), there is an affine-linear map \(q_i^{(k)} : \mathbb {R}^d \rightarrow \mathbb {R}\) satisfying
Next, note that Step 5 implies for arbitrary \(\varepsilon > 0\) that for all k large enough (depending on \(\varepsilon \)), \(g_i^{(k)}\) is affine-linear on \(B_{r_i} (x_i) \cap U_{\alpha ^{(i)}, \beta ^{(i)}}^{(\varepsilon , \pm )}\). Since \(f(x) = \lim _k f_{n_k}(x) = \lim _{k} g_i^{(k)}(x) + q_i^{(k)}(x)\), we thus see that f is affine-linear on \(B_{r_i} (x_i) \cap U_{\alpha ^{(i)}, \beta ^{(i)}}^{(\varepsilon , \pm )}\) for arbitrary \(\varepsilon > 0\). Therefore, f is affine-linear on \(B_{r_i} (x_i) \cap W_{\alpha ^{(i)}, \beta ^{(i)}}^{\pm }\) and continuous on \(\Omega \supset B_{r_i}(x_i)\), and we have \(x_i \in B_{r_i} (x_i) \cap S_{\alpha ^{(i)}, \beta ^{(i)}} \ne \varnothing \). Thus, Lemma D.8 shows that there are \(c_i \in \mathbb {R}\), \(\zeta _i \in \mathbb {R}^d\), and \(\kappa _i \in \mathbb {R}\) such that
We now intend to make use of the following elementary fact: If \((\psi _k)_{k \in \mathbb {N}}\) is a sequence of maps \(\psi _k : \mathbb {R}^d \rightarrow \mathbb {R}\), if \(\Theta \subset \mathbb {R}^d\) is such that each \(\psi _k\) is affine-linear on \(\Theta \), and if \(U \subset \Theta \) is a nonempty open subset such that \(\psi (x) {:}{=}\lim _{k \rightarrow \infty } \psi _k (x) \in \mathbb {R}\) exists for all \(x \in U\), then \(\psi \) can be uniquely extended to an affine-linear map \(\psi : \mathbb {R}^d \rightarrow \mathbb {R}\), and we have \(\psi _k (x) \rightarrow \psi (x)\) for all \(x \in \Theta \), even with locally uniform convergence. Essentially, what is used here is that the vector space of affine-linear maps \(\mathbb {R}^d \rightarrow \mathbb {R}\) is finite-dimensional, so that the (Hausdorff) topology of pointwise convergence on U coincides with that of locally uniform convergence on \(\Theta \); see [61, Theorem 1.21].
To use this observation, note that Eq.s (D.20) and (D.21) show that \(g_i^{(k)} + q_i^{(k)}\) converges pointwise to \(G_i\) on \(B_{r_i}(x_i)\). Furthermore, since \(x_i \in S_{\alpha ^{(i)}, \beta ^{(i)}}\), it is not hard to see that there is some \(\varepsilon _0 > 0\) with \(\big ( U_{\alpha ^{(i)}, \beta ^{(i)}}^{(\varepsilon , \pm )} \big )^{\circ } \cap B_{r_i} (x_i) \ne \varnothing \) for all \(\varepsilon \in (0, \varepsilon _0)\); for the details, we refer to Step 1 in the proof of Lemma D.8. Finally, as a consequence of Step 5, we see for arbitrary \(\varepsilon \in (0, \varepsilon _0)\) that \(g_i^{(k)} + q_i^{(k)}\) and \(G_i\) are both affine-linear on \(U_{\alpha ^{(i)}, \beta ^{(i)}}^{(\varepsilon , \pm )}\), at least for k large enough (depending on \(\varepsilon \)). Thus, the observation from above (with \(\Theta = U_{\alpha ^{(i)}, \beta ^{(i)}}^{(\varepsilon , \pm )}\) and \(U = \Theta ^{\circ } \cap B_{r_i} (x_i)\)) implies that \(g_i^{(k)} + q_i^{(k)} \rightarrow G_i\) pointwise on \(U_{\alpha ^{(i)}, \beta ^{(i)}}^{(\varepsilon , \pm )}\), for arbitrary \(\varepsilon \in (0,\varepsilon _0)\).
Because of \( \bigcup _{\sigma \in \{\pm \}} \bigcup _{0< \varepsilon < \varepsilon _0} U_{\alpha ^{(i)}, \beta ^{(i)}}^{(\varepsilon ,\sigma )} = \mathbb {R}^d \setminus S_{\alpha ^{(i)}, \beta ^{(i)}}, \) this implies
Step 7 (Finishing the proof): For arbitrary \(\delta \in (0, \delta _0)\), let us set
Then, Eqs. (D.19) and (D.22) imply for \(k \ge N_3 (\delta )\) that
But since \(g_{r+1}^{(k)}\) and all \(q_i^{(k)}\) are affine-linear, and since \(\Lambda _\delta \) is an open set of positive measure, this implies that there is an affine-linear map \(\psi : \mathbb {R}^d \rightarrow \mathbb {R}, x \mapsto \langle \zeta , x \rangle + \kappa \) satisfying \(f - \sum _{i=1}^r G_i = \psi \) on \(\Lambda _\delta \), for arbitrary \(\delta \in (0, \delta _0)\). Note that \(\psi \) is independent of the choice of \(\delta \), and thus
But the latter set is dense in \(\Omega \) (since its complement is a null-set), and f and \(\psi + \sum _{i = 1}^r G_i\) are continuous on \(\Omega \). Hence,
Recalling from Steps 3 and 4 that \(r < N_0 \), this implies \(f \in \mathcal {R}\mathcal {N}\mathcal {N}_{\varrho _a}^\Omega ((d,r+1,1)) \subset \mathcal {R}\mathcal {N}\mathcal {N}_{\varrho _a}^\Omega ((d,N_0,1))\), as claimed. Here, we implicitly used that
since \(\langle \alpha , x \rangle + d B \, \Vert \alpha \Vert _{\ell ^2} \ge 0\) for \(x \in \Omega = [-B,B]^d\), so that \(\varrho _a (\langle \alpha , x \rangle + d B \Vert \alpha \Vert _{\ell ^2}) = \langle \alpha , x \rangle + d B \Vert \alpha \Vert _{\ell ^2}\). \(\square \)
Appendix E: Proofs of the Results in Sect. 4
1.1 E.1. Proof of Proposition 4.1
Step 1: We first show that if \((f_n)_{n \in \mathbb {N}}\) and \((g_n)_{n \in \mathbb {N}}\) are sequences of continuous functions \(f_n : \mathbb {R}^d \rightarrow \mathbb {R}^N\) and \(g_n : \mathbb {R}^N \rightarrow \mathbb {R}^D\) that satisfy \(f_n \rightarrow f\) and \(g_n \rightarrow g\) with locally uniform convergence, then also \(g_n \circ f_n \rightarrow g \circ f\) locally uniformly.
To see this, let \(R, \varepsilon > 0\) be arbitrary. On \(\overline{B_R}(0) \subset \mathbb {R}^d\), we then have \(f_n \rightarrow f\) uniformly. In particular, \(C {:}{=}\sup _{n \in \mathbb {N}} \sup _{|x| \le R} | f_n (x) | < \infty \); here, we implicitly used that f and all \(f_n\) are continuous, and hence bounded on \(\overline{B_R}(0)\). But on \(\overline{B_C}(0) \subset \mathbb {R}^N\), we have \(g_n \rightarrow g\) uniformly, so that there is some \(n_1 \in \mathbb {N}\) with \(| g_n (y) - g(y) | < \varepsilon \) for all \(n \ge n_1\) and all \(y \in \mathbb {R}^N\) with \(| y | \le C\). Furthermore, g is uniformly continuous on \(\overline{B_C} (0)\), so that there is some \(\delta > 0\) with \(| g(y) - g(z) | < \varepsilon \) for all \(y,z \in \overline{B_C} (0)\) with \(| y-z | \le \delta \). Finally, by the uniform convergence of \(f_n \rightarrow f\) on \(\overline{B_R}(0)\), we get some \(n_2 \in \mathbb {N}\) with \(| f_n (x) - f(x) | \le \delta \) for all \(n \ge n_2\) and all \(x \in \mathbb {R}^d\) with \(| x | \le R\).
Overall, these considerations show for \(n \ge \max \{n_1, n_2\}\) and \(x \in \mathbb {R}^d\) with \(| x | \le R\) that
Step 2 We show that \(\mathrm {R}^{\Omega }_{\varrho }\) is continuous. Assume that some neural network sequence \((\Phi _n)_{n \in \mathbb {N}} \subset \mathcal {NN}((d,N_1,\dots ,N_L))\) given by \({\Phi _n = \big ( (A_1^{(n)}, b_1^{(n)}), \dots , (A_L^{(n)}, b_L^{(n)}) \big )}\) fulfills \(\Phi _n \rightarrow \Phi = \big ( (A_1,b_1), \dots , (A_L,b_L) \big ) \in \mathcal {NN}((d,N_1,\dots ,N_L))\). For \(\ell \in \{1,\dots ,L-1\}\) set
where \(\varrho _\ell {:}{=}\varrho \times \cdots \times \varrho \) denotes the \(N_\ell \)-fold Cartesian product of \(\varrho \). Likewise, set
By what was shown in Step 1, it is not hard to see for every \(\ell \in \{1,\dots ,L\}\) that \(\alpha _\ell ^{(n)} \rightarrow \alpha _\ell \) locally uniformly as \(n \rightarrow \infty \). By another (inductive) application of Step 1, this shows
with locally uniform convergence. Since \(\Omega \) is compact, this implies uniform convergence on \(\Omega \), and thus completes the proof of the first claim.
Step 3 Let \(\varrho _\ell {:}{=}\varrho \times \cdots \times \varrho \) be the \(N_\ell \)-fold Cartesian product of \(\varrho \) in case of \(\ell \in \{1,\dots ,L-1\}\), and set \(\varrho _L {:}{=}\mathrm {id}_{\mathbb {R}^{N_L}}\). For arbitrary \(x \in \Omega \) and \(\Phi = \big ( (A_1,b_1), \dots , (A_L,b_L) \big ) \in \mathcal {NN}(S)\), define inductively \(\alpha _x^{(0)} (\Phi ) {:}{=}x \in \mathbb {R}^d = \mathbb {R}^{N_0}\), and
Let \(R > 0\) be fixed, but arbitrary. We will prove by induction on \(\ell \in \{0, \dots , L\}\) that
for suitable \(C_{\ell ,R},M_{\ell ,R} > 0,\) arbitrary \(x \in \Omega \) and \(\Phi ,\Psi \in \mathcal {NN}(S)\) with \(\Vert \Phi \Vert _{\mathrm {total}}, \Vert \Psi \Vert _{\mathrm {total}} \le R\).
This will imply that \(\mathrm {R}^{\Omega }_{\varrho }\) is locally Lipschitz, since clearly \(\mathrm {R}_\varrho ^\Omega (\Phi )(x) = \alpha _x^{(L)} (\Phi )\), and hence
The case \(\ell = 0\) is trivial: On the one hand, \(| \alpha _x^{(0)}(\Phi ) - \alpha _x^{(0)}(\Psi ) | = 0 \le \Vert \Phi - \Psi \Vert _{\mathrm {total}}\). On the other hand, since \(\Omega \) is bounded, we have \(| \alpha _x^{(0)} (\Phi ) | = | x | \le C_0\) for a suitable constant \(C_0 = C_0 (\Omega )\).
For the induction step, let us write \(\Psi = \big ( (B_1,c_1), \dots , (B_L,c_L)\big )\), and note that
Clearly, the same estimate holds with \(A_{\ell +1},b_{\ell +1}\) and \(\Phi \) replaced by \(B_{\ell +1}, c_{\ell +1}\) and \(\Psi \), respectively. Next, observe that with \(\varrho \) also \(\varrho _{\ell +1}\) is locally Lipschitz. Thus, there is \(\Gamma _{\ell +1,R} > 0\) with
for all \(x,y \in \mathbb {R}^{N_{\ell +1}}\) with \(\Vert x \Vert _{\ell ^\infty }, \Vert y \Vert _{\ell ^\infty } \le K_{\ell +1,R}\). On the one hand, this implies
On the other hand, we also get
Step 4 Let \(\varrho \) be Lipschitz with Lipschitz constant M, where we assume without loss of generality that \(M \ge 1\). With the functions \(\varrho _\ell \) from the preceding step, it is not hard to see that each \(\varrho _\ell \) is M-Lipschitz, where we use the \(\Vert \cdot \Vert _{\ell ^\infty }\)-norm on \(\mathbb {R}^{N_\ell }\).
Let \(\Phi = \! \big ( (A_1,b_1), \dots , (A_L,b_L) \big ) \! \in \mathcal {NN}(S)\), and \(\alpha _\ell : \mathbb {R}^{N_{\ell -1}} \! \rightarrow \mathbb {R}^{N_\ell }, x \mapsto \varrho _\ell (A_\ell \, x + b_\ell )\) for \({\ell \in \{1,\dots ,L-1\}}\). Then, \(\alpha _\ell \) is Lipschitz with \({{\,\mathrm{Lip}\,}}(\alpha _\ell ) \le M \cdot \Vert A_\ell \Vert _{\ell ^\infty \rightarrow \ell ^\infty } \le M \cdot N_{\ell -1} \cdot \Vert A \Vert _{\max } \le M N_{\ell -1} \cdot \Vert \Phi \Vert _{\mathrm {scaling}}\). Thus, we finally see that \(\mathrm {R}^{\Omega }_{\varrho } (\Phi ) = \alpha _L \circ \cdots \circ \alpha _1\) is Lipschitz with Lipschitz constant \(M^L \cdot N_0 \cdots N_{L-1} \cdot \Vert \Phi \Vert _{\mathrm {scaling}}^L\). This proves the final claim of the proposition when choosing the \(\ell ^\infty \)-norm on \(\mathbb {R}^d\) and \(\mathbb {R}^{N_L}\). Of course, choosing another norm than the \(\ell ^\infty \)-norm can be done, at the cost of possibly enlarging the constant C in the statement of the proposition. \(\square \)
1.2 E.2. Proof of Theorem 4.2
Step 1 For \(a > 0\), define
Our claim in this step is that there is some \(a > 0\) with \(f_a \not \equiv \mathrm {const}\).
Let us assume toward a contradiction that this fails; that is, \(f_a \equiv c_a\) for all \(a > 0\). Since \(\varrho \) is Lipschitz continuous, it is at most of linear growth, so that \(\varrho \) is a tempered distribution. We will now make use of the Fourier transform, which we define by \({\widehat{f}}(\xi ) = \int _{\mathbb {R}} f(x) \, e^{-2 \pi i x \xi } \, d x\) for \(f \in L^1(\mathbb {R})\), as in [26, 31], where it is also explained how the Fourier transform is extended to the space of tempered distributions. Elementary properties of the Fourier transform for tempered distributions (see [31, Proposition 2.3.22]) show
Next, setting \(z(\xi ) {:}{=}e^{2\pi i a \xi } \ne 0\), we observe that
as long as \(z(\xi ) \ne 1\), that is, as long as \(\xi \notin a^{-1} \mathbb {Z}\).
Let \(\varphi \in C_c^\infty (\mathbb {R})\) such that \(0 \not \in \hbox {supp}\varphi \) be fixed, but arbitrary. This implies \(\hbox {supp}\varphi \subset \mathbb {R}\setminus a^{-1} \mathbb {Z}\) for some sufficiently small \(a > 0\). Since \(g_a\) vanishes nowhere on the compact set \(\hbox {supp}\varphi \), it is not hard to see that there is some smooth, compactly supported function h with \(h \cdot g_a \equiv 1\) on the support of \(\varphi \). All in all, we thus get
Since \(\varphi \in C_c^\infty (\mathbb {R})\) with \(0 \not \in \hbox {supp}\varphi \) was arbitrary, we have shown \(\hbox {supp}{\widehat{\varrho }} \subset \{0\}\). But by [31, Corollary 2.4.2], this implies that \(\varrho \) is a polynomial. Since the only globally Lipschitz continuous polynomials are affine-linear, \(\varrho \) must be affine-linear, contradicting the prerequisites of the theorem.
Step 2 In this step we construct certain continuous functions \(F_n : \mathbb {R}^d \rightarrow \mathbb {R}\) which satisfy \(\mathrm {Lip}(F_n|_\Omega ) \rightarrow \infty \) and \(F_n \rightarrow 0\) uniformly on \(\mathbb {R}^d\). We will then use these functions in the next step to construct the desired networks \(\Phi _n\).
We first note that each function \(f_a\) from Step 1 is bounded. In fact, if \(\varrho \) is M-Lipschitz, then
Next, recall that \(\varrho \) is Lipschitz continuous and not affine-linear. Therefore, Lemma C.8 shows that there is some \(t_0 \in \mathbb {R}\) such that \(\varrho \) is differentiable at \(t_0\) with \(\varrho '(t_0) \ne 0\). Therefore, Proposition B.3 shows that there is a neural network \(\Phi \in \mathcal {NN}((1,\dots ,1))\) with \(L-1\) layers such that \(\psi {:}{=}\mathrm {R}^{\mathbb {R}}_{\varrho } (\Phi )\) is differentiable at the origin with \(\psi (0) = 0\) and \(\psi '(0) = 1\). By definition, this means that there is a function \(\delta : \mathbb {R}\rightarrow \mathbb {R}\) such that \(\psi (x) = x + x \cdot \delta (x)\) and \(\delta (x) \rightarrow 0 = \delta (0)\) as \(x \rightarrow 0\).
Next, since \(\Omega \) has nonempty interior, there exist \(x_0 \in \mathbb {R}^d\) and \(r > 0\) with \(x_0 + [-r,r]^d \subset \Omega \). Let us now choose \(a > 0\) with \(f_a \not \equiv \mathrm {const}\) (the existence of such an \(a > 0\) is implied by the previous step), and define
Since \(f_a\) is not constant, there are \(b,c \in \mathbb {R}\) with \(b < c\) and \(f_a (b) \ne f_a (c)\). Because of \(\delta (x) \rightarrow 0\) as \(x \rightarrow 0\), we see that there is some \(\kappa > 0\) and some \(n_1 \in \mathbb {N}\) with
Let us set \(x_n {:}{=}x_0 + n^{-2} \cdot (b,0,\dots ,0) \in \mathbb {R}^d\) and \(y_n {:}{=}x_0 + n^{-2} \cdot (c,0,\dots ,0) \in \mathbb {R}^d\), and observe \(x_n, y_n \in \Omega \) for \(n \in \mathbb {N}\) large enough. We have \(| x_n - y_n | = n^{-2} \cdot | b-c |\). Furthermore, using the expansion \(\psi (x) = x + x \cdot \delta (x)\), and noting \(f_a (n^2 (x_n - x_0)_1) = f_a(b)\) as well as \(f_a(n^2 (y_n - x_0)_1) = f_a(c)\), we get
as long as \(n \ge n_1\) is so large that \(x_n,y_n \in \Omega \). But this implies
It remains to show \(F_n \rightarrow 0\) uniformly on \(\mathbb {R}^d\). Thus, let \(\varepsilon > 0\) be arbitrary. By continuity of \(\psi \) at 0, there is some \(\delta > 0\) with \(| \psi (x) | \le \varepsilon \) for \(| x | \le \delta \). But Eq. (E.1) shows \(|n^{-1} \cdot f_a (n^{-2} \cdot (x-x_0)_1)| \le n^{-1} \cdot 2M|a| \le \delta \) for all \(x \in \mathbb {R}^d\) and all \(n \ge n_0\), with \(n_0 = n_0(M,a,\delta ) \in \mathbb {N}\) suitable. Hence, \(| F_n (x) | \le \varepsilon \) for all \(n \ge n_0\) and \(x \in \mathbb {R}^d\).
Step 3 In this step, we construct the networks \(\Phi _n\). For \(n \in \mathbb {N}\) define
as well as \(A_2^{(n)} {:}{=}n^{-1} \cdot (1, -2, 1) \in \mathbb {R}^{1 \times 3}\) and \(b_2^{(n)} {:}{=}0 \in \mathbb {R}^1\). A direct calculation shows
for all \(x \in \mathbb {R}^d,\) where \(\Phi _n^{(0)} {:}{=}\big ( (A_1^{(n)},b_1^{(n)}), (A_2^{(n)},b_2^{(n)}) \big ).\) Thus, with the concatenation operation introduced in Definition B.2, the network satisfies \(\mathrm {R}^{\Omega }_{\varrho }(\Phi _n^{(1)}) = F_n|_\Omega \). Furthermore, it is not hard to see that \(\Phi _n^{(1)}\) has L layers and has the architecture \((d,3,1,\dots ,1)\). From this and because of \(N_1 \ge 3\), by Lemma B.1 there is a network \(\Phi _n\) with architecture \((d,N_1,\dots ,N_{L-1},1)\) and \(\mathrm {R}^{\Omega }_{\varrho }(\Phi _n) = F_n|_\Omega \). By Step 2, this implies \(\mathrm {R}^{\Omega }_{\varrho } (\Phi _n) = F_n|_\Omega \rightarrow 0\) uniformly on \(\Omega \), as well as \(\mathrm {Lip}(\mathrm {R}^{\Omega }_{\varrho } (\Phi _n)) \rightarrow \infty \) as \(n \rightarrow \infty \).
Step 4 In this step, we establish the final property which is stated in the theorem. For this, let us assume toward a contradiction that there is a family of networks \((\Psi _n)_{n \in \mathbb {N}}\) with architecture S and \(\mathrm {R}^{\Omega }_{\varrho }(\Psi _n) = \mathrm {R}^{\Omega }_{\varrho }(\Phi _n)\), some \(C > 0\), and a subsequence \((\Psi _{n_r})_{r \in \mathbb {N}}\) with \(\Vert \Psi _{n_r} \Vert _{\mathrm {scaling}} \le C\) for all \(r \in \mathbb {N}\). In view of the last part of Proposition 4.1, there is a constant \(C' = C'(\varrho ,S) > 0\) with
in contradiction to \(\mathrm {Lip} \big (\mathrm {R}^{\Omega }_{\varrho }(\Phi _n) \big ) \rightarrow \infty \). \(\square \)
1.3 E.3. Proof of Corollary 4.3
Let us denote the range of the realization map by R. By definition (see [44, p. 65]), \(\mathrm {R}^{\Omega }_{\varrho }\) is a quotient map if and only if
Clearly, by switching to complements, we can equivalently replace “open” by “closed” everywhere.
Now, choose a sequence of neural networks \((\Phi _n)_{n \in \mathbb {N}}\) as in Theorem 4.2, and set \(F_n {:}{=}\mathrm {R}^{\mathbb {R}^d}_{\varrho } (\Phi _n)\). Since \(\mathrm {Lip}(F_n|_{\Omega }) \rightarrow \infty \), we have \(F_n |_{\Omega } \not \equiv 0\) for all \(n \ge n_0\) with \(n_0 \in \mathbb {N}\) suitable. Define \(M {:}{=}\{F_n |_{\Omega } \,:\,n \ge n_0 \} \subset R\). Note that \(M \subset R \subset C(\Omega )\) is not closed, since \(F_n |_{\Omega } \rightarrow 0\) uniformly, but \(0 \in R \setminus M\). Hence, once we show that \(\left( \mathrm {R}^{\Omega }_{\varrho }\right) ^{-1}(M)\) is closed, we will have shown that \(\mathrm {R}^{\Omega }_{\varrho }\) is not a quotient map.
Thus, let \((\Psi _n)_{n \in \mathbb {N}}\) be a sequence in \(\left( \mathrm {R}^{\Omega }_{\varrho }\right) ^{-1}(M)\) and assume \(\Psi _n \rightarrow \Psi \) as \(n\rightarrow \infty \). In particular, \(\Vert \Psi _n \Vert _{\mathrm {scaling}} \le C\) for some \(C > 0\) and all \(n \in \mathbb {N}\). We want to show \(\Psi \in \left( \mathrm {R}^{\Omega }_{\varrho }\right) ^{-1}(M)\) as well. Since \(\Psi _n \in \left( \mathrm {R}^\Omega _{\varrho }\right) ^{-1}(M)\), there is for each \(n \in \mathbb {N}\) some \(r_n \in \mathbb {N}\) with \(\mathrm {R}^{\Omega }_{\varrho } (\Psi _n) = F_{r_n}|_{\Omega }\). Now there are two cases:
Case 1: The family \((r_n)_{n \in \mathbb {N}}\) is infinite. But in view of Proposition 4.1, we have
for a suitable constant \(C' = C'(\varrho , S)\), in contradiction to the fact that \(\mathrm {Lip}(F_{r_n}|_{\Omega }) \rightarrow \infty \) as \(r_n \rightarrow \infty \). Thus, this case cannot occur.
Case 2: The family \((r_n)_{n \in \mathbb {N}}\) is finite. Thus, there is some \(N \in \mathbb {N}\) with \(r_n = N\) for infinitely many \(n \in \mathbb {N}\), that is, \(\mathrm {R}^\Omega _{\varrho }(\Psi _n) = F_{r_n}|_{\Omega } = F_N|_{\Omega }\) for infinitely many \(n \in \mathbb {N}\). But since \(\mathrm {R}^{\Omega }_{\varrho } (\Psi _{n}) \rightarrow \mathrm {R}^{\Omega }_{\varrho }(\Psi )\) as \(n \rightarrow \infty \) (by the continuity of the realization map), this implies \(\mathrm {R}^{\Omega }_{\varrho } (\Psi ) = F_N|_{\Omega } \in M\), as desired. \(\square \)
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Petersen, P., Raslan, M. & Voigtlaender, F. Topological Properties of the Set of Functions Generated by Neural Networks of Fixed Size . Found Comput Math 21, 375–444 (2021). https://doi.org/10.1007/s10208-020-09461-0
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10208-020-09461-0