Introduction

Neural networks, introduced in 1943 by McCulloch and Pitts [49], are the basis of every modern machine learning algorithm based on deep learning [30, 43, 63]. The term deep learning describes a variety of methods that are based on the data-driven manipulation of the weights of a neural network. Since these methods perform spectacularly well in practice, they have become the state-of-the-art technology for a host of applications including image classification [36, 41, 65], speech recognition [22, 34, 69], game intelligence [64, 66, 70], and many more.

This success of deep learning has encouraged many scientists to pick up research in the area of neural networks after the field had gone dormant for decades. In particular, quite a few mathematicians have recently investigated the properties of different neural network architectures, hoping that this can explain the effectiveness of deep learning techniques. In this context, mathematical analysis has mainly been conducted in the context of statistical learning theory [20], where the overall success of a learning method is determined by the approximation properties of the underlying function class, the feasibility of optimizing over this class, and the generalization capabilities of the class, when only training with finitely many samples.

In the approximation theoretical part of deep learning research, one analyzes the expressiveness of deep neural network architectures. The universal approximation theorem [21, 35, 45] demonstrates that neural networks can approximate any continuous function, as long as one uses networks of increasing complexity for the approximation. If one is interested in approximating more specific function classes than the class of all continuous functions, then one can often quantify more precisely how large the networks have to be to achieve a given approximation accuracy for functions from the restricted class. Examples of such results are [7, 14, 51, 52, 57, 71]. Some articles [18, 54, 57, 62, 72] study in particular in which sense deep networks have a superior expressiveness compared to their shallow counterparts, thereby partially explaining the efficiency of networks with many layers in deep learning.

Another line of research studies the training procedures employed in deep learning. Given a set of training samples, the training process is an optimization problem over the parameters of a neural network, where a loss function is minimized. The loss function is typically a nonlinear, non-convex function of the weights of the network, rendering the optimization of this function highly challenging [8, 13, 38]. Nonetheless, in applications, neural networks are often trained successfully through a variation of stochastic gradient descent. In this regard, the energy landscape of the problem was studied and found to allow convergence to a global optimum, if the problem is sufficiently overparametrized; see [1, 16, 27, 56, 67].

The third large area of mathematical research on deep neural networks is analyzing the so-called generalization error of deep learning. In the framework of statistical learning theory [20, 53], the discrepancy between the empirical loss and the expected loss of a classifier is called the generalization error. Specific bounds for this error for the class of deep neural networks were analyzed for instance in [4, 11], and in more specific settings for instance in [9, 10].

In this work, we study neural networks from a different point of view. Specifically, we study the structure of the set of functions implemented by neural networks of fixed size. These sets are naturally (nonlinear) subspaces of classical function spaces like \(L^p(\Omega )\) and \(C(\Omega )\) for compact sets \(\Omega \).

Due to the size of the networks being fixed, our analysis is inherently non-asymptotic. Therefore, our viewpoint is fundamentally different from the analysis in the framework of statistical learning theory. Indeed, in approximation theory, the expressive power of networks growing in size is analyzed. In optimization, one studies the convergence properties of iterative algorithms—usually that of some form of stochastic gradient descent. Finally, when considering the generalization capabilities of deep neural networks, one mainly studies how and with which probability the empirical loss of a classifier converges to the expected loss, for increasing numbers of random training samples and depending on the sizes of the underlying networks.

Given this strong delineation to the classical fields, we will see that our point of view yields interpretable results describing phenomena in deep learning that are not directly explained by the classical approaches. We will describe these results and their interpretations in detail in Sects. 1.11.3.

We will use standard notation throughout most of the paper without explicitly introducing it. We do, however, collect a list of used symbols and notions in Appendix A. To not interrupt the flow of reading, we have deferred several auxiliary results to Appendix B and all proofs and related statements to Appendices CE.

Before we continue, we formally introduce the notion of spaces of neural networks of fixed size.

Neural Networks of Fixed Size: Basic Terminology

To state our results, it will be necessary to distinguish between a neural network as a set of weights and the associated function implemented by the network, which we call its realization. To explain this distinction, let us fix numbers \(L, N_0, N_1, \dots , N_{L} \in \mathbb {N}\). We say that a family \(\Phi = \big ( (A_\ell ,b_\ell ) \big )_{\ell = 1}^L\) of matrix-vector tuples of the form \(A_\ell \in \mathbb {R}^{N_{\ell } \times N_{\ell -1}}\) and \(b_\ell \in \mathbb {R}^{N_\ell }\) is a neural network. We call \(S{:}{=}(N_0, N_1, \dots , N_L)\) the architecture of \(\Phi \); furthermore \(N(S){:}{=}\sum _{\ell = 0}^L N_\ell \) is called the number of neurons of S and \(L = L(S)\) is the number of layers of S. We call \(d{:}{=}N_0\) the input dimension of \(\Phi \), and throughout this introduction we assume that the output dimension \(N_L\) of the networks is equal to one. For a given architecture S, we denote by \(\mathcal {NN}(S)\) the set of neural networks with architecture S.

Defining the realization of such a network \(\Phi = \big ( (A_\ell ,b_\ell ) \big )_{\ell =1}^L\) requires two additional ingredients: a so-called activation function \(\varrho : \mathbb {R}\rightarrow \mathbb {R}\), and a domain of definition \(\Omega \subset \mathbb {R}^{N_0}\). Given these, the realization of the network \(\Phi = \big ( (A_\ell ,b_\ell ) \big )_{\ell =1}^L\) is the function

$$\begin{aligned} \mathrm {R}_\varrho ^\Omega \left( \Phi \right) : \Omega \rightarrow \mathbb {R}, \ \ x \mapsto x_L \, , \end{aligned}$$

where \(x_L\) results from the following scheme:

$$\begin{aligned} \begin{aligned} x_0&{:}{=}x, \\ x_{\ell }&{:}{=}\varrho (A_{\ell } \, x_{\ell -1} + b_\ell ), \quad \text { for } \ell = 1, \dots , L-1,\\ x_L&{:}{=}A_{L} \, x_{L-1} + b_{L}, \end{aligned} \end{aligned}$$

and where \(\varrho \) acts componentwise; that is, \(\varrho (x_1,\dots ,x_d) := (\varrho (x_1),\dots ,\varrho (x_d))\). In what follows, we study topological properties of sets of realizations of neural networks with a fixed size. Naturally, there are multiple conventions to specify the size of a network. We will study the set of realizations of networks with a given architecture S and activation function \(\varrho \); that is, the set \(\mathcal {R}\mathcal {N}\mathcal {N}_{\varrho }^{\Omega }(S) {:}{=}\{\mathrm {R}_\varrho ^\Omega (\Phi ) :\Phi \in \mathcal {NN}(S) \}\). In the context of machine learning, this point of view is natural, since one usually prescribes the network architecture, and during training only adapts the weights of the network.

Before we continue, let us note that the set \(\mathcal {NN}(S)\) of all neural networks (that is, the network weights) with a fixed architecture forms a finite-dimensional vector space, which we equip with the norm

$$\begin{aligned} \Vert \Phi \Vert _{\mathcal {NN}(S)} {:}{=}\Vert \Phi \Vert _{\mathrm {scaling}} + \max _{\ell = 1,\dots ,L} \Vert b_\ell \Vert _{\max }, \text { for } \Phi = \big ( (A_\ell , b_\ell ) \big )_{\ell =1}^L \in \mathcal {NN}(S) , \end{aligned}$$

where \(\Vert \Phi \Vert _{\mathrm {scaling}} {:}{=}\max _{\ell = 1,\dots ,L } \Vert A_\ell \Vert _{\max }\). If the specific architecture of \(\Phi \) does not matter, we simply write \(\Vert \Phi \Vert _{\mathrm {total}}{:}{=}\Vert \Phi \Vert _{\mathcal {NN}(S)}\). In addition, if \(\varrho \) is continuous, we denote the realization map by

$$\begin{aligned} \mathrm {R}^{\Omega }_{\varrho } : \mathcal {NN}(S) \rightarrow C(\Omega ; \mathbb {R}^{N_L}), ~ \Phi \mapsto \mathrm {R}^{\Omega }_{\varrho } (\Phi ). \end{aligned}$$
(1.1)

While the activation function \(\varrho \) can in principle be chosen arbitrarily, a couple of particularly useful activation functions have been established in the literature. We proceed by listing some of the most common activation functions, a few of their properties, as well as references to articles using these functions in the context of deep learning. We note that all activation functions listed below are non-constant, monotonically increasing, globally Lipschitz continuous functions. This property is much stronger than the assumption of local Lipschitz continuity that we will require in many of our results. Furthermore, all functions listed below belong to the class \(C^\infty (\mathbb {R}\setminus \{0\}).\)

In the remainder of this introduction, we discuss our results concerning the topological properties of the sets of realizations of neural networks with fixed architecture and their interpretation in the context of deep learning. Then, we give an overview of related work. We note at this point that it is straightforward to generalize all of the results in this paper to neural networks for which one only prescribes the total number of neurons and layers and not the specific architecture.

For simplicity, we will always assume in the remainder of this introduction that \(\Omega \subset \mathbb {R}^{N_0}\) is compact with non-empty interior.

Non-convexity of the Set of Realizations

We will show in Sect. 2 (Theorem 2.1) that, for a given architecture S, the set \(\mathcal {R}\mathcal {N}\mathcal {N}_{\varrho }^{\Omega }(S)\) is not convex, except possibly when the activation function is a polynomial, which is clearly not the case for any of the activation functions that are commonly used in practice.

In fact, for a large class of activation functions (including the ReLU and the standard sigmoid activation function), the set \(\mathcal {R}\mathcal {N}\mathcal {N}_{\varrho }^{\Omega }(S)\) turns out to be highly non-convex in the sense that for every \(r \in [0,\infty )\), the set of functions having uniform distance at most r to any function in \(\mathcal {R}\mathcal {N}\mathcal {N}_{\varrho }^{\Omega }(S)\) is not convex. We prove this result in Theorem 2.2 and Remark 2.3.

This non-convexity is undesirable, since for non-convex sets, there do not necessarily exist well-defined projection operators onto them. In classical statistical learning theory [20], the property that the so-called regression function can be uniquely projected onto a convex (and compact) hypothesis space greatly simplifies the learning problem; see [20, Sect. 7]. Furthermore, in applications where the realization of a network— rather than its set of weights—is the quantity of interest (for example when a network is used as an Ansatz for the solution of a PDE, as in [24, 42]), our results show that the Ansatz space is non-convex. This non-convexity is inconvenient if one aims for a convergence proof of the underlying optimization algorithm, since one cannot apply convexity-based fixed-point theorems. Concretely, if a neural network is optimized by stochastic gradient descent so as to satisfy a certain PDE, then it is interesting to see if there even exists a network so that the iteration stops. In other words, one might ask whether gradient descent on the set of neural networks (potentially with bounded weights) has a fixed point. If the space of neural networks were convex and compact, then the fixed-point theorem of Schauder would guarantee the existence of such a fixed point.

(Non-)Closedness of the Set of Realizations

For any fixed architecture S, we show in Sect. 3 (Theorem 3.1) that the set \(\mathcal {R}\mathcal {N}\mathcal {N}_{\varrho }^{\Omega }(S)\) is not a closed subset of \(L^p (\mu )\) for \(0< p < \infty \), under very mild assumptions on the measure \(\mu \) and the activation function \(\varrho \). The assumptions concerning \(\varrho \) are satisfied for all activation functions used in practice.

For the case \(p = \infty \), the situation is more involved: For all activation functions that are commonly used in practice—except for the (parametric) ReLU— the associated sets \(\mathcal {R}\mathcal {N}\mathcal {N}_{\varrho }^{\Omega }(S)\) are non-closed also with respect to the uniform norm; see Theorem 3.3. For the (parametric) ReLU, however, the question of closedness of the sets \(\mathcal {R}\mathcal {N}\mathcal {N}_{\varrho }^{\Omega }(S)\) remains mostly open. Nonetheless, in two special cases, we prove in Sect. 3.4 that the sets \(\mathcal {R}\mathcal {N}\mathcal {N}_{\varrho }^{\Omega }(S)\) are closed. In particular, for neural network architectures with two layers only, Theorem 3.8 establishes the closedness of \(\mathcal {R}\mathcal {N}\mathcal {N}_{\varrho }^{\Omega }(S)\), where \(\varrho \) is the (parametric) ReLU.

A practical consequence of the observation of non-closedness can be identified with the help of the following argument that is made precise in Sect. 3.3: We show that the set

$$\begin{aligned} \left\{ \mathrm {R}_\varrho ^\Omega (\Phi ) \,:\, \Phi = ((A_\ell ,b_\ell ))_{\ell =1}^L \text { has architecture } S \text { with } \Vert A_\ell \Vert + \Vert b_\ell \Vert \le C \right\} \end{aligned}$$

of realizations of neural networks with a fixed architecture and all affine linear maps bounded in a suitable norm, is always closed. As a consequence, we observe the following phenomenon of exploding weights: If a function f is such that it does not have a best approximation in \(\mathcal {R}\mathcal {N}\mathcal {N}^\Omega _\varrho (S)\), that is, if there does not exist \(f^*\in \mathcal {R}\mathcal {N}\mathcal {N}^\Omega _\varrho (S)\) such that

$$\begin{aligned} \Vert f^* - f\Vert _{L^p(\mu )} = \tau _f {:}{=}\inf _{g \in {\mathcal {R}\mathcal {N}\mathcal {N}_{\varrho }^{\Omega }(S)}} \Vert f-g\Vert _{L^p(\mu )}, \end{aligned}$$

then for any sequence of networks \((\Phi _n)_{n \in \mathbb {N}}\) with architecture S which satisfies \(\Vert f - \mathrm {R}_\varrho ^\Omega (\Phi _n)\Vert _{L^p(\mu )} \rightarrow \tau _f\), the weights of the networks \(\Phi _n\) cannot remain uniformly bounded as \(n \rightarrow \infty \). In words, if f does not have a best approximation in the set of neural networks of fixed size, then every sequence of realizations approximately minimizing the distance to f will have exploding weights. Since \(\mathcal {R}\mathcal {N}\mathcal {N}_{\varrho }^{\Omega }(S)\) is not closed, there do exist functions f which do not have a best approximation in \(\mathcal {R}\mathcal {N}\mathcal {N}_{\varrho }^{\Omega }(S)\).

Certainly, the presence of large coefficients will make the numerical optimization increasingly unstable. Thus, exploding weights in the sense described above are highly undesirable in practice.

The argument above discusses an approximation problem in an \(L^p\)-norm. In practice, one usually only minimizes “empirical norms”. We will demonstrate in Proposition 3.6 that also in this situation, for increasing numbers of samples, the weights of the neural networks that minimize the empirical norms necessarily explode under certain assumptions. Note that the setup of having a fixed architecture and a potentially unbounded number of training samples is common in applications where neural networks are trained to solve partial differential equations. There, training samples are generated during the training process [25, 42].

Failure of Inverse Stability of the Realization Map

As our final result, we study (in Sect. 4) the stability of the realization map \(\mathrm {R}_\varrho ^\Omega \) introduced in Eq. (1.1), which maps a family of weights to its realization. Even though this map will turn out to be continuous from the finite dimensional parameter space to \(L^p (\Omega )\) for any \(p \in (0,\infty ]\), we will show that it is not inverse stable. In other words, for two realizations that are very close in the uniform norm, there do not always exist network weights associated with these realizations that have a small distance. In fact, Theorem 4.2 even shows that there exists a sequence of realizations of networks converging uniformly to 0, but such that every sequence of weights with these realizations is necessarily unbounded.

For both of these results—continuity and no inverse stability—we only need to assume that the activation function \(\varrho \) is Lipschitz continuous and not constant.

These properties of the realization map pinpoint a potential problem that can occur when training a neural network: Let us consider a regression problem, where a network is iteratively updated by a (stochastic) gradient descent algorithm trying to minimize a loss function. It is then possible that at some iterate the loss function exhibits a very small error, even though the associated network parameters have a large distance to the optimal parameters. This issue is especially severe since a small error term leads to small steps if gradient descent methods are used in the optimization. Consequently, convergence to the very distant optimal weights will be slow even if the energy landscape of the optimization problem happens to be free of spurious local minima.

Related Work

Structural properties

The aforementioned properties of non-convexity and non-closedness have, to some extent, been studied before. Classical results analyze the spaces of shallow neural networks, that is, of \(\mathcal {RNN}_\varrho ^\Omega (S)\) for \(S = (d, N_0, 1)\), so that \(L = 2\). For such sets of shallow networks, a property that has been extensively studied is to what extent \(\mathcal {RNN}_\varrho ^\Omega (S)\) has the best approximation property. Here, we say that \(\mathcal {RNN}_\varrho ^\Omega (S)\) has the best approximation property, if for every function \(f \in L^p(\Omega )\), \(1 \le p \le \infty \), there exists a function \(F(f) \in \mathcal {RNN}_\varrho ^\Omega (S)\) such that \(\Vert f - F(f) \Vert _{L^p} = \inf _{g \in \mathcal {RNN}_\varrho ^\Omega (S)} \Vert f - g \Vert _{L^p}\). In [40] it was shown that even if a minimizer always exists, the map \(f \mapsto F(f)\) is necessarily discontinuous. Furthermore, at least for the Heaviside activation function, there does exist a (non-unique) best approximation; see [39].

Additionally, [28, Proposition 4.1] demonstrates, for shallow networks as before, that for the logistic activation function \(\varrho (x) = (1 + e^{-x})^{-1}\), the set \(\mathcal {RNN}_\varrho ^\Omega (S)\) does not have the best approximation property in \(C(\Omega )\). In the proof of this statement, it was also shown that \(\mathcal {RNN}_\varrho ^\Omega (S)\) is not closed. Furthermore, it is claimed that this result should hold for every nonlinear activation function. The previously mentioned result of [39] and Theorem 3.8 below disprove this conjecture for the Heaviside and ReLU activation functions, respectively.

Other notions of (non-)convexity

In deep learning, one chooses a loss function \({{\mathcal {L}}: C(\Omega ) \rightarrow [0,\infty )}\), which is then minimized over the set of neural networks \(\mathcal {R}\mathcal {N}\mathcal {N}_{\varrho }^{\Omega }(S)\) with fixed architecture S. A typical loss function is the empirical square loss, that is,

$$\begin{aligned} E_N(f) {:}{=}\frac{1}{N} \sum _{i = 1}^N |f(x_i) - y_i|^2, \end{aligned}$$

where \((x_i,y_i)_{i=1}^N \subset \Omega \times \mathbb {R}\), \(N \in \mathbb {N}\). In practice, one solves the minimization problem over the weights of the network; that is, one attempts to minimize the function \({\mathcal {L}} \circ \mathrm {R}_\varrho ^\Omega : \mathcal {NN}(S) \rightarrow [0,\infty )\). In this context, to assess the hardness of this optimization problem, one studies whether \({\mathcal {L}} \circ \mathrm {R}_\varrho ^\Omega \) is convex, the degree to which it is non-convex, and if one can find remedies to alleviate the problem of non-convexity, see for instance [5, 6, 27, 37, 50, 56, 59, 67, 73].

It is important to emphasize that this notion of non-convexity describes properties of the loss function, in contrast to the non-convexity of the sets of functions that we analyze in this work.

Non-convexity of the Set of Realizations

In this section, we analyze the convexity of the set of all neural network realizations. In particular, we will show that this set is highly non-convex for all practically used activation functions listed in Table 1. First, we examine the convexity of the set \(\mathcal {R}\mathcal {N}\mathcal {N}^{\Omega }_{\varrho }(S)\):

Table 1 Commonly used activation functions and their properties

Theorem 2.1

Let \(S = (d, N_1, \dots , N_L)\) be a neural network architecture with \(L \in \mathbb {N}_{\ge 2}\) and let \(\Omega \subset \mathbb {R}^d\) with non-empty interior. Moreover, let \(\varrho : \mathbb {R}\rightarrow \mathbb {R}\) be locally Lipschitz continuous.

If \(\mathcal {R}\mathcal {N}\mathcal {N}_\varrho ^\Omega (S)\) is convex, then \(\varrho \) is a polynomial.

Remark

(1) It is easy to see that all of the activation functions in Table 1 are locally Lipschitz continuous, and that none of them is a polynomial. Thus, the associated sets of realizations are never convex.

(2) In the case where \(\varrho \) is a polynomial, the set \(\mathcal {R}\mathcal {N}\mathcal {N}_\varrho ^\Omega (S)\) might or might not be convex. Indeed, if \(S = (1, N, 1)\) and \(\varrho (x) = x^m\), then it is not hard to see that \(\mathcal {R}\mathcal {N}\mathcal {N}_\varrho ^\Omega (S)\) is convex if and only if \(N \ge m\).

Proof

The detailed proof of Theorem 2.1 is the subject of “Appendix C.1”. Let us briefly outline the proof strategy:

  1. 1.

    We first show in Proposition C.1 that \(\mathcal {R}\mathcal {N}\mathcal {N}_\varrho ^\Omega (S)\) is closed under scalar multiplication, hence star-shaped with respect to the origin, i.e., 0 is a center.Footnote 1

  2. 2.

    Next, using the local Lipschitz continuity of \(\varrho \), we establish in Proposition C.4 that the maximal number of linearly independent centers of the set \(\mathcal {R}\mathcal {N}\mathcal {N}_\varrho ^\Omega (S)\) is finite. Precisely, it is bounded by the number of parameters of the underlying neural networks, given by \(\sum _{\ell = 1}^L (N_{\ell -1} + 1) N_{\ell }\).

  3. 3.

    A direct consequence of Step 2 is that if \(\mathcal {R}\mathcal {N}\mathcal {N}_\varrho ^\Omega (S)\) is convex, then it can only contain a finite number of linearly independent functions; see Corollary C.5.

  4. 4.

    Finally, using that \(\mathcal {R}\mathcal {N}\mathcal {N}_\varrho ^{\mathbb {R}^d}(S)\) is a translation-invariant subset of \(C(\mathbb {R}^d)\), we show in Proposition C.6 that \(\mathcal {R}\mathcal {N}\mathcal {N}_\varrho ^{\mathbb {R}^d}(S)\) (and hence also \(\mathcal {R}\mathcal {N}\mathcal {N}_\varrho ^{\Omega }(S)\)) contains infinitely many linearly independent functions, if \(\varrho \) is not a polynomial.

\(\square \)

In applications, the non-convexity of \(\mathcal {R}\mathcal {N}\mathcal {N}^\Omega _\varrho (S)\) might not be as problematic as it first seems. If, for instance, the set \(\mathcal {R}\mathcal {N}\mathcal {N}^\Omega _\varrho (S) + B_\delta (0)\) of functions that can be approximated up to error \(\delta > 0\) by a neural network with architecture S was convex, then one could argue that the non-convexity of \(\mathcal {R}\mathcal {N}\mathcal {N}^\Omega _\varrho (S)\) was not severe. Indeed, in practice, neural networks are only trained to minimize a certain empirical loss function, with resulting bounds on the generalization error which are typically of size \(\varepsilon = {\mathcal {O}}(m^{-1/2})\), with m denoting the number of training samples. In this setting, one is not really interested in “completely minimizing” the (empirical) loss function, but would be content with finding a function for which the empirical loss is \(\varepsilon \)-close to the global minimum. Hence, one could argue that one is effectively working with a hypothesis space of the form \(\mathcal {R}\mathcal {N}\mathcal {N}^\Omega _\varrho (S) + B_\delta (0)\), containing all functions that can be represented up to an error of \(\delta \) by neural networks of architecture S.

To quantify this potentially more relevant notion of convexity of neural networks, we define, for a subset A of a vector space \({\mathcal {Y}}\), the convex hull of A as

$$\begin{aligned} \mathrm {co}(A) {:}{=}\bigcap _{B \subset {\mathcal {Y}} \text { convex and } B \supset A} B \, . \end{aligned}$$

For \(\varepsilon > 0\), we say that a subset A of a normed vector space \({\mathcal {Y}}\) is \(\varepsilon \)-convex in \(({\mathcal {Y}},\Vert \cdot \Vert _{{\mathcal {Y}}})\), if

$$\begin{aligned} \text {co}(A) \subset A + B_\varepsilon (0) \, . \end{aligned}$$

Hence, the notion of \(\varepsilon \)-convexity asks whether the convex hull of a set is contained in an enlargement of this set. Note that if \(\mathcal {R}\mathcal {N}\mathcal {N}_\varrho ^\Omega (S)\) is dense in \(C(\Omega )\), then its closure is trivially \(\varepsilon \)-convex for all \(\varepsilon > 0\). Our main result regarding the \(\varepsilon \)-convexity of neural network sets shows that this is the only case in which \(\overline{\mathcal {R}\mathcal {N}\mathcal {N}_\varrho ^\Omega (S)}\) is \(\varepsilon \)-convex for any \(\varepsilon > 0\).

Theorem 2.2

Let \(S = (d, N_1, \dots , N_{L-1}, 1)\) be a neural network architecture with \(L \ge 2\), and let \(\Omega \subset \mathbb {R}^d\) be compact. Let \(\varrho : \mathbb {R}\rightarrow \mathbb {R}\) be continuous but not a polynomial, and such that \(\varrho '(x_0) \ne 0\) for some \(x_0 \in \mathbb {R}\).

Assume that \(\mathcal {R}\mathcal {N}\mathcal {N}_\varrho ^{\Omega }(S)\) is not dense in \(C(\Omega )\). Then there does not exist any \(\varepsilon > 0\) such that \(\overline{\mathcal {R}\mathcal {N}\mathcal {N}_\varrho ^{\Omega }(S)}\) is \(\varepsilon \)-convex in \(\big ( C(\Omega ), \Vert \cdot \Vert _{\sup } \big )\).

Remark

All closures in the theorem are taken in \(C(\Omega )\).

Proof

The proof of this theorem is the subject of “Appendix C.2”. It is based on showing that if \(\overline{\mathcal {R}\mathcal {N}\mathcal {N}_\varrho ^{\Omega }(S)}\) is \(\varepsilon \)-convex for some \(\varepsilon > 0\), then in fact \(\overline{\mathcal {R}\mathcal {N}\mathcal {N}_\varrho ^{\Omega }(S)}\) is convex, which we then use to show that \(\overline{\mathcal {R}\mathcal {N}\mathcal {N}_\varrho ^{\Omega }(S)}\) contains all realizations of two-layer neural networks with activation function \(\varrho \). As shown in [45], this implies that \(\mathcal {R}\mathcal {N}\mathcal {N}_\varrho ^{\Omega }(S)\) is dense in \(C(\Omega )\), since \(\varrho \) is not a polynomial. \(\square \)

Remark 2.3

While it is certainly natural to expect that \(\overline{\mathcal {R}\mathcal {N}\mathcal {N}_\varrho ^{\Omega }(S)} \ne C(\Omega )\) should hold for most activation functions \(\varrho \), giving a reference including large classes of activation functions such that the claim holds is not straightforward. We study this problem more closely in “Appendix C.3”.

To be more precise, from Proposition C.10 it follows that the ReLU, the parametric ReLU, the exponential linear unit, the softsign, the sigmoid, and the \(\tanh \) yield realization sets \(\mathcal {R}\mathcal {N}\mathcal {N}_\varrho ^{\Omega }(S)\) which are not dense in \(L^p(\Omega )\) and in \(C(\Omega )\).

The only activation functions listed in Table 1 for which we do not know whether any of the realization sets \(\mathcal {R}\mathcal {N}\mathcal {N}_\varrho ^{\Omega }(S)\) is dense in \(L^p(\Omega )\) or in \(C(\Omega )\) are: the inverse square root linear unit, the inverse square root unit, the softplus, and the \(\arctan \) function. Of course, we expect that also for these activation functions, the resulting sets of realizations are never dense in \(L^p(\Omega )\) or in \(C(\Omega )\).

Finally, we would like to mention that if \(\Omega \subset \mathbb {R}^d\) has non-empty interior and if the input dimension satisfies \(d \ge 2\), then it follows from the results in [48] that if \(S = (d, N_1, 1)\) is a two-layer architecture, then \(\mathcal {R}\mathcal {N}\mathcal {N}_\varrho ^{\Omega }(S)\) is not dense in \(C(\Omega )\) or \(L^p(\Omega )\).

(Non-)Closedness of the Set of Realizations

Let \(\varnothing \ne \Omega \subset \mathbb {R}^d\) be compact with non-empty interior. In the present section, we analyze whether the neural network realization set \(\mathcal {R}\mathcal {N}\mathcal {N}^\Omega _\varrho (S)\) with \(S=(d,N_1,\dots ,N_{L-1},1)\) is closed in \(C(\Omega )\), or in \(L^p (\mu )\), for \(p \in (0, \infty )\) and any measure \(\mu \) satisfying a mild “non-atomicness” condition. For the \(L^p\)-spaces, the answer is simple: Under very mild assumptions on the activation function \(\varrho \), we will see that \(\mathcal {R}\mathcal {N}\mathcal {N}^\Omega _\varrho (S)\) is never closed in \(L^p (\mu )\). In particular, this holds for all of the activation functions listed in Table 1. Closedness in \(C(\Omega )\), however, is more subtle: For this setting, we will identify several different classes of activation functions for which the set \(\mathcal {R}\mathcal {N}\mathcal {N}^\Omega _\varrho (S)\) is not closed in \(C(\Omega )\). As we will see, these classes of activation functions cover all those functions listed in Table 1, except for the ReLU and the parametric ReLU. For these two activation functions, we were unable to determine whether the set \(\mathcal {R}\mathcal {N}\mathcal {N}^\Omega _\varrho (S)\) is closed in \(C(\Omega )\) in general, but we conjecture this to be true. Only for the case \(L = 2\), we could show that these sets are indeed closed.

Closedness of \(\mathcal {R}\mathcal {N}\mathcal {N}^\Omega _\varrho (S)\) is a highly desirable property as we will demonstrate in Sect. 3.3. Indeed, we establish that if \(X = L^p(\mu )\) or \(X = C(\Omega )\), then, for all functions \(f \in X\) that do not possess a best approximation within \({\mathcal {R}} = \mathcal {R}\mathcal {N}\mathcal {N}^\Omega _\varrho (S)\), the weights of approximating networks necessarily explode. In other words, if \((\mathrm {R}_{\varrho }^{\Omega }(\Phi _n))_{n \in \mathbb {N}} \subset {\mathcal {R}}\) is such that \(\Vert f - \mathrm {R}_{\varrho }^{\Omega }(\Phi _n)\Vert _{X}\) converges to \(\inf _{g\in {\mathcal {R}}}\Vert f-g \Vert _{X}\) for \(n \rightarrow \infty \), then \(\Vert \Phi _n\Vert _{\mathrm {total}} \rightarrow \infty \). Such functions without a best approximation in \({\mathcal {R}}\) necessarily exist if \({\mathcal {R}}\) is not closed. Moreover, even in practical applications, where empirical error terms instead of \(L^p(\mu )\) norms are minimized, the absence of closedness implies exploding weights as we show in Proposition 3.6.

Finally, we note that for simplicity, all “non-closedness” results in this section are formulated for compact rectangles of the form \(\Omega = [-B,B]^d\) only; but our arguments easily generalize to any compact set \(\Omega \subset \mathbb {R}^d\) with non-empty interior.

Non-closedness in \(L^p(\mu )\)

We start by examining the closedness with respect to \(L^p\)-norms for \(p \in (0, \infty )\). In fact, for all \(B > 0\) and all widely used activation functions (including all activation functions presented in Table 1), the set \(\mathcal {R}\mathcal {N}\mathcal {N}_{\varrho }^{[-B,B]^d}(S)\) is not closed in \(L^p(\mu )\), for any \(p \in (0, \infty )\) and any “sufficiently non-atomic” measure \(\mu \) on \([-B,B]^d\). To be more precise, the following is true:

Theorem 3.1

Let \(S = (d, N_1, \dots , N_{L-1}, 1)\) be a neural network architecture with \(L \in \mathbb {N}_{\ge 2}\). Let \(\varrho : \mathbb {R}\rightarrow \mathbb {R}\) be a function satisfying the following conditions:

  1. (i)

    \(\varrho \) is continuous and increasing;

  2. (ii)

    There is some \(x_0 \in \mathbb {R}\) such that \(\varrho \) is differentiable at \(x_0\) with \(\varrho '(x_0) \ne 0\);

  3. (iii)

    There is some \(r > 0\) such that \(\varrho |_{(-\infty , -r) \cup (r, \infty )}\) is differentiable;

  4. (iv)

    At least one of the following two conditions is satisfied:

    1. (a)

      There are \(\lambda , \lambda ' \ge 0\) with \(\lambda \ne \lambda '\) such that \(\varrho '(x) \rightarrow \lambda \) as \(x \rightarrow \infty \), and \(\varrho '(x) \rightarrow \lambda '\) as \(x \rightarrow -\infty \), and we have \(N_{L-1} \ge 2\).

    2. (b)

      \(\varrho \) is bounded.

Finally, let \(B > 0\) and let \(\mu \) be a finite Borel measure on \([-B, B]^d\) for which the support \(\hbox {supp}\mu \) is uncountable. Then the set \(\mathcal {R}\mathcal {N}\mathcal {N}_{\varrho }^{[-B,B]^d}(S)\) is not closed in \(L^p(\mu )\) for any \(p \in (0, \infty )\). More precisely, there is a function \(f \in L^\infty (\mu )\) which satisfies \(f \in \overline{\mathcal {R}\mathcal {N}\mathcal {N}_{\varrho }^{[-B,B]^d}(S)} \setminus \mathcal {R}\mathcal {N}\mathcal {N}_{\varrho }^{[-B,B]^d}(S)\) for all \(p \in (0,\infty )\), where the closure is taken in \(L^p(\mu )\).

Remark

If \(\hbox {supp}\mu \) is countable, then \(\mu = \sum _{x \in \hbox {supp}\mu } \mu (\{ x \}) \, \delta _x\) is a countable sum of Dirac measures, meaning that \(\mu \) is purely atomic. In particular, if \(\mu \) is non-atomic (meaning that \(\mu (\{ x \}) = 0\) for all \(x \in [-B,B]^d\)), then \(\hbox {supp}\mu \) is uncountable and the theorem is applicable.

Proof

For the proof of the theorem, we refer to “Appendix D.1”. The main proof idea consists in the approximation of a (discontinuous) step function which cannot be represented by a neural network with continuous activation function. \(\square \)

Corollary 3.2

The assumptions concerning the activation function \(\varrho \) in Theorem 3.1 are satisfied for all of the activation functions listed in Table 1. In any case where \(\varrho \) is bounded, one can take \(N_{L-1} = 1\); otherwise, one can take \(N_{L-1} = 2\).

Proof

For a proof of this statement, we refer to “Appendix D.2”. \(\square \)

Non-closedness in \((C([-B,B]^d)\) for Many Widely Used Activation Functions

We have seen in Theorem 3.1 that under reasonably mild assumptions on the activation function \(\varrho \)—which are satisfied for all commonly used activation functions—the set \(\mathcal {R}\mathcal {N}\mathcal {N}_\varrho ^{[-B,B]^d}(S)\) is not closed in any \(L^p\)-space where \(p \in (0, \infty )\). However, the argument of the proof of Theorem 3.1 breaks down if one considers closedness with respect to the \(\Vert \cdot \Vert _{\sup }\)-norm. Therefore, we will analyze this setting more closely in this section. More precisely, in Theorem 3.3, we present several criteria regarding the activation function \(\varrho \) which imply that the set \(\mathcal {R}\mathcal {N}\mathcal {N}_\varrho ^{[-B,B]^d}(S)\) is not closed in \(C ([-B,B]^d)\). We remark that in all these results, \(\varrho \) will be assumed to be at least \(C^1\). Developing similar criteria for non-differentiable functions is an interesting topic for future research.

Before we formulate Theorem 3.3, we need the following notion: We say that a function \(f: \mathbb {R}\rightarrow \mathbb {R}\) is approximately homogeneous of order \((r,q) \in \mathbb {N}_0^2\) if there exists \(s > 0\) such that \(|f(x) - x^r| \le s\) for all \(x \ge 0\) and \(|f(x) - x^q| \le s\) for all \(x \le 0\). Now the following theorem holds:

Theorem 3.3

Let \(S = (d, N_1, \dots , N_{L-1}, 1)\) be a neural network architecture with \(L\in \mathbb {N}_{\ge 2}\), let \(B > 0\), and let \(\varrho : \mathbb {R}\rightarrow \mathbb {R}\). Assume that at least one of the following three conditions is satisfied:

  1. (i)

    \(N_{L-1}\ge 2\) and \(\varrho \in C^1(\mathbb {R}) \setminus C^\infty (\mathbb {R}).\)

  2. (ii)

    \(N_{L-1}\ge 2\) and \(\varrho \) is bounded, analytic, and not constant.

  3. (iii)

    \(\varrho \) is approximately homogeneous of order (rq) for certain \(r,q \in \mathbb {N}_0\) with \(r \ne q\), and \(\varrho \in C^{\max \{r,q\}}(\mathbb {R})\).

Then the set \(\mathcal {R}\mathcal {N}\mathcal {N}_\varrho ^{[-B,B]^d}(S)\) is not closed in the space \(C([-B,B]^d)\).

Proof

For the proof of the statement, we refer to “Appendix D.3”. In particular, the proof of the statement under Condition (i) can be found in “Appendix D.3.1”. Its main idea consists of the uniform approximation of \(\varrho '\) (which cannot be represented by neural networks with activation function \(\varrho \), due to its lack of sufficient regularity) by neural networks. For the proof of the statement under Condition (ii), we refer to “Appendix D.3.2”. The main proof idea consists in the uniform approximation of an unbounded analytic function which cannot be represented by a neural network with activation function \(\varrho \), since \(\varrho \) itself is bounded. Finally, the proof of the statement under Condition (iii) can be found in “Appendix D.3.3”. Its main idea consists in the approximation of the function \(x\mapsto (x)^{\max \{r,q\}}_+ \not \in C^{\max \{r,q\}}.\) \(\square \)

Corollary 3.4

Theorem 3.3 applies to all activation functions listed in Table 1except for the ReLU and the parametric ReLU. To be more precise,

  1. (1)

    Condition (i) is fulfilled by the function \(x\mapsto \max \{0, x \}^k\) for \(k \ge 2\), and by the exponential linear unit, the softsign function, and the inverse square root linear unit.

  2. (2)

    Condition (ii) is fulfilled by the inverse square root unit, the sigmoid function, the \(\tanh \) function, and the \(\arctan \) function.

  3. (3)

    Condition (iii) (with \(r = 1\) and \(q = 0\)) is fulfilled by the softplus function.

Proof

For the proof of this statement, we refer to “Appendix D.4”. In particular, for the proof of (1), we refer to “Appendix D.4.1”, the proof of (2) is clear and for the proof of (3), we refer to “Appendix D.4.2”. \(\square \)

The Phenomenon of Exploding Weights

We have just seen that the realization set \(\mathcal {R}\mathcal {N}\mathcal {N}_\varrho ^{[-B,B]^d}(S)\) is not closed in \(L^p(\mu )\) for any \(p \in (0,\infty )\) and every practically relevant activation function. Furthermore, for a variety of activation functions, we have seen that \(\mathcal {R}\mathcal {N}\mathcal {N}_\varrho ^{[-B,B]^d}(S)\) is not closed in \(C([-B,B]^d)\). The situation is substantially different if the weights are taken from a compact subset:

Proposition 3.5

Let \(S = (d, N_1, \dots , N_L)\) be a neural network architecture, let \(\Omega \subset \mathbb {R}^d\) be compact, let furthermore \(p \in (0,\infty )\), and let \(\varrho : \mathbb {R}\rightarrow \mathbb {R}\) be continuous. For \(C > 0\), let

$$\begin{aligned} \Theta _C := \big \{ \Phi \in \mathcal {NN}(S):\Vert \Phi \Vert _{\mathrm {total}}\le C \big \}. \end{aligned}$$

Then the set \(\mathrm {R}_{\varrho }^{\Omega }(\Theta _C)\) is compact in \(C(\Omega )\) as well as in \(L^p(\mu )\), for any finite Borel measure \(\mu \) on \(\Omega \) and any \(p \in (0, \infty )\).

Proof

The proof of this statement is based on the continuity of the realization map and can be found in “Appendix D.5”. \(\square \)

Proposition 3.5 helps to explain the phenomenon of exploding network weights that is sometimes observed during the training of neural networks. Indeed, let us assume that \({\mathcal {R}} := \mathcal {R}\mathcal {N}\mathcal {N}_\varrho ^{[-B,B]^d}(S)\) is not closed in \({\mathcal {Y}}\), where \({\mathcal {Y}} {:}{=}L^p (\mu )\) for a Borel measure \(\mu \) on \([-B,B]^d\), or \({\mathcal {Y}} {:}{=}C([-B,B]^d)\); as seen in Sects. 3.1 and 3.2, this is true under mild assumptions on \(\varrho \). Then, it follows that there exists a function \(f \in {\mathcal {Y}}\) which does not have a best approximation in \({\mathcal {R}}\), meaning that there does not exist any \(g \in {\mathcal {R}}\) satisfying

$$\begin{aligned} \Vert f - g \Vert _{{\mathcal {Y}}} = \inf _{h \in {\mathcal {R}}} \Vert f - h \Vert _{{\mathcal {Y}}} {=}{:}M \, ; \end{aligned}$$

in fact, one can take any \(f \in \overline{{\mathcal {R}}} \setminus {\mathcal {R}}\). Next, recall from Proposition 3.5 that the subset of \({\mathcal {R}}\) that contains only realizations of networks with uniformly bounded weights is compact.

Hence, we conclude the following: For every sequence

$$\begin{aligned} (f_n)_{n \in \mathbb {N}} = \big ( \mathrm {R}_\varrho ^{[-B,B]^d} (\Phi _n) \big )_{n \in \mathbb {N}} \subset {\mathcal {R}} \end{aligned}$$

satisfying \({\Vert f - f_n \Vert _{{\mathcal {Y}}} \rightarrow M}\), we must have \(\Vert \Phi _n\Vert _{\mathrm {total}} \rightarrow \infty \), since otherwise, by compactness, \((f_n)_{n \in \mathbb {N}}\) would have a subsequence that converges to some \(h \in \mathrm {R}_{\varrho }^{\Omega }(\Theta _C) \subset {\mathcal {R}}\). In other words, the weights of the networks \(\Phi _n\) necessarily explode.

The argument above only deals with the approximation problem in the space \(C([-B,B]^d)\) or in \(L^p(\mu )\) for \(p \in (0, \infty )\). In practice, one is often not concerned with these norms, but instead wants to minimize an empirical loss function over \({\mathcal {R}}\). For the empirical square loss, this loss function takes the form

$$\begin{aligned} E_N(f) {:}{=}\frac{1}{N} \sum _{i = 1}^N |f(x_i) - y_i|^2, \end{aligned}$$

for \(\big ( (x_i, y_i) \big )_{i=1}^N \subset \Omega \times \mathbb {R}\) drawn i.i.d. according to a probability distribution \(\sigma \) on \(\Omega \times \mathbb {R}\). By the strong law of large numbers, for each fixed measurable function f, the empirical loss function converges almost surely to the expected loss

$$\begin{aligned} {\mathcal {E}}_\sigma (f) {:}{=}\int _{\Omega \times \mathbb {R}} \left| f(x) - y\right| ^2 \mathrm{{d}} \sigma (x,y). \end{aligned}$$
(3.1)

This expected loss is related to an \(L^2\) minimization problem. Indeed, [20, Proposition 1] shows that there is a measurable function \(f_\sigma : \Omega \rightarrow \mathbb {R}\)—called the regression function—such that the expected risk from Eq. (3.1) satisfies

$$\begin{aligned} {\mathcal {E}}_\sigma (f) = {\mathcal {E}}_\sigma (f_\sigma ) + \int _{\Omega } \left| f(x) - f_\sigma (x) \right| ^2 \mathrm{{d}} \sigma _\Omega (x) \text { for each measurable } f : \Omega \rightarrow \mathbb {R}.\nonumber \\ \end{aligned}$$
(3.2)

Here, \(\sigma _\Omega \) is the marginal probability distribution of \(\sigma \) on \(\Omega \), and \({\mathcal {E}}_\sigma (f_\sigma )\) is called the Bayes risk; it is the minimal expected loss achievable by any (measurable) function.

In this context of a statistical learning problem, we have the following result regarding exploding weights:

Proposition 3.6

Let \(d \in \mathbb {N}\) and \(B, K > 0\). Let \(\Omega {:}{=}[-B,B]^d\). Moreover, let \(\sigma \) be a Borel probability measure on \(\Omega \times [-K,K]\). Further, let \(S = (d, N_1, \dots , N_{L-1}, 1)\) be a neural network architecture and \(\varrho : \mathbb {R}\rightarrow \mathbb {R}\) be locally Lipschitz continuous. Assume that the regression function \(f_\sigma \) is such that there does not exist a best approximation of \(f_\sigma \) in \(\mathcal {R}\mathcal {N}\mathcal {N}_\varrho ^{\Omega }(S)\) with respect to \(\Vert \cdot \Vert _{L^2(\sigma _\Omega )}\). Let \(\big ( (x_i, y_i) \big )_{i \in \mathbb {N}} \overset{\mathrm {i.i.d.}}{\sim } \sigma \); all probabilities below will be with respect to this family of random variables.

If \((\Phi _N)_{N \in \mathbb {N}} \subset \mathcal {NN}(S)\) is a random sequence of neural networks (depending on \(\big ( (x_i, y_i) \big )_{i \in \mathbb {N}}\)) that satisfies

$$\begin{aligned} {\mathbb {P}} \left( E_N \left( \mathrm {R}_\varrho ^{\Omega }(\Phi _N)\right) - \inf _{f \in \mathcal {R}\mathcal {N}\mathcal {N}_\varrho ^{\Omega }(S)} E_N(f)> \varepsilon \right) \rightarrow 0 \text { as } N \rightarrow \infty \text { for all } \varepsilon > 0 ,\nonumber \\ \end{aligned}$$
(3.3)

then \(\Vert \Phi _N \Vert _{\mathrm {total}} \rightarrow \infty \) in probability as \(N \rightarrow \infty \).

Remark

A compact way of stating Proposition 3.6 is that, if \(f_\sigma \) has no best approximation in \(\mathcal {R}\mathcal {N}\mathcal {N}_\varrho ^{\Omega }(S)\) with respect to \(\Vert \cdot \Vert _{L^2(\sigma _\Omega )}\), then the weights of the minimizers (or approximate minimizers) of the empirical square loss explode for increasing numbers of samples.

Since \(\sigma \) is unknown in applications, it is indeed possible that \(f_\sigma \) has no best approximation in the set of neural networks. As just one example, this is true if \(\sigma _\Omega \) is any Borel probability measure on \(\Omega \) and if \(\sigma \) is the distribution of (Xg(X)), where \(X \sim \sigma _\Omega \) and \(g \in L^2(\sigma _\Omega )\) is bounded and satisfies \(g \in \overline{\mathcal {R}\mathcal {N}\mathcal {N}_\varrho ^{\Omega }(S)} \setminus \mathcal {R}\mathcal {N}\mathcal {N}_\varrho ^{\Omega }(S)\), with the closure taken in \(L^2(\sigma _\Omega )\). The existence of such a function g is guaranteed by Theorem 3.1 if \(\hbox {supp}\sigma _\Omega \) is uncountable.

Proof

For the proof of Proposition 3.6, we refer to “Appendix D.6”. The proof is based on classical arguments of statistical learning theory as given in [20]. \(\square \)

Closedness of ReLU Networks in \(C([-B,B]^d)\)

In this subsection, we analyze the closedness of sets of realizations of neural networks with respect to the ReLU or the parametric ReLU activation function in \(C(\Omega )\), mostly for the case \(\Omega = [-B, B]^d\). We conjecture that the set of (realizations of) ReLU networks of a fixed complexity is closed in \(C(\Omega )\), but were not able to prove such a result in full generality. In two special cases, namely when the networks have only two layers, or when at least the scaling weights are bounded, we can show that the associated set of ReLU realizations is closed in \(C(\Omega )\); see below.

We begin by analyzing the set of realizations with uniformly bounded scaling weights and possibly unbounded biases, before proceeding with the analysis of two layer ReLU networks.

For \(\Phi = \big ( (A_1,b_1),\dots ,(A_L,b_L) \big ) \in \mathcal {NN}(S)\) satisfying \(\Vert \Phi \Vert _{\mathrm {scaling}} \le C\) for some \(C>0\), we say that the network \(\Phi \) has C -bounded scaling weights. Note that this does not require the biases \(b_\ell \) of the network to satisfy \(|b_\ell | \le C\).

Our first goal in this subsection is to show that if \(\varrho \) denotes the ReLU, if \(S = (d, N_1,\dots ,N_L)\), if \(C > 0\), and if \(\Omega \subset \mathbb {R}^d\) is measurable and bounded, then the set

$$\begin{aligned} \mathcal {R}\mathcal {N}\mathcal {N}_{\varrho }^{\Omega ,C}(S) := \left\{ \mathrm {R}_\varrho ^{\Omega } (\Phi ) \,:\,\Phi \in \mathcal {NN}(S) \text { with } \Vert \Phi \Vert _{\mathrm {scaling}} \le C \right\} \end{aligned}$$

is closed in \(C(\Omega ;\mathbb {R}^{N_L})\) and in \(L^p (\mu ; \mathbb {R}^{N_L})\) for arbitrary \(p \in [1, \infty ]\). Here, and in the remainder of the paper, we use the norm \(\Vert f\Vert _{L^p(\mu ;\mathbb {R}^{N_L})} = \Vert \, |f| \,\Vert _{L^p(\mu )}\) for vector-valued \(L^p\)-spaces. The norm on \(C(\Omega ; \mathbb {R}^{N_L})\) is defined similarly. The difference between the following proposition and Proposition 3.5 is that in the following proposition, the “shift weights” (the biases) of the networks can be potentially unbounded. Therefore, the resulting set is merely closed, not compact.

Proposition 3.7

Let \(S = (d, N_1, \dots , N_L)\) be a neural network architecture, let \(C > 0\), and let \(\Omega \subset \mathbb {R}^d\) be Borel measurable and bounded. Finally, let \(\varrho : \mathbb {R}\rightarrow \mathbb {R}, x \mapsto \max \{ 0, x \}\) denote the ReLU function.

Then the set \(\mathcal {R}\mathcal {N}\mathcal {N}_{\varrho }^{\Omega ,C}(S)\) is closed in \(L^p (\mu ;\mathbb {R}^{N_L})\) for every \(p \in [1,\infty ]\) and any finite Borel measure \(\mu \) on \(\Omega \). If \(\Omega \) is compact, then \(\mathcal {R}\mathcal {N}\mathcal {N}_{\varrho }^{\Omega ,C}(S)\) is also closed in \(C(\Omega ;\mathbb {R}^{N_L})\).

Remark

In fact, the proof shows that each subset of \(\mathcal {R}\mathcal {N}\mathcal {N}_{\varrho }^{\Omega ,C}(S)\) which is bounded in \(L^1 (\mu ; \mathbb {R}^{N_L})\) (when \(\mu (\Omega ) > 0\)) is precompact in \(L^p(\mu ; \mathbb {R}^{N_L})\) and in \(C(\Omega ; \mathbb {R}^{N_L})\).

Proof

For the proof of the statement, we refer to “Appendix D.7”. The main idea is to show that for every sequence \((\Phi _n)_{n \in \mathbb {N}} \subset \mathcal {NN}(S)\) of neural networks with uniformly bounded scaling weights and with \(\Vert \mathrm {R}_\varrho ^\Omega (\Phi _n) \Vert _{L^1(\mu )} \le M\), there exist a subsequence \((\Phi _{n_k})_{k\in \mathbb {N}}\) of \((\Phi _n)_{n\in \mathbb {N}}\) and neural networks \(({\widetilde{\Phi }}_{n_k})_{k\in \mathbb {N}}\) with uniformly bounded scaling weights and biases such that \( \mathrm {R}^\Omega _\varrho \big ( {\widetilde{\Phi }}_{n_k} \big ) = \mathrm {R}^\Omega _\varrho \big ( {\Phi }_{n_k} \big ) \). The rest then follows from Proposition 3.5. \(\square \)

As our second result in this section, we show that the set of realizations of two-layer neural networks with arbitrary scaling weights and biases is closed in \(C([-B,B]^d),\) if the activation is the parametric ReLU. It is a fascinating question for further research whether this also holds for deeper networks.

Theorem 3.8

Let \(d, N_0 \in \mathbb {N},\) let \(B > 0\), and let \(a \ge 0\). Let \(\varrho _a:\mathbb {R}\rightarrow \mathbb {R}, x \mapsto \max \{x, ax \}\) be the parametric ReLU. Then \(\mathcal {R}\mathcal {N}\mathcal {N}_{\varrho _a}^{[-B,B]^d}((d,N_0,1))\) is closed in \(C([-B,B]^d)\).

Proof

For the proof of the statement, we refer to “Appendix D.8”; here we only sketch the main idea: First, note that each \(f \in \mathcal {R}\mathcal {N}\mathcal {N}_{\varrho _a}^{[-B,B]^d}((d,N_0,1))\) is of the form \(f(x) = c + \sum _{i=1}^{N_0} \varrho _a (\langle \alpha _i , x \rangle + \beta _i)\). The proof is based on a careful—and quite technical—analysis of the singularity hyperplanes of the functions \(\varrho _a (\langle \alpha _i, x \rangle + \beta _i)\), that is, the hyperplanes \(\langle \alpha _i, x \rangle + \beta _i = 0\) on which these functions are not differentiable. More precisely, given a uniformly convergent sequence \((f_n)_{n \in \mathbb {N}} \subset \mathcal {R}\mathcal {N}\mathcal {N}_{\varrho _a}^{[-B,B]^d}((d,N_0,1))\), we analyze how the singularity hyperplanes of the functions \(f_n\) behave as \(n \rightarrow \infty \), in order to show that the limit is again of the same form as the \(f_n\). For more details, we refer to the actual proof. \(\square \)

Failure of Inverse Instability of the Realization Map

In this section, we study the properties of the realization map \(\mathrm {R}^{\Omega }_{\varrho }\). First of all, we observe that the realization map is continuous.

Proposition 4.1

Let \(\Omega \subset \mathbb {R}^d\) be compact and let \(S = (d, N_1, \dots , N_L)\) be a neural network architecture. If the activation function \(\varrho : \mathbb {R}\rightarrow \mathbb {R}\) is continuous, then the realization map from Eq. (1.1) is continuous. If \(\varrho \) is locally Lipschitz continuous, then so is \(\mathrm {R}^{\Omega }_{\varrho }\).

Finally, if \(\varrho \) is globally Lipschitz continuous, then there is a constant \(C = C(\varrho , S) > 0\) such that

$$\begin{aligned} \mathrm {Lip}\big (\mathrm {R}^{\Omega }_{\varrho } (\Phi )\big ) \le C \cdot \Vert \Phi \Vert _{\mathrm {scaling}}^L \qquad \text { for all } \Phi \in \mathcal {NN}(S) \, . \end{aligned}$$

Proof

For the proof of this statement, we refer to “Appendix E.1”. \(\square \)

In general, the realization map is not injective; that is, there can be networks \(\Phi \ne \Psi \) but such that \(\mathrm {R}^{\Omega }_{\varrho } (\Phi ) = \mathrm {R}^{\Omega }_{\varrho } (\Psi )\); in fact, if for instance

$$\begin{aligned} \Phi = \big ( (A_1,b_1), \dots , (A_{L-1}, b_{L-1}), (0,0) \big ), \end{aligned}$$

and

$$\begin{aligned} \Psi = \big ( (B_1, c_1), \dots , (B_{L-1}, c_{L-1}), (0,0) \big ), \end{aligned}$$

then the realizations of \(\Phi ,\Psi \) are identical.

In this section, our main goal is to determine whether, up to the failure of injectivity, the realization map is a homeomorphism onto its range; mathematically, this means that we want to determine whether the realization map is a quotient map. We will see that this is not the case.

To this end, we will prove for fixed \(\Phi \) that even if \(\mathrm {R}^{\Omega }_{\varrho } (\Psi )\) is very close to \(\mathrm {R}^{\Omega }_{\varrho }(\Phi )\), it is not true in general that \( \mathrm {R}^{\Omega }_{\varrho } (\Psi ) = \mathrm {R}^{\Omega }_{\varrho } ({\widetilde{\Psi }}) \) for network weights \({\widetilde{\Psi }}\) close to \(\Phi \). Precisely, this follows from the following theorem for \(\Phi = 0\) and \(\Psi = \Phi _n\).

Theorem 4.2

Let \(\varrho : \mathbb {R}\rightarrow \mathbb {R}\) be Lipschitz continuous, but not affine-linear. Let \(S = (N_0,\dots ,N_{L-1},1)\) be a network architecture with \(L \ge 2\), with \(N_0 = d\), and \(N_1 \ge 3\). Let \(\Omega \subset \mathbb {R}^d\) be bounded with nonempty interior.

Then there is a sequence \((\Phi _n)_{n \in \mathbb {N}}\) of networks with architecture S and the following properties:

  1. 1.

    We have \(\mathrm {R}^{\Omega }_{\varrho } (\Phi _n) \rightarrow 0\) uniformly on \(\Omega \).

  2. 2.

    We have \(\mathrm {Lip}(\mathrm {R}^{\Omega }_{\varrho }(\Phi _n)) \rightarrow \infty \) as \(n \rightarrow \infty \).

Finally, if \((\Phi _n)_{n \in \mathbb {N}}\) is a sequence of networks with architecture S and the preceding two properties, then the following holds: For each sequence of networks \((\Psi _n)_{n \in \mathbb {N}}\) with architecture S and \(\mathrm {R}^{\Omega }_{\varrho } (\Psi _n) = \mathrm {R}^{\Omega }_{\varrho } (\Phi _n)\), we have \(\Vert \Psi _n\Vert _{\mathrm {scaling}} \rightarrow \infty \).

Proof

For the proof of the statement, we refer to “Appendix E.2”. The proof is based on the fact that the Lipschitz constant of the realization of a network essentially yields a lower bound on the \(\Vert \cdot \Vert _{\mathrm {scaling}}\) norm of every neural network with this realization. We construct neural networks \(\Phi _n\) the realizations of which have small amplitude but high Lipschitz constants. The associated realizations uniformly converge to 0, but every associated neural network must have exploding weights. \(\square \)

We finally rephrase the preceding result in more topological terms:

Corollary 4.3

Under the assumptions of Theorem 4.2, the realization map \(\mathrm {R}^{\Omega }_{\varrho }\) from Eq. (1.1) is not a quotient map when considered as a map onto its range.

Proof

For the proof of the statement, we refer to “Appendix E.3”. \(\square \)