The Universal Approximation Property

The universal approximation property of various machine learning models is currently only understood on a case-by-case basis, limiting the rapid development of new theoretically justified neural network architectures and blurring our understanding of our current models’ potential. This paper works towards overcoming these challenges by presenting a characterization, a representation, a construction method, and an existence result, each of which applies to any universal approximator on most function spaces of practical interest. Our characterization result is used to describe which activation functions allow the feed-forward architecture to maintain its universal approximation capabilities when multiple constraints are imposed on its final layers and its remaining layers are only sparsely connected. These include a rescaled and shifted Leaky ReLU activation function but not the ReLU activation function. Our construction and representation result is used to exhibit a simple modification of the feed-forward architecture, which can approximate any continuous function with non-pathological growth, uniformly on the entire Euclidean input space. This improves the known capabilities of the feed-forward architecture.


Introduction
Neural networks have their organic origins in [1] and in [2], wherein the authors pioneered a method for emulating the behavior of the human brain using digital computing. Their mathematical roots are traced back to Hilbert's 13 th problem, which postulated that all highdimensional continuous functions are a combination of univariate continuous functions. Anastasis Kratsios anastasis.kratsios@math.ethz.ch 1 Arguably the second major wave of innovation in the theory of neural networks happened following the universal approximation theorems of [3,4], and of [5], which merged these two seemingly unrelated problems by demonstrating that the feed-forward architecture is capable of approximating any continuous function between any two Euclidean spaces, uniformly on compacts. This series of papers initiated the theoretical justification of the empirically observed performance of neural networks, which had up until that point only been justified by analogy with the Kolmogorov-Arnold Representation Theorem of [6].
Since then the universal approximation capabilities, of a limited number of neural network architectures, such as the feed-forward, residual, and convolutional neural networks has been solidified as a cornerstone of their approximation success. This, coupled with the numerous hardware advances have led neural networks to find ubiquitous use in a number of areas, ranging from biology, see [7,8], to computer vision and imaging, see [9,10], and to mathematical finance, see [11][12][13][14][15]. As a result, a variety of neural network architectures have emerged with the common thread between them being that they describe an algorithmically generated set of complicated functions built by combining elementary functions in some manner.
However, the case-by-case basis for which the universal approximation property is currently understood limits the rapid development of new theoretically justified architectures. This paper works at overcoming this challenges by directly studying universal approximation property itself in the form of far-reaching characterizations, representations, construction methods, and existence results applicable to most situations encounterable in practice.
The paper's contributions are organized as follows. Section 2 overviews the analytic, topological, and learning-theoretic background required in formulating the paper's results. Section 3 contains the paper's main results. These include a characterization, a representation result, a construction theorem, and existence result applicable to any universal approximator in most function spaces of practical interest. The characterization result shows that an architecture has the UAP on a function space if and only if that architecture implicitly decomposes the function space into a collection of separable Banach subspaces, whereon the architecture contains the orbit of a topologically transitive dynamical system. Next, the representation result shows that any universal approximator can always be approximately realized as a transformation of the feed-forward architecture. This result reduces the problem of constructing new universal architectures for identifying the correct transformation of the feed-forward architecture for the given learning task. The construction result gives conditions on a set of transformations of the feed-forward architecture, guaranteeing that the resultant is a universal approximator on the target function space. Lastly, we obtain a general existence and representation result for universal approximators generated by a small number of functions applicable to many function spaces. Section 4 then focuses the main theoretical results to the feed-forward architecture. Our characterization result is used to exhibit the dynamical system representation on the space of continuous functions by composing any function with an additional deep feed-forward layer whose activation function is continuous, injective, and has no fixed points. Using this representation, we show that the set of all such deep feed-forward networks constructed through this dynamical system maintain its universal approximation property even when constraints are imposed on the network's final layers or when sparsity is imposed on the network's connections' initial layers. In particular, we show that feed-forward networks with ReLU activation function fail these requirements, but a simple affine transformation of the Leaky-ReLU activation function is of this type. We provide a simple and explicit method for modifying most commonly used activation functions into this form. We also show that the conditions on the activation function are sharp for this dynamical system representation to have the desired topological transitivity properties.
As an application of our construction and representation results, we build a modification of the feed-forward architecture which can uniformly approximate a large class of continuous functions which need not vanish at infinity. This architecture approximates uniformly on the entire input space and not only on compact subsets thereof. This refines the known guarantees for feed-forward networks (see [16,17]) which only guarantee uniform approximation on compacts subsets of the input space, and consequentially, for functions vanishing at infinity. As a final application of the results, the existence theorem is then used to provide a representation of a small universal approximator on L ∞ (R), which provides the first concrete step towards obtaining a tractable universal approximator thereon.

Background and preliminaries
This section overviews the analytic, topological, and learning-theoretic background used to in this paper.

Metric spaces
Typically, two points x, y ∈ R m are thought of as being near to one another if y belongs to the open ball with radius δ > 0 centered about x defined by Ball R m (x, δ) {z ∈ R m : x − z < δ}, where (x, z) x − z denotes the Euclidean distance function. The analogue can be said if we replace R m by a set X on which there is a distance function d X : X × X → [0, ∞) quantifying the closeness of any two members of X. Many familiar properties of the Euclidean distance function are axiomatically required of d X in order to maintain many of the useful analytic properties of R m ; namely, d X is required to satisfy the triangle inequality, symmetry in its arguments, and it vanishes precisely when its arguments are identical. As before, two points x, y ∈ X are thought of as being close if they belong to the same open ball, Ball X (x, δ) {z ∈ X : d X (x, z) < δ} where δ > 0. Together, the pair (X, d X ) is called a metric space, and this simple structure can be used to describe many familiar constructions prevalent throughout learning theory. We follow the convention of only denoting (X, d X ) by X whenever the context is clear.
Example 1 (Spaces of Continuous Functions) For instance, the universal approximation theorems of [16][17][18][19] describe conditions under which any continuous function from R m to R n can be approximated by a feed-forward neural network. The distance function used to formulate their approximation results is defined on any two continuous functions f, g : .
In this way, the set of continuous functions from R m to R n by C(R m , R n ) is made into a metric space when paired with d ucc . In what follows, we make the convention of denoting C(X, R) by C(X).
Example 2 (Space of Integrable Functions) Not all functions encountered in practice are continuous, and the approximation of discontinuous functions by deep feed-forward networks is studied in [20,21] for functions belonging to the space L p μ (R m , R n ). Briefly, elements of L p μ (R m , R n ) are equivalence classes of Borel measurable f : R m → R n , identified up to μ-null sets, for which the norm is finite; here μ is a fixed Borel measure on R m and 1 ≤ p < ∞. We follow the convention of denoting L p μ (R m , R) by L p (R m ) when μ is the Lebesgue measure on R m .
Spaces of this type simultaneously carry compatible metric and vector spaces structures. Moreover, in such a space, if every sequence converges whenever its pairwise distances asymptotically tend to zero, then the space is called a Banach space. The prototypical Banach space is R m .
Unlike Banach spaces or the space of Example 1, general metric spaces are non-linear. That is, there is no meaningful notion of addition, scaling, and there is no singular reference point analogous to the 0 vector. Examples of non-linear metric spaces arising in machine learning are shape spaces used in neuroimaging applications (see [22]), graphs and trees arising in structured and hierarchical learning (see [23,24]), and spaces of probability measures appearing in adversarial approaches to learning (see [25]).
The lack of a reference point may always be overcome by artificially declaring a fixed element of X, denoted by 0 X , to be the central point of reference in X. In this case, the triple (X, d X , 0 X ), is called a pointed metric space. We follow the convention of denoting the triple by X, whenever the context is clear. For pointed metric spaces X and Y , the class of functions f : L x 1 − x 2 , for some L > 0 and every x 1 , x 2 ∈ X, is denoted by Lip 0 (X, Y ) and this class is understood as mapping the structure of X into Y without too large of a distortion. In the extreme case where an f ∈ Lip 0 (X, Y ) perfectly respects the structure of X, i.e. : when x 1 − x 2 , we call f a pointed isometry. In this case, f (X) represents an exact copy of X within Y .
The remaining non-linear aspects of a general metric space pose no significant challenge and this is due to the following linearization feature map of [26]. Since its inception, the following method has found notable applications in clustering [27] and in optimal transport [28]. In particular, the later connects this linearization procedure with optimal transport approaches to adversarial learning of [29,30].
Example 3 (Free-Space over X) We follow the formulation described in [28]. Let X be a metric space and for any x ∈ X, let δ x be the (Borel) probability measure assigning value 1 to any Ball X ⊆ X if x ∈ Ball X and 0 otherwise. The Free-space over X is the Banach space B(X) obtained by completing the vector space N n=1 α n δ x n : a n ∈ R, x n ∈ X, n = 1, . . . , N, N ∈ N + with respect to the following As shown in [31, Proposition 2.1], the map δ X : x → δ x is a (non-linear) isometry from X to B(X). As shown in [32], the pair (B(X), δ X ) is characterized by the following linearization property: whenever f ∈ Lip 0 (X, Y ) and Y is a Banach space then there exists a unique continuous linear map satisfying Thus, δ X : X → B(X) can be interpreted as a minimal isometric linearizing feature map.
Sometimes the feature map δ X can be continuously inverted from the left. In [31] any continuous map ρ : Following [31], if a barycenter exists then X is called barcycentric. Examples of barycentric spaces are Banach spaces [33], Cartan-Hadamard manifolds described (see [34,Corollary 6.9.1]), and other structures described in [35]. Accordingly, many function spaces of potential interest contain a dense barycentric subspace. When the context is clear, we follow the convention of denoting δ X simply by δ.

Topological Background
Rather than using open balls to quantify closeness, it is often more convenient to work with open subsets of X; where U ⊆ X is said to be open whenever every point x ∈ U belongs to some open ball B X (x, δ) contained in U . This is because open sets have many desirable properties; for example, a convergent sequence contained in the complement of an open set must also have its limit in that open set's complement. Thus, the complement of open sets are often called closed sets since their limits cannot escape them.
Unfortunately, many familiar situations arising in approximation theory cannot be described by a distance function. For example, there is no distance function describing the point-wise convergence of a sequence of functions {f n } n∈N on R m to any other such function f (for details [36, page 362]). In these cases, it is more convenient to work directly with topologies. A topology τ is a collection of subsets of a given set X whose members are declared as being open if τ satisfies certain algebraic conditions emulating the basic properties of the typical open subsets of R m (see [37,Chapter 2]). Explicitly, we require that τ contain the empty set ∅ as well as the entire space X, we require that the arbitrary union of subsets of X belonging to τ also belongs to τ , and we require that finite intersections of subsets of X belonging to τ also be a member of τ . A topological space is a pair of a set X and a topology τ thereon. We follow the convention of denoting topological spaces with the same symbol as their underlying set.
Most universal approximation theorems [4,16,17] guarantee that a particular subset of C(R m , R n ) is dense therein. In general, A ⊆ X is dense if the smallest closed subset of X containing A is X itself. Topological spaces containing a dense subset which can be put in a 1-1 correspondence with the natural numbers N is called a separable space. Many familiar spaces are separable, such as C(R m ) and R m .
A function f : R m → R n is thought of as continuously depending on its inputs if small variations in its inputs can only produce small variations in its outputs; that is, for any x ∈ R m 0 there exists some δ > 0 such that . It can be shown, see [37], that this condition is equivalent to requiring that the pre-image f −1 [U ] of any open subset U of R n is open in R m . This reformulation means that open sets are preserved under the inverse-image of continuous functions, and it lends itself more readily to abstraction. Thus, a function f : X → Y between arbitrary topological spaces X and Y is If f is a continuous bijection and its inverse function f −1 : Y → X is continuous, then f is called a homeomorphism and X and Y are thought of as being topologically identical. If f is a homeomorphism onto its image, f is an embedding.
We illustrate the use of homeomorphisms with a learning theoretic example. Many learning problems encountered empirically benefit from feature maps modifying the input a of learning model; for example, this is often the case with kernel methods (see [38][39][40]), in reservoir computing (see [41,42]), and in geometric deep learning (see [23,43]). Recently, in [44], it was shown that, a feature map φ : X → R m is continuous and injective if and only if the set of all functions f • φ ∈ C(X), where f ∈ C(R m ) is a deep feed-forward network with ReLU activation, is dense in C(X). A key factor in this characterization is that the map Φ : C(R m ) → C(X), given by f → f • φ, is an embedding if φ is continuous and injective.
The above example suggests that our study of an architecture's approximation capabilities is valid on any topological space which can be mapped homeomorphically onto a well-behaved topological space. For us, a space will be well-behaved if it belongs to the broad class of Fréchet spaces. Briefly, these spaces have compatible topological space and vector space structures, meaning that the basic vector space operations such as addition, inversion, and scalar multiplication are continuous; furthermore, their topology is induced by a complete distance function which is invariant under translation and satisfies an additional technical condition described in [45,Section 3.7]. The class of Fréchet spaces encompass all Hilbert and Banach spaces and they share many familiar properties with R m . Relevant examples of a Fréchet space are C(R m , R n ), the free-space B(X) over any pointed metric space, and L 1 μ (R m , R n ).

Universal approximation background
In the machine learning literature, universal approximation refers to a model class' ability to generically approximate any member of a large topological space whose elements are functions, or more rigorously, equivalence classes of functions. Accordingly, in this paper, we focus on a class of topological spaces which we call function spaces. In this paper, a function space X is a topological space whose elements are equivalence classes of functions between two sets X and Y . For example, when X = R = Y then X may be C(R) or L p (R). We refer to X as a function space between X and Y and we omit the dependence to X and Y if it is clear from the context. The elements in X are called functions, whereas functions between sets are referred to as set-functions. By a partial function f : X → Y we mean a binary relation between the sets X and Y which attributes at-most one output in Y to each input in X.

Notational Conventions
The following notational conventions are maintained throughout this paper. Only non-empty outputs of any partial function f are specified. We denote the set of positive integers by N + . We set N N + ∪ {0}. For any n ∈ N + , the n-fold Cartesian product of a set A with itself is denoted by A n . For n ∈ N, we denote the n-fold composition of a function φ : X → X with itself by φ n and the 0-fold composition φ 0 is defined to be the identity map on X.
Definition 1 (Architecture) Let X be a function space. An architecture on X is a pair (F , ) of a set of set-functions F between (possibly different) sets and a partial function : J ∈N F J → X , satisfying the following non-triviality condition: there exists some f ∈ X , J ∈ N + , and f 1 , . . . , f J ∈ F satisfying The set of all functions f in X for which there is some J ∈ N + and some f 1 , . . . , f J ∈ F satisfying the representation (3) is denoted by N N (F , ) .
Many familiar structures in machine learning, such as convolutional neural networks, trees, radial basis functions, or various other structures can be formulated as architectures.
To fix notation and to illustrate the scope of our results we express some familiar machine learning models in the language of Definition 1.
whenever the right-hand side of (4) is well-defined. Since the composition of two affine functions is again affine then N N (F , ) is the set of deep feed-forward networks from R m to R n with activation function σ .

Remark 1
The construction of Example 4 parallels the formulation given in [46,47]. However, in [47] elements of F are referred to as neural networks and functions in N N (F , ) are called their realizations.
We are interested in architectures which can generically approximate any function on their associated function space. Paraphrasing [48, page 67], any such architecture is called a universal approximator.

Main Results
Our first result provides a correspondence between the apriori algebraic structure of universal approximators on X and decompositions of X into subspaces on which N N (F , ) contains the orbit of a topologically generic dynamical system, which are a priori of a topological nature. The interchangeability of algebraic and geometric structures is a common theme, notable examples include [49][50][51][52].

Theorem 1 (Characterization: Dynamical Systems Structure of Universal Approximators)
Let X be a function space which is homeomorphic to an infinite-dimensional Fréchet space and let (F , ) be an architecture on X . Then, the following are equivalent: and {g i } i∈I ⊆ N N (F , ) such that:

(b) For each i ∈ I and every pair of non-empty open
In particular, φ n i (g i ) : i ∈ I, n ∈ N is dense in N N (F , ) .
Theorem 1 describes the structure of universal approximators, however, it does not describe an explicit means of constructing them. Nevertheless, Theorem 1 (ii.a) and (ii.d) suggest that universal approximators on most function spaces can be built by combining multiple, non-trivial, transformations of universal approximators on C(R m , R n ).
This is type of transformation approach to architecture construction is common in geometric deep learning, whereby non-Euclidean data is mapped to the input of familiar architectures defined between R d and R D using specific feature maps and that model's outputs are then return to the manifold by inverting the feature map. Examples include the hyperbolic feed-forward architecture of [24], and the shape space regressors of [53], and the matrix-valued regressors of [54,55], amongst others. This transformation procedure is a particular instance of the following general construction method, which extends [44].
Theorem 2 (Construction: Universal Approximators by Transformation) Let n, m, ∈ N + , X be a function space, (F , ) be a universal approximator on C(R m , R n ), and { i } i∈I be a non-empty set of continuous functions from C(R m , R n ) to X satisfying the following condition: The alternative approach to architecture development, subscribed to by authors such as [56][57][58][59], specifies the elementary functions F and the rule for combining them. Thus, this method explicitly specifies F and implicitly specifies . These competing approaches are in-fact equivalent since every universal approximator an approximately a transformation of the feed-forward architecture on C(R).

Theorem 3 (Representation: Universal Approximators are Transformed Neural Networks)
Let σ be a continuous, non-polynomial activation function, and let (F 0 , 0 ) denote the architecture of Example 4. Let X be a function space which is homeomorphic to an infinitedimensional Fréchet. If (F , ) has the UAP on X then, there exists a family { i } i∈I of The previous two results describe the structure of universal approximators but they do not imply the existence of such architectures. Indeed, the existence of a universal approximator on X can always be obtained by setting F = X and (f ) = f ; however, this is uninteresting since F is large, is trivial, and N N (F , ) is intractable. Instead, the next result shows that, for a broad range of function spaces, there are universal approximators for which F is a singleton, and the structure of is parameterized by any prespecified separable metric space. This description is possible by appealing to the free-space on X .
Theorem 4 (Existence: Small Universal Approximators) Let X be a separable pointed metric space with at least two points, let X be a function space and a pointed metric space, and let X 0 be a dense barycentric sub-space of X . Then, there exists a non-empty set I with

and bounded linear maps
Furthermore, if X = X then the set I is a singleton and i is the identity on B(X 0 ).
The rest of this paper is devoted to the concrete implications of these results in learning theory.

Applications
The dynamical systems described by Theorem 1 (ii) can, in general, be complicated. However, when (F , ) is the feed-forward architecture with certain specific activation functions then these dynamical systems explicitly describe the addition of deep layers to a shallow feed-forward network. We begin the next section by characterizing those activation function before outlining their approximation properties.

Depth as a transitive dynamical system
The impact of different activation functions on the expressiveness of neural network architectures is an active research area. For example, [60] empirically studies the effect of different activation function on expressiveness and in [61] a characterization of the activation functions for which shallow feed-forward networks are universal is also obtained. The next result characterizes the activation functions which produce feed-forward networks with the UAP even when no weight or bias is trained and the matrices {A n } N n=1 are sparse, and the final layers of the network are slightly perturbed.
Fix an activation function σ : , with terminology rooted in [62]. The family of composition operators { A,b } A,b creates depth within an architecture (F , ) by extending it to include any function of the form , and each f j ∈ F for j = 1, . . . , J . In fact, many of the results only require the following smaller extension of (F , ), denoted by (F deep;σ , deep;σ ), where F deep;σ F × N and where and b is any fixed element of R m with positive components and I m is the m × m identity matrix.

Theorem 5 (Characterization of Transitivity in Deep Feed-Forward Networks) Let (F , )
be an architecture on C(R m , R n ), σ be a continuous activation function, fix any b ∈ R m with strictly positive components. Then I m ,b is a well-defined continuous linear map from C(R m , R n ) to itself and the following are equivalent: (i) σ is injective and has no fixed-points, Remark 2 A characterization is given in Appendix B when A = I m , however, this less technical formulation is sufficient for all our applications.
We call an activation function transitive if it satisfies any of the conditions (i)-(ii) in Theorem 5.

Example 7
The following variant of the Leaky-ReLU activation of [63] does satisfy Theorem 5 (i) More generally, transitive activation functions also satisfying the conditions required by the central results of [17,61] can be build via the following.

Proposition 1 (Construction of Transitive Activation Functions) Letσ : R → R be a continuous and strictly increasing function satisfyingσ
Then, σ is continuous, injective, has no fixed-points, is non-polynomial, and is continuously differentiable with non-zero derivative on infinitely many points. In particular, σ satisfies the requirements of Theorem 5.
Transitive activation functions allow one to automatically conclude that (F σ ;deep , σ ;deep ) has the UAP on C(R m , R n ) if (F , ) is only a universal approximator on some non-empty open subset thereof.

Corollary 1 (Local-to-Global UAP) Let X be a non-empty open subset of C(R m , R n ) and (F , ) be a universal approximator on X . If any of the conditions described by Lemma 3 (i)-(iii) hold, then (F , )[σ ; deep] is a universal approximator on C(R m , R n ).
The function space affects which activation functions are transitive. Since most universal approximation results hold in the space C(R m , R n ) or on L p μ (R m ), for suitable μ and p, we describe the integrable variant of transitive activation functions.

Integrable variants
Some notation is required when expressing the integrable variants of the Theorem 5 and its consequences. Fix a σ -finite Borel measure μ on R m . Unlike in the continuous case, the operators A,b may not be well-defined or continuous from L 1 μ (R m ) to itself. We require the notion of a push-forward measure by a measurable function is required. If S : R m → R m is Borel measurable and μ is a finite Borel measure on R m , then its push-forward by S is the measure denoted by S # μ and defined on Borel In particular, if μ is absolutely continuous with respect to the Lebesgue measure μ M on R m , then as discussed in [ Recall that, if a function is monotone on R, then it is differentiable outside a μ M -null set. We denote the μ M -a.e. derivative of any such a function σ by σ . Lastly, we denote the essential supremum of any In particular, when σ is monotone then I m ,b is well-defined if and only if there exists some . A function is called Borel bi-measurable if both the image and pre-images of Borel sets, under that map, are again Borel sets.
. . , m, and suppose that σ is injective, Borel bi-measurable, that σ (x) > x except on a Borel set of μ-measure 0, and assume that condition (6) holds. If (F , ) has the UAP on Ball(g, δ) for some f ∈ L 1 μ (R m ) and some δ > 0 then, for every f ∈ L 1 μ (R m ) and every 0 there exists some f ∈ N N (F , ) and ) .

The Universal Approximation Property
We call activation functions satisfying the conditions of Corollary 2 L p μ -transitive. The following is a sufficiency condition analogous to the characterization of Proposition 1.
Different function spaces can have different transitive activation functions. By shifting the Leaky-ReLU variant of Example 7 we obtain an L p -transitive activation function which fails to be transitive.
The following variant of the Leaky-ReLU activation function is a continuous bijection on R with continuous inverse and therefore it is injective and bimeasurable. Since 0 is its only fixed point, then the set {σ (x) > x} = {0} is of Lebesgue measure 0, and thus of μ measure 0 since μ and μ M are equivalent. Hence, σ is injective, Borel bi-measurable, that σ (x) > x except on a Borel set of μ-measure 0, as required in (2). However, since 0 is a fixed point of σ then it does not meet the requirements of Theorem 5 (i).
Our main interest with transitive activation functions is that they allow for refinements of classical universal approximation theorems, where a network's last few layers satisfy constraints. This is interesting since constraints are common in most practical citations.

Deep networks with constrained final layers
The requirement that the final few layers of a neural network to resemble the given function f is in effect a constraint on the network's output possibilities. The next result shows that, if a transitive activation function is used, then a deep feed-forward network's output layers may always be forced to approximately behave likef while maintaining that architecture's universal approximation property. Moreover, the result holds even when the network's initial layers are sparsely connected and have breadth less than the requirements of [17,19]. Note that, the network's final layers must be fully connected and are still required to satisfy the width constraints of [17]. For a matrix A (resp. vector b) the quantity A 0 (resp. b 0 ) denotes the number of non-zero entries in A (resp. b).

Corollary 4 (Feed-Forward Networks with Approximately Prescribed Output Behavior)
Letf : R m → R n , 0, and let σ be a transitive activation function which is non-affine continuous and differentiable at-least at one point with non-zero derivative at that point. If there exists a continuous functionf 0 : R m → R n such that then there exists f ∈ N N (F , ) , J, J 1 , J 2 ∈ N + , 0 ≤ J 1 < J , and sets of composable •σ •W 1 and the following hold: Remark 3 Condition 7, for any δ > 0, whenever f 0 is continuous.
We consider an application of Corollary 4 to deep transfer learning. As described in [65], deep transfer learning is the practice of transferring knowledge from a pre-trained model into a neural network architecture which is to be trained on a, possibly new, learning task. Various formalizations of this paradigm are described in [66] and the next example illustrates the commonly used approach, as outlined in [67], where one first learns a feedforward networkf : R m → R n and then uses this map to initialize the final portion of a deep feed-forward network. Here, given a neural networkf , typically trained on a different learning task, we seek to find a deep feed-forward network whose final layers are arbitrarily close tof while simultaneously providing an arbitrarily precise approximation to a new learning task.
The structure imposed on the architecture's final layers can also be imposed by a set of constraints. The next result shows that, for a feed-forward network with a transitive activation function, the architecture's output can always be made to satisfy a finite number of compatible constraints. These constraints are described by a finite set of continuous functionals {F n } N n=1 on C(R m , R n ) together with a set of thresholds {C n } N n=1 , where each C n > 0. such that for each n = 1, . . . , N the following holds then for every f ∈ C(R m , R n ) and every Next, we show that transitive activation functions can be used to extend the currentlyavailable approximation rates for shallow feed-forward networks to their deep counterparts.

Approximation bounds for networks with transitive activation function
In [68,69], it is shown that the set of feed-forward neural networks of breadth N ∈ N + , can approximate any function lying in their closed convex hull of at a rate of O(N −1 2 ). These results do not incorporate the impact of depth into its estimates and the next result builds on them by incorporating that effect. As in [69], the convex-hull of a subset 1. For each f ∈ L 1 μ (R m ) and every n ∈ N, there is some N ∈ N such that the following bound holds Remark 4 Unlike in [69], Corollary 6(i) holds even when the function f does not lie in the closure of co (A) F . This is entirely due to the topological transitivity of the composition operator I m ,b and is therefore entirely due to the depth present in the network. In particular, Corollary 6 (iii) implies that universal approximation can be achieved even if a feed-forward networks' output weights are all constrained to satisfy n i=1 α i = 1 and α i = [0, 1] and even if all but the architecture's final two layers are sparsely connected and not trainable.
To date, we have focused on the application and interpretation of Theorem 1. Next, Theorem 3 is used to modify and improve the approximation capabilities of universal approximators on C(R).

Improving the approximation capabilities of an architecture
Most currently available universal approximation results for spaces of continuous functions, provide approximation guarantees for the topology of uniform convergence on compacts. Unfortunately, this is a very local form of approximation and there is no guarantee that the approximation quality holds outside a prespecified bounded set. For example, the sequence converges to the constant 0 function, uniformly on compacts while maintaining the constant error sup x∈R f n (x) − 0 1. These approximation guarantees are strengthened by modifying any given universal approximator on C(R m , R n ) to obtain a universal approximator in a smaller space of continuous functions for a much finer topology. We introduce this space as follows.
Let be a finite set of non-negative-valued, continuous functions ω from [0, ∞) to [0, ∞) for which there is some ω 0 ∈ satisfying ω 0 (·) = 1. Let C (R m , R n ) be the set of all continuous functions whose asymptotic growth-rate is controlled by some ω ∈ , in the sense that, is a special case of the weighted spaces studied in [70], which are Banach spaces when equipped with the norm ω,∞ . Accordingly, C (R m , R n ) is equipped with the finest topology making each C ω (R m , R n ) into a subspace. Indeed, such a topology exists by [71, Proposition 2.6].
Example 10 If = {max{t, t i }} i>0 then f ∈ C (R m , R n ) if and only if f has asymptotically sub-polynomial growth, in the sense that, there is a polynomial p : R m → R n with lim Given an architecture (F , ) on C(R m , R n ), define its -modification to be the architecture (F , ) on C (R m , R n ) given by F F × × (0, ∞) 2 and where Therefore, the functions in N N (F , ) are capable of adjusting to the different growth rates of functions in C (R m , R n ) into continuous functions of different growth rates; whereas those in (F , ) need not be.
then (F , ) is a universal approximator on C (R m , R n ).

Remark 5 Condition (9) is satisfied by any set of piecewise linear functions. For instance, N N (F , ) is comprised of piecewise linear functions if F is as in Example 4 and σ is the ReLU activation function.
The architecture (F , ) often provides a strict improvement over (F , ).

Proposition 2 Let (F , ) be a universal approximator on
, and let {exp(−kt) : n ∈ N}. Then (F , ) is not a universal approximator on C (R m , R n ).

Representation of approximators on L ∞
There is currently no available universal approximation theorem describing a small architecture on L ∞ (R m , R n ) with the UAP. Indeed, even trees are not dense therein since the Lebesgue measures is σ -finite and not finite. A direct consequence of Theorem 4 is the guarantee that a minimal architecture on L ∞ (R) exists and admits the following representation.

Conclusion
In this paper, we studied the universal approximation property in a scope applicable to most architectures on most function spaces of practical interest. Our results were used to characterize, construct, and establish the existence of such structures both in many familiar and exotic function spaces.
Our results were used to establish the universal approximation capabilities of deep and narrow networks with constraints on their final layers and sparsely connected initial layers. We derived approximation bounds for feed-forward networks with this activation function in terms of depth and height. We showed that the set of activation functions for which these results hold is broader when the underlying functions space is L p (R m ) than if it is C(R m ), which showed that the choice of activation function depends on the underlying topological criterion quantifying the UAP. We characterized the activation functions for which these results hold as precisely being the set of injective, continuous, non-affine activation functions which are differentiable at at-least one point with non-zero derivative at that point and have no fixed points. We provided a simple direct way to construct these activation functions. We showed that a rescaled and shifted Leaky-ReLU activation is an example of such an activation function while the ReLU activation is not. We used our construction result to build a universal approximator in the space of continuous functions between Euclidean spaces, which have controlled growth, equipped with a uniform notion of convergence. This result strengthens the currently available guarantees for feed-forward networks, which state that this architecture is universal in C(R m , R n ) for the weaker uniform convergence on compacts topology. Finally, we obtained a representation of a small universal approximator on L ∞ (R m ).
The results, structures, and methods introduced in this paper provide a flexible and broad toolbox to the machine learning community to build, improve, and understand universal approximators. It is hoped that these tools will help others develop new, theoretically justified architectures for their learning tasks.

Funding Open access funding provided by Swiss Federal Institute of Technology Zurich.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommonshorg/licenses/by/4.0/.

Appendix A: Proofs of Main Results
Theorem 1 is encompassed by the following broader but more technical result.

Lemma 2 (Characterization of the Universal Approximation Property) Let X be a function space, E is an infinite-dimensional Fréchet space for which there exits some homeomorphism
: X → E, and F , be an architecture on X . Then the following are equivalent: has the UAP, (ii) Decomposition of UAP via Subspaces: There exist subspaces {X i } i∈I of X such that:

) is a separable infinite-dimensional Fréchet subspace of E and
N N (F , ) ∩ X i contains a countable, dense, and linearlyindependent subset of X i ), (c) For each i ∈ I , there exists a homeomorphism i : X i → L 2 (R).

(iii) Decomposition of UAP via Topologically Transitive Dynamics:
There exist subspaces {X i } i∈I of X and continuous functions {φ i } i∈I with φ i : X i → X i such that:

and in particular, it is a dense subset of
(iv) Parameterization of UAP on Subspaces: There are triples {(X i i , ψ i )} i∈I of separable topological spaces X i , non-constant continuous functions i : X i → X , and functions ψ i : X i → X i satisfying the following:

and in particular, it is a dense subset of i (X i ).
Moreover, if X is separable, then I may be taken to be a singleton.
Proof of Lemma 2 Suppose that (ii) holds. Since i∈I X i is dense in X and since i∈I N N (F , ) ∩X i ⊆ N N (F , ) , then, it is sufficient to show that i∈I N N (F , ) ∩X i is dense in i∈I X i to conclude that is is dense in X . Since each X i is a subspace of X then, by restriction, each X i is a subspace of i∈I N N (F , ) ∩ X i with its relative topology.
LetX denote the set i∈I X i equipped with the finest topology making each X i into a subspace, such a topology exists by [71,Proposition 2.6]. Since each X i is also a subspace of i∈I X i with its relative topology and since, by definition, that topology is no finer than the topology ofX then it is sufficient to show that i∈I N N (F , ) ∩ X i is dense inX to conclude that it is dense in i∈I X i equipped with its relative topology.
Indeed, by [71, Proposition 2.7] the spaceX is given by the (topological) quotient of the disjoint union i∈I X i , in the sense of topological spaces (see [ In particular, (11) implies that for every open subset U ⊆X Therefore, i∈I N N (F , ) ∩X i is dense inX and therefore it is dense in i∈I X i equipped with its relative topology. Hence, F has the UAP and therefore (i) holds.
In the next portion of the proof, we denote the (linear algebraic) dimension of any vector space V by dim(V ). Recall, that this is the cardinality of the smallest basis for V . We follow the Von Neumann convention and, whenever required by the context, we identify the natural number n with the ordinal {1, . . . , n}.
Assume that (i) holds. For the first part of this proof, we would like to show that D contains a linearly independent and dense subset D . Since X is homeomorphic to some infinite-dimensional Fréchet space E, then there exists a homeomorphism : X → E mapping N N (F , ) to a dense subset D of E. We denote the metric on E by d. A consequence of [72,Theorem 3.1], discussed thereafter by the authors, implies that since E is an infinite dimensional Fréchet space then it has a dense Hamel basis, which we denote by {b a } a∈A . By definition of the Hamel basis of E we may assume that the cardinality of A, denoted by Card(A), is equal to dim(E). Next, we use {b a } a∈A to produce a base of open sets for the topology of E of cardinality equal to dim(E).
Since E is a metric space, then its topology is generated by the open sets then for every a ∈ A and r ∈ (0, ∞) the basic open set Ball E (b a , r) can be expressed by Ball E (b a , r) = q∈Q∩(0,r) Ball E (b a , q). Hence, {Ball E (b a , q)} a∈A,q∈Q∩(0,∞) generates the topology on E. Moreover, the cardinality the indexing set A × Q is computed by since E is infinite and therefore at-least countable. Therefore, {Ball E (b a , q)} a∈A,q∈Q∩(0,∞) is a base for the topology on E of Cardinality equal to dim(E). Let ω be the smallest ∩ (0, ∞)). In particular, there exists a bijection F : ω → A × Q ∩ (0, ∞) which allows us to canonically order the open sets We construct D by transfinite induction using ω. Indeed since 1 < ω, then since D is dense in E and {Ball E (F (j ) 1 , F (j) 2 )} j ≤ω defines a base for the topology of E, then there exists some U 1 ∈ {Ball E (F (j ) 1 , F (j) 2 )} j ≤ω containing some d 1 ∈ D. For the inductive step, suppose that for all i ≤ j for some j < ω, we have constructed a linearly independent set {d i } i<j with d i ∈ {Ball E (F (i) 1 , F (i) 2 )} for every i ≤ j . Since j < ω and {d i } i<j contains Card(j ) and {d i } i<j is a Hamel basis of span({x i } i<j ) then dim span({x i } i<j ) < dim(E). Hence, span({x i } i<j ) has empty interior and therefore it cannot contain any {Ball E (F (j ) 1 , F (j) 2 )} j ≤ω . In particular, there is an open subset V ⊆ Ball E (F (j ) 1 , F (j) 2 ) − span({x i } i<j ) and since D was assumed to be dense in E then there must be some d j ∈ V ⊆ Ball E (F (j ) 1 , F (j) 2 ). This completes the inductive step and therefore there is a linearly independent and dense subset D Next, let I be the set of all countable sequences of distinct elements in ω. For every i ∈ I , let E i span j ∈i (d j ), where A denotes the closure of a subset A ⊆ E in the topology of E. Then, each E i is a linear subspace of E with countable basis {d j } j ∈i . Since any Fréchet space with countable basis is separable and therefore each E i is a separable Fréchet space. Moreover, by construction, and therefore i∈I E i is dense in E since D is dense in E. Since is a homeomorphism then −1 : E → X is a continuous surjection, and since the image of a dense set under any continuous map is dense in the range of that map then −1 (D ) is dense in X . Moreover, using the fact that inverse images commute with unions and the fact that that is a bijection, we compute that Since as a bijection and D was defined as the image of N N (F , ) in E under , then D ⊂ N N (F , ) and D is dense in X . In particular, (14) implies that i∈I Since is a homeomorphism then it preserves dense sets and in particular since {d i } j ∈i is a countable, dense, and linearly independent subset of −1 [{d j } j ∈i ] then it is a dense countable subset of X i . Hence, each X i is separable.
This gives (ii.b). Lastly, by [73] any two separable infinite-dimensional Fréchet space are homeomorphic. In particular, since L 2 (R) is a separable Hilbert space is a separable Fréchet space. Therefore, for each i ∈ I , there is a homeomorphism i : E i → L 2 (R). In particular, i • : X i → L 2 (R) must be a homeomorphism and therefore (ii.b) holds. Therefore, (i) implies (ii).
Suppose that ( is dense in X i . Thus, (iii.c) holds. For any i ∈ I , define the map ψ i : and define the vectorg i ∈ L 2 (R) byg i i • (g i ). Since and i are homeomorphisms and since φ i is continuous then ψ i is well-defined and continuous. Moreover, analogously to (15) we compute that ψ n i (g i ) n∈N is dense in L 2 (R). Since L 2 (R) is a complete separable metric space with no isolated points and ψ i is continuous self-map of L 2 (R) for which there is a vectorg i ∈ L 2 (R) such that the set of iterates {ψ n i (g i )} n∈N is dense in L 2 (R) then Birkhoff Transitivity Theorem, see the formulation of [74,Theorem 1.16], implies that for every pair of non-empty open subsetsŨ,Ṽ ⊆ L 2 (R) there is some nŨ ,Ṽ satisfying Since i • is a homeomorphism, then [74, Proposition 1.13] and (16) imply that for every pair of non-empty open subsets U , V ⊆ X i there exists some n U ,V ∈ N satisfying Since X i is equipped with the subspace topology then every non-empty open subset U ⊆ X i is of the form U ∩ X i for some non-empty open subset U ⊆ X . Therefore, (17) implies (iii.b). Since both L 2 (R) and C(R) are separable infinite-dimensional Fréchet spaces then the [73, implies that there exists a homeomorphism : L 2 (R) → C(R). Therefore, for each i ∈ I , • i • : X → C(R) is a homeomorphism and thus (ii.c) implies (iii.d).
Suppose that (iii) holds. For every i ∈ I , set X i X i , let i 1 X i be the identity map on X i , set ψ i φ i , and set x i g i . Therefore, (iv) holds.
Suppose that (iv) holds. By (iv.c), for each i ∈ I , N N (F , ) ∩ X i is dense in X i . Therefore, By (iv.a) since i∈I X i is dense in X therefore its closure is X and therefore the smallest, and thus only, closed set containing i∈I X i is X itself. Therefore, by (18) the smallest set containing i∈I N N (F , ) ∩ X i must be X . Therefore, N N (F , ) is dense in X and (i) holds. This concludes the proof.

Proof of Theorem 2 By the [73, Anderson-Kadec Theorem]
there is no loss of generality in assuming that m = n = 1, since C(R m , R n ) and C(R) are homeomorphic. Let X i∈I i (C(R)). By (5), X is dense in X and since density is transitive, then it is enough to show that i∈I i (N N (F , ) ) is dense in X to conclude that it is dense in X . Since each i is continuous, then, the topology on X is no finer than the finest topology on i∈I i (C(R)) making each i continuous and by [71, Proposition 2.6] such a topology exists. Let X denote i∈I i (C(R)) equipped with the finest topology making each i (C(R)) into a subspace. By construction, if U ⊆ X is open then it is open in X and therefore if i∈I i (N N (F , ) ) intersects each non-empty open subset of X then it must do the same for X . Hence, it is enough to show that i∈I i (N N (F , ) ) is dense in X to conclude that it is dense in X and therefore, i∈I i (N N (F , ) ) is dense in X .
We proceed similarly to the proof of Lemma 2. Indeed, by [71, Proposition 2.7] the space X is given by the (topological) quotient of the disjoint union i∈I i (C(R)), in the sense of topological spaces (see [ In particular, (19) implies that for every open subset U ⊆ X Therefore, i∈I N N (F , ) ∩ i (C(R)) is dense in X and therefore it is dense in i∈I i (C(R)) equipped with its relative topology. Hence, (F , ) has the UAP on X and therefore it has the UAP on X itself.

Proof of Theorem 3
Let σ be a continuous and non-polynomial activation function. Then [61] implies that the architecture F 0 , 0 , as defined in Example 4, is a universal approximator on C(R).
By Theorem 1, since F , has the UAP on X and since X is homeomorphic to an infinite-dimensional Fréchet space then there are homeomorphisms { i } i∈I from C(R) onto a family of subspaces {X i } i∈I of X such that i∈I X i is dense. Fix > 0 and f ∈ X . Since i∈I X i is dense in X there exists some i ∈ I and some f i ∈ X i such that Since i is a homeomorphism then it must map dense sets to dense sets. Since F 0, 0 has the UAP on C(R) then N N (F 0, 0) is dense in C(R) and therefore, for each i ∈ I , i (N N (F 0, 0) ) is dense in X i . Hence, there exists someg ∈ i (N N (F 0, 0) ) such that d X (f i ,g ) < 2 . Since i is a homeomorphism, it is a bijection, therefore there exists a unique g ∈ N N (F 0, 0) with i (g ) =g . Hence, the triangle inequality and (21) This yields the first inequality in the Theorem's statement. By Theorem 1 since, for each i ∈ I , N N (F , ) ∩ X i is dense in X i and since −1 i is a homeomorphism on X i then −1 i N N (F , ) ∩ X i is dense in C(R). In particular, there Since i is a bijection then there exists a unique f ∈ N N (F , ) such that −1 i (f ) =f . Therefore, (23) and the triangle inequality imply that Therefore the conclusion holds. The proof of the next result relies on some aspects of inductive limits of Banach spaces. Briefly, an inductive limit of Banach spaces is a locally convex space B for which there exists a pre-ordered set I , a set of Banach sub-spaces {B i } i∈I with B i ⊆ B j if i ≤ j . The inductive limit of this direct system is the subset i∈I B i equipped with the finest topology which simultaneously makes each B i into a subspace and makes i∈I B i into a locallyconvex spaces. Spaces constructed in this way are called ultrabornological spaces and more details about them can be found in [75,Chapter 6].

Proof of Theorem 4
Since B(X 0 ) and B(X) are both infinite-dimensional Banach spaces, then they are infinite-dimensional ultrabornological space, in the sense of [75, Definition 6.1.1]. Since X is separable, then as observed in [33], B(X) is separable. Therefore, [75,Theorem 6.5.8] applies; hence, there exists a directed set I with pre-order ≤, a collection of Banach subspaces {B i } i∈I satisfying (i) and (ii), and a collection of continuous linear isomorphisms i : B(X) → B i . Furthermore, the topology on B is coarser than the inductive limit topology lim − →i∈I B i . Since each B(X) and B i are Banach spaces, and in particular normed linear spaces, then by the results of [76, Section 2.7] the maps i are bounded linear isomorphisms.
Let i ∈ I , and fix any x i ∈ X −{0 X } then since δ X : X → B(X) is base-point preserving then δ X x i = 0 and therefore there exists a linearly independent subset B x i of B(X) containing δ X x i . Since B(X) is separable then B x i is countably infinite and therefore [74,Theorem 8.24] there exists a bounded linear map φ i : Since i is a continuous linear isomorphisms then it is in particular a surjective continuous map from B(X) onto B i . Since the image of a dense set under a continuous surjection is itself dense then i • φ n i (δ x i ) n∈N + is a dense subset of B i . Moreover, this holds for each i ∈ I .
By definition, the topology on lim − →i∈I B i is at-least as fine as the Banach space topology on B(X 0 ), since each B i is a linear subspace of B(X 0 ). Moreover, the topology on lim − →i∈I B i is no finer than the finest topology on i∈I B i making each B i into a topological space (but not requiring that i∈I B i be locally-convex), which exists by [77,Proposition 6]. Denote this latter space byB. Therefore, if is dense inB then it is dense in lim − →i∈I B i and in B(X 0 ). Hence, we show that (24) In particular, (25) implies that for every open subset U ⊆B Therefore, (24) is dense inB and, in particular, it is dense in B(X 0 ). Since X 0 was barycentric, then there exists a continuous linear map ρ : B(X 0 ) → X 0 which is a left-inverse of δ X 0 . Thus, for every f ∈ X 0 , ρ • δ X 0 f = f and therefore ρ is a continuous surjection. Since the image of a dense set under a continuous surjection is dense and since (24) is dense then is a dense subset of X 0 . Since X 0 has assumed to be dense in X and since density is transitive then (27) is dense in X . This concludes the main portion of the proof. The final remark follows from the fact that if X = X 0 then the identity map 1 X : X → X 0 is an isometry and therefore the universal property of B(X) described in Theorem [32,Theorem 3.6] implies that 1 X uniquely extends to a bounded linear isomorphism L between B(X) and B(X 0 ) satisfying Hence L must be the identity on B(X).

Appendix B: Proof of Applications of Main Results
Lemma 3 Fix some b ∈ R m , and let σ : R → R be a continuous activation function. Then A,b is a well-defined and continuous linear map from C(R m , R n ) to itself and the following are equivalent: (ii) σ is injective, A is of full-rank, and for every compact subset K ⊆ [a, b] there is some N K ∈ N + such that If A is the m × m-identity matrix I m and b i > 0 for i = 1, . . . , m then (i) and (ii) are equivalent to (iii) σ is injective and has no fixed-points.
If A is the m × m-identity matrix I m and b i > 0 for i = 1, . . . , m then (iii) is equivalent to Proof Lemma 3 By [37, Theorem 46.8] the topology of uniform convergence on compacts is the compact-open topology on C(R m , R n ) and by [37,Theorem 46.11] composition is a continuous operation in the compact-open topology. Therefore, A,b is well-defined and continuous map. Its linearity follows from the fact that Since the topology of uniform convergence on compacts is a metric topology, with metric d ucc , then U f, : f ∈ C(R m , R n ), > 0 defines a base for this topology, where U f, {g ∈ C(R m , R n ) : d ucc (f, g) < }. Therefore, Lemma 3 (i) is equivalent to the statement: for each pair of non-empty open subsets U, V ∈ C(R m , R n ) there is some N U,V ∈ N + such that Without loss of generality, we prove this formulation instead.

.1] A,b satisfies Theorem 1 (ii.b) if and only if S(x) σ (Ax+ b)
is injective and for every compact subset K ⊆ R m there exists some N K ∈ N + such that Therefore, A must be injective which is only possible if A is of full-rank. This gives the equivalence between (i) and (ii). We consider the equivalence between (ii) and (iii) in the case where A is the identity matrix and b i > 0 for i = 1, . . . , m. Since S(x) = (σ (x + b 1 ), . . . , σ (x + b m )) it is sufficient to verify condition (28) in the case where m = 1. Since b i > 0 for 1, . . . , m then it is clear that S is injective and has no fixed points if and only if σ is injective and has no fixed points. We show that S is injective and has no fixed points if and only if (ii) holds. Indeed, note that if S has not fixed points, then since b i > 0 for i = 1, . . . , m then S has no fixed points if and only if σ no fixed points.
From here, we proceed analogously to the proof of [79,Lemma 4.1]. If S has a fixedpoint then for every N ∈ N + , S N (x) = {x} which is a non-empty compact subset of R. Therefore, (28) cannot hold. Conversely, suppose that S has no fixed points. The intermediate-value theorem and the fact that S has no fixed-points that either S(x) < x or S(x) > x. Mutatis mutandis, we proceed with the first case. Since σ is injective and S has not fixed points then S must be a strictly increasing function; thus S([a, b]) = [S(a), S(b)] for every a < b.
Let K be a non-empty compact subset of R. By the Heine-Borel theorem K is closed and bounded, thus it is contained in some [a, b] for a < b. Therefore, it is sufficient to show the results for the case where K = [a, b]. Since S is increasing then for every n ∈ N, the sequence {S n (a)} n∈N satisfies S n (a) < S n+1 (a). If this sequence is not unbounded then there would exist some a 0 ∈ R such that a 0 = lim n→∞ S n (a). Therefore, by the continuity of but since S has not fixed points then there cannot exist such an a 0 since otherwise a 0 = S(a 0 ). Therefore, a 0 does not exist and thus {S n (a)} n∈N is unbounded. Hence, for every a < b there exists some N [a,b] ∈ N + such that Thus, (ii) and (iii) are equivalent when A = I m .
Next, assume that any of (i) to (iii) hold, that X is a non-empty subset of C(R m , R n ), and that F , has the UAP on X . Then for any other non-empty open subset U ⊆ C(R m , R n ) there exists some N X ,U ∈ N such that Since A,b is continuous then so is N A,b and therefore ( This implies that X ∩ Thus, for each U in there exists some N U ∈ N + and some f U ∈ N N (F , ) such that N U (f U ) ∈ U . In particular, since (31)  Proof of Theorem 5 The equivalence between (i), (ii), and (iv) follows from Lemma 3. The equivalence between (iii) and (iv) follows from the formulation of Birkhoff's transitivity theorem described in [74,Theorem 2.19].
Proof of Proposition 1 Since α 1 < 1 then σ (x) > x for every x < 0. Since 0 < α 2 then σ (0) = 0 < α 2 . Lastly, sinceσ is monotone increasing then for every x > 0 we have that Therefore, σ cannot have a fixed point. Moreover, sinceσ is strictly increasing it must be injective, since if x < y then σ (x) < σ (y) and therefore σ (x) = σ (y) if x = y. Hence, σ is injective. Moreover, since the sum of continuous functions is again continuous, then σ is continuous.
Since α 1 x + α 2 is affine then it is continuously differentiable. Thus σ is continuously differentiable on any x < 0. Lastly, setting α 2 not equal toσ (0) − 1 ensure that σ is not differentiable at 0 and therefore it cannot be polynomial. In particular, it cannot be affine.
For convenience, we denote the collection of set-functions from R m to R n by [R m , R n ].

Proof of Corollary 4
Since d ucc is a metric on [R m , R n ] and since C(R m , R n ) ⊆ [R m , R n ], then the map F : C(R m , R n ) → C(R m , R n ) defined by F (g) d ucc (f 0 , g) is continuous. Therefore, the set F −1 [(−∞, δ)] is an open subset of C(R m , R n ). In particular, (7) guarantees that it is non-empty. Since σ is non-affine and continuously differentiable at-least at one point with non-zero derivative at that point then [17,Theorem 3.2] applies, whence the set X 0 of continuous functions h : R m → R n with representation where W j : R d j → R d j +1 , for j = 1, . . . , J − 1, are affine and n m + 2 ≥ d j if j ∈ {1, J } and d 1 = m, and d J = n, is dense in C(R m , R n ). Therefore, since Fix some b ∈ R m with b i > 0 for i = 1, . . . , m. Since σ is continuous, injective, and has no fixed-points then applying Lemma 3 implies that δ)] ∩ X 0 , N ∈ N + }, is a dense subset of C(R m , R n ). This gives (i). Moreover, by construction, every g ∈ X 1 admits a representation satisfying (iii) and (iv). Furthermore, since W J • σ • · · · • σ • W 1 ∈ X 2 and by construction there exists some g ∈ X 1 for which d ucc (W J • σ • · · · • σ • W 1 , g) < δ,; then (ii) holds.
is an open subset of C(R m , R n ). Since there exists some f 0 ∈ C(R m , R n ) satisfying (8) then U is non-empty. Since F , has the UAP on C(R m , R n ) then F , ∩ U is dense in U . Fix b ∈ R m with b i > 0 for i = 1, . . . , m and set A = I m .
Since σ is a transitive activation function then Corollary 1 applies and therefore the set By Lemma 1, the map I m ,b and therefore the map we obtain the conclusion. Proof of Corollary 3 By Proposition 1 and the observation in its proof that σ (x) > x we only need to verify that σ is Borel bi-measurable. Indeed, since σ is continuous and injective then by [81, Proposition 2.1], σ −1 exists and is continuous on the image of σ . Since σ was assumed to be surjective then σ −1 exists on all of R and is continuous thereon. Hence, σ −1 and σ are measurable since any continuous function is measurable.
Proof of Theorem 6 Fix A = I m and b ∈ R m with b i > 0 for i = 1, . . . , m. Since int (co (A) F ) is a non-empty open set then there exists some f ∈ int (co(F )) and some δ > 0 for which ) then its intersection with any non-empty open subset thereof is also dense; in particular, co Since σ is L 1 -transitive then (iii) follows from Corollary 2.
Since L 1 μ is a metric space then Ball L 1 μ (R m ) (g, δ) : g ∈ L 1 μ (R m ), δ > 0 is a base for the topology thereon. Therefore, Corollary 2 implies that for any two non-empty open subsets U, V ∈ L 1 μ (R m ) there exists some N U,V ∈ N satisfying Moreover, by Lemma 1, we know that the right-hand side of (35) is finite. Therefore (34) implies that for every f 1 , . . . , f n ∈ F , α 1 , . . . , α n ∈ [0, 1] with n i=1 α i = 1, the following holds Combining the estimates (33)- (36) we obtain Since N I m ,b is linear, then the right-hand side of (37) reduces and we obtain the following estimate (38) Therefore, the estimate in (i) holds.
For the statement of the next lemma concerns the Banach space of functions vanishing at infinity. Denoted by C 0 (R m , R n ), this is the set of continuous functions f from R m to R n such that, given any > 0 there exists some compact subset K ⊆ R m for which sup x∈K f (x) < . As discussed in [82,VII], C 0 (R m , R n ) is made into a Banach space by equipping with the supremum norm f ∞ sup x∈R m f (x) .

Lemma 4 (Uniform Approximation of Functions Vanishing at Infinity) Suppose that F ,
is a universal approximator on C(R m , R n ), then for every f ∈ C 0 (R m , R n ) and every > 0 there exists g ∈ C 0 (R m , R n ) with representation the absolute value |·| is applied component-wise, g ∈ N N (F , ) , and a, b > 0, and satisfying the uniform approximation bound Proof of Lemma 4 Let F , be a universal approximator on C(R m , R n ), let f ∈ C 0 (R m , R n ), and > 0. Since f vanishes at infinity then there exists some non-empty compact K ,f ⊆ R m for which f (x) ≤ 2 −1 for every x ∈ K ,f . By the Heine-Borel theorem K ,f is bounded and therefore there exists some b > 0 such that Since the bump function x → e −1 1 1−x 2 I |x|<1 is continuous, affine functions are continuous, f ∈ C(R m , R n ), and the composition and multiplication of continuous functions is again continuous then the function x → f (x) − 2 −1 e b b − x 2 I x <b is itself continuous. Observe also that the set Ball(0, b ) = {x ∈ R m : x ≤ b } is closed and bounded, Proof of Theorem 6 For each ω ∈ , define the map ω : C 0 (R m , R n ) → C ω (R m , R n ) by ω (f ) (ω( · ) + 1) f . For each f, g ∈ C 0 (R m , R n ) we compute Therefore, for each ω ∈ , the map ω is an isometry. For each ω ∈ , define the map ω : C ω (R m , R n ) → C 0 (R m , R) by ω (f ) 1 ω( · )+1f . For eachf ∈ C ω (R m , R n ) and compute Hence, ω is a right-inverse of ω . Since every isometry is a homeomorphism onto its image and since ω is surjective isometry then ω defines a homeomorphism from C 0 (R m , R n ) onto C ω (R m , R n ). In particular, ω (C 0 (R m , R n )) = C ω (R m , R n ). Therefore, Hence, condition (5) holds.
Since it was assumed that sup x∈R m f (x) e − x < ∞ holds, then Lemma 4 applies, whence, f e − b b− · 2 + a I · <b + ae −|f (·)|( x −b) I · ≥b : 0 < b, a, f ∈ N N (F , ) is dense in C 0 (R m , R n ). Therefore, the conditions for Theorem 2 are met. Hence, ω∈ ω f e − b b− · 2 + a I · <b + ae −|f (·)|( x −b) I · ≥b : 0 <b, a, f ∈ N N (F , ) (47) is dense in C (R m , R n ). By definition, (47) is a subset of N N (F , ) and therefore N N (F , ) is dense in C (R m , R n ). Hence, F , is a universal approximator on C (R m , R n ).
Proof of Proposition 2 For each k, m ∈ N with n ≤ m, we have that exp(−kt) > exp(−mt) for every t ∈ [0, ∞). Thus, C exp(−k·) (R m , R n ) ⊆ C exp(−m·) (R m , R n ), (48) and the inclusion is strict if n < m. Moreover, for n ≤ m, the inclusion of each i k m : C exp(−n·) (R m , R n ) into C exp(−m·) (R m , R n ) is continuous. Thus, C exp(−k·) (R m , R n ), i k m n∈N is a strict inductive system of Banach spaces. Therefore, by [83, Proposition 4.5.1] there exists a finest topology on k∈N C exp(−k·) (R m , R n ) both making it into a locally-convex space and ensuring that each C exp(−k·) (R m , R n ) is a subspace. Denote k∈N C exp(−k·) (R m , R n ) equipped with this topology by C LCS (R m , R n ).
If f ∈ C LCS (R m , R n ) then by construction there must exist some K ∈ N such that f ∈ C exp(−K·) (R m , R n ). By [84, Propositions 2 and 4], a sequence {f t } t∈N converges to some f if and only if there exists some K ∈ N and some N K ∈ N + such that for every t ≥ N K every f t ∈ C exp(−K·) (R m , R n ) and the sub-sequence {f t } t≥N K converges in the Banach topology of C exp(−K·) (R m , R n ) to f . In particular, since C exp(−0·) (R m , R n ) = C 0 (R m , R n ) then the function f (x) (exp(−|x|), . . . , exp(−|x|)) ∈ C exp(−0·) (R m , R n ). Since each f ∈ N N (F , ) is either constant of sup x∈R m f (x) = ∞ then for any sequence {f t } t∈N ∈ N N (F , ) there exists some N 0 ∈ N + for which the sub-sequence {f t } t≥N 0 lies in C exp(−0·) (R m , R n ) = C 0 (R m , R n ) if and only if for each t ≥ N 0 the map f t is constant. Therefore, for each t ≥ N 0 we compute that Hence, f t cannot converge to f in C (R m , R n ) and therefore F , does not have the UAP on C (R m , R n ).

Proof of Corollary 7 Let X
R and X 0 X L ∞ (R). Since every Banach space is a pointed metric space with reference-point its zero vector and since R is separable then Theorem 4 applies. We only need to verify the form of η and of ρ. Indeed, the identification of B(R) with L 1 (R) and explicit description of η is constructed in [32,Example 3.11]. The fact that L ∞ (R) is barycentric follows from the fact that it is a Banach space and by [31,Lemma 2.4].