High-dimensional distribution generation through deep neural networks

We show that every d-dimensional probability distribution of bounded support can be generated through deep ReLU networks out of a 1-dimensional uniform input distribution. What is more, this is possible without incurring a cost—in terms of approximation error measured in Wasserstein-distance—relative to generating the d-dimensional target distribution from d independent random variables. This is enabled by a vast generalization of the space-filling approach discovered in Bailey and Telgarsky (in: Bengio (eds) Advances in neural information processing systems vol 31, pp 6489–6499. Curran Associates, Inc., Red Hook, 2018). The construction we propose elicits the importance of network depth in driving the Wasserstein distance between the target distribution and its neural network approximation to zero. Finally, we find that, for histogram target distributions, the number of bits needed to encode the corresponding generative network equals the fundamental limit for encoding probability distributions as dictated by quantization theory.


I. INTRODUCTION
Deep neural networks have been employed very successfully as generative models for complex natural data such as images [21], [14] and natural language [4], [26]. Specifically, the idea is to train deep networks so that they realize complex high-dimensional probability distributions by transforming samples taken from simple low-dimensional distributions such as uniform or Gaussian [15], [11], [1].
Generative networks with output dimension higher than the input dimension occur, for instance, in language modelling where deep networks are used to predict the next word in a text sequence. Here, the input layer size is determined by the dimension of the word embedding (typically ∼ 100) and the output layer, representing a vector of probabilities for each of the words in the vocabulary, is of the size of the vocabulary (typically ∼ 100k). Another example where the dimension of the output distribution is mandated to be higher than that of the input distribution is given by explicit [15], [24] and implicit [11], [1] density generative networks.
Notwithstanding the practical success of deep generative networks, a profound theoretical understanding of their representational capabilities is still lacking. First results along these lines appear in [16], where it was shown that generative networks can approximate distributions arising from the composition of Barron functions [3]. It remains unclear, however, which distributions can be obtained in such a manner. More recently, it was established [17] that for every given target distribution (of finite third moment) and source distribution, both defined on R d , there exists a ReLU network whose gradient pushes forward the source distribution to an arbitrarily accurate-in terms of Wasserstein distance-approximation of the target distribution. The aspect of dimensionality increase was addressed in [2], where it is shown that a uniform univariate source distribution can be transformed, by a ReLU network, into a uniform target distribution of arbitrary dimension through a space-filling approach. Besides, [2] also demonstrates how a univariate Gaussian target distribution can be obtained from a univariate uniform source distribution and vice versa. In a general context, the problem of optimal transport [25] between source and target distributions on spaces of different dimensions was studied in [18].
The approximation of distributions through generative networks is inherently related to function approximation and hence to the expressivity of neural networks. A classical result along those lines is the universal approximation theorem [7], [13], which states that single-hidden-layer neural networks with sigmoidal activation function can approximate continuous functions on compact subsets of R n arbitrarily well. More recent developments in this area are concerned with the influence of network depth on attainable approximation quality [23], [8], [10]. A theory establishing the fundamental limits of deep neural network expressivity is provided in [6], [9].
The aim of the present paper is to develop a universal approximation result for generative neural networks. Specifically, we show that every target distribution supported on a bounded subset of R d can be approximated arbitrarily well in terms of Wasserstein distance by pushing forward a 1-dimensional uniform source distribution through a ReLU network. The result is constructive in the sense of explicitly identifying the corresponding generative network. Concretely, we proceed in two steps. Given a target distribution, we find the histogram distribution that best approximates it-for a given histogram resolution-in Wasserstein distance. This histogram distribution is then realized by a ReLU network driven by a uniform univariate input distribution. The construction of this ReLU network exploits a space-filling property, vastly generalizing the one discovered in [2], [19]. The main conceptual insight of the present paper is that generating arbitrary d-dimensional target distributions from a 1-dimensional uniform distribution through a deep ReLU network does not come at a cost-in terms of approximation error measured in Wasserstein distance-relative to generating the target distribution from d independent random variables through, e.g., (for arbitrary d) the normalizing flows method [22] and (for d = 2) the Box-Muller method [5]. We emphasize that the generating network has to be deep, in fact its depth has to go to infinity to obtain the same Wasserstein-distance error as a construction from d independent random variables would yield. Finally, we find that, for histogram target distributions, the number of bits needed to encode the corresponding generative network equals the fundamental limit for encoding probability distributions as dictated by quantization theory [12].

II. DEFINITIONS AND NOTATION
We start by introducing general notation. log stands for the logarithm to base 2. For n 1 , n 2 ∈ N, the set of integers in the range [n 1 , n 2 ] is designated as [n 1 : n 2 ]. For x = (x 1 , x 2 , . . . , x d ) ∈ R d , we denote the vector obtained by retaining the first t, t ≤ d, entries by x [1:t] := (x 1 , x 2 , . . . , x t ) ∈ R t . U (∆) stands for the uniform distribution on the interval ∆; when ∆ = [0, 1], we simply write U . Given a probability density function (pdf) p, we denote its push-forward under the function f as f #p. δ x refers to the Dirac delta distribution. B d stands for the Borel σ-algebra on R d , i.e., the smallest σ-algebra on R d that contains all open subsets of R d . For a vector b ∈ R d , we let b ∞ := max i |b i |, similarly we write A ∞ := max i,j |A i,j | for the matrix A ∈ R m×n . The Cartesian product of the intervals I i , i ∈ [1 : d], is denoted by × d i=1 I i = I 1× I 2× · · · × I d . Finally, χ I stands for the indicator function on the set I.
We proceed to define ReLU neural networks.
We will frequently use the concept of histogram distributions formalized as follows.
Definition II. 3. A random variable X is said to have a general histogram distribution of resolution n on [0, 1], denoted as X ∼ G[0, 1] 1 n , if for some t 0 , t 1 , . . . , t n ∈ R such that 0 = t 0 ≤ t 1 ≤ · · · ≤ t n = 1 and if t i = t i+1 , then t i+2 = t i , ∀i ∈ [0 : (n − 2)], its pdf is given by and w k > 0 for all k ∈ [0 : (n − 1)]. Here, and General histogram distributions allow for bins of arbitrary size and for point singularities, see the right-hand plot in Figure 4. We will, however, mostly be concerned with histogram distributions of uniform tile size, defined as follows.
Definition II. 4. A random vector X = (X 1 , X 2 , . . . , X d ) ⊤ is said to have a histogram distribution of resolution n on the d-dimensional unit cube, denoted as X ∼ E[0, 1] d n , if its pdf is given by

referred to as index vectors, and c
Example histogram distributions in the 1and 2-dimensional case, respectively, are depicted in Figs. 1 and 2. Throughout the paper, to indicate that the random vector X with distribution p X (x) satisfies X ∼ E[0, 1] d n , we shall frequently also write p X (x) ∈ E[0, 1] d n . Remark II.5. For ease of exposition, in Definitions II.3 and II.4, we let the intervals [t k , t k+1 ] and the cubes c k , respectively, be closed, thus allowing the breakpoints to belong to different intervals/cubes. While this comes without loss of generality, for concreteness, it is understood that the value of the pdf at a breakpoint is given by the average across the intervals/cubes containing the breakpoint.

III. SAWTOOTH FUNCTIONS
As mentioned above, our universal generative network construction is based on a new space-filling property of ReLU networks, vastly generalizing the one discovered in [2], [19]. At the heart of this idea are higher-order sawtooth functions obtained as follows. Consider the sawtooth function g : R → [0, 1], , and define the sawtooth function of order s as the s-fold composition of g with itself according to  Figure 3 depicts the sawtooth functions of orders 1, 2, and 3. Next, we note that g can be realized by a 2-layer The sth-order sawtooth function g s can hence be realized by a ReLU network Φ s We close this section with an important technical ingredient of the generalized space-filling idea presented in Section V.
Lemma III.1. Let f (x) be a continuous function on [0, 1], with f (0) = 0. Then, for all s ∈ N, and for all k ∈ [0 : Proof. We first note that the sawtooth functions g s (x) satisfy [23] g s (x) = ) coincides with that of g(2 s−1 x − k), which in turn yields (4). To establish (3), we note that the supports of g(2 s−1 x − k) are pairwise disjoint across k and hence

IV. RELU NETWORKS GENERATE HISTOGRAM DISTRIBUTIONS
This section establishes a systematic connection between ReLU networks and histogram distributions. Specifically, we show that the pushforward of a uniform distribution under a piecewise linear function results in a histogram distribution. We also identify, for a given histogram distribution, the piecewise linear function generating it under pushforward of a uniform distribution. Combined with the insight that ReLU networks realize piecewise linear functions, the desired connection is established.
We start with an auxiliary result. Proof. We start with the case m ∈ R \ {0}. The pdf of the pushforward of a random variable with pdf p(x) under the function f (x) is given by .
for m > 0, and for m < 0. Finally, if m = 0, then the entire interval [a, b] is mapped to the point y = s and the corresponding pdf is given by q(y) = δ y−s .
We next show that the pushforward of a uniform distribution under a piecewise linear function results in a (general) histogram distribution.
Using the law of total probability and the chain rule, the pdf of q = f #U can accordingly be represented as As U is uniform, it is also uniform conditional on being in a given interval I j . By Lemma IV.1 it therefore follows that q(y|u ∈ I j ) can be written as Noting that by continuity of f and the boundary conditions f (0) = 0, f (1) = 1, we have j R j = [0, 1], it follows that q(y) in (5) corresponds to a general histogram distribution according to (1) with n = t and We will also need the converse to the result just established, in particular a constructive version thereof explicitly identifying the piecewise linear function that leads to a given general histogram distribution under pushforward of a uniform distribution on the interval [0, 1].
Proof. (6) is linear on I i with slope given by An example of a piecewise linear function and the corresponding general histogram distribution according to Theorems IV.2 and IV.3 is provided in Figure 4. Theorems IV.2 and IV.3 are of independent interest as they allow to conclude that ReLU networks, by virtue of always realizing piecewise linear functions, produce general histogram distributions when pushing forward uniform distributions. In the remainder of the paper, we shall, however, work with histogram distributions E[0, 1] d n only, in particular for d = 1, in which case Theorem IV.3 takes on a simpler form spelled out next.
We shall often need the explicit form of f (x) in Corollary IV.4 on its intervals of linearity [b ℓ , b ℓ+1 ]. A direct calculation reveals that

V. INCREASING DISTRIBUTION DIMENSIONALITY
This section is devoted to the aspect of distribution dimensionality increase through deep ReLU networks. Specifically, for a given random vector X ∼ E[0, 1] d n with (histogram) distribution p X (x) of resolution n, we construct a piecewise linear map M : [0, 1] → [0, 1] d such that the pushforward M #U approximates p X (x) arbitrarily well. The main ingredient of our construction is a vast generalization of the space-filling property of sawtooth functions discovered in [19]. Informally speaking, the novel space-filling property we describe allows to completely fill the d-dimensional target space according to a prescribed target histogram distribution by transporting probability mass from the 1-dimensional uniform distribution U to d-dimensional space.
We first develop some intuition behind this construction. Specifically, we consider the 2-dimensional case and visualize the idea of approximating a 2-dimensional target distribution through pushforward of U by the sawtooth functions g s (x) (depicted in Figure 3) as painting the curve g s (x) with probability mass taken from U . The geodesic distance traveled by the brush distributing probability mass along g s (x) goes to infinity according to 2 s as s → ∞. This follows by noting that the number of teeth is 2 s and for s → ∞, the length of the individual teeth (in fact their halves) approaches 1. Therefore, as s → ∞ the square [0, 1] 2 will be filled with paint completely. Moreover, as x traverses from 0 to 1, the speed at which probability mass is allocated to the marginal dimensions, i.e., along the x 1 -and x 2 -axes, is constant. To see this, simply note that along the x 2 -axis the speed of the brush is given by the derivative of g s (x), which by virtue of g s (x) consisting of piecewise linear segments, is constant. Likewise, as the inverse of a linear function is again a linear function, the brush moves with constant speed along the x 1 -axis as well. This guarantees that the resulting 2-dimensional probability distribution along with its marginals and conditional distributions are all uniform. The rate at which the joint distribution approaches a 2-dimensional uniform distribution can be quantified in terms of Wasserstein distance by defining the transport map M : x → (x, g s (x)) and noting that W (M #U, U ([0, 1] 2 )) ≤ √ 2 2 s [2]. What is noteworthy here is that the map M takes probability mass from R to R 2 in a space-filling fashion, i.e., we get a dimensionality increase as s → ∞.
By adjusting the "paint plan", this idea can now be generalized to 2-dimensional histogram target distributions that are constant with respect to one of the dimensions, here, for concreteness, the x 1 -dimension. Specifically, we replace g s (x) in the construction above by f (g s (x)), where the piecewise linear function f (x) determines the paint plan resulting in the desired weights (across the x 2 -axis) according to Corollary IV.4. We refer to Figure 5 for an illustration of the idea. While the outer function f (x) determines how much time the paint brush spends in a given interval along the x 2 -axis, the inner function g s (x) takes care of filling the unit square as s → ∞. The larger the slope of f (x) on a given interval along the x 2 -axis, the less time the brush spends in that interval and the smaller the amount of probability mass allocated to the interval. Concretely, by Corollary IV.4 the amount of probability mass deposited in a given interval is inversely proportional to the slope of f (x) on that interval.
Finally, consider the 2-dimensional histogram distribution p X1,X2 (x 1 , x 2 ) ∈ E[0, 1] 2 n and note that it can be represented as follows where w k1 = 1 n k2 w k1,k2 , w k2|k1 = w k1,k2 /w k1 , p X1 (x 1 ∈ [i/n, (i + 1)/n]) = w i χ ci (x 1 ) denotes the restriction of the marginal histogram distribution p X1 (see Lemma V.1 below) to the interval [i/n, (i + 1)/n], and p X2|X1 x 2 |x 1 ∈ [i/n, (i + 1)/n] = k2 w k2|i χ c k 2 (x 2 ), for each i ∈ [0 : (n − 1)], can be viewed as a 2-dimensional histogram distribution that is constant with respect to x 1 , and which we assume to be generated by f (i) X2 (g s (x)) according to the procedure described in the previous paragraph. Now, in order to "paint" the general 2-dimensional histogram distribution in (7), we have to "squeeze" the space-filling curves f X1 (x) − i)) when applied to U generates p X1,X2 in (7) asymptotically in s. To see this, we first note that the second component of M pushes forward, by , the random variable f z1 X1 #U resulting from the pushforward of U by the first component of M . This allows us to read off the conditional distributions p X2|X1 . Specifically, thanks to the individual components in which is (7). An example illustrating the overall construction can be found in Figure 6.
We are now ready to formalize the idea just described and generalize it to target distributions of arbitrary dimension. To this end, we start with a technical lemma stating that all marginal and conditional distributions X2 to be given by the function depicted on the left in Figure 5.
of a d-dimensional histogram distribution are themselves histogram distributions, a result that was already used implicitly in the description of our main idea in the 2-dimensional case above.
n . Proof. The proof of the first statement follows by noting that, for all t ∈ [1 : where z = k [1:t] = (z 1 , z 2 , . . . , z t ) with k according to Definition II.4, and To prove the second statement, we first note that for all t ∈ [1 : .
Next, for a given z ′ ∈ [1 : (n − 1)] t , we have which allows us to conclude that, for all t ∈ [1 : We next define the auxiliary functions F r and Z r needed in the construction of the d-dimensional generalization of the 2-dimensional transport map M : with the initialization Further, define the functions Z r according to and . With the quantities just defined, we can write In the d-dimensional case, the space-filling transport map that takes a 1-dimensional uniform distribution into a given d-dimensional histogram distribution, or more precisely a sufficiently accurate approximation thereof, will be seen to be given by Theorem V.6 below, the central result of this section, makes this formal. The material from here on up to Theorem V.6 is all preparatory and technical. We recommend that it be skipped at first reading and suggest to proceed to Theorem V.6, in particular the intuition behind the construction of (10) provided right after the proof of Theorem V.6. 2 Formally, z 1 , albeit not defined, would correspond to a 0-dimensional quantity. It is used throughout the paper only for notational convenience.
We do recommend, however, to first visit Figure 7, which illustrates the F r -functions and their role in generating the target histogram distribution.
The following lemma establishes support properties of the F r -functions and corresponding consequences for the Z r -functions.
Before proceeding, we need to introduce further notation. Specifically, let We will also need the function , but does so in reverse manner, i.e., by mapping x = 0 to b and x = 1 to a. Additionally, we The next lemma establishes that the Z r -functions indeed realize the per-bin histogram distributions constituting the desired target histogram distribution.
The proof of the induction step largely follows the arguments underlying the proof of the base case. Fix k ∈ N, with k ≥ 2, and assume that for all z k = (z 1 , z 2 , . . . , We first note that We first provide the proof for the case The explicit form of f z k X k on P z k z k follows from the remark after Corollary IV.4 as , which is thanks to (14), in the induction assumption (for Next, for x ∈ [0, 1], it follows from (9) and (19) that where we used Lemma III.1 in (a), the fact that g(x/2 + h k /2 − j) = 0, for all x ∈ [0, 1], for j = ⌊h k /2⌋ in (b), h k /2 − ⌊h k /2⌋ = 0 for h k ∈ 2N 0 and h k /2 − ⌊h k /2⌋ = 1/2 for h k ∈ 2N 0 + 1 along with g(x) = g(1 − x), for x ∈ [0, 1], in (c), and The boundary cases i) x = 0 and k i=1 h i ∈ 2N 0 and ii) x = 1 and k i=1 h i ∈ 2N 0 + 1 follow along the same lines as in the base case upon noting that F k (N (0, T k ), z k+1 , s) = g s (h k /2 s ) and F k (N (1, T k ), z k+1 , s) = g s ((h k + 1)/2 s ).
This concludes the proof of the induction step and thereby the overall proof.
We continue with a corollary to Lemma V.4 complementing the results on |T k |, k ∈ [1 : (d − 1)], by the corresponding expression for |T d | and specifying the range of the Z r -functions on the domain T d .
Proof. We first prove the statement on |T d | and start by noting that, owing to Lemma V.4, . This establishes the first statement.

Now, by (19), it follows for
and analogously, for k−1 i=1 h i ∈ 2N 0 + 1, by (21), We have hence shown that, for all k ∈ [1 : d], Z k (x, s) ∈ z k n + h k 2 s n , z k n + h k +1 2 s n , for all x ∈ T k . The proof is completed upon noting that T d ⊆ T k , for all k ∈ [1 : d].
We are now ready to state the main result of this section, namely that the piecewise linear map M : x → (Z 1 (x, s), Z 2 (x, s), . . . , Z d (x, s)) transports a 1-dimensional uniform distribution in a space-filling manner to an arbitrarily close approximation of any high-dimensional histogram distribution.
Theorem V.6. For every distribution p X (x) ∈ E[0, 1] d n , the corresponding transport map . This establishes that the map M transports probability mass 1 2 sd p X (x ∈ c z ) to the cube c h z of volume 1 n2 s d , for all z. As p X is a histogram distribution, it is uniformly distributed on its constituent cubes c z , which, in turn, implies that the amount of probability mass it exhibits on each subcube c h z of c z is given by 1 2 sd p X (x ∈ c z ). The map M , when pushing forward U , therefore transports exactly the right amount of probability mass to each cube c h z for a coupling between p X and M #U to exist. Combining this with x − y ≤ √ d n2 s , for all points x, y in a d-dimensional cube of side length n −1 2 −s , it follows from Definition II.2 that Theorem V.6 was proven in [19] for d = 2. We remark that a space-filling approach for increasing distribution dimensionality was first described by Bailey and Telgarsky in [2]. Specifically, the construction in [2] generates uniform target distributions of arbitrary dimension based on the transport map M : x → (x, g s (x), g 2s (x), . . . ). The generalization introduced in this paper is capable of producing arbitrary histogram target distributions through space-filling transport maps that build on several key ideas, the first two of which are best illustrated by revisiting the 2-dimensional case with corresponding transport map M : X2 . This idea allows to realize different marginal distributions for different horizontal histogram bins (see the rightmost subplot in Fig. 6) and is not present in the Bailey-Telgarsky construction as, owing to the target distribution being uniform, there is no concept of histogram distributions. Taken together the two ideas just described allow to generate arbitrary marginal histogram distributions p X2|X1 x 2 |x 1 ), which are then combined-through the chain rule-with the histogram distribution p X1 (x 1 ) to the overall target histogram distribution p X1,X2 (x 1 , x 2 ).
A further idea underlying our transport map construction becomes transparent in the general d-dimensional case. Specifically, in taking the space-filling idea to higher dimensions, we note that in the Bailey-Telgarsky map M : x → (x, g s (x), g 2s (x), . . . ), the third component, g 2s (x) can actually be interpreted as a composition of g s (·) with the second component g s (x), simply as g s (g s (x)) = g 2s (x). Likewise, as already noted in the previous paragraph, the second component, g s (x), is a composition of g s (·) with the first component, x. This insight informs the recursive definition of the F r -functions according to (9), which, modulo the shaping by the localized f zr Xrfunctions, can be seen to exhibit this g s -composition property as well. The Z r (x, s)-functions constituting the components of our transport map (23) are then obtained by applying the localization idea as described above for the 2-dimensional case. There is, however, an important difference between localization in the 2-dimensional case and in the general d-dimensional case. This is best seen by inspecting the 3-dimensional case illustrated in Figure 7. Specifically, whereas in the 2-dimensional case the F r -functions are contiguously supported (see subplot (f)), in the 3-dimensional case, as illustrated in subplot (d), the support sets are disjointed, but exhibit a periodic pattern. Going to higher dimensions yields a fractal-like support set picture. We emphasize that this support set structure is a consequence of interlacing the self-compositions of the g s -functions with the localized per-bin histogram-distribution shaping functions f zr Xr .
We finally note that the transport map M in Theorem V.6 can be interpreted as a transport operator in the sense of optimal transport theory [20], [25], with the source distribution being 1-dimensional and the target-distribution d-dimensional. What is special here is that the transport operator acts between spaces of different dimensions and does so in a space-filling manner [18].

VI. REALIZATION OF TRANSPORT MAP THROUGH QUANTIZED NETWORKS
This section is concerned with the realization of the transport map M by ReLU networks. In particular, we shall consider networks with quantized weights, for three reasons. First, in practice network weights can not be stored as real numbers on a computer, but rather have to be encoded with a finite number of bits. Second, we want to convince ourselves that the space-filling property of the transport map, brittle as it seems, is, in fact, not dependent on the network weights being real numbers. Third, we will be able to develop a relationship, presented in Section IX, between the complexity of target distributions and the complexity of the ReLU networks realizing the corresponding transport maps. Specifically, complexity will be quantified through the number of bits needed to encode the distribution and the network, respectively, to within a prescribed accuracy.
We will see that ReLU networks with quantized weights generate histogram distributions with quantized weights, referred to as quantized histogram distributions in the following. In Section VII, we will then study the approximation of general distributions by quantized histogram distributions. Finally, in Section VIII, we put everything together and characterize the error incurred when approximating arbitrary target distributions by the transportation of a 1-dimensional uniform distribution through a ReLU network with quantized weights.
Before proceeding, we need to define quantized histogram distributions and quantized networks. We start with scalar distributions.
Definition VI.1. Let δ = 1/A, for some A ∈ N. A random variable X is said to have a δ-quantized histogram distribution of resolution n on [0, 1], denoted as X ∼Ẽ δ [0, 1] 1 n , if its pdf is given by We extend this definition to random vectors by saying that a random vector has a δ-quantized histogram distribution, if all its conditional (1-dimensional) distributions p zi Xi are δ-quantized histogram distributions. Definition VI.2. Let δ = 1/A, for some A ∈ N. A random vector X = (X 1 , X 2 , . . . , X d ) ⊤ is said to have a δ-quantized histogram distribution of resolution n on the d-dimensional unit cube, denoted as X ∼Ẽ δ Definition VI.3. For δ > 0, we say that a ReLU network is δ-quantized if each of its weights is of one of the following two types. A weight w is of Type 1 if w ∈ (δZ ∩ [−1/δ, 1/δ]) and of Type 2 if 1 w ∈ (δZ ∩ [−1/δ, 1/δ]). Formally, the goal of this section is to find, for fixed p X ∈Ẽ δ [0, 1] d n , a quantized ReLU network Φ such that Φ#U approximates p X to within a prescribed accuracy. To this end, we start with an auxiliary lemma, which constructs the building blocks of such networks.
Proof. We start with auxiliary results needed in the proof and then proceed to establish the statement for the cases r = 0 and r ≥ 1 separately. According to Corollary IV.4, for every k ∈ [1 : d], for all z k ∈ [0 : (n − 1)] k−1 , f z k X k (x) can be realized through a ReLU network Φ z k : R → R ∈ N 1,1 given by For ∆ = δ n , the network Φ z k is ∆-quantized with the weights 1 w0 , 1 wi , and 1 wi−1 of Type 2, and the weights 1 We are now ready to prove the statement for r = 0. Here, The networks Ψ z1 i,s realizing the components g s nf z1 As we want to apply [9, Lemma II.5], we hence need to augment Φ z1 to depth s + 3. This is effected by exploiting that Φ z1 (x) ≥ 0, ∀x ∈ R, which allows us to retain the input-output relation realized by the network while amending it by multiplications by 1 (acting as affine transformations) interlaced by applications of ρ for an overall depth of s + 3. This leads to the augmented networkΦ z1 = ρ • . . . We proceed to the proof for the case r ≥ 1. To this end, we use (9) to write the map M r : R n r +r → R n r+1 +r+1 , for r ∈ [1 : (d − 1)], as follows M r : (y 1 , y 2 , . . . , y n r +r ) → g s nf k,s ) = s + 3. We will also need the identity networks Φ s+3 , Φ s+3 id (y n r +1 ), . . . , Φ s+3 id (y n r +r ),Ψ Σ (y 1 , . . . , y n r ) .
The next result characterizes the ReLU networks realizing the transport map and quantifies their size in terms of connectivity and depth.
Lemma VI.5. For every p X ∈Ẽ δ [0, 1] d n with d > 1, the corresponding transport map We are now ready to state the main result of this section, namely that for every quantized histogram distribution p X and every ε > 0, there exists a quantized ReLU network Ψ satisfying W (Ψ#U, p X ) ≤ ε. In particular, we also quantify the dependence of ε on the resolution n and the dimension d of p X as well as the depth of the network Ψ.
Proof. By Lemma VI.5, for every p X ∈Ẽ δ [0, 1] d n with d > 1, the corresponding transport map M : x → (Z 1 (x, s), Z 2 (x, s), . . . , Z d (x, s)) can be realized through a ∆-quantized ReLU network We note that for fixed histogram resolution n, the upper bound on the approximation error (24) decays exponentially in s and hence in network depth L(Ψ). In particular, choosing s ∼ n, guarantees that the error in Theorem VI.6 decays exponentially in n while the connectivity of the network is in O(n d ); this behavior is asymptotically optimal as the number of parameters inẼ δ [0, 1] d n is of the same order.

VII. APPROXIMATION OF ARBITRARY DISTRIBUTIONS ON [0, 1] d BY QUANTIZED HISTOGRAM DISTRIBUTIONS
This section is concerned with the approximation of arbitrary distributions ν supported on [0, 1] d by δ-quantized histogram distributions of resolution n as defined in the previous section.
Define the k-dimensional subcube c i k = [i 1 /n, (i 1 + 1)/n] × [i 2 /n, (i 2 + 1)/n] × · · · × [i k /n, (i k + 1)/n], where i k = (i 1 , i 2 , . . . , i k ) ∈ [0 : (n − 1)] k , and its corner point Next, we discretize the domain [0, 1] d into the subcubes c i d and characterize the amount of probability mass ν assigns to the individual subcubes. First, set Then, for k ∈ [1 : (d − 1)], we define the projections P k : R d → R k , (x 1 , . . . , x k , . . . , x d ) → (x 1 , . . . , x k ) and the corresponding k-dimensional marginals ν k := P k #ν with weights It will also be useful to define conditional masses according to n i1 = m i1 and, for k ∈ [2 : d], for all 4 For m i k−1 = 0, we can, in principle, set the conditional masses arbitrarily, but, for concreteness, we choose Now that we have defined the masses m i k and the conditional masses n i k for the distribution ν, we can proceed to derive the massesm i k andñ i k of the corresponding δ-quantized histogram distribution. Denote the index of the subcube with the highest (original) mass in the first coordinate as 5 If there are multiple subcubes with the same maximal mass, simply pick one of them (it does not matter which one). Now, for k = 1 and i 1 = i * (i0) 1 , we choose the quantized masses as follows, Note that with this definition, the quantized massesm i1 are always nonzero for i 1 = i * (i0) 1 , even in subcubes where the original masses m i1 are equal to zero. We will later verify that this is also the case for i 1 = i * (i0) 1 whenever δ < 1 n(n−1) . For k ≥ 2, we similarly borrow mass from the subcube with maximum mass, and we do so in each coordinate individually. To this end, for each k ∈ [2 : d], we set for all i k−1 , 4 Throughout, we use the symbols i 1 and i 1 interchangeably. 5 Formally, i 0 , albeit not defined, would correspond to a 0-dimensional quantity. It is used throughout the paper only for notational consistency.
As in the assignment (25) for the first coordinate, if there are multiple such values, any of them will do. To define the quantized conditional masses, we set for each i k−1 ∈ [0 : (n − 1)] k−1 and each i k = i * (i k−1 ) k , We can then define the quantized weights according tõ and correspondinglym We now check that the quantized weights verify the following properties: 1) Correct marginals: 2) If δ < 1 n(n−1) , then all quantized masses are positive. To this end, we first note that Since for i k = i * (i k−1 ) k , we have by definitionñ We next formalize the procedure for going from the original masses m i k to the quantized massesm i k by characterizing a transport map effecting this transition.
Lemma VII.1. Let k ∈ [1 : d], ν a distribution supported on [0, 1] k and with masses m i k in the subcubes c i k and conditional masses n i k , all as specified above. Let the quantized massesm i k and the conditional quantized masses n i k also be given as above. Then, for all i k , we havẽ where The proof of Lemma VII.1 is provided in the appendix.
We are now ready to state the main result of this section. Specifically, we establish an upper bound on the Wasserstein distance between a given (arbitrary) distribution ν supported on [0, 1] d , for any d ∈ N, and the corresponding δ-quantized histogram distribution of resolution n obtained based on the procedure described above.
Theorem VII.2. Let d ∈ N. For every distribution ν supported on [0, 1] d , there exists a δ-quantized histogram distribution µ of resolution n such that Proof. The proof proceeds in three steps as follows.
Step 2. To redistribute the masses from the original values m i d to the quantized valuesm i d , we proceed coordinate by coordinate. Specifically, in the k-th coordinate, we carry out two (sets of) transportations. The first one moves, for fixed i 1 , . . . , i k−1 , i k+1 , . . . , i d , mass from the point p to the points , and does this for all tuples i 1 , . . . , i k−1 , i k+1 , . . . , i d . The second set of transportations reconfigures masses in the coordinates [1 : (k − 1)] so as to obtain the correct marginals in coordinate k. These reconfigurations moreover preserve the marginals in coordinates [1 : (k − 1)]. We make all this precise through the following claim, proved below after Step 3 has been presented.
Claim: Reconfiguring the masses between the corner points such that the mass in the point p i d is given by yields the correct marginal massesm i k ′ in all coordinates k ′ ∈ [1 : k] and comes at a Wasserstein cost of at most k(n− 1)δ, i.e., the Wasserstein distance between the configuration of masses before the moves and the configuration after the moves is at most k(n − 1)δ. There is a slight complication when m k ′ = 0. In either case, we have We note that the transport map in the Claim characterizes, at a high level, the state of the masses at an intermediate step in the transportation, while (26) describes the "final state" after all the moves have been completed in coordinate k.
If we accept the claim and apply it for k = d in combination with Lemma VII.1, it follows that the masses m i d are, indeed, redistributed to the massesm i d . Moreover, we get that the total cost of the transportations in Step 2 effecting this redistribution is upper-bounded by (n − 1) δ + 2 (n − 1) δ + · · · + d (n − 1) δ = d(d + 1) 2 (n − 1) δ.
Step 3. The Wasserstein cost associated with spreading out the massesm i d uniformly across their associated subcubes follows from (27) as Using the fact that Wasserstein distance is a metric, we can put the costs incurred in the individual steps together according to Step 1 Step 2 Step 3 It remains to prove the claim.
Proof of the Claim. We proceed by induction on k and start with the base case k = 1. The statement on the transportation cost associated with (28) does not need an induction argument, rather it follows as a byproduct of the proof by induction. Evaluating the transport map for k = 1 yields Since masses are moved in the first coordinate only and (m i1 − m i1 ) ≤ δ, for i 1 = i * (i0) 1 , the Wasserstein cost of the overall transportation satisfies Furthermore, we obtain the desired marginal masses in i 1 as a consequence of i2,...,i d This completes the proof of the base case. We proceed to establish the induction step. Assume that transportations were conducted in coordinate k according to (28) and that all marginal masses up to and including coordinate k are as desired. We consider the transport equation (28) in coordinate k + 1, i.e., the sums in (28) range from 1 to k + 1 and start by pointing out that The first set of transportations (corresponding to the index k ′ = k + 1 in the transport equation (28) evaluated for coordinate k + 1) hence amounts to moving, for fixed i 1 , . . . , i k , i k+2 , . . . , i d , the mass   m i1,...,i k ,i * (i k ) out of the point p i1,...,i k ,i * (i k ) k+1 ,i k+2 ,...,i d and redistributing it across the points p (i1,...,i k ,i k+1 ,i k+2 ,...,i d ) , for i k+1 = i * (i k ) k+1 . Note that for i k+1 = i * (i k ) k+1 , the quantity n i k+1 −ñ i k+1 is positive by definition ofñ. These transportations are conducted for all possible tuples i 1 , . . . , i k , i k+2 , . . . , i d . The Wasserstein cost associated with the collection of these transportations satisfies where the last inequality follows because η i k (i k ) =ñ i k exactly when i k = i * (i k−1 ) k , in which case we havẽ n i k ≤ n i k . The second set of transportations reconfigures the masses in the coordinates k ′ ≤ k in order to obtain correct marginals in the (k + 1)-th coordinate. To this end, we first note that, for each k ′ ≤ k, for all i k ′ , the following identity holds  Specifically, noting that by the induction assumption transportation according to (28) was carried out in coordinate k, we would like to reconfigure, for each k ′ ≤ k, the masses  This reconfiguration is possible as only one term in each (29) and (30) depends on i k+1 and i k+1ñ Masses to be moved in this manner appear for all i 1 , ..., It follows by inspection of the transport map (28) that these transportations do not alter the marginals for k ′ ≤ k as, for given k ′ , the mass moved out of the point p i1,...,i k ′ −1 ,i * (i k ′ −1 ) k ′ ,i k ′ +1 ,...,i d , accounted for by the sum with negative sign in (28), equals the total mass moved into the points p (i1,...,i k ′ −1 ,i k ′ ,i k ′ +1 ,...,i d ) , for i k ′ = i * (i k ′ −1) k ′ , accounted for by the sum with positive sign. These moves hence retain the marginals for k ′ ≤ k, which are correct by the induction assumption. Before establishing that the desired marginals in coordinate k + 1 are obtained, we compute the Wasserstein cost associated with the mass reconfiguration moves according to We must carry this out for all k ′ ∈ [1 : k], which results in a Wasserstein cost of k(n − 1)δ for the reconfigurations. Altogether, we have a Wasserstein cost of (k + 1)(n − 1)δ incurred by the moves corresponding to coordinate k + 1.
Next, usingñ i k+1Υ (i, k, k ′ + 1) =Υ(i, k + 1, k ′ + 1), it follows that the updated mass in the point p i d is given by Finally, we need to check that the marginals in the (k + 1)-th coordinate are, indeed, given bym i k+1 . This is accomplished by noting that owing to Lemma VII.1, we have This concludes the proof of the induction step and hence establishes the claim.
Proof. The proof follows by application of the triangle inequality for Wasserstein distance in combination with Theorems VI.6 and VII.2 according to where µ denotes the δ-quantized histogram distribution of resolution n per Theorem VII.2.
The error bound in (32) illustrates the main conceptual insight of this paper, namely that generating arbitrary ddimensional distributions from a 1-dimensional uniform distribution by pushforward through a deep ReLU network does not come at a cost-in terms of Wasserstein-distance error-relative to generating the target distribution from d independent random variables. Specifically, if we let the depth s of the generating network go to infinity, the first term in the rightmost expression of (32) will go to zero exponentially fast in s-thanks to the space-filling property of the transport map realized by the generating network-leaving us only with the second term, which reflects the error stemming from the histogram approximation of the distribution. Moreover, this second term is inversely proportional to the histogram resolution n and can thus be made arbitrarily small by letting the histogram resolution n approach infinity. The width of the corresponding generating network will grow according to O(n d ).
Theorem VIII.1 applies to distributions supported on the unit cube [0, 1] d . The extension to distributions supported on bounded subsets of R d is, however, fairly straightforward. Before stating this extension, we provide a lemma that will help us deal with the scaling and shifting of distributions.
Proof. Let π be a coupling between µ and ν and let g : R 2d → R 2d ; (y 1 , y 2 ) → (f (y 1 ), f (y 2 )). Then g#π is a coupling between µ and ν and We are now ready to state the extension announced above. .
Proof. We first note that g −1 is Lipschitz with Lip(g −1 ) = α. The result then follows immediately from Lemma VIII.2 combined with (32) by taking Φ ∈ N 1,d to approximate the distribution g#ν according to Theorem VIII.1.
We finally remark that g −1 by virtue of being an affine map can easily be realized by a ReLU network.

IX. COMPLEXITY OF GENERATIVE NETWORKS
In this section, we compare the complexity of ReLU networks generating a given class of probability distributions to fundamental bounds on the complexity of encoding classes of probability distributions through discrete approximations, a process commonly referred to as quantization [12]. Specifically, complexity will be measured in terms of the number of bits needed to describe the generative networks and, respectively, distributions. We begin by reviewing a fundamental result on the approximation of (non-singular) distributions.
The quantity V n (ν) characterizes the approximation error-in Wasserstein distance-incurred by the best discrete n-point approximation of ν.
The next result, taken from [12], states that this approximation error exhibits the same asymptotics for all (nonsingular) distributions satisfying a mild moment constraint.
Theorem IX.2 ([12, Theorem 6.2]). Let X be a random vector in R d with X ∼ ν, where ν is non-singular and supported on [0, 1] d , and E X 1+δ < ∞ for some δ > 0, where · is any norm on R d . Then, where C > 0 is a constant depending on d only.
Theorem IX.2 allows us to conclude that the best-approximating discrete distribution must have at least n = O(ε −d ) points for V n (ν) ≤ ε to hold. As Wasserstein distance is a metric, we hence have a covering argument which says that the class of (non-singular) distributions ν supported on [0, 1] d (and satisfying the moment constraint in Theorem IX.2) has metric entropy lower-bounded by d log(ε −1 ) bits. Although this lower bound is very generous, we demonstrate next that it is achieved for quantized histogram target distributions encoded by their generating ReLU networks.

APPENDIX
Proof. Note first that so for a given i k ′ only one of the two χ-terms above is active. Terms with i k ′ = i * (i k ′ −1 ) k ′ correspond to subcubes to which we add mass to get the quantized masses in the k ′ -th coordinate, while terms with i k ′ = i * (i k ′ −1 ) k ′ correspond to the subcube from which we take this extra mass. Correspondingly, we refer to terms with i k ′ = i * (i k ′ −1 ) k ′ as "+ terms", while we designate terms with i k ′ = i * (i k ′ −1 ) k ′ as "-terms". By construction, ñ i k ′ − n i k ′ ≥ 0 for + terms, while n i k ′ −ñ i k ′ ≥ 0 for -terms. In evaluating the sum (26), we consider three different cases.
Case 2: All terms are -terms. In this case, the sum is Case 3: There is at least one + term and one -term. Let the indices of the + terms be given by k + 1 , . . . , k + ℓ1 and those of the -terms by k − 1 , . . . , k − ℓ2 , with both sets arranged in increasing order and ℓ 1 + ℓ 2 = k.
We first consider the sum of the -terms given by ℓ2 ℓ=1 Υ(i, k, k − ℓ + 1) n i and establish a cancelation property of successive terms in this sum, leaving only the border terms to be considered. Indeed, take ℓ such that 1 < ℓ < ℓ 2 , with the corresponding term given by Next, note that the positive part of the term corresponding to the index ℓ + 1 is given by since all indices that lie strictly between k − ℓ and k − ℓ+1 , if there are any, correspond to + terms. Comparing (36) with (35) reveals that the positive part of the term corresponding to ℓ + 1 cancels out the negative part of the term for ℓ. Similarly, the negative part of the term corresponding to ℓ − 1, given by cancels out the positive part of the term for index ℓ, which is given by Hence, the only contributions remaining in the sum (34) over all the -terms are the negative part of the term corresponding to the index ℓ 2 and the positive part of the term for the index 1, i.e., ℓ2 ℓ=1 since indices smaller than k − 1 necessarily correspond to + terms. We proceed to the sum over the + terms. A similar cancelation property between consecutive terms in the sum can be established so that we are again left with contributions from the first and the last term only. Indeed, take ℓ such that 1 < ℓ < ℓ 1 , with the corresponding term given bỹ Υ(i, k, k + ℓ + 1)ñ i k + ℓ Υ η (i, k + ℓ − 1, 1) −Υ(i, k, k + ℓ + 1) n i k + ℓ Υ η (i, k + ℓ − 1, 1).
The positive part of the term corresponding to the index ℓ + 1 is given bỹ since all indices that lie strictly between k + ℓ and k + ℓ+1 , if there are any, correspond to -terms. We can hence conclude, as above, that the positive part of the term corresponding to ℓ + 1 cancels out the negative part of the term for ℓ. By the same argument, the positive part of the term for ℓ is cancelled out by the negative part of the term corresponding to ℓ − 1. Overall, the only remaining contributions are the negative part of the term corresponding to ℓ 1 and the positive part of the term for the index 1, i.e., ℓ1 ℓ=1Υ (i, k, k + ℓ + 1) ñ i k + ℓ − n i k + ℓ Υ η (i, k + ℓ − 1, 1) =Υ(i, k, 1) −Υ(i, k, k + ℓ1 + 1) n i k + ℓ 1 Υ η (i, k + ℓ1 − 1, 1) =m i k −Υ(i, k, k + ℓ1 + 1) n i k + ℓ 1 Υ η (i, k + ℓ1 − 1, 1).