Phase Transitions in Rate Distortion Theory and Deep Learning

Rate distortion theory is concerned with optimally encoding a given signal class $\mathcal{S}$ using a budget of $R$ bits, as $R\to\infty$. We say that $\mathcal{S}$ can be compressed at rate $s$ if we can achieve an error of $\mathcal{O}(R^{-s})$ for encoding $\mathcal{S}$; the supremal compression rate is denoted $s^\ast(\mathcal{S})$. Given a fixed coding scheme, there usually are elements of $\mathcal{S}$ that are compressed at a higher rate than $s^\ast(\mathcal{S})$ by the given coding scheme; we study the size of this set of signals. We show that for certain"nice"signal classes $\mathcal{S}$, a phase transition occurs: We construct a probability measure $\mathbb{P}$ on $\mathcal{S}$ such that for every coding scheme $\mathcal{C}$ and any $s>s^\ast(\mathcal{S})$, the set of signals encoded with error $\mathcal{O}(R^{-s})$ by $\mathcal{C}$ forms a $\mathbb{P}$-null-set. In particular our results apply to balls in Besov and Sobolev spaces that embed compactly into $L^2(\Omega)$ for a bounded Lipschitz domain $\Omega$. As an application, we show that several existing sharpness results concerning function approximation using deep neural networks are generically sharp. We also provide quantitative and non-asymptotic bounds on the probability that a random $f\in\mathcal{S}$ can be encoded to within accuracy $\varepsilon$ using $R$ bits. This result is applied to the problem of approximately representing $f\in\mathcal{S}$ to within accuracy $\varepsilon$ by a (quantized) neural network that is constrained to have at most $W$ nonzero weights and is generated by an arbitrary"learning"procedure. We show that for any $s>s^\ast(\mathcal{S})$ there are constants $c,C$ such that, no matter how we choose the"learning"procedure, the probability of success is bounded from above by $\min\big\{1,2^{C\cdot W\lceil\log_2(1+W)\rceil^2 -c\cdot\varepsilon^{-1/s}}\big\}$.


Introduction
Let S be a signal class, that is, a relatively compact subset of a Banach space (X, · X ). Rate distortion theory is concerned with the question of how well the elements of S can be encoded using a prescribed number R of bits. In many cases of interest, the best achievable coding error scales like R −s * , where s * is the optimal compression rate of the signal class S. We show that a phase transition occurs: the set of elements x ∈ S that can be encoded using a strictly larger exponent than s * is thin; precisely, it is a null-set with respect to a suitable probability measure P. Crucially, the measure P is independent of the chosen coding scheme.
In order to make these results more rigorous, let us state the needed notions of ratedistortion theory, see also [3,4,12,14].

A crash course in rate distortion theory
To formalize the notion of encoding a signal class S ⊂ X, we define the set Enc R S,X of encoding/decoding pairs (E, D) of code-length R ∈ N as We are interested in choosing (E, D) ∈ Enc R S,X such as to minimize the (maximal) distortion δ S,X (E, D) := sup x∈S x − D(E(x)) X .
The intuition behind these definitions is that the encoder E converts any signal x ∈ S into a bitstream of code-length R (i.e., consisting of R bits), while the decoder D produces from a given bitstream b ∈ {0, 1} R a signal D(b) ∈ X. The goal of rate distortion theory is to determine the minimal distortion that can be achieved by any encoder/decoder pair of code-length R ∈ N. Typical results concerning the relation between code-length and distortion are formulated in an asymptotic sense: One assumes that for every code-length R ∈ N, one is given an encoding/decoding pair (E R , D R ) ∈ Enc R S,X , and then studies the asymptotic behaviour of the corresponding distortion δ S,X (E R , D R ) as R → ∞.
We refer to a sequence (E R , D R ) R∈N of encoding/decoding pairs as a codec, so that the set of all codecs is Codecs S,X := R∈N Enc R S,X .
For a given signal class S in a Banach space X, it is of great interest to find an asymptotically optimal codec; that is, a sequence (E R , D R ) R∈N ∈ Codecs S,X such that the asymptotic decay of δ S,X (E R , D R ) R∈N is, in a sense, maximal. To formalize this, for each s ∈ [0, ∞) define the class of subsets of X that admit compression rate s as Comp s X := S ⊂ X : ∃ (E R , D R ) R∈N ∈ Codecs S,X : sup For a given (bounded) signal class S ⊂ X we aim to determine the optimal compression rate for S in X, that is Although the calculation of the quantity s * X (S) may appear daunting for a given signal class S, there exists in fact a large body of literature addressing this topic. A landmark result in this area states that the JPEG2000 compression standard represents an optimal codec for the compression of piecewise smooth signals [22]. This optimality is typically stated more generally for the signal class S = B 0, 1; B α p,q (Ω) , the unit ball in the Besov space B α p,q (Ω), considered as a subset of X = H = L 2 (Ω), for "sufficiently nice" bounded domains Ω ⊂ R d ; see [9].
For a codec C = (E R , D R ) R∈N ∈ Codecs S,X , instead of considering the maximal distortion of C over the entire signal class S, one can also measure the approximation rate that the codec C achieves for each individual x ∈ S. Precisely, the class of elements with compression rate s under C is A s S,X (C) := x ∈ S : sup If the signal class S is "sufficiently regular"-for instance if S is compact and convexthen one can prove (see Proposition G.1) that the following dichotomy is valid: s < s * X (S) =⇒ ∃ C ∈ Codecs S,X ∀ x ∈ S : x ∈ A s S,X (C), s > s * X (S) =⇒ ∀ C ∈ Codecs S,X ∃ x * ∈ S : x * / ∈ A s S,X (C).
Thus, all signals in S can be approximated at any compression rate lower than the optimal rate for S using a common codec. Furthermore, for any approximation rate s larger than the optimal rate for S, and for any codec C, there exists some x * = x * (s, C) ∈ S that is not compressed at rate s by C.
Remark (Encoding/decoding schemes vs. discretization maps). As the above considerations suggest, the crucial quantity for our investigations are not the encoding/decoding pairs (E, D) ∈ Enc R S,X , but the distortion they cause for each x ∈ S. Therefore, we could equally well restrict our attention to the discretization map D • E : S → X, which has the crucial property |range(D • E)| ≤ 2 R . Conversely, given any (discretization) map ∆ : S → X with |range(∆)| ≤ 2 R , one can construct an encoding/decoding pair (E, D) ∈ Enc R S,X , by choosing a surjection D : {0, 1} R → range(∆), and then setting which ensures that x − D(E(x)) X ≤ x − ∆(x) X for all x ∈ S. Thus, all our results could equally well be rephrased in terms of such discretization maps rather than in terms of encoding/decoding pairs. For more details on this connection, see also Lemma B.1.

Our contributions 1.2.1. Phase Transition
We improve on the dichotomy (1.3) by measuring the size of the class A s S,X (C) of elements with compression rate s under the codec C. Then a phase transition occurs: the class of elements that can not be encoded at a "larger than optimal" rate is generic. We prove this when the signal class is a ball in a Besov-or Sobolev space, as long as this ball forms a compact subset of H = L 2 (Ω) for a bounded Lipschitz domain Ω ⊂ R d .
More precisely, for each such signal class S, we construct a probability measure P on S such that the compressibility exhibits a phase transition as in the following definition. Here P * is the outer measure corresponding to P, defined in Equation (1.6) below.
The first implication in (1.4) is always satisfied, as a consequence of (1.3). The second part of (1.4) states that for any s > s * H (S) and any codec C, almost every x ∈ S cannot be compressed by C at rate s. In other words, whenever P exhibits a compressibility phase transition on S, the property of not being compressible at a "larger than optimal" rate is a generic property. Remark 1.2 (Universality in Definition 1.1). Note that the measure P in Definition 1.1 is required to satisfy the second property in (1.4) universally for any choice of codec C.
In fact, if P would be allowed to depend on C, one could simply choose P = δ x , where x = x(C, s) ∈ S is a single element that is not approximated at rate s by C; for s > s * H (S) such an element exists under mild assumptions on S. In contrast, the measure P in Definition 1.1 satisfies P({x}) = 0 for each x ∈ S, as can be seen by taking C = (E R , D R ) R∈N with D R : {0, 1} R → S, c → x, so that A s S,H (C) = {x} for all s > 0. This shows, in particular, that any probability measure P exhibiting a compressibility phase transition is atom free, so that P(M ) = 0 for any countable set M .
Our first main result establishes the existence of critical measures for all Sobolev-and Besov balls (denoted B(0, 1; W k,p (Ω; R)), resp. B(0, 1; B τ p,q (Ω; R))); see Appendix C) that are compact subsets of L 2 (Ω): In either case, s * L 2 (Ω) (S) = s * , and there is a Borel probability measure P on S that exhibits a compressibility phase transition as in Definition 1.1.
Since Remark 1.2 shows that the measure P from the preceding theorem satisfies P(M ) = 0 for each countable set M ⊂ S, we get the following strengthening of the dichotomy (1.3).
Corollary 1.4. Under the assumptions of Theorem 1.3, for each codec C ∈ Codecs S,L 2 (Ω) the set S \ s>s * A s S,L 2 (Ω) (C), which consists of all signals that can not be encoded by C at compression rate s for some s > s * , is uncountable.
In words, Corollary 1.4 states that for every codec the set of signals in S that can not be approximated at any compression rate larger than the optimal rate for S is uncountable. In contrast, previous results (such as Proposition G.1) only state the existence of a single such "badly approximable" signal.

Quantitative lower bounds
As a quantitative version of Theorem 1.3, we show that if one randomly chooses a function f ∼ P according to the probability measure P constructed in (the proof of) Theorem 1.3, one can precisely bound the probability that a given encoding/decoding pair (E R , D R ) of code-length R achieves a given error ε for f . To underline a probabilistic interpretation, we define, for any property τ of elements f ∈ S Proof. This follows from Theorems 4.1, 4.2, and 2.2. Theorem 1.5 is interesting due to its nonasymptotic nature. Indeed, given a fixed budget of R bits and a desired accuracy ε, it provides a partial answer to the question: How likely is one to succeed in describing a random f ∈ S to within accuracy ε using R bits? Figure 1 provides an illustration of the phase transition behaviour in dependence of ε and R; it graphically shows that the transition is quite sharp. Figure 1: For S a Sobolev or Besov ball, Theorem 1.5 provides bounds on the probability of being able to describe a random function f ∈ S to within accuracy ε using R bits. This probability is, for every s > s * and ε ∈ (0, ε0) (s * denoting the optimal compression rate of S), upper bounded by Es(R, ε) := min 1, 2 R−c·ε −1/s . In this figure we show two plots of the function Es over the (R, 1/ε)-plane. Both grayscale plots show Es for s = 2.002 > s * = 2 and c = 1, while the red curve indicates the critical region where R = (1/ε) 1/s . We see that a sharp phase transition occurs in the sense that above and slightly below the critical curve R = ε −1/2 (white area) the upper bound Es does not rule out the possibility that it is always possible to describe f ∈ S to within accuracy ε using R bits; but even slightly below the critical curve (dark area) the bound Es shows that such a compression is almost impossible. The sharpness of the phase transition is more clearly shown in the zoomed part of the figure. The bottom plot further illustrates the quantitative behaviour by using a logarithmic colormap. Note that in the bottom plot two different colormaps are used for the range [−100, 0] and the remaining range [−1000, −100).

Lower Bounds for Neural Network Approximation
As an application we draw a connection between the previously described results and function approximation using neural networks. We will use the following mathematical formalization of (fully connected, feed forward) neural networks [23].
The complexity of the network Φ is described by We will also be interested in the complexity of the individual weights and biases of the network. Precisely, for σ, W ∈ N, we say that Φ is (σ, W )-quantized if all entries of the matrices A and the vectors b belong to Note that in applications one necessarily deals with quantized NNs due to the necessity to store and process the weights on a digital computer. Regarding function approximation by such quantized neural networks, we have the following result: Let S, s * , and P as in Theorem 1.3. Then the following hold: 1. There is C = C(d, σ) ∈ N such that for each s > s * there are c, ε 0 > 0 satisfying Pr min

If we define
then P * (A * N N , ) = 0. Proof. The proof of this theorem is deferred to Appendix F. Theorem 1.7 can be interpreted as follows: Suppose we would like to approximate a function f ∈ S to within accuracy ε using (quantized) neural networks of size ≤ W . Theorem 1.7 provides an upper bound on the probability of success. In particular it shows that the network size has to scale at least of order ε −1/s * to succeed with high probability if S is a Sobolev-or Besov ball; see Figure 2. Figure 2: Suppose we want to approximately represent a signal f to within accuracy ε by a (quantized) neural network Φ f constrained to be of size W (Φ f ) ≤ W (for example due to limited memory). Such a network shall be produced by any numerical "learning" procedure Φ f = Learn(f ). Suppose further that the only available prior information is that f ∈ S, where S has optimal compression rate s * as in Theorem 1.3 (such prior information is, for instance, available if f is the solution of a linear elliptic PDE with known right hand side). Then, no matter how we choose the "learning" algorithm Learn(f ), Theorem 1.7 states that for any s > s * there are constants c, C such that the probability of success is bounded from above by min 1, 2 C·W log 2 (1+W ) 2 −c·ε −1/s .
where τ ∈ (0, 1 s * ) is arbitrary. This follows from results in [12,26]. Since the details are mainly technical, the proof is deferred to Appendix F. We remark that by similar arguments as in [12,26], one can also prove the sharpness for other activation functions than the ReLU and other domains than [0, 1] d .

Related literature
Many (optimality) results in approximation theory are formulated in a minimax sense, meaning that one precisely characterizes the asymptotic decay of where S ⊂ X is the signal class to be approximated, and M n ⊂ X contains all functions "of complexity n", for example polynomials of degree n or shallow neural networks with n neurons, etc. As recent examples of such results related to neural networks, we mention [4,23,31].
A minimax lower bound of the form d X (S, M n ) n −s * , however, only makes a claim about the possible worst case of approximating elements f ∈ S. In other words, such an estimate in general only guarantees that there is at least one "hard to approximate" function f * ∈ S that satisfies inf g∈Mn f * − g X n −s for each s > s * , but nothing is known about how "massive" this set of "hard to approximate" functions is, or about the "average case".
The first paper to address this question-and one of the main sources of inspiration for the present paper-is [20]. In that paper, Maiorov, Meir, and Ratsaby consider essentially the "L 2 -Besov-space type" signal class S = S r of functions f ∈ L 2 (B d ) (with where P K = span x α : α ∈ N d 0 with |α| ≤ K denotes the space of d-variate polynomials of degree at most K. On this signal class, they construct a probability measure P such that given the subset of functions one obtains the minimax asymptotic d L 2 (S r , M n ) n −r/(d−1) , but furthermore there is c > 0 such that In other words, the measure of the set of functions for which the minimax asymptotic is sharp tends to 1 for n → ∞. In this context, we would also like to mention the recent article [19], in which the results of [20] are extended to cover more general signal classes and approximation in stronger norms than the L 2 norm. While we draw heavily on the ideas from [20] for the construction of the measure P in Theorem 1.3, it should be noted that we are interested in phase transitions for general encoding/decoding schemes, while [19,20] exclusively focus on approximation using the ridge function classes M n .
Finally, we would like to point out that our lower bounds for neural network approximation consider networks with quantized weights, as in [4,23]. The main reason is that without such an assumption, even two-layer networks with a fixed number of neurons can approximate any function arbitrarily well if the activation function is chosen suitably; see [21,Theorem 4]. Moreover, even if one considers the popular ReLU activation function, it was recently observed that the optimal approximation rates for networks with quantized weights can in fact be doubled using arbitrarily deep ReLU networks with highly complex weights [31].

Outline
In Section 2, we introduce and study a class of probability measures with a certain growth behaviour. More precisely, we say that P is of logarithmic growth order s 0 on S ⊂ X if for each s > s 0 , we have for suitable c, ε 0 > 0 depending on s 0 . Here, as in the rest of the paper, B(x, ε; X) is the open ball around x of radius ε with respect to · X . A measure has critical growth if its logarithmic growth order equals the optimal compression rate s * X (S). We show in particular that every critical measure exhibits a compressibility phase transition as in Definition 1.1, and we show how critical measures can be transported from one set to another.
In Section 3, we study certain sequence spaces p,q P,α ; these are essentially the coefficient spaces associated to Besov spaces. By modifying the construction given in [20], we construct probability measures of critical growth on the unit balls S p,q P,α of the spaces The construction of critical measures on the unit balls of Besov and Sobolev spaces is then accomplished in Section 4, essentially by using wavelet systems to transfer the critical measure from the sequence spaces to the function spaces. This makes heavy use of the transfer results established in Section 2.
A host of more technical proofs are deferred to the appendices. We assume all vector spaces to be over R, unless explicitly stated otherwise. For a given (quasi)-normed vector space (X, · ), we denote the closed ball of radius r ≥ 0 around x ∈ X by B(x, r; X) := {y ∈ X : y − x ≤ r}. If we want to emphasize the quasi-norm (for example, if multiple quasi-norms are considered on the same space X), we write B(x, r; · ) instead.

Notation
For an index set I and an integrability exponent p ∈ (0, ∞], the sequence space For a measure µ on a measurable space (S, A ) the outer measure µ * : 2 S → [0, ∞] induced by µ is given by (1.6) It is well-known (see [13,Proposition 1.10]) that µ * is σ-subadditive, meaning that µ * ( ∞ n=1 M n ) ≤ ∞ n=1 µ * (M n ) for arbitrary M n ⊂ S. We will be interested in µ * -nullsets; that is, subsets N ⊂ S satisfying µ * (N ) = 0. This holds if and only if there is N ∈ A satisfying N ⊂ N and µ(N ) = 0. Furthermore, directly from the σsubadditivity of µ * , it follows that a countable union of µ * -null-sets is again a µ * -null-set.
A comment on measurability: Given a (not necessarily measurable) subset M ⊂ X of a Banach space X, we will always equip M with the trace σ-algebra Note that if (Ω, A ) is an arbitrary measurable space, then Φ : Ω → M is measurable if and only if it is measurable when considered as a map Φ : Ω → (X, B X ).

General results on phase transitions in Banach spaces
In this section we establish an abstract version of the phase transition considered in (1.4) for signal classes in general Banach spaces and a class of measures that satisfy a uniform growth property that we term "critical" (see Definition 2.1). We will show in Section 2.1 that such critical measures automatically induce a phase transition behavior. We furthermore show in Section 2.2 that criticality is preserved under pushforward by "nice" mappings. The existence of critical measures is by no means trivial; quite the opposite, their construction for a class of sequence spaces in Section 3-and for Besov and Sobolev spaces on domains in Section 4-constitutes an essential part of the present article.

Measures of logarithmic growth
Definition 2.1. Let S be a subset of a Banach space X, and let s 0 ∈ [0, ∞).
A Borel probability measure P on S has (logarithmic) growth order s 0 (with respect to X) if for every s > s 0 , there are constants ε 0 , c > 0 (depending on s, s 0 , P, S, X) such that P S ∩ B(x, ε; X) ≤ 2 −c·ε −1/s ∀ x ∈ X and ε ∈ (0, ε 0 ). (2.1) We say that P is critical for S (with respect to X) if P has logarithmic growth order s * X (S), with the optimal compression rate s * X (S) as defined in Equation (1.1).
Remark. If P has growth order s 0 , then P also has growth order σ, for arbitrary σ > s 0 .
The motivation for considering the growth order of a measure is that it leads to bounds regarding the measure of elements x ∈ S that are well-approximated by a given codec; see Equation (2.2) below. Furthermore, as we will see in Corollary 2.5, if P is a probability measure of growth order s 0 , then necessarily s 0 ≥ s * X (S), so critical measures have the minimal possible growth order.
The following theorem summarizes our main structural results, showing that critical measures always exhibit a compressibility phase transition.
Theorem 2.2. Let the signal class S be a subset of the Banach space X, let P be a Borel probability measure on S that is critical for S with respect to X, and set s * := s * X (S).
Then the following hold: (i) Let s > s * and let c = c(s) > 0 and ε 0 = ε 0 (s) as in Equation (2.1). Then, for any R ∈ N and (E R , D R ) ∈ Enc R S,X , we have where we use the notation from Equation (1.5).
(ii) For every s > s * and every codec C ∈ Codecs S,X , the set A s S,X (C) is a P * -null-set: for a constant C = C(s, C) > 0. In particular, the set of s-compressible signals A s S,X (C) defined in Eq. (1.2) satisfies A s S,X (C) = S and hence P(A s S,X (C)) = 1. Remark. 1) Note that the theorem does not make any statement about the case s = s * . In this case, the behavior depends on the specific choices of S and P.
2) As noted above, the question of the existence of a critical probability measure P is nontrivial.
The proof of Theorem 2.2 is divided into several auxiliary results. Part (i) is contained in the following lemma. Lemma 2.3. Let S be a subset of a Banach space X, and let P be a Borel probability measure on S that is of logarithmic growth order s 0 ≥ 0 with respect to X.
Let s > s 0 and let c = c(s) > 0 and ε 0 = ε 0 (s) as in Equation (2.1). Then, for any R ∈ N and (E R , D R ) ∈ Enc R S,X , we have Furthermore, for any given s > s 0 and K > 0 there exists a minimal code-length R 0 = R 0 (s, s 0 , K, P, S, X) ∈ N such that for every codec C = (E R , D R ) R∈N ∈ Codecs S,X , we have Remark. The lemma states that the measure of the subset of points x ∈ S with approx- In fact, the proof shows that the approximation error is decreasing asymptotically superexponentially.
Proof. Let s > s 0 and let c, ε 0 as in Equation (2.1). For R ∈ N and ε ∈ (0, ε 0 ), define Since P is of growth order s 0 and because of |range(D R )| ≤ 2 R , we can apply (2.1) and the subadditivity of the outer measure P * to deduce This proves the first part of the lemma.
To prove the second part, let s > s 0 , and choose σ = s+s 0 2 , noting that σ ∈ (s 0 , s). Therefore, the first part of the lemma, applied with σ instead of s, yields c, ε 0 > 0 such that Note that ε := K ·R −s ≤ ε 0 /2 < ε 0 holds as soon as R ≥ (2K/ε 0 ) 1/s =: R 1 . Finally, since s/σ > 1 we can find a code-length R 2 ∈ N such that Overall, we thus see that (2.3) holds, with R 0 = max{R 1 , R 2 }. Proposition 2.4. Let S be a subset of the Banach space X. If P is a Borel probability measure on S that is of growth order s 0 ∈ [0, ∞), then, for every s > s 0 and every codec C = (E R , D R ) R∈N ∈ Codecs S,X , we have Proof. First, note that To see that this holds, note that Lemma 2.3 shows This easily implies P * R∈N A (s) The proof of Theorem 2.2 merely consists of combining the preceding lemmas.
Proof of Theorem 2.2. Proof of (i): This is contained in the statement of Lemma 2.3.

Proof of (ii): This follows from Proposition 2.4.
Proof of (iii): This follows from the definition of the optimal compression rate: for s < s * there exists a codec (E R , D R ) R∈N ∈ Codecs S,X such that for a constant C > 0 and all x ∈ S. In particular, this implies A s S,X (C) = S, and therefore P(A s S,X (C)) = 1.
We close this subsection by showing that if P is a probability measure with logarithmic growth order s 0 , then this growth order is at least as large as the optimal compression rate of the set on which P is defined. This justifies the nomenclature of "critical measures" as introduced in Definition 2.1.
Corollary 2.5. Let S be a subset of X, and P be a Borel probability measure on S of growth order s 0 . Then s 0 ≥ s * Proof. Suppose for a contradiction that 0 ≤ s 0 < s * X (S), and choose s ∈ (s 0 , s * X (S)). By S,X (C) = S. By Proposition 2.4, we thus obtain the contradiction 1 = P(S) = P(A s S,X (C)) = 0.

Transferring critical measures
Our main goal in this paper is to prove a phase transition as in (1.4) for Besov-and Sobolev spaces. To do so, we will first prove (in Section 3) that such a phase-transition occurs for a certain class of sequence spaces, and then transfer this result to the Besov-and Sobolev spaces, essentially by discretizing these function spaces using suitable wavelet systems. In the present subsection, we formulate general results that allow such a transfer from a phase transition as in (1.4) from one space to another. In general, it would be most convenient if we had access to an orthonormal wavelet basis (or at least to a Riesz basis) of wavelets that is "compatible" with Besov-and Sobolev spaces. For the setting of very general domains Ω ⊂ R d and the full range of parameters p, q, however, it seems to be unknown whether such orthonormal wavelet bases exist. Therefore, our transfer results will allow to use two distinct maps: Essentially, one can use a frame to transfer the optimal compression rate, and a (possibly different) Riesz sequence to transfer the critical measure. In the abstract formulation of this section, this will be formulated using a Lipschitz continuous surjection Φ (the synthesis operator of the frame) and an expansive injection Ψ (the synthesis operator of the Riesz sequence).
The precise transference result reads as follows: Theorem 2.6. Let X, Y, Z be Banach spaces, and let S X ⊂ X, S Y ⊂ Y, and S ⊂ Z. Assume that 3. there exists a Borel probability measure P on S Y that is critical for S Y with respect to Y; 4. there exists an expansive measurable map Ψ : Then s * Z (S) = s * X (S X ), and the push-forward measure P • Ψ −1 is a Borel probability measure on S that is critical for S with respect to Z.
Remark. 1) In many cases, it is natural to take S X = S Y and Φ = Ψ. As we will see in Section 4, however, the added flexibility of the formulation above is necessary to transfer critical measures from the sequence spaces p,q P,α,θ considered in Section 3 to Besov and Sobolev spaces.
2) As mentioned in Section 1.5, regarding the measurability of Ψ, S Y is equipped with the trace σ-algebra of the Borel σ-algebra on Y, and analogously for S.
Proof. The proof is given in Appendix A.

Proof of the phase transition in 2 (I)
In this section, we provide the proof of the phase transition for a class of sequence spaces associated to Sobolev and Besov spaces; these sequences spaces are defined in Section 3.1, where we also formulate the main result (Theorem 3.3) concerning the compressibility phase transition for these spaces. Section 3.2 establishes elementary embedding results for these spaces and provides a lower bound for their optimal compression rate; the latter essentially follows by adapting results by Leopold [18] to our setting. The construction of the critical probability measure for the sequence spaces is presented in Section 3.3, while the proof of Theorem 3.3 is given in Section 3.4.

Main Result
Definition 3.1 (d-regular partitions). Let I be a countably infinite index set, and P = (I m ) m∈N be a partition of I; that is, (3.1) Convention: We will always assume that I, P and d have this meaning. Associated with a d-regular partition we now define the following family of weighted sequence spaces.
The mixed-norm sequence space p,q P,α,θ is In the remainder of this section, we will prove the existence of a critical measure on each of the sets S p,q P,α , provided that α > d · ( 1 2 − 1 p ) + . In the proof, the (otherwise not really important) spaces p,q P,α,θ will play an essential role. Our main result is thus the following theorem, the proof of which is given in Section 3.4 below. Theorem 3.3. Let p, q ∈ (0, ∞] and α ∈ R, and assume that α > d · 1 2 − 1 p + . Then S p,q P,α ⊂ 2 (I) is compact and hence Borel measurable, its optimal compression rate is given by s , and there exists a Borel probability measure P p,q P,α on S p,q P,α that is critical for S p,q P,α with respect to 2 (I). In particular, the phase transition described in Theorem 2.2 holds.

Embedding results and a lower bound for the compression rate
Having introduced the signal classes S p,q P,α , we now collect two technical ingredients needed to construct the measures P p,q P,α on these sets: A lower bound for the optimal compression rate of S p,q P,α (Proposition 3.5) and certain elementary embeddings between the spaces p,q P,α,θ for different choices of the parameters (Lemma 3.4).
Proof. The claim follows by an elementary application of Hölder's inequality; the details can be found in Appendix H.
We continue by lower bounding the optimal compression rate of the classes S p,q P,α . As we will see in Theorem 3.3, we actually have an equality.

Construction of the measure
We now come to the technical heart of this section-the construction of the measures P p,q P,α . We will provide different constructions for q = ∞ and for q < ∞: Since for q = ∞ the class S p,∞ P,α,θ has a natural product structure (Lemma 3.6), we define the measure as a product measure (Definition 3.7). We then use the embedding result of Lemma 3.4 to transfer the measure on S p,∞ P,α,θ to the general signal classes S p,q P,α ; see Definition 3.8. We start with the elementary observation that the balls S p,∞ P,α,θ can be written as infinite products of finite dimensional balls.
Lemma 3.6. The balls of the mixed-norm sequence spaces satisfy (up to canonical identifications) the factorization Proof. We identify x ∈ R I with (x m ) m∈N ∈ m∈N R Im , as defined in Equation (3.2). Set w m := m θ · 2 αm for m ∈ N. The statement of the lemma then follows by recalling that With Lemma 3.6 in hand we can readily define P p,∞ P,α,θ as a product measure. Definition 3.7 (Measures for q = ∞). Let P = (I m ) m∈N be a d-regular partition of I. Let B m be the Borel σ-algebra on R Im , and denote the Lebesgue measure on (R Im , B m ) by λ m .
For p ∈ (0, ∞] and w m > 0 define the probability measure P p,wm m on (R Im , B m ) by .
Given p ∈ (0, ∞] and α, θ ∈ R define w m := m θ · 2 αm , let B I denote the product σ-algebra on R I , and define P p,∞ P,α,θ as the product measure of the family P p,wm m m∈N (see e.g. [10, Section 8.2]): With the help of the preceding results, we can now describe the construction of the measure P p,q P,α on S p,q P,α , also for q < ∞. A crucial tool will be the embedding result from Lemma 3.4.
Definition 3.8 (Measures for q < ∞). Let the notation be as in Definition 3.7. For for all x ∈ R I , and define . In the following, we verify that the measures defined according to Definitions 3.7 and 3.8 are indeed (Borel) probability measures on the signal classes S p,∞ P,α,θ and S p,q P,α , respectively. To do so, we first show that the signal classes are measurable with respect to the product σ-algebra B I , and we compare this σ-algebra to the Borel σ-algebra on 2 (I).
Lemma 3.9. Let B I denote the product σ-algebra on R I and let p, q ∈ (0, ∞] and α, θ ∈ R. Then the (quasi)-norm · p,q P,α,θ is measurable with respect to B I . In particular, S p,q P,α,θ ∈ B I . Further, the Borel σ-algebra B 2 on 2 (I) coincides with the trace σ-algebra 2 (I) B I .
Proof. The (mainly technical) proof is deferred to Appendix H.
, and the measure P p,q P,α is a probability measure on S p,q P,α , S p,q P,α B 2 , where B 2 denotes the Borel σ-algebra on 2 (I).

Proof of Theorem 3.3
In this subsection, we prove that the measures P p,q P,α constructed in Definition 3.8 are critical, provided that α > d · ( 1 2 − 1 p ) + . An essential ingredient for the proof is the following estimate for the volumes of balls in p ([m]). .
Proof. A proof of (3.5) can be found e.g. in [16,Theorem 5]. For proving (3.6), it is shown in [16,Lemma 4] that for each p ∈ (0, ∞) there are constants λ p , Λ p > 0 satisfying It is clear that this remains true also for p = ∞; in fact, since Γ(1) = 1, one can simply choose λ ∞ = Λ ∞ = 1 in this case. By (3.5), we see that , and the estimate (3.7) implies Hence, we can choose C p := We are finally equipped to prove Theorem 3.3.
Proof of Theorem 3.3.
Step 3: (Completing the proof ): By Proposition 3.5, S p,q P,α ⊂ 2 (I) is compact with s * 2 (I) S p,q P,α ≥ s * . By Step 2 and Lemma 3.10, P p,q P,α is a Borel probability measure on S p,q P,α of growth order s * with respect to 2 (I). Thus, Lemma A.3 shows that s * 2 (I) S p,q P,α = s * and that P p,q P,α is critical for S p,q P,α with respect to 2 (I).
Remark. The proof borrows its main idea (using the product measure structure of P p,∞ P,α,θ to work on finite dimensional projections) from [20].

Besov spaces on bounded open sets Ω ⊂ R d
For Besov spaces on bounded domains, we obtain the following consequence of Theorem 3.3, by using suitable wavelet bases to "transport" the measure P p,q P,α to the Besov spaces.
For a review of the definition of Besov spaces (on R d and on domains), and the characterization of these spaces by wavelets, we refer to Appendices C.1 and C.2.
(ii) there is a Borel probability measure P on S that is critical for S with respect to L 2 (Ω); Remark. In the discussion following Theorem 2.2, we observed that the existence of a critical measure in general leaves open what happens for s = s * . In the case of Besov spaces, the above theorem shows that the compression rate s = s * is actually achieved by a suitable codec.
so that α satisfies the assumptions of Theorem 3.3.
Using the wavelet characterization of Besov spaces, it is shown in Appendix C.3 1 that there are countably infinite index sets J ext , J int with associated d-regular partitions P ext = I ext m m∈N and P int = I int m m∈N and such that there are linear maps with the following properties: 1. p,q P int ,α → 2 (J int ) and p,q P ext ,α → 2 (J ext ); this follows from Proposition 3.5.
3. There is > 0 such that Q ext c L 2 (Ω) ≤ · c 2 < ∞ for all c ∈ p,q P ext ,α , and and that there exists a Borel probability measure P 0 on B(0, 1; p,q P int ,α ) that is critical for B(0, 1; p,q P int ,α ) with respect to 2 (J int ). Therefore, we can apply Theorem 2.6 with the choices X = 2 (J ext ), Y = 2 (J int ) and Z = L 2 (Ω) as well as S X = B 0, 1; p,q P ext ,α , S Y = B 0, 1; p,q P int ,α , and S = B 0, 1; B τ p,q (Ω; R) , and finally Φ = Q ext , Ψ = Q int , and κ = γ. This theorem then shows s * L 2 (Ω) (S) = τ d > 0 (in particular, S ⊂ L 2 (Ω) is totally bounded and hence compact, since S ⊂ L 2 (Ω) is closed by Lemma E.1) and that P : int is a Borel probability measure on S that is critical for S with respect to L 2 (Ω).
Finally, Proposition 3.5 yields a codec Furthermore, Q ext is Lipschitz (with respect to · 2 and · L 2 ) and satisfies (4.1); thus, the remark after Lemma A.2 shows that , and prove that also for these spaces, the phase transition phenomenon holds. To be completely explicit, we endow the space W k,p (Ω) with the following norm:

Sobolev spaces on Lipschitz domains
Our phase-transition result reads as follows: (ii) there is a Borel probability measure P on S that is critical for S with respect to L 2 (Ω); Remark. 1) As for the case of Besov spaces, the theorem shows that the critical rate s = s * = k d is actually attained by a suitable codec.
Proof of Theorem 4.2. We present here the proof for the case p ∈ (1, ∞), where we will see that the claim follows from that for the Besov spaces. For the case p ∈ {1, ∞}, the proof is more involved, and thus postponed to Appendix D.
Define p := min{p, 2} and p := max{p, 2}, as well as S s := B 0, 1; B k p, p (Ω) and S b := B 0, 1; B k p, p (Ω) . We will prove below that there are constants C 1 , C 2 > 0 such that Assuming this for the moment, recall from Theorem 4.1 that s * and that there exists a Borel probability measure P 0 on S s that is critical for S s with respect to L 2 (Ω). Define X := Y := Z := L 2 (Ω) and S X := S b , S Y := S s , as well as Using (4.3), one easily checks that all assumptions of Theorem 2.6 are satisfied. An application of that theorem shows that s * L 2 (Ω) (S) = k d and that P := P 0 • Ψ −1 is a Borel probability measure on S that is critical for S with respect to L 2 (Ω).
Finally, Part (iii) of Theorem 4.1 yields a codec Hence, there are . Furthermore, since Ω is a Lipschitz domain, [25, Chapter VI, Theorem 5] shows that there is a bounded linear "extension operator" E : It is now easy to prove the inclusion (4.3), with C 1 := C 3 and C 2 := C 4 · E . First, if f ∈ S s and ε > 0, then there is g ∈ B k p, p (R d ) satisfying f = g| Ω and g B k p, p (R d ) ≤ 1 + ε, and hence f W k,p (Ω) = g| Ω W k,p (Ω) ≤ g W k,p (R d ) ≤ C 3 · (1 + ε). Since this holds for all ε > 0, we see that − − → f ∈ S, showing that S ⊂ L 2 (Ω) is compact. 3 The precise definition of these spaces is immaterial for us. We merely remark that the identity F k p,2 (R d ) = W k,p (R d ) is only valid for p ∈ (1, ∞).

A. Transferring approximation rates and measures
In this appendix, we provide the proof of Theorem 2.6. Along the way we will show that expansive maps can be used to transfer measures with a certain growth order from one set to another, while Lipschitz maps can be used to transfer estimates for the optimal compression rate from one set to another.
Lemma A.1. Let X, Y be Banach spaces and let S ⊂ X and S ⊂ Y. Let Φ : S → S be measurable (with respect to the trace σ-algebra of the Borel σ-algebras) and expansive, in the sense that there is κ > 0 such that If s 0 ≥ 0 and if P is a Borel probability measure on S of growth order s 0 , then the push-forward measure P • Φ −1 is a Borel probability measure on S of growth order s 0 as well.
The estimate is trivial if Φ(S) ∩ B(y, ε; Y) = ∅, since then Φ −1 B(y, ε; Y) = ∅, and hence ν S ∩B(y, ε; Y) = P Φ −1 (S ∩B(y, ε; Y)) = P(∅) = 0. Therefore, let us assume that ∅ = Φ(S) ∩ B(y, ε; Y) y ; say y = Φ(x ) for some x ∈ S. Now, for arbitrary We have thus shown Φ −1 (S ∩ B(y, ε; Y)) ⊂ S ∩ B(x , 2 κ ε; X). Since 2 κ ε < 2 κ ε 0 = ε 0 , we see by Property (2.1) as claimed that As a kind of converse of the previous result, we now show that Lipschitz maps can be used to obtain bounds for the optimal compression rate s * X (S) of a signal class S ⊂ X. Lemma A.2. Let X, Y be Banach spaces, and let S ⊂ X and S ⊂ Y. Assume that Φ : S → Y is Lipschitz continuous, and that Φ(S) ⊃ S . Then s * Remark. The proof shows that if there exists a codec C = (E R , D R ) R∈N ∈ Codecs S,X satisfying δ S,X (E R , D R ) R −s for some s ≥ 0, then one can construct a modified codec Proof. The claim is clear if s * X (S) = 0. Thus, let us assume s * X (S) > 0, and let s ∈ [0, s * X (S)) be arbitrary. Then there is a codec C = (E R , D R ) R∈N ∈ Codecs S,X and a constant C > 0 such that δ S,X (E R , D R ) ≤ C · R −s for all R ∈ N. Let L > 0 denote a Lipschitz constant for Φ. Now, for ε > 0 and x ∈ X, choose Ψ ε (x) ∈ S such that x − Ψ ε (x) X ≤ ε + dist(x, S), and let D * Now, if y ∈ S ⊂ Φ(S) is arbitrary, then y = Φ(x) for some x ∈ S, and hence s for all y ∈ S and R ∈ N, and hence s * Y (S ) ≥ s. Since s ∈ [0, s * X (S)) was arbitrary, this completes the proof.
The following lemma shows that if a signal class S ⊂ X carries a Borel probability measure of growth order s 0 and satisfies s * X (S) ≥ s 0 , then in fact s * X (S) = s 0 . This is elementary, but will be used quite frequently, so that we prefer to state it as a lemma.
Lemma A.3. Let s 0 ∈ [0, ∞), let X be a Banach space, and let S ⊂ X. Assume that there exists a Borel probability measure P of growth order s 0 on S and that s * Then s * X (S) = s 0 and P is critical for S with respect to X.
Proof. Corollary 2.5 shows that s 0 ≥ s * X (S). Since s 0 ≤ s * X (S) by assumption, the claim of the lemma follows.
We finally provide the proof of Theorem 2.6.
Proof of Theorem 2.6. Since Φ : S X → Z is Lipschitz continuous with Φ(S X ) ⊃ S, Lemma A.2 shows that s * Z (S) ≥ s * X (S X ) =: s * . Furthermore, since Ψ : S Y → S is measurable and expansive and P has growth order s * Y (S Y ) = s * , Lemma A.1 shows that ν := P • Ψ −1 is a Borel probability measure on S of growth order s * as well. Now, Lemma A.3 shows that s * Z S = s * and that ν is critical for S with respect to Z. B. A lower bound for the optimal compression rate s * 2 (I) S p,q P,α Our goal in this subsection is to show that the optimal compression rate for the class S p,q P,α satisfies s * 2 (I) S p,q P,α ≥ α d −( 1 2 − 1 p ), assuming that α > d·( 1 2 − 1 p ) + . Our proof of this fact relies on an equivalence between the optimal distortion for a set and the so-called entropy numbers of that set. By combining this equivalence with known estimates for the entropy numbers of certain embeddings between sequence spaces (taken from [18]), we will obtain the claim.
First, let us describe the equivalence between the optimal achievable distortion and the entropy numbers of a set. Following [5,11], given a (quasi)-Banach space X, a set M ⊂ X, and k ∈ N, the k-th entropy number e k (M ) := e k (M ; X) of M is defined as Finally, if Y is a further (quasi)-Banach space, and T : Y → X is linear, then the entropy numbers e k (T ) are defined as e k (T ) := e k T (B(0, 1; Y)); X .
For proving that s * 2 (I) S p,q P,α ≥ α d − ( 1 2 − 1 p ), we will use the following folklore equivalence between entropy numbers and the optimal achievable distortion for a given set: Lemma B.1. Let X be a Banach space and S ⊂ X. Then Proof. "≤": Let (E R , D R ) ∈ Enc R S,X be arbitrary. Note that range(D R ) = D R ({0, 1} R ) is nonempty and has at most 2 R elements, so that range(D R ) = {x 1 , . . . , x 2 R }, where we possibly repeat some elements. Define δ := δ S,X (E R , D R ). If δ = ∞, then trivially e R+1 (S; X) ≤ δ; hence, assume that δ < ∞. By definition of the distortion, this means u − D R (E R (u)) X ≤ δ for all u ∈ S, and hence B(x i , δ; X), which shows that e R+1 (S) ≤ δ = δ S,X (E R , D R ).
In addition to this equivalence between entropy numbers and best achievable distortion, we will use two results from [18] about the asymptotic behavior of the entropy numbers of certain sequence spaces. The following definition introduces the terminology used in [18].
Using these notions, Leopold proved the following results:
Assume that either (i) p 1 ≤ p 2 ; or (ii) p 2 < p 1 and the sequence β j · N is almost strongly increasing.
Then the embedding q 1 (β j p 1 N j ) → q 2 ( p 2 N j ) holds, and there are C 1 , C 2 > 0 such that for all L ∈ N, we have Remark. We note that the above results pertain to spaces of complex sequences. At least concerning the upper bound, however, this is no problem: To see this, note that if we denote by Re x the (componentwise) real part of the sequence x, then clearly Re x q (β j p N j ) ≤ x q (β j p N j ) . Hence, defining the real-valued version of the space Proof of Proposition 3.5. Let n m := |I m | and N j := n j+1 for m ∈ N and j ∈ N 0 . Further, set κ := d −1 1 + log 2 (A/a) , where we recall from Equation (3.1) that a, A > 0 satisfy a 2 dm ≤ n m ≤ A 2 dm . Thus, N j+1 = n j+2 ∼ 2 d(j+2) = 2 d 2 d(j+1) ∼ n j+1 = N j , which shows that N = (N j ) j∈N 0 is admissible. Furthermore, if k ≥ j + κ, then which shows that N is almost strongly increasing. Next, define β j := 2 α(j+1) for j ∈ N 0 , noting that β j+1 = 2 α β j , which implies that (β j ) j∈N 0 is admissible. Furthermore, if k ≥ j + α −1 , then β k ≥ 2 · 2 α(j+1) = 2 β j , so that (β j ) j∈N 0 is also almost strongly increasing. Here, we used that α > 0.
Finally, for each m ∈ N pick a bijection ι m : [N m−1 ] → I m (which is possible since N m−1 = n m = |I m |), and define Ψ : It is easy to see that Ψ is a bijection, and that Using these identities, it is straightforward to see that p,q P,α → 2 (I) holds if and only if q (β j p N j ; R) → 2 ( 2 N j ; R), and furthermore that e k (S p,q P,α ; 2 (I)) = e k p,q There are now two cases. First, if p ≤ 2, then Equation (B.1) and the first part of Theorem B.3 with p 1 = p, q 1 = q and p 2 = q 2 = 2 show that q (β j p N j ; R) → 2 ( 2 N j ; R), and yield a constant C 1 > 0 such that If otherwise p > 2, then 2 −1 − p −1 > 0, so that our assumptions concerning α imply that α > d·(2 −1 −p −1 ) + = d·(2 −1 −p −1 ), and hence γ :

Thus, Part (ii) of Theorem B.3 and Equation (B.1) show that
Define C 3 := max{C 1 , C 2 } and note that the preceding estimates only yield bounds for the entropy numbers e k (S p,q P,α ; 2 (I)) in case of k = 2N L for some L ∈ N, not for general k ∈ N. This, however, suffices to handle the general case. Indeed, let R ∈ N with R ≥ 2N 1 be arbitrary, and let L ∈ N be maximal with 2N L ≤ R + 1; this is possible since N L → ∞ as L → ∞. Note R ≤ R + 1 < 2 N L+1 = 2 n L+2 ≤ 2A 2 d(L+2) = 2 2d+1 A 2 dL by maximality. Since the sequence of entropy numbers e k (S p,q P,α ; 2 (I)) k∈N is nonincreasing, we thus see for all R ≥ 2N 1 and suitable constants C 4 , C 5 > 0 which are independent of R. Now, since S p,q P,α ⊂ 2 (I) is bounded (otherwise, all entropy numbers would be infinite), it is easy to see e R+1 (S p,q P,α ; 2 (I)) ≤ e 1 (S p,q P,α ; 2 (I)) R −( α d +p −1 −2 −1 ) for R ∈ N with R < 2N 1 . With this, the claim s * 2 (I) (S p,q P,α ) ≥ α d − 1 2 − 1 p follows from the relation between entropy numbers and optimal distortion described in Lemma B.1.
Finally, since e R S p,q P,α ; 2 (I) → 0 as R → ∞, it follows that S p,q P,α ⊂ 2 (I) is totally bounded. Since S p,q P,α ⊂ 2 (I) is also easily seen to be closed (this essentially follows from Fatou's lemma), we see that S p,q P,α ⊂ 2 (I) is compact.

C. A review of Besov spaces
In this subsection, we review the relevant properties of Besov spaces on R d and on domains, including the characterization of these spaces in terms of wavelets; see Section C.2. Before we dive into the details, a word of caution is in order. In the literature, there are two common definitions of Besov spaces: A Fourier analytic definition and a definition using moduli of continuity. Here, we only consider the former definition; the reader interested in the latter is referred to [8]. It should be mentioned, however, that the two definitions do not agree in general; see for instance [15]. Nevertheless, in the regime that we are interested in, the two definitions coincide, as can be deduced from [28, Theorem in Section 2.5.12]. Since we focus on the Fourier analytic definition only, we omit the details.

C.1. The (Fourier-analytic) definition of Besov spaces
Our presentation here follows [28,Section 2.3] and [27,Section 1.3]. In this section, all functions are taken to be complex-valued, unless indicated otherwise. Let S (R d ) denote the space of Schwartz functions (see, for instance, [13, Section 8.1]), and S (R d ) its topological dual space, the space of tempered distributions (see [13,Section 9.2]). We use the Fourier transform on L 1 (R d ) with the same normalization as in [28,30]; that is, x j ξ j denotes the standard inner product on R d . With this normalization, the Fourier transform F : , with the latter defined by Ff, ϕ S ,S := f, Fϕ S ,S . Here, as in the remainder of the paper, the dual pairing for distributions are taken to be bilinear. In any case, the inverse Fourier transform is given by (the extension of) the operator F −1 f (x) = Ff (−x). All of the facts listed here can be found in [24,Chapter 7].

C.2. The wavelet characterization of Besov spaces
Wavelets are usually constructed using a so-called multiresolution analysis of L 2 (R). A multiresolution analysis (see [30,Definition 2.2] or [7, Section 5.1]) of L 2 (R) is a sequence (V j ) j∈Z of closed subspaces V j ⊂ L 2 (R) with the following properties: 5. there exists a function ψ F ∈ V 0 (called the scaling function or the father wavelet) such that ψ F (• − m) m∈Z is an orthonormal basis of V 0 .
To each multiresolution analysis, one can associate a (mother) wavelet ψ M ∈ L 2 (R); see [30,Theorem 2.20]. More precisely, denote by W 0 ⊂ L 2 (R) the orthogonal complement of V 0 as a subset of V 1 , and define W j := {f (2 j •) : f ∈ W 0 } for j ∈ N, so that W j is the orthogonal complement of V j in V j+1 . We then have L 2 (R) = V 0 ⊕ ∞ j=0 W j , where the sum is orthogonal.
One can show (see [30,Lemma 2.19]) that there exists ψ M ∈ W 0 such that the family ψ M (• − k) k∈Z is an orthonormal basis of W 0 . In this case, we say that ψ M is a mother wavelet associated to the given multiresolution analysis. For each such ψ M , one can show (see [27,Proposition 1.51]) that if we define for j ∈ N 0 and m ∈ Z, then the inhomogeneous wavelet system (ψ j,m ) j∈N 0 ,m∈Z forms an orthonormal basis of L 2 (R). Furthermore, the family 2 j/2 ψ M (2 j • −k) j,k∈Z is an orthonormal basis of L 2 (R). For our purposes, we will need sufficiently regular wavelet systems, as provided by the following theorem: Theorem C.1. For each k ∈ N, there is a multiresolution analysis (V j ) j∈Z of L 2 (R) with father/mother wavelets ψ F , ψ M ∈ L 2 (R) such that the following hold: 1. ψ F , ψ M are real-valued and have compact support; , . . . , k} ( vanishing moment condition). Proof. The existence of a multi-resolution analysis (V j ) j∈Z with compactly supported father/mother wavelets ψ F , ψ M ∈ C k (R) is shown in [30,Theorem 4.7] (while the original proof was given in [6]). It is not stated explicitly, however, that ψ F , ψ M are real-valued; but this can be extracted from the proof: The function Φ := ψ F is constructed as Φ = (2π) −1/2 F −1 Θ, with Θ(ξ) = ∞ j=1 m(2 −j ξ) (see [30,Theorem 4.1]), where m(ξ) = T k=0 a k e ikξ is obtained through [30,Lemma 4.6], so that a 0 , . . . , a T ∈ R and m(0) = 1. Therefore, [30,Lemma 4.3] shows that Φ is real-valued. Finally, Ψ := ψ M is obtained from Φ as Ψ(x) = 2 T k=0 a k (−1) k Φ(2x + k + 1); see [30,Equation (4.5)]. Since a 0 , . . . , a T ∈ R, this shows that ψ M = Ψ is real-valued as well.
Wavelet systems in R d can be constructed by taking suitable tensor products of a onedimensional wavelet system. To describe this, let ψ F , ψ M be father/mother wavelets, and let T 0 := {F } d and T j : Then (see [27,Proposition 1.53]), the system (Ψ j,t,m ) (j,t,m)∈J is an orthonormal basis of L 2 (R d ).
Finally, we have the following wavelet characterization of the Besov spaces B τ p,q (R d ).
Theorem C.2. (consequence of [27, Theorem 1.64]) Let d ∈ N, p, q ∈ (0, ∞] and τ ∈ R. For a sequence c = (c j,t,m ) (j,t,m)∈J ∈ C J define Let k ∈ N, and let ψ F , ψ M as provided by Theorem C.1. Let the d-dimensional wavelet system (Ψ j,t,m ) (j,t,m)∈J be as defined in Equation is well-defined (with unconditional convergence of the series in S (R d )), and an isomorphism of (quasi)-Banach spaces. The inverse map of Γ will be denoted by We will also use the real-valued Besov space The spaces B τ p,q (Ω; R) are defined similarly. We will also use the space b τ p,q (R d ;

C.3. Wavelets and Besov spaces on bounded domains
Note that Theorem C.2 only pertains to the Besov spaces B τ p,q (R d ). To describe Besov spaces on domains, we will use the sequence spaces b τ p,q (Ω int ; R) and b τ p,q (Ω ext ; R) that we now define.
Definition C.3. Let p, q ∈ (0, ∞] and τ ∈ R, and let k ∈ N with k > max{τ, 2d and define b τ p,q (Ω int ; R) similarly. Both of these spaces are considered as subspaces of b τ p,q (R d ; R); they are thus equipped with the (quasi)-norm · b τ p,q . Remark. Strictly speaking, the spaces b τ p,q (Ω int ; R) and b τ p,q (Ω ext ; R) depend on the choice of k ∈ N and on the precise choice of ψ F , ψ M . We will, however, suppress this dependence.
The next lemma describes the relation between these sequence spaces and the Besov spaces B τ p,q (Ω; R).
Then there are continuous linear maps with the following properties: • There is γ > 0 such that T int c L 2 (Ω) = γ · c 2 for all c ∈ 2 (J) ∩ b τ p,q (Ω int ; R), and T int c B τ p,q (Ω) ≤ c b τ p,q for all c ∈ b τ p,q (Ω int ; R).

(C.4)
(ii) There is a d-regular partition P ext of J ext and some > 0 such that if we define where c ∈ R J is obtained by extending c ∈ R J ext by zero, then ι ext c 2 = c 2 for all c ∈ p,q P ext ,α , and Proof. The proof is divided into three steps.
Step 1 (Estimating |J int j | and |J ext j |): We show that there are j 0 ∈ N and a, A > 0 satisfying First of all, we clearly have J int j ⊂ J ext j and thus |J int Regarding the lower bound, recall that Ω = ∅ is open, so that there are x 0 ∈ R d and n ∈ N satisfying x 0 + [−r, r] d ⊂ Ω, where r := 2 −n . Choose j 0 ∈ N ≥n+3 such that 2 j 0 −1 r ≥ 2R, and note 2 j 0 −3 r = 2 j 0 −3−n ∈ N. Let j ≥ j 0 . Choose m 0 := 2 j−1 x 0 ∈ Z d , with the "floor" operation applied componentwise. We have 2 j−1 x 0 − m 0 ∞ ≤ 1, and hence Here, one should observe 2 j−3 r = 2 j−j 0 2 j 0 −3−n ∈ N. Because of R ≤ 2 j 0 −2 r ≤ 2 j−2 r, the above estimate implies that for m ∈ N ≥2 . As shown in Step 1, we have for m ∈ N ≥2 that a · 2 dm ≤ a · 2 d(j 0 +m−1) ≤ |I int m | ≤ |I ext m | ≤ A · 2 d(j 0 +m−1) =: A · 2 dm and also |I ext 1 | ≥ |I int 1 | ≥ |J int j 0 | ≥ a·2 dj 0 ≥ a·2 d . Thus, a · 2 dm ≤ |I int m | ≤ |I ext m | ≤ A · 2 dm for all m ∈ N, where A := max{A , |I ext 1 |}. Furthermore, we have J int = m∈N I int m and J ext = m∈N I ext m , so that P int := I int m m∈N and P ext := I ext m m∈N are d-regular partitions of J int and J ext , respectively. Now, for J 0 ⊂ J and c ∈ R J 0 , let c ∈ R J be the sequence c, extended by zero. We claim that there are C 1 , C 2 > 0 such that and similarly for P ext and J ext instead of P int and J int . For brevity, we only prove the claim for P int .
Overall, we obtain that which proves Equation (C.5).
Step 3 (Completing the proof ): Step 2 guarantees the existence of γ > 0 satisfying for all c ∈ p,q P int ,α . Furthermore, we clearly have Similarly, Step 2 shows that there is > 0 satisfying c p,q P ext ,α ≤ c b τ p,q for all c ∈ R J ext . Now, given b ∈ B 0, 1; b τ p,q (Ω ext ; R) , note that b = (b| J ext ) and furthermore b| J ext p,q P ext ,α ≤ (b| J ext ) b τ p,q ≤ , so that c := −1 · b| J ext ∈ B 0, 1; p,q P ext ,α satisfies b = ι ext c. It is clear that ι ext c 2 = c 2 for all c ∈ p,q P ext ,α .

D. The phase transition for Sobolev spaces with p ∈ {1, ∞}
In this subsection we provide the missing proof of Theorem 4.2 for the cases p = 1 and p = ∞. We begin with the case p = 1.

D.1. The case p = 1
The proof is crucially based on the following embedding.
Lemma D.1. For arbitrary k, d ∈ N and 1 ≤ p < ∞, we have W k,p (R d ) → B k p,∞ (R d ).
Proof. This follows from [1,Section 7.33]. Here, the definition of Besov spaces used in [1] coincides with our definition, as can be seen by combining [28,Theorem in Section 2.5.12] with [17,Proposition 17.21 and Theorem 17.24].
Overall, we see that if we choose Φ : S X → L 2 (Ω), f → C · f and Ψ : S Y → S, f → κ · f, then Φ, Ψ are well-defined and satisfy all assumptions of Theorem 2.6. This theorem then shows that s * L 2 (Ω) (S) = k d and that P := P 0 • Ψ −1 is a Borel probability measure on S that is critical for S with respect to L 2 (Ω).
Next, it is well-known (see for instance [29,Example 7.2 . Now, for f ∈ B k ∞,1 (Ω) and ε > 0, by definition of the norm on B k ∞,1 (Ω) there is some Finally, Theorems 4.1 and 4.2 (the latter applied with p = 2 ∈ (1, ∞)) show that s * L 2 (Ω) (S X ) = s * L 2 (Ω) (S Y ) = k d and that there exists a Borel probability measure P 0 on S Y that is critical for S Y with respect to L 2 (Ω).
Combining these observations, it is not hard to see that all assumptions of Theorem 2.6 are satisfied for This theorem thus shows that s * L 2 (Ω) S = k d and that P := P 0 •Ψ −1 is a Borel probability measure on S that is critical for S with respect to L 2 (Ω).
Finally, Theorem 4.2 shows that there exists

E. Measurability of Besov and Sobolev balls
In this subsection, we show for the range of parameters considered in Theorems 4.1 and 4.2 that the balls B 0, R; B τ p,q (Ω) and B 0, R; W k,p (Ω) are measurable subsets of L 2 (Ω). We remark that for the case where p, q ∈ (1, ∞), easier proofs than the ones given here are possible. Yet, since the proofs for the cases where p ∈ {1, ∞} or q ∈ {1, ∞} apply verbatim for a whole range of exponents, we prefer to state and prove the more general results.
We begin with the case of Besov spaces, for which the balls are in fact closed.
As seen above, [2,Theorem 8.5] shows that there is a subsequence (g n k ) k∈N and some g ∈ L p 0 (R d ) such that g n k → g in the weak- * -sense in L p 0 (R d ) = (L p 0 (R d )) . In particular, g n k → g in S (R d ). By what we showed above, this implies g B τ p,q (R d ) ≤ lim inf k→∞ g n k B τ p,q (R d ) ≤ R. Finally, we have for any ϕ ∈ C ∞ c (Ω) that g, ϕ = lim k→∞ g n k , ϕ = lim k→∞ f n k , ϕ = f, ϕ , since f n k = g n k | Ω and f n k → f in L 2 (Ω). Overall, we thus see that f = g| Ω ∈ B τ p,q (Ω) and For the Sobolev spaces W k,p (Ω) with p = 1, the set B 0, R; W k,1 (Ω) is not closed in L 2 (Ω). In order to show that this ball is nonetheless Borel measurable, we begin with the following result on R d .
Proof. Let ϕ ∈ C ∞ c (R d ) with ϕ ≥ 0 and R d ϕ(x) dx = 1, and define ϕ n (x) := n d · ϕ(nx). It follows from [2,Section 4.13] Step 1: Define S := L 2 (R d ) ∩ W k,p (R d ). In this step, we show that For "⊂", note that if f ∈ S, then from the definition of the weak derivative we see Theorem 4.15] shows that ϕ n * f → f with convergence in L 2 , we get f = g 0 ∈ L p (R d ). Furthermore, as seen above, we have ϕ n * f ∈ C ∞ (R d ) with ∂ α (ϕ n * f ) = (∂ α ϕ n ) * f . Therefore, we see for arbitrary ψ ∈ C ∞ c (R d ) and α ∈ N d 0 with |α| ≤ k that which shows that g α is the α-th weak derivative of f ; that is, ∂ α f = g α ∈ L p (R d ). Since this holds for all |α| ≤ k, we see that f ∈ W k,p (R d ) and thus f ∈ S.
Step 2: For n, m, M ∈ N, define Γ n,m,M : Since p ≤ 2, it is easy to see that Γ n,m,M is well-defined and continuous. Furthermore, We can now prove a similar result on bounded domains. For the convenience of the reader, we recall that f W k,p = max |α|≤k ∂ α f L p ; see Equation (4.2).
, and let Ω ⊂ R d be open and bounded. In case of p = 1, assume additionally that Ω is a Lipschitz domain.
For "⊂", we see as in Step 3 that γ(f ) ≤ R if f ∈ S. Furthermore, by the properties of the extension operator E, we also have f ∈ Θ if f ∈ S. For "⊃", let f ∈ Θ satisfy γ(f ) ≤ R. Since f ∈ Θ, we have f = (Ef )| Ω ∈ W k,1 (Ω). One can then argue as at the end of Step 3 (using [2,Corollary 6.13]) to see that f ∈ B(0, R; W k,1 (Ω)) and thus f ∈ S.

F. Proof of the lower bounds for neural network approximation
We begin by explaining the connection between rate distortion theory and approximation by neural networks. This is based on the observation from [4,23] that one can use the existence of approximating networks to construct a codec for a function class. This in turn relies on the fact that neural networks ca be encoded as bit strings, as described in the following result.
The precise connection to rate distortion theory is established by the following lemma.
Part 2: For , σ ∈ N, define A ,σ := f ∈ S : ∃ C > 0 ∀ ε ∈ (0, 1) : It is not hard to see that A * N N , ⊂ σ∈N ∈N A ,σ , so that it suffices to show P * (A ,σ ) = 0 for all σ, ∈ N. To see this, let C ∈ Codecs L 2 (Ω),L 2 (Ω) as in Lemma F.2, and note with the notation of that lemma and with δ := s * /2 2 −1 that A ,σ = S ∩ A where Theorem 1.3 shows that P * A s * · 2 −1/2 2 −1 find a single signal x ∈ S on which the codec C does not attain the rate s. The following proposition shows that even a slightly stronger statement holds: Given C, one can find a "badly encodable" x = x(C) ∈ S such that x is not encoded at any rate s > s * X (S) by the codec C.
Proposition G.1. Let X be a Banach space and let ∅ = S ⊂ X. Assume that either 1. S is closed, bounded, and convex; or 2. S = {x ∈ X : x * ≤ r} for some r ∈ (0, ∞) and a map · * : X → [0, ∞] with the following properties: a) · * is a quasi-norm; that is, there exists κ ≥ 1 such that α x * = |α| · x * and x + y * ≤ κ · ( x * + y * ) for all α ∈ R and x, y ∈ X; b) there is C ≥ 1 satisfying x X ≤ C · x * for all x ∈ X; c) S ⊂ X is closed; d) · * is continuous "with respect to itself ", meaning that Set s * := s * X (S). Then, for each codec C = (E R , D R ) R∈N ∈ Codecs S,X there is some x = x(C) ∈ S such that for each ∈ N, we have x − D R (E R (x)) X ≥ R −(s * + −1 ) for infinitely many R ∈ N.
Remark. 1) In particular, we see for each s > s * (by choosing ∈ N such that s * + −1 < s) Therefore, x ∈ S \ s>s * A s S,X (C).
2) The assumptions on the quasi-norm · * might appear quite technical, but they are usually satisfied. Indeed, the condition x X ≤ C · x * is equivalent to S ⊂ X being bounded, which is necessary for having s * X (S) > 0. Next, most naturally appearing quasi-norms are q-norms for some q ∈ (0, 1], meaning that x + y q * ≤ x q * + y q * . In this case, it is not hard to see x q * − y q * ≤ x − y q * , which implies that · * is "continuous with respect to itself". Finally, most natural quasi-norms will satisfy a Fatou property, in the sense that if x n → x in X, then x * ≤ lim inf n→∞ x n * . If this is the case, then S ⊂ X is closed.

Proof.
Step 1: (Setup for applying the Baire category theorem). Let us assume towards a contradiction that the claim does not hold. Define M R := range(D R ) ⊂ X. Then for each x ∈ S there exist n x , x ∈ N satisfying x − D R (E R (x)) X < R −(s * + −1 x ) for all R ≥ n x . Thus, it is not hard to see that where we defined N x := 1 + max{k s * + −1 x · x − D k (E k (x)) X : 1 ≤ k ≤ n x }. Thus, if we define G N, := x ∈ S : ∀ R ∈ N : dist X (x, M R ) ≤ N · R −(s * + −1 ) for N, ∈ N, then S = N, ∈N G N, . Furthermore, since dist X (·, M R ) is continuous, it is not hard to see that each set G N, ⊂ S is closed. Finally, S ⊂ X is a closed set, and hence a complete metric space (equipped with the metric induced by · X ). Thus, the Baire category theorem ([13, Theorem 5.9]) shows that there are certain N, ∈ N such that the relative interior G • N, of G N, in S satisfies G • N, = ∅.
Step 2: (Proving that there are x 0 ∈ X and t > 0 satisfying x 0 + tS ⊂ G N, ). We distinguish the two cases regarding the assumptions on S.
Step 3: (Completing the proof ) For each x ∈ S, we have x 0 +t x ∈ G N, , and therefore dist X x 0 + t x, M R ≤ N · R −(s * + −1 ) for all R ∈ N. Because of M R = range(D R ), this implies that there is c x,R ∈ {0, 1} R satisfying (x 0 + tx) − D R (c x,R ) X ≤ N · R −(s * + −1 ) . Now, we define a new codec C = ( E R , D R ) R∈N by For arbitrary x ∈ S, we then see for all R ∈ N. By definition of the optimal exponent, this implies s * = s * X (S) ≥ s * + −1 , which is the desired contradiction.
As the second result in this appendix, we show that the preceding property does not hold for general compact sets S ⊂ X, even if X = H is a Hilbert space. In other words, some additional regularity assumption-like convexity-is necessary to ensure the property stated in Proposition G.1.
We claim that s * H (S) = s, but that there is a codec C = (E R , D R ) R∈N ∈ Codecs S,H such that A σ S,H (C) = S for every σ > 0; that is, every element x ∈ S is compressed by C with arbitrary rate σ > 0.
Proof of Lemma 3.9. By definition of the product σ-algebra, each of the finite-dimensional projections π m : R I → R Im , x → x m is measurable. Since · p (Im) is continuous on R Im and hence Borel measurable, q m : R I → [0, ∞), x → 2 αm m θ x m p (Im) is B Imeasurable for each m ∈ N.
In case of q < ∞, this implies that the map is B I -measurable as a countable series of measurable, non-negative functions, and hence so is x → x p,q P,α,θ . If q = ∞, the (quasi) norm · p,∞ P,α,θ = sup m∈N q m is B I -measurable as a countable supremum of B I -measurable, non-negative functions.
For proving the final claim, let us write T := 2 (I) B I for brevity. By the first part of the lemma, · 2 = · 2,2 P,0,0 : R I → [0, ∞] is B I -measurable. Furthermore, for arbitrary x ∈ R I the translation R I → R I , y → y + x is B I -measurable. These two observations imply that the norm · 2 : 2 (I) → [0, ∞) and the translation operator 2 (I) → 2 (I), y → y + x are T -measurable for any x ∈ 2 (I). This implies that B r (x) = {y ∈ 2 (I) : y + (−x) 2 < r} is T -measurable. But 2 (I) is separable, so that every open set is a countable union of open balls; therefore, it follows that B 2 ⊂ T . Conversely, T is generated by sets of the form {x ∈ 2 (I) : p i (x) ∈ M }, where M ⊂ R is a Borel set and p i : R I → R, (x j ) j∈I → x i . Since p i | 2 (I) : 2 (I) → R is continuous with respect to · 2 (I) , we see that each generating set of T also belongs to B 2 , which completes the proof.