Efficient Approximation of Solutions of Parametric Linear Transport Equations by ReLU DNNs

We demonstrate that deep neural networks with the ReLU activation function can efficiently approximate the solutions of various types of parametric linear transport equations. For non-smooth initial conditions, the solutions of these PDEs are high-dimensional and non-smooth. Therefore, approximation of these functions suffers from a curse of dimension. We demonstrate that through their inherent compositionality deep neural networks can resolve the characteristic flow underlying the transport equations and thereby allow approximation rates independent of the parameter dimension.


Introduction
Linear parametric transport equations play an essential role in engineering, modelling, and mathematical physics where they describe physical phenomena of heat and mass transfer. A typical example is the transport of pollution in air or water depending on a set of parameters such as the direction and intensity of the flow of the fluid.
In this work, we study to what extent the solutions of various types of parametric linear transport equations can be efficiently represented by deep neural networks. Concretely, we study variations of the following problem: Let n, D, k ∈ N, and T > 0. Let V ∈ C k ([0, T ] × R n × [0, 1] D ; R n ), f ∈ C k ([0, T ] × R n × [0, 1] D ) and let u 0 ∈ C 1 (R n ). We want to find u ∈ C 1 ([0, T ] × R n × [0, 1] D ) such that ∂ t u(t, x, η) + V (t, x, η) · ∇ x u(t, x, η) = f (t, x, η), u(0, x, η) = u 0 (x). (1.1) The PDE of (1.1) has been studied extensively, see, e.g., [1,2,9,22] and we will recall the fundamentals in Section 3. The set-up that we have in mind is one where the dimension of the parameter space D is very high, V is smooth, and u 0 is not very regular. Hence, direct approximation of u amounts to approximating a high-dimensional function of low regularity. In this formulation, the task is extremely challenging for classical methods.
While the global approximation problem is almost intractable, the method of characteristics shows that, even though u is not smooth, its singularities revolve along smooth curves, called characteristic curves. In this framework, the function u can typically be written in a compositional form of two functions where one is high-dimensional and smooth and the other is low-dimensional and (potentially) rough.
Based on this split and the inherent compositionality of neural networks, we will demonstrate that every u satisfying (1.1) can be approximated efficiently by neural networks with ReLU activation function. The approximation rate is significantly better than that of any classical regularity-based approximation of u. In particular, in the prescribed set-up, we will observe an approximation rate independent of the dimension D of the parameter space.
The material presented below was first established in a mini-project of the first author at the University of Oxford, [35].

Applications and relevance
We believe that the efficient approximation of solutions of parametric transport equations with dimensionindependent approximation rates is an interesting and relevant problem in the following domains: • Approximation theory: The class of functions that are solutions of high-dimensional parametric linear transport equations is a relevant but non-standard function class. This class while high-dimensional has a non-trivial but rigid structure imposed upon via the underlying PDE. It is, therefore, interesting to establish to what extent deep neural networks can leverage on this structure. In this context, similar approximation schemes based on structured systems were developed for special types of (parametric) transport equations. For example, in [14,15] and [42] systems were introduced that approximate the solutions of linear transport equations with linear or C 2 -regular characteristic curves. These constructions are closely tied to the type of characteristic curves. We demonstrate here that, in contrast to these systems, approximation by deep neural networks is automatically adapted to the underlying regularity of the problems and benefits from higher regularity of the characteristic curves.
• Estimation: In machine learning and especially in deep learning, deep neural networks are trained with gradient-based optimisation algorithms to minimise empirical energies based on random samples, [24,36]. These techniques have proven to be extremely successful in a variety of applications.
Consider a parametric transport problem of the form (1.1) where u 0 , f , and V are unknown, but samples (u(t i , x i , η i )) N i=1 are available through measurements. Such a scenario could be encountered in the transport of pollution in fluids under unknown circumstances, but with a control on parameters of the experiment. In this formulation, the transport problem is a standard supervised learning problem. Moreover, classical methods to solve linear transport equations cannot be used at all without knowledge of f and V in (1.1).
Certainly, this estimation problem can only be successfully solved with deep learning techniques, if the correct solution to the problem can be represented or closely approximated by a deep neural network. In this context, our results show the feasibility of this approach.
• Numerical analysis: Deep neural networks have been employed as Ansatz spaces for PDEs in multiple settings before [53,55,4]. Of course, the efficiency of these methods depends on the capacity of the Ansatz space to capture the true solution.
Our proposed approximation of solutions of (1.1) by deep neural networks can be thought of as a higher-order method that automatically adapts to the regularity of the underlying characteristic curves.
An established method to solve (1.1) is by using (Petrov-) Galerkin-type discretisations, [13,18]. These methods are, however, typically not adaptive to singularities lying on lower-dimensional manifolds. In the model of (1.1), such structured singularities evolve precisely along the characteristic curves. While some advances to handle structured singularities have been made, e.g. [14], the adaptivity to the manifold only uses low-order information on the smoothness of the manifold. Our results demonstrate that an approximation via deep neural networks adapts to any regularity of the characteristic curves in the sense that the approximation quality improves for smoother characteristic curves.
Standard discretisation and time-stepping techniques such as finite differences and Euler, Crank-Nicolson, or higher-order variants converge with rates depending on the global regularity of the solution of the PDE which, in our set-up, is assumed to be quite low.
• Reduced-Order Models: In applications where the solution of (1.1) is requested for many different parameter values, it is desirable to employ model reduction techniques [29,48]. It is well known that parametric linear transport problems are highly challenging for linear reduced-order models because the dimension of linear approximation spaces to capture the non-linear evolution of singularities can be excessive, [43,Section 5], [16,Section 6.3]. Indeed, linear reduced-order models typically succeed only if additional assumptions are made on the parameter dependence, such as a certain separability of the parametric dependence and the spatial dependence of V , [25].
The neural network based approximation presented in this work requires almost no structural assumption on the parametric dependence. Indeed, a smooth dependence of V and f on the parameters is sufficient. In this context, a superiority of deep neural network-based approaches over linear reducedorder models to solve parametric transport equations was also empirically observed in [21].

Related work
This work describes the capacity of neural networks to approximate high-dimensional functions with asymptotic rates independent of the underlying dimension. Of course, approximation theory of deep neural networks is a well established field. Therefore, in order to place our contribution in the context of existing literature, we provide an overview of classical and more modern developments in the field below.

Classical approximation
The first and probably most prominent result describing the approximation capabilities of neural networks is the universal approximation theorem, [12,31]. This theorem states that, on a compact, domain, every continuous function can be arbitrarily well approximated by a neural network in the uniform norm. These statements, however, do not provide an estimate on the required sizes of the approximating neural networks. The typical approach to obtain a quantitative estimate on the order of approximation is to re-approximate classical methods. For example, in [38] and [37], it was shown that neural networks yield the same approximation rates as splines when approximating smooth functions. Recently, approximation by neural networks with the ReLU activation function has received the most attention since this activation function is arguably the most widely-used in applications. It was demonstrated that deep neural networks with the ReLU activation function achieve the same approximation rates as linear and higher-order finite elements [28,44], wavelets [52], and local approximation by Taylor polynomials, [56].
These classical approximation results show that deep neural networks are very versatile by combining the approximation capabilities of a wide variety of classical tools. However, they do not identify a particular situation where deep neural networks outperform the best classical method. This picture changes drastically, when one considers high-dimensional approximation.

High-dimensional approximation
High-dimensional approximation generally suffers from a curse of dimension, meaning that approximation rates deteriorate exponentially with increasing dimension, [6,41]. Nonetheless, if an additional structure is assumed, then the curse of dimension can be overcome. It turns out that deep neural networks can take advantage of a wide variety of complex additional structural properties. For example, it was shown in [3], that deep neural networks can approximate functions with bounded first Fourier moments without a curse of dimension. Other regularity-based assumptions were used in [39] and [54]. Further classes of functions with structural assumptions such as functions based on directed acyclic graphs [47] or functions admitting strong invariances [46, Section 5] allow similar results. Finally, if the approximation error is evaluated on a low-dimensional manifold only, then [52,49,10,8] show approximation rates independent of the ambient dimension.

Approximation of solutions of PDEs
The extraordinary efficiency in the approximation of certain high-dimensional functions has been especially interesting in connection with the numerical solution of PDEs [53,55,4]. For example, for high-dimensional Black Scholes-, Kolmogorov-, or heat equations deep neural networks can efficiently approximate the solutions thereof in a regime where any mesh-based method would fail, [19,32,5,7]. In these works, a compositional structure of the solution of a PDE is derived via the Feynman-Kac formula. The approach via the method of characteristics of our work can be interpreted as a special case of the approach via the Feynman-Kac formula.
Moreover, in the framework of parametric problems, high-dimensional problems can be efficiently represented if there exist suitable representations thereof in a general reduced basis, [34], or as a polynomial chaos expansion, [50,45].

Outline
In Section 2, we introduce all notions and fundamental results associated with neural networks. Section 3 is devoted to the introduction of various types of linear transport equations. In Section 4, we present the four main results of this work: Theorems 4.3, 4.4, 4.7, and 4.9. These results describe approximation rate bounds for the solutions of the equations of Section 3 by deep neural networks. Finally, in Section 5, we discuss natural extensions of the presented results. Some auxiliary results have been deferred to the appendix.

Notation
Below we collect some notation that is used throughout the manuscript. This notation is mostly standard and hence this section can be skipped and only be referred to when a symbol is unclear.
We denote by N = {1, 2, ...} the set of all natural numbers and define, for k ∈ N, the set N ≥k := {n ∈ N : n ≥ k}. For d 1 , d 2 ∈ N we denote by Id R d 1 the identity on R d1 and by 0 R d 1 ×d 2 we denote the map from R d1 to R d2 that vanishes everywhere. We denote by 0 R d 1 the zero vector in R d1 . On R d1×d2 we denote by · the euclidean norm and by · ∞ the maximum norm. The number of nonzero entries of a matrix or vector A ∈ R d1×d2 is counted by · 0 , where A 0 := |{(i, j) : A i,j = 0}|. If d 1 , d 2 , d 3 ∈ N, and A ∈ R d1×d2 , B ∈ R d1×d3 , then we use the block matrix notation and write for the horizontal concatenation of A and B where the second notation is used if a stronger delineation between different blocks is appropriate. A similar notation is used for the vertical concatenation of A ∈ R d1×d2 and B ∈ R d3×d2 . For d 1 , d 2 ∈ N, and Ω ⊂ R d1 , we denote by L p (Ω, R d2 ), p ∈ [1, ∞] the R d2 -valued Lebesgue spaces, where we set L p (Ω) := L p (Ω, R). For k ∈ N, we denote by W k,∞ (Ω), the space of k-times weakly differentiable functions that have all derivatives of order at most k in L ∞ (Ω). The space W k,∞ loc (Ω) consists of functions such their restriction to every compact K ⊂ Ω is in W k,∞ (K). By C k (Ω, R d2 ), we denote the set of k-times continuously differentiable functions mapping from Ω to R d2 , where we set C k (Ω) := C k (Ω, R). By C k c (Ω) we denote all functions in C k (Ω) that have compact support.

Neural Networks
In this section, we define neural networks and then recall a couple of operations on these objects that will be used frequently in the sequel. In the definition of neural networks, we distinguish between a neural network as a set of weights and an associated function that we call the realisation of the neural network. This formal approach was introduced in [46], but we recall here a slightly different formulation of [26] for neural networks that allow so-called skip connections.
, ̺(x) = max{0, x} and let Φ be a NN as above. Then we define the associated where x L results from the following scheme: where A ℓ,k is an N ℓ × N k matrix for k = 0, . . . , ℓ − 1 and ℓ = 1, . . . , L. Then is called the number of weights of Φ. Moreover, we refer to N L as output dimension of Φ.

Standard operations on neural networks
We collect four standard operations that can be performed with NNs below. First, we can concatenate two NNs Φ 1 , Φ 2 in such a way that the realisation of the concatenation is a composition of the individual reali- An additional operation that is frequently applied to NNs in the sequel is that of parallelisation. This procedure puts NNs in parallel such that the output of the realisation is a vector containing the outputs of the original NNs.

Proposition 2.3 ([26, Remark 2.10])
. Let n, d ∈ N and, for i = 1, . . . , n, let Φ i be a NN with d-dimensional input and L i ∈ N layers. Then there exists a NN P(Φ 1 , . . . , Φ n ) with d-dimensional input such that We will occasionally need to construct NNs the realisation of which is the sum of functions that we had approximated beforehand by realisations of NNs. In this situation, the following operation that emulates a sum of NNs is convenient.
Proof. Let

Then we set
A L := 1 1 A L and b L : Per construction, Finally, we can construct a NN that represents the multiplication of two NNs Φ 1 and Φ 2 in the sense that its realisation is close to the multiplication of the realisations of Φ 1 and Φ 2 . In contrast to the previous operations, this emulation of the multiplication is not exact but requires a parameter ε > 0 describing how accurately the multiplication is implemented.
The result now follows from Propositions 2.2 and 2.3.

Approximation of smooth functions
In addition to the operations on NNs described in the previous section, we will frequently invoke the following standard approximation result of smooth functions by realisations of NNs.
instead of the unit ball F k,d then the constant c from Theorem 2.6 also depends on R. However, the asymptotic behaviour with respect to ε remains unchanged. The same change holds if we consider the space

Linear Transport Equations
In this section, we introduce the Cauchy problem for the parametric linear transport equation, state the most important existence results for several types of linear transport equations and provide expressions for their solutions. Here, we mainly follow [22]. An English translation of this source can be found in the lecture notes [23]. For more information on linear transport equations see also [1,2,9].
Definition 3.1. The Cauchy problem of the parametric linear transport equation is given by It is well known that linear transport equations can be solved via the method of characteristics [11,20,33]. The idea of this method is to consider characteristic curves that are defined so that the solution u of (3.1) is constant along these curves. Then the solution at a point (t, x, η) equals the initial data evaluated at the origin of this curve.

Definition 3.2. The characteristic curve of the transport operator
where γ is the solution of the characteristic system of ordinary differential equations Let us briefly show why the solution u of (3.1) does not change along characteristic curves. Considering the case where V (t, x, η) ≡ v for a v ∈ R n and dropping the η-dependency, we have where the last equality is due to (3.1).
To make the method of characteristics work, we have to ensure that the characteristic curves are diffeomorphisms and that there exists a global solution of system (3.2). Therefore, we make the following assumptions on the vector field V : Let n, D ∈ N, and T > 0.
The following theorem states that these assumptions lead to global existence of the characteristic curves and characterises their regularity.
Proof. The proof presented in [23] can directly be extended to the parametric case. Moreover, the differentiability with respect to η is a standard result, compare [27,Corollary 4.1].
For our main result, we want to approximate the map X with a NN. To quantify the complexity of this NN, we need a bound on the C k -norm of X in terms of the given data. We establish this bound in the next proposition.
Proof. The proof is presented in Appendix A.

Solutions of the standard linear transport equation
The next theorem states the existence of a solution for the linear transport equation of (3.1). Furthermore, the theorem establishes that the solution has a compositional structure resulting from composing the initial data with the solution of the characteristic system of ODEs starting at (t, x, η) evaluated at s = 0.
For initial conditions that are not differentiable it makes sense to introduce a weak notion of a solution.
The following definition and proposition where taken from [17]. A proof for the simplified case, where div x V = 0 can be found in [40,Theorem 3.12].
As for strong solutions of the transport equation, the solution of the weak formulation is given by a composition of the initial condition with a flow along the characteristic curves.  0, t, x, η)).

Solutions of extensions of the parametric linear transport equation
In Chapter 4, we extend our main result to linear transport equations that include source terms and are formulated in conservative form. The following two propositions present the corresponding existence results and the form of the solutions for problems with source terms and in conservative form.
has a unique solution u ∈ C min{s,s ′ ,k} ([0, T ] × R n × [0, 1] D ) which is given by X(s, t, x, η), η) ds. . In this case, the associated weak formulation is given by (3.3) after replacing the right-hand side by of that problem is still given by (3.5). Here one only needs to assume that u 0 is continuous, [40,Remark 3.13] or [17].

DNN Approximation of Solutions of Linear Transport Equations
Theorem 3.5 and Propositions 3.8 and 3.10 suggest that the solutions of parametric linear transport equations are of a compositional form, where the initial condition is composed with a flow along characteristic curves.
Since realisations of NNs are naturally of compositional structure, it is therefore conceivable that the form of the solutions of linear transport equations can be efficiently resolved by NNs. Indeed, based on this observation we present, for each of the cases discussed in Section 3, an approximation result for the solution of the associated parametric linear transport equations by NNs.

Standard linear transport equations
We start by presenting an approximation result for the solutions of standard linear transport equations as described in Theorem 3.5. We will assume that the initial condition can be approximated reasonably well by NNs. For this, we use the following definition: Definition 4.1. Let n ∈ N and r > 0, a function f ∈ L ∞ (R n ) is r-approximable by NNs if, for every compact set K ⊂ R n , there exists a constant c = c(K, r, f ) > 0 such that for every ǫ ∈ (0, 1) there exists a NN Φ f,ε such that • L Φ f,ε ≤ c · (ln(1/ε) + 1), Remark 4.2. By Theorem 2.6, every function f ∈ C s (R n ) is r approximable for r = s/n .
We now present the main theorems of this section for the strong and weak formulation of standard linear transport equations below. Afterward, in Remark 4.5, we discuss to what extent the resulting approximation rates improve upon a direct application of Theorem 2.6 to the solution u of a linear transport equation. We present the proofs of the theorems at the end of this subsection.
Then, for every ε ∈ (0, 1) and every compact subset K ⊂ R n , there exists a NN Φ u,ε with d-dimensional input, where d := 1+n+D, such that for the restriction u := u [0,T ]×K×[0,1] D there holds that, for c = c(n, r, d, k, K, T, V C k , u 0 ) > 0, In Theorem 4.3 above, the initial condition is required to be continuously differentiable. In the following result, we extend Theorem 4.3 to initial conditions that are Lipschitz continuous only. To handle initial conditions that are not continuously differentiable, we have to consider weak solutions as described in Definition 3.6. Theorem 4.4. Let V satisfy assumptions (H1) and (H2) for k, n, D ∈ N, and T > 0. Further let, for r > 0, u 0 ∈ W 1,∞ loc be r-approximable by NNs. Let u(t, x, η) = u 0 (X(0, t, x, η)) denote the weak solution of the Cauchy problem for the parametric linear transport equation of (3.1) according to Proposition 3.7. Then, for every ε ∈ (0, 1) and every compact subset K ⊂ R n , there exists a NN Φ u,ε with d-dimensional input, where d := 1 + n + D, such that for the restriction u := u [0,T ]×K×[0,1] D there holds that for c = c(n, r, d, k, |K|, T, V C k , u 0 ) > 0.

Remark 4.5. The typical framework in which we expect to apply Theorems 4.3 and 4.4 above is that
where V is substantially smoother than the initial condition u 0 . Concretely, in the following two situations we have that the approximation rates resulting from an application of Theorems 4.3 or 4.4 are significantly better than those resulting from a direct approximation of u by Theorem 2.6.
• Assume that, in the notation of Theorems 4.3 and 4.4, n ≤ s ≪ d ≤ k, and u 0 ∈ C s . In this situation, the dimension of the parameter space is significantly larger than the dimension of the physical domain. The dependence of V on the parameters is, however, very regular.
Then u ∈ C s ([0, T ] × K × [0, 1] D ) and a direct application of Theorem 2.6 would yield an approximating network with a complexity bound for the number of weights of the form c · ((ln(1/ε) + 1)ε −d/s ). On the other hand, Theorem 2.6 yields that u 0 is 1-approximable by NNs and hence Theorem 4.3 yields a complexity bound that is not worse than c · ((ln(1/ε) + 1)ǫ −1 .
• Assume that D ∈ N and in the notation of Theorems 4.3 and 4.4 n ≪ n + D, and k = d. Moreover, assume that u 0 ∈ W 2,∞ loc , but u 0 ∈ W 1,∞ loc and u 0 can be very efficiently represented by the realisation of a NN. A typical example is that u 0 is a ramp function along a hyperplane or a piecewise affine function. In this case, u 0 is r-approximable for every r ∈ R.

Proof of Theorem 4.4.
The proof is completely analogous to the proof of Theorem 4.3 since the form of the solution did not change and our estimates only required u 0 to be Lipschitz continuous, but not that u 0 was in C 1 .

Non-vanishing source term
In the following, we extend Theorem 4.3 to non-vanishing source terms. We state our results for two different types of source terms. For V satisfying assumptions (H1) and (H2) with k, n, D ∈ N, and T > 0, we assume that one of the following properties holds: In words, we assume high regularity of f if it depends on η while much less regularity of f is sufficient for the η-independent case.
Remark 4.6. In both cases, (4.1) and (4.2), Theorem 2.6 demonstrates that for every ε ∈ (0, 1) there exists a NN Φ f,ε such that Based on the assumption on the source term, we next present an approximation result for solutions of non-homogeneous parametric linear transport equations.
Then, for every ε ∈ (0, 1) and every compact subset K ⊂ R n , there exists a NN Φ u,ε with d-dimensional input, where d := 1 + n + D, such that for the restriction u := u [0,T ]×K×[0,1] D there holds that The proof, therefore, proceeds as follows: First, via Proposition B.1, we construct a NN the realisation of which approximates the antiderivative of f with respect to the first coordinate by a NN. Then we construct a second NN via Theorem 4.3 the realisation of which approximates u 0 • X. Finally, Proposition 2.4 yields a NN such that the associated realisation approximates (4.3).

Conservative form
Next, we extend our results to transport equations in conservative form by invoking Proposition 3.10. Theorem 4.9. Let V satisfy assumptions (H1) and (H2) with n, D, k ∈ N, and T > 0. Further let, for r > 0, u 0 ∈ C 1 (R n ) be r-approximable by NNs. Let u ∈ C 1 ([0, T ] × R n × [0, 1] D ) denote the unique solution of the Cauchy problem for the conservative parametric linear transport equation Then, for every ε ∈ (0, 1) and every compact subset K ⊂ R n , there exists a NN Φ u,ε with d-dimensional input, where d := 1 + n + D, such that for the restriction u := u [0,T ]×K×[0,1] D there holds for c = c(n, r, d, k, |K|, T, V C k , u 0 ) > 0.

Remark 4.10.
• If u 0 ∈ C s (R n ), then Remark 4.2 demonstrates that we can replace r in Theorem 4.9 by s/n and the constant c in Theorem 4.9 depends more specifically on u 0 C s .
The proof proceeds by first approximating J by a NN with help of Theorem 2.6, then invoking the known approximation of u 0 • X via Theorem 4.3, and then applying the multiplication of NNs by Proposition 2.5.

Extensions
Below, we discuss some natural extensions of our work to more general settings.
• Non-linear transport equations: An immediate question is to what extent the results carry over to the non-linear setting. We believe that it is highly unlikely that similar results hold in this regime without overly restrictive assumptions. Indeed, in the non-linear case, non-smoothness of the initial condition u 0 potentially implies non-smoothness of the characteristic curves described by X. This can already be seen in the one-dimensional case. We consider the one-dimensional non-linear transport equation The characteristic system of ODEs is then given by [51, p. 26] as ∂ s X(s, t, x, η) = f ′ (u(t, X(s, t, x, η), η)), (5.1a) X(t, t, x, η) = x. (5.1b) Hence, the regularity of the characteristic curves described by X depends on the global regularity of u and therefore on the regularity of u 0 and, therefore, X is not guaranteed to be smooth.
If X is non-smooth, then the fundamental backbone of the argument, which is that u can be written as the composition of a high-dimensional smooth and low-dimensional (potentially) rough function, collapses.
• Damping/amplification: The extension of our results to parametric linear transport equations that include an amplification or damping factor is straight-forward. More precisely, we consider solutions of the equation where a is, similarly to (4.1) and (4.2), either given by a(t, If V satisfies assumptions (H1) and (H2) with n, D, k ∈ N, T > 0, and u 0 ∈ C 1 (R n ) being r-approximable by NNs for r > 0, then one can show, see [23], that there exists a unique solution of (5.2) which is given by To get an estimate on the sizes of approximating NNs for functions of the form of (5.3), we only have to combine previous results. Section 4.2 describes how to approximate the map (t, x, η) → − t 0 a(τ, X(τ, t, x, η), η) dτ by realisations of NNs. This approximation can be concatenated with an approximation of the smooth, one-dimensional exponential function via Proposition 2.2. Finally, the result may be multiplied, via Proposition 2.5, with the already known approximation of u 0 (X(0, t, x, η)) from Theorem 4.3. This yields a NN Φ u,ε such that the realisation of Φ u,ε approximates (5.3) up to an error of ε > 0. Estimating the individual sizes of the networks involved in the construction of Φ u,ε yields that with c = c(n, r, d, k, |K|, T, V C k , u 0 , a C s ′ ) > 0. As before, if u 0 ∈ C s (R n ), then r = s/n.
• Parameter dependence of initial condition: We only considered the case where u 0 does not depend on the parameters. It is not hard to see that, in the framework of r-approximability, the same result would hold if u 0 depended on the parameters. However, if u 0 ∈ C s (R n × [0, 1] D ), then Remark 4.2 would yield an approximation rate depending on the dimension D of the parameter space.
For an application of Remark 4.2 it is required that u 0 is a low-dimensional function. Hence, if u 0 ∈ C s depends on very few parameters, say the first t ≪ D, then all main theorems can be extended directly. Instead of approximating x → u 0 (x) with a NN up to an error of ǫ > 0 and having to use O(ǫ −n/s ) many weights for ε → 0, one would instead approximate x → u 0 (x, η 1 , . . . , η t ) which requires O(ǫ −(n+t)/s ) many weights for ε → 0. A second framework in which u 0 could be guaranteed to be r-approximable with large r while having low spatial smoothness is that where the parameter dependence is decoupled from the dependence on the spatial coordinates. For example, if u 0 (x, η) =ũ 0 (x) · κ 1 (η) + κ 2 (η) for smooth κ 1 , κ 2 , then again low regularity ofũ 0 could suffice to achieve fast rates.
• Weak solutions with discontinuous initial condition: Since realisations of deep neural networks are always continuous functions, we cannot hope to obtain approximation results in the uniform norm as studied in this work. However, if one considers L p -approximation, for p ∈ [1, ∞) instead, then approximation of piecewise regular functions is possible. This situation was studied in [46].

A Proof of Proposition 3.4
Proof. For simplicity we drop the η-dependence of X since derivatives with respect to η are easy to compute and bound. Moreover, we use the norm and semi-norm , and define d := 1 + 1 + n. We start with the definition of X given by By the fundamental theorem of calculus we conclude In the following, we show how to bound the zeroth, first, and second derivatives to identify the pattern how these bounds are built. With the help of the sub-linear growth-condition (H2) we conclude and by Gronwall's inequality Hence, Now we derive bounds for the first derivatives. In the following, we abbreviate V C j ([0,T ]×G0) by V C j . We have by (A.1a) Furthermore, by Leibniz integral rule applied to (A.2) and therefore Gronwall's inequality implies then The same procedure results for ∇ x X in Thus, we get after assuming without loss of generality that T ≥ 1, V C j ≥ 1 As the next step, we derive a bound for the second derivatives. We have ∂ ss X(s, t, x) = ∂ s V (s, X(s, t, x)) + ∇ x V (s, X(s, t, x))∂ s X(s, t, x) and consequently we conclude again by Gronwall's inequality The derivatives ∂ sx X, ∂ tx X and ∂ xx X can be computed in the same way and bounded by the same term as ∂ tt X. Note that the semi-norm |X| C 1 is bounded by |X| C 2 . Thus, we receive We can iterate this procedure to receive the bound for the k-th derivative of the form: where we have used that the semi-norm | · | C k bounds the k − 1 previous semi-norms. The bound in (A.5) includes a factor 2 k because every application of the product rule doubles the number of summands and a factor T k−1 because with every derivative we get a new factor T from bounding s t . . . . Remark. The proof of Proposition 3.4 might not provide a very sharp bound for the C k -norm of X. However, for our approximation result in Chapter 4 it is only important that the norm is bounded by some constant that depends on the given data, since the norm of X only enters as a constant and does not influence the asymptotic behaviour in the tolerance ε. Whether the size of the bound influences the practicality of the approximation via deep NNs will be investigated in numerical simulations in future work.

B Construction of a NN emulating the left Riemann sum
We have that W (Φ We have that Since, by (B.6), Proof. Let N (t) := max{i ∈ N | t i < t}. Then