Error analysis for physics-informed neural networks (PINNs) approximating Kolmogorov PDEs

Physics-informed neural networks approximate solutions of PDEs by minimizing pointwise residuals. We derive rigorous bounds on the error, incurred by PINNs in approximating the solutions of a large class of linear parabolic PDEs, namely Kolmogorov equations that include the heat equation and Black-Scholes equation of option pricing, as examples. We construct neural networks, whose PINN residual (generalization error) can be made as small as desired. We also prove that the total L2-error can be bounded by the generalization error, which in turn is bounded in terms of the training error, provided that a sufficient number of randomly chosen training (collocation) points is used. Moreover, we prove that the size of the PINNs and the number of training samples only grow polynomially with the underlying dimension, enabling PINNs to overcome the curse of dimensionality in this context. These results enable us to provide a comprehensive error analysis for PINNs in approximating Kolmogorov PDEs.


Introduction
Background and context.Partial differential equations (PDEs) are ubiquitous as mathematical models in the sciences and engineering.Explicit solution formulas for PDEs are not available except in very rare cases.Hence, numerical methods, such as finite difference, finite element and finite volume methods, are key tools in approximating solutions of PDEs.In spite of their well-documented successes, it is clear that these methods are inadequate for a variety of problems involving PDEs.In particular, these methods are not suitable for efficiently approximating PDEs with high-dimensional state or parameter spaces.Such problems arise in different contexts ranging from PDEs such as the Boltzmann, Radiative transfer, Schrödinger and Black-Scholes type equations with very high number of spatial dimensions, to many-query problems, as in uncertainty quantification (UQ), optimal design and inverse problems, which are modelled by PDEs with very high parametric dimensions.
Given this pressing need for efficient algorithms to approximate the afore-mentioned problems, machine learning methods are being increasingly deployed in the context of scientific computing.In particular, deep neural networks (DNNs) i.e., multiple compositions of affine functions and scalar nonlinearities, are being widely used.Given the universality of DNNs in being able to approximate any continuous (measurable) function to desired accuracy, they can serve as ansatz spaces for solutions of PDEs, as for high-dimensional semi-linear parabolic PDEs [7], linear elliptic PDEs [36,16] and nonlinear hyperbolic PDEs [24,25] and references therein.More recently, DNN-inspired architectures such as DeepOnets [4,22,19] and Fourier Neural operators [21] have been shown to even learn infinite-dimensional operators, associated with underlying PDEs, efficiently.
A large part of the literature on the use of deep learning for approximating PDEs relies on the supervised learning paradigm, where the DNN has to be trained on possibly large amounts of labelled data.However in practice, such data is acquired from either measurements or computer simulations.Such simulations might be very computationally expensive [24] or even infeasible in many contexts, impeding the efficiency of the supervised learning algorithms.Hence, it would be very desirable to find a class of machine learning algorithms that can approximate PDEs, either without any explicit need for data or with very small amounts of data.Physics informed neural networks (PINNs) provide exactly such a framework.
Physics Informed Neural Networks (PINNs).PINNs were first proposed in the 90s [6,18,17] as a machine learning framework for approximating solutions of differential equations.However, they were resurrected recently in [33,34] as a practical and computationally efficient paradigm for solving both forward and inverse problems for PDEs.Since then, there has been an explosive growth in designing (T.De Ryck and S. Mishra) Seminar for Applied Mathematics, ETH Zürich, Rämistrasse 101, 8092 Zürich, Switzerland E-mail addresses: tim.deryck@sam.math.ethz.ch,siddhartha.mishra@sam.math.ethz.ch.
and applying PINNs for a variety of applications involving PDEs.A very incomplete list of references includes [35,23,26,32,40,13,14,27,28,29,1] and references therein.We briefly illustrate the idea behind PINNs by considering the following general form of a PDE: (1.1) D[u](x, t) = 0, Bu(y, t) = ψ(y, t), u(x, 0) = ϕ(x), for x ∈ D, y ∈ ∂D, t ∈ [0, T ], Here, D ⊂ R d is compact and D, B are the differential and boundary operators, u : D ×[0, T ] → R m is the solution of the PDE, ψ : ∂D × [0, T ] → R m specifies the (spatial) boundary condition and ϕ : D → R m is the initial condition.We seek deep neural networks u θ : D × [0, T ] → R m (see (2.6) for a definition), parameterized by θ ∈ Θ, constituting the weights and biases, that approximate the solution u of (1.1).To this end, the key idea behind PINNs is to consider pointwise residuals, defined for any sufficiently smooth function for x ∈ D, y ∈ ∂D, t ∈ [0, T ].Using these residuals, one measures how well a function f satisfies resp.the PDE, the boundary condition and the initial condition of (1.1).Note that for the exact solution Hence, within the PINNs algorithm, one seeks to find a neural network u θ , for which all residuals are simultaneously minimized, e.g. by minimizing the quantity, However, the quantity E G (θ), often referred to as the population risk or generalization error [27] of the neural network u θ involves integrals and can therefore not be directly minimized in practice.Instead, the integrals in (1.3) are approximated by a numerical quadrature, resulting in, ( Here, one samples quadrature points in space-time to construct data sets S i = {(t n i , x n i )} Ni n , S s = {(t n s , x n s )} Ns n and S t = {x n t } Nt n , and w n q are suitable quadrature weights for q = i, t, s.Thus, the generalization error E G (θ) is approximated by the so-called training loss or training error [27], (1.5) E T (θ, S) 2 = E i T (θ, S i ) 2 + E s T (θ, S s ) 2 + E t T (θ, S t ) 2 , where S = (S i , S s , S t ), and a stochastic gradient descent algorithm is to used to approximate the nonconvex optimization problem, (1.6) θ * = arg min θ∈Θ E T (θ, S) 2 , and u * = u θ * is the trained PINN that approximates the solution u of the PDE (1.1).
The above questions are of fundamental importance as affirmative answers to them certify that, in principle, there exists a PINN, corresponding to the parameter θ, such that the resulting PDE residual (1.2) is small, and consequently also the overall error in approximating the solution of the PDE (1.1).Moreover, the smallness of the generalization error E G ( θ) can imply that the training error E T ( θ) (1.5), which is an approximation of the generalization error, is also small.Hence, in principle, the (global) minimization of the optimization problem (1.6) should result in a proportionately small training error.
However, the optimization problem (1.6) involves the minimization of a non-convex, very-high dimensional objective function.Hence, it is unclear if a global minimum is attained by a gradient-descent algorithm.In practice, one can evaluate the training error E T (θ * ) for the (local) minimizer θ * of (1.6).Thus, it is natural to ask if, Q3.Given a small training error E T (θ * ) and a sufficiently large training set S, is the corresponding generalization error E G (θ * ) also proportionately small?
An affirmative answer to question Q3, together with question Q2, will imply that the trained PINN u θ * is an accurate approximation of the solution u of the underlying PDE (1.1).Thus, answering the above three questions affirmatively will constitute a comprehensive theoretical investigation of PINNs and provide a rationale for their very successful empirical performance.
Given the very large number of papers exploring PINNs empirically, the rigorous theoretical study of PINNs is in a relative state of infancy.In [37], the authors prove a consistency result for PINNs, for linear elliptic and parabolic PDEs, where they show that if E T (θ m ) → 0 for a sequence of neural networks {u θm } m∈N , then u θm − u L ∞ → 0, under the assumption that one adds a specific C k,α -regularization term to the loss function, thus partially addressing question Q3 for these PDEs.However, this result does not provide quantitative estimates on the underlying errors.A similar result, with more quantitative estimates for advection equations is provided in [38].
In [27,28], the authors provide a strategy for answering questions Q2 and Q3 above.They leverage the stability of solutions of the underlying PDE (1.1) to bound the total error in terms of the generalization error (question Q2).Similarly, they use accuracy of quadrature rules to bound the generalization error in terms of the training error (question Q3).This approach is implemented for Forward problem corresponding to a variety of PDEs such as the semi-linear and quasi-linear parabolic equations and the incompressible Euler (Navier-Stokes) equations [27], radiative transfer equations [29], nonlinear dispersive PDEs such as the KdV equations [1] and for the unique continuation (data assimilation) inverse problem for many linear elliptic, parabolic and hyperbolic PDEs [28].However, these works suffer from two essential limitations: first, question Q1 on the smallness of generalization error is not addressed and second, the assumptions on the quadrature rules in [27,28] are rather stringent and in particular, the analysis does not include the common choice of using random sampling points in S, unless an additional validation set is chosen.Thus, the theoretical analysis presented in [27,28] is incomplete and this sets the stage for the current paper.
Aims and scope of this paper.Given the above discussion, our main aims in this paper are to address the fundamental questions Q1, Q2 and Q3 and to establish a solid foundation and rigorous rationale for PINNs in approximating PDEs.
To this end, we choose to focus on a specific class of PDEs, the so-called Kolmogorov equations [31] in this paper.These equations are a class of linear, parabolic PDEs which describe the space-time evolution of the density for a large set of stochastic processes.Prototypical examples include the heat (diffusion) equation and Black-Scholes type PDEs that arise in option pricing.A key feature of Kolmogorov PDEs is the fact that the equations are set in very high dimensions.For instance, the spatial dimension in a Black-Scholes PDE is given by the number of underlying assets (stocks), upon which the basket option is contingent, and can range up to hundreds of dimensions.
Our motivation for illustrating our analysis on Kolmogorov PDEs is two-fold.First, they offer a large class of PDEs with many applications, while still being linear.Second, it has already been shown empirically in [27,39,30] that PINNs can approximate very high-dimensional Kolmogorov PDEs efficiently.
Thus in this paper, • We show that there exist PINNs, approximating a class of Kolmogorov PDEs, such that the resulting generalization error (1.3), and the total error, can be made as small as possible.Moreover under suitable hypothesis on the initial data and the underlying exact solutions, we will show that the size of these PINNs does not grow exponentially with respect to the spatial dimension of the underlying PDE.This is done by explicitly constructing PINNs using a representation formula, the so-called Dynkin's formula, that relates the solutions of the Kolmogorov PDE to the generator and sample paths for the underlying stochastic process.• We leverage the stability of Kolmogorov PDEs to bound the error, incurred by PINNs in L 2 -norm in approximating solutions of Kolmogorov PDEs, by the underlying generalization error.• We provide rigorous bounds for the generalization error of the PINN approximating Kolmogorov PDEs in terms of the underlying training error (1.5), provided that the number of randomly chosen training points is sufficiently large.Furthermore, the number of random training points does not grow exponentially with the dimension of the underlying PDE.We use a novel error decomposition and standard Hoeffding's inequality type covering number estimates to derive these bounds.
Thus, we provide affirmative answers to questions Q1, Q2 and Q3 for this large class of PDEs.Moreover, we also show that PINNs can overcome the curse of dimensionality in approximating these PDEs.Hence, our results will place PINNs for these PDEs on solid theoretical foundations.The rest of the paper is organized as follows: In section 2, we present preliminary material on linear Kolmogorov equations and describe the PINNs algorithm to approximate them.The generalization error and total error (questions Q1 and Q2) are considered in section 3 and the generalization error is bounded in terms of training error (question Q3) in section 4.

2.
PINNs for Linear Kolmogorov Equations 2.1.Linear Kolmogorov PDEs.In this paper, we will consider the following general form of linear time-dependent partial differential equations, (2.1) where σ : R d → R d×d and µ : R d → R d are affine functions, ∇ x denotes the gradient and H x the Hessian (both with respect to the space coordinates).For definiteness, we set D = (0, Here, the β i are stock volatilities, the coefficients ρ ij model the correlation between the different stock prices, µ is an interest rate and the initial condition ϕ is interpreted as a payoff function. Prototypical examples of such payoff functions are ϕ(x) = max{ i a i x i − K, 0} (basket call option), ϕ(x) = max{max i a i x i − K, 0} (call on max) and analogously for put options.Our goal in this paper is to approximate the classical solution u of Kolmogorov equations with PINNs.We start with a brief recapitulation of neural networks below.
2.2.Neural Networks.We denote by σ : R → R be an (at least) twice continuously differentiable activation function, like tanh or sigmoid.For any n ∈ N, we write for z ∈ R n that σ(z) := (σ(z 1 ), . . ., σ(z n )).We formally define a neural network below, Definition 2.1.Let R ∈ (0, ∞], L, W ∈ N and l 0 , . . ., l L ∈ N. Let σ : R → R be a twice differentiable function and define We denote by u θ : R l0 → R lL the function that satisfies for all z ∈ R l0 that , where in the setting of approximating Kolmogorov PDEs (2.1) we set l 0 = d + 1 and z = (x, t).
We refer to u θ as the realization of the neural network associated to the parameter θ with L layers with widths (l 0 , l 1 , . . ., l L ), of which the middle L − 1 layers are called hidden layers.For 1 ≤ k ≤ L, we say that layer k has width l k and we refer to W k and b k as the weights and biases corresponding to layer k.If L ≥ 3, we say that u θ is a deep neural network (DNN).
2.3.PINNs.As already mentioned in the introduction, the key idea behind PINNs is to minimize pointwise residuals associated with the Kolmogorov PDE (2.1).To this end, we define the differential operator associated with (2.1), (2.7) Next, we define the following residuals associated with (2.1), The generalization error for a neural network of the form (2.6), approximating the Kolmogorov PDE is then given by the formula (1.3), but with the residuals defined in (2.8).
Given the possibly very high-dimensional domain D of (2.1), it is natural to use random sampling points to define the loss function for PINNs θ → E T (θ, S) 2 as follows, (2.9) , where the training data sets, S i = {(t n i , x n i )} Ni n , S s = {(t n s , x n s )} Ns n and S t = {x n t } Nt n , are chosen randomly, independently with respect to the corresponding Lebesgue measures and the residuals R i,s,t are defined in (2.8).
A trained PINN u * = u θ * is then defined as a (local) minimum of the optimization problem (1.6), with loss function (2.9) (possibly with additional data and weight regularization terms), found by a (stochastic) gradient descent algorithm such as ADAM or L-BFGS.

Bounds on the approximation error for PINNs
In this section, we will first answer the question Q1 for the PINNs approximating linear Kolmogorov equations (2.1) i.e., our aim will be to construct a deep neural network (2.6) for approximating (2.1), such that the corresponding generalization error E G (1.3) is as small as desired.
Recalling that the Kolmogorov PDE is a linear parabolic equation with smooth coefficients, one can use standard parabolic theory to conclude that there exists a unique classical solution u of (2.1) and it is sufficiently regular, for instance u ∈ W s,∞ ((0, T ) × D) for some s > 2. As u is a classical solution, the residuals (2.8), evaluated at u, vanish i.e., (3.1) Moreover, one can use recent results in approximation theory, such as those presented in [9,10,5] and references therein, to infer that one can find a deep neural network (2.6) that approximates the solution u in the W 2,∞ -norm, and therefore yields an approximation for which the PINN residual is small.For instance, one appeals to the following theorem (more details, including exact constants and bounds on the network weights, can be derived from the results in [5]).Theorem 3.1.Let T > 0, γ, d, s ∈ N with s ≥ 2 + γ and let u ∈ W s,∞ ([0, T ] × [0, 1] d ) be the solution of a linear Kolmogorov PDE (2.1).Then for every ε > 0 there exists a tanh neural network u ε = u θ ε with two hidden layers of width at most Proof.It follows from [5, Theorem 5.1] that there exists a tanh neural network u ε with two hidden layers of width at most Using a standard trace inequality, one finds similar bounds for the R s [u] and R t [u].From this, it follows directly that E G ( θ ε ) ≤ ε.
Hence, u ε is a neural network for which the generalization error (1.3) can be made arbitrarily small, providing an affirmative answer to Q1. However from Theorem 3.1, we observe that the size (width) of the resulting deep neural network u ε , grows exponentially with spatial dimension d for (2.1).Thus, this neural network construction clearly suffers from the curse of dimensionality.Hence, this construction cannot explain the robust empirical performance of PINNs in approximating Kolmogorov equations (2.1) in very high spatial dimensions [27,39,30].Therefore, we need a different approach for obtaining bounds on the generalization error that overcome this curse of dimensionality.To this end, we rely on the specific structure of the Kolmogorov equations (2.1).In particular, we will use the Dynkin's formula, which relates Kolmogorov PDEs to Itô diffusion SDEs.
In order to state Dynkin's formula, we first need to introduce some notation.Let (Ω, F , P, (F t ) t∈[0,T ] ) be a stochastic basis, D ⊆ R d a compact set and, for every x ∈ D, let X x : Ω × [0, T ] → R d be the solution, in the Itô sense, of the following stochastic differential equation, , where B t is a standard d-dimensional Brownian motion on (Ω, F , P, (F t ) t∈[0,T ] ).The existence of X x is guaranteed by Lemma A.5. Dynkin's formula relates the generator F of X x t , given in e.g.[31], with the initial condition ϕ ∈ C 2 (D) and differential operator L (2.7) of the corresponding Kolmogorov PDE (2.1).Equipped with this notation, we state the Dynkin's formula below, Lemma 3.2 (Dynkin's formula).For every x ∈ D, let X x be the solution to a linear Kolmogorov SDE (3.3) with affine µ : R d → R d and σ : where u is defined as Proof.See Corollary 6.5 and Section 6.10 in [15].
Our construction of a neural network with small residual (2.8) relies on emulating the right hand side of Dynkin's formula (3.5) with neural networks.In particular, the initial data ϕ and the generator F ϕ will be approximated by suitable tanh neural networks.On the other hand, the expectation in (3.5) will be replaced by an accurate Monte Carlo sampling.Our construction is summarized in the following theorem, Moreover, assume that for every ξ, δ, c > 0, there exist tanh neural networks ϕ ξ,d : Then there exist constants C, λ > 0 such that for every ε > 0 and d ∈ N, there exist a constant ρ d > 0 and a tanh neural network Ψ ε,d with at most C(dρ d ) λ ε − max{5p+3,2+p+β} neurons and weights that grow at most as C(dρ d ) λ ε − max{ζ,8p+6} for ε → 0 such that Moreover, ρ d is defined as where X x is the solution, in the Itô sense, of the SDE (3.3) and q > 2 is independent of d.
Proof.Based on the Dynkin's formula of Lemma 3.2, we will construct a tanh neural network, denoted by u M,N for some M, N ∈ N, and we will prove that the PINN residual (2.8) of u M,N is small.To do so, we need to define intermediate approximations ūN and ũM,N .In this proof, C > 0 will denote a constant that will be updated throughout and can only depend on d, D, µ, T , ϕ and L, i.e., not on M nor N .In particular, the dependence of C on the input dimension d will be of interest.We will argue that the final value of C will depend polynomially on d and ρ d (3.9).Because of the third point of Lemma A.5, the quantity within the maximum in the definition of ρ d (3.9) is finite for every individual x ∈ D and hence the maximum of this quantity over x ∈ {0, e 1 , . . ., e d } will be finite as well.As a result of the fourth point of Lemma A.5 it then follows that ρ d < ∞.Moreover, if ρ d depends polynomially on d, then so will C.For notational simplicity, we will not explicitly keep track of the dependence of C on d and ρ d .
Next, we observe that (3.10) such that the left-hand side also grows at most polynomially in d and ρ d .
Finally, we will denote by ) and to simplify notation we will write u := u d and D := D d .
Step 1: from u to ūN .In the first step we approximate the temporal integral in (3.5) by a Riemann sum, that can be readily approximated by neural networks.To this end, let h : R → R be defined by h(x) = max{0, min{x, 1}}.Then we define for N ∈ N, .
We first define n 0 (t) = ⌊N t/T ⌋ and calculate for t Next, we make the observation that there exist constants a i , b i , c ij (that only depend on the coefficients of µ and σ) and functions Λ i , Ψ i and Φ ij (that linearly depend on ϕ and its derivatives) such that for any d-dimensional stochastic process Z x .If we define x to be random variable that is uniformly distributed on D, we can use the Lipschitz continuity of Λ i and the temporal regularity of X x (property (3) of Lemma A.5 with λ ← x) to see that .
Similarly, we find using Lemma A.5 and the generalized Hölder inequality with q > 0 such that Using also the fact that , we can find that As a result, we find that . In a similar fashion, one can also find that To obtain this result, one can use that for all x ∈ R d and t ∈ [0, T ] it holds that (3.20) see Lemma A.5.Using this, and writing . .e d }, 1 ≤ k 1 , . . ., k r ≤ d (with r independent of d) and where F is a linear combination of ϕ and its partial derivatives and G is a product of µ and σ and their derivatives.Using these observations and the fact that ρ d < ∞, one can obtain (3.19).Moreover, very similar yet tedious computations yield, (3.21) u − ūN Step 2: from ūN to ũM,N .We continue the proof by constructing a Monte Carlo approximation of ūN .For this purpose, we randomly draw ω m ∈ Ω for all m ∈ N and define for every M, N ∈ N the random variable Using the same arguments as in the proofs of (3.18) and (3.19), we find for all (x, t) ∈ D × [0, T ] and q ∈ {t, x 1 , . . .x d } that, Invoking Lemma A.2, we find that Similarly, one can prove that This can be proven using the same arguments as in the proof of (3.19).Using again Lemma A.2 and Lemma A.5, and in combination with our previous result, we find that there is a constant C 0 > 0 independent of M (and with the same properties of C in terms of dependence on d) such that and therefore by Lemma A.3 that The fact that this event has a non-zero probability implies the existence of some fixed ω m ∈ Ω, 1 ≤ m ≤ M , such that for the function (3.29) Step 3: from ũM,N to u M,N .For every ǫ > 0 and N = N (ǫ) ∈ N, let h ǫ be a tanh neural network such that ≤ η.
If we now in (3.28) replace ϕ and F ϕ by ϕ ξ and ( F ϕ) δ as from (3.7), h by h ǫ and × by × η , then we end up with the tanh neural network .
A sketch of this network can be found in Figure 1.In what follows, we will write ∂ 1 for the partial derivative to the first component and we will write (3.33) It holds that (3.34) Using (3.31), we find that for every n for every m, n affine u M,N (x, t) .
For the other term, we calculate using (3.7), (3.30) and (3.31) that (3.36) Thus, we find that Finally, we obtain a bound on L ũM,N − u M,N

2
. We simplify notation again by setting We start off by calculating (3.39) Explicitly working out the above formula is straightforward, but tedious, and we omit the calculations for the sake of brevity.From this, together with a repeated use of the triangle inequality and (3.29), we find that (3.40) Moreover, using similar tools as above we also find that Step 4: Total error bound.
We first determine the complexity of the network sizes in terms of ε.The network will consist of multiple sub-networks, as illustrated in Figure 1.The first part constructs M • N copies of ( F ϕ) δ , leading to a subnetwork with O M N δ −β = O ε −2−p−β neurons.Next, we need N copies of h ǫ .From Lemma A.6 it follows that for each copy, one needs a subnetwork with two hidden layers of width O N for any γ > 0. One can calculate that N copies of this lead to a width of We assume that the subnetworks to approximate the identity function have a size that is negligible compared to the network sizes of the other parts [5].Combining these observations with the fact that C depends polynomially on d and ρ d , we find that there exists a constant λ > 0 such that the number of neurons of the network is bounded by O((dρ d ) λ ε − max{5p+3,2+p+β} ).
Remark 3.4.For the Black-Scholes equation (2.3), the initial condition is to be interpreted as a payoff function.Note that any mollified version of the payoff functions mentioned in Section 2.1 satisfies the regularity requirements of Theorem 3.3.Moreover, because of their compositional structure, these payoff functions and their derivatives can be approximated without the curse of dimensionality.Hence, the assumption (3.7) is satisfied as well.
Theorem 3.3 reveals that the size of the constructed tanh neural network, approximating the underlying solution u of the linear Kolmogorov equation (2.1), and whose PINN residual is as small as desired (3.8), grows with increasing accuracy, but at a rate that is independent of the underlying dimension d.Thus, it appears that this neural network overcomes the curse of dimensionality in this sense.
However, Theorem 3.3 reveals that the overall network size grows polynomially in ρ d .It could be that this constant grows exponentially with dimension.Consequently, the overall network size will be subject to the curse of dimensionality.Given this issue, we will prove that at least for a subclass of Kolmogorov PDEs (2.1), ρ d only grows polynomially on d.This is for example the case when the coefficients µ and σ are both constant functions.Theorem 3.5.Assume the setting of Theorem 3.3 and assume that µ and σ are both constant.Then there exists a constant λ > 0 such that for every ε > 0 and d ∈ N, there exists a tanh neural network Ψ ε,d with O(d λ ε − max{5p+3,2+p+β} ) neurons and weights that grow as O(d λ ε − max{ζ,8p+6} ) for small ε and large d such that Proof.We show that when µ and σ are both constant functions, the constant ρ d , as defined in (3.9), grows only polynomially in d.It is well-known that in this setting the solution process to the SDE (3.3) is given by X x t = x + µt + σB t , where (B t ) t∈[0,T ] is a d-dimensional Brownian motion.The fact that ρ d only grows polynomially in d then follows directly from the Lévy's modulus of continuity (Lemma A.4).The corollary then is a direct consequence of Theorem 3.3.
Remark 3.6.We did not specify the boundary conditions explicitly in either Thoorem 3.3 or Theorem 3.5.The reason lies in the fact that Dynkin's formula (Lemma 3.2) holds with R d as domain.Therefore, we implicitly use the trace of the true solution u d at the boundary of D d as the Dirichlet boundary condition.A similar approach has been used in e.g.[8,12], where the Feynman-Kac formula is used to construct a neural network approximation for the solution to Kolmogorov PDEs.This assumption is quite reasonable as Black-Scholes type PDEs (2.3) are specified in the whole space.In practice, one needs to put in some artificial boundary conditions, for instance by truncating the domain.To this end, one can use some explicit formulas such as the Feynman-Kac or Dynkin's formula to (approximately) specify the boundary condition, see [39] for examples.Another possibility is to consider periodic boundary conditions as Dynkin's formula also holds in this case.Thus, we have been able to answer question Q1 by showing that there exists a neural network, for which the PINN residual (generalization error) (1.3) is as small as desired.In this process, we have also answered Q2 for this particular tanh neural network as the bound (3.43) clearly shows that the overall error (in the L 2 -norm and even H 1 -norm) of the tanh neural network Ψ ε,d is arbitrarily small.
Although in this particular case, an affirmative answer to question Q2 was a by-product of the proof of question Q1, it turns out that one can follow the recent paper [27] and leverage the stability of Kolmogorov PDEs to answer question Q2 in much more generality, by showing that as long as the generalization error is the small, the overall error is proportionately small.We have the following precise statement about this fact, Theorem 3.7.Let u be a (classical) solution to a linear Kolmogorov equation (2.1) with µ ∈ C 1 (D; R d ) and σ ∈ C 2 (D; R d×d ), u θ a PINN and let the residuals be defined by (2.8).Then where all integrals are to be interpreted as integrals with respect to the Lebesgue measure on D, resp.∂D.For the first term of (3.47), we observe that Trace(σσ for any 1 ≤ i, j, k ≤ d.Next, we define (3.49) .
From this, using integration by parts and letting n denote the unit normal on ∂D, we find that For the second term of (3.47), we find that Finally, we find for the third term of the right-hand side of (3.47) that Integrating (3.47) over the interval [0, τ ] ⊂ [0, T ], using all the previous inequalities together with Hölder's inequality, we find that Using Grönwall's inequality and integrating over [0, T ] then gives (3.54) Renaming the constants yields the statement of the theorem.
Thus the bound (3.46) clearly shows that controlling the generalization error (1.3) suffices to control the L 2 -error for the PINN, approximating the Kolmogorov equations (2.1).In particular, combining Theorem 3.7 with Theorem 3.3 then proves that it is possible to approximate solutions to linear Kolmogorov equations in L 2 -norm at a rate that is independent of the spatial dimension d.
It is easy to observe that the constants C 1,3 ∼ O(1) as they only depend on the coefficients of the Kolmogorov PDE.On the other hand, the constant C 2 depends on the PINN approximation u θ and needs to be evaluated for each individual approximation.For instance, for the PINN Ψ ǫ,d , constructed in Theorem 3.3, it is straightforward to observe from the arguments presented in the proof of Theorem 3.3 that C 2 ∼ O(ǫ).

Generalization error of PINNs
Having answered the questions Q1 and Q2 on the smallness of the PINN residual (generalization error (1.3)) and the total error for PINNs approximating the Kolmogorov PDEs (2.1), we turn our attention to question Q3 i.e., given small training error (2.9) and for sufficiently many training samples S i,s,t , can one show that the generalization error (1.3) (and consequently the total error by Theorem 3.7) is proportionately small?To this end, we start with the observation that the PINN residual as well training error (2.9) has three parts, two data terms corresponding to the mismatches with the initial and boundary data and a residual term that measures the amplitude of the PDE residual.Thus, we can embed these two types of terms in the following very general set-up: let D ⊂ R d be compact and let f : D → R, f θ : D → R be functions for all θ ∈ Θ.We can think of f as the ground truth for the initial or boundary data for the PDE (2.1) and f θ be the corresponding restriction of approximating PINNs to the spatial or temporal boundaries.Similarly, we can think of f ≡ 0 as the PDE residual, corresponding to the exact solution of (2.1) and f θ is the interior PINN residual (first term in (2.8)), for a neural network with weights θ.Let M ∈ N be the training set size and let S = {z 1 , . . ., z M } ⊂ D M be the training set, where each z i is independently drawn according to some probability measure µ on D. We define the (squared) training error, generalization error and empirical risk minimizer as (4.1) where we restrict ourselves to the (squared) L 2 -norm only for definiteness, while claiming that all the subsequent results readily extend to general L p -norms for 1 ≤ p < ∞.It is easy to see that the above set-up encompasses all the terms in the definitions of the generalization error (1.3) and training error (2.9) for PINNs.Our first aim is to decompose this very general form of generalization error in (4.1) as, Proof.Since Θ is compact, there exist for every δ > 0 a natural number N = N (δ) ∈ N and parameters θ 1 , . . .θ N ∈ Θ such that for all θ ∈ Θ there exists 1 This error decomposition holds in particular for i * = i * (θ * ) ∈ arg min i θ * − θ i ∞ .Using that θ * − θ i * ∞ ≤ δ and then majorizing gives the bound from the statement.
Note that we have leveraged the compactness of the parameter space Θ in (4.2) to decompose the generalization error in terms of the training error E T (θ * (S), S), the so-called generalization gap i.e., sup θ∈Θ E G (θ) 2 − E T (θ) 2 and error terms that measure the modulus of continuity of the generalization and training errors.From this decomposition, we can intuitively see that these error terms can be made suitably small by requiring that the generalization and training errors are, for instance, Lipschitz continuous.Then, we can use standard concentration inequalities to obtain the following very general bound on the generalization error in terms of the training error, for all θ ∈ Θ and S ⊂ D M and let θ → E G (θ) 2 and θ → E T (θ, S) 2 be Lipschitz continuous with Lipschitz constant L. For every ǫ, η > 0, it holds that Proof.For arbitrary ǫ > 0, set δ = ǫ 2 2L and let {θ i } N i=1 be a δ-covering of Θ with respect to the supremum norm.Then it holds that N can be bounded by (2aL/ǫ 2 ) k and moreover Then it holds for every 1 Next, we define a projection P : Θ → Θ that maps θ to a unique θ i * with i * ∈ arg min i θ − θ i ∞ and we define the following events for 1 ≤ i ≤ N , As a consequence, it holds that Using (the last step of the proof of) Theorem 4.2 with η = ǫ 2 2c then gives that (4.12) As a first example for illustrating the bounds of Theorem 4.2 (and Corollary 4.3), we apply it to the estimation of the generalization errors, corresponding to the spatial and temporal boundaries, in terms of the corresponding training errors (2.9).These bounds readily follow from the following general bound.
Corollary 4.4.Let L, W ∈ N, R ≥ 1, L ≥ 2 and let f θ : D → R, θ ∈ Θ, be tanh neural networks with at most L − 1 hidden layers, width at most W and weights and biases bounded by R. For every 0 < ǫ < 1, it holds that for the generalization and training error (4.1) that, Proof.Using the inverse triangle inequality and the fact that Next, we will apply the above general results to PINNs for the Kolmogorov equation (2.1).The following corollary provides an estimate on the (cumulative) PINN generalization error and can be seen as the counterpart of Corollary 4.4.It is based on the fact that neural networks and their derivatives are Lipschitz continuous in the parameter vector, the proof of which can be found in Appendix B. Consequently, the PINN generalization error is Lipschitz as well (cf.Lemma C.3).
be tanh neural networks with smooth activation function σ, at most L − 1 hidden layers, width at most W and weights and biases bounded by R. For q = i, t, s let the PINN generalization E q G and training E q T errors for linear Kolmogorov PDEs (cf.Section 2.1) and let c q > 0 be such that Then for any ǫ > 0 it holds that (4.17) Remark 4.6.Corollary 4.5 requires bounds c q on the training errors E q T and the generalization errors E q G of the PINN.Lemma C.3 provides such bounds, given by c i = 4αC(d + 7)L 2 R 3L W 3L−3 β L and c t = c s = 2W R.Although the values for c t and c s are of reasonable size, the value for c i is likely to be a large overestimate.It might makes sense to consider the approximation for some randomly sampled θ n ∈ Θ and x m ∈ D.
Combining Corollary 4.5 with Theorem 3.7 allows us to bound the L 2 -error of the PINN in terms of the (cumulative) training error and the training set size.The following corollary proves that a well-trained PINN on average has a low L 2 -error provided that the training set is large enough.It is also possible to prove a similar probabilistic statement instead of a statement that holds on average.
Proof.This is a direct consequence of Corollary 4.5 and the proof of Theorem 3.7 (in particular, one needs to take the expectation of all training sets S before applying Hölder's inequality in the proof of Theorem 3.7).
Remark 4.8.If we assume that the optimization algorithm used to minimize the training loss finds a global minimum, then one can prove that the cumulative training errors in (4.20) are small if the training set is large enough.To see this, first observe that for the network Ψ ε,d that was constructed in Theorem 3.
) are all of order O(ε).Since Ψ ε,d is not correlated with the training data S, one can use a Monte Carlo argument to find for any ǫ > 0 that If the optimization algorithm reaches a global minimum, the training loss of u θ * (S) will be upper bounded by that of Ψ ε,d .Therefore it also holds that E q T = O(ǫ).Thus, in Corollaries 4.5 and 4.7, we have answered the question Q3 by proving that a small training error and a sufficiently large number of samples, as chosen in (4.17), suffice to ensure a small generalization error (and total error).Moreover, the number of samples only depends polynomially on the dimension.Therefore, it overcomes the curse of dimensionality.

Discussion
Physics informed neural networks (PINNs) are widely used in approximating both forward as well as inverse problems for PDEs.However, there is a paucity of rigorous theoretical results on PINNs that can explain their excellent empirical performance.In particular, one wishes to answer the questions Q1 (on the smallness of PINN residuals), Q2 (smallness of the total error) and Q3 (smallness of the generalization error if the training error is small) in order to provides rigorous guarantees for PINNs.
In this article, we aimed to address these theoretical questions rigorously.We do so within the context of the Kolmogorov equations, which are linear parabolic PDEs of the general form (2.1).The heat equation as well as the Black-Scholes equation of option pricing are prototypical examples of these PDEs.Moreover, these PDEs can be set in very high-dimensional spatial domains.Thus, in addition to providing rigorous bounds on the PINN generalization error and total error, we also aimed to investigate whether PINNs can overcome the curse of dimensionality in this context.
To this end, we answered question Q1 in Theorem 3.3, where we constructed a PINN (see Figure 1) for which the PINN residual (generalization error) can be made as small as possible.Our constuction relied on emulating Dynkin's formula (3.5).Under suitable assumptions on the initial data as well as on the underlying stochastic process (cf.(3.9) and Theorem 3.5), we are also able to prove that the size of the constructed only grew polynomially, in input spatial dimension.Thus, we were able to show that this PINN was able to overcome the curse of dimensionality in attaining as small a residual as desired.
Next, we answered question Q2 in Theorem 3.7 by leveraging the stability of Kolmogorov PDEs to bound the total error (in L 2 ) for PINNs in terms of the underlying generalization error.
Finally, question Q3 that required one to bound the generalization error in terms of the training error was answered by using an error decomposition, Lipschitz continuity of the underlying generalization and training error maps and concentration inequalities in Corollary 4.5, where we derived a bound on the generalization error in terms of the training error and for sufficiently many randomly chosen training samples (4.17).Moreover, the number of training samples only grew polynomially in the dimension, alleviating the curse of dimensionaly in this regard.
Although we do not present numerical experiments in this paper, we point the readers to [39] and the forthcoming paper [30], where a large number of numerical experiments for PINNs in approximating both forward and inverse problems for Kolmogorov type and related equations, are presented.In particular, these experiments reveal that PINNs overcome the curse of dimensionality in this context.These findings are consistent with our theoretical results.
At this stage, it is instructive to contrast our results with related works.As mentioned in the introduction, there are very few papers where PINNs are rigorously analyzed.When comparing to [37], we highlight that the fact that the authors of [37] used a special bespoke Hölder-type regularization term that penalized the gradients in their loss function.In practice, one trains PINNs in the L 2 (or L 1 ) setting and it is unclear how relevant the assumptions of [37] are in this context.On the other hand, we use the natural training paradigm for PINNs and prove rigorously that overall errors can be made small.Comparing with [27], we observe that the authors of [27] only address questions Q2 and (partially) Q3, but in a very general setting.It is not proved in [27] that the total error can be made small.We do so here.Moreover, we also provide the first bounds for PINNs, where the curse of dimensionality is alleviated.
It is an appropriate juncture to compare our results with a large number of articles demonstrating the alleviation of the curse of dimensionality for neural networks approximating Kolmogorov type PDEs, see [8,3] and references therein.We would like to point out that these articles consider the supervised learning paradigm, where (possibly large amounts of) data needs to be provided to train the neural network for approximating solutions of PDEs.This data has to be generated by either expensive numerical simulations or the use of representation formulas such as the Feynman-Kac formulas, which requires solutions of underlying SDEs.In contrast, we recall that PINNs do not require any data in the interior of the domain and thus are very diferent in design and conception to supervised learning frameworks.
We would also like to highlight some limitations of our analysis.We showed in Theorem 3.3 that network size in approximating solutions of general Kolmogorov equations (2.1) depended on the rate of growth the quantity ρ d , defined in (3.9).We were also able to prove in Theorem 3.5 that ρ d only grew polynomially (in dimension) for a subclass of Kolmogorov PDEs.Extending these results to general Kolmogorov PDEs is an open question.Moreover, it is worth repeating (see Remark 4.6) that the constants in our estimates are clearly not optimal and might be significant overestimates, see [27] for a discussion on this issue.
Finally, we point out that although we focussed our results on the large and important class of Kolmogorov PDEs in this paper, the methods that we developed will be very useful in the analysis of PINNs for approximating PDEs.In particular, the use of smoothness of the underlying PDEs solutions and their approximation by Tanh neural networks (as in [5]), to build PINNs with small PDE residuals can be applied to a variety of linear and non-linear PDEs.Similarly, the error decomposition (4.2) and Theorem 4.2 (Corollary 4.3) are very general and can be used in many different contexts, to bound PINN generalization error by training error, for sufficiently many random training points.We plan to apply these techniques for the comprehensive error analysis of PINNs for approximating forward as well as inverse problems for PDEs in forthcoming papers.
Lemma A.4 (Lévy's modulus of continuity).For (B t ) t∈[0,1] a Brownian motion, it holds almost surely that Proof.This result is due to [20] and can be found in most probability theory textbooks.
Lemma A.6.Let h : R → R : x → min{1, max{0, x}}.For every N ≥ 2 and ǫ, γ > 0 there exists a tanh neural network ĥ with two hidden layers, O N neurons and weights growing as ≤ ǫ and ĥ′ Proof.We first approximate h with a function h that is twice continuously differentiable, It is easy to prove that h − h = O(ǫ 2 ).Next, we calculate the derivative of h, A straightforward calculation leads to the bound h ′ − h′ = O(ǫ).Finally, one can easily check that h′′ is continuous and that h′′ ).An application of [5, Theorem 5.1] on h gives us for every γ > 0 and N large enough the existence of a tanh neural network ĥN with two hidden layers and O(N ) neurons for which it holds that h − ĥN . Because of the nature of the construction of ĥN , the monotonous behaviour of the hyperbolic tangent towards infinity and the fact that h is constant outside [−1, 2], the stronger result that h − ĥN automatically as well.As a result we find that ( ĥN Moreover, [5, Theorem 5.1] tells us that the weights of ĥN grow as . The statement then follows from applying the triangle inequality.

Appendix B. Lipschitz continuity in the parameter vector of a neural network and its derivatives
In this section we will prove that for any x ∈ D, a neural network and its corresponding Jacobian and Hessian matrix are Lipschitz continuous in the parameter vector.This property is of crucial importance to find bounds on the generalization error of physics informed neural networks, cf.Section 4. We first introduce some notation and then state or results.The main results of this section are Lemma B.3 and Lemma B.5.
We denote by σ : R → R be an (at least) twice continuously differentiable activation function, like tanh or sigmoid.For any n ∈ N, we write for x ∈ R n that σ(x) := (σ(x 1 ), . . ., σ(x n )).We use the definition of a neural network as in Definition 2.1.
Recall that for a differentiable function f : R n → R m the Jacobian matrix J[f ] is defined by For our purpose, we make the following the following convention.For any 1 ≤ k ≤ L, we define Similarly, for a twice differentiable function g : R n → R the Hessian matrix is defined by Slightly abusing notation, we generalize this to vector-valued functions g : R n → R m .We write where we identify R 1×n×n with R n×n to make the definitions consistent.Similarly, if v ∈ R 1×m , then v • H[g] should be interpreted as For any 1 ≤ k < L, we write Finally, we will use the notation J θ := J[Ψ θ ] and H θ := H[Ψ θ ].The following lemma presents a generalized version of the chain rule.
Lemma B.1.Let f : R n → R m and g : R m → R. Then it holds that We now apply this formula to find an expression for H θ in terms of J θ k and H θ k .
Lemma B.2.It holds that Proof.The first statement is just the chain rule for calculating the derivative of a composite function.
We prove the second statement using induction.For the base step, let L = 1.Then Ψ θ = f θ L and we have H[Ψ θ ] = H θ L .For the induction step, take K ∈ N, K ≥ 2 and assume that the statement holds for Applying the generalized chain rule to calculate H[Φ θ • f θ 1 ] and using the induction hypothesis on H[Φ θ ] gives the wanted result.
Next, we formally introduce the element-wise supremum The following lemma states that the output of each layer of a neural network is Lipschitz continuous in the parameter vector for any input x ∈ [a, b] d .The lemma is stated for neural networks with a differentiable activation function, but can be easily adapted for e.g.ReLU neural networks.
Proof.Let l 0 , . . ., l L denote the widths of the neural network, where l First of all, it holds that A recursive application of this inequality then gives us for 1 ≤ K ≤ L that where we used that    18) and analogously for J ϑ k (x) i and H ϑ k (x) i .The triangle inequality and the Lipschitz continuity of σ ′ gives us that for k ≥ 2. One can check that the inequality also holds for k = 1.
For the Hessian matrix, the triangle inequality and the Lipschitz continuity of σ ′′ gives us that Proof.We will prove the formulas by repeatedly using the triangle inequality and using the representations proven in Lemma B.2.To do so, we need to introduce some notation.Define for 0 ≤ l ≤ L + k − 1 the object φ l ∈ {θ, ϑ} 2L such that (B.25) φ l j = ϑ j ≤ l, θ j > l. and In particular, φ k,0 j = θ and φ k,L+k−1 j = ϑ for all j.To simplify notation, we write The triangle inequality and Lemma B.2 then give that (B.27) Observe that A k,l−1 j − A k,l j = 0 for j = l.Therefore (B.30) In an entirely similar fashion we obtain G and training E q T errors be defined as in Section 2.3 for linear Kolmogorov PDEs (cf.Section 2.1).Let α = max{1, |a|, |b|, σ ∞ } and β = max{1, σ ′ ∞ , σ ′′ ∞ , σ ′′′ ∞ } and assume that max{ ϕ ∞ , ψ ∞ } ≤ max θ∈Θ u θ ∞ .Let L q Q denote the Lipschitz constant of E q Q , for q = i, t, s and Q = G, T .Then it holds that Proof.Without loss of generality, we only focus on E q G , for q = i, s, t.We see for q = i, t, s where we let |•| p denote the vector p-norm of the vectorized version of a general tensor (cf.(B.9)).Next, we calculate using again Lemma B.5 (by setting ϑ = 0) and max{ ϕ ∞ , ψ ∞ } ≤ max θ∈Θ u θ ∞ for q = t, s that where C = max x∈D (1 + µ(x) 1 + σ(x)σ(x) * 1 ).Combining all the previous results prove the stated bound.

(4. 7 )
Note that (4.5) and (4.6) imply that D ⊆ A and thus P(D) ≤ P(A).Next, by the definition of P it holds that P induces a partition on Θ and thus

(4. 9 ) 2 T 2 ; 4 . 3 .
The bound on the generalization error in terms of the training error (4.4) is a probabilistic statement.It can readily be recast in terms of averages by defining the so-called cumulative generalization and training errors of the form, θ * (S)) 2 dµ M (S), E = ˆDM E T (θ * (S), S) 2 dµ M (S).Here µ M = µ ⊗ µ . . .⊗ µ is the induced product measure on the training set S. We have the following ensemble version of Theorem 4.Corollary Assume the setting of Theorem 4.2.It holds that(4.11)

Corollary 4 . 7 .
Let u be a (classical) solution to a linear Kolmogorov equation (2.1) with µ ∈ C 1 (D; R d ) and σ ∈ C 1 (D; R d×d ), u * = u θ * (S) a trained PINN, let E i T , E s T and E t T denote the interior, spatial and temporal cumulative training error, cf.(1.3) and let C 1 , C 2 and C 3 be the constants as defined in Theorem 3.7.If the training set sizes where chosen as in (4.17) of Corollary 4.5 for some ǫ > 0, then