Numerical Solution of the Parametric Diffusion Equation by Deep Neural Networks

We perform a comprehensive numerical study of the effect of approximation-theoretical results for neural networks on practical learning problems in the context of numerical analysis. As the underlying model, we study the machine-learning-based solution of parametric partial differential equations. Here, approximation theory predicts that the performance of the model should depend only very mildly on the dimension of the parameter space and is determined by the intrinsic dimension of the solution manifold of the parametric partial differential equation. We use various methods to establish comparability between test-cases by minimizing the effect of the choice of test-cases on the optimization and sampling aspects of the learning problem. We find strong support for the hypothesis that approximation-theoretical effects heavily influence the practical behavior of learning problems in numerical analysis.


Introduction
This work studies the problem of numerically solving a specific parametric partial differential equation (PPDE) by training and applying neural networks (NNs).The central goal of the following exposition is to identify those key aspects of a parametric problem that render the problem harder or simpler to solve for methods based on NNs.
The underlying mathematical problem, the solution of PPDEs, is a standard problem in applied sciences and engineering.In this model, certain parts of a PDE such as the boundary conditions, the source terms, or the shape of the domain are controlled through a set of parameters, e.g., [28,50].In some applications where PDEs need to be evaluated very often or in real-time, individually solving the underlying PDEs for each choice of parameters becomes computationally infeasible.In this case, it is advisable to invoke methods that leverage on the joint structure of all the individual problems.A typical approach is that of constructing a reduced basis associated with the problem.With respect to this basis, the computational complexity of solving the PPDE is then significantly reduced, e.g., [28,50,56,44].
Recently, as an alternative or to augment the reduced basis method, approaches were introduced that attempt to learn the parameter-to-solution map through methods of machine learning.We will provide a comprehensive overview of related approaches in Section 1.4.One approach is to train a NN to fit the discretized parameter-to-solution map, i.e., a map taking a parameter to a finite-element discretization of the solution of the associated PDEs.This approach has already been analyzed theoretically in [35] where it was shown from an approximation-theoretical point of view that the hardness of representing the parameterto-solution map by NNs is determined by a highly problem-specific notion of complexity that depends (in some cases) only very mildly on the dimension of the parameter space.
In this work, we study the problem of learning the discretized parameter-to-solution map in practice.We hypothesize that the approximation-theoretical capacity of a NN architecture is one of the central factors in determining the difficulty level of the learning problem in practice.
The motivation for this analysis is two-fold: First, we regard this as a general analysis of the feasibility of approximation-theoretical arguments in the study of deep learning.Second, specifically for the problem of numerical solution of PPDEs, we consider it important to identify which characteristics of a parametric problem determine its practical hardness.This is especially relevant to identify in which areas the application of this model is appropriate.We outline these two points of motivation in Section 1.1.The design of the numerical experiment is presented in Section 1.2 and we give a high-level report of our findings in Section 1.3.

Motivation
As already outlined before, we describe the motivation for this paper in the following two sections.

Understanding of Deep Learning in General
A typical learning problem consists of an unknown data model, a hypothesis class, and an optimization procedure to identify the best fit in the hypothesis class to the observed (sampled) data, e.g., [15,16].In a deep learning problem, the hypothesis class is the set of NNs with a specific architecture.
The approximation-theoretical point of view analyzes the trade-off between the capacity of the hypothesis class and the complexity of the data model.In this sense, this point of view describes only one aspect of the learning problem.
In the framework of approximation theory, there are precise ways to assess the hardness of an underlying problem.Concretely, this is done by identifying the rate by which the misfit between the hypothesis class and the data model decreases for sequences of growing hypothesis classes.For example, one common theme in the literature is the observation that for certain function classes, NNs do not admit a curse of dimension, i.e., their approximation rates do not deteriorate exponentially fast with increasing input dimension, e.g., [6,60,48].Another theme is that classes of smooth functions can be approximated more efficiently than classes of rougher functions, e.g., [42,67,47,45].
While these results offer some interpretation of why a certain problem should be harder or simpler, it is not clear how relevant these results are in practice.Indeed, there are at least three issues that call the approximation-theoretical explanation for a practical learning problem into question: • Tightness of the upper bounds: Approximation-theoretical bounds usually describe worst-case error estimates for whole classes of functions.For individual functions or subsets of these function classes, there is no guarantee that one could not achieve a significantly better approximation rate.
• Optimization and sampling prevent approximation theoretical effect from materializing: As explained at the beginning of this section, the learning problem consists of multiple aspects, one of which is the ability of the hypothesis class to describe the data.Two further aspects are how well the sampling of the data model describes the true model and how well the optimization procedure performs in finding the best fit to the sampled data.Since the underlying optimization problem of deep learning is in general non-convex, it is conceivable that, while there theoretically exists a very good approximation of a function by a NN, finding it in practice is highly unlikely.Moreover, it is certainly possible that the sampling process does not contain sufficient information to guarantee that the optimization routine will identify the theoretically best approximation.
• Asymptotic estimates: All approximation-theoretical results mentioned until here and almost all in the literature describe the capacity of NNs to represent functions approximately with accuracy ε for sufficiently large architectures only in a regime where ε tends to zero and the size of the architecture is sufficiently large.The associated approximation rates may contain arbitrarily large implicit constants, and therefore it is entirely unclear if changes to the trade-off between the complexity of the data model and the size of the architecture have the theoretically predicted impact for moderately-sized practical learning problems.
We believe that, to understand the effect of approximation-theoretical capacities of NNs in practical learning scenarios, the learning problem associated with the parameter-to-solution map in a PPDE occupies a special role: It is, in essence, a high-dimensional approximation problem of a function that has a very strong low-dimensional, but highly non-trivial structure.What is more is that one can, to a certain extent, control the complexity of the problem, as we have seen in [35].In this context, we can ask ourselves the following questions: Do we observe a curse of dimensionality in the practical solution of the problem?If not, how does the difficulty in practice scale with the parameter dimension?On which characteristics of the problem does the hardness of the practical solution thereof depend?
If we study these questions numerically, then the answers can be compared with the predictions from approximation-theoretical considerations.If the predictions coincide with the observed behavior and other causes, such as artefacts from the optimization and sampling procedure, can be ruled out, then we can view these experiments as a strong support for the practical relevance of approximation-theoretical arguments.
Because of this, we study the aforementioned questions in an extensive numerical experiment that will be described in Section 1.2 below.

Feasibility of the Machine-Learning-Based Solution of Parametric PDEs
The method (as described in [35]) of learning the parameter-to-solution map has at least two major advantages over classical approaches to solve PPDEs: First of all, the setup is completely independent of the underlying PPDE.This versatility of NNs could be quite desirable in an environment where many substantially different PPDEs are treated.Second, because this approach is fully data-driven, we do not require any knowledge of the underlying PDE.Indeed, as long as sufficiently many data points are supplied, for example, from a physical experiment, the approach could be feasible under high uncertainty of the model.
The main drawback of the method is the lack of theoretical understanding thereof.Moreover, for the theoretical results that do exist, we lack any evaluation of how pertinent the theoretical observations are for practical behavior.Most importantly, we do not have an a priori assessment for the practical feasibility of certain problems.
In [35], we observed that the complexity of the solution manifold, i.e., the set of all solutions of the PDE, is a central quantity involved in upper bounding the hardness of approximating the parameter-to-solution map with a NN.In practice, it is unclear to what extent this notion is appropriate and if the complexity of the solution manifold influences the performance of the method at all.In the numerical experiment described in the next chapter, we explore the performance of the learning approach for various test-cases with different intrinsic complexities and observe the sensitivity of the method to the different setups.

The Experiment
To analyze the approximation-theoretical effect of the architecture on the overall performance of the learning problem in practice, we train a fully connected NN as considered in [35] on a variety of datasets stemming from different parameter choices of the parametric diffusion equation.The design of the data sets is such that we vary the relationship between the capacity of the architecture and the complexity of the data and report the effect on the overall performance.
In designing such an experiment, we face three fundamental challenges hindering the comparability between test-cases: • Effect of the optimization procedure: The effect of the architecture on the optimization procedure is not clear, and this interplay may be a much stronger factor in the performance of the method than the capacity of the architecture to fit the data model.Similarly, the effect of the complexity of the data model could affect the optimization procedure and influence the performance of the learning method stronger than any approximation-theoretical effect.
• Effect of the sampling procedure: We train our network based on a finite number of samples of the true solution.The number and choice of samples could have a non-negligible effect on the overall performance and most importantly affect some test-cases more than others.
• Quantification of the intrinsic complexity: While we have theoretically established that the complexity of the solution manifold is the main factor in upper-bounding the hardness of the problem in the approximation-theoretical framework, we cannot, in practice, quantify this complexity.
We address these issues in the following four ways: • Keeping the architecture fixed: An approximation-theoretical result on NNs is based on three ingredients.A function class C, a worst-case accuracy ε > 0, and the size of the architecture.
Whenever one of these hyper-parameters-the function class, the accuracy, or the architecture-is fixed, one can theoretically describe how changing a second parameter influences the last one.For example, for fixed C, an approximation-theoretical statement yields an estimate of the necessary size of the architecture to achieve an accuracy of ε.
Because of the potentially strong impact of the architecture on the optimization procedure, we expect that the most sensible point of view to test numerically is that where the architecture remains fixed while we vary the function class C and observe ε.This way, we can guarantee that the influence of the architecture on the optimization procedure is the same between test-cases.
• Analyzing the convergence behavior a posteriori: We are not aware of any method to guarantee a priori that the choice of the data model would not influence the convergence behavior.We do, however, analyze the convergence after the experiment to see if there are fundamental differences between our test-cases.This analysis reveals no significant differences between all the setups and therefore indicates that the effect of the data model on the optimization procedure is very similar between test-cases.
• Establishing independence of sample generation: We run the experiment multiple times for various numbers of training samples N chosen in the same way-uniformly at random-in every test-case.
Between the choices of N , we observe a linear dependence of the achieved accuracy on N .This indicates that the influence of the number of N on the performance of the method is the same for all test-cases.
• Design of semi-ordered test-cases: While we are not able to assess the intrinsic complexity exactly, it is straight-forward to construct series of test-cases with increasing complexity.In this sense, we can introduce a semi-ordering of test-cases according to their complexity and observe to what extent the performance of the method follows this ordering.
We present the construction of the test-cases in Section 4 and discuss the measures taken to remove effects caused by the optimization and sampling procedures in greater detail in Appendix A. All of our test-cases consider the following parametric diffusion equation where f ∈ L 2 (Ω) and a y ∈ L ∞ (Ω), is a diffusion coefficient depending on a parameter y ∈ Y.In our test-cases below, we learn a discretization of the map R p ⊃ Y y → u y , where p ∈ N, for various choices of parametrizations Concretely, we vary the following characteristics of the parametrizations and observe the effect on the overall performance of the learning problem: • Type of parametrization: We choose test-cases which differ with respect to the following characteristics: First, we study parametrizations (1.1) of various degrees of smoothness.Second, we study test-cases where the parametrization (1.1) is affine-linear and non-linear.Third, we consider cases, where a y = p i=1 a yi for a yi ∈ L ∞ (Ω) and where the supports of ( a yi ) p i=1 overlap or have various degrees of separation.
• Dimension of parameter space: The discretization of our solution space is done on the maximal computationally feasible grid (with respect to our workstation).We have chosen the dimensions p of the parameter spaces in such a way that the resolutions of the parametrized solutions are still meaningful with respect to the underlying discretization.
• Complexity hyper-parameters: To generate comparable test-cases with increasing complexities, we include two types of hyper-parametes into the data-generation process.One that directly influences the ellipticity of the problem and another that introduces a weighting of the parameter values.
We expect that these tests yield answers to the following questions: How versatile is the approach?Does it perform well only for special types of parametrizations or is it generally applicable?Do we observe a curse of dimensionality and how much does the performance of the learning method depend on the dimension of the parameter space?How strongly does the performance of the learning method depend on the intrinsic complexity of the data?

Our Findings
In the numerical experiments, which we report in Section 4.3 and evaluate in Section 4.4, we find that the proposed method is very sensitive to the underlying type of test-case.Indeed, we observe qualitatively different scaling behaviors of the achieved error with the dimension p of the parameter space between different test-cases.Concretely, we observe the following asymptotic behavior of the errors in different test-cases: O(1), O(log(p)) and O(p k ) for p → ∞ and k > 0, where k depends on one of the complexity hyperparameters.Notably, we do not observe a scaling according to the curse of dimensionality, i.e., an error scaling exponentially with p, in any of the test-cases.We also observe that the achieved errors obey the semi-ordering of complexities of the test-cases.This shows that the method is very versatile and can be applied for various settings.Moreover, the complexity of the solution manifold appears to be a sensible predictor for the efficiency of the method.
In addition, we observe that the numerical results agree with the predictions that can be made via approximation-theoretical considerations.By design, we can exclude effects associated with the optimization and sampling procedures.This supports the practical relevance of approximation-theoretical results for this particular problem and for deep learning problems in general.

Related Works
The practical application of NNs in the context of PDEs dates back to the 1990s, [36].However, in recent years the topic again gained traction in the scientific community driven by the ever-increasing availability of computational power.Much of this research can be condensed into three main directions: Learning the solution of a single PDE, system identification, and goal-oriented approaches.The first of these directions uses NNs to directly model the solution of a (in some cases user-specified) single PDE, [65,52,66,39,57], an SDE, [7,18], or even the joint solution for multiple boundary conditions, [61].These methods mostly rely on the differential operator of the PDE to evaluate the loss, but other approaches do exist, [26].In system identification, one tries to discover an underlying physical law from data by reverse-engineering the PDE.This can be done by attempting to uncover a hidden parameter of a known equation, [53], or modeling physical relations, [10,51].Conversely, goal-oriented approaches, try to infer a quantity of interest stemming from the solution of an underlying PDE.For example, NNs can be used as a surrogate model to directly learn the quantity of interest and thereby circumvent the necessity of explicitly solving the equation, [33].A practical example for this is given by the ground state energy of a molecule which is derived from the solution of the electronic Schrödinger equation.This task has been efficiently solved by graph NNs, [58,40,21].Furthermore, building a surrogate model can be especially useful in uncertainty quantification, [62].NNs can also aid classical methods in solving goal-oriented tasks, [13,41].In addition to the aforementioned research directions, further work has been done on fusing NNs with classical numerical methods to assist, for example, in model-order reduction, [55,37].
Our work focuses on PPDEs and more specifically we are interested in learning the mapping from the parameter to the coefficients of the high-fidelity solution.Related but different approaches were analyzed in [29], and [17,62], where the solution of the PPDE is learned in an already precomputed reduced basis or at point evaluations in fixed spatial coordinates.
On the theoretical side, the majority of works analyzing the power of NNs for the solution of (parametric) PDEs is concerned with an approximation-theoretical approach.Notable examples of such works include [24,18,23,8,25,20,32,7,11,31], in which it is shown that NNs can overcome the curse of dimensionality in the approximative solution of some specific single PDE.In the same framework, it was shown in [11] how estimates on the approximation error imply bounds on the generalization error.Concerning the theoretical analysis of PPDEs, we mention [59,35,46,27].We will describe the results of the first two works in more detail in Section 3.2.The work [27] is concerned with an efficient approximation of a map that takes a noisy solution of a PDE as an input and returns a quantity of interest.
Additionally, we wish to mention that there exists a multitude of approaches (which are not necessarily directly RBM-or NN-related) that study the approximation of the parameter-to-solution map of PPDEs.These include methods based on sparse polynomials (see for instance [14,30] and the references therein), tensors (see for instance [5,19] and the references therein) and compressed sensing (see for instance [54,64] and the references therein).
Parametric PDEs also appear in the context of stochastic PDEs or PDEs with random coefficients (see for instance [49]) and have been theoretically examined under the perspective of uncertainty quantification.For the sake of brevity, we only mention [14] and the references therein.
Finally, we mention that a comprehensive numerical study analyzing to what extent the approximation theoretical findings of NNs (not in the context of PPDEs) are visible in practice has been carried out in [2].Similarly, in [22], a numerical algorithm that reproduces certain approximation-theoretically established exponential convergence rates of NNs was studied.The approximation rates of [12] were also numerically reproduced in that paper.

Outline
We start by describing the parametric diffusion equation and how we discretize it in Section 2.Then, we provide a formal introduction to NNs and a review of the approximation-theoretical results of NNs for parameter-to-solution maps in Section 3. In Section 4, we describe our numerical experiment.We start by stating three hypotheses underlying the examples in Subsection 4.1, before describing the set-up of our experiments in Subsection 4.2.After that, we present the results of the experiments in Subsection 4.3.Finally, in Subsection 4.4, we evaluate and interpret the observations.In Appendix A we describe the measures taken to ensure comparability between test-cases.

The Parametric Diffusion Equation
Throughout this paper, we will consider the parameter-dependent diffusion equation with homogeneous Dirichlet boundary conditions where f ∈ L 2 (Ω) is the parameter-independent right-hand side, a ∈ A ⊂ L ∞ (Ω), and A constitutes some compact set of parametrized diffusion coefficients.In the following, we will examine different varieties of parametrized diffusion coefficient sets A. Following [14] (by restricting ourselves to the case of finitedimensional parameter spaces), we will always describe the elements of A by elements in R p for some p ∈ N.
To be more precise, we will assume that where Y ⊂ R p is the compact parameter space.
A common assumption on the set A, present in the first test-cases which we will describe below and especially convenient for the theoretical analysis of the problem, is given by affine parametrizations of the form where the functions (a i ) p i=0 ⊂ L ∞ (Ω) are fixed.After reparametrization, we consider the following problem, given in its variational formulation: where and u y ∈ H := H 1 0 (Ω) is the solution. 1We will consider experiments in which the involved bilinear forms are uniformly continuous and uniformly coercive in the sense that there exist C cont , C coer > 0 with By the Lax-Milgram lemma (see [50, Lemma 2.1]), the problem of (2.4) is well-posed, i.e., for every y ∈ Y there exists exactly one u y ∈ H such that (2.4) is satisfied and u y depends continuously on f .

High-Fidelity Discretizations
In practice, one cannot hope to solve (2.4) exactly for every y ∈ Y. Instead, if we assume for the moment that y is fixed, a common approach towards the calculation of an approximate solution of (2.4) is given by the Galerkin method, which we will describe briefly below following [28, Appendix A] and [50,Chapter 2.4].
In this framework, instead of solving (2.4), one solves a discrete scheme of the form 1 Throughout this paper, we denote by H the space H 1 0 (Ω) := {u ∈ H 1 (Ω) : u| ∂Ω = 0}, where H 1 (Ω) := W 1,2 (Ω) is the first-order Sobolev space and where ∂Ω denotes the boundary of Ω.On this space, we consider the norm u where U h ⊂ H is a subspace of H with dim U h < ∞ and u h y ∈ U h is the solution of (2.5).Let us now assume that U h is given.Moreover, let D := dim U h , and let (ϕ i ) D i=1 be a basis for U h .Then the stiffness matrix B h y := (b y (ϕ j , ϕ i )) D i,j=1 is non-singular and positive definite.The solution u h y of (2.5) satisfies where y is, up to a universal constant, a best approximation of u y in U h .In this framework, we can now define the central object of interest which is the map taking an element from the parameter space Y to the discretized solution u h y .
2).Then we define the discretized parameter-to-solution map (DPtSM) by Remark 2.2.The DPtSM P is a potentially nonlinear map from a p-dimensional set to a D-dimensional space.Therefore, without using the information that P has a very specific structure described through A and the PDE (2.1), a direct approximation of P as a high-dimensional smooth function will suffer from the curse of dimensionality, [9,43].
Before we continue, let us introduce some crucial notation.Later, we need to compute the Sobolev norms of functions v ∈ H.This will be done via a vector representation v of v with respect to the high-fidelity basis (ϕ i ) D i=1 .We denote by G := ( ϕ i , ϕ j H ) D i,j=1 ∈ R D×D the symmetric, positive definite Gram matrix of the basis functions (ϕ i ) D i=1 .Then, for any v ∈ U h with coefficient vector v with respect to the basis

Approximation of the Discretized Parameter-to-Solution Map by Realizations of Neural Networks
In this section, we describe the approximation-theoretical motivation for the numerical study performed in this paper.We present a formal definition of NNs below.In Question 3.5, we present the underlying approximation-theoretical question of the considered learning problem.Thereafter, we recall the results of [35] showing that one can upper bound the approximation rates that NNs obtain when approximating the DPtSM through an implicit notion of complexity of the DPtSM.

Neural Networks
NNs describe functions of compositional form that result from repeatedly applying affine linear maps and a so-called activation function.From an approximation-theoretical point of view, it is sensible to count the number of active parameters of a NN.To associate a meaningful and mathematically precise notion of the number of parameters to a NN, we differentiate here between neural networks which are sets of matrices and vectors, essentially describing the parameters of the NN, and realizations of neural networks which are the associated functions.Concretely, we make the following definition: If Φ is a NN as above, K ⊂ R n , and if : R → R is arbitrary, then we define the associated realization of Φ with activation function over K (in short, the -realization of Φ over K) as the map R K (Φ) : K → R N L such that R K (Φ)(x) = x L , where x L results from the following scheme: and where acts componentwise, that is, (v) := ( (v 1 ), . . ., (v m )) for all v = (v 1 , . . ., v s ) ∈ R s .
We call N (Φ) := n + L j=1 N j the number of neurons of the NN Φ and L the number of layers.We call M (Φ) := L =1 A 0 + b 0 the number of non-zero weights of Φ.Moreover, we refer to N L as the output dimension of Φ.Finally, we refer to (N 0 , . . ., N L ) as the architecture of Φ.
Remark 3.3.For every α ∈ (0, 1) it holds that for all x ∈ R Hence, for every α ∈ (0, 1), we can represent the ReLU as the sum of two rescaled α-LReLUs and vice versa.If we define for n ∈ N then, for NNs . Moreover, it is not hard to see that M (Φ 1 ) ≤ M (Φ 2 ), M (Φ 3 ) and M (Φ 2 ), M (Φ 3 ) ≤ 4M (Φ 1 ).Therefore, we have that for every α 1 , α 2 ∈ [0, 1) and every function f : R n → R N L of a function space X such that for a NN Φ implies that there exists another NN Φ with L( Φ) = L(Φ) and M ( Φ) ≤ 16M (Φ) such that In other words, up to a multiplicative constant the parameter α of the α-LReLU does not influence the approximation properties of realizations of NNs.
Remark 3.4.While Remark 3.3 shows that all α-LReLUs yield, in principle, the same approximation behavior, these activation functions still display quite different behavior during the training phase of NNs, where a non-vanishing parameter α can help avoid the occurrence of dead neurons.

Approximation of the Discretized Parameter-to-Solution Map by Realizations of Neural Networks
We can quantify the capability of NNs to represent the DPtSM by answering the following question: Question 3.5.Let p, D ∈ N, α ∈ [0, 1), Ω = (0, 1) 2 , U h ⊂ H be a D-dimensional space, A = {a y : y ∈ Y} ⊂ L ∞ (Ω) be compact with Y ⊂ R p as in (2.2).We consider the following equivalent questions: • For ε > 0, how large do M ε , L ε ∈ N need to be to guarantee, that there exists a NN Φ that satisfies • For M, L ∈ N, how small can ε L,M > 0 be chosen so that there exists a NN Φ that satisfies (1) (ii) The results to follow measure the necessary sizes of the NNs in terms of the numbers of non-zero weights M (Φ).However, from a practical point of view, we are also interested in the number of necessary neurons N (Φ).Invoking a variation of [47,Lemma G.1.]shows that similar rates to the ones below are valid for the number of neurons N (Φ).
If the regularity of P is known, then a straight-forward bound on M ε and L ε can be found in [67].Indeed, if P ∈ C s (Y; R D ) with P C s ≤ 1, then one can choose In other situations, e.g., if L ε is permitted to grow faster than log 2 (1/ε), one can even replace s by 2s in (3.1), see [68,38].This rate of (3.1) uses the smoothness of P only and does not take into account the underlying structure stemming from the PDE (2.1) and the choice of A. As a result, we find this rate to be significantly suboptimal.
In [35], it was showed that P can be approximated in the sense of Question 3.5 with where d(ε) is a certain intrinsic dimension 3 of the problem, essentially reflecting the size of a reduced basis required to sufficiently approximate S(Y).In many cases, especially those discussed in this manuscript, one can theoretically establish the scaling behavior of d(ε Applied to (3.2) this yields that for some c ≥ 1.We also mention a similar approximation result, not of the discretized parametric map P but of the parametrized solution (y, x) → u ay (x), where u ay is as in (2.1) for a = a y .In this situation, and for 3 derived from bounds on the Kolmogorov N -width of S(Y) specific parametrizations of A, [59,Theorem 4.8] shows that this map can be approximated by the realization of a NN using the ReLU activation function up to an error of ε with a number of weights that essentially scales like ε −r where r depends on the summability of the (in this case potentially infinite) sequence (a i ) ∞ i=1 such that a y = a 0 + ∞ i=1 y i a i for a coefficient vector y = (y i ) ∞ i=1 .Here r can be very small if a i L ∞ (Ω) decays quickly for i → ∞.This leads to very efficient approximations.
While the aforementioned results all examine the approximation-theoretical properties of realizations of NNs with respect to the uniform approximation error, they trivially imply the same rates if we examine the average errors , which are often used in practice.Here, 1 ≤ p < ∞ and µ is an arbitrary probability measure on Y.In this paper, we examine the discrete counterpart of the mean relative error where µ denotes the uniform probability measure on Y.
In view of the aforementioned theoretical results, it is clear that a parameter that is not the dimension of the parameter space Y but a problem-specific notion of complexity determines the hardness of the approximation problem of Question 3.5.To what extent this theoretical observation influences the hardness of the practical learning problem will be analyzed in the numerical experiment presented in the next section.

Numerical Survey of Approximability of Discretized Parameterto-Solution Maps
As outlined in Section 3, the theoretical hardness of the approximation problem of Question 3.5 is determined by an intrinsic notion of complexity that potentially differs substantially from the dimension of the parameter space.
To test how this intrinsic complexity affects the practical machine-learning based solution of (2.1), we perform a comprehensive study where we train NNs to approximate the DPtSM P for various choices of A. Here, we are especially interested in the performance of the learned approximation of P for varying complexities of A. In this context, we test the hypotheses listed in the following Subsection 4.1.The remainder of this section is structured as follows: In Subsection 4.2, we introduce the concrete setup of parametrized diffusion coefficient sets, NN architecture, and optimization procedure and explain how the choice of test-cases are related to our hypotheses.Afterwards, in Subsection 4.3, we report the results of our numerical experiments.Subsection 4.4 is devoted to an evaluation and interpretation of these results in view of the hypotheses of Subsection 4.1.

Hypotheses
[H1] The performance of learning the DPtSM does not suffer from the curse of dimensionality: The theoretical results of [35] show that the dimension of the parameter space p is not the main factor in determining the hardness of the underlying approximation-theoretical problem.As already outlined in the introduction, it is by no means clear that this effect is visible in a practical learning problem.
We expect that after accounting for effects stemming from optimization and sampling to promote comparability between test-cases in a way described in Appendix A, the performance of the learning method will scale only mildly with the dimension of the parameter space.
[H2] The performance of learning the DPtSM is very sensitive to parametrization: We expect that, within the framework of Question 3.5, there are still extreme differences of intrinsic complexities for different choices of parametrizations for the diffusion coefficient sets A ⊂ L ∞ (Ω) as defined in (2.2).However, it is not clear to what extent NNs are capable of resolving the low-dimensional sub-structures generated by various choices of A ⊂ L ∞ (Ω).
Since realizations of NNs are a very versatile function class, we expect the degree to which the performance of a trained NN depends on the number of parameters to vary strongly over the choice of (a i ) p i=1 .
[H3] Learning the DPtSM is efficient also for non-affinely parametrized problems: The analysis of PPDEs often relies on affine parametrizations as in (2.3) or smooth variations thereof.
We expect the overall theme that NNs perform according to an intrinsic complexity of the problem depending only weakly on the parameter dimension to hold in more general cases.

Setup of Experiments
To test the hypotheses [H1], [H2], and [H3] of Section 4.1, we consider the following setup.

Parameterized Diffusion Coefficient Sets
We perform training of NNs for different instances of the approximation problem of Question 3.5.Here, we always assume the right-hand side to be fixed as f (x) = 20 + 10x 1 − 5x 2 , for x = (x 1 , x 2 ) ∈ Ω, and we vary the parametrized diffusion coefficient set A.
We consider four different parametrized diffusion coefficient sets as described in the test-cases [T1]-[T4] (for a visualization of [T3] and [T4] see Figure 1
[T1] Trigonometric Polynomials: In this case, the set A consists of trigonometric polynomials that are weighted according to a scaling coefficient σ.To be more precise, we consider for some fixed shift µ > 0 and a scaling coefficient σ ∈ R.Here a i (x) = sin i+2 We analyze the cases p = 2, 5, 10, 15, 20 and, for each p, the scaling coefficients σ = −1, 0, 1.As a shift we always choose µ = 1.

Setup of Neural Networks and Training Procedure
Our experiments are implemented using Tensorflow, [1], for the learning procedure and FEniCS, [3], as FEM solver.The code used for dataset generation of all considered test-cases is made publicly available at www.github.com/MoGeist/diffusion_PPDE.To be able to compare different test-cases and remove all effects stemming from the optimization procedure, we train almost the same model for all parameter spaces.The only-to a certain extent inevitable-change that we allow between test-cases is that the input dimension of the NN changes to that of the parameter space.Concretely, we consider the following setup: (1) The finite element space U h resulting from a triangulation of Ω = [0, 1] 2 with 101 × 101 = 10201 equidistant grid points and first-order Lagrange finite elements.This space shall serve as a discretized version of the space H 1 (Ω).We denote by D = 10201 its dimension and by (ϕ i ) D i=1 the corresponding finite element basis.
(2) The (feedforward) neural network architecture S = (p, 300, . . ., 300, 10201) with L = 11 layers, where p is test-case-dependent and the weights and biases are initialized according to a normal distribution with mean 0 and standard deviation 0.1.
(4) The loss function is the relative error on the finite-element discretization of  In our experiments, we aim at finding a NN Φ with architecture S such that the mean relative training error is minimized.We then test the accuracy of our NN by computing the mean relative test error Here, we use the mean relative error instead of the mean absolute error in order to establish comparability of our results between different sets A, allowing us to put our results into context.The optimization is done through batch gradient descent.To ensure further comparability between the different setups, the hyper-parameters in the optimization procedure are kept fixed: Training is conducted with batches of size 256 using the ADAM optimizer [34] with hyper-parameters α = 2.0 × 10 −4 , β 1 = 0.9, β 2 = 0.999 and ε = 1.0 × 10 −8 .Training is stopped after reaching 40000 epochs.Having trained the NN, for some new input y ∈ Y, the computation of the approximate discretized solution R Y (Φ)(y) is done by a simple forward pass.

Relation to Hypotheses
The test-cases [T1] -[T4] are designed to test the hypotheses [H1] -[H3] in the following way: Enabling comparability between test-cases: We implement three measures to produce a uniform influence of the optimization and sampling procedure in all test-cases.These are that we only change the architecture in the minimally required way between test-cases to not alter the optimization behavior, we analyze a posteriori the optimization behavior to see if there are qualitative differences between test-cases, and we choose the number of training samples in such a way that neither moderate further increasing or decreasing of the number of training samples affects the outcome of the experiments.We describe these measures in detail in Appendix A.

Relation to Hypothesis [H1]:
To test if the learning method suffers from the curse of dimensionality or if the prediction of [35] that its complexity is determined only by some intrinsic complexity of the function class holds, we run all test-cases [T1]-[T4] for various values of the dimension of the parameter space, and study the resulting scaling behavior.

Relation to Hypothesis [H2]:
To understand the extent to which the NN model is sufficiently versatile to adapt to various types of solution sets, we study four commonly considered parametrized diffusion coefficient sets which also include multiple subproblems described via the hyper-parameters σ and µ.
The parametrized sets exhibit the following different characteristics: [T1] The parameter-dependence in this case is affine (i.e. the forward-map y → b y (u, v) depends affinely on y for all u, v ∈ H) whereas the spatial regularity of the functions (a i ) p i=1 is analytic.To vary the difficulty of the problem at hand, we consider different instances of the scaling coefficient σ which put different emphasis on the high-frequency components of the functions (a i ) p i=1 .In particular, if σ > 0, a higher weight is put on the high-frequency components than on the lowfrequency ones whereas the opposite is true for σ < 0.
[T2] The parameter-dependence in this case is affine again, whereas the spatial regularity of the (X Ωi ) p i=1 is very low.To vary the difficulty of the problem, we consider different instances of shifts µ.The higher the shift is, the more elliptic the problem becomes.
[T3] [T3-F] again exhibits affine parameter-dependence and the same regularity properties as testcase [T2].However, this problem is considered to be easier than test-case [T2] since the Ω i do not intersect each other.For test-case [T3-V], the geometric properties of the domain partition are additionally encoded via a parameter thereby rendering the problem to be non-affine.
[T4] In this case, the parameter-dependence is non-affine and has low regularity due to the clipping procedure.Additionally, the spatial regularity of the functions a y is comparatively low in general.
A visualization highlighting the versatility of our test-cases can be seen when comparing the FE solutions in Figure 2 (test-case [T2]) with the FE solutions in Figure 3 (test-case [T4]).

Numerical Results
In this subsection, we report the results of the test-cases announced in the previous subsection.

[T1] Trigonometric Polynomials
We observe the following mean relative test errors for the sets A tp (p, σ).In Figure 2, we show samples from the test set for different values of µ.Here we always depict one average performing test-case and one with poor performance.These figures offer a potential explanation of why the scaling with p is qualitatively different for different values of µ.This seems to be because for lower µ the effect of the individual parameters on the solution seems to be much more local than for higher µ.This appears to lead to a higher intrinsic dimensionality of the problem.

[T3] Cookies with Fixed and Variable Radii
We start with one experiment where the radii of the cookies are fixed to 0.8/(2s):   [T4] Clipped Polynomials For the set A cp (p, 10 −1 ), we obtain the following mean relative test errors when varying p.

Evaluation and Interpretation of Experiments
We make the following observations about the numerical results of Section 4.3.
[O1] Our test-cases show that the error rate achieved by NN approximations for varying parameter sizes differs strongly and qualitatively between different test-cases.In Figures 4, 5 .For [T1] and σ = −1, the error appears to be almost independent from p for p → ∞.In contrast to that, we observe for σ = 1 a linear scaling in the loglog plot implying a polynomial dependence of the error on p.
For test-case [T2], we observe that the error scales linearly in the loglog scale of Figure 5.We conclude that for A cb (p, µ), the error scales polynomially with p.The semilog plot of Figure 7 shows that for test-case [T4] with the sets A cp (p, 10 −4 ), the growth of the error is logarithmic in p.
In total, we observed scaling behaviors of O(1), O(log(p)) and O(p k ) for k > 0 and for p → ∞.Notably, none of the test-cases exhibited an exponential dependence of the error on p.
[O2] The choice of the hyper-parameters σ and µ in the test-cases [T1], [T2], [T3] influences the scaling behavior according to its effect on the complexity of the parameterized diffusion coefficient set.
Weighting the parameters using the scaling parameter σ should, in principle, simplify the parametric problem for smaller values of σ.This is precisely, what we observe in Table 2 and Figure 4.
The influence of the shift µ is of a somewhat different type.Higher values of µ make the underlying problem more elliptic.This can be seen in Figure 2: For a small value of µ, the impacts of the individual values on the chessboard-pieces on the solution appear to be almost completely decoupled.On the other hand, in the more elliptic case, the solution appears more smoothed out, and therefore each parameter value also influences the solution more globally.This implies a stronger coupling of the parameters and at least intuitively indicates a reduced intrinsic dimensionality for higher values of µ.
Accordingly, we see in Table 4 and Figure 5 that the parameter µ influences the scaling behavior of the method with p. Indeed, the error scales as O(p k ), where the exponent in the polynomial dependence on p depends on µ.Concluding, we can see that the approximation of the DPtSM by NNs appears to be very sensitive to these parameters, as we observe in Table 4 and Figure 5 as well as in Table 8 and Figure 6.
[O3] We observe no fundamentally worse scaling behavior for non-affinely parametrized test-cases compared to test-cases with an affine parameterization.In test-case [T3], we do observe that the non-linearly parametrized problem appears to be more challenging overall, while the scaling behavior is the same as for the affinely parametrized problem.In test-case [T4], which is the test-case with the highest number of parameters p, we observe only a very mild (in fact logarithmic) dependence of the error on p.
From these observations we draw the following conclusions for our hypotheses:

Hypothesis [H1]
In observation [O1], we saw that over a wide variety of test-cases multiple types of scaling of the error with the dimension of the parameter space could be observed.None of them admit an exponential scaling.In fact, the behavior of the errors seems to be determined by an intrinsic complexity of the problems.

Hypothesis [H2]
Comparing performance both between test-cases (observation [O1]) and within testcases (observation [O2]), leads us to conclude that there exist strong differences in the performance of learning the DPtSM.For various test-cases, using NNs with precisely the same architecture, we observed (see [O2]) considerably different scaling behaviors of the test-cases [T1]-[T4] which have the error scale polynomially, logarithmically and being constant with changing parameter dimension p (described in [O1]).
According to [O2], the overall level of the errors and the type of scaling for increasing p follows the semiordering of complexities of test-cases in the sense that more complex parametrized sets yield higher errors whereas simpler sets or spaces with intuitively lower intrinsic dimensionality yield smaller errors (test-cases [T1] and [T2]).Therefore, we conclude that the approximation theoretical intrinsic dimension of the parametric problem is a main factor in determining the hardness of learning the DPtSM.

Hypothesis [H3]
In support of [H3], we found no fundamental difference of the performance of the NN model for non-affinely parametrized problems (see [O3]).
In conclusion, we found support for all the hypotheses [H1]-[H3].We consider this result a validation of the importance of approximation-theoretical results for practical learning problems, especially in the application of deep learning to problems of numerical analysis.
It is clear that the results presented in this work only analyze the sensitivity of the performance of the learned DPtSM corresponding to the semi-ordering of complexities.For future work, it would be interesting to identify alternative and more quantitative notions of complexities and test the sensitivity of the learned method with regards to those.

A.3 A Posteriori Analysis of Convergence Behavior
Similarly to the architecture, the hyper-parameters of the optimization method were also kept fixed across all datasets and training runs.This measure, however, only eliminates the effect of the architecture on the optimization method and does not address any obfuscating effect that the choice of test-cases may have.To analyze if such an effect is present, we check the convergence on our two hardest test-cases [T2] and [T3-V] for the largest parameter dimension p considered.The results are depicted in Figure 8 and 9, respectively.We see that even for small shifts µ, i.e., the most difficult problem settings, the error on the training set converges smoothly.This behavior can also be witnessed on all other test-cases.Another possible pitfall of our optimization procedure would be the occurrence of overfitting.In particular, this would render our attained accuracy levels invalid as we trained for a fixed number of epochs.However, this did not occur in any of our tests.We exemplarily showcase the convergence plot of the training and test error for the hardest parameter choices of [T3-V] and [T4] in Figure 10 and 11 respectively.Similar behavior can also be observed on all other datasets.

Definition 3 . 1 .
Let n, L ∈ N. A neural network Φ with input dimension n and L layers is a sequence of matrix-vector tuples Φ = (A 1 , b 1 ), (A 2 , b 2 ), . . ., (A L , b L ) , where N 0 = n and N 1 , . . ., N L ∈ N, and where each A is an N × N −1 matrix, and b ∈ R N .

4 X
A denotes the indicator function of A.

Figure 1 :
Figure 1: Partition of Ω as in Test-case [T2] (left) for p = 9, (the red and blue areas indicate the Ω i ), test-case [T3-F] (middle) for p = 4 (the red areas indicate the Ω i ) and test-case [T3-V] (right) for p = 8 (the red areas indicate the Ω i,y i+s 2 ).
The training set (y i,tr ) Ntrain i=1 ⊂ Y consists of N train := 20000 i.i.d.parameter samples, drawn with respect to the uniform probability measure on Y.

( 6 )
The test set (y i,ts ) Ntest i=1 ⊂ Y consists of N test := 5000 i.i.d.parameter samples, drawn with respect to the uniform probability measure on Y.

Figure 2 :
Figure 2: Comparison of the ground truth solution and the one predicted by the NN for an average (left) and a poor performing (right) for p = 16 and µ = 10 −1 (top) and µ = 10 −3 (bottom) for test-case [T2].The percentage in brackets represents the relative test error for this particular sample.

Figure 3 :
Figure 3: Comparison of the ground truth solution and the one generated by the NN for an average (left) and a poor performing case (right) for µ = 10 −1 and p = 6 (top) and p = 91 (bottom) for test-case [T4].The percentage in brackets represents the relative test error for this particular sample.

Figure 4 :
Figure 4: Plot of the mean relative test error for the sets of test-case [T1] for different values σ.The horizontal axis follows the dimension of the parameter space p the mean relative test error is shown on the vertical axis.Both axes use a logarithmic scale.

Figure 5 :
Figure 5: Plot of the mean relative test error for the sets of test-case [T2] for different values of µ with p on the horizontal axis and the mean relative test error on the vertical axis.Both axes use a logarithmic scale.

Figure 6 :
Figure 6: Plot of the mean relative test error for the sets of test-case [T3] with p on the horizontal axis and the error on the vertical axis.Both axes use a logarithmic scale.

Figure 7 :
Figure 7: Plot of the mean relative test error for the sets of test-case [T4] with p on the horizontal axis and the error on the vertical axis.Only the horizontal axis is scaled logarithmically.

Figure 8 :
Figure 8: Plot of the mean relative training error for [T2] with p = 25 and different shifts µ.

Figure 9 :
Figure 9: Plot of the mean relative training error for [T3-V] with p = 50 and different shifts µ.

Figure 11 :
Figure 11: Plot of the mean relative training and test error for [T4] with p = 91 and µ = 10 −1 .

Table 2 :
Mean relative test error for test-case [T1] and different parameter dimensions p, different scaling parameters σ and shift µ = 1.We observe the following mean relative test errors for the sets A cb (p, µ).

Table 4 :
Mean relative test error for test-case [T2] and parameter dimensions p = s 2 .

Table 6 :
Mean relative test error for test-case [T3-F] and different parameter dimensions p = s 2 with shift µ = 10 −4 and radius 0.8/(2s).Moreover, we find for the sets of cookies with variable radii A cvr (p, µ) the following mean relative test errors:

Table 8 :
Mean relative test error for test-case [T3-V] and different parameter dimensions p = 2s 2 .

Table 10 :
Mean relative test error for test-case [T4] with clipping value µ = 10 −1 and different parameter dimensions p.