Learning-Informed Parameter Identification in Nonlinear Time-Dependent PDEs

We introduce and analyze a method of learning-informed parameter identification for partial differential equations (PDEs) in an all-at-once framework. The underlying PDE model is formulated in a rather general setting with three unknowns: physical parameter, state and nonlinearity. Inspired by advances in machine learning, we approximate the nonlinearity via a neural network, whose parameters are learned from measurement data. The latter is assumed to be given as noisy observations of the unknown state, and both the state and the physical parameters are identified simultaneously with the parameters of the neural network. Moreover, diverging from the classical approach, the proposed all-at-once setting avoids constructing the parameter-to-state map by explicitly handling the state as additional variable. The practical feasibility of the proposed method is confirmed with experiments using two different algorithmic settings: A function-space algorithm based on analytic adjoints as well as a purely discretized setting using standard machine learning algorithms.


Introduction
We study the problem of determining an unknown nonlinearity f from data in a parameterdependent dynamical system Here, the state u is a function on a finite time interval (0, T ) and a bounded Lipschitz domain Ω, and u denotes the first order time derivative.In (1), both F, f are nonlinear Nemytskii operators in λ, α, u; these Nemytskii operators are induced by nonlinear, time-dependent functions [F (λ, u)](t) := F (t, λ, u(t)) and [f (α, u)](t, x) := f (α, u(t, x)), where we consistently abuse notation in this manner throughout the paper; see also Lemmas 2,4.We assume that F was specified beforehand from an underlying physical model, that the terms λ, u 0 are physical parameters (with λ = λ(x) depending only on space), and that α is a finite dimensional parameter arising in the nonlinearity.Furthermore, the model ( 1) is equipped with Dirichlet or Neumann boundary conditions.Some examples of partial differential equations (PDEs) of the from (1) are diffusion models u = ∆u + f (α, u) with a nonlinear reaction term f (α, u) as follows [30]: • f (α, u) = −αu(1 − u): Fisher equation in heat and mass transfer, combustion theory.
The underlying assumption of this work is that in some cases, the nonlinearity f is unknown due to simplifications or inaccuracies in the modeling process or due to undiscovered physical laws.In such situations, our goal is to learn f from data.In order to realize this in practice, we need to use a parametric representation.For this, we choose neural networks, which have become widely used in computer science and applied mathematics due to their excellent representation properties, see for instance [19] for the classical universal approximation theorem, [27] for recent results indicating superior approximation properties of neural networks with particular activations (potentially at the cost of stability) and [9,3] for general, recent overviews on the topic.Learning the nonlinearity f thus reduces to identifying parameters θ of a neural network N θ such that N θ ≈ f , rendering the problem of learning a nonlinearity to be a parameter identification problem of a particular form.
For the majority of this paper, the nonlinearity f will therefore not appear directly; instead, f will consistently be replaced by its neural network representation N θ , and our focus will be on showing the properties of N θ , rather than those of f .
A main point in our approach, which is motivated from feasibility for applications, is that learning the nonlinearity must be achieved only via indirect, noisy measurements of the state y δ ≈ M u with M a linear measurement operator.More precisely, we assume to have K different measurements of different states u k available, where the different states correspond to solutions of the system (1) with different, unknown parameters (λ k , α k , u k 0 ), but the same, unknown nonlinearity f which is assumed to be part of the ground truth model.The simplest form of M is a full observation over time and space of the states, i.e.M = Id as in e.g.(theoretical) population genetics.In other contexts, M could be discrete observations at time instances of u, i.e.M u = (u(t i , •)) n T i=1 , t i ∈ (0, T ), as in material science [31], system biology [4] (see also Corollary 32), or Fourier transform as in MRI acquisition [2], etc.In most cases, M is linear, as is assumed here.
Our approach to address this problem is to use an all-at-once formulation that avoids constructing the parameter-to-state map (see for instance [21]).That is, we aim to identify all unknowns by solving a minimization problem of the form min (3) where we refer to Section 2 for details on the function spaces involved.Here, G is a forward operator that incorporates the PDE model, the initial conditions and the measurement operator via G(λ, α, u 0 , u, θ) = ( u − F (λ, u) − N θ (α, u), u(0) − u 0 , M u), and R 1 , R 2 are suitable regularization functionals.
Existing research towards learning PDEs and all-at-one identification.Exploring governing PDEs from data is an active topic in many areas of science and engineering.With advances in computational power and mathematical tools, there have been numerous recent studies on datadriven discovery of hidden physical laws.One novel technique is to construct a rich dictionary of possible functions, such as polynomials, derivatives etc., and to then use sparse regression to determine candidates that most accurately represent the data [35,5,33].This sparse identification approach yields a completely explicit form of the differential equation, but requires an abundant library of basic functions specified beforehand.In this work, we take the viewpoint that PDEs are constructed from principal physical laws.As it preserves the underlying equation and learns only some unknown components of the models, e.g.f in (1), our suggested approach is capable of refining approximate models by staying more faithful to the underlying physics.
Besides the machine learning part, the model itself may contain unknown physical parameters belonging to some function space.This means that if the nonlinearity f is successfully learned, one can insert it into the model.One thus has a learning-informed PDE, and can then proceed via a classical parameter identification.The latter problem was studied in [11] for stationary PDEs, where f is learned from training pairs (u, f (u)).This paper emphasizes analysis of the error propagating from the neural network-based approximation of f to the parameter-to-state map and the reconstructed parameter.
In reality, one does not have direct access to the true state u, but only partial or coarse observations of u under some noise contamination.This factor affects the creation of training data pairs (u, f (u)) with f (u) = u − F (u) for the process of learning f , e.g in [11].Indeed, with a coarse measurement of u, for instance u ∈ L 2 ((0, T )×Ω), one cannot evaluate u, nor terms such as ∆u that may appear in F (u).Moreover, with discrete observations, e.g. a snapshot y = (u(t i , •)) n T i=1 , t i ∈ (0, T ), one is unable to compute u for the training data.
For this reason, we propose an all-at-once approach to identify the nonlinearity f , state u and physical parameter simultaneously.In comparison to [11], our approach bypasses the training process for f , and accounts for discrete data measurements.The all-at-once formulation avoids constructing the parameter-to-state map, which is nonlinear and often involves restrictive conditions [16,20,21,22,28].Additionally, we here consider time-dependent PDE models.
For discovering nonlinearities in evolutionary PDEs, the work in [7] suggests an optimal control problem for nonlinearities expressed in terms of neural networks.Note that the unknown state still needs to be determined through a control-to-state map, i.e. via the classical reduced approach, as opposed to the new all-at-once approach.
While [11,7] are the recent publications that are most related to our work, we also mention the very recent preprint [12] on an extension of [11] that appeared independently and after the original submission of our work.Furthermore, there is a wealth of literature on the topic of deep learning emerging in the last decade; for an authoritative review on machine learning in the context of inverse problems, we refer to [1].For the regularization analysis, we follow the well known theory put forth in [13,23,26,37].It is worthwhile to note that since this work, to the knowledge of the authors, is the first attempt at applying an all-at-once approach to learning-informed PDEs, our focus will be on this novel concept itself, rather than on obtaining minimal regularity assumptions on the involved functions, in particular on the activation functions.In subsequent work, we might further improve upon this by considering, e.g., existing techniques from a classical optimal control setting with non-smooth equations [6] or techniques to deal with non-smoothness in the context of training neural networks [8].
Contributions.Besides introducing the general setting of identifying nonlinearities in PDEs via indirect, parameter-dependent measurements, the main contributions of our work are as follows: Exploiting an all-at-once setting of handling both the state and the parameters explicitly as unknowns, we provide well-posedness results for the resulting learning-and learning-informed parameter identification problems.This is achieved for rather general, nonlinear PDEs and under local Lipschitz assumptions on the activation function of the involved neural network.Further, for the learning-informed parameter identification setting, we ensure the tangential cone condition on the neural-network part of our model.Together with suitable PDEs, this yields local uniqueness results as well as local convergence results of iterative solution methods for the parameter identification problem.We also provide a concrete application of our framework for parabolic problems, where we motivate our function-space setting by a unique existence result on the learning-informed PDE.Finally, we consider a case study in a Hilbert space setting, where we compute function-space derivatives of our objective functional to implement the Landweber method as solution algorithm.Using this algorithm, and also a parallel setting based on the ADAM algorithm [25], we provide numerical results that confirm feasibility of our approach in practice.
Organization of the paper.Section 2 introduces learning-informed parameter identification and the abstract setting.Section 3 examines existence, stability and solution methods for the minimization problem.Section 4 focuses on the learning-informed PDE, and analyzes some problem settings.Finally, in Section 5 we present a complete case study, from setup to numerical results.

Problem setting 2.1 Notation and basic assertions
Throughout this work, Ω ⊂ R d will always be a bounded Lipschitz domain, where additional smoothness will be required and specified as necessary.We use standard notations for spaces of continuous, integrable and Sobolev functions with values in Banach spaces, see for instance [10,32], in particular [32,Section 7.1] for Sobolev-Bochner spaces and associated concepts such as time-derivatives of Banach-space valued functions..For an exponent p ∈ [1, ∞], we denote by p * the conjugate exponent given as the continuous embedding of W l,p (Ω) to L q (Ω), which exists for q dp d−lp , where the notation means if lp < d, then q ≤ dp d−lp , if lp = d, then q < ∞, and if lp ≥ d, then q = ∞ .An example of such an embedding, which will be used frequently in Section 4, is H 1 (Ω) → L 6 (Ω) for d = 3.We further denote by C W l,p →L q the operator norm of the corresponding continuous embedding operator.
We also use → → to denote the compact embedding (see [32,Theorem 1.21]) The notation C indicates generic positive constants.Given any Banach spaces X, Y , we denote by • X→Y the operator norm • L(X,Y ) , and by •, • X,X * the pairing between dual spaces X, X * .We write C locLip (X, Y ) for the space of locally Lipschitz continuous functions between X and Y .Furthermore, A • B denotes the Frobenius inner product between generic matrices A, B, while AB stands for matrix multiplication, and A T stands for the transpose of A. The notation B X ρ (x † ) means a ball of center x † , radius ρ > 0 in X.For functions mapping between Banach spaces, by the term weak continuity we will always refer to weak-weak continuity, i.e., continuity w.r.t.weak convergence in both the domain and the image space.

The dynamical system
For the general setting considered in this work, we use the following set of definitions and assumptions.A concrete application where these abstracts assumptions are satisfied can be found in Section 4 below.

Assumption 1.
• The space X (parameter space) is a reflexive Banach space.The spaces V (state space) and W (image space under the model operator), Y (observation space) and Ṽ are separable, reflexive Banach spaces.In view of initial conditions, we further require U 0 (initial data space) to be a reflexive Banach space, and H to be a separable, reflexive Banach space.
• We assume the following embeddings: Further, Ṽ will always be such that either L p(Ω) → Ṽ or Ṽ → L p(Ω).
• The function is such that for any fixed parameter λ ∈ X, F (•, λ, •) : (0, T )×V → W meets the Carathéodory conditions, i.e., F (•, λ, v) is measurable with respect to t for all v ∈ V and F (t, λ, •) is continuous with respect to v for almost every t ∈ (0, T ).Moreover, for almost all t ∈ (0, T ) and all λ ∈ X, v ∈ V , the growth condition is satisfied for some B : R 2 → R such that b → B(a, b) is increasing for each a ∈ R, and γ ∈ L 2 (0, T ).
• We define the overall state space and image space including time dependence as respectively with the norms u • We define the overall observation space including time as Y dt and the corresponding measurement operator • We further assume the following embeddings for the state space: The embeddings in (6) are very feasible in the context of PDEs.The state space V usually has some certain smoothness such that its image under some spatial differential operators belongs to W .For the motivation of V → C(0, T ; H), the abstract setting in [32,Lemma 7.3.](see Appendix A) is an example.Note that due to V → C(0, T ; H), clearly U 0 = H is a feasible choice for the initial space; for the sake of generality, only U 0 → H is assumed in (6).
Under Assumption 1, the function F induces a Nemytskii operator on the overall spaces.
Lemma 2. Let Assumption 1 hold.Then the function F : (0, T ) × X × V → W induces a welldefined Nemytskii operator F : X × V → W given as Proof.Under the Carathéodory assumption, t → F (t, λ, u(t)) is Bochner measurable for every λ ∈ X and u ∈ V.For such λ, u, we further estimate ) being increasing and by the embedding V → C(0, T ; H).This allows to conclude that t → F (t, λ, u(t)) is Bochner integrable (see [10,Theorem II.2.2]) and that the Nemytskii operator F : X × V → W is well-defined.
Note that we use the same notation for the function F : (0, T ) × X × V → W and the corresponding Nemytskii operators.

Basics of neural networks
As outlined in the introduction, the unknown nonlinearity f will be represented by a neural network.In this work, we use a rather standard, feed-forward form of neural networks defined as follows.
where L θ l : R n l−1 → R n l , for z ∈ R n l−1 is given as Here, ) summarizes all the parameters of the l-th layer and σ is a pointwise nonlinearity that is fixed.Given a depth L ∈ N and architecture (n i ) L i=0 , we also use Θ to denote the finite dimensional vector space containing all possible parameters θ 1 , . . ., θ L of neural networks with this architecture.
In this work, neural networks will be used to approximate the nonlinearity f : R m+1 → R. Consequently, we always deal with neural networks N θ : R m+1 → R, i.e., n 0 = m + 1 and n L = 1.
As such, rather than showing that f induces a well-defined Nemytskii operator, we instead show that N θ does so.A sufficient condition for this to be true is the continuity of the activation function σ, as the following Lemma shows.
We again use the same notation for N θ : R m ×R → R and the corresponding Nemytskii operator.

The learning problem
As the nonlinearity f is represented by a neural network N θ : R m+1 → R, we rewrite the partialdifferential-equation (PDE) model (1) into the form and introduce the forward operator G, which incorporates the observation operator M , as Here, U 0 and H are the spaces related to the initial condition and the trace operator, that is, one has unknown initial data u 0 ∈ U 0 and trace operator (•) t=0 : V u → u(0) ∈ H.With U 0 → H as assumed in (6), one has u(0) − u 0 ∈ H.
The minimization problem for the learning process is then given by min Assume now that the particular parameter θ has been learned.As in (4), one can now solve other parameter identification problems, given new measured datum y ≈ M u, by solving min 3 Learning-informed parameter identification

Well-posedness of minimization problems
We start our analysis by studying existence theory for the optimization problems ( 13) and ( 14), where the unknown nonlinearity is replaced by a neural network approximation.To this aim, we first establish weak closedness of the forward operator.In what follows, the architecture of the network N is considered fixed.
Lemma 5. Let Assumption 1 hold.Then, if σ ∈ C locLip (R, R), N : R m × V × Θ → W is weakly continuous.Further, if either and (−F ) is pseudomonotone in the sense that for almost all t ∈ (0, T ), then F is weakly closed.Moreover, if N is weakly continuous and F is weakly closed, then G as in (12) is weakly closed.
Proof.We first consider weak closedness of G. To this aim, recall that G is given as First note that M ∈ L(V, Y) by (9).Weak closedness of ((•) t=0 , Id) : V × U 0 → H follows from weak continuity of Id : U 0 → H as U 0 → H, and from weak-weak continuity of (•) t=0 : V → H which follows from u(0) H ≤ sup t∈[0,T ] u(t) H ≤ C u V for C > 0 and V → C(0, T ; H).Weak continuity of d dt : V → W results from the choice of norms in the respective spaces.Thus, weak closedness of G follows when F is weakly closed and N is weakly continuous.
Weak continuity of N .First, we observe that together with [32,Lemma 7.7] (see Appendix A), in the other case that L p(Ω) → V , this follows directly from [32,Lemma 7.7]).Based on this, we deduce u n → u in L 2 (0, T ; L p(Ω)).Then as as argued in the proof of Lemma 4. Above L(M ) denotes the Lipschitz constant of (α, y, θ) → N θ (α, y) in the ball with radius M and p/(p − 1) = ∞ in case p = 1.This shows that here, we even obtain weak-strong continuity of N θ , which is stronger than weak-weak continuity, as required.
Weak closedness of F .To show weak closedness of the Nemytskii operator F : X × V → W, we consider two cases.We first consider the case that F (t, •) is weakly continuous.To this aim, take (λ n , u n ) n to be a sequence weakly converging to (λ, u) in X × V.As V → C(0, T ; H), we have u n C(0,T ;H) u as n → ∞.Now, we show u n (t) H u(t) for all t ∈ (0.T ) via the fact that the point-wise evaluation function (•)(t) : V → H for any t ∈ [0, T ] is linear and bounded, thus weak-weak continuous.Indeed, its linearity is clear and boundedness follows from From this, we obtain u n (t) H u(t), thus having (u n (t), λ n ) H×X (u(t), λ)) for all t ∈ (0, T ).Using the growth condition (7), we now estimate where C( λ X , u V ) > 0 can be obtained independently from n due to V → C(0, T ; H), B being increasing, and boundedness of ((u n , λ n )) n in V × X.Since F is assumed to be weakly continuous on H × X, when n → ∞ we have n (t) → 0 pointwise in t.Hence, applying Lebesgue's Dominated Convergence Theorem yields convergence of the time integral to 0, thus weak convergence of F (λ n , u n ) to F (λ, u) in W as claimed.Accordingly, if the condition (15) holds, we obtain weak-weak continuity of F .Now we consider the second case, i.e. ( 16)-( 17), for weak closedness of F .Assume that V → → H as in ( 16), H → W * and that −F is pseudomonotone as in (17).Given (u n , λ n ) V×X (u, λ), By moving to a subsequence indexed by (n k ) k , we thus have ξ n k (t) → 0 as k → ∞ for almost every t ∈ (0, T ).As lim inf k→∞ ξ n k (t) → 0, pseudomonotonicity (as in (17)) implies that for any v ∈ W * , Further, from the Fatou-Lebesgue theorem, we get where the last estimate follows from (20) and from weak convergence of of F (λ n , u n ) to g in W. As this estimate is valid for any v ∈ W * , we conclude that F is weakly closed on X × V, that is, Existence of a solution to ( 13) and ( 14) now follows from a standard application of the direct method [13,37], using weak-closedness of G and weak lower semi-continuity of the involved quantities.
Remark 7 (Stability).We note that under the assumptions of Proposition 6, also stability for the minimization problems (13) and ( 14) follows with standard arguments, see for instance [17,Theorem 3.2].Here, stability means that for convergent sequence of data (y n ) n converging to some y, any corresponding sequence of solutions admits a weakly convergent subsequence, and any limit of such weakly convergent subsequence is a solution of the original problem with data y.
Next we deal with minimization problem (13) in the limit case where the given data converges to a noise-free ground truth, and the PDE should be fulfilled exactly.Our result in this context is a direct extension of classical results as provided for instance in [17], but since also variants of this result will be of interest, we provide a short proof.
Proposition 8 (Limit case).With the assumption of Proposition 6 and parameters β e , β M > 0, consider the parametrized learning problem and assume that, for ((y 21) with parameters β e n , β M n and data y n admits a weakly convergent subsequence, and any limit of such a subsequence is a solution to Proof.With ( λk , αk , ûk 0 , ûk ) k and θ arbitrary such that e( λk , αk , ûk 0 , ûk , θ) = 0 and M ûk = (y † ) k , and ) n any sequence of solutions to (21) with parameters β e n , β M n , by optimality it holds that By weak precompactness of the sublevel sets of R 1 and R 2 and convergence of ) be the limit of such a weakly convergent subsequence, which we again denote by (( Closedness of G together with lower semi-continuity of the norm • W×H and the estimate (23) (possibly moving to another non-relabeled subsequence) then yields that both This shows that e(λ k , α k , u k 0 , u k , θ) = 0 and M u k = (y † ) k for all k.Again using the estimate ( 23), now together with weak lower semi-continuity of R 1 , R 2 , we further obtain that Since ( λk , αk , ûk 0 , ûk ) k and θ were arbitrary solutions of e( λk , αk , ûk 0 , ûk , θ) = 0 and 22) as claimed.At last, in case the solution to ( 22) is unique, weak convergence of the entire sequence follows by a standard argument, using that any subsequence contains another subsequence that weakly converges to the same limit.
Remark 9 (Different limit cases).The above result considers the limit case of both fulfilling the PDE exactly and matching noise-free ground truth measurements.Variants can be easily obtained as follows: In case only the PDE should be fulfilled exactly, one can consider β M fixed and only β e converging to infinity (at an arbitrary rate), such that the resulting limit solution will be a solution of the reduced setting.Likewise, one can consider the case that β e is fixed and β M converges to infinity appropriately in dependence of the noise level δ, in which case the limit solutions solves the all-atonce setting with the hard constraint M u k = (y † ) k , see [18] for some general results in that direction.The corresponding assumption of existence of (( λk , αk , ûk 0 , ûk ) k , θ) such that e( λk , αk , ûk 0 , ûk , θ) = 0 and M ûk = (y † ) k can be weakened in both cases accordingly.
Further, note that the convergence result as well as its variants can be deduced also for the learning-informed parameter identification problem (14) exactly the same way.
Remark 10 (Uniqueness of minimum-norm solution.).A sufficient condition for uniqueness of a minimum-norm solution, and thus for convergence of the entire sequence of minimizers as stated in Proposition 8, is the tangential cone condition and existence of a solution ( λk , ûk 0 , ûk ) to the PDE such that M u k = (y † ) k , see [23,Proposition 2.1].In Section 3.3 below, we discuss this condition in more detail and provide a result which, together with Remark 19, ensures this condition to hold for some particular choices of F and N θ .Regarding solvability of the PDE, we refer to Proposition 24 below, where a particular application is considered.

Differentiability of the forward operator
Solution methods for nonlinear optimization problems, like gradient descent or Newton-type methods, require uniform boundedness of the derivative of G. Differentiability of G is a question of differentiability of F and N , which is discussed in the following.Note that there, and henceforth, we denote by H (a) : A → B the Gâteaux derivative of a function H : A → B and define Gâteaux differentiability in the sense of [37, Section 2.6], i.e., require H (a) to be a bounded linear operator.The basis for differentiability of the forward operator is the following lemma, which is a direct extension of [37,Lemma 4.12].Lemma 11.Let A, B, S be Banach spaces such that A → S. For Σ ⊂ R N open and bounded, and r ∈ [1, ∞), let A, B be Banach spaces such that A → L r (Σ, A) and A → L ∞ (Σ, S), and L r (Σ, B) → B. Further, let H : Σ × A → B be a function such that H(z, •) is Gâteaux differentiable for every z ∈ Σ with derivative H (z, •), and such that H is locally Lipschitz continuous in the sense that, for any M > 0 there exists L(M ) > 0 such that for every a, ξ ∈ A with max{ a S , ξ Then, if the Nemytskii operators H : A → B given as H(a)(z) = H(z, a(z)) and Further, H is locally bounded in the sense that, for any bounded set Ã ⊂ A, sup a∈ Ã H (a) < ∞.
Local boundedness as claimed follows direct from choosing M := sup a∈ Ã sup t∈(0,t) a(t) S + 1, and integrating the rth power of (26) over time.
In addition, assume that F satisfies the following local Lipschitz continuity condition: For all M ≥ 0 there exists L(M ) > 0, such that for all v i ∈ V and λ i ∈ X, i = 1, 2, with max{ v i H , λ i X } ≤ M and for almost every t ∈ (0, T ), Furthermore, G (•) is locally bounded in the sense specified in Lemma 11.
Proof.First note that it suffices to show corresponding differentiability and local boundedness assertions for the different components of G given as u → u, F , N , (u, u 0 ) → u(0) − u 0 and M .For all except F and N , the corresponding assertions are immediate, hence we focus on the latter two.
Regarding F , this is an immediate consequence of Lemma 11 with For N , this is again an immediate consequence of Lemma 11 with Remark 13.For stronger image spaces W L q (Ω), ∀q ∈ [1, ∞), differentiability of F remains valid if (27) holds, while differentiability of N requires a smoother activation function, e.g., the one suggested in Remark 29 below.

Lipschitz continuity and the tangential cone condition
In this section, we focus on showing a rather strong Lipschitz-type result for the neural network.This property allows us to apply (finite-dimensional) gradient-based algorithms to learn the neural networks, where the Lipschitz constant and its derivatives are used to determine the step size.Moreover, by this Lipschitz continuity, the tangential cone condition on ( 14) can be verified.This condition, together with solvability of the learning-informed PDE, answers the important question of uniqueness of a minimizer to the limit case of ( 14), as mention in Remark 10.
For ease of notation, we assume in this lemma that the outer layer of the neural network has activation σ, as in the lower layers.Adapting the proof for σ = Id in the last layer is straightforward.
Lemma 14 (Lipschitz properties of neural networks).Consider an L-layer neural network N : Lemma 4).Denote by N i θ i the i lowest layers of the neural network, depending only on z and on the i lowest-index pairs of parameters θ i , while Fix now a layer l, 1 ≤ l ≤ L, as well as (z, θ), (z, θ), (z, θ) ∈ B, where θ differs from θ only in that its l-th weight is replaced by some ωl and θ differs from θ only in that its l-th bias is replaced by some βl ; explicitly, Then N satisfies the Lipschitz estimates while its derivatives with regards to z, ω l and β l , respectively, satisfy the Lipschitz estimates where one defines C z L+1 := C ω l L+1 := C β l L+1 := 0 and, by backward recursion for Proof.See Appendix B.
Remark 15.If σ is locally Lipschitz continuous on R, the existence of C σ , C σ and the s i is clear whenever B is a bounded set.Thus, it is a direct consequence of Lemma 14 (or follows simply by the properties of the functions N is composed of ) that the mapping (z, θ) → N (z, θ) restricted to any bounded set is bounded, Lipschitz continuous and has Lipschitz continuous derivative.This is relevant for gradient-based optimization algorithms to solve the learning problem (13), where Lipschitz continuity of the derivative of the objective function is a key ingredient for (local) convergence, see for instance [36] for a result in Hilbert spaces.In particular, Lipschitz continuity of θ → N (z, θ) for z fixed is useful for the learning problem (13), where the exact (λ, u) is known.In this case, one simply learns the finite-dimensional hyperparameter θ, thus standard convergence results on gradient-based methods in finite dimensional vector spaces apply, see, e.g., [34,Section 5.3].
Based on these Lipschitz estimates, we can study the tangential cone condition for the problem (14), given a learned N θ .For this, we assume that N θ (α, u) = N θ (u).
Analyzed in the all-at-once setting ( 14), the tangential cone condition reads as 16 for all (λ, u 0 , u), ( λ, ũ0 , ũ) ∈ B X×U0×V ρ (λ † , u † 0 , u † ), where F and N are the Gâteaux derivatives.The tangential cone condition strongly depends on the PDE model F and the architectures of N .By triangle inequality, a sufficient condition for (33) to hold is that the tangential cone condition holds for F and for N separately.The tangential cone condition in combination with solvability of equation G(x) = 0 ensures uniqueness of a minimum-norm solution [23, Proposition 2.1] (see Appendix A).Solvability of the operator equation G(x) = 0, according to the all-at-once formulation, is the question of solvability of the learning-informed PDE and exact measurements, i.e. δ = 0.For solvability of the learning-informed PDE, we refer to Proposition 24 in Section 4. In the following, we focus on the tangential cone condition for the neural networks by studying Condition 16 for G := N θ .
Lemma 17 (Tangential cone condition for neural networks).The tangential cone condition in Condition 16 for G = N θ : V → W with fixed parameter θ holds in any ball ) and ρ, depending on the Lipschitz constant in Lemma 14 is sufficiently small.
ρ (u † ), we have for almost all (t, x) ∈ (0, T ) × Ω that u(t, x), ũ(t, x) ∈ B for some B bounded.Thus, we can use Lemma 14 with such a B, and in particular the estimate (62) for z = u(t, x), to obtain Lemma 14,and We note that having full observation, i.e.M = Id, is crucial for establishing the tangential cone condition, as it allows us to link the estimate from u − ũ Y to M (u − ũ) Y , yielding the last quantity on the right hand side of (33).The necessity of full observation has also been mentioned in [24].Now using [23, Proposition 2.1] together with Lemma 17, a uniqueness result follows.
Proposition 18 (Uniqueness of minimizer for the limit case of ( 14) , and assume that the conditions in Lemma 17 are satisfied.Moreover, suppose that the tangential cone condition for F holds in B X×U0×V ρ (λ † , u † 0 , u † ) and the equation G(λ, u 0 , u, θ) = 0 with G in (12) and θ fixed is solvable in B X×U0×V ρ (λ † , u † 0 , u † ).Then the limit case of the parameter identification problem (14) admits a unique minimizer in the ball Remark 19.We refer to Section (4) below for solvability of the learning-informed PDE in an application.We refer to [24] for concrete choices of F and of function space settings such that the tangential cone condition can be verified.Note that, while the tangential cone condition for limit case of the of the parameter identification problem (14) can be confirmed as above, the same question for the learning problem (13) remains open.

Application
In this section, as special case of the dynamical system (1), we examine a class of general parabolic problems given as where Ω ⊂ R d is a bounded C 2 -class domain, with d ∈ {1, 2, 3} being relevant in practice.The nonlinearity f , which can be replaced by a neural network later, is assumed to be given as the Nemytskii operator f : ).We initially work with the following parameter spaces where P a > d, and, for existence of a solution, we will require the constraints 0 < a ≤ a(x) ≤ a for a.e.x ∈ Ω.
Thus, the overall parameter space X is given as X = (X ϕ , X c , X a ).

Unique existence results for (34)
Our next goal is to study unique existence of (34).The main purpose of this is to inspire a relevant choice of function space setting for the all-at-once setting of ( 13) and ( 14), even though unique existence is not required there.Also, a unique existence result is of interest for studying the reduced setting, where well-definedness of the parameter-to-state map is needed.We will proceed in two steps: In the first step, we prove that (34) admits a unique solution Then, in the second step, we lift the regularity of u to the somewhat stronger space to achieve boundedness in time and space of the solution, which will later serve our purpose of working with a neural network acting pointwise.It is worth noting that the study for unique existence is carried out first of all for classes of general nonlinearity f satisfying some specific assumptions, such as pseudomonotonicity and growth condition, see Lemmas 21 and 23 below.The nonlinearity f as a neural network will then be considered in Proposition 24, Remark 25.Before investigating (34), we summarize the unique existence theory as provided in [32, Theorems 8.18, 8.31] for the autonomous case.
Theorem 20.Let V be a Banach space, Ĥ be a Hilbert spaces and assume that for , the following holds: for some c 0 > 0 and some seminorm S3. F , u 0 and ϕ satisfy the regularity condition F (u 0 ) − ϕ ∈ Ĥ, u 0 ∈ V and Then the abstract Cauchy problem By verifying the conditions in Theorem 20, we now obtain unique existence as follows.
Lemma 21 (Unique existence).Let the nonlinearity f (α, •) : H 1 0 (Ω) → H 1 0 (Ω) * be given as the Nemytskii mapping of a measurable function f (α, •) : R → R that satisfies Then, equation (34) with parameter ϕ, c, a and u 0 such that (35), (36) hold, admits a unique solution Proof.We verify the conditions in Theorem 20 for First, note that due to measurability and the growth constraint, the Nemytskii mapping f (α, , where we set f (α, u)(w) := Ω f (α, u(x))w(x) dx for w ∈ V , is indeed well-defined since, Since 0 < a ≤ a almost everywhere on Ω and c ∈ L 2 (Ω), the estimate This implies that semicoercivity as in S2 with c 0 , c 2 as above and c 1 = 0. Also, the second estimate in the regularity condition S3 now follows directly with where again, we employ monotonicity of f (α, •).
In order to verify pseudomonotonicity S1, we first notice that F : V → V * is bounded, i.e., it maps bounded sets to bounded sets, and continuous where the latter follows from continuity of F , which is immediate, and continuity of f , which holds by assumption.Using this, one can apply [14, Lemma 6.7] to conclude pseudomonotonicity if the following statement is true The latter follows since, by V → → Ĥ, one gets for which implies u n V → u as n → ∞.With this, Theorem 20 implies unique existence of a solution Note that, by embedding, u ∈ W 1,∞,∞ (0, T ; Ĥ, Ĥ) ∩ W 1,∞,2 (0, T ; V , V ) implies that u ∈ L ∞ (0, T ; V ) ∩ H 1 (0, T ; V )).In a second step, we now aim to find suitable assumptions on the parameter spaces X ϕ , X c , X a and U 0 such that regularity of the solution u of (34) as obtained in the previous proposition is lifted to u ∈ L ∞ ((0, T ) × Ω).
The obtained unique existence result in now summarized in the following proposition.
Proof.Our aim is examining the assumptions proposed Lemma 5, which leads to the result in Proposition 6.At first, we verify Assumption 1.The embeddings are an immediate consequence of our choice of p and q and standard Sobolev embeddings.The embeddings follow from the discussion in Step 2 above, see also Proposition 24.
Noting that well-definedness of the Nemytskii mappings as well as the growth condition ( 7) are consequences of the following arguments on weak continuity.We focus on weak continuity of F : V × (X c , X a , X ϕ ) → W, F (λ, u) := ∇ • (a∇u) − cu + ϕ + φ(u) via weak continuity of the operator inducing it as presented in Lemma 5. First, for the cu part we see (c, u) → cu is weakly continuous on (X c , H).Indeed, for Ω) is not strong enough to enable weak continuity of (a, u) → ∇ • (a∇u) on (X a , H), we therefore evaluate directly weak continuity of the Nemytskii operator.So, let (a n , u n ) (a, u) in X a × V, taking w * ∈ L 2 (0, T ; L q * (Ω)) we have due to the following: we have ∇uw * ∈ L 2 (0, T ; L P a * (Ω)), ∇a n ∇a in L P a (Ω) in the first estimate, and u n → u in L 2 (0, T ; W 1,18 (Ω)), ∇a n w * L 2 (0,T ;L 18/17 (Ω)) ≤ C < ∞ for all n in the second estimate.In the third estimate, one has a n → a in L ∞ (Ω) and Finally, in in the last estimate it is clear that aw * ∈ L 2 (0, T ; L q * (Ω)), u n u in L 2 (0, T ; W 2,q (Ω)) implying ∆u n ∆u in L 2 (0, T ; L q (Ω)).For the term φ, by Finally, the fact that activation function σ satisfies σ ∈ C locLip (R, R) completes the verification that the result of Proposition 6 holds.
For the following results, we set φ = 0.
Lemma 28 (Differentiability).In accordance with Proposition 12 and the frameworks in Proposition 27, setting φ = 0, the model operator Proof.With the setting in Proposition 27, we verify local Lipschitz continuity of F (λ, u) = ∇ • (a∇u) − cu + ϕ with λ = (ϕ, c, a).To this aim, we estimate with qq 3q 3−q .Also, Gâteaux differentiability of F : X × V → W as well as Carathéodory assumptions are clear from this estimate and bilinearity of F with respect to λ, u.Differentiability of N θ with σ ∈ C 1 (R, R) has been shown in Proposition 12, the last paragraph of its proof.
When the image space W is stronger, that is, W L q (Ω), ∀q ∈ [1, ∞) as discussed in Remark 13, we require smoother activation functions than what was employed in Lemma 28 in order to ensure differentiability of N θ .
Remark 29 (Strong image space W and smoother neural network).Consider the case where the unknown parameter is ϕ, parameters a, c are known, and the neural network N θ has smoother activation σ ∈ C 1 locLip (R, R), i.e. σ ∈ C locLip (R, R).The minimization problems introduced in Proposition 27 have minimizers that belong to the Hilbert spaces and Proof.For fixed θ, α, let us denote N θ (α, •) =: N θ .It is clear that this setting fulfills all the embeddings in Assumption 1. Weak-strong continuity of N θ is derived from ) and Lipschitz constants L , L .This shows continuity of N in u; continuity of N in (α, θ) can be done similarly.For F , when c, a are known and fixed, it is just a linear operator on u.Weak continuity of F hence can be explained through its boundedness, which can be confirmed in the same fashion as A, B above.
To conclude this section, we consider a Hilbert space setting that will be relevant for our subsequent applications.
Remark 30 (Hilbert space framework for application).Another possible Hilbert space framework where the all-at-once setting is applicable is where Y is a Hilbert space, and Verification of weak continuity and the growth condition for F can be carried out similarly as in Proposition 27; moreover, weak continuity of (X a × H) (a, u) → ∇ • (a∇) ∈ W can be confirmed like the part (c, u) → cu, without the need of evaluating directly the Nemytskii operator.This is the setting in which we will study in detail the application (34).
5 Case studies in Hilbert space framework

Setup for case studies
In this section, for the sake of simplicity of implementation, we carry out case studies for some minimization examples in a Hilbert space framework, where we drop the unknown α and use the regularizers Proposition 31.Consider the minimization problem (13) (or ( 14)) associated with the learning informed PDE The following statements are true: (i) The minimization problem admits minimizers.
(ii) The corresponding model operator G is Gâteaux differentiable with locally bounded G .
(iii) The adjoint of the derivative operator is given by with we can write explicitly g 3,1 : one has the recursive procedure with a l , a l detailed in the proof.
Proof.Assertion i) follows from Remark 30.Using Proposition 12, Assertion ii) can be shown similarly as in Lemma 28.The proof for assertion iii) is presented in Appendix B.
Corollary 32 (Discrete measurements).In case of discrete measurements , where the pointwise time evaluation is well-defined as V → C(0, T ; H 2 (Ω)), the adjoint g 2,3 is modified as follows.For h ∈ Y , provided that u h = const in [t i , T ] in order to form the integral of the full time line (0, T ) in the last line.Above, h, h are respectively in place of k z and z in (50); besides, u h solves Thus we arrive at This shows a numerical advantage of processing discrete observations in an Kaczmarz scheme, for instance in deterministic or stochastic optimization.To be specific, for each data point in the forward propagation, thanks to the all-at-one approach, no nonlinear model needs to be solved; in the backward propagation, by the same reason and (57), one needs to compute the corresponding adjoint only for small time intervals.

Numerical results
This section is dedicated to a range of numerical experiments carried out in two parallel settings: by way of analytic adjoints in Section 5.2.1, and with Pytorch in Section 5.2.2.While, in our experiments, we evaluate and compare the proposed method for different settings, such as varying the number of time measurements or noise, we highlight that the main purpose of these experiments is to show numerical feasibility of the proposed approach in principle, rather than providing highly optimized results.In particular, a tailored optimization of, e.g., regularization parameters and initialization strategies involved in our method might still be able to improve results significantly.
For both settings (analytic adjoints and Pytorch), we use the following learning-informed PDE as special case of the one considered in Proposition 31: We deal with time-discrete measurements as in Corollary 32, i.e., we use a time-discrete measurement operator M : V → L2 (Ω) n T , with n T ∈ N, given as M (u) ti = u(t i ) for t 0 = 0 and t i ∈ (0, T ) with i = 1, . . ., n T − 1.We further let a noisy measurement of the initial state u 0 be given at timepoint t = 0. Further, we consider two situations: 1.The source ϕ in (58) is fixed; we estimate the state u and the nonlinearity N θ only, yielding a model operator 2. The source ϕ in (58) is unknown, and we estimate the state u, the source ϕ and the nonlinearity N θ .This results in a model operator G : For these two settings, the special case of the learning problem (13) we consider here is given as min for state-and nonlinearity identification and min Remark 33 (Offsets).With Ω y := u(Ω × (0, T )) the range of u for all x ∈ Ω, t ∈ (0, T ), and given that ∂ ∂t u(x, t) = 0, consider any solutions f : Ω y → R, ϕ : Ω → R of (34).Then all solutions of (34) are on the form Indeed, assume f , φ are solutions, and define g(y) := f (y) − f (y), Φ(x) := φ(x) − ϕ(x).Since these are solutions, one has 0 = g(u(x, t)) + Φ(x) for all (x, t) such that As ∂ ∂t u(x, t) = 0 on Ω × (0, T ), it follows that g (y) ≡ 0 on u(Ω × (0, T )), that is, there is some Moreover, finding any solutions f , ϕ and setting . Remark 34 (Different measurement operators).In our experiments, we use a time-discrete measurement operator, and at times where data was measured, we assume measurements to be available in all of the domain.As will be seen in the next two subsections, reconstruction of the nonlinearity is possible in this case even with rather few time measurements.A further extension of the measurement setup could be to use partial measurements also in space.While we expect similar results for approximately uniformly distributed partial measurements in space, highly localized measurements such as boundary measurements and measurements on subdomains are more challenging.In this case, we expect the reconstruction quality of the nonlinearity to strongly depend on the range of values the state u admits in the observed points, but given the analytical focus of our paper, we leave this topic to future research.
Discretization.In all but one experiment (in which we test different spatial and temporal resolutions), we consider a time interval T = [0, 0.1], uniformly discretized with 50 time steps, and a space domain Ω = (0, 1), uniformly discretized with 51 grid points.The time-derivative as well as the Laplace operator was discretized with central differences.For the neural network N θ , we consider a fully-connected network with tanh activation functions, and three single-channel hidden layers of width [2,4,2] for all experiments.Note that this network architecture was chosen empirically by evaluating the approximation capacity of different architectures with respect to different nonlinear functions.For the sake of simplicity, we choose a simple, rather small architecture (satisfying the assumptions of our theory) for all experiments considered in this paper.In general, the architecture (together with regularization of the network parameters) must be chosen such that a balance between expressivity and overfitting may be reached (see for instance [3, Sections 1.2.2 and 3]), but a detailed evaluation of different architectures is not within the scope of our work.

Implementation with analytic adjoints
Set up.In what follows, we apply Landweber iteration to solve the minimization problem (13).The Landweber algorithm is implemented with the analytic adjoints computed in Proposition 31 and Corollary 32, ensuring that the backward propagation maps to the correct spaces.PDE and adjoints.We employed finite difference methods to numerically compute the derivatives in the PDE model, as well as in the adjoints outlined in Proposition 31 and Corollary 32.In particular, central difference quotients were used to approximate time and space derivatives.For numerical integration, we applied the trapezoidal rule.The inverse operator (−∆) −1 (−∆ + Id) −1 constructed in (50) is called in each Landweber iteration.
Neural network.In the examples considered, f : u(x) → f (u(x)) is a real-valued smooth function, hence the suggested simple architecture with 3 hidden layers of [2,4,2] neurons is appropriate.As the reconstruction is carried out in the all-at-once setting, the hyperparameters were estimated simultaneously with the state.The iterative update of the hyperparameters is done in the recursive fashion (55).
Data measurement.We work with measured data y as limited snapshots of u (see Corollary 32) and evaluated examples in the case of no noise and δ = 3% relative noise.Noise is sampled from a Gaussian distribution N (0, 1), and the measured data is y = u + δ ( u 2 / 2 ).
Error.Error between the reconstruction and the ground truth was measured in the corresponding norms, i.e.X ϕ -norm for ϕ and W-norm for the PDE residual and the error of f .For u, V-norm is the recommended measure; for simplicity, we displayed L 2 -error.
Minimization problem.The regularization parameters are R u = R ϕ and M i (u) = 10 u(t i ) (c.f Corollary 32).We implement an adaptive Landweber step size scheme, i.e. if the PDE residual in the current step decreases, the step size is accepted, otherwise it is bisected.For noisy data, the iterations are terminated after a stopping rule via a discrepancy principle (c.f.[23]) is reached.

Implementation with Pytorch
The experiments of this section were carried out using the Pytorch [29] package to numerically solve (59) and (60).More specifically, we used the pre-implemented ADAM [25] algorithm with automatic differentiation, a learning rate of 0.01 and 10 4 iterations for all experiments.In case noise is added to the data, we use Gaussian noise with zero mean and different standard deviations denoted by σ.
Solving for state and nonlinearity In this paragraph we provide experiments for the learning problem with a single datum, where we solve for the state and the nonlinearity and test with increasing noise levels and reducing the number of observations.We refer to Figure 3 for the visualization of selected results, and to Table 1 (top) for error measures for all tested parameter combinations.It can be observed that reconstruction of the nonlinearity works reasonable well even up to a rather low number of measurements together with a rather high noise level: The shape of the nonlinearity is reconstructed correctly in all cases except the one with three time measurements and a noise level of σ = 0.1.
Solving for parameter, state and nonlinearity In this section, we provide experiments for the learning problem with a single datum, where we solve for the parameter, the state and the nonlinearity and test with increasing noise levels and decreasing of observations.We refer to Figure 4 for the visualization of selected results and to Table 1 (bottom) for error measures for all tested parameter combinations.
It can again be observed that the reconstruction works rather well, in this case for both the nonlinearity and the parameter.Nevertheless, due to the additional degrees of freedom, the reconstruction breaks down earlier than in the case of identifying just the state and the nonlinearity.
Varying the discretization level In this paragraph, we test the result of different spatial and temporal resolution levels of the state.To this aim, we reproduce the experiment as in line 3 of Figure 4 (6 time measurements, δ = 0.03, quadratic nonlinearity, solving for nonlinearity and state) for 501 × 500 and 5001 × 5000 gridpoints in space × time (instead of 51 as in the original example).
The result can be found in Figure 5.As can be observed there, changing the resolution level has only a minor effect on result, possibly slightly decreasing the reconstruction quality for the nonlinearity.We attribute this to the fact that the number of spatial grid points for the measurement was equally increased, see also Remark 34 for a discussion of localized measurements.
Reconstructing the nonlinearity from multiple samples In this paragraph we show numerically the effect of having different numbers of datapoints available, i.e., the effect of different numbers K ∈ N in (60).We again consider the identification of state, parameter and nonlinearity and use three time measurements and a noise level of 0.08; a setting where the identification of the nonlinearity breaks down when having only a single datum available.
As can be observed in Figure 6, having multiple data samples improves reconstruction quality as expected.It is worth noting that here, even though each single parameter is reconstructed rather imperfectly with strong oscillations, the nonlinearity is recovered reasonable well already for three data samples.This is to be expected, as the nonlinearity is shared among the different measurements, while the parameter differs.

Comparison of different approximation methods
Here we evaluate the benefit of approximating the nonlinearity with a neural network, as compared to classical approximation methods.As test example, we consider the identification of the state and the nonlinearity only, using a noise level of 0.03 and 10 discrete time measurements.We consider four different ground-truth nonlinearities: As approximation methods we use polynomials as well as trigonometric polynomials, where in both settings we allow for the same number (= 29) of degrees of freedom as with the neural network approximation.For all methods, the same algorithm (ADAM) was used, and the regularization parameters for the state and the parameters of the nonlinearity were optimized by gridsearch to achieve the best performance.
The results can be seen in Figure 7.While each methods yields a good approximation in some cases, it can be observed that the polynomial approximation performs poorly both for the cosinenonlinearity and the polynomial-nonlinearity (even tough the degrees of freedom would be sufficient to represent the later exactly).The trigonometric polynomial approximation on the other hand performs generally better, but produces some oscillations when approximating the square nonlinearity.The neural network approximation performs rather well for all types of nonlinearity, which might be interpreted as such that neural-network approximation is preferable when no structural information on the ground-truth nonlinearity is available.It should be noted, however, that due to non-convexity of the problem, this result depends many factors such as the choice of initialization and numerical algorithm.

Conclusion
We have considered the problem of learning a partially unknown PDE model from data, in a situation where access to the state is possible only indirectly via incomplete, noisy observations of a parameter-dependent system with unknown physical parameters.The unknown part of the PDE model was assumed to be a nonlinearity acting pointwise, and was approximated via a neural network.Using an all-at-once formulation, the resulting minimization problem was analyzed and well-posedness was obtained for a general setting as well a concrete application.Furthermore, a tangential cone condition was ensured for the neural network part of a resulting learning-informed parameter identification problem, thereby providing the basis for local uniqueness and convergence results.Finally, numerical experiments using two different types of implementation strategies have confirmed practical feasibility of the proposed approach.and V 2 → V 3 (a continuous embedding), and fix 1 < p < +∞, 1 ≤ q ≤ +∞.Then W 1,p,q ([0, T ]; V 1 , V 3 ) → → L p (I; V 2 ).for some c(x, x) ≥ 0, where c(x, x) < 1 if x − x ≤ .If G(x) = y is solvable in B ρ (x 0 ), then a unique x 0 -minimum-norm solution exists.It is characterized as the solution x † of G(x) = y in B ρ (x 0 ) satisfying the condition Note that in this proposition, the claim does not change if the statement is made for the ball B ρ (x † ) with x 0 ∈ B ρ (x † ).
The estimate (29) will now be shown via backwards induction, with the various constants defined in (32)  = A i+1 (z, θ)ω i+1 σ (ω i N i−1 (z, θ i−1 ) + β i ) − A i+1 (z, θ)ω i+1 σ (ω i N i−1 (z, θ i−1 ) + β i ) We apply (61), then (62) and the bound on A i+1 to the first line, while we apply the induction assumption together with the definition of s i to the second line to obtain The adjoint for g 2,2 = (•) * t=0 can be derived in a similar manner.We now compute the last adjoint g 3,1 = −N θ (u, θ) * involving the neural network with weights ω, biases β and the fixed activation σ.With the architecture mentioned at the beginning of this section, we define by a l the output of the l-th layer a l = σ(ω l a l−1 + β l ), a 0 = input data u l = 1 . . .L, and introduce a l = σ (ω l a l−1 + β l ), a 0 = input data u l = 1 . . .L, with σ = Id in the L-th (output) layer, and σ is the derivative of σ.

Lemma 4 .
Assume that σ ∈ C(R, R).Then, with the setting of Assumption 1, N θ : R m × R → R as in Definition 3 induces a well-defined Nemytskii operator N the image of the i-th layer before applying the activation function.Assume that the activation function σ ∈ C 1 (R, R) associated to N for all 1 ≤ i ≤ L satisfies the Lipschitz inequalities |σ(x) − σ(x)| ≤ C σ |x − x| , |σ (x) − σ (x)| ≤ C σ |x − x| for all x, x ∈ B i and some positive constants C σ , C σ , and that s i := sup x∈Bi |σ (x)| < ∞ .

Figure 1 :
Figure 1: Numerical identification of state u and ground-truth nonlinearity f (u) = u 2 − 1 in (58) for three different values of the source term ϕ.In each case, three noise-free observations are given (n T = 3).Plots 1-3 and 4-6 in the top line show the given data and the ground truth state for the three equations, respectively.The content of the remaining plots is described in the titles.

Figure 2 :
Figure 2: Numerical identification of state u and ground-truth nonlinearity f (u) = u 2 − 1 in (58) for three different values of the source term ϕ.In each case, three observations (n T = 3) with 3% noise are given.Plots 1-3 and 4-6 in the top line show the given data and the ground truth state for the three equations, respectively.The content of the remaining plots is described in the titles.

Figure 4 :
Figure 4: Numerical identification of state u, ground-truth nonlinearity f (u) = u 2 − 1 and the parameter ϕ in (58) for decreasing numbers of discrete observations (lines 1-2, 3-4 and 5-6) and increasing noise levels (even lines versus odd lines).From left to right: Given data, recovered state, recovered nonlinearity (orange) compared to ground truth (blue), recovered parameter (orange) compared to ground truth (blue) and initialization (green) 38