Laplace based Bayesian inference for ordinary differential equation models using regularized artificial neural networks

Parameter estimation and associated uncertainty quantification is an important problem in dynamical systems characterised by ordinary differential equation (ODE) models that are often nonlinear. Typically, such models have analytically intractable trajectories which result in likelihoods and posterior distributions that are similarly intractable. Bayesian inference for ODE systems via simulation methods require numerical approximations to produce inference with high accuracy at a cost of heavy computational power and slow convergence. At the same time, Artificial Neural Networks (ANN) offer tractability that can be utilized to construct an approximate but tractable likelihood and posterior distribution. In this paper we propose a hybrid approach, where Laplace-based Bayesian inference is combined with an ANN architecture for obtaining approximations to the ODE trajectories as a function of the unknown initial values and system parameters. Suitable choices of customized loss functions are proposed to fine tune the approximated ODE trajectories and the subsequent Laplace approximation procedure. The effectiveness of our proposed methods is demonstrated using an epidemiological system with non-analytical solutions—the Susceptible-Infectious-Removed (SIR) model for infectious diseases—based on simulated and real-life influenza datasets. The novelty and attractiveness of our proposed approach include (i) a new development of Bayesian inference using ANN architectures for ODE based dynamical systems, and (ii) a computationally fast posterior inference by avoiding convergence issues of benchmark Markov Chain Monte Carlo methods. These two features establish the developed approach as an accurate alternative to traditional Bayesian computational methods, with improved computational cost.


Introduction 1.Existing Works and Literature
Ordinary differential equation (ODE) models refer to statistical models whose mean functions are governed by an underlying ordinary differential equation.Dynamical systems are systems that describe changes in certain states or quantities over time, and are readily formalized by ordinary differential equations.As a result, dynamical systems elicited via ODE models find applications in numerous fields including biology, epidemiology, and physics, indicating their importance in these fields for understanding and gaining insights into the processes involved.An important aspect of these application problems is the inference of unknown parameters of ODE models and associated uncertainty quantification.This is known to be a very challenging problem due to the intractability of the underlying ODE, which is typically nonlinear and thus prohibits any closed form solution.Bayesian inference of unknown parameters of ODE models is also challenging due to this intractability which result in likelihoods and posterior distributions that are similarly intractable.Thus, Bayesian computations for inference in ODE models are carried out via simulation methods that require numerical approximations to produce inference with high accuracy at a cost of heavier computational power and slow convergence.Examples of Bayesian inference for ODE models depending on long-run convergence results of iterative procedures are reported in [1,2,3], and associated convergence diagnostic methods can be found in [4,5].
Recent advances in computational power have brought about increasingly complex models and innovative solutions to both traditional and new problems.One such example is the use of artificial neural networks (ANNs) in mathematical modelling.ANNs were first introduced in [16], and research into their theoretical capabilities [17,13] laid strong foundations for the usage of neural networks as a universal function approximation tool [19,20,21]; Li (1996) [22] further showed that multivariate functions and their derivatives can be simultaneously approximated by ANNs, with a comprehensive, practical algorithm that improves the optimization of ANN weights and architectures laid out by [15] as well as references therein.Crucially, the optimal or example ANNs in these literature often use less than 3 hidden layers and less than 20 neurons in each layer to achieve good results, which can be noted to be relatively simple architectures to achieve effective functional and derivative approximations.Since then, ANN training algorithms that explicitly incorporate derivative information have been explored in an effort to enhance their approximation abilities [27,23,18,26].
The traditional way of approximating solutions and trajectories of nonlinear ODEs are based on numerical methods such as the Euler, Runge-Kutta, finite difference and finite element methods, when closed analytic-form solutions are unavailable.Nevertheless, not-ing that realized states of the ODE are simply functions of time and intrinsic parameters (including the unknown initial values), the idea of solving differential equations, or at least approximating their dynamics, with ANNs seems natural.This is the approach taken in this paper where we approximate the functional value and derivatives of the underlying ODE in dynamical systems by an appropriate ANN.The resulting ANN is tractable and allows derivatives of the likelihood and posterior densities to be computed in closed form.As a result, Bayesian inference is facilitated in closed form based on the Laplace approximation to the posterior distribution of the unknown parameters.The effectiveness of our proposed hybrid Bayesian-ANN methods as an alternative to traditional iterative Bayesian methods like MCMC is demonstrated using an epidemiological system with non-analytical solutions -the Susceptible-Infectious-Removed (SIR) model for infectious diseases -based on simulated and real-life influenza datasets.
The approximation by ANNs of ODE models is not new.Physics-Informed Neural Networks (PINNs) proposed by Raissi et al (2019) [11] are unsupervised neural networks (i.e.no training targets) trained to obey the physics of the dynamical system by using penalties based on the differential equations in the loss function.Raissi et al [11] also develops methodology for data discovery, i.e. parameter learning, based on a given observed trajectory of the ODE system for unknown parameters.Their approach is similar to the approach of Ramsay (2007) [12] where in the latter, a parameter cascading approach is implemented to estimate the parameters of the ODE as oppposed to the joint minimization approach of Raissi et al [11].Ramsay [12] also incorporates a penalty coefficient and selects it in a data driven way.Raissi et al [11] does not incorporate a penalty coefficient and their parameter estimates do not come with associated measures of uncertainty.
Further work on approximating ODEs by extension of ANNs have been reported in the literature.A deep neural network approach to forward-inverse problems by Jo et al [10] extends results from Li [22] to prove that under certain conditions, these DNN approximations converge to the dynamical system's solution, and the parameter estimates resulting from the ANN converge to their true values.Jo et al subsequently applied this methodology to a Susceptible-Infected-Removed (SIR) model for COVID-19 cases in South Korea [9].Their joint minimization approach is similar to that of Raissi et al [11] for parameter learning and they, too, do not report uncertainty measures corresponding to their parameter estimates.Another drawback of the theoretical approach of Jo et al [10] is that a "close-enough" ANN is shown to exist but it does not indicate how well the learned network estimates the ODE system when the loss function has been trained to be less than , for some small > 0. The DNNs in these works use 4 or more layers, with over 100 neurons in each layer, demonstrating the high complexity of these models.

Our Contributions
Previous works using ANNs for ODE models do not incorporate uncertainly measures (either summary measures or entire distributions) associated with the estimates of unknown parameters of the ODE models.In this paper, we develop a Bayesian inferential framework for analytically intractable ODE models by developing ANN architecture that tractably approximates the function values and derivatives of the ODE.Uncertainty measures are obtained by additionally deriving a Laplace approximation to the true posterior distribution, which is a Gaussian distribution whose variance relies on the ANNs' approximations to the ODE system's partial derivatives.The proposed hybrid method admits an approximate but tractable posterior distribution, and thereby enables us to obtain accurate parameter point estimates together with appropriate uncertainty quantifications.
Our approach to the ANN approximations differs from Raissi et al [11], Jo et al [10] and Ramsay [12] in that we utilize supervised ANNs with the unknown ODE parameters as extrinsic inputs, trained by numerical solutions of the ODE over a grid of collocation points in the parameter and temporal domains; hence the inclusion of initial values as an input further allows initial value estimation.In inference for epidemic systems, this is particularly important as it allows the estimation of crucial epidemic quantities, such as the number of infected individuals at the onset of an outbreak.Under such an architecture, obtaining the partial derivatives of the ODE system's states with respect to its parameters (and initial values) is straightforward.This facilitates the imposition of regularizations on the ANNs' loss functions to improve their accuracies in approximating the solutions and partial derivatives of the ODE, and thus provides more reliable estimates of posterior variances in the aforementioned Laplace approximation procedure.
A further important contribution of this work is that the hybrid method combining Bayesian inference with an appropriately regularized ANN architecture proves to be a faster and more lightweight alternative to typical MCMC methods that require long runs and convergence of iterative chains.The latter are well-known methods in the Bayesian framework for obtaining inference with associated measures of uncertainty.We demonstrate the gain in computational costs in our simulation and real data examples subsequently.

Organization of sections
The remainder of the paper is organized as follows: Section 2 explains the some preliminary concepts leading up to our methods, and Section 3 details the application of said methods using an example dynamical system -the SIR Model, on simulated datasets.Subsequently, Section 4 demonstrates the usage of our methods to infer epidemic parameters from a real-life influenza dataset.Section 5 discusses the results obtained in Sections 3 and 4, and reflects on the advantages and limitations of our methods.Finally, conclusions and potential future work are outlined in Section 6.

ODE Systems
Differential equations arise in many areas and are often used to mathematically describe rates of changes of a system's states, with respect to some underlying variable (such as time).Specifically, let x(t) := (x 1 (t), x 2 (t), ..., x D (t)) ∈ R D denote the states of an ODE system whose values vary over time points t ∈ [T 0 , T 1 ] based on the differential equation, where T 0 and T 1 denote the initial and final time points respectively, The general ODE equation can be written as: Here This system is called autonomous if f is further independent of t; in this paper, we will consider autonomous ODE systems.
Additionally, in our proposed methods detailed later (pertaining to the neural network design in Section 2.5 and loss function regularizations in Section 2.6), we will also consider the derivatives of the system's states with respect to its parameters (and initial values) ∇ φ φ φ x(t, φ φ φ): whose trajectory is given by obtained by differentiating Equation (1) with respect to φ φ φ.Extending the system states x to include these derivatives as additional states of the dynamical system, we define where D † := D(D + Q + 1).Combining Equations ( 1) and ( 2) results in a system for x † written as

Bayesian Inference
In a dynamical system, the observed data y at different times t usually arise as a result of a combination of entries of x(t), φ φ φ and noise terms, thus we may write the observation at time t as y(t) := y(x(t), φ φ φ), and the associated probability as where are the time points of the observations.Assuming that the observations are independent of each other, the complete likelihood can be obtained by multiplying together the probabilities in (5).It is often understood that the observations y arise from the underlying states x, in which case we may abbreviate the complete likelihood as p(y|φ φ φ).
Bayesian inference generally involves specifying a prior belief on a set of parameters, and then factoring in the likelihood of observed data to arrive at a posterior belief of said parameters.Probability distributions are often used to represent these beliefs: denote by p(φ φ φ) the prior distribution of φ φ φ, and p(y|φ φ φ) the likelihood of observing some data y given the parameters φ φ φ.The posterior distribution of φ φ φ, p(φ φ φ|y), then satisfies p(φ φ φ|y) = p(φ φ φ)p(y|φ φ φ) p(φ φ φ)p(y|φ φ φ) dφ φ φ ∝ p(φ φ φ)p(y|φ φ φ), (7) where in the last expression, it suffices to consider the product of the likelihood and prior and view it as a function of φ φ φ only.
Equation (7) is the usual expression of the posterior density associated with φ φ φ given observed data.The typical approach in Bayesian inference, when the posterior is intractable, is to obtain samples from this posterior and perform Monte Carlo inference in order to derive estimates such as the posterior mean, variance and credible intervals.However, in the case of posteriors from ODE models, obtaining samples directly from them is difficult.This is because many ODE models cannot be solved analytically in closed form, resulting in a likelihood expression and hence a posterior density that is intractable.Previous works have worked around this intractability in a number of ways, such as Monte Carlo simulations with importance sampling arising from likelihoods calculated using numerical solutions [38], Approximate Bayesian Computation which avoids the use of likelihoods [39,8], modifications to Markov Chain Monte Carlo methods [40] and many more.Most of these methods are computationally slow and are either frequentist or do not provide exact Bayesian inference.

Challenges of Analytically Intractable Likelihoods
The solution of a dynamical system refers to the set of values of the system's states that satisfy the corresponding ODEs; some systems of differential equations have exact, analytically tractable solutions, whereas others generally require function or numerical approximations to their true solutions.Function approximations involve pre-defining a class of functions and finding an element or a subset of the class that best match the analytical solution, and on the other hand, numerical methods often involve discretizing some parameter space and then obtaining numerical values of the system's states at those discrete points.
Numerical methods can prove to be inconvenient in computing the likelihoods and posterior distributions of ODE parameters: for each φ φ φ m (where m ∈ {1, 2, ..., M }), the ODE needs to be solved numerically to obtain the solution (x(t 1 , φ φ φ m ), x(t 2 , φ φ φ m ), ..., x(t N , φ φ φ m )).This is often repeated for a large number of times M to obtain a good representation of the posterior distribution, rendering the method computationally expensive.Any changes to the true values of the parameters φ φ φ also causes the previously-inferred posterior to be inaccurate, and the algorithm will have to be repeated for another M interations to obtain new posteriors.
With function approximation methods, the accuracy of the solution naturally depends on the function space considered.Given selected functions x(t, φ φ φ) = (x 1 (t, φ φ φ), x2 (t, φ φ φ), ..., xD (t, φ φ φ)) that approximate each of x(t, φ φ φ) = (x 1 (t, φ φ φ), x 2 (t, φ φ φ), ..., x D (t, φ φ φ)) in Equation (1) reasonably well, this method bears the advantage that the state values x for different φ φ φ m may be computed simply by changing the inputs to the approximating function.Furthermore, as the function space considered is arbitrary, we are somewhat free to choose xd with desirable properties such as continuity and differentiability -this proves to be useful in quantifying the posterior uncertainty in φ φ φ in later sections.
Next, we will employ Artificial Neural Networks (ANNs) to approximate the solution of an ODE system parameterized by φ φ φ.Our proposed method utilizes ANNs as a universal function approximator for the ODE solution x(t, φ φ φ) x(t, φ φ φ), taking it explicitly as a function of t and φ φ φ -we shall henceforth refer to the approximation of x as Method I. Additionally we also consider an ANN that approximates x † , i.e. the system states as well as its first derivatives, which we shall refer to as Method II.
For each method, we will first describe the process of obtaining training features and targets using collocation points, then describe the architectures used, and finally discuss the custom loss functions and how they serve to regularize the ANN outputs.

Artificial Neural Networks: Training Examples
Consider a dynamical system that occurs over the range of time [T 0 , T 1 ] and the range of parameter values in the (t, φ φ φ) domain in the following way: Nt forms a grid on the t-domain with N t grid points, and forms a grid on the φ i -domain with N φ i points.The collection of all points (t C , φ φ φ C ) forms the collocation grid for training features.The choice of such points must satisfy [T ] for all i = 1, 2, ..., D + Q so that the grid is inclusive of all time points and parameter values of interest.To ensure good approximation, the collocation grid for training features should also be reasonably dense (i.e.|t C j+1 − t C j | and |φ i,j+1 − φ i,j | should be reasonably small) -achieved by taking N t and N φ i reasonably large -and somewhat evenly spaced.The number of collocation points (t C , φ φ φ C ) arising from the above gridding exercise is therefore To obtain training targets for Method I, numerical solutions x(t C , φ φ φ C ) are evaluated at each point (t C , φ φ φ C ) of the training features' collocation grid, after which scaling is carried out to ensure that the targets are evenly trained.Define F d as some appropriate, invertible transformation (explained later) and set training targets as One common choice for F d is the identity function, but in our investigations we found that the ANN performs better function approximations on a multivariable x vector if scaling is performed on each component.To do this, F d evaluations are chosen to come from the z-score standardization function, i.e.
where m x d and s x d are the mean and standard deviation, respectively, of the collection of numerical solutions x d (t C , φ φ φ C ) arising from all points (t C , φ φ φ C ) in the collocation grid; all targets {x d } d=1,...,D are scaled using this method.Thus the inverse F −1 d can be written as In our setting, it is important for F d to be invertible since reverting the ANN outputs Fd to their original scale corresponding to the ODE model xd will be crucial for the calculation of likelihoods detailed in Section 2.7.This target scaling is similarly applied ).

Artificial Neural Networks: Architecture
It is known that multilayer feed-forward neural networks with continuous sigmoidal activation functions are universal approximators to any measurable function (with degrees of success affected by the depth and width of the network) [13].For this reason, the function space for x we will consider in this paper is that resulting from multilayer feed-forward neural networks with non-linear sigmoidal activations.
Since we wish to model (i.e.approximate) each xd , d = 1, ..., D, as a function of time t and the parameters φ φ φ, the ANN should take these as input.In light of the target scaling mentioned in Equations (8 -9) in Section 2.5, it is natural to define the output Fd (t, φ φ φ) as an F d -transformed version of xd ; in other words xd (t, φ φ φ) can simply be obtained as Mathematically, the neural network for Method I can be represented by for d = 1, 2, ..., D and a continuous sigmoidal function ϕ, where w ij is the weight connecting the i-th node of hidden layer k − 1 to the j-th node of hidden layer k and b (k) j is the corresponding bias, for k = 1, 2, ..., K + 1 (with the 0-th layer referring to the input layer and (K + 1)-th the output layer).For simplicity we may also rewrite the input nodes (t, φ 1 , ...φ D+Q ) as (A 01 , A 02 , ..., A 0,D+Q+1 ).Equation ( 10) can also be expressed vectorially: where ϕ is taken component-wise.Finally, we concatenate the output Fd of each individual ANN into one vector F := ( F1 , F2 , ..., FD ) from which the loss function extracts the output values and backpropagates the loss into each individual ANN.
Method II aims to also train the ANN to approximate first derivatives of the system states.To this end, we simply use more individual ANNs of the same architecture, each used to approximate each derivative component of x † in Equation (3).∂x d ∂φ i .Since no modifications were made to the architecture, we may directly use x † d and F † d in place of xd and Fd in Equations ( 10) and (11), and extend them up to where ϕ is again taken component-wise.The concatenation of outputs into a single vector 2 is also carried out to facilitate backpropagation; details are discussed in Section 2.6.The general architecture of the ANNs used in Methods I and II are illustrated in Figure 1.Note that the inputs to the ANN architecture in both methods remain the same.Figure 2 gives the general architecture for the ANNs developed in Equation ( 12) for each collocation point (t C , φ φ φ C ).Using these network architectures, x(t, φ φ φ) and x † (t, φ φ φ) specify analytically tractable functions of t and φ φ φ, which we can use to replace x(t, φ φ φ) in the likelihood (Equation ( 6)) as an approximation.For this approximation to be accurate we will need to train the network to find optimal weights w (k) ij and biases b (k) j based on suitable loss functions.

Artificial Neural Network: Loss Function
During training of the ANNs, loss functions are minimized with respect to the network weights w (k) ij and biases b (k) j using an optimization algorithm such as gradient descent algorithms or their adapted versions.Naturally, different loss functions lead to different optimal network weights and biases, as well as different outputs, thereby modifying the ANN's modelling properties.A common loss function used is the mean squared error (MSE) between training targets and the corresponding network output: where • is the L 2 -norm and t ,φ φ φ denotes the sum taken over all t and φ φ φ in each batch of training examples of size M .Equation ( 13) constitutes only a part of our loss function; the remaining parts attempt to improve the ANN model's adherence to the ODE system dynamics.The specific choices made are motivated subsequently.
Certain ODE models in the literature, for example, the SIR model by Kermack and McKendrick [14] for modelling infectious diseases satisfy a natural constraint among the components of x(t, φ φ φ).Let these natural constraints to the ODE system we wish to model be generally represented as Corresponding to the extension in Equation ( 4), we may also define an extended set of constraints B † (x † (t, φ φ φ), φ φ φ) = 0, where , which are obtained by differentiating the original constraint equation ( 14) by φ i , for i = 1, 2, ..., D + Q.
For Method I, define the loss function J 1 using the L 2 -norm • : and similarly, define the loss function J 2 for Method II by replacing x with x † : where ) are the standardized targets based on the corresponding ANN output vectors, and λ λ λ 11 , λ λ λ 12 , λ λ λ 21 , λ λ λ 22 ≥ 0 are vectors of regularization coefficients.
In both J 1 and J 2 , the first term represents a mean squared error which penalizes the deviation of the ANN's approximation from the numerical solution (i.e. the training target).When λ λ λ 11 , λ λ λ 21 > 0, the second term attempts to explicitly force the system states and its partial derivatives to adhere to the ODE dynamics, similar to the derivative penalties applied in [11] (except our ANNs are supervised by numerical solutions).The last term involving λ λ λ 12 , λ λ λ 22 > 0 represents the ANN approximations' adherence to the system's constraints.Note that, when λ λ λ 11 , λ λ λ 12 , λ λ λ 21 and λ λ λ 22 are identically 0, the defined loss functions reduce to a simple MSE between the ANN's outputs and training targets in Equation (13).Since x and x † are functions of w j , over a large number of epochs.The resulting weights and biases give rise to the outputs x(t, φ φ φ) (resp.x † (t, φ φ φ)) which approximate the solution x(t, φ φ φ) (resp. x † (t, φ φ φ)), and is used in the likelihood and posterior expressions.

Laplace's Method
In general, Laplace's Method [32] is an integral approximation technique.Its application to Bayesian statistics allows us to approximate the posterior distribution as a Gaussian centered around the maximum a posteriori (MAP) estimate -the parameter value at which the posterior density is highest.
Figure 3 shows a flowchart that summarizes the general algorithm for our methods which can be applied to ODE models arising from various dynamical systems.The next section illustrates the general methodology development for a special ODE model which is the SIR epidemic model.

Application: The SIR Model
This section details the application of the proposed general methodology on a specific dynamical system: the Susceptible-Infected-Removed (SIR) model for infectious epidemics, first proposed by Kermack and McKendrick [14].We first provide a brief introduction to the model's states and parameters, and then simulate datasets based on which we will draw posterior inference upon.The specific details of the ANN and Laplace Approximations are discussed subsequently.Before showing the results, we also describe the Metropolis Hastings (MH) algorithm which will act as a benchmark for our methods.

The SIR Model
The SIR model is a compartmental model in epidemiology consisting of three compartments, or states, denoted by Susceptible (S), Infectious (I) and Removed (R).An in- dividual in the population can move from a susceptible state, S, to an infectious state, I, upon infection, and after a certain infectious period, move to the removed state, R; it is assumed that those in the removed state do not return to the susceptible state.We introduce the following notation: Given these parameters, the SIR ODE model is then specified by the following system of ODEs: where we have parameterized c I 0 := ln I(0), c γ := ln γ and c β := ln β to ensure that we infer positive values for β, γ and I(0).Here we assume β and γ are constant over all t ∈ [T 0 , T 1 ], as in the original SIR model by Kermack and McKendrick [14].Thus φ φ φ = c := (c I 0 , c γ , c β ) based on notation introduced in Section 2. Note that S(0) = P −e c I 0 is not included in φ φ φ since it is simply a function of c I 0 and R(0) is fixed at 0.
Further differentiating equations ( 18), ( 19) and ( 20) with respect to c I 0 , c γ and c β extends the ODE system for our methodology's needs.Abbreviating S(t, c), I(t, c) and R(t, c) as S, I and R respectively, we derive the following equations.

Simulated Datasets
In order to assess the approximations to the true posterior over a number of experiments, we simulate 200 different sets of data {y i } i=1,2,...,200 where each y i := (y i (t 1 ), ..., y i (t N )) is generated from a fixed set of parameters c * = (c * I 0 , c * γ , c * β ) = (ln 7, ln 1/7, −0.25) (henceforth referred to as the "true" value of the ODE's parameters) and P = 10, 000.As in real practice, only some of the S-I-R compartments will be observed.In this simulated experiment, we choose y i (t n ) to represent the number of new removed (state R) individuals within the time interval (t n−1 , t n ]; further setting t n = n days means the y i (t n )'s represent daily numbers of newly removed individuals.Since dR dt = e cγ I(t, c) is the corresponding rate, we may use it as the expected value of y i (t j ).Hence, for all i = 1, 2, ..., 200, y i (t n ) is randomly generated from a Poisson distribution with mean e c * γ I(t n , c * ), where I(t, c * ) is the numerical solution of the state I at time t under the parameter setting c * .

ANN: Training Examples and Architecture
For training features of both Methods I and II, we chose a grid of equally spaced collocation points based on the following: Numerically solving the ODE for each parameter combination and then applying a target scaling transformation (i.e.F in Method I and F † in Method II) -the z-score standardization -gives rise to With reference to Figure 1 in Section 2.5, the depths and widths chosen for each ANN were K = 2 and L 1 = L 2 = 10 respectively, along with tanh activations (i.e.ϕ := tanh) for each hidden layer.The batch size chosen was M = 400 and the ANNs were trained for 2,500 epochs, using the loss functions J 1 and J 2 in Equations ( 15) and ( 16), applying the notation and methodology to the SIR Model as discussed in Section 3.1.The optimal values for λ λ λ 11 , λ λ λ 12 , λ λ λ 21 and λ λ λ 22 were found by trial and error -the values that gave the least weighted absolute error in the approximation were chosen, although we found the performance did not vary largely within a small neighbourhood of these optimal values.The values chosen were λ λ λ 11 = (0.5, 0.5, 0.5), λ λ λ 12 = 0.5, λ λ λ 21 = (0.5, 0.5, 0.5, 0.001, 0.001, 0.001, 0.001, 0.001, 0.001, 0.001, 0.001, 0.001) and λ λ λ 22 = (0.5, 0.001, 0.001, 0.001).
All model building and training described above were done using the tensorflow and keras libraries in R. Through functions in these libraries we were able to obtain the derivatives of the outputs with respect to each of their inputs (such as ∂ ∂t S(t , c ), ), as is required by the loss function defined above when λ 1 , λ 2 > 0.

Laplace Approximation
Priors for c were specified to be independently Gaussian: giving rise to the prior density and values for the means and variances were chosen to elicit vague [36] priors: Corresponding to the model used to generate the datasets {y i } i=1,2,...,200 , a Poisson model with mean e cγ Ĩ(t n , c) was used for the likelihood of observing each y i (t n ), n = 1, 2, ..., N .Noting the conditional independence of y i (t n ), n = 1, ..., N given c, the complete likelihood L(c; y i ) can be written as e cγ Ĩ(t n , c) e −e cγ Ĩ(tn,c) resulting in the log-likelihood where Ĩ(t n , c) is the approximation to the solution of state I by the ANN specified in Section 3.3.We highlight here that the same set of ANNs is used for all 200 datasets without having to repeat training -this demonstrates the advantage of using function approximation methods.This feature thus has the added advantage of not needing reruns during the inference stage unlike in the case of Markov Chain Monte Carlo (MCMC) methods.
We can now formulate the Laplace approximation that would lead us to the appropriate proposal distribution/approximate posterior h(c|y i ).As before, writing e g(c) = p(c)p(y i |c) we find an expression for g using Equations ( 37) and ( 40): The main objective of the Laplace Approximation is to find the approximate maximum a posteriori (MAP) ĉ such that g(ĉ) = 0 -this was done using the BFGS optimization algorithm within the optimx package in R, with initial point on the boundary of the collocation grid (ln 8, ln 1/6, −0.2).Finally, we arrive at the posterior Gaussian distribution c ∼ N (ĉ, −∇ −2 g(ĉ)) after observing the dataset y i .The above procedure is repeated for each of the 200 experiments based on different sets of observations y i , i = 1, 2, ..., 200, which can be completed very quickly.On a HP workstation with 16 GB RAM and 12 Intel Core 7.0 processors with processing speed of 2.60 GHz, each Laplace approximation task took, on average, 27.6 seconds for Method I and 1.77 seconds for Method II.

Performance Benchmark
The performance of our method will be compared with the Random-Walk Metropolis Hastings (MH) algorithm for sampling from the posterior distribution; see [41] for an example of such an application.For consistency, the same priors as in Equation (36) were used along with the likelihood p(y i |c) = N n=1 Poi(y i (t n )|e cγ I(t n , c)) where I(t, c) comes from the numerical solution for given a set of (proposed) parameters c, for all i = 1, 2, ..., 200.
Let n iter be the number of iterations of the algorithm, and α ∈ Z + be some positive integer that divides n iter .The Random-Walk MH Algorithm for obtaining the posterior h(c|y i ) arising from the dataset y i is as follows.
• Set an initial value of parameters c (0) , • Solve the system of ODEs for the SIR Model numerically (Equations ( 18) -( 22)) to obtain I(t, c (0) ), • For j = 1, 2, ..., n iter : -Propose a new set of parameters c from a symmetrical distribution c ∼ N (c (j−1) , S 2 ) where S 2 := diag(s 2 c I 0 , s 2 cγ , s 2 c β ) controls the variance of this proposal, -Solve the system of ODEs for the SIR Model to obtain I(t, c ), -Calculate the probability of acceptance p acc as: , -Accept c as a sample from the posterior with probability p acc and set c (j) = c ; otherwise reject c and set c (j) = c (j−1) (i.e.we are employing a "block" MH algorithm where all parameters in c are updated together, and not separately), • c (α) , c (2α) , ..., c (n iter −α) , c (n iter ) is taken as the posterior sample.
Sufficiently large n iter and α were chosen along with appropriate values of s 2 c I 0 , s 2 cγ , s 2 c β to ensure good representation of the posterior: n iter =.Further, c (0) was conveniently set to c * to avoid a long burn-in period.Applying a kernel density estimate to the posterior sample gives us the posterior density resulting from this Random-Walk MH algorithm.Note that the MH algorithm needs to be rerun for each y i , i = 1, 2, ..., 200, for obtaining posterior inference on c.On the same computer specifications detailed at the end of Section 3.4, each MH procedure took an average of 11.9 minutes even though the starting point c (0) was chosen very close to c * .

Results
In this subsection, we first show the accuracy of the ANN approximations x and x † from Methods I and II compared to the numerical solutions.Next, we examine the approximate posteriors {h(c|y i )} i=1,2,...,200 obtained from each method, compared to the benchmark from the MH algorithm.
Figure 4 shows the ANN approximations S(t, c * ), Ĩ(t, c * ), R(t, c * ) as well as the derivatives obtained from the ANN approximations ∂ (t, c * ), ..., ∂R ∂c β (t, c * ) under Method II; note that unlike Method I, the approximations to the derivatives are directly available from the ANN outputs.The differences in approximation performances of S(t, c * ), Ĩ(t, c * ), R(t, c * ) between Methods I and II are not obvious on the plots, however the approximations for the derivatives in Method II are evidently closer to the numerical solutions, as expected due to the explicit training and regularization; the plots for the second derivatives are not shown but follow a similar pattern to the first derivatives.To illustrate these differences quantitatively, Table 1 shows the percentage reduction, p, from the total sum of squares (TSS) by the ANNs in Methods I and II under the parameter setting c * , where Here, TSS represents the error resulting from the simple mean estimate 1 50 50 t=1 x d (t, c * ), and SSE represents the error between the ANN approximation xd and the numerical solution x d .p therefore describes the extent of reduction in error provided by the ANN approximations relative to the simple mean estimate; a larger value of p indicates better approximation by the ANN.
Notably, the percentage reductions from TSS achieved for all states and derivatives under Method II are larger than that under Method I; Method II achieves over 99% reduction from TSS for all states, first derivatives and second derivatives, whereas Method I falls behind especially in approximating the second derivatives.This is in line with the design of the ANN in Method II -explicit training of the derivatives as well as more regularizations.The accuracy of such ANN approximations have an effect on the Laplace Approximation that follows, due to the usages of Ĩ(t, c) in the likelihood expression, ∇ c Ĩ(t, c) in the BFGS algorithm, and ∇ 2 c Ĩ(t, c) in quantifying the uncertainty of the MAP estimate.We now examine the performance of each method in estimating the posterior distribution.
Figure 6 shows the marginal posterior distributions obtained for each parameter c I 0 , c γ , c β from Method I, Method II and the MH algorithm, for each of the 200 simulated datasets.The posterior densities for all methods generally fall in the same area with little deviation; a more informative comparison is provided in Table 2 detailing the approximate Mean Integrated Squared Errors (MISE) [37] between the two posterior density estimates from each of our methods and that from the MH algorithm calculated using for φ ∈ {c I 0 , c γ , c β }, where h 1 , h 2 are densities to be compared, and {φ (j) } j=1,...,H are equally spaced points with [φ (1) , φ (H+1) ] chosen to cover a large enough range of values of φ and H large to ensure good approximation to the true MISE.Hence Table 2 shows that Method II gives posteriors closer to the MH algorithm overall.The pointwise average of the 200 posterior densities are also plotted in Figure 7.

Influenza Example: Real Data
To demonstrate the real-life applicability of our proposed methods, we infer the parameters c I 0 , c γ , c β from an influenza outbreak at a boys boarding school reported by the Lancet [34] in 1978, shown in Table 3.This epidemic dataset provides the number of individuals confined to bed over time and so corresponds to the infectious state (I) in the SIR model.Raissi [35] also applied their physics-informed deep learning methodology onto this dataset to infer the values of β and γ.Similar to Section 3, we will outline the training and design of the relevant ANNs, as well as the priors, likelihoods and posteriors pertaining to the Laplace Approximations that follow.Lastly, the results from Methods I and II will also be compared to that obtained from a benchmark MH algorithm.

ANN: Training Examples and Architecture
A challenge present in inferring the ODE parameters from real-life datasets using our methods is the choice of hyperparameters for the ANN's collocation grid; MCMC algorithms face a somewhat similar challenge in the choice of initial points.We could specify a large range of values to hopefully include the true value of the parameters, but if the grid density is to be kept reasonably high (to ensure good approximation between collocation points), the number of training examples would become extremely large.In light of this, we use a simpler, preliminary Maximum Likelihood Estimation algorithm to narrow down the range of values to be included in our collocation grid.
The approximate Maximum Likelihood Estimation algorithm uses a Monte Carlo approach as follows.First, 20,000 sets of parameters are generated from a large range of values, each of which will be used to numerically solve the SIR ODE system based on the observed Influenza data.The likelihood under a Poisson model is calculated for each set of parameters, and the set with the largest likelihood is chosen as the (approximate) Maximum Likelihood Estimate (MLE).This procedure results in an approximation to the true MLE which increases in accuracy as the number of sets of parameters generated becomes large.However, for the purpose of grid construction only, this approxi-mation approach suffices for training the ANNs within bounds of reasonable parameter values, and takes less than a minute.The values we obtained from this algorithm were (−0.574, −0.704, 0.593) = (ln 0.563, ln 0.495, ln 1.810).This inspires the following choice of hyperparameters for the collocation grid: Before obtaining the training targets, we carried out an additional step to exclude parameter combinations that admitted a basic reproduction number R 0 := e c β −cγ of less than 1, since R 0 < 1 leads to a decline in the number of infected individuals over the entire time frame, which is clearly not the case in this dataset; in fact, this step removes unnecessary training examples and reduces computation time.
Subsequently, training targets were obtained similar to Section 3.3 -numerically solving the ODE for all parameter combinations and then applying z-score standardizations.The depths and widths chosen for each ANN were K = 2 and L 1 = L 2 = 10 with tanh activations.J 1 and J 2 were used as loss functions for Methods I and II respectively, however we found some different optimal values for the regularization coefficients: λ λ λ 11 = (0.5, 0.5, 0.

Laplace Approximation
The same vague priors as in Section 3.4 were chosen for c; one could use a more informative prior based on the approximate MLE for a more reliable inference, but in this case study, a vague prior choice was sufficient.
Modelling the observed infected cases y := (y(t 1 ), ..., y(t 14 )) after a Poisson distribution with mean Ĩ(t, c), the likelihood is expressed as: The corresponding expression for the function g in this real-life example is then: The BFGS algorithm with an initial point at the approximate MLE (−0.574, −0.704, 0.593) was then used to find ĉ, and subsequently, ∇ −2 g(ĉ), which determine the approximate posterior h(c|y) ∼ N (ĉ, ∇ −2 g(ĉ)).

Performance Benchmark
A Metropolis-Hastings algorithm similar to Section 5.2 was carried out, with the main difference being the likelihood p(y|c) = 14 n=1 Poi(y(t n )|I(t n , c)) taken as the Poisson distribution.The hyperparameters n iter and α were chosen to be 200, 000 and 1, 000, with s 2 c I 0 = s 2 cγ = s 2 c β = 0.05.A kernel density estimate with appropriate bandwidth was applied to the 2, 000 posterior samples from the MH algorithm to be compared with h(c|y) from Methods I and II.

Results
Table 4 shows the percentage reduction p from the total sum of squares (TSS) by the ANNs in Methods I and II at the approximate MLE (−0.574, −0.704, 0.593); Method II performs comparably with Method I in approximating the states and first derivatives, but fares much better in approximating the second derivatives.
The MH algorithm resulted in the following marginal posterior samples: the MAP estimates and posterior sample variances were (−0.944, 0.0633) for c I 0 , (−0.730, 5.5121×10 −4 ) for c γ and (0.630, 7.9948 × 10 −4 ) for c β , respectively; note that the density estimates from the MH algorithm are not necessarily Gaussian.The approximate Integrated Squared Error (ISE) between the posterior density from each of our proposed methods (h 1 ) and the kernel density estimate from the MH algorithm (h 2 ) are displayed in Table 5.
In summary, transforming the parameters back to the original scales (instead of the logarithmic scale), the MAP estimates for (I(0), γ, β), (0.433, 0.482, 1.858), (0.426, 0.482, 1.861) and (0.389, 0.482, 1.877) in Method I, Method II and the benchmark MH algorithm respectively.Since I(0) takes discrete values in reality, we conclude that the most likely value of I(0) is 1 for both methods, by virtue of the posteriors found.

Additional Investigations
In this subsection, we investigate the posterior inference under two additional settings: (i) when the initial point for the BFGS algorithm is further from the optimum, and (ii) when the prior for c I 0 is specified to be very informative (low variance) and centered around 0 (i.e.I(0) = 1) in light of the conclusion drawn in the previous section.
Using the same ANN trained in Section 4.1, we repeat the posterior inference with a different initial point (ln 2.9, ln 1/2.9, 0.41) (near the boundaries of the collocation grid) for the BFGS algorithm.Method I results in the following approximate marginal posteriors for c I 0 , c γ and c β given by: ) for c β , respectively.Alternatively, using a very small prior variance for c I 0 in the original algorithm for c yields similar results.

Method I vs. Method II
The difference of Method II from Method I lies in the training of the derivatives ∇ φ φ φ x as well as the regularization terms pertaining to these derivatives.In the simulated example, the ANN from Method II did better in approximating all states and derivatives  compared to Method I (see Table 1).In the real-life example, the ANN in Method II performed better in approximating second derivatives, but was comparable to Method I in approximating S, I, R and their first derivatives (see Table 4); this is still intuitively plausible since Method II does not perform worse than Method I overall -the extra outputs and regularizations still contributed to an improvement in approximations, even if only marginal.The discrepancies between the former and the latter may have several explanations, among which are: • Random initializations of the ANN weights and biases w j : the loss functions proposed could have multiple minima, especially for the SIR ODE system which can be rather complex.Slight differences in initialization could steer the ANN towards different minima and result in different weights and biases; • Accuracy tradeoffs due to training more states in the same number of epochs: Both ANNs in Methods I and II were trained for 2,500 epochs, but the ANN in Method II has a concatenated output of a larger dimension -this means the optimization process is more complex in Method II, and a larger number of epochs may be needed to achieve consistently better predictions for S, I, R and their first derivatives; • Choice of regularization coefficients: λ λ λ 11 , λ λ λ 12 , λ λ λ 21 , λ λ λ 22 were determined via trial and error in both Methods I and II.Due to the amount of time consumed to train an ANN, only a limited number of combinations could be examined.Even though we have found that the performance of the ANN does not vary largely within a small neighbourhood of the optimal values, it could explain the small discrepancy between the simulated dataset ANNs and the influenza example ANNs.Perhaps a more thorough optimization algorithm for the regularization coefficients would lead to Method II performing consistently better than Method I, but that is beyond the scope of this paper.
In estimating the posterior distributions, we see in Figures 7 and 8, as well as in Tables 2  and 5, that Method II produces posterior distributions closer to that obtained with the  benchmark MH algorithm, compared to Method I in both simulated datasets and influenza examples.This demonstrates the two-fold advantage that Method II provides over Method I: more accurate approximations to the first and second derivatives, leading to a more accurate MAPs found via the gradient BFGS algorithm and better uncertainty quantifications respectively.This does, however, come with heavier computational costs -the time taken to train each ANN is shown in Table 6.The BFGS algorithm runtime for the Laplace Approximation is also shown in Table 7 -this is faster under Method II since the first derivatives required for the algorithm are directly obtained from the ANN output, whereas differentiation is required to obtain these first derivatives under Method I.
Lastly, we also see from Section 4.5 that the results are not sensitive to the initial point of the BFGS optimization algorithm -this is due to good approximations to the first and second dertivatives, which render the optimization algorithm and uncertainty quantification consistent, demonstrating the importance of sufficient training as well as regularization of the ANNs.Additional evidence of this is the near-identical inference of the posterior under Method II despite a relatively large change in the initial point.We also found that reducing the extent of training of the ANN increases the sensitivity of the MAP to the initial point, as expected.

Proposed Methods vs. Random Walk MH
The benchmark MH algorithm utilizes the numerical solution to the ODE system to calculate likelihoods, and with a large number of iterations as well as good mixing, it produces a very representative sample of the posterior.The reliability of this algorithm lends itself naturally to be the benchmark for our proposed methods.As the number of iterations tends to infinity, the posterior distribution inferred using the MH algorithm represents the ceiling of the performance our methods, since our ANNs are also trained with numerical solutions as targets.
The main drawback of the MH algorithm in the inference of parameters of an analytically intractable ODE system is the computational cost -the ODE systems needs to be solved for every proposed move, causing a long run-time; see Table 7.For n iter = 200, 000 and α = 1, 000, the MH algorithm takes more than 39 hours to generate appropriate posterior samples for the 200 simulated datasets, significantly longer than the time taken for the Laplace approximation algorithm to produce the corresponding approximate posteriors in both Methods I and II; Method II especially produces results comparable to the MH algorithm.This highlights ANNs as a flexible function approximation tool which allow for quick evaluation of the approximate solution of the ODE system, and hence, a fast and convenient posterior inference, at the sacrifice of little accuracy.
Furthermore, the MH algorithm becomes inefficient when sampling in high dimensions especially when there are high correlations between parameters.In other general MCMC methods, this problem may be overcome by modifying the jump proposals or the acceptance criteria, potentially changing the space from which the algorithm is able to sample from.Inefficiencies in high dimensions may occur with ANNs as well, but can be circumvented using deeper and wider networks, or better optimizers that adapt to different difficulties in optimization; the ANN approach makes the optimization task explicit and independent so that specific modifications can be made only to the optimizer, without changing the structure or the function space that it may access.

General Advantages and Limitations
The inferential capabilities of the ANN approach coupled with the Laplace approximation are comparable to a typical MCMC approach, such as the Random Walk MH algorithm.Once trained, the ANN approximates the trajectory/solution surface at all points within the designed range, then carries out a relatively lightweight optimization task based on the Laplace Approximation to obtain the approximate posterior.This may prove convenient if the true values of the ODE parameters change to a nearby value within the designed range; the prediction and gradient surfaces remain accurate and can be re-used in the Laplace Approximation optimization.This is in contrast to the MH algorithm, where the entire algorithm needs to be run again to infer the corresponding posterior distribution, considerably increasing the computational cost, especially if the true values change frequently or sequentially.This advantage of ANNs over MCMC algorithms become more evident in sequential settings when the latter needs to be rerun every time a new observation arrives.
Another advantage of the ANN approach, along with other function approximation methods such as basis function expansion, stems from the flexibility of its loss or objective function, allowing regularization to be applied to improve posterior inference of the ODE parameters.The extent of regularization can be further controlled by varying the values of the regularization coefficients λ λ λ, allowing a bias-variance tradeoff.Traditional MCMC methods lack this flexibility, and their hyperparameters often concern the mixing efficiency and sufficient representation of the posterior, rather than controlling the bias-variance tradeoff.
The main limitation to our approach is the lack of flexibility of the approximate posterior, since the Laplace Approximation restricts it to be a Gaussian density.Affine transformations performed on the parameters can overcome this restriction to some extent (much like the log-transformation on our original parameters I(0), γ and β), but the posterior distribution would still be restrictied to only certain transformations of the Gaussian distribution.This is slightly evident in Figure 8 where the posterior distribution for c β (and to a lesser extent, for c I 0 ) resulting from the MH algorithm is slightly skewed, a characteristic that was not captured by Method I or II.Traditional MCMC approaches are not bound by this limitation and are flexible in sampling from any well-behaved posteriors.A consolation to our method is that it may be extended to use the approximate posterior in conjunction with importance sampling to obtain a better representation of the posterior, by employing a reweighing of the samples generated from the Laplace method.In that sense, the Laplace-approximated posterior can be seen as an effective way to approximate the true posterior since the resampling weights quickly deteriorate if the approximate is crude .
Methods I and II are based on design -particularly the grid of values we used to train the ANN, and therefore inherits the limitation we have imposed in the process of fixing this design.In our applications, we used a relatively small grid of values specified in (35), and predictions outside this range would fall apart.This issue could of course be addressed by increasing the range of the grid, but to maintain the same grid density would mean that the number of training points increases by the order of O(n p ), where p is the number of ODE parameters, thereby heavily increasing the computational cost.A data-dependent approach such as that described in [12] is much more generalized.
Our methods naturally inherit all the challenges in utilizing ANNs.In our case, applying a z-standardization was sufficient, but for certain dynamical systems, the numerical solutions can have a few extremely large or small values, rendering the training targets to be rather skewed in distribution.With no modifications, the training of the ANN could easily neglect the fit around such values.This can, however, be circumvented by performing transformations on the training examples such as quantile transforms where necessary, which may be investigated in future research.

j
during the training stage, the loss is backpropagated through each ANN and minimized with respect to their weights and biases.The concatenation illustrated in Figure2facilitates the backpropagation in an algorithmic way.With the batch size, loss functions, and λ λ λ 11 , λ λ λ 12 , λ λ λ 21 and λ λ λ 22 specified, we can use an optimization algorithm to minimize the loss function with respect to each ANN's weights w (k) ij and biases b (k)
t) = number of susceptible individuals at time t, I(t) = number of infectious individuals at time t, R(t) = number of removed individuals at time t, β = transmissibility of the disease = (average number of contacts per individual per unit time) × (probability of a successful transmission), γ = average transition rate of an individual from infectious to removed.

Figure 6 :
Figure 6: Variability of the marginal posterior distributions for the 200 simulated datasets (Blue: Method I, Green: Method II, Red: MH algorithm, Black line: True value) corresponding to the parameters (a) c I0 , (b) c I0 and (c) c β .MISE Parameter MH vs. Method I MH vs. Method II c I 0 0.106 0.0393 c γ 2.32 2.59 c β 0.706 0.307

Figure 7 :
Figure 7: Point-wise average of marginal posterior distributions based on the 200 simulated datasets (Blue: Method I, Green: Method II, Red: MH algorithm, Black line: True value) corresponding to the parameters (a) c I0 , (b) c I0 and (c) c β .

Figure 8
Figure8shows the resulting approximate posteriors h(c|y) for both Methods I and II, plotted together with the kernel density estimate from the benchmark MH algorithm.Method I yielded the approximate marginal posteriors for c I 0 , c γ and c β which are, respectively,

Figure 8 :
Figure 8: Marginal posterior distributions for (a) c I0 (b) c γ and (c) c β for the influenza dataset (Blue: Method I, Green: Method II, Red: MH algorithm, Black line: approximate MLE).
204, 800 examples.20% of these were randomly selected to form the test dataset, and subsequently a further 20% of the remaining examples comprised the validation set (leaving 64% × 204, 800 = 131, 072 examples for training).The choices for the values in (35) are based on true values c * = (ln 7, ln 1/7, −0.25) of the parameters used to simulate datasets in Section 3.2.Note that N t , N c I 0 , N cγ and N c β were chosen so that the grid is dense enough for the ANN to produce good approximations.

Table 1 :
Numerical values for the percentage reduction from TSS for all states, first derivatives and second derivatives achieved by the respective ANNs at c * , for Methods I and II.

Table 2 :
Numerical values of MISE between the marginal posterior distributions corresponding to the parameters c I0 , c γ and c β in the simulated datasets example.

Table 4 :
Numerical values for the percentage reduction from TSS for all states, first derivatives and second derivatives achieved by the respective ANNs at the approximate MLE (−0.574, −0.704, 0.593), for Methods I and II.

Table 5 :
ISE between the posterior density estimate from the MH algorithm and the approximate posteriors from each of Methods I and II for c I0 , c γ and c β in the influenza example.

Table 6 :
Time taken to train ANNs for Methods I and II on a HP workstation with 16 GB RAM and 12 Intel Core 7.0 processors with processing speed of 2.60 GHz, in the simulated dataset and influenza examples.

Table 7 :
Time taken for posterior inference algorithms under Methods I and II, on a HP workstation with 16 GB RAM and 12 Intel Core 7.0 processors with processing speed of 2.60 GHz, in the simulated dataset and influenza examples.