Trajectory inference and parameter estimation in stochastic models with temporally aggregated data

Stochastic models are of fundamental importance in many scientific and engineering applications. For example, stochastic models provide valuable insights into the causes and consequences of intra-cellular fluctuations and inter-cellular heterogeneity in molecular biology. The chemical master equation can be used to model intra-cellular stochasticity in living cells, but analytical solutions are rare and numerical simulations are computationally expensive. Inference of system trajectories and estimation of model parameters from observed data are important tasks and are even more challenging. Here, we consider the case where the observed data are aggregated over time. Aggregation of data over time is required in studies of single cell gene expression using a luciferase reporter, where the emitted light can be very faint and is therefore collected for several minutes for each observation. We show how an existing approach to inference based on the linear noise approximation (LNA) can be generalised to the case of temporally aggregated data. We provide a Kalman filter (KF) algorithm which can be combined with the LNA to carry out inference of system variable trajectories and estimation of model parameters. We apply and evaluate our method on both synthetic and real data scenarios and show that it is able to accurately infer the posterior distribution of model parameters in these examples. We demonstrate how applying standard KF inference to aggregated data without accounting for aggregation will tend to underestimate the process noise and can lead to biased parameter estimates.


Introduction
Stochastic differential equations (SDEs) are used to model the dynamics of processes that evolve randomly over time. SDEs have found a range of applications in finance (e.g. stock markets, Hull 2009), physics (e.g. statistical physics, Gardiner 2004) and biology (e.g. biochemical processes, Wilkinson 2011). Usually, the coefficients (model parameters) of SDEs are unknown and have to be inferred using observations from the systems of interest. Observations are typically partial (e.g. collected at discrete times for a subset of variables), corrupted by measurement noise, and may also be aggregated over time and/or space. Given these observed data, our task is to infer the process trajectory and estimate the model parameters.
A motivating example of stochastic aggregated data comes from biology and more specifically from luminescence bioimaging, where a luciferase reporter gene is used for studying gene expression inside a cell (Spiller et al. 2010). The luminescence intensity emitted from the luciferase experiments is collected from single cells and is integrated over a time period (in certain cases up to 30 min, Harper et al. 2011) and then recorded as a single data point. In this paper, we consider the problem of inferring SDE model parameters given temporally aggregated data of this kind.
Imaging data from single cells are highly stochastic due to the low number of reactant molecules and the inherent stochasticity of cellular processes such as gene transcription or protein translation. The chemical master equation (CME) is widely used to describe the evolution of biochemical reactions inside cells stochastically . Exact inference with the CME is rare and, even when possible, computationally prohibitive. In Golightly and Wilkinson (2005), the authors perform inference using a diffusion approximation of the CME, resulting in a nonlinear SDE. The linear noise approximation (LNA) (Kampen 2007) has been used as an alternative approximation of the CME which is valid for a sufficiently large system (Komorowski et al. 2009;Fearnhead et al. 2014). According to the LNA, the system is decomposed into a deterministic and a stochastic part. The latter is described by a linear SDE of the following form: where X t is a d-dimensional process, a t is a (d × d)-matrixvalued function, W t is an m-dimensional Wiener process, and b t a (d × m) matrix-valued function.
Given an initial condition X 0 = c, Eq.
(1) has the following known solution (Arnold 1974): where Φ t is the fundamental matrix of the homogeneous equation dX t = a t X t dt. Note that the right integral in Eq.
(2) is a Gaussian process, as it is an integral of a nonrandom function with respect to W t (Arnold 1974). If we further assume that the initial condition c is normally distributed or constant, Eq.
(2) gives rise to a Gaussian process. Additionally, the solution of a (linear) SDE is a Markov process (Arnold 1974). These properties of linear SDEs (of the form of Eq. (1)) are highly desirable when carrying out inference. The approaches above do not treat the aggregated nature of luciferase data in a principled way but instead assume that the data are proportional to the quantity of interest at the measurement time (Harper et al. 2011;Komorowski et al. 2009). Here, we build on the work of Komorowski et al. (2009) and Fearnhead et al. (2014) and extend it to the case of aggregated data. Since we are using the LNA, the problem is equivalent to a parameter inference problem for the time integral of a linear SDE as in Eq. (1): t t 0 X (u)du. We follow a Bayesian approach, where the likelihood of our model is computed using a continuous-discrete Kalman filter (Särkkä 2006) and parameter inference is achieved using an MCMC algorithm. The paper is structured as follows: we first provide a description of the LNA as an approximation of the CME and introduce the integral of the LNA for treating temporally aggregated observations. We then describe a Kalman filter framework for performing inference with the LNA and its integral. Finally, we apply our method in three different examples. The Ornstein-Uhlenbeck process has been picked as a system where we can study its exact solutions. The Lotka-Volterra model was selected as an example of a nonlinear system with partial observations. The translation inhibition model was used to demonstrate our method with real data.

The linear noise approximation and its integral
The CME can be used to model biochemical reactions inside a cell. It is essentially a forward Kolmogorov equation for a Markov process that describes the evolution of a spatially homogeneous biochemical system over time.
Assume a biochemical reaction network consisting of N chemical species X 1 , . . . , X N in a volume Ω and v reactions R 1 , . . . , R v . The usual notation for such a network is given below: . , X N ) T represents the number of chemical species (we assume molecules) and x = X Ω is the concentration of molecules. We denote with P the v × u matrix whose elements are given by p i j and Q the v × u matrix with elements q i j . We define the stoichiometry matrix S as S = (Q − P) T . The probability of a reaction taking place in [t, t + dt) is given by the vector of reaction rates h j (x, Ω, t)Ωdt.
The probability p(X, t) that the system is in state X at time t is given by the CME: However, as mentioned before, exact inference with the CME, even when possible, is computationally prohibitive. We use the LNA as an approximation of the CME due to its successful application in Komorowski et al. (2009) andFearnhead et al. (2014). The state of the system X is expected to have a peak around the macroscopic value of order Ω and fluctuations of order Ω 1/2 such that X t = Ωφ t + Ω 1/2 ξ t . This way the system is decomposed to the deterministic part φ t and the stochastic part ξ t . The LNA arises as a Taylor expansion of the CME in powers of the volume Ω; for a detailed derivation the reader is referred to Kampen (2007) and Elf and Ehrenberg (2003). By collecting terms of order Ω 1/2 , we obtain the deterministic part of the system, namely the macroscopic rate equations φ i , where i stands for the ith species: Terms of order Ω 0 give us the stochastic part of the system: where (5) is a linear SDE of the form of Eq. (1). Its solution is a Gaussian Markov process, provided that we have an initial condition that is a constant or a Gaussian random variable. The ordinary differential equations (ODEs) that describe the mean and variance of this Gaussian process are given by Arnold (1974): Note that if we set the initial condition of m 0 = 0, then Eq. (6) will lead to m t = 0 at all times. We will make the assumption that, at each observation point, m t is reset to zero since it can be beneficial for inference as discussed in Fearnhead et al. (2014) and Giagos (2010).
In what follows we will assume, without loss of generality, that the volume Ω = 1, i.e. the number of molecules equals the concentration of molecules and thus, Equation (8) is the sum of a deterministic and a Gaussian term; consequently, it will also be normally distributed. By taking its expectation and variance, we have that X t |X 0 ∼ N (φ t +m t , V t ) which, according to the initial condition m 0 = 0, leads to X t |X 0 ∼ N (φ t , V t ).
We are now interested in the integral of Eq. (8), as this will allow us to model the aggregated data, The deterministic part of this aggregated process is given by I (t), and the stochastic part is given by Q(t). Subsequently, we have the following ODEs: Here, Q t will also follow a Gaussian process (as it is the integral of a Gaussian process) so we need to compute its mean and variance. The ODEs for the mean, variance and E[Q t ξ T t ] are given below; their proofs can be found in "Appendix A.1": Note that Q t is not Markovian since knowledge of its history is not sufficient to determine its current state. However, jointly with ξ t it forms a bivariate Gaussian Markov process, that is characterised by the following linear SDE: From Eq. (15) we have that ξ t , Q t are jointly Gaussian and, consequently, their marginals are also normally distributed. Thus, according to (

Kalman filter for the LNA and its integral
The classical filtering problem is concerned with the problem of estimating the state of a linear system given noisy, indirect or partial observations (Kalman 1960). In our case, the state is continuous and is described by Eq. (8) while the observations are collected at discrete time points with or without Gaussian noise. For this reason, we refer to it as the continuous-discrete filtering problem (Jazwinski 1970;Särkkä 2006). First, we consider the case where observations are taken from the process X t and not from its integral H t . In that case, the observation process is given by y t = P t X t + t where t ∼ N (0, R) and accounts for technical noise. The observability matrix P t is used to deal with the partial observability of the system, for example, if we have two species X 1 , X 2 and we observe only X 1 , P = [1, 0] T .
Following the Kalman filter (KF) methodology, we need to define the following quantities: -Prior: p(X 0 ). -Predictive distribution: p(X t |y 1:t−1 ), where y 1:t−1 refers to the observations at discrete points up to time t − 1. -Posterior or Update distribution: p(X t |y 1:t ).
The predictive distribution is given by X t |y 1: where μ − 1t and V − t are found by integrating forward for [t, t − 1] Eqs. (4) and (7) initialised at the posterior mean μ 1t−1 and variance V t−1 . In our case, the mean of the stochastic part is initialised at 0, so μ 1t corresponds to the deterministic part φ t . By updating the deterministic solution at each observation point, we achieve a better estimate, as the ODE solution can become a poor approximation over long periods of time. The posterior distribution p(X t |y 1:t ) = N (μ 1t , V t ) corresponds to the standard posterior distribution of a discrete KF and the updated μ 1t and V t are given in "Appendix A.3". This case has been thoroughly studied in Fearnhead et al. (2014).
We consider now the case where the state X t is being observed through the integrated process H t , such that the observation process is given by y t = P t H t + t and t ∼ N (0, R). Again, we need to define a prior distribution as well as calculate the predictive and posterior distributions for the system that we are studying.
The predictive distribution of our system is given by For this step, we need to integrate forward the ODEs (4), (10), (7), (13) and (14) with the appropriate initial conditions as seen in Algorithm 1. Note that the integrated process H t needs to be reset to 0 at each observation point in order to capture correctly the 'area under graph' of the underlying process X t .
To compute the posterior distribution p(X t |y 1:t ), we look at the joint distribution of (H t , X t , y t ) conditioned on y 1:t−1 : By using the lemma in "Appendix A.2" and using the corresponding blocks of the joint distribution (16), we can calculate the posterior mean and variance of p(X t |y 1:t ): Since we are interested in parameter inference, we will need to compute the likelihood L(θ ) of the system, where θ represents the parameter vector of the system: The individual terms of the likelihood are given by p(y t | y 1: Parameter inference is then straightforward either by using a numerical technique such as the Nelder-Mead algorithm to obtain the maximum likelihood (ML) parameters or using a Bayesian method such as a Metropolis-Hastings (MH) algorithm. The general procedure for performing inference using aggregated data is summarised in Algorithm 1.

The Ornstein-Uhlenbeck process
We first investigate the effect of integration in a onedimensional, zero-mean OU process of the following form: where α is the drift or decay rate of the process and σ is the diffusion constant. Both of these parameters are assumed to be unknown, and we will try to infer them using the KF scheme that we have developed. The OU process is a special case of a linear SDE (Eq. (1)), since its coefficients are time invariant, resulting in a stationary Gaussian-Markov process. Analytical solutions for both the OU and its integral exist (Gillespie 1996) and are presented in "Appendix A.4". The results for the mean m t and variance V t of the OU, where = t − t 0 , are given below: The integral of Eq. (19) is given by dY t = X t dt, and the mean, variance and covariance are given below, We are interested in inferring the parameters α and σ given observations from Y t at discrete times, where the interval between two observations is constant. We will compare two approaches. First, we will assume that the data come directly from X t ignoring their aggregated nature and use the standard discrete-continuous KF, referred to as KF1. To make the comparison of this scenario fairer, we will normalise the observations by dividing with , which brings the observation close to an average value of the process, in an attempt to match the observations to data generated from the process X t . In the second case, we will use the KF on the integrated process in analogy with Algorithm 1, which we will refer to as KF2. The case of inferring the parameters of an OU process using non-aggregated data with an MCMC algorithm has already been studied in Mbalawata et al. (2013). X t will reach its stationary distribution after a time of order 1 α , which is given by N (0, σ 2 2α ) . However, the integrated process Y t is non-stationary since Var[y t ] → ∞ as → ∞ . This already shows us that the two processes behave differently.
Since we are going to use the normalised observations from Y t with KF1, we will take a look at the normalised process Z t = 1 Y t : Var By taking the limit as → ∞ in Eq. (22) and using L'Hospital's rule we can show that E[z t ] → 0 and Var[z t ] → 0. So, the normalised process is again not approaching the stationary distribution of X t .
We have generated aggregated data from the integral of an OU process with α = 4 and σ = 2. To simulate data from Y t , we need to first simulate data from X t . This can be done in general by discretising the process and using the Euler-Maruyama algorithm. However, in the case of the OU process, we can also use an exact updating formula (see "Appendix A.6"). The aggregated data can then be collected using the discretised form Y t+dt = Y t + X t dt or a numerical integration method such as the trapezoidal rule over the indicated integration period. In "Appendix A.12" we have included plots of the OU process and the corresponding aggregated process. We tested inference using KF1 with normalised data and KF2 with aggregated data. Results of parameter estimation using a standard random walk MH algorithm are presented in Table 1. Improper uniform priors over infinite range have been used on the log parameters, while different time intervals have been considered. For each interval , we have sampled 100 observations from a single trajectory of an OU process with α = 4 and σ = 2 aggregated over the specified . For this example, we have assumed no observation noise. MCMC traceplots of α and σ can be found in "Appendix A.13" (Figs. 6, 7) which indicate a good mixing of the chain and fast convergence. All chains were run for 50K iterations and 30K were discarded as burn-in. To verify the validity of the results, we have run nine more datasets, separately each time. An average over the ten datasets can be found in "Appendix A.7" ( Table 5). As we can see, the estimates for KF1 deteriorate for larger . This is expected since the aggregated process diverges further from the OU process as increases. Estimates remain good for KF2 even when is large, although they become more uncertain, as can be witnessed by the increased standard deviations. Filtering results for KF1 and KF2 with aggregated data using the estimated parameter results for = 1 are given in "Appendix A.14".
It is of interest to investigate the inferred stationary variance of the OU process using KF1 and KF2. We have plotted the inferred stationary variances obtained by the MH for both KF1 and KF2 in Fig. 1. The boxplots are obtained using the average of 10 different datasets and correspond again to an OU process with α = 4 and σ = 2, thus giving rise to a stationary variance of σ 2 2α = 0.5. When using the normalised aggregated data directly with KF1, we infer the wrong sta- Fig. 1 Boxplots of inferred stationary variance of the OU process for different . The simulated OU process has α = 4 and σ = 2 corresponding to a stationary variance of 0.5, as indicated by the dotted horizontal line. The inferred stationary variance using KF1 tends to zero as grows, but the stationary variance from KF2 is inferred correctly at all . a Boxplots of inferred stationary variance for different using KF1. b Boxplots of inferred stationary variance for different using KF2 tionary variance of the underlying OU process which tends to zero as becomes larger, consistent with the theoretical results from Eq. (22). Intuitively, we can attribute this behaviour to the fact that aggregated data have relatively smaller fluctuations, so that KF1 will tend to underestimate the process variance.
In this section, we have looked at an example of inferring the parameters of an SDE using aggregated data, and we have found that to obtain accurate results we need to explicitly model the aggregated process. As the observation intervals become larger, there is a greater mismatch between KF1 and KF2. In the next two sections, we will look at examples of more complex stochastic systems that must be approximated by the LNA and compare again inference results using KF1 and KF2.

Lotka-Volterra model
We are now going to look at a system of two species that interact with each other according to three reactions The model represented by the biochemical reaction network (23) is known as the Lotka-Volterra model, with X 1 representing prey species and X 2 predator species. Although a simple model, it has been used as a reference model (Boys et al. 2008;Fearnhead et al. 2014) since it consists of two species, making it possible to observe it partially through one of the species and also provides a simple example of a nonlinear system. The LNA can be used to approximate the dynamics and the resulting ODEs can be found in "Appendix A.8". We want to compare parameter estimation results using KF1 and KF2. We collected aggregated data from a Lotka-Volterra model using the Gillespie algorithm. We assumed a known initial population of 10 prey species and 100 predator species. The parameters of the system for producing the synthetic data were set to (θ 1 , θ 2 , θ 3 ) = (0.5, 0.0025, 0.3), following (Boys et al. 2008). We have added Gaussian noise with standard deviation set to 3.0, and we assumed that the noise level was known for inference. Our goal was to infer the three parameters (θ 1 , θ 2 , θ 3 ) of the system using aggregated observations solely from the predator population.
The Gillespie algorithm was run for 20 min. Data were aggregated and collected every 2 min resulting in 10 observations per sample. To infer the parameters, we assumed that we had 40 independent samples available. Since we assumed independence between the samples, we worked with the product of their likelihoods. In the ideal case of having complete data of a stochastic kinetic model the likelihood is conjugate to an independent gamma prior for the rate constants (Wilkinson 2011). The choice of Ga(2,10) with shape = 2 and rate = 10 gives a reasonable range for all three parameters and has also been used by Fearnhead et al. (2014). However, in this case the choice of prior is not important as the data dominate the posterior. We have run the same experiment using uninformative exponential priors Exp(10 −4 ) that resulted in equivalent posterior distributions. Since we know that we want all parameters to be positive, we worked with a log transformation. MCMC convergence in this example is relatively slow and adaptive MCMC (Sherlock et al. 2010) was found to speed up convergence (see "Appendix A.9" for details). The adaptive MCMC was run for 30K iterations with 10K regarded as burn-in. The MCMC was initialised at random values sampled from uniform distributions. Parame- ter estimation results for all three parameters using adaptive MCMC are shown in Table 2, while Fig. 2 shows histograms of their posterior densities. The ground truth value for each parameter is indicated by a vertical blue line. We can see that only the posterior histograms corresponding to KF2 include the correct estimate for all three parameters in their support. In "Appendix A.15", we have included traceplots of the MCMC runs for all three parameters, where we can see that the adaptive MCMC leads to a fast convergence for both KF1 and KF2. In order to verify the validity of our results, we have run an extra 100 datasets, each consisting of 40 independent samples and obtained point estimates from KF1 and KF2 using the Nelder-Mead algorithm. The results can be found in "Appendix A.10" and agree with our previous conclusion that inference with KF1 gives inaccurate estimates. Assuming knowledge of the parameter values, we can also use the KF for trajectory inference. In Fig. 3, we demonstrate filtering results for the prey population assuming that we have aggregated data. We simulated a trajectory using θ 1 = 0.5, θ 2 = 0.0025, θ 3 = 0.3 and sampled aggregated data every 2 min. Black lines represent the true trajectory of the populations. We see that the inferred credible region with KF1 does not contain the true underlying trajectory in many places. Note that red dots correspond to normalised (aggregated) observations for KF1 and aggregated observations for KF2, so they do not have the same values. In "Appendix A.16", we include filtering results for the unobserved predator population.

Translation inhibition model
In this example, we are interested in inferring the degradation rate of a protein from a translation inhibition experiment. We model the translation inhibition experiment by the following set of reactions where R stands for mRNA and P for protein: The LNA is used, again, as an approximation of the dynamics and the resulting system of ODEs can be found in "Appendix A.11". Before applying our method to real data Fig. 2 Posterior densities of θ 1 , θ 2 , θ 3 from aggregated data using KF1 (red histogram) and KF2 (green histogram). a Posterior density of θ 1 . b Posterior density of θ 2 . c Posterior density of θ 3 from this system, we test the performance on synthetic data simulated using the Gillespie algorithm. We simulated 30 time series (corresponding to 30 different cells), assuming the following values as the ground truth for the kinetic parameters: c P = 200 and d P = 0.97. We further set the initial protein abundance of m 0 to 400 molecules. We have scaled the data by a factor k = 0.03, so that they are proportional to the original synthetic data and added Gaussian noise with a variance of s = 0.1. For this study, we have assumed that data were integrated over 30 min. Again we use an adaptive MCMC algorithm (Sherlock et al. 2010). Non-informative exponential priors with mean 10 4 were placed on all parameters. We have adopted the parametarisation used in Komorowski et al. (2009) andFinkenstädt et al. (2013) such as c P = k · c p and m 0 = k · m 0 and worked in the log parameter space. Parameter estimation results for the vector (c p , d p , s, k, m 0 ) using KF1 and KF2 are summarised in Table 3. As we can see, the degradation rates are successfully inferred by both approaches. However, using KF1 leads to an overestimation of m 0 and an underesti- mation of the noise level s, which corresponds to a smoother process than the underlying one. MCMC traces from both KF1 and KF2 are presented in Fig. 11. We then applied our model to single cell luciferase data from a subset of 11 pituitary cells (Harper et al. 2011). Parameter estimation results using the same adaptive MCMC are summarised in Table 4. The MCMC was run for 100K iterations out of which 60K were discarded as burn-in. Again, we observe that, using KF1, we get a higher m 0 and a slightly lower noise level s. Posterior histograms of the degradation rates are shown in Fig. 4. A deterministic approach for fitting the data would give a degradation rate of around 1.02 and, as we can see, this value is included in both histograms of Fig. 4. To check convergence using the Gelman-Rubin statistic, we have run 3 different chains with different initialisations. MCMC traces for both KF1 and KF2 are shown in "Appendix A.18" (Fig. 12 and 13) where we can see that the three chains are very close to each other, corresponding to a Gelman-Rubin statistic close to 1.

Discussion
We have presented a Bayesian framework for doing inference using aggregated observations from a stochastic process. Motivated by a systems biology example, we chose to use the LNA to approximate the dynamics of the stochastic system, leading to a linear SDE. We then developed a Kalman filter that can deal with integrated, partial and noisy data. We have compared our new inference procedure to the standard Kalman filter which has previously been applied in systems biology applications approximated using the LNA. Overall, we conclude that the aggregated nature of data should be considered when modelling data, as aggregation will tend to reduce fluctuations and therefore the stochastic contribution of the process may be underestimated.
In Sect. 4, we described the different properties of a stochastic process and its integral in the case of the Ornstein-Uhlenbeck process. We showed that one cannot simply treat the integrated observations as proportional to observations coming from the underlying unintegrated process when carrying out inference. As the aggregation time window increases, parameter estimates using this approach become less accurate and the inferred stationary variance of the process is underestimated. In contrast, our modified KF is able to accurately estimate the model parameters and stationary variance of the process.
In Sect. 5, we have demonstrated the ability of our method to give more accurate results in a Lotka-Volterra model given synthetic aggregated data. In Sect. 6, we looked at a real-world application with data from a translation inhibition experiment carried out in single cells. As the LNA depends on its deterministic part, and in a deterministic system integration is dealt with reasonably well using the simple proportionality constant approach, some of the system parameters, such as the degradation rate, can be inferred reasonably well by the standard non-aggregated data approach. However, neglecting the aggregated nature of the data does lead to a significantly larger estimate of the initial population of molecules even in this simple application. This is consistent with our observation that neglecting aggregation will tend to underestimate the scale of fluctuations as it is the number of molecules that determines the size of fluctuations in this example. In models where noise plays a more critical role, e.g. systems with noise-induced oscillations, the effect of parameter misspecification could have more serious consequences on model-based inferences.
Our proposed inference method can deal with the intrinsic noise inside a cell, measurement noise and temporal aggregation. However, cell populations are highly heterogeneous, and cell-to-cell variability has not been considered in our current inference scheme. It would be possible to deal with cell-to-cell variability using a hierarchical model (Finkenstädt et al. 2013) which could be combined with the integrated data Kalman Filter developed here.
All experiments were carried out on a cluster of 64bit Ubuntu machines with an i5-3470 CPU @ 3.20 GHz x 4 processor and 8 GB RAM. All scripts were run in Spyder (Anaconda 2.5.0, Python 2.7.11, Numpy 1.10.4). Code reproducing the results of the experiments can be found on GitHub https://github.com/maria-myrto/inference-aggregated.

A.2 Useful Gaussian identities
Let x and y be jointly Gaussian random vectors: Then, the marginal and conditional distributions of x (equivalently for y) are, respectively (Bishop 2007):

A.3 Update equations of a discrete Kalman Filter
Using the Gaussian Identities in A.2 we have Since we are working with Gaussians, we know that X i |y 1:i ∼ N (m * i , S * i ), and the updated m * i and S * i are given by:

A.4 Analytical solutions for the OU process and its integral
Given an OU process of the following form: we can derive its solution according to the general theory for linear SDEs. Since the solution is a Gaussian process, we will only need to define its mean and variance which are given by Eqs. (6, 7). All the ODEs in this case are first-order linear ODEs with constant coefficients, so using for example an integrating factor, we can derive the following solution for an ODE of the form dx dt + ax = g(t), x(t = 0) = x 0 : For the mean we get from Eq. (6): For the variance we have the following: For the solution of the integrated OU process dY t /dt = X t , we need to calculate its mean, covariance and variance given by Eqs. (12), (13) and (14). The initial conditions for these ODEs will be set to 0, since at each observation point the integrated process starts from 0. For clarity, we will use the results A, B, C from "Appendix A.5".
First we find the mean: For the covariance, we first calculate from Eq. (13): Now the covariance can be calculated from: For the variance, we need to calculate: Now we can derive the variance:

A.6 Exact updating formula of OU process
The OU process dX t = −α X t dt + σ dW t admits an exact update formula given by :

A.7 Average over 10 datasets-OU example
See Table 5.

A.8 LNA for Lotka-Volterra model
The Lotka-Volterra model (23) gives rise to the stoichiometry matrix, with transition rates, The following matrices need to be computed: Sdiag(h(y t ))S T = E E T = θ 1 y 1 + θ 2 y 1 y 2 −θ 2 y 1 y 2 −θ 2 y 1 y 2 θ 2 y 1 y 2 + θ 3 y 2 , The macroscopic rate equations are now given by: For the diffusion terms, we only need to compute the variance of the resulting Gaussian process since we restart the stochastic part at each observation point in accordance with (Fearnhead et al. 2014).

A.9 Adaptive MCMC
According to the specific adaptive MH (Sherlock et al. 2010), the new state θ * is sampled from a mixture of Gaussians: Σ t corresponds to the sampled variance up to iteration t and is estimated after enough samples have been accepted. The parameter δ ∈ (0, 1) and is defined by the user, we have chosen a value of 0.05. The scaling factor λ can either be fixed (Roberts and Rosenthal 2009) or be tuned (Sherlock et al. 2010). This algorithm targets an acceptance rate of ≈ 0.3.  Table 6.

A.11 LNA for translation inhibition model
The following model is being assumed where R and P stand for the (numbers of) gene mRNA and protein, respectively: The above equations result in the following stoichiometry matrix: Fig. 10 Filtering results for the predator population. Red dots correspond to aggregated observations, and the black line represents the actual process. Purple lines represent the mean estimate and green 1 s.d.
. a Filtering results for the predator population using KF1. b Filtering results for the predator population using KF2 and the transition rates are : The required matrices are calculated below: Sdiag(h(y t , θ))S T = E E T = c P + d P p , The deterministic part is now given by: The stochastic part is given by the (restarting) LNA where we have dropped the dependency of M t , V t from time: resulting in the following ODE for the stochastic variance: For the integrated process, we get the following according to Eqs. (10),(13) and (14) (Fig. 5). First the deterministic part is given by: The stochastic part is given by: resulting in the following ODEs for its integrated variance and covariance with the unintegrated process: A.12 OU and aggregated OU process See Fig. 5.

A.13 OU traceplots
Figures 6 and 7 show samples of the OU parameters during the MCMC runs.
A.14 Filtering results for OU process using aggregated data See Fig. 8.

A.16 Filtering results for the predator population
See Fig. 10.

A.17 MCMC traces for Translation inhibition example with synthetic data
See Fig. 11.