Augmented pseudo-marginal Metropolis–Hastings for partially observed diffusion processes

We consider the problem of inference for nonlinear, multivariate diffusion processes, satisfying Itô stochastic differential equations (SDEs), using data at discrete times that may be incomplete and subject to measurement error. Our starting point is a state-of-the-art correlated pseudo-marginal Metropolis–Hastings algorithm, that uses correlated particle filters to induce strong and positive correlation between successive likelihood estimates. However, unless the measurement error or the dimension of the SDE is small, correlation can be eroded by the resampling steps in the particle filter. We therefore propose a novel augmentation scheme, that allows for conditioning on values of the latent process at the observation times, completely avoiding the need for resampling steps. We integrate over the uncertainty at the observation times with an additional Gibbs step. Connections between the resulting pseudo-marginal scheme and existing inference schemes for diffusion processes are made, giving a unified inference framework that encompasses Gibbs sampling and pseudo marginal schemes. The methodology is applied in three examples of increasing complexity. We find that our approach offers substantial increases in overall efficiency, compared to competing methods


Introduction
Although stochastic differential equations (SDEs) have been ubiquitously applied in areas such as finance (see e.g.Kalogeropoulos et al., 2010;Stramer et al., 2017), climate modelling (see e.g.Majda et al., 2009;Chen et al., 2014) and life sciences (see e.g.Golightly and Wilkinson, 2011;Fuchs, 2013;Wilkinson, 2018;Picchini and Forman, 2019), their widespread uptake is hindered by the significant challenge of fitting such models to partial data at discrete times.General nonlinear, multivariate SDEs rarely admit analytic solutions, necessitating the use of a numerical solution (such as that obtained from the Euler-Maruyama scheme).The resulting discretisation error can be controlled through the use of intermediate time steps between observation instants.However, integrating over the uncertainty at these intermediate times can be computationally expensive.
Upon resorting to discretisation, two approaches to Bayesian inference are apparent.If learning of both the static parameters and latent process is required, a Gibbs sampler provides a natural way of exploring the joint posterior.The well-studied dependence between the parameters and latent process can be problematic; see Roberts and Stramer (2001) for a discussion of the problem.Gibbs strategies that overcome this dependence have been proposed by Roberts and Stramer (2001) for reducible SDEs and by Golightly and Wilkinson (2008), Fuchs (2013), Papaspiliopoulos et al. (2013) and van der Meulen and Schauer (2017) (among others) for irreducible SDEs.If primary interest lies in learning the parameters, pseudo-marginal Metropolis-Hastings (PMMH) schemes (Andrieu et al., 2010;Andrieu and Roberts, 2009;Stramer and Bognar, 2011;Golightly and Wilkinson, 2011) can be constructed to directly target the marginal parameter posterior, or, with simple modification, the joint posterior over both parameters and the latent process.PMMH requires running a particle filter (conditional on a proposed parameter value) at each iteration of the sampler to obtain an estimate of the marginal likelihood (henceforth referred to simply as the likelihood).This can be computationally costly, especially in applications involving long time series, since the number of particles should be scaled in proportion to the number of data points n to maintain a desired likelihood estimator variance (Bérard et al., 2014).To reduce the variance of the acceptance ratio (for a given number of particles), successive likelihood estimates can be positively correlated (Dahlin et al., 2015;Deligiannidis et al., 2018).The resulting correlated PMMH (CPMMH) scheme has been applied in the discretised SDE setting by Golightly et al. (2019); see also Choppala et al. (2016) and Tran et al. (2016).
As a starting point for this work, we consider a CPMMH scheme.A well known problem with this approach is the use of resampling steps in the particle filter, which can destroy correlation between successive likelihood estimates.This problem can be alleviated by sorting each particle before propagation using e.g. a Hilbert sorting procedure (Deligiannidis et al., 2018) or simple Euclidean sorting (Choppala et al., 2016).However, Golightly et al. (2019) find that as the level of the noise in the observation process increases, correlation deteriorates.Moreover, the empirical findings of Deligiannidis et al. (2018) suggest that the number of particles should scale as n d/(d+1) (where d is the dimension of the state space of the SDE), so that the advantage of CPMMH over PMMH degrades as the state dimension increases.Our aim is to avoid resampling altogether.A joint update of the entire latent process would avoid the need for resampling (as implemented by Stramer and Bognar (2011) in the PMMH setting) but to be computationally efficient this usually requires an extremely accurate method for sampling the latent process.
Our novel approach is to augment the parameter vector to include the latent process at the observation times, but not at the intermediate times between observation time instants.Given observations y, latent values x o at observation instants and parameters θ, a Gibbs sampler is used to update θ conditional on (x o , y), and x o conditional on (θ, y).Both steps require estimating likelihoods of the form p(x o t+1 |x o t , θ) for which we obtain unbiased estimators via importance sampling.Consequently, our approach can be cast within the pseudo-marginal framework and we further use correlation to improve computational efficiency.Crucially, we avoid the need for resampling, thus preserving induced positive correlation between likelihood estimates.Furthermore, each iteration of our proposed scheme admits steps that are embarrassingly parallel.We refer to the resulting sampler as augmented CPMMH (aCPMMH).It should be noted that aCPMMH requires careful initialisation of x o and subsequently, a suitable proposal mechanism.We provide practical advice for initialisation and tuning of proposals for a wide class of SDEs.Special cases of aCPMMH are also considered, and a byproduct of our approach is an inferential framework that encompasses both the pseudo-marginal strategy and Gibbs samplers described in the second paragraph of this section.In particular, we make clear the connection between aCPMMH and the Gibbs sampler (with reparameterisation) described in Golightly and Wilkinson (2008).We apply aCPMMH in three examples of increasing complexity and compare against state-of-the-art CPMMH and PMMH schemes.We find that the proposed approach offers an increase in overall efficiency of over an order of magnitude in several settings.
The remainder of this paper is organised as follows.Background information on the inference problem, PMMH and CPMMH approaches is provided in Section 2. Our novel contribution is described in Section 3 and we explore connections between our proposed approach and existing samplers that use data augmentation in Section 3.3.Applications are given in Section 4, and conclusions are provided in Section 5, alongside directions for future research.

Bayesian inference via time discretisation
Consider a continuous-time d-dimensional Itô process {X t , t ≥ t 0 } satisfying an SDE of the form Here, α is a d-vector of drift functions, the diffusion coefficient β is a d × d positive definite matrix with a square-root representation √ β such that √ β √ β T = β and W t is a d-vector of (uncorrelated) standard Brownian motion processes.We assume that both α and β depend on X t = (X 1,t , . . ., X d,t ) T and denote the parameter vector with θ = (θ 1 , . . ., θ p ) T Suppose that {X t , t ≥ 0} cannot be observed exactly, but observations y = (y 1 , . . ., y n ) T are available on a regular time grid and these are conditionally independent (given the latent process).We link the observations to the latent process via an observation model of the form where Y t is a d o -vector, F is a constant d × d o matrix and ǫ t is a random d o -vector.Note that this setup allows for only observing a subset of components (d o < d).In settings where learning Σ is also of interest, the parameter vector θ can be augmented to include the components of Σ.
For most problems of interest the form of the SDE in (1) will not permit an analytic solution.We therefore work with the Euler-Maruyama approximation where ∆W t ∼ N (0, I d ∆t).To allow arbitrary accuracy of this approximation, we adopt a partition of [t, t + 1] as The Euler-Maruyama approximation is then applied over each interval of width ∆τ , with m chosen by the practitioner, to balance accuracy and computational efficiency.The transition density under the Euler-Maruyama approximation of X τ t,k+1 |X τ t,k = x τ t,k is denoted by where N (•; µ, V ) denotes the Gaussian density with mean vector µ and variance matrix V .
In what follows, we adopt the shorthand notation for the latent process over the time interval [t, t + 1] with an analogous notation for intervals of the form (t, t + 1] and (t, t + 1) which ignore x τ t,0 and (x τ t,0 , x τt,m ) respectively.Hence, the complete latent trajectory is given by The joint density of the latent process over an interval of interest is then given by a product of Gaussian densities; for example with explicit dependence on the number of intermediate sub-intervals made clear via the superscript (m).

Pseudo-marginal Metropolis-Hastings (PMMH)
Suppose that interest lies in the marginal parameter posterior where π(θ) is the prior density ascribed to θ and p (m) (y|θ) is the (marginal) likelihood under the augmented Euler-Maruyama approach.That is, where and Although π (m) (θ|y) is typically complicated by the intractable likelihood term, p (m) (y|θ), the latter can be unbiasedly estimated using a particle filter (Del Moral, 2004;Pitt et al., 2012).We write the estimator as p(m) U (y|θ) with explicit dependence on U ∼ p(u), that is, the collection of all random variables of which a realisation u gives the estimate p(m) u (y|θ).Algorithm 1 gives the necessary steps for the generation of p(m) u (y|θ), with the explicit role of u suppressed for simplicity.A key requirement of the particle filter is the ability to simulate latent trajectories x (t,t+1] at each time t.To yield a reasonable particle weight, such trajectories must be consistent with x t and y t+1 and are typically termed bridges.In this article we generate bridges by drawing from a density of the form where the constituent terms take the form for suitable choices of µ(x τ t,k , y t+1 , θ) and Ψ(x τ t,k , θ).The form of (7) permits a wide choice of bridge construct and we refer the reader to Whitaker et al. (2017) and Schauer et al. (2017) for several options.Throughout this paper, we take where ∆ k = t + 1 − τ t,k and we adopt the shorthand notation that α k := α(x τ t,k , θ) and β k := β(x τ t,k , θ).We note that (8) and (9) correspond to the (extension to partial and noisy observations of the) modified diffusion bridge construct of Durham and Gallant (2002).We may write the construct generatively as where u τ t,k ∼ N (0, I d ).It should then be clear that an estimate of the likelihood, p(m) u (y|θ), is a deterministic function of the Gaussian innovations driving the bridge construct, and additionally, Algorithm 1 Particle filter Input: observations y = (y 1 , . . ., y n ) T , parameter θ, auxiliary variable u and the number of particles N .1. Initialise.For i = 1, . . ., N sample particle x i 0 from the initial state distribution and assign weight w i 0 = 1/N .
2. For times t = 0, 1, . . ., n − 1: (a) Resample.For i = 1, . . ., N , sample the index a i t ∼ M w 1:N t of the ancestor of particle i, where M(w wi t+1 of the observed data likelihood.
any random variates required in the resampling step of Algorithm 1.We use systematic resampling, which requires a single uniform innovation per resampling step.
The pseudo-marginal Metropolis-Hastings (PMMH) scheme (Andrieu and Roberts, 2009;Andrieu et al., 2009) is a Metropolis-Hastings (MH) scheme targeting the joint density for which it is easily checked (using p(m) u (y|θ)p(u)du = p (m) (y|θ)) that π (m) (θ|y) is a marginal density.Hence, for a proposal density that factorises as q(θ ′ |θ)q(u ′ ), the MH acceptance probability is The variance of p(m) u (y|θ) is controlled by the number of particles N , which should be chosen to balance both mixing and computational efficiency.For example, as the variance of the likelihood estimator increases, the acceptance probability of the pseudo-marginal MH scheme decreases to 0 (Pitt et al., 2012).Increasing N results in more acceptances at increased computational cost.Practical advice for choosing N is given by Sherlock et al. (2015) and Doucet et al. (2015) under two different sets of simplifying assumptions.Given a parameter value with good support under the posterior (e.g. the marginal posterior mean, estimated from a pilot run), we select N so that the estimated log-likelihood at this parameter value has a standard deviation of roughly 1.5.Unfortunately, the value of N required to meet this condition is often found to be impractically large.Therefore, we consider a variance reduction technique which is key to our proposed approach.
3. If i = n iters , stop.Otherwise, set i := i + 1 and go to step 2.
between successive estimates of the observed data likelihood in the MH scheme.This can be achieved by taking a proposal q(θ ′ |θ)K(u ′ |u) where K(u ′ |u) satisfies the detailed balance equation Recall that u consists of the collection of Gaussian random variates used to propagate the state particles (2(b) in Algorithm 1) and any variates required in the resampling step (2(a) in Algorithm 1).The Uniform random variate required for systematic resampling can be obtained by applying the inverse Gaussian cdf to a Gaussian draw.Hence, u consists entirely of standard Gaussian variates and it is then natural to set where d * is the total number of required innovations.We note that the density in ( 12) corresponds to a Crank-Nicolson proposal density for which it is easily checked that p(u) = N (u; 0, I d * ) is invariant.The parameter ρ is chosen by the practitioner, with ρ ≈ 1 inducing strong and positive correlation between p(m) u ′ (y|θ ′ ) and p(m) u (y|θ).The correlated pseudo-marginal Metropolis-Hastings scheme is given by Algorithm 2, which should be used in conjunction with a modified version of Algorithm 1 to induce the desired correlation.However, the resampling step has the effect of breaking down correlation between successive likelihood estimates.To alleviate this problem, the particles can be sorted immediately after propagation e.g. using a Hilbert sorting procedure (Deligiannidis et al., 2018) or simple Euclidean sorting (Choppala et al., 2016).Given a distance metric between particles, the particles are sorted as follows: the first particle in the sorted list is the one which has the smallest first component; for j = 2, . . ., N , the jth particle in the sorted list is chosen to be the one among the unsorted N − j + 1 particles that is closest to the j − 1th sorted particle.
Upon choosing a value of ρ (e.g.ρ = 0.99), the number of particles N can be chosen to minimise the distance between successive log estimates of marginal likelihood (Deligiannidis et al., 2018).In practice, we choose N so that the variance of the logarithm of the ratio p(m) u (y|θ) is around 1, for θ set at some central posterior value.
There are a nuber of limitations regarding the implementation of CPMMH as described here, which motivate the approach of Section 3.Although sorting particle trajectories after propagation can in theory alleviate the effect of resampling on maintaining correlation between successive likelihood estimates, the sorting procedure can be unsatisfactory in practice.For example, the Euclidean sorting procedure described above (and implemented within the SDE context in Golightly et al. (2019)) sorts trajectories x 1:N (t,t+)] between observation times by applying the procedure to the particle states x 1:N t+1 at the observation times.Consequently, trajectories with similar values at time t + 1 may exhibit qualitatively different behaviour between observation times, leading to potentially significant differences in likelihood contributions (via the particle weights).In turn, this may break down correlation between likelihood values at successive iterations.The procedure is therefore likely to be particularly ineffectual when the measurement error variance is large relative to stochasticity inherent in the latent diffusion process, or, as the dimension of the SDE increases.Resampling may be executed less often, although choosing a resampling schedule a priori may necessarily be ad hoc.Moreover, reducing the number of resampling steps, or indeed omitting the resampling step altogether (so that an importance sampler is obtained) would necessitate a bridge construct that samples over the entire inter-observation interval from an approximation that is very close to the true (but intractable) conditioned diffusion process, otherwise the resulting importance sampler weights are likely to have high variance.In what follows, we derive a novel approach which avoids resampling altogether, without recourse to importance sampling of the entire latent process.

Augmented CPMMH (aCPMMH)
It will be helpful throughout this section to denote x o as the values of x at the observation times 1, 2, . . ., n, and x L as the values of x at the remaining intermediate times.That is It is also possible to treat x n as a latent variable.In what follows, we include x n in x o for ease of exposition.
Rather than target the posterior in ( 5), we target the joint posterior where and p(y|x o , θ) = p(y|x, θ) as in ( 6).Although the integral in ( 14) will be intractable, we may estimate it unbiasedly as follows.

Sequential importance sampling
We adopt the factorisation and note that the constituent terms can be written as e (x (t,t+1] |x t , θ) is given by ( 4).Now, given some suitable importance density g(x (t,t+1) |x t , x t+1 , θ), we may write and a similar expression can be obtained for p (m) (x 1 |θ).Hence, given N draws x i (t,t+1) , i = 1, . . ., N from the density g(•|x t , x t+1 , θ), a realisation of an unbiased estimator of with the convention that x i t+1 = x t+1 for all i.We recognise ( 16) as an importance sampling estimator of ( 15).An unbiased importance sampling estimator of p (m) (x 1 |θ) can be obtained in a similar manner, by using an importance density p(x 0 )g(x (0,1) |x 0 , x 1 , θ).
We take g(x (t,t+1) |x t , x t+1 , θ) as a simplification of the bridge construct used in Section 2.1 so that where g x (t,t+1) |x t , x t+1 , θ has the form (7) but with the exact x t+1 taking the place of the noisy y t+1 .Since Σ = 0, ( 8) and ( 9) simplify to We make clear the role of the of the innovation vector u t = (u t,0 , . . ., u t,m−2 ) T in ( 16) by writing the bridge construct generatively as in (10) but with µ and Ψ given by ( 17).Now, since the x (t,t+1) , t = 0, . . ., n − 1 are conditionally independent given x o , we may unbiasedly estimate p realisations of which may be computed by running the importance sampler in Algorithm 3 for each t = 0, . . ., n − 1.Note that each t-iteration of Algorithm 3 can be performed in parallel if desired.

Algorithm
We adopt a pseudo-marginal approach by targeting the joint density for which it is easily checked that the posterior of interest, π (m) (θ, x o |y) given by ( 13), is a marginal density.The form of ( 19) immediately suggests a Gibbs sampler which alternates between draws from the full conditional densities (FCDs) Algorithm 3 Importance sampling Input: parameter θ, latent values x t , x t+1 , auxiliary variable u t and the number of importance samples N .
Hence, unlike the (C)PMMH scheme, the latent process at the observation times is no longer integrated out.Nevertheless, the sampler targets a posterior for which the latent process between observation instants is marginalised over, and this is crucial for side-stepping the well known dependence problem between the parameters and latent process.Note that as the number of importance samples N → ∞, the scheme can be seen as an idealised Gibbs sampler that alternates between draws of π (m) (θ|x o , y) and π (m) (x o |θ, y).For N = 1, the scheme is an extension of the modified innovation scheme of Golightly and Wilkinson (2008), as discussed further in Section 3.3.Metropolis-within-Gibbs steps are necessary for generating draws from the FCDs above.To sample the FCD π (m) (θ|u, x o , y) we use a proposal density q(θ ′ |θ) so that the acceptance probability is given by Given that π (m) (x o , u|θ, y) may be high dimensional, we propose to update (x o , u) in separate blocks corresponding to each time component of x o .For t = 1, . . ., n − 1 we have that where, for example p(u t ) = N (u t ; 0, I N (m−1)d ) (assuming N importance samples, (m − 1) intermediate time points between observation instants and a d-dimensional latent process).For t = 1, p(m) The full conditionals for the remaining end-point is given by We sample from each FCD using a Metropolis-within-Gibbs step.For each t = 1, . . ., n − 1 we use a proposal density of the form where and we use the shorthand notation u (t−1,t) = (u t−1 , u t ).Hence, the innovations are updated using a Crank-Nicolson kernel.The end-point proposal is defined similarly, with q ).We recall that the Crank-Nicolson kernel satisfies detailed balance with respect to the innovation density to arrive at the acceptance probabilities for t = 1, . . ., n − 1.For the end-point update, the acceptance probability is It is evident that the innovations (u 1 , . . ., u n−1 ) are updated twice per Gibbs iteration.We note that a scheme that only updates these innovations once per Gibbs iteration is also possible, but eschew this approach in favour of the above, which promotes better exploration of the innovation variable space.Further tuning considerations are discussed in Section 3.4.We refer to the resulting inference scheme as augmented CPMMH (aCPMMH).The scheme is summarised by Algorithm 4. Note that the components of the latent process X o at the observation times and auxiliary variables u are updated in steps 3-5; step 3 updates (x t , u (t−1,t) ) for t = 1, 3, . . ., n − 1 and step 4 for t = 2, 4, . . ., n − 2 (assuming, WLOG, that n is even).Step 5 updates the final value (x n , u n−1 ).Updating in this way allows for embarrisingly parallel operations over t (at steps 2, 3 and 4).Note that, as presented, uncertainty for the initial value x 0 is integrated over as part of the importance sampler (Algorithm 3).If required, aCPMMH can be modified either to treat x 0 as part of the parameter vector θ or with an extra step that updates x 0 (and therefore u 0 ) conditional on x 1 and θ.
As presented, Algorithm 4 is appropriate for the general case of noisy and partial observation of X t .In the special case of data consisting of noise free observation of all SDE components (so that Σ = 0 and F = I d in (2)), steps 3-5 are not required.Additionally, step 2 should propose the full auxiliary vector u ′ from K(u ′ |u) in (12) .Hence, this special case corresponds to the CPMMH algorithm with p(m) u ′ (y|θ ′ ) obtained by importance sampling and the acceptance probability is as in (11).In the case of noise free observation of a subset of components of X t , the scheme proceeds as in Algorithm 4, with the unobserved components of X t updated in steps 3-5.

Connection with existing samplers for SDEs
Consider aCPMMH with N = 1 particle and ρ = 0.In this case aCPMMH exactly coincides with the modified innovation scheme introduced by Golightly and Wilkinson (2008) (see also Golightly and Wilkinson, 2010;Papaspiliopoulos et al., 2013;Fuchs, 2013;van der Meulen and Schauer, 2017).We note that for this choice of N there is a one-to-one correspondence between the innovations u and the latent path x.Hence, step 2 of the Gibbs sampler in Section 3.2 is equivalent to directly updating the latent path x in blocks of size 2m − 1.To make this clear, consider updating x (t−1,t+1) .Upon substituting (16) into the acceptance probability in (21) we obtain min combining ( 18) with ( 16) and substituting the result into the acceptance probability in (20), we obtain It is straightforward to show that the Jacobian associated with the change of variables (from x to u) is given by n−1 t=0 g(x (t,t+1) |x t , x t+1 , θ) −1 and therefore the above acceptance probability coincides with that obtained for the parameter update in the modified innovation scheme (see e.g.page 14 of Golightly and Wilkinson, 2010).
For N = 1 and 0 < ρ < 1, aCPMMH can be seen as an extension of the modified innovation scheme that uses a Crank-Nicolson proposal for the innovations.A recent application can be found in Arnaudon et al. (2020).We assess the performance of aCPMMH for different values of ρ and N in Section 4.

Initialisation and tuning choices
Recall that both CPMMH and PMMH require setting the number of particles N and, if using a random walk Metropolis (RWM) proposal, a suitable innovation variance.Practical advice on choosing N for (C)PMMH is discussed at the end of Sections 2.1 and 2.2.For a RWM proposal of the form q(θ ′ |θ) = N (θ * ; θ, Ω), a rule of thumb for the innovation variance Ω is to take Ω = 2.56 2 p var(θ|y) (Sherlock et al., 2015), which could be obtained from an initial pilot run (such as that required to find a plausible θ value for subsequently choosing N ).
For (C)PMMH, both the pilot and main monitoring runs require careful initialisation of θ (Owen et al., 2015).The aCPMMH scheme additionally requires initialisation of x o , with poor choices likely to slow initial convergence of the Gibbs sampler.One possibility is to seek an approximation to π (m) (θ, x o |y), denoted π (a) (θ, x o |y), for which samples can be obtained (e.g. via MCMC) at relatively low computational cost.These samples can then be used to compute estimates Ê(θ|y) and Ê(x t |y), which can be used to initialise aCPMMH.Further, the proposal variances for θ and x t can be made proportional to the estimates var(θ|y) and var(x t |y) respectively, which can also be computed from the samples.For SDE models of the form (1), the linear noise approximation (LNA) (Stathopoulos and Girolami, 2013;Fearnhead et al., 2014) provides a tractable Gaussian approximation.We describe the LNA, its solution and sampling of π (a) (θ, x o |y) in Appendix A. In scenarios where using the LNA is not practical, we suggest initialising a pilot run of aCPMMH with x o = y (if d o = d so that all components are observed) or sampling x o via recursive application of the bridge construct in (7).The pilot run can be used to obtain further quantities required for tuning the proposal densities q(θ ′ |θ) and q(x ′ t |x t ).Hence, our intitialisation and tuning advice can be summarised by the following two options: 1. Perform a short pilot run of an MCMC scheme targeting π (a) (θ, x o |y) (as described in Appendix A for the LNA) to obtain the estimates Ê(θ|y), Ê(x t |y), var(θ|y) and var(x t |y).These quantities are used to initialise the main monitoring run of aCPMMH and in the RWM proposals for θ and the components of x o .
2. Perform a short pilot run of aCPMMH with x o initialised at the data y (if all SDE components are observed) or, in the case of incomplete observation of all SDE components, recursively draw from (7), retaining only those values at the observation times.Compute estimates as in option 1, for use in the main monitoring run.
The length of the pilot run can be set by choosing a fraction of the main monitoring run to fix the overall computational budget.For simplicity, we use RWM proposals in the pilot runs with diagonal innovation variances chosen to obtain (approximately) a desired acceptance rate.We note that Option 2 additionally requires specifying an initial number of particles N for the pilot run.In our experiments, we find that N = 1 is often sufficient.For either option, the number of particles can be further tuned if desired, before the main monitoring run.

Applications
We consider three applications of increasing complexity.All algorithms are coded in R and were run on a desktop computer with an Intel quad-core CPU.For all experiments, we compare the performance of competing algorithms using minimum (over each parameter chain and for aCPMMH, x o chain) effective sample size per second (mESS/s).Effective sample size (ESS) is the number of independent and identically distributed samples from the target that would produce an estimator with the same variance as the auto-correlated MCMC output.We computed ESS using the R coda package, details of which can be found in Plummer et al. (2006).When running CPMMH and aCPMMH with ρ > 0 we set ρ = 0.99.This pragmatic choice promotes good mixing in the θ chain (for CPMMH and aCPMMH) and x o chain (for aCPMMH) while allowing the auxiliary variables to mix on a scale comparable to θ and x o .We report CPU time based on the main monitoring runs and note that CPU cost of tuning was small relative to the cost of the main run (and typically less than 10% of the reported CPU time).For all experiments (unless stated otherwise) we used a discretisation of ∆τ = 0.2 which we found gave a good balance between accuracy (in the sense of limiting discretisation bias) and computational performance.

Square-root diffusion process
Consider a univariate diffusion process satisfying an Itô SDE of the form which can be seen as a degenerate case of a Feller square-root diffusion (Feller, 1952).We generated two synthetic data sets consisting of 101 observations at integer times using θ = (0.05, 0.06) T and a known initial condition of X 0 = 25.The observation model is Y t ∼ N (X t , σ 2 ) where σ ∈ {1, 5} giving data sets designated as D 1 and D 2 respectively (and shown in Figure 1).We took independent N (0, 10 2 ) priors for each log θ i , i = 1, 2, and work on the logarithmic scale when using the random walk proposal mechanism.We ran aCPMMH for 50K iterations with ρ fixed at 0.99.We report results for N = 1 particle, since N > 1 gave no increase in overall performance.Both tuning and initialisation methods (options 1 and 2 of Section 3.4) were implemented and denoted "aCPMMH (1)" and "aCPMMH (2)".We additionally include results based on the output of PMMH and CPMMH, which were tuned in line with the guidance given at the end of Sections 2.1 and 2.2.
Table 1 and Figure 2 summarise our results.The latter shows marginal posteriors obtained from the output of aCPMMH, and for comparison, from the LNA as an inferential model.We note substantial differences in posteriors obtained when using the discretised SDE in ( 23) as the inferential model compared to inferences made under the LNA.This is not surprising since for this example, the ground truth θ 1 and θ 2 values are similar, for which the assumption that fluctutaions about the mean of X t are small (as is necessary for an accurate LNA), is unreasonable.Nevertheless, we were still able to use the LNA to adequately initialise and tune aCPMMH.We also note that both initialisation and tuning options give comparable results.
It is evident from Table 1 that aCPMMH offers substantial improvements in overall efficiency compared to PMMH and, to a lesser extent, CPMMH.Minimum effective sample size per second for PMMH : CPMMH : aCPMMH scales as 1 : 2.7 : 6 for data set D 1 and 1 : 2.5 : 2.5 for data set D 2 .We found that as the measurement error variance (σ 2 ) is increased, the optimal number of particles N for both PMMH and CPMMH also increased.Although aCPMMH required N = 1 (that is, we  PSfrag replacements Figure 1: Birth-Death model.Data sets (circles) and summaries (mean and 95% credible intervals obtained from the output of aCPMMH) of the within-sample predictive π(y|D 1 ) (left) and π(y|D 2 ) (right).observed no additional improvement in overall efficiency for N > 1), the mixing deteriorates, due to having to integrate over the additional uncertainty in the latent process at the observation times.Finally, although the Euclidean sorting algorithm used in CPMMH is likely to be effective for this simple univariate example, we anticipate its deterioration in subsequent examples with increasing state dimension.

Lotka-Volterra
The Lotka-Volterra system describes the time-course behaviour of two interacting species: prey X 1,t and predators X 2,t .The stochastic differential equation describing the dynamics of X t = (X 1,t , X 2,t ) T is given by after suppressing dependence on t.We repeated the experiments of Golightly et al. ( 2019) which, for this example, involved three synthetic data sets generated with θ = (0.5, 0.0025, 0.3) T and a known initial condition of X 0 = (100, 100) T .The observation model is where I 2 is the 2× 2 identity matrix and σ ∈ {1, 5, 10} giving data sets designated as D 1 , D 2 and D 3 respectively.Data set D 3 is shown in Figure 3, and gives dynamics typical of the parameter choice taken.The parameters correspond to the rates of prey reproduction, prey death and predator reproduction, and predator death.As the parameters must be strictly positive, we work on the logarithmic scale with independent N (0, 10 2 ) priors assumed for each log θ i , i = 1, 2, 3.The main monitoring runs consisted of 10 5 iterations of aCPMMH, CPMMH (with ρ = 0.99) and PMMH.Note that aCPMMH used random walk proposals in the log θ and x o updates, with variances obtained from the output of an MCMC pilot run based on the LNA, which was also used to initialise θ and x o .
From Figure 4 we see that aCPMMH gives parameter posterior output that is consistent with the ground truth (and also with the output of PMMH and CPMMH -not shown).In this case, the LNA gives accurate output when used as an inferential model.We compare efficiency of PMMH, CPMMH and aCPMMH in Table 2.We found that N = 1 was sufficient for aCPMMH but also include results for N = 2, which gave a small increase in minimum ESS but a decrease in overall efficiency, due to the increase (doubling) in CPU time.It is clear that as σ increases, PMMH and CPMMH require an increase in N to maintain a reasonable minimum ESS.Consequently, their performance degrades.Although the statistical efficiency (mESS) of aCPMMH reduces as σ increases, the reduction is gradual (compared to that of CPMMH) and we see an increase in overall efficiency of aCPMMH (with ρ = 0.99) of an order of magnitude over PMMH in all experiments, and over CPMMH for data sets D 2 and D 3 .We also include the output of aCPMMH with N = 1 and ρ = 0.0, corresponding to the modified innovation scheme of Golightly and Wilkinson (2008) (and as discussed in Section 3.3).Although this approach works well compared to CPMMH and PMMH, and gives well mixing parameter chains, we see a decrease in mESS (relative to aCPMMH with ρ = 0.99) calculated from the x o chains, and this relative difference increases as σ increases.
Finally, we compare the performance of CPMMH and aCPMMH when parallelised over two cores.For aCPMMH (and as discussed at the end of Section 3.2), operations over t in steps 2,3 and 4 of Algorithm 4 can be performed in parallel.For CPMMH, we perform the propagate step of the particle filter (step 2(b) of Algorithm 1) in parallel.Figure 5 shows the difference (2 cores vs 1 core) in log 2 CPU times (denoted ∆ log 2 CPU) against log 10 ∆τ , where the discretisation level is ∆τ ∈ {10 −4 , 10 −3 , 10 −2 , 10 −1 }; for a perfect speed up from the use of two cores, this would be −1.Results based on aCPMMH used N = 1 in all cases whereas CPMMH used N = 3, N = 8 and N = 18.These values correspond to the the numbers of particles required for each synthetic data set.For aCPMMH, we see that using 2 cores is beneficial for ∆τ ≤ 10 −2 .For CPMMH and N = 1, there is almost no benefit in a multi-core approach (and CPU time using 2 cores is typically higher than a single core approach).This is unsurprising given the resampling steps (performed in serial) between the propagate steps.As N increases, the benefit of using 2 cores can be seen.

Autoregulatory gene network
A commonly used mechanism for auto-regulation in prokaryotes which has been well-studied and modelled is a negative feedback mechanism whereby dimers of a protein repress its own transcription (e.g.Arkin et al., 1998).A simplified model for such a prokaryotic auto-regulation, based on this mechanism can be found in Golightly and Wilkinson (2005) (see also Golightly and Wilkinson, 2011).We consider the SDE representation of the dynamics of the key species involved in this mechanism.These are RNA, P, P 2 and DNA, denoted as X 1 , X 2 , X 3 and X 4 respectively.The SDE takes the form (1) where and the hazard function  after dropping t to ease the notation.Further details regarding the derivation of the SDE can be found in Golightly and Wilkinson (2005).
The parameters θ = (θ 1 , θ 2 , θ 3 , θ 4 ) T correspond to the rate of protein unbinding at an operator site, the rate of transcription of a gene into mRNA, the rate at which protein dimers disassociate and the rate at which protein molecules degrade.We generated a single synthetic data set with θ = (0.7, 0.35, 0.9, 0.3) T and an initial condition of X 0 = (8, 8, 8, 5) T .The observation model is where Σ is a diagonal matrix with elements 1, 1, 1, 0.25.The data are shown in Figure 6.Independent U (−5, 5) priors were assumed for each log θ i , i = 1, 2, 3, 4. A short MH run was performed using the LNA, to obtain estimates of var(log θ|y) and var(x t |y) (to be used the innovation variances of the random walk proposal mechanisms in (a)CPMMH) and plausible values of θ and x o (to be used to initialise the main monitoring runs of (a)CPMMH).Pilot runs of aCPMMH and CPMMH suggested taking N = 1 and N = 20 for each respective scheme.We then ran aCPMMH   and CPMMH for 10 5 iterations with these tuning choices.Table 3 and Figure 6 summarise our findings.
It is clear that aCPMMH with ρ = 0.99 results in a considerable improvement in statistical efficiency over aCPMMH with ρ = 0.0 (which is the modified innovation scheme).In particular, minimum ESS (calculated over the x o chains) is almost an order of magnitude higher for ρ = 0.99 (866 vs 5524).An improvement in overall efficiency of aCPMMH over CPMMH is evident, irrespective of the choice of ρ.Increasing N to 2 gives results in better mixing of the x o chains, but no appreciable increase in minimum ESS over all chains.

Discussion
Given observations at discrete times, performing fully Bayesian inference for the parameters governing nonlinear, multivariate stochastic differential equations is a challenging problem.A discretisation approach allows inference for a wide class of SDE models, at the cost of introducing an additional bias (the so called discretisation bias).The simplest such approach uses the Euler-Maruyama approximation in combination with intermediate time points between observations, to allow a time step chosen by the practitioner, that should trade off computational cost and accuracy.It is worth emphasising that although a Gaussian transition density is assumed, the mean and variance are typically nonlinear functions of the diffusion process, and consequently, the data likelihood (after integrating out at intermediate times) remains intractable, even under the assumption of additive Gaussian noise.
These remarks motivate the use of pseudo-marginal Metropolis-Hastings (PMMH) schemes, PSfrag replacements which replace an evaluation of the intractable likelihood with a realisation of an unbiased estimator, obtained from a single run of a particle filter over dynamic states (Andrieu et al., 2010).It is crucial that the number of particles is carefully chosen to balance computational efficiency whilst allowing for reasonably accurate likelihood estimates.Inducing strong and positive correlation between successive likelihood estimates can reduce the variance of the acceptance ratio, permitting fewer particles (Dahlin et al., 2015;Deligiannidis et al., 2018).Essentially, the (assumed Gaussian) innovations that are used to construct the likelihood estimates are updated with a Crank-Nicolson (CN) proposal.The resampling steps in the particle filter are also modified in order to preserve correlation; the random numbers used during this step are included in the CN update, and the particle trajectories are sorted before resampling takes place at the next time point.We follow Choppala et al. (2016) and Golightly et al. (2019) by using a simple Euclidean sorting procedure based on the state of the particle trajectory at the current observation time.We find that the effectiveness of this correlated PMMH (CPMMH) approach degrades as the observation variance and state dimension increases.
Our novel approach avoids the use of resampling steps, by updating parameters conditional on the values of the latent diffusion process at the observation times (and the observations themselves), whilst integrating over the state uncertainty at the intermediate times.An additional step is then used to update the latent process at the observation times, conditional on the parameters and data.The resulting algorithm can be seen as a pseudo-marginal scheme, with unbiased estimators of the likelihood terms obtained via importance sampling.We further block together the updating of the latent states and the innovations used to construct the likelihood estimates, and adopt a CN proposal mechanism for the latter.We denote the resulting sampler as augmented, correlated PMMH (aCPMMH).A related approach is given by Fearnhead and Meligkotsidou (2016), who use particle MCMC with additional latent variables, carefully chosen to trade-off the error in the particle filter against the mixing of the MCMC steps.We emphasise that unlike this approach, the motivation behind aCPMMH is to avoid use of a particle filter altogether, the benefits of which are two fold: positive correlation between successive likelihood estimates is preserved and the method for obtaining these likelihood estimates can be easily parallelised over observation intervals.Section 4.2 shows that once the updating over an inter-observation interval is sufficiently costly substantial gains can be obtained by parallelising this task over the different inter-observation intervals: this could be useful for stiff SDEs, high-dimensional SDEs, or if multiple inter-observation intervals are tackled by a single importance sampler.
In addition to the tuning choices required by CPMMH (that is, the number of particles N , correlation parameter ρ in the CN proposal and parameter proposal mechanism), aCPMMH requires initialisation of the latent process at the observation times and a suitable proposal mechanism.If a computationally cheap approximation of the joint posterior can be found, this may be used to initialise and tune aCPMMH.To this end, we found that use of a linear noise approximation (LNA) can work well, even in settings when inferences made under the LNA are noticeably discrepant, compared to those obtained under the SDE.In scenarios where use of the LNA is not practical, a pilot run of aCPMMH can be used instead.
We compared the performance of aCPMMH with both PMMH and CPMMH using three examples of increasing complexity.In terms of overall efficiency (as measured by minimum effective sample size per second), aCPMMH offered an increase of up to a factor of 28 over PMMH.We obtained comparable performance with CPMMH for a univariate SDE application, and an increase of up to factors of 10 and 14 in two applications involving SDEs of dimension 2 and 4 respectively.Our experiments suggest that although the mixing efficiency of aCPMMH increases with N , the additional computational cost results in little benefit (in terms of overall efficiency) over using N = 1.A special case of aCPMMH (when ρ = 0 and N = 1) is the modified innovation scheme of Golightly and Wilkinson (2008), which is typically outperformed (in terms of overall efficiency) with ρ > 0.

Limitations and future directions
There are some limitations of aCPMMH which form the basis of future research.For example, in each of our applications the associated latent process exhibits unimodal marginal distributions and linear dynamics between observation instants that are well approximated by the modified diffusion bridge construct.Extension to the multimodal case would require an importance proposal that captures the multimodality of the marginal distributions of the true bridge between observation instants.We refer the reader to the guided proposals of van der Meulen and Schauer (2017) and Schauer et al. (2017) for possible candidate proposal processes.
Additional directions for future research include the use of methods based on adaptive proposals which may be of benefit in both the parameter and latent state update steps.Proposal mechanisms that exploit gradient information such as the Hamiltonian Monte Carlo (HMC) method (Duane et al., 1987) may also be of interest.We note that in the case of N = 1, it is possible to directly calculate the required log-likelihood gradient.For N > 1, importance samples generated from π(x L |x o , θ) could be used to estimate For a general discussion on the use of particle filters for estimating log-likelihood gradients we refer the reader to Poyiadjis et al. (2011); see also Nemeth et al. (2016).A comparison of aCPMMH with approaches that target the joint posterior of the parameters and latent process also warrants further attention; see e.g.Botha (2020) for an implmentation of the latter.
Although we have focussed on updating the latent states in separate blocks (single site updating), other blocking schemes may offer improved mixing efficiency.Alternatively, it might be possible to reduce the number of latent variables, for example, by only explicitly including latent states in the joint posterior at every (say) kth observation instant.The success of such a scheme is likely to depend on the accuracy of an importance sampler that covers k observations, and whether or not the resulting likelihood estimates can be made sufficiently correlated.This is the subject of ongoing work.

A Linear noise approximation (LNA)
The linear noise approximation (LNA) provides a tractable approximation to the SDE in (1).We provide brief, intuitive details of the LNA and its implementation, and refer the reader to Fearnhead et al. (2014) (and the references therein) for an in-depth treatment.33) and (30) forward to t + 1 to obtain η t+1 , V t+1 and P t+1 .
where p (a) (y|θ) is the marginal likelihood under the LNA.Then, a sample x o is drawn from p (a) (x o |θ, y) for each θ sample from step one.Note that p (a) (y|θ) and p (a) (x o |θ, y) are tractable under the LNA.
A forward filter is used to evaluate p (a) (y|θ).Since the parameters θ remain fixed throughout the calculation of p (a) (y|θ), we drop them from the notation where possible.
Hence, samples of θ can be obtained from π (a) (θ|y) by running a Metropolis-Hastings scheme with target (34).Then, for each (thinned) θ draw, x o can be sampled from p (a) (x o |θ, y) using a backward sampler (see Algorithm 6).

Figure 2 :
Figure 2: Birth-Death model.Marginal posterior distributions using data set D 2 and based on the output of aCPMMH (solid lines) and the LNA (dashed lines).The true values of θ 1 , θ 2 and θ 1 − θ 2 are indicated.

Figure 4 :
Figure 4: Lotka-Volterra model.Marginal posterior distributions using data set D 3 and based on the output of aCPMMH (solid lines) and the LNA (dashed lines).The true values of θ 1 , θ 2 and θ 3 are indicated.

Figure 6 :
Figure 6: Autoregulatory model.Data set (circles) and summaries (mean and 95% credible intervals obtained from the output of aCPMMH) of the within-sample predictive π(y|D).

Figure 7 :
Figure 7: Autoregulatory model.Marginal posterior distributions based on the output of aCPMMH (solid lines) and the LNA (dashed lines).The true parameter values are indicated.

Table 1 :
Birth-Death model.Number of particles N , correlation parameter ρ, CPU time (in seconds s), minimum ESS (over θ and x o chains), minimum ESS per second and relative (to PMMH) minimum ESS per second.All results are based on 5 × 10 4 iterations of each scheme.

Table 2 :
Lotka-Volterra model.Number of particles N , correlation parameter ρ, CPU time (in seconds s), minimum ESS (over θ and x o chains), minimum ESS per second and relative (to PMMH) minimum ESS per second.All results are based on 10 5 iterations of each scheme.

Table 3 :
Autoregulatory model.Number of particles N , correlation parameter ρ, CPU time (in seconds s), minimum ESS (over θ and x o chains), minimum ESS per second and relative (to PMMH) minimum ESS per second.All results are based on 10 5 iterations of each scheme.
Partition X t as X t = η t + R t ,(25)where {η t , t ≥ 0} is a deterministic process satisfying the ODE , t ≥ 0} is a residual stochastic process.The residual process (R t ) satisfiesdR t = {α(X t , θ) − α(η t , θ)} dt + β(X t ,θ) dW t (27) which will typically be intractable.A tractable approximation can be obtained by Taylor expanding α(X t , θ) and β(X t , θ) about η t .Retaining the first two terms in the expansion of α and the first term in the expansion of β gives d Rt = H t Rt dt + β(η t , θ) dW t (28) Algorithm 5 LNA forward filter 1.For t = 0, . . ., n − 1, (a) Prior at t + 1. Initialise the LNA with η t = a t , V t = C t and P t = I d .Integrate the ODEs (26), (