Scalable Inference for Markov Processes with Intractable Likelihoods

Bayesian inference for Markov processes has become increasingly relevant in recent years. Problems of this type often have intractable likelihoods and prior knowledge about model rate parameters is often poor. Markov Chain Monte Carlo (MCMC) techniques can lead to exact inference in such models but in practice can suffer performance issues including long burn-in periods and poor mixing. On the other hand approximate Bayesian computation techniques can allow rapid exploration of a large parameter space but yield only approximate posterior distributions. Here we consider the combined use of approximate Bayesian computation (ABC) and MCMC techniques for improved computational efficiency while retaining exact inference on parallel hardware.


Introduction
Stochastic kinetic models describe the probabilistic evolution of a dynamical system. Motivated by the need to incorporate intrinsic stochasticity in the underlying mechanics of the systems, stochastic models are increasingly used to model a wide range of real-world problems, particularly, but not limited to, computational systems biology, predator-prey population models, and single species population development (Kitano, 2002;Boys et al, 2008;Gillespie and Golightly, 2010). Systems are governed by a network of reactions which change the state of the system by a discrete amount, and hence are most naturally represented by a continuous time Markov process on a discrete state space. Typically, exact realisations from the model are obtained using the direct method (Gillespie, 1977). There are a number of faster, but approximate algorithms, such as the diffusion approximation, moment closure, or hybrid simulation strategies (Gillespie, 2000;Salis and Kaznessis, 2005;Gillespie, 2009).
Our goal is to perform statistical inference for the parameters that govern the dynamics, where data are partially missing and prior knowledge of reaction rates may be poor. Likelihood functions for such problems are rarely analytically tractable but it is possible to leverage particle Markov chain Monte Carlo (pMCMC) methods to perform exact inference (Golightly and Wilkinson, 2011;Andrieu et al, 2010). Typically, pMCMC algorithms exhibit a high computational cost, since at each iteration we run a sequential Monte Carlo (SMC) filter, which requires multiple forward simulations. This cost can become overly prohibitive if convergence is slow.
Approximate Bayesian computation (ABC) techniques, allow posterior inference to be made for problems where evaluation of the likelihood function is unavailable, relying heavily on the ability to simulate from the model (Tavare et al, 1997;Pritchard et al, 1999;Beaumont et al, 2002). An approximation to the true posterior distribution of interest is made via samples of parameter vectors, θ, that yield simulated data deemed "close" to the observed data. Here the definition of close is controlled by a tolerance, , and distance function ρ(·), hence retained samples are from π(θ|ρ(D, D * ) ≤ ). The simple rejection algorithm typically leads to high numbers of proposed parameter vectors being rejected for meaningful tolerance values (Pritchard et al, 1999). Further developments within this framework lead to MCMC schemes which avoid calculation of the likelihood, as well as sequential Monte Carlo algorithms, which typically exhibit better acceptance rates than the simple rejection sampler (Marjoram et al, 2003;Del Moral et al, 2006;Sisson et al, 2007;Toni et al, 2009). Such techniques have successfully been applied to stochastic kinetic models for approximate inference (Drovandi and Pettitt, 2011;Fearnhead and Prangle, 2012).
In this article, we explore a scenario where we desire exact posterior inference for model rate parameters, while exploiting parallel hardware. By coupling scalable, approximate and exact techniques, we are able to increase computational efficiency of posterior sampling via parallel particle MCMC and improve convergence properties of chains that suffer expensive burn-in periods to further increase efficiency.

Stochastic kinetic models
Consider a network of reactions involving a set of u species X 1 , . . . , X u and v reactions R 1 , . . . , R v where each reaction R i is given by Let P be the v × u matrix of pre-reaction coefficients of reactants p i,j , similarly Q, the v × u matrix of post-reaction coefficients of products. The stoichiometry matrix, defined as is a useful way to encode the structure of the reaction network. Also define X t to be the vector (X 1,t , . . . , X u,t ) denoting the number of species X present at time t. Each reaction R i is assumed to have an associated rate constant, θ i , and a hazard function h i (X t , θ i ) which gives the overall propensity for a type i reaction to occur. The form of h i (X t , θ i ) in many cases can be considered as arising from interactions between components in a well mixed population. This leads to reactions following the law of mass action kinetics where the hazard function for a reaction of type i takes the form If θ = (θ 1 , . . . , θ v ) and h(X t , θ) = (h 1 (X t , θ), . . . , h v (X t , θ v )) then values for θ and X 0 complete full specification of this Markov process.
Algorithm 1 The Direct method 1. Set t = 0. Initialise the rate constants θ and initial states X 0 .

Calculate the hazard functions
is the j th column of the stoichiometry matrix S.
6. If t < T return to 2.

The Direct method
Exact simulations of the evolution of such a system can be obtained via the Direct method (Gillespie, 1977). Algorithm 1 describes the procedure for exact forward simulation of a stochastic kinetic model, given its stoichiometry matrix, S, a vector of reaction rates, θ, an initial state vector X 0 and an associated hazard function h(X t , θ). Reactions simulated to occur via the Gillespie algorithm incorporate the discrete and stochastic nature of the system since each reaction, which increments or decrements species levels by discrete amounts, are chosen probabilistically. Less computationally expensive simulation algorithms such as the chemical Langevin equation (CLE), or Euler Maruyama scheme relax the restriction imposed by the discrete state space but retain the stochasticity of the underlying mechanics, giving approximate realisations of the progression of the species (Gillespie, 2000). For the purpose of exact inference and this article we consider only exact solutions to the stochastic differential equations governing the system. For a more in depth background into stochastic kinetic modelling see Wilkinson (2011).

Methods for Bayesian Inference in Intractable Likelihood Problems
Methods for likelihood-free Bayesian inference have become popular due to the increasing use of models in which the likelihood function is either deemed too computationally expensive to compute, or analytically unavailable. In particular, approximate Bayesian computation (ABC), and particle filtering methods have proven to be two classes of algorithms in which one can proceed with inference in a scenario where evaluation of the likelihood is unavailable, commonly termed likelihood-free inference.
3. Calculate a measure of distance between the candidate data, D * , and the observed data D, ρ(D, D * ).

Approximate Bayesian computation
The goal of approximate Bayesian computation techniques is to obtain a collection of samples from the posterior distribution π(θ|D) where the likelihood function π(D|θ) is unavailable, due to either high computational cost or analytical intractability. Typically we rely on the assumption that simulation from the model given a set of parameters, θ, is relatively straightforward, as in algorithm 1. Given a set of proposed parameter vectors we would ideally keep any such vectors which yield data simulations that are equal to the observations that we have. In reality however, we will typically match the simulated data to the observations perfectly with low probability. To avoid rejecting all proposals, we instead keep parameters that yield simulations deemed sufficiently close to the observations. We characterise a set of simulated data, D * as sufficiently close if, for a given metric, ρ(·), the distance between D * and observed data D is below a tolerance threshold, . The simple rejection sampler is described in algorithm 2.
Rather than leading to a sample from the exact posterior distribution, the samples instead have the distribution π(θ|ρ(D, D * ) < ). As → 0 this tends to the true posterior distribution if ρ(·) is a properly defined metric on sufficient statistics. Conversely as → ∞ one recovers the prior distribution on the parameters. Since it is usually the case that sufficient statistics are not available in this type of problem, further approximation can be made by choosing a set summary statistics that are not sufficient, but it is hoped that describe the data well. If a large value of is chosen, a large number of proposed candidates are accepted, but little is learnt about the posterior distribution. Choosing a small tolerance yields a much better approximation to π(θ|D) but at the expense of poorer acceptance rates and hence increased computational cost. This rejection based approach has been included in a number of sampling techniques.
Within the context of this article we consider a sequential approach to ABC inference. An example of such a scheme, based on importance sampling, is described in algorithm 3, see Toni et al (2009) for details.
Algorithm 3 as described here allows us choice of an essentially arbitrary perturbation kernel K t (.) for preventing degeneracy of particles, and the sequence of tolerances 0 , . . . , T . In practice the choice of K t (.) will have an important effect on the overall efficiency of the algorithm. A perturbation which gives small moves will typically allow acceptance rates to remain higher, whilst exploring the space slowly. Conversely making large moves allows better exploration of the parameter space, but usually at the cost of reduced acceptance rates. We shall use the 'optimal' choice of a random walk perturbation kernel for this algorithm as detailed in Filippi et al (2011). For a multivariate Gaussian random walk Algorithm 3 Sequential ABC 1. Initialise 0 > 1 > . . . > T > 0 and set the population indicator, t = 0.

Set particle indicator
Else sample θ * from the previous population {θ , if t > 0 .
If i < N set i = i + 1, go to 3 5. Normalise the weights, if t < T , set t = t + 1 and go to 2.
kernel at population t, we choose Σ (t) as where N is the number of weighted samples we have from the previous population π(θ|ρ(D, D * ) < t−1 ) with weights ω, N 0 is the number of those samples,θ, that satisfy π(θ|ρ(D, D * ) < t ) with associated normalised weightsω. This builds on the work of Beaumont et al (2009). The choice of sequence of tolerances 0 , . . . , T also has a large effect on the efficiency of the algorithm. A sequence which decreases too slowly will yield high acceptance rates but convergence to the posterior will be slow. Whereas a rapidly decreasing sequence typically yields poor acceptance rates. We will, throughout this article, use an adaptive choice of t determined by taking a quantile of the distances at (t − 1).

ABC methods in parallel
Non MCMC-based ABC techniques, such as those described in algorithm 2 and algorithm 3 are often amenable to parallelisation. Since all proposed parameters are independent samples from the same distribution, and the acceptance step for each does not depend on the previously accepted particle, the bulk of computation for a sequential ABC scheme can be run in parallel, greatly reducing overall CPU time required to obtain a sample from the target. Some thread communication is necessary as we move from one distribution in the sequence to the next, something that would be avoided if using the simple rejection sampler, but typically this is small in comparison to the work done in forward simulation.
Coding of a parallel ABC algorithm adds little complexity over a non-parallelised version. This is particularly true for the simple rejection sampler where one effectively Algorithm 4 Metropolis Hastings 1. Initialise with a random starting value θ ∼ π(θ).
3. Accept the move with probability else remain at θ.

Return to 2.
just runs a sampler on each of N processors and no communication between threads is needed.

Markov Chain Monte Carlo
Suppose interest lies in π(θ|D), and that we wish to construct a MCMC algorithm whose stationary distribution is exactly this posterior. Using an appropriate proposal function q(θ * |θ) we can construct a Metropolis Hastings algorithm to do this as described in algorithm 4. This can often be impractical due to unavailability of the likelihood term, π(D|θ).
In practice one normally chooses a proposal function q(θ * |θ) such that proposed moves from θ to θ * are distributed symmetrically about θ giving rise to random walk Metropolis Hastings. Typical choices include a uniform U(θ − σ, θ + σ) or Gaussian N (θ, Σ) distribution. In this article we shall consider Gaussian innovations for random walk proposals. It has been shown that under various assumptions about the target, the optimal scaling for a random walk Metropolis algorithm using a Gaussian proposal, Σ q , for a d dimensional distribution is where Σ is the posterior covariance (Roberts et al, 1997;Roberts and Rosenthal, 2001). It is of note however that typically the posterior covariance is unavailable.

Parallel MCMC
MCMC algorithms are somewhat less amenable to parallelisation than rejection sampling algorithms. Essentially there are two options to be considered, parallelisation of a single MCMC chain or the construction of parallel chains, for an in depth discussion see Wilkinson (2006). Parallelisation of a single chain in many scenarios is somewhat difficult due to the inherently iterative nature of a Markov chain sampler and will not be discussed in detail here. Running parallel chains, however, is straightforward and potentially useful.

Parallel chains
In practice, chains initialised with an arbitrary starting point will be, after an appropriate burn-in period, essentially sampling from the target. In the context of a serial implementation there is still some debate over whether it is better to run a single long chain, or many shorter chains. The benefits of the single chain are that any burn in period must only be suffered once, where as the argument for multiple shorter chains is that one may better diagnose convergence to the stationary distribution. The argument changes when considering parallel implementation. Indeed, if burn-in periods are relatively short, running independent chains on separate processors can be a very time efficient way of learning about the distribution of interest. Burn-in is still a potential limiter on the scaling of the performance to be had when employing parallel chains with the number of processors. The greater the burn-in period of a chain, the more time each processor has to waste computing samples that will eventually be thrown away. The theoretical speed up calculation given N processors to obtain n stored samples with a burn-in period b is which is clearly limited for any b > 0, as N → ∞ (Wilkinson, 2006). A "perfect" parallelisation of multiple chains then is one in which we have no burn-in period, i.e initialise each chain with an independent draw from the target. The performance gain when run on N processors is then of factor N . Clearly this "perfect" situation will not be possible to implement in practice since we typically are unable to sample from the target.

Particle MCMC
Re-consider the Metropolis Hastings MCMC algorithm from section 3.2. The ability to construct such a scheme relies on the evaluation of the likelihood function π(D|θ) among others. Crucially, the problems of interest discussed in this article are such that the likelihood term π(D|θ) is unavailable. However Andrieu and Roberts (2009) proposed a pseudo marginal MCMC approach to this problem. In the acceptance ratio of the MCMC scheme, replace the intractable likelihood, π(D|θ), with a Monte Carlo estimate. It can be shown that, provided E[π(D|θ)] = π(D|θ), the stationary distribution of the resulting Markov chain is exactly the distribution of interest. In the context of inference for state space models, it is natural to make use of sequential Monte Carlo techniques through a bootstrap particle filter, (Doucet et al, 2001), to obtain estimates of the likelihood π(D|θ). The bootstrap particle filter for application to Markov processes is described in algorithm 5. Substituting the estimate ofπ(D|θ) in place of the likelihood function yields a MH algorithm with acceptance probability min 1,π (D|θ * )π(θ * )q(θ|θ * ) π(D|θ)π(θ)q(θ * |θ) .
It turns out that the particle filter's estimate of marginal likelihood is unbiased, giving rise to an "exact approximate" pseudo-marginal MCMC algorithm. This use of a SMC sampler within a pseudo-marginal algorithm is a special case of the particle marginal Metropolis Hastings (PMMH) algorithm described in Andrieu et al (2010), which in general can be Algorithm 5 The boot-strap particle filter At time t we have a set of N particles X * t = {(x i t , π i t ) : i = 1, . . . , N }. The filter assumes fixed parameters and so we drop the θ from our notation. t ∈ (0, 1, . . . , T ) for observed data with T discrete time-point observations.
1. Initialise at X * 0 , a set of N independent draws from our prior distribution on the state.
2. At time t, suppose we have a sample X * t ∼ π(X t |D 1:t ).
3. Sample a set of indices for candidates for forward simulation, I i t according to the weights π t . 4. Simulate forward from the model the chosen paths, 5. Calculate weights, w i t+1 = p(d t+1 |x i t+1 ), and normalise to set used to target the full joint posterior on the state and rate π(θ, X|D). In this article we consider only the pseudo-marginal special case. As discussed in section 3.2.1 this type of scheme does not lend itself well to parallelisation unless we have the opportunity to initialise chains with a sample from the posterior distribution π(θ|D). We recognise that rather sophisticated parallel particle filter algorithms which have the potential for increasing the speed of moves within a single chain are available. However the speed up is limited by necessary communication and synchronisation between processors and typically do not scale well. The pseudo-marginal MCMC scheme requires specification of a number of particles, N , to be used in the boot-strap filter. It was noted, (Andrieu and Roberts, 2009), that the efficiency of the scheme decreases as the variance of the estimated marginal likelihood increases. This can be overcome by increasing the value of N , albeit at an increased computational cost. An optimal choice of N was the subject of Pitt et al (2012) and Doucet et al (2012). The former suggest that N should be chosen such that the variance in the estimated log-likelihood is around 1 while the latter show that the penalty is small for values between 0.25 and 2.25.

Combined use of ABC and pMCMC
Since it is typically not possible to initialise a MCMC chain with a draw from the desired target, we propose an approach to parallel MCMC by choosing initial parameter vectors according to samples from an approximate posterior distribution. The intuition is that if we have a reasonable approximation to the target of interest, samples from the approximation will closely match those from π(θ|D). Because of this we expect that after a very short burn-in we are sampling from the desired target. Hence we are approaching the scenario of perfect parallelisation of MCMC. Clearly the better the approximation the shorter the burn-in for each chain will be.
If we first run an ABC scheme targeting an approximation to this distribution, we can exploit parallel hardware to eliminate a large region of prior parameter space very quickly.
Then take a set of N independent samples from π(θ|ρ(D, D * ) < ) as initialisations to N independent, parallel, pMCMC chains each targeting the exact posterior distribution π(θ|D).
We can in some sense consider this process of obtaining a sample from an ABC posterior as an artificial burn-in period. Crucially however, since the ABC algorithm yields a large set of samples from π(θ|ρ(D, D * ) ≤ ) the computational price of performing the artificial burn-in has only to be paid once and can be parallelised. With a set of samples we can initialise an arbitrary number of MCMC chains each with an independent parameter vector θ 0 which comes from a distribution close to the target π(θ|D).
Since ABC is being used only for initialisation, it is not used as a prior for MCMC algorithm, nor does it form part of the proposal Hence, the ABC approximation does not affect the exactness of the pMCMC target.

Random Walk pMCMC using ABC
We now refer back to the optimal choice of Gaussian random walk kernel of Roberts and Rosenthal (2001) mentioned in section 3.2. Since we hope that we start with a reasonable approximation to the true posterior, we likewise consider that the covariance of a sample from π(θ|ρ(D, D * ) < ) denoted here Σ ABC , will be close to the covariance of true posterior samples. This allows us to use the approximate posterior covariance structure as an informative tool for calibrating the random walk innovations. In most MCMC applications, since the true posterior variance is unknown, Σ q requires tuning, typically from pilot runs. In the case of using Σ ABC we hope that we can also remove the necessity to tune Σ q . In practice we take In addition we use our approximation to automatically tune the number of particles required for the particle filter. In practice we take the average of the approximate distribution,θ ABC to calculate the number of particles required to satisfy Var(π(D|θ ABC )) 2, in line with Sherlock et al (2013). The hybrid ABC pMCMC algorithm is outlined in algorithm 6.
relatively simple example of a model in this context, it highlights many of the difficulties associated with more complex systems. We summarise the system by its stoichiometry matrix, S and hazard function h(X, θ): Figure 1(a) is a set of synthetic, noisy observations of the two species simulated using the Gillespie algorithm with true parameter values log(θ) = (0, −5.30, −0.51) together with a Gaussian measurement error structure, d j,t ∼ N (X j,t , 10 2 ).

pMCMC for Lotka-Volterra
Exact parameter inference is possible for this system using pseudo marginal MCMC pMCMC scheme; provided the chain is initialised near the posterior mode (Wilkinson, 2011). However under poor initialisation, a pseudo-marginal particle MCMC scheme will perform very badly. Suppose we have little prior knowledge on the rate parameters, We take a prior distribution for the initial state X 0 on each individual species as Poisson distributions with rate parameter equal to the true initial conditions, x 1,0 ∼ Pois(50), x 2,0 ∼ Pois(100).
Further we assume that the variance, σ 2 , of the Gaussian measurement error is known. Using a Gaussian random walk proposal q(θ * |θ) we construct a MH algorithm using a bootstrap particle filter, targeting π(θ|D) as in algorithm 4. Figure 1(b) shows that when initialising the chain using random draws from the weakly informative prior, the chains . Due to numerical issues, the chains wander the space randomly. (c,d,e) are heat-maps of log-likelihood estimates from the bootstrap particle filter; the white areas is where the estimated log-likelihood is negative infinity. In each of these plots the missing parameter is taken to be fixed at the true value wander the space at random failing to converge due to the numerical underflow issue. Further investigation shows why this is happening. Figure 1 (c,d,e) show log-likelihood estimatesπ(D|θ) using a bootstrap particle filter as a heat map. The plot shows that the likelihood surface has a very steep gradient around the true values, and that a large proportion of the space leads to an estimate of negative infinity. This has two consequences, the first being that when we are a reasonable distance from a region of high likelihood, to available numerical accuracy, we obtain estimates that are numerically equal to negative infinity. This means that exploration of the space depends only on the random walk moves, unguided by the likelihood structure. On top of this, small proposed moves can lead to large variation in the estimated likelihood, leading to poor exploration of the tails in the posterior distribution. Whilst we are guaranteed to eventually converge to the stationary distribution, the required computational cost, without carefully thought out initialisation, could be very high. We note that this is not a failure of the theory or algorithm, but a consequence of the computational accuracy available. We therefore apply the proposed ABC initialisation for a "perfect" parallel pMCMC scheme.

Results for Lotka-Volterra model
We use a sequence of seven distributions π(θ|ρ(D, D * ) < t ), t = 0, . . . , 6 to obtain our approximation to the posterior. For each population t we take t as the 0.3 quantile of the distribution of distances from the samples at t − 1 and propose candidates until we reach a sample of 1000 for each t. At each stage we perform the forward simulation of the model ABC q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q  Figure 2 (a) shows summaries for the reaction 1 rate parameter θ 1 for each of the distributions in the series. The distributions quickly remove potential initialisation points and converge around the true value log(θ 1 ) = 0. Given the sample from π(θ|ρ(D, D * ) < 7 ) we initialise eight MCMC chains on separate processors with random draws. The results in figure 2 (b,c,d) are then, the 20,000 pooled samples from the eight independent parallel chains, each of which has been thinned by a factor of 100 to give 2,500 samples. Each chain is sampling from the same stationary distribution as seen in the trace plot for θ 1 , figure 2 (c), and mixing is good, figure 2 (b). Further the true parameter values, log(θ) = (0, −5.30, −0.51), used to simulate the data are well identified within the posterior densities, figure 2 (d).

Real-data problem -aphid model
Next we consider a model of aphid population dynamics as proposed in Matis et al (2007).
The system can be represented by the following reactions: and summarised in the same way as with the Lotka-Volterra example in terms of its stoichiometry matrix S, and hazard function h(X t , θ), N t and C t are the numbers of aphids alive at time t and the cumulative number of aphids that have lived up until time t respectively. In the first reaction, when a new aphid is introduced into the system we get an increase in both the current number of aphids and the cumulative count. When an aphid is removed the number of aphids decreases by one but the cumulative count remains the same. Aphid death is considered to be proportional not only to the number of aphids currently in the system, but also to the cumulative count, representing the idea that over time they are making their own environment less inhabitable.
Given some initial conditions X 0 = (N 0 , C 0 ) and a set rate of parameters θ, we can simulate from the model much like with any stochastic system of this type. However the difficulty in practice with the model presented here is that we will never have observations of the cumulative count. Observations then are noisy, discrete time measurements of just a single species of the system.

Aphid data
We now consider the data described in Matis et al (2008) consisting of cotton-aphid counts for twenty seven treatment-block combinations. The treatments consisted of three nitrogen levels(blanket, variable and none), three irrigation levels(low, medium and high) and three blocks. The sampling times of the data are t = 0, 1.14, 2.29, 3.57, 4.57 weeks, or every seven to eight days. We restrict our investigation to a single treatment combination, three data sets with blanket nitrogen level and low irrigation. If we denote the block by i ∈ {1, 2, 3} then the data D i is the number of aphids, N , in block i at each time t. The data are plotted in figure 3(a).
We make the assumption that the counts are observed with error such that We again use a set of weakly informative priors on the rate parameters θ We place a prior of the form We treat the three sets of observations as repeats of the same experiment. A full treatment of all twenty-seven data sets using a fixed effects model can be found in Gillespie and Golightly (2010). We consider the initial aphid counts to be the true values, as in Gillespie and Golightly (2010), on the basis that there should be no error in counting such small populations.

Results for the aphid growth model
We use the same criteria for the choice of t for the ABC section of the inference as with the Lotka-Volterra model in 4.2.1, namely the 0.3 quantile of the distribution of distances. A sequence of five distributions gives us 1000 samples from π(θ|ρ(D, D * ) < 5 ) with which we, as before, initialise eight parallel chains. We record 20,000 samples from the exact target posterior π(θ|D) after appropriate thinning. Figure 3(c and d) shows the analysis of the MCMC chains. Again we find that each chain is sampling from the same target and posterior densities are very close from all eight chains. Figure 3(b) shows posterior predictive quantiles given a sample from the pooled posterior distribution. Model fit appears to be reasonable.

Discussion
We have proposed an approach to exact inference for Markov processes that combines the relative strengths of ABC and pMCMC methodology to increase computational efficiency through use of parallel hardware. Through use of an approximation to the posterior distribution of interest, obtained via a sequential ABC algorithm which is easy to parallelise, we can set up a parallel implementation of pMCMC which has numerous desirable properties. By enabling the construction of independent parallel chains initialised close to the stationary distribution, this enables fast convergence and sampling from an exact posterior distribution that scales well with available computational resources. Algorithmic tuning parameters required for pMCMC, such as the variance of Gaussian random walk proposals and numbers of particles for the particle filter can be chosen without the need for additional pilot runs, as a consequence of having a sample from an ABC posterior. In addition, independent parallel chains allow verification of convergence and the computational saving in burn-in times extends to repeat MCMC analyses. We have demonstrated this approach by applying it to two stochastic kinetic models. With the Lotka-Volterra predator prey system, a relatively simple model in which both species can be observed, we highlighted clear issues with practical implementation of a pseudo-marginal approach in a scenario in which prior information on reaction rate parameters is poor. This issue can be alleviated by first obtaining a sample from an approximation to the posterior, then using it to guide an exact pMCMC scheme. The approach discussed performed similarly well in the application to a set of real data for a model for aphid growth dynamics in which one of the species in the system can never be observed, where we again imposed weak prior conditions on the rate parameters governing the system and had access to repeat data.