Unbiased Bayesian inference for population Markov jump processes via random truncations

We consider continuous time Markovian processes where populations of individual agents interact stochastically according to kinetic rules. Despite the increasing prominence of such models in fields ranging from biology to smart cities, Bayesian inference for such systems remains challenging, as these are continuous time, discrete state systems with potentially infinite state-space. Here we propose a novel efficient algorithm for joint state/parameter posterior sampling in population Markov Jump processes. We introduce a class of pseudo-marginal sampling algorithms based on a random truncation method which enables a principled treatment of infinite state spaces. Extensive evaluation on a number of benchmark models shows that this approach achieves considerable savings compared to state of the art methods, retaining accuracy and fast convergence. We also present results on a synthetic biology data set showing the potential for practical usefulness of our work.


Introduction
Discrete state, continuous time stochastic processes such as Markov Jump Processes (MJP) [1] are popular mathematical models used in a wide variety of scientific and technological domains, ranging from systems biology to computer networks.Of particular relevance in many applications are models where the state-space is organised according to a population structure (population Markov Jump Processes, pMJP): each state label corresponds to counts of individual entities in a number of populations (or species).These models are at the root of essentially all agent-based models, a class of models which is gaining increasing popularity in applications ranging from smart cities, to epidemiology, to systems biology.Despite their importance, solving inferential problems within the pMJP framework is challenging: the discrete nature of the system prevents the use of simple parametric distributions, and the size of the state space (which can be unbounded for open systems) effectively rules out analytical computations.At the same time, technological advances in areas as diverse as single cell biology and remote sensing are providing increasing amounts of data which can be naturally modelled as pMJPs, creating a pressing need for inferential methodologies.
In response to these developments, researchers in the statistics, machine learning and systems biology communities have been addressing inverse problems for MJPs using a variety of methods, from variational techniques [2,3] to particle-based [4,5] and auxiliary variable sampling methods [6].Markov-chain Monte Carlo (MCMC) methods, in particular, offer a promising direction: while often computationally more intensive than variational methods, they provide asymptotically exact inference.However, standard MCMC methods rely on likelihood computations, which are computationally or mathematically infeasible for pMJPs with a large or unbounded number of states.Such systems are commonplace in many applications, where one is often confronted with open systems where upper bounds on the numbers of agents are difficult to come by.As far as we are aware, current methods address this issue by arbitrarily truncating the state space according to pre-defined heuristics, offering no control over the error introduced by this procedure.
In this paper we present a novel Bayesian approach to posterior inference in pMJPs which solves these issues by adopting a pseudo-marginal approach based on random truncations, yielding both asymptotic exactness and computational improvements.We build on the auxiliary variable Gibbs sampler for finite state Markov Jump Processes (MJP) of [6], significantly increasing its efficiency by leveraging the more compact representation of the kinetic parameters provided by the pMJP framework.We then present a novel formulation of the likelihood, which enables the deployment of a Russian Roulette-like random truncation strategy as in [7,8].Based on this, we develop a pseudo-marginal sampling approach for general pMJPs, obtaining two novel algorithms: a relatively straightforward Metropolis-Hastings pseudo-marginal scheme, and an auxiliary variable pseudo-marginal Gibbs sampler.We examine the performance of these algorithms in terms of accuracy and efficiency on non-trivial case studies.We conclude the paper with a discussion of our contribution in the light of existing research and possible future directions in systems biology.

Population Markov Jump Processes
Population Markov Jump Processes are a particular type of Markov Jump Processes (also known as Population Continuous Time Markov Chains); they are continuous time stochastic processes whose discrete state vector s = (n 1 , n 2 , . . ., n M ) gives the agent counts of each of M populations (or species) interacting through R reaction channels.We will adopt here the language of chemical reactions to describe such processes, but the same considerations apply in general.Reactions between individual agents (or molecules) happen as a result of random collisions, and each reaction changes the state by a finite amount, encoded in the stoichiometry of the system, corresponding to the creation/ destruction of a certain number of molecules.Each reaction i also has an associated kinetic law giving its rate: this is generally of the form where ρ i is a fixed function of the state n, while θ i are (usually unknown) kinetic parameters.Therefore, while in a general MJP there can be a parameter associated with each possible transition, in pMJPs the dynamics are captured more succinctly by a single parameter per reaction.A schematic of a simple pMJP is given in Figure 1, where it can be seen that the same reaction can correspond to multiple transitions in the state-space of the process, all of which follow the same kinetic law and incur the same update to the state.
(2,2,0,0) The time evolution of the process marginals is given by the Chemical Master Equation (CME): where p i (t) is the probability of being in state i at time t and a ij is the rate of jumping from state i to state j, which for pMJPs is known from the kinetic law.
For finite state-spaces, one can gather the transition rates a ij in the generator matrix A, and the CME can be solved analytically as: This solution can be computationally intensive, even with the use of specialized algorithms like [9].

Uniformisation and inference
An alternative approach to solve the CME is given by uniformisation [10], a well-known technique for the transient analysis of Markovian systems, used widely in fields like performance modelling.Given a MJP with generator A, uniformisation constructs a discrete time Markov chain by imposing a common exit rate γ for all states.For this procedure to be consistent, γ must be no less than the highest exit rate among all states.The resulting uniformised system is then faster than the original, in the sense that transitions occur at a higher rate.To compensate for this and maintain the behaviour of the original MJP, virtual jumps must be added from each state to itself.This results in a discrete time system with transition probability matrix B = 1 γ A + I, in which likelihood computations are standard.In this discrete time system, the waiting time before a jump occurs now follows an exponential distribution with rate γ, regardless of the current state.The probability of jumping from state i to another state j is a ij γ , but it is now also possible to remain in i after the jump, with probability 1 − 1 γ j =i a ij .Jensen's classical result [10] then guarantees that all the time-marginals of the discrete time process match those of the continuous time chain.
Uniformisation has previously been exploited by [6] to draw posterior samples from a MJP conditioned on a set of observations.The idea is to construct a discrete time chain using uniformisation, sample a trajectory (including self-loops) and run a standard forward filteringbackward sampling (FFBS) algorithm on it.This gives a new trajectory which, when selfjumps are removed, is a sample from the posterior process.This path-sampling algorithm can be alternated with Gibbs updates to jointly sample transition probabilities; in [6] this is accomplished by choosing conjugate Dirichlet priors on each entry of the generator matrix, resulting in potentially many parameters with consequent storage/ computational issues.
3 Unbiased sampling for pMJPs 3.1 Efficient Gibbs sampling for finite state pMJPs The special structure of pMJP systems implies considerable inferential savings over the generic Gibbs sampler [6].In particular, the functional form of the kinetic law associated with the ith reaction, f i (s) = θ i ρ i (s), suggests a different conjugate prior for the parameters θ i , which greatly simplifies the parameter sampling steps within the Gibbs sampler.
Let (S, T ) be a full trajectory sampled from the uniformised conditional posterior in a Gibbs step, where S = (s 0 , s 1 , . . ., s K ) is the sequence of states at times T = (t 0 , t 1 , . . ., t K ).Let u k denote the reaction at time t k+1 , as inferred1 from inspection of s k and s k+1 .From Section 2.1, we know that the total rate of exiting state s k is r k = R i=1 θ i ρ i (s k ).Since the waiting time between jumps is exponentially distributed in a MJP, this gives The probability of the next state being s k+1 is . The total likelihood is then Let each parameter be Gamma-distributed a priori : We then have: Therefore, conditioned on the trace, the parameters are again Gamma-distributed with shape a i + N i and rate b i + K−1 k=0 ∆t k ρ i (s k ), where N i is the number of times the ith reaction type is observed in the trace.Hence, we have exact Gibbs updates for the kinetic parameters; notice that, since we have a single parameter for each reaction, the number of parameters to be sampled is often orders of magnitude lower than the number of parameters sampled in [6] (one per possible state transition), yielding computational and storage savings.

Unbounded state-spaces
Many pMJPs of practical interest describe open systems with infinite state-spaces, which are not amenable to uniformisation.A plausible solution would be to truncate the system, possibly using methods such as in [11] to quantify the error.However, any such bound would be dependent on the unknown parameters, and in order to achieve acceptable performance we may need to still retain very large state spaces.An alternative approach may be to introduce random truncations in such a way as to obtain an unbiased estimator of the likelihood, which can be used in a pseudo-marginal MCMC scheme [12,13].We describe here two algorithms based on random truncations, a simple Metropolis-Hastings (M-H) sampler directly targeting the marginal likelihood, and a Metropolized auxiliary variable Gibbs sampler.

Expanding the likelihood
We start by describing a formulation of the likelihood in the pMJP setting as an infinite series.The basic idea is to decompose the space of process trajectories into a nested sum over subspaces of trajectories which differ by at most N from the observations.We can then define a generator matrix on each of these finite state-space systems and compute transient probabilities using (3).We now explicitly define the terms in this expansion of the likelihood.For simplicity, we focus on deriving the likelihood for a single, noiseless observation (t , s ) in a one-dimensional process, assuming the state at time 0 is known to be s ∈ N. Due to the Markovian nature of the process, the actual likelihood will be given by a product of such terms.If we write s u = max(s, s ), we have: The notation s 0:t indicates all values of the process in the time interval [0, t] and is used here as follows: max (s 0:t ) = N means that the maximum value of the process in the interval [0, t] is N .Similarly, max (s 0:t ) ≤ N means that the process does not exceed the value N during [0, t].
Note that the constraint on the maximum of s 0:t − s u does not simply define a state-space, but constrains us to consider only those trajectories that actually achieve a "dispersal" of N .If we define then each term of the series can be decomposed as: These sub-terms are now the transient probability for a finite state-space pMJP, and can be computed using Equation 3. Any number of them are computable but, naturally, the whole sum cannot be computed in finite time.It can, however, be estimated in an unbiased way.

Random truncations
Assume we wish to estimate an infinite sum where each term f N is computable.One way of approximating the sum is to pick a single term f k , where k is chosen from any discrete distribution with mass p 0 , p 1 , . . . .We can immediately see that and is therefore an unbiased estimator of the infinite sum.An issue with this approach is that, depending on the choice of distribution p i , the variance of f might be very large, even infinite.
A reduced variance estimator can be obtained by approximating f with a partial sum up to order N , weighted appropriately.The number of terms is chosen randomly: at every term j, a random choice is made: there is a probability q j of stopping the sum, otherwise we continue to form iteratively the partial sum f = j N =0 f N p N , where p N = N −1 j=1 (1 − q j ).This scheme, imaginatively termed Russian Roulette sampling [7], can also be shown to yield an unbiased estimator of f .

Metropolis-Hastings sampling
Applying this random truncation strategy to the expansion in ( 5) produces an unbiased estimator.Such estimates can be obtained for every interval between successive observations; since they are independent, their product will be an unbiased estimate of the likelihood under all the observations.Note that each summand in ( 5) is a probability, and is therefore non-negative.Thus, we avoid the problems of possibly negative estimators; this positivity is important, as non-positive estimators may result in a large or infinite variance.It is worth remarking that the term for N = 0 corresponds to a space that includes the observations at both ends of the time interval, and hence will already include a significant contribution of probability mass towards the likelihood.
The same approach is easily extended to higher dimensions, where the states are vector of integers, by adapting the notation: max (s 0:t ) ≤ N means that the value in any dimension does not exceed N in the given interval, whereas max (s 0:t ) = N now means that a value of N is not exceeded in any dimension during [0, t], and that it is achieved in at least one dimension.
This procedure directly gives rise to a pseudo-marginal M-H algorithm, where the likelihood term is approximated by the unbiased estimate obtained as described above.We refer to this as Algorithm 1 and examine its performance in the next section.
For our purposes, we choose a q n sequence such that the probability of accepting a term decreases geometrically; specifically, we use q n = 1 − a(1 − q n−1 ), with q 0 = 0 and a = 0.95.We note that, since all terms in the series are non-negative and tend to 0, we can make use of a result from [7] to show that the variance of the estimator is finite.We show an empirical analysis of the variance in Section 4.1 that validates our choice of q n and indicates that performance is robust with respect to the choice of the particular stopping distribution.

Modified Gibbs sampling
An alternative approach is to incorporate the truncation in the Gibbs sampler described in Section 3.1.The difficulty is that there is no direct way to sample trajectories without a bound on the state-space, as the uniformisation sampler requires a finite number of states.To work around this limitation, we propose to sample a truncation point, then draw a trajectory and parameters for this state-space as in Section 3.1.Since we are no longer sampling from the true conditional posterior over trajectories, but rather are also conditioning on the chosen truncation, we are no longer able to accept every trajectory and parameter sample drawn.Instead, we must introduce an acceptance ratio that will ensure we are sampling from an unbiased estimate of the true conditional posterior.We refer to this as Algorithm 2 ; the following is a summary of the procedure to form a new sample (θ t+1 , S t+1 ) from the current state (θ t , S t ) of the chain, given a set of observations O: 1. Sample θ * | S t , as detailed above.
4. With probability min(α, 1), accept the new sample and set (θ t+1 , S t+1 ) = (θ * , S * ); otherwise, set (θ t+1 , S t+1 ) = (θ t , S t ) Note that the analysis from Section 3.1 giving the conditional posterior of the parameters (Equation ( 4)) still holds and is not affected by the truncation.Step 1 is therefore performed following (4).In Step 2a, we follow the Rusian Roulette methodology as in Section 3.2.2 and take m * to be the number of terms before the truncation stops.In the scheme used in our experiments, the probability of taking an additional term follows a geometric distribution, as with the previous algorithm.Based on this truncation point m * , we can define a state-space where y * = (y 1 , . . ., y M ) is a vector of the maximum values observed in each dimension.The method of Section 3.1 can then be used to sample a trajectory (Step 2b) in this finite state-space.
Steps 3a and 3b involve the computation of probabilities which can be performed via the forward-backward algorithm on the appropriate state-spaces.So far in this paper, the algorithm has been used to sample a new path from the process, but it can easily be adapted to calculate the probability of a given path, as shown in the algorithm outline below.
In the following, we assume we have N observations y i at time points t i , i = 1, . . ., N .For a finite state-space S, we denote with S k the k-th state of the space, according to some arbitrary order.The forward and backward messages are vectors of size |S|, and there is one such message for each observed time point.a (i) denotes the forward message at the i-th time point t i ; its k-th element is a ) that is, the joint probability of the observations prior to t i and the state at t i being S k .Similarly, the backward messages b (i) has elements: and so the probability of the observed time-series can be computed from the b (i) .This is a slightly different than the usual formulation of the forward-backward algorithm, and necessitates

8:
Find index k of observation y i in S 9: 10: end for 11: return p the computation of the forward messages a (i) first.The messages can be computed recursively as shown in [6].
These probabilities computed this way are then used in the acceptance ratio α (Step 4).As noted above, the acceptance step is necessary because we are not proposing trajectories from the exact conditional posterior.Instead, the truncation we impose gives an estimate of the correct proposal distribution p(S * | θ * , O), and the ratio compensates for this estimate.Note that, if we could draw trajectories from the whole state-space without truncating it, the terms in α would cancel out, giving standard Gibbs sampling with acceptance rate of 1.
It is important to observe that this auxiliary variable Gibbs sampler actually targets the joint posterior distribution of parameters and trajectories.As such, it provides richer information than the M-H sampler (which directly targets the parameter posterior), but may be less effective if one is solely interested in parameter inference.The performance can also be affected by computational factors, particularly the costs of drawing sample trajectories (which was not needed in Algorithm 1, where we compute the likelihood by matrix exponentiation).In general, such costs will be model-and data-dependent, so that some initial exploration may be advisable before deciding which algorithm to use.

Results
This section describes the experimental validation of our approach.The experiments were performed on MATLAB implementations of the algorithms described in the previous section 2 .The M-H proposals for Algorithm 1 were Gaussian, with variances tuned using trial runs.In the same algorithm, matrix exponentiation was performed using the method of [9], with the code that the authors have made available.Unless otherwise noted, the Russian Roulette truncation used in the experiments was chosen so as to yield 5.6 terms on average.

Variance of the estimator
Before showing how our algorithms perform against the state of the art, we present empirical evidence that our Russian Roulette-style truncation approach produces estimators with low a = 0.95 a = 0.75 a = 0.2 (5.6 terms) ( variance, an issue that has recently received attention in pseudo-marginal methods [14,15].In order to achieve estimators of low variance, the tails of the distribution of the number of terms taken must match those of the sequence being approximated [16] (or the estimator is likely to ignore significant terms).To our knowledge, there are no established results on the behaviour of transient probabilities in general pMJPs as the state-space grows.Our approach is to use a geometric truncation distribution, which is well known ( [17]) to arise as a steadystate distribution of simple pMJPs such as queueing systems, and might thus be a plausible candidate distribution.Our focus in this section is to provide an empirical evaluation of our method.Additionally, we show that the estimator is robust to the choice of the particular stopping distribution q n used in the truncation scheme.To verify this, we considered three different q n sequences, applied to the predator-prey model described in Section 4.2.For clarity, we write qn ≡ 1 − q n , the probability of continuing at term n.All schemes were of the form qn = aq n−1 with q0 = 1 and a ∈ {0.95, 0.75, 0.2}, respectively yielding 5.6, 2.4 and 1.2 terms on average.For each scheme, we calculated 1000 estimates of the transition probabilities between observations, obtaining estimates of the log-likelihood and computing its mean and variance.This was repeated for 10 different parameterizations of the model.It can be seen (Table 1) that the variance of the estimator (measured as the coefficient of variation of the log-likelihood, due to small values) is consistently low.This validates our approach and indicates that the stopping distribution does not critically affect performance and therefore does not require fine-tuning.An intuitive explanation for this comes from remembering that the "base" space (corresponding to the first term in the expansion) comprises all states between consecutive observations.Often, this is large enough that there is a substantial probability of the process remaining within or around it.Hence, even with a few terms, we are capturing a large part of the probability mass, and obtaining good estimates.We expect our estimator to have low variance if the process does not change radically between the observation times.It is possible, however, to find situations where the truncation strategy needs many terms in order to yield good performance.This is more likely to occur if the process is very sparsely observed, or if it is highly volatile.In both cases, the observations may not be very indicative about the behaviour of the process during the interval under consideration, therefore only taking few terms may produce inaccurate estimates of the true likelihood.This situation is also likelier when high counts are involved, in To illustrate this, we considered the example of a birth-death process involving a single species, X, with a constant birth rate of 150 and a death rate of X. From an initial value of X = 10, we simulated the system and used the values at 5 time points (Figure 2a).The three truncation schemes described above did not yield accurate estimates, even when taking 5 terms on average.With a stopping scheme qn = 0.99q n−1 (corresponding to 12.5 terms on average), we were able to get good estimates of the true probabilities.The more aggressive truncation schemes display higher variance and could cause problems when their estimates are used in Algorithm 1: when taking 5.6 terms on average, the variance causes the sampling chain to "stick", as seen in Figure 2b.
As a way of improving the behaviour of the sampler, we examined the use of the so-called Monte Carlo within Metropolis (MCWM) pseudomarginal variant [13], in which the estimate of the likelihood of the current state of the chain is recomputed at every step.This can potentially alleviate the "sticking" problem and lead to better mixing, but at the cost of making the resulting chain sample from an approximation instead of the true posterior.Experiments on the predator-prey model of Section 4.2 showed that there was no noticeable improvement in either the number of steps needed to reach convergence or the acceptance rate when using MCWM.This, in addition to the bias introduced and the additional computational burden from re-estimating the likelihood, leads us to believe that in this case there is no benefit from using MCWM.
To further study of the impact of the choice of truncation distribution, we examined how it affects convergence.We tried ten different stopping distributions q n of the form described above, chosen so that they produce 1, 2, . . ., 10 terms on average.For each of them, we measured the steps required for convergence, as described in the next section.Overall, we found that taking more terms generally leads to faster convergence (Figure 3).This indicates that the variance of the estimates decreases when taking more terms.

Benchmark data sets
We now assess the performance the two algorithms described in the previous section as well as the Gibbs sampler based on uniformisation (Section 3.1).We could not run the original Gibbs sampler of [6] as the high number of parameters (one per state) swiftly led to storage problems.We first compared the performance of the three methods on two widely used pMJP models: Lotka-Volterra (LV) model This predator-prey system involves four types of reactions, representing the birth and death of each species, and is a classic model in ecology and biochemistry.Truncated LV processes have been studied in previous work ( [2], [18]), making it an attractive candidate for evaluating our approach.
We start from an initial state of 7 predators and 20 prey.When a finite state-space is required, we impose a maximum count of 100 for each species, as in previous work.
SIR epidemic model A commonly-used model of disease spreading (see e.g.[19]), where the state comprises three kinds of individuals: S(usceptible), I(nfected) and R(ecovered).We examine two variants of the model, a finite version where the total population is constant: and an infinite state variant where new individuals can join the S population with unknown arrival rate: ∅ → S at rate θ 3 The initial state in both cases is (S, I, R) = (10, 5, 0).For the finite-state version, this gives a state-space of 121 states.For the infinite case, we chose a truncation with upper limit (28,33,33), corresponding to 18 new arrivals in the system.Experiments were performed on a 24-core Xeon E5-2680 2.5GHz, to accommodate the increased memory requirements of some cases.
of arrivals in a time interval of duration T is Poisson-distributed, with mean θ 3 T .We used the final observation time and the prior mean of θ 3 , and chose the 95-percentile of the distribution governing the new arrivals.In broad terms, this means our truncation will accommodate new arrivals with 95% probability.Table 2 summarises our evaluation results across the models considered; the metrics we use are total computational time for 5000 samples, mean relative error in parameter estimates (using the posterior mean as a point estimate), Effective Sample Size (ESS) per minute of computation, and number of iterations to convergence, defined as Potential Scale Reduction Factor (PSRF) < 1.1 [20].
Results on the LV model show that methods based on random truncations achieve very considerable improvements in performance compared to the Gibbs sampler (where the state space was truncated at a maximum number of 100 individuals per species).In particular, Algorithm 2 shows excellent behaviour in most aspects, with a high ESS suggesting it is a more efficient sampler.The running time of Algorithm 2 is comparable to that reported for a variational mean field approximation in [2], and its rapid convergence time suggests that this is a very competitive algorithm in practice.Sample results from Algorithm 2 are presented in Figure 4a for the reaction parameters, and in Figure 4b for the state of the process itself.Algorithm 1, while still computationally feasible, requires a long time to converge, reflecting potential difficulties in choosing effective proposal distributions (a problem naturally bypassed by Algorithm 2).The simple Gibbs algorithm is much slower than the other two, undoubtedly owing to its large state-space of 10000 states and very high memory requirements during the FFBS algorithm.Note that the impact of the (necessarily) large truncation is twofold.Firstly, the large state-space directly affects the running time of the FFBS algorithm, whose complexity is quadratic in the number of states.Secondly, since the rates in this model are increasing functions, having states with high counts means the generator matrix has high diagonal entries (exit rates).This, in turn, requires choosing a high exit rate for uniformisation, leading to long paths with many self-jumps, and ultimately further slowing down the FFBS step.The results for this model clearly show the usefulness of the random truncation approach compared to using Results on the SIR model show that, in the finite state space case, the Gibbs sampler of Section 3.1 is highly efficient and by some way the best algorithm.This is unsurprising, as truncations incur additional computational overheads which are not needed for such a small state space.The picture is completely different for the infinite SIR model.In this case, the M-H sampler clearly seems to be the best algorithm, achieving very fast convergence and outperforming the other two.For parameter values within the prior range, the infinite SIR model exhibits fast dynamics which lead to very long uniformised trajectories, considerably increasing the computational costs of sampling trajectories via the FFBS algorithm.The problem is further compounded for the simple Gibbs sampler algorithm of Section 3.1.Even with the truncation described above, there are 32594 states, resulting in very severe computational and storage costs.

Genetic toggle switch
As a real application of our approach, we consider a model of a synthetic biological circuit describing the interaction between two genes (G1 and G2) and the proteins they encode (P1 and P2).Each protein acts as a repressor for the other gene, inhibiting its expression.This leads to a bistable behaviour, switching between a state with high P1 and low P2, and one with low P1 and high P2 (hence the name toggle-switch).The interactions are encoded as eight chemical reactions: where r is a constant assumed known.This system was engineered in vivo in one of the pioneering studies in synthetic biology [21] and has been further studied in [22].Statistical inference is increasingly being recognised as a crucial bottleneck in synthetic biology: while genome engineering technologies enable researchers to reliably synthesise circuits with a desired structure, predicting the dynamic behaviour of a circuit requires knowledge of the kinetic parameters of the system once it is implanted in the cell, which cannot be directly measured.As synthetic biology is intrinsically at the single cell level, inference techniques for stochastic models have the potential to be of great aid in the rational design of synthetic biology circuits.
Following [22], we model the system using a binary state for each gene and discrete levels for the proteins.The genes can be active or inactive, with protein being produced only in the former case.Each gene can be modelled with a telegraph process: an inactive gene becomes active at a constant rate, and an active one becomes inactive at a rate depending on the level of its repressor.When a gene is active, the level of its product follows a birth-death process; that is, proteins are produced at a constant rate and degrade at mass-action rates.We use a single production reaction for each protein to abstract various underlying mechanisms, including transcription and translation.The model comprises eight types of reaction; note that the requirements of our method on the form of the kinetic laws (Section 3.1) are flexible enough to accommodate the deactivation dynamics used here, even though they are not mass-action.We simulated the system to produce behaviour similar to the simulated traces in [22].We kept 20 time points of measurements, which varied between 0 and 24 for each observed protein.
We used Algorithm 2 to infer the joint posterior distribution of the eight parameters and state trajectories in this system.Our results indicate that the likelihood is relatively insensitive to the parameters governing the activation and deactivation of the two genes.This is a reasonable result, since we do not observe the state of the genes but only the levels of the two protein products.Therefore, the effect of the switching parameters is seen only indirectly through the switching events, which are rare in the data.In contrast, the protein expression and degradation rates have sharp posteriors which capture interesting correlations between the parametersfor instance, we observe a strong correlation between the production and degradation rate of each protein, as perhaps expected given the similarity to a birth-death process.Figure 5 shows parameter posteriors and convergence statistics for one such experiment, showcasing the good behaviour of the algorithm.

Related work
Parameter inference in pMJPs has been the subject of previous work, with a significant body of literature focusing on continuous approximations to the process, in order to work around the complexities entailed by the stochastic dynamics.In general, such approximations are more accurate when the populations involved are high, and their accuracy degrades for lower populations as the impact of discrete stochastic behaviour becomes more pronounced.Two general classes of methods have been proposed to this end.The first involves approximating a pMJP with a diffusion process, as in [23], and using the resulting stochastic differential equations to calculate the likelihood.The second approach uses van Kampen's Linear Noise Approximation [24], which assumes that the marginal distribution of the approximating process at any time is Gaussian.Under this assumption, ordinary differential equations for the mean and covariance can be derived as in [25,26,27] and used to compute the likelihood as part of an inference scheme.In contrast to these methods, our suggested approach is expected to be more accurate for smaller populations, as it maintains the stochastic dynamics.This makes it particularly useful for a range of systems which are large enough that a direct solution is inefficient, but not as large as to be accurately represented with continuous dynamics.In addition to MCMC-based approaches, like ours, particle methods have also been proposed for use with pMJPs, either with the exact dynamics [5,4] or with continuous approximations such as the ones mentioned above.However, they do require more user choices (e.g.number of particles) and can also incur heavy computational overheads for large models or state spaces.For infinite MJPs, in particular, the transition kernel is not available explicitly, making particle methods non-trivial and intrinsically expensive.Variational methods have been developed in [2], and can offer computational savings; however, the work in [2] only performed state inference, providing point estimates for parameters.Furthermore, the error introduced by the variational approximation is often difficult to quantify.
Recent work has made use of random truncations in different contexts: Strathmann et al. [28] propose using a Russian Roulette-style approach in large data scenarios where computing the likelihood from all data points is impractical, while Filippone & Engler [8] exploit the methodology to perform efficient inference for Gaussian processes.More generally, the construction of unbiased estimators has been the subject of theoretical and practical analysis.McLeish [29] and Rhee & Glynn [16] examine the use of a method similar to Russian Roulette for obtaining un-biased estimates from biased ones.Agapiou et al. [30] consider ways of debiasing the estimates obtained by MCMC methods, particularly focusing on infinite spaces.Jacob & Thiery [31] examine the theoretical existence of estimators that are both unbiased and guaranteed to be non-negative under different generation schemes.

Conclusions
MJPs are common models in many branches of science, yet they still present fundamental statistical challenges.In this paper, we have proposed a novel MCMC framework for asymptotically exact inference for pMJPs, an important class of MJPs widely used in chemistry and systems biology.We remark that, while our focus is on biological applications, models with exactly the same structure are employed in many other fields, from epidemiology to ecology to performance modelling.Our random truncations pseudo-marginal approach enables a principled treatment of systems with potentially unbounded state-spaces.Interestingly, our results show that random truncations can also bring computational benefits over the naive alternative of bounding the state-space ab initio, as done in [6].Intuitively, this is because choosing a truncation which guarantees a certain error bound usually requires still retaining a large state-space, while our random truncation method generally samples from much smaller systems.The two truncationbased algorithms we consider here appear to perform best in different kinds of systems, and so neither can be said to be clearly superior in the general case.
The performance of our proposed methods may vary with the system in question.As the number of species grows, the state-space grows exponentially larger, leading to increased computational overheads for our method (as for many other methods).While this may be a serious limitation for large models, it is worth pointing out that many practical applications of pMJPs describe systems with a small number of species, where our method's performance should not be affected.High counts of the species involved also result in larger state-spaces, leading to heavier computations, particularly for Algorithm 1.For Algorithm 2, the rates of the reactions can also have an impact: very fast reactions lead to a fine time-discretization and slower computations in the forward-backward step.Our methods perform best when particle numbers are not exceedingly large (otherwise, a continuous approximation would be both accurate and more efficient) and when observations are relatively dense or, equivalently, the process is not too volatile (or a truncation with many terms would be required for a good result).
Pseudo-marginal methods based on random truncations are relatively new to statistics and machine learning [32,7,8]: to our knowledge, this is the first time that they are employed as a way of truncating an unbounded state space, and we think this idea may be appealing in other scenarios where unbounded state spaces are normal, such as non-parametric Bayesian methods.Compared to pseudo-marginal methods based on importance sampling [33,5], random truncations offer several advantages: there is no need to choose a proposal distribution, a notoriously difficult problem in high dimensions.The choice of the truncating distribution, which controls the variance of the estimator, can in general be aided by some initial exploratory runs with different truncation distributions with different expected numbers of retained terms.Recent work on improving the behaviour of pseudo-marginal MCMC methods [34] may also be relevant to enhancing the performance of our proposed method.

Figure 1 :
Figure 1: State-space of an example system.Arrows indicate transitions between states; the bolded transitions are "instances" of the same reaction type, which updates the state by (−1, −1, 1, 0) and occurs with rate θ 1 s 1 s 2 , where s 1 , s 2 are the first and second components of the state.

2 .
Sample S * | O, θ * : a) Choose a truncation point m * , defining a finite state-space.b) Run the FFBS algorithm to draw S * .3. Calculate the acceptance ratio α: a) Compute p (t+1) (S * | θ * , O) and p (t) (S * | θ * , O), the conditional posterior probabilities of the new trajectory under the new and old truncations.b) Compute p (t+1) (S t | θ * , O) and p (t) (S t | θ * , O), the conditional posterior probabilities of the old trajectory under the new and old truncations.

Figure 2 :
Figure 2: (a) Full trace (continuous line) and observations (dots) used in the birth-death process example; (b) Parameter samples for the birth rate using Algorithm 1, illustrating undesirable "sticking" behaviour when taking 5.6 terms on average

Figure 3 :
Figure 3: Steps until convergence for different stopping distributions (results shown for one parameter of the LV model).

Figure 4 :
Figure 4: (a) Posterior marginals and pairwise correlations for the parameters of the LV model, from 5000 samples using Algorithm 2 (true values marked by red line, prior shown in dashed line); (b) Samples of the posterior process: prey (top), predators (bottom).Dots indicate the observations.

Table 1 :
Coefficient of variation for the log-likelihood, estimated from 1000 samples for the LV model, under three truncation schemes (varying α) and ten parameter configurations (Section 4.1)

Table 2 :
To see this, note that the number Performance of the various algorithms tested.Metrics are averaged over all parameters.