Stochastic Rate Parameter Inference using the Cross-Entropy Method

. We present a new, eﬃcient algorithm for inferring, from time-series data or high-throughput data ( e.g. , ﬂow cytometry), stochastic rate parameters for chemical reaction network models. Our algorithm combines the Gillespie stochastic simulation algorithm (including approximate variants such as tau-leaping) with the cross-entropy method. Also, it can work with incomplete datasets missing some model species, and with multiple datasets originating from experiment repetitions. We evaluate our algorithm on a number of challenging case studies, including bistable systems (Schl¨ogl’s and toggle switch) and experimental data.


Introduction
In this paper we are concerned with the inference of biochemical reaction stochastic rate parameters from data. Reactions are discrete events that can occur randomly at any time with a rate dependent on the chemical kinetics [40]. It has recently become clear that stochasticity can produce dynamics profoundly different from the corresponding deterministic models. This is the case, e.g., in genetic systems where key species are present in small numbers or where key reactions occur at a low rate [23], resulting in transient, stochastic bursts of activity [4,24]. The standard model for such systems is the Markov jump process popularised by Gillespie [13,14]. Given a collection of reactions modelling a biological system and time-course data, the stochastic parameter inference problem is to find parameter values for which the Gillespie model's temporal behaviour is most consistent with the data. This is a very difficult problem, much harder, both theoretically and computationally, than the corresponding problem for deterministic kinetics -see, e.g., [41,Section 1.3]. One simple reason is because stochastic models can behave widely differently from the same initial conditions. (The related issue of parameter non-identifiability is outside the scope of this paper, but the interested reader can find more in, e.g., [37,38] and references therein.) Additionally, experimental data is usually sparse and most often involves only a limited subset of a model's species; and the system under study might exhibit multimodal behaviour. Also, data might not directly relate to a species, it might be measured in arbitrary units (e.g., fluorescence measurements), thus requiring the estimation of scaling factors, or it might be described by frequency distributions (e.g., high-throughput data such as flow cytometry). Stochastic parameter inference is thus a fundamental and challenging problem in systems biology, and it is crucial for obtaining validated and predictive models.
In this paper we propose an approach for the parameter inference problem that combines Gillespie's Stochastic Simulation Algorithm (SSA) with the crossentropy (CE) method [27]. The CE method has been successfully used in optimisation, rare-event probability estimation, and other domains [29]. For parameter inference, Daigle et al. [8] combined a stochastic Expectation-Maximisation (EM) algorithm with a modified cross-entropy method. We instead develop the cross-entropy method in its own right, discarding the costly EM algorithm steps. We also show that our approach can utilise approximate, faster SSA variants such as tau-leaping [15]. Summarising, the main contributions of this paper are: we present a new, cross entropy-based algorithm for the stochastic parameter inference problem that outperforms previous, state-of-the-art approaches; our algorithm can work with multiple, incomplete, and distribution datasets; we show that tau-leaping can be used within our technique; we provide a thorough evaluation of our algorithm on a number of challenging case studies, including bistable systems (Schlögl model and toggle switch) and experimental data.

Background
Notation Given a system with n chemical species, the state of the system at time t is represented by the vector x(t) = (x 1 (t), . . . , x n (t)), where x i represents the number of molecules of the ith species, S i , for i ∈ {1, . . . , n}. A well-mixed system within a fixed volume at a constant temperature can be modelled by a continuous-time Markov chain (CTMC) [13,14]. The CTMC state changes are triggered by the (probabilistic) occurrences of chemical reactions. Given m chemical reactions, let R j denote the jth reaction of type: where the vectors ν − j and ν + j represent the stoichiometries of the underlying chemical kinetics for the reactants and products, respectively. Let ν j ∈ Z n denote the overall (non-zero) state-change vector for the jth reaction type, specifically ν j = ν + j − ν − j , for j ∈ {1, . . . , m}. Assuming mass action kinetics (and omitting time dependency for x(t)), the reaction R j leads to the propensity [41]: where θ = (θ 1 , . . . , θ m ) is the vector of rate constants. In general, θ is unknown and must be estimated from experimental data -that is the aim of our work. Our algorithm can work with propensity functions factorisable as in (1), but it is not restricted to mass action kinetics (i.e., the functions α j 's can be arbitrary).

Cross-Entropy Method For Optimisation
The Kullback-Leibler divergence [20] or cross-entropy (CE) between two probability densities g and h is: where X is a random variable with density g, and E g is expectation w.r.t. g. Note that D(g, h) ≥ 0 with equality iff g = h (almost everywhere). (However, The CE has been successfully adopted for a wide range of hard problems, including rare event simulation for biological systems [7], discrete, and continuous optimisation [29,28]. Consider the minimisation of an objective function J over a space χ (assuming such minimum exists), γ * = min x∈χ J(x). The CE method performs a Monte Carlo search over a parametric family of densities {f (·; v), v ∈ V} on χ that contains as a limit the (degenerate) Dirac density that puts its entire mass on a value x * ∈ χ such that J(x * ) = γ * -the so called optimal density. The key idea is to use the CE to measure how far a candidate density is from the optimal density. In particular, the method solves a sequence of optimisation problems of the type below for different values of γ by minimising the CE between a putative optimal density g * ( where I is the indicator function and X has density f (·; u) for u ∈ V. The definition of density g * above essentially means that, for a given γ, we only consider densities that are positive only for arguments x for which J(x) γ. The generic CE method involves a 2-step procedure which alternates solving (2) for a candidate g * with adaptively updating γ. In practice, problem (2) is solved approximately via a Monte Carlo adaptation, i.e., by taking sample averages as estimators for E u . The output of the CE method is a sequence of putative optimal densities identified by their parametersv 0 ,v 1 , . . . ,v * , and performance scoresγ 0 ,γ 1 , . . . ,γ * , which improve with probability 1. For our problem, a key benefit of the CE method is that an analytic solution for (2) can be found when {f (·; v), v ∈ V} is the exponential family of distributions. (More details in [29].) Cross-Entropy Method for the SSA We denote by r j the number of firings of the jth reaction channel, τ i the time between the ith and (i − 1)th reaction, and τ r+1 the final time interval at the end of the simulation in which no reaction occurs. It can be shown that an exact SSA trajectory z = (x 0 , . . . , x r ), where r is the total number of reaction events r = m j=1 r j , belongs to the exponential family of distributions [41] -whose optimal CE parameter can be found analytically. Daigle et al. [8] showed that the solution of (2) for the SSA likelihood yields the following Monte Carlo estimate of the optimal CE parameter v * j , where K is the number of SSA trajectories of the Monte Carlo approximation of (2), z k is the kth trajectory, r jk and τ ik are as before but w.r.t. the kth trajectory, x i,k denotes the state after the (i − 1)th reaction in the kth trajectory, and the fraction is defined only when the denominator is nonzero (i.e., there is at least one trajectory z k for which J(z k ) ≤ γ -so-called elite samples). Note for γ = 0, the CE estimator (3) coincides with the maximum likelihood estimator (MLE) for θ j over the same trajectory. Following [7] and [26,Section 5.3.4], it is easy to show that a Monte Carlo estimator of the covariance matrix of the optimal parameter estimators (3) is given (written in operator style) by the matrix: where E is the set of elite samples, K E = |E|, the operator ∂ 2 ∂θ 2 returns a m×m matrix, ∂ ∂θ returns an m-dimensional vector (m×1 matrix), and ∂ ∂θ T denotes matrix transpose. From Eq. (4) parameter variance estimates can be readily derived. However, a more numerically stable option is to approximate the variance of the jth parameter estimator using the sample variancê

Methods
In this section, we present our stochastic rate parameter inference with crossentropy (SPICE) algorithm.
Overview To efficiently sample the parameter space, we treat each stochastic rate parameter as being log-normally distributed, i.e., θ j ∼ Lognormal(ω j , var(ω j )), where ω j = log(θ j ) is the log-transformed parameter calculated analagously to (3) and (4), respectively. For the initial iteration, we sample the parameter vector θ from the (log-transformed) desired parameter search space [θ max ] using a Sobol low-discrepancy sequence [33] to ensure adequate coverage. Subsequent iterations then generate a sequence of distribution parameters {(γ n , θ n , Σ n )} which aim to converge to the optimal parameters as follows: 1. Updating of γ n : Generate K sample trajectories using the SSA, z 1 , . . . , z K , from the model f (·; θ (n−1) ) with θ (n−1) sampled from the lognormal distribution, and sort them in order of their performances J 1 ≤ · · · ≤ J K (see Eqs. (7) and (6) for the actual definition of the performance, or score, function we adopt). For a fixed small ρ, say ρ = 10 −2 , letγ n be defined as the ρth quantile of J(z), i.e.,γ n = J ( ρK ) . 2. Updating of θ n : Using the estimated levelγ n , use the same K sample trajectories z 1 , . . . , z K to deriveθ n andσ 2 n from the solution of Eqs. (3) and (4). In case of numerical issues (or undersampling) in our implementation we switch to (5) for updating the variance.
The SPICE algorithm's pseudocode is shown in Algorithm 1. This 2-step approach provides a simple iterative scheme which converges asymptotically to the optimal density. A reasonable termination criteria to take would be to stop if γ n γ n−1 . . . for a fixed number of iterations. In general, more samples are required as the mean and variance of the estimates approach their optima.

Adaptive Sampling
We adaptively update the number of samples K n taken at each iteration. The reasoning is to ensure the parameter estimates improve with statistical significance at each step. Thus, our method allows the algorithm to make faster evaluations early on in the iterative process, and concentrate simulation time on later iterations, where it becomes increasingly hard to distinguish significant improvements of the estimated parameters. We update our parameters based on a fixed number of elite samples, K E , satisfying J(z) ≤ γ. The performance of the 'best' elite sample is denoted J * n , while the performance of the 'worst' elite sample -previously given by the ρth quantile of J(z) -isγ n . The quantile parameter ρ is adaptively updated each iteration as ρ n = K E /K n , where K E is typically taken to be 1-10% of the base number of samples K 0 . At each iteration, a check is made for improvement in either of the best or worst performing elite samples, i.e., if, J * n < J * n−1 orγ n <γ n−1 , then we can update our parameters and proceed to the next iteration. If no improvement in either values are found, the number of samples K n in the current iteration is increased in increments, up to a maximum K max . If we hit the maximum number of samples K max for c iterations (e.g., c = 3), then this suggests no further significant improvement can be made given the restriction on the number of samples.

Objective Function
The SPICE algorithm has been developed to handle an arbitrary number of datasets. Given N time series datasets, SPICE associates N objective function scores with each simulated trajectory. Each objective value corresponds to the standard sum of L 2 distances of the trajectory across all time points in the respective dataset: where x t = x(t) and y n,t is the datapoint at time t in the nth dataset. To ensure adequate coverage of the data, we choose our elite samples to be the best performing quantile of trajectories for each individual dataset (with scores J n ).
In the absence of temporal correlation within the data (e.g., when measurements between time points are independent or individual cells cannot be tracked as in flow cytometry data), we instead construct an empirical Gaussian mixture model for each time point within the data. Each mixture model at time t is comprised of N multivariate normal distributions, each with a vector of mean values y n,t corresponding to the observed species in the nth dataset, and diagonal covariance matrix σ 2 n corresponding to an error estimate or variance of the measurements on the species. In our experiments we used a 10% standard deviation, as we did not have any information about measurement noise. We then take the objective score function to be proportional to the negative log-likelihood of the simulated trajectory w.r.t. the data: Smoothed Updates We implement the parameter smoothing update formulâ , q∈N + and β∈(0, 1) are smoothing constants, andθ,σ are outputs from the solution of the cross-entropy in equation (2), approximated by (3) and (4), respectively. Parameter smoothing between iterations has three important benefits: (i) the parameter estimates converge to a more stable value, (ii) it reduces the probability of a parameter value tending towards zero within the first few iterations, and (iii) it prevents the sampling distribution from converging too quickly to a degenerate point probability mass at a local minima. Furthermore, [6] provide a proof that the CE method converges to an optimal solution with probability 1 in the case of smoothed updates.
Multiple Shooting and Particle Splitting SPICE can optionally utilise these two techniques for trajectory simulation between time intervals. For multiple shooting we construct a sample trajectory comprised of T intervals matching the time stamps within the data y. Originally [42], each segment from x t−1 to x t was simulated using an ODE model with the initial conditions set to the previous time point of the dataset, i.e., x t−1 = y t−1 . We instead treat the data as being mixture-normally distributed, thus we sample our initial conditions , where the index of the time series n is first uniformly sampled. Using the SSA, each piecewise section of a trajectory belonging to sample k is then simulated with the same parameter vector θ. For particle splitting we adopt a multilevel splitting approach as in [8], and the objective function is calculated after the simulation of each segment from x t−1 to x t . The trajectories z k satisfying J(z k ) ≤γ are then re-sampled with replacement K n times before simulation continues (recall K n is the number of samples in the nth iteration). This process aims at discarding poorly performing trajectories in favour of those 'closest' to the data. This will in turn create an enriched sample, at the cost of introducing an aspect of bias propagation.
Hyperparameters SPICE allows for the inclusion of hyperparameters φ (e.g., scaling constants, and non kinetic-rate parameters), which are sampled (logarithmically) alongside θ. These hyperparameters are updated at each iteration via the standard CE method.
Tau-Leaping With inexact, faster methods such as tau-leaping [15] a degree of accuracy is traded off in favour of computational performance. Thus, we are interested in replacing the SSA with tau-leaping in our SPICE algorithm. The next Proposition shows that with a tau-leaping trajectory we get the same form for the optimal CE estimator as in (3). Proposition 1. The CE solution for the optimal rate parameter over a tauleaping trajectory is the same as that for a standard SSA trajectory.
Proof. We shall use the same notation of Section 2 and further assume a trajectory in which state changes occur at times t l , for l∈{0, 1, . . . , L}. For each given time interval of size τ l of the tau-leaping algorithm, k jl ∈ Z + firings of each reaction channel R j are sampled from a Poisson process with mean λ jl = θ j α j (x t l )τ l . Thus, the probability of firing k jl reactions, in the interval [t l , t l + τ l ), given the initial state x t l is P (k jl |x t l , λ jl ) = exp{−λ jl }(λ jl ) k jl /k jl !, where P (0|x t l , 0) = 1. Therefore, the combined probability across all reaction channels is: Extending for the entire trajectory, the complete likelihood is given by: We can conveniently factorise the likelihood into component likelihoods associated with each reaction channel as L = m j=1 L j , where each component L j is . Expanding λ jl : where r j = L l=0 k jl , i.e., the total number of firings of reaction channel R j . From [29], the solution to (2) can be found by solving: given that the differentiation and expectation operators can be interchanged. Expanding ln L j and simplifying, we get: We can then take the derivative, ∇, with respect to θ j , It is simple to see that the previous entity holds when r j /θ j = L l=0 α j (x t l )τ l , yielding the Monte Carlo estimate, Forward simulate z i,k ←SSA(x, t0, ti, θ k ) 21 if (Method = Splitting) or (i = L) then // depending on type of data, use (6) or (7) 22 Calculate the cost function d k ← J(z k ) 23 if Method = Splitting then 24 Sample with replacement weighted trajectories satisfying d k < γn 25 γn ← ρth quantile of (d1, . . . , dK n ) 26 Computeθ (n) andσ (n) by means of Eqs. (3) and (4) (or (5)), using the elite trajectories satisfying d k < γn (and taking log appropriately in (3) and (4) ) 27 Increment n ← n + 1 28 Adaptively update Kn 29 until convergence detected in {γ1, . . . , γn−1} We utilise our SPICE algorithm on four commonly investigated systems: (i) the Lotka-Volterra predator-prey model, (ii) a Yeast Polarization model, (iii) the bistable Schlögl system, and (iv) the Genetic Toggle Switch. We present results for each system obtained using both the standard SSA and optimised tau-leaping (with an error control parameter of ε = 0.1) to drive our simulations.
For each run of the algorithm we set the sample parameters K E = 10, K min = 1, 000, K max = 20, 000, and set an upper limit on the number of iterations to 250. The smoothing parameters (λ, β, q) were set to (0.7, 0.8, 5) respectively. For our analysis, we define the mean relative error (MRE) between a parameter estimateθ and the truth θ * as MRE(% ERR ) = M −1 M j |θ j − θ * j |/θ * j × 100. All our experiments were performed on a Intel Xeon 2.9GHz Linux system without using multiple cores -all reported CPU times are single-core. SPICE has been implemented in Julia and is open source (https://github.com/pzuliani/SPICE).
For models (i)-(iii), we use synthetic data where the true solution is known, and compare the results of SPICE against some commonly used parameter estimation techniques implemented in COPASI 4.16 [17]. Specifically, we check the performance of SPICE against the genetic algorithm (GA), evolution strategy (ES), evolutionary programming (EP), and particle swarm (PS) implementations. For the ES and EP algorithms we allow 250 generations with a population of 1,000 particles. For the GA, we run 500 generations with 2,000 particles. For the PS, we allow 1,000 iterations with 1,000 particles 1 . For model (iv), the Genetic Toggle Switch, we show results for SPICE using real experimental data.
All statistics presented are based on 100 runs of each algorithm using fixed datasets. For each approach we also compared the performance of using the standard SSA versus tau-leaping, alongside multiple-shooting and particle splitting approaches. However, for the models tested, neither multiple shooting nor particle splitting helped in reducing CPU times or improving the estimates accuracy.

Lotka-Volterra Predator-Prey Model
We implement the standard Lotka-Volterra model below with real parameters (θ 1 , θ 2 , θ 3 ) = (0.5, 0.0025, 0.3), and initial population (X 1 , X 2 ) = (50, 50) We artificially generated 5 datasets each consisting of 40 timepoints using Gillespie's SSA, and performed parameter estimation based on these datasets. For the initial iteration, we placed bounds on the Sobol sequence parameter search space of θ j ∈[1e−6, 10], for j = 1, 2, 3. The minimum, maximum, and average MRE between the true parameters and their estimates across all 100 runs of each algorithm (using the standard SSA) are summarised in Table 1 In the previous Lotka-Volterra predator-prey example, SPICE was provided with the complete data for both species X 1 , X 2 . However, we are also concerned with cases where the data is not fully observed, i.e., when we have latent species. To compare the effects of latent species on the quality of parameter estimates, we ran SPICE again (averaging across 100 runs), this time supplying information about species X 1 alone. The results are presented in Table 1.

Yeast Polarization Model
We artificially generated 5 datasets each consisting of 17 timepoints using Gillespie's SSA, and performed parameter estimation based on these datasets. For the initial iteration, we placed bounds on the parameter search space of θ j ∈[1e−6, 10] for 1 j 7, and θ 8 ∈[1e−6, 100]. The average relative errors between the estimated and the real parameters across 100 runs of the algorithm are summarised in Table 1, along with the corresponding CPU run times. The variability of the estimates obtained using SPICE (and other methods) are shown in Fig 2. Schlögl System We use the Schlögl model [30] with parameters (θ 1 , θ 2 , θ 3 , θ 4 ) = (3e−7, 1e−4, 1e−3, 3.5), and initial population (X, A, B) = (250, 1e5, 2e5). This model is well known to produce bistable dynamics (see Fig. 4). We artificially generated 10 datasets (in order to partially capture a degree of the bistable dynamics) each consisting of 100 timepoints, and performed parameter estimation based on these datasets (also see Fig. 4). For the initial iteration, we placed bounds on the parameter search space of θ 1 ∈[1e−9, 1e−5], θ 2 ∈[1e−6, 0.01], θ 3 ∈[1e−5, 10], θ 4 ∈[0.01, 100]. Unlike the previous models, we explicitly ran the Schlögl System using tau-leaping for all algorithms, due to the computation time being largely infeasible under the same conditions (4.5 hours in SPICE, 48+ hours in COPASI). The MRE of all the estimated parameters, together with CPU times for each algorithm are summarised in Table 1. Box plots of the SPICE algorithm's performance are presented in Fig. 3. Note that the Schlögl system is sensitive to the initial conditions, so even slight perturbations of its parameters can cause the system to fail in producing bimodality.  is comprised of two repressors, and two promoters, often mediated in practice through IPTG 2 and aTc 3 induction. We perform parameter inference based on real high-throughput data (see Fig. 5), implemented upon a simple model (see below) based on [12]. For our model, we define the following reaction propensities: where GFP and mCherry are the two model species (reporter molecules), and the stochastic rate parameters are (θ 1 , . . . , θ 4 ). The data used for parameter inference was obtained through fluorescent flow cytometry in [21], via the GFP and mCherry reporters, and consists 40,731 measurements across 7 timepoints over 6 hours. We look specifically at the case where the switch starts in the low-GFP (high mCherry) state, and switches to the high-GFP (low-mCherry) state over the time course after aTc induction to the cells. The inclusion of real, noisy data requires a degree of additional care as the data needs to be rescaled from arbitrary units (a.u.) to discrete molecular counts. We assume a linear (multiplicative) scale, e.g., such that GFP (a.u.) = φ 5 ×GFP molecules. Furthermore, we can no longer assume all the cells begin at the same state, and we must assume the initial state belongs to a distribution. This introduces extra so-called 'hyperparameters', specifically the GFP molecule count to fluorescent (a.u.) scale factor φ 5 , and the respective mCherry scale factor φ 6 . In addition, the model now contains 4 additional parameters, φ 1 , . . . , φ 4 , which in turn are required to be estimated. Each hyperparameter is initially sampled as before using the low-discrepancy Sobol sequence, and updated using the means and variances of the generated elite samples as per the CE method.

Discussion
We can see from the presented results that our SPICE algorithm performs well on the models studied. For the Lotka-Volterra model the quality of the estimates is always good -there is no relative error larger than 2.1% in Table 1 for SPICE. The CPU times are reasonable in absolute terms (about 20 minutes, single core), and much smaller than those of the methods implemented in COPASI, and with smaller errors. Also, having one unobserved species (X 2 ) in the data does not seem to impact the results very much. In particular, from Table 1 we see that the latent model indeed has higher error than the fully observable model. However, the error is always smaller than 10%, which is acceptable.
The Yeast Polarization model is a more difficult system: we can indeed see from Table 1 that a number of parameter estimates have large relative errors. These are the same 'hard' parameters estimated by MCEM 2 [8] with similar errors. However, in CPU time terms, our SPICE algorithm does much better than MCEM 2 : SPICE can return a quite good estimate (in line with MCEM 2 's) on average in about 18 minutes using the direct method, while MCEM 2 would need about 30 days [8] -a speed-up of 2,400 times. Furthermore, for this model one could use tau-leaping instead of the direct method, gaining a 3x speedup in performance while giving up little on accuracy (the Min., Av., and Max. MRE % ERR were 31.2, 41.5, and 56.3, respectively; Av. CPU time was 303s).
The Schlögl system is another challenging case study, as clearly showed by results of Table 1, which were obtained by utilising tau-leaping (as a matter of fact, for the Schlögl model the average accuracy of SPICE increases with the use of tau-leaping). Our choice was motivated by the large CPU time of the direct method due to the fact that the upper steady state for X in the model has a large molecule number (about 600), which negatively impacts the running time of the direct method samples. The results of Table 1 show that there is no clear winner: the Evolutionary Programming method in COPASI has the smallest runtime, but twice the error achieved by SPICE, which has the best accuracy. As noted before, running the COPASI implementations with larger populations and more iterations did not significantly improve accuracy for the increased cost.
Lastly, the genetic Toggle Switch presents an interesting real-world case study with high-throughput data. The model now comprises four hyperparameters, each of which must be estimated alongside the four kinetic rate constants. In addition, the non-discrete (and noisy) data is no longer known to be generated from a convenient mathematical model. In other terms, there is no guarantee that the model reflects the true underlying biochemical reaction network. Despite these challenges, our SPICE algorithm does a very good job (in little more than an hour of CPU time) in computing parameter estimates for which the model quite closely matches the experimental data -we see in fact from Fig. 5 that the model simulations fall inside the data, with very few exceptions, and the empirical and simulated distributions closely match.
Related Work Techniques for stochastic rate parameter estimation fall into four categories. Early efforts included methods based on MLE: simulated maximum likelihood utilises Monte Carlo simulation and a genetic algorithm to maximise an approximated likelihood [34]. Efforts have been made to incorporate the Expectation-Maximisation (EM) algorithm with the SSA [18]. The stochastic gradient descent explores a Markov Chain Monte Carlo sampler with a Metropolis-Hastings update step [39]. In [25] a hidden Markov model is used for the system state, which is then solved by (approximate) likelihood maximisation. Lastly, a recent work [8] has combined an ascent-based EM algorithm with a modified cross-entropy method. Another category of methodologies include Bayesian inference. In particular, approximate Bayesian computation (ABC) gains an advantage by becoming 'likelihood free', and recent advances in sequential Monte Carlo (SMC) samplers have further improved these methods [32,35]. We note the similarities between ABC(-SMC) approaches and SPICE. Both methods can utilize 'elite' samples to produce better parameter estimates.
A key difference is that ABC(-SMC) uses accepted simulation parameters to construct a posterior distribution, while SPICE utilizes complete trajectory information to compute optimal updates of an underlying parameter distribution. The Bayesian approach presented in [5] can handle partially observed systems, including notions of experimental error. Linear noise approximation techniques have been used alongside Bayesian analysis [19]. A very recent work [36] combines Bayesian analysis with statistical emulation in an attempt at reducing the cost due to the SSA simulations. A third class of methodologies center around the numerical solution of the chemical master equation (CME), which is often intractable for all but the simplest of systems. One approach is to use dynamic state space truncation [3] or finite state projection methods [9] that truncate the CME state space by ignoring the smallest probability states. Another variation is to use a method of moments approximation [10,16] to construct ordinary differential equations (ODEs) describing the time evolution for the mean, variance, etc., of the underlying distribution. Other CME approximations are system size expansion using van Kampen's expansion [11], and solutions of the Fokker-Planck equation [22] using a form of linear noise approximation. Finally, another method [42] treats intervals between time measurements piecewise, and within each interval an ODE approximation is used for the objective function. This method has been recently extended using linear noise approximation [43]. A recent work [1], tailored for high-throughput data, proposes a stochastic parameter inference approach based on the comparison of distributions.

Conclusions
In this paper we have introduced the SPICE algorithm for rate parameter inference in stochastic reaction networks. Our algorithm is based on the cross-entropy method and Gillespie's algorithm, with a number of significant improvements. Key strengths of our algorithm are its ability to use multiple, possibly incomplete datasets (including distribution data), and its (theoretically justified) use of tau-leaping methods for model simulation. We have shown that SPICE works well in practice, in terms of both computational cost and estimate accuracy (which was often the best in the models tested), even on challenging case studies involving bistable systems and real high-throughput data. On a non-trivial case study, SPICE can be orders of magnitude faster than other approaches, while offering comparable accuracy in the estimates.