1 Introduction

In this paper we are concerned with the inference of biochemical reaction stochastic rate parameters from data. Reactions are discrete events that can occur randomly at any time with a rate dependent on the chemical kinetics [40]. It has recently become clear that stochasticity can produce dynamics profoundly different from the corresponding deterministic models. This is the case, e.g., in genetic systems where key species are present in small numbers or where key reactions occur at a low rate [23], resulting in transient, stochastic bursts of activity [4, 24]. The standard model for such systems is the Markov jump process popularised by Gillespie [13, 14]. Given a collection of reactions modelling a biological system and time-course data, the stochastic parameter inference problem is to find parameter values for which the Gillespie model’s temporal behaviour is most consistent with the data. This is a very difficult problem, much harder, both theoretically and computationally, than the corresponding problem for deterministic kinetics—see, e.g., [41, Sect. 1.3]. One simple reason is because stochastic models can behave widely differently from the same initial conditions. (The related issue of parameter non-identifiability is outside the scope of this paper, but the interested reader can find more in, e.g., [37, 38] and references therein.) Additionally, experimental data is usually sparse and most often involves only a limited subset of a model’s species; and the system under study might exhibit multimodal behaviour. Also, data might not directly relate to a species, it might be measured in arbitrary units (e.g., fluorescence measurements), thus requiring the estimation of scaling factors, or it might be described by frequency distributions (e.g., high-throughput data such as flow cytometry). Stochastic parameter inference is thus a fundamental and challenging problem in systems biology, and it is crucial for obtaining validated and predictive models.

In this paper we propose an approach for the parameter inference problem that combines Gillespie’s Stochastic Simulation Algorithm (SSA) with the cross-entropy (CE) method [27]. The CE method has been successfully used in optimisation, rare–event probability estimation, and other domains [29]. For parameter inference, Daigle et al. [8] combined a stochastic Expectation–Maximisation (EM) algorithm with a modified cross-entropy method. We instead develop the cross-entropy method in its own right, discarding the costly EM algorithm steps. We also show that our approach can utilise approximate, faster SSA variants such as tau-leaping [15]. Summarising, the main contributions of this paper are:

  • we present a new, cross entropy-based algorithm for the stochastic parameter inference problem that outperforms previous, state–of–the–art approaches;

  • our algorithm can work with multiple, incomplete, and distribution datasets;

  • we show that tau-leaping can be used within our technique;

  • we provide a thorough evaluation of our algorithm on a number of challenging case studies, including bistable systems (Schlögl model and toggle switch) and experimental data.

2 Background

Notation. Given a system with n chemical species, the state of the system at time t is represented by the vector \(\varvec{x}(t) = (x_1(t), \ldots , x_n(t))\), where \(x_i\) represents the number of molecules of the ith species, \(S_i\), for \(i \in \{1,\ldots ,n\}\). A well-mixed system within a fixed volume at a constant temperature can be modelled by a continuous-time Markov chain (CTMC) [13, 14]. The CTMC state changes are triggered by the (probabilistic) occurrences of chemical reactions. Given m chemical reactions, let \(\mathcal {R}_j\) denote the jth reaction of type:

$$ \mathcal {R}_j \quad : \quad \nu _{j,1}^{-}S_1 + \ldots + \nu _{j,n}^{-}S_n \overset{\theta _j}{\rightarrow } \nu _{j,1}^{+}S_1 + \ldots + \nu _{j,n}^{+}S_n, $$

where the vectors \(\varvec{\nu }_j^-\) and \(\varvec{\nu }_j^+\) represent the stoichiometries of the underlying chemical kinetics for the reactants and products, respectively. Let \(\varvec{\nu }_j \in \mathbb {Z}^n\) denote the overall (non-zero) state-change vector for the jth reaction type, specifically \(\varvec{\nu }_j = \varvec{\nu }_j^{+} - \varvec{\nu }_j^{-}\), for \(j \in \{1,\ldots ,m\}\). Assuming mass action kinetics (and omitting time dependency for \(\varvec{x}(t)\)), the reaction \(\mathcal {R}_j\) leads to the propensity [41]:

$$\begin{aligned} h_j(\varvec{x},\varvec{\theta }) = \theta _j\alpha _j(\varvec{x})= \theta _j \prod _{i=1}^n \left( {\begin{array}{c}x_i\\ \nu ^-_{j,i}\end{array}}\right) , \end{aligned}$$
(1)

where \(\varvec{\theta } = (\theta _1,\ldots ,\theta _m)^{\intercal }\) is the vector of rate constants. In general, \(\varvec{\theta }\) is unknown and must be estimated from experimental data—that is the aim of our work. Our algorithm can work with propensity functions factorisable as in (1), but it is not restricted to mass action kinetics (i.e., the functions \(\alpha _j\)’s can be arbitrary).

Cross-Entropy Method for Optimisation. The Kullback-Leibler divergence [20] or cross-entropy (CE) between two probability densities g and h is:

$$ \mathcal {D}(g,h) = \mathbb {E}_g\left[ \ln \frac{g(\varvec{X})}{h(\varvec{X})}\right] = \int g(\varvec{x})\ln \frac{g(\varvec{x})}{h(\varvec{x})} d\varvec{x} $$

where \(\varvec{X}\) is a random variable with density g, and \(\mathbb {E}_g\) is expectation w.r.t. g. Note that \( \mathcal {D}(g,h) \ge 0\) with equality iff \(g=h\) (almost everywhere). (However, \(\mathcal {D}(g,h)\ne \mathcal {D}(h,g)\).) The CE has been successfully adopted for a wide range of hard problems, including rare event simulation for biological systems [7], discrete, and continuous optimisation [28, 29]. Consider the minimisation of an objective function J over a space \(\chi \) (assuming such minimum exists), \(\gamma ^*=\min \limits _{x \in \chi } J(x)\). The CE method performs a Monte Carlo search over a parametric family of densities \(\{f(\cdot ;\varvec{v}),\varvec{v}\in \mathcal {V}\}\) on \(\chi \) that contains as a limit the (degenerate) Dirac density that puts its entire mass on a value \(x^*\in \chi \) such that \(J(x^*) = \gamma ^*\)—the so called optimal density. The key idea is to use the CE to measure how far a candidate density is from the optimal density. In particular, the method solves a sequence of optimisation problems of the type below for different values of \(\gamma \) by minimising the CE between a putative optimal density \(g^*(\varvec{x}) \propto I_{\{J(\varvec{x})\le \gamma \}}f(\varvec{x}, \varvec{v}^*)\) for some \(\varvec{v}^*\in \mathcal {V}\), and the density family \(\{f(\cdot ;\varvec{v}),\varvec{v}\in \mathcal {V}\}\)

$$\begin{aligned} \min _{\varvec{v}\in \mathcal {V}} \mathcal {D}(g^*, f(\cdot ;\varvec{v})) = \max _{\varvec{v}\in \mathcal {V}} \mathbb {E}_u \left[ I_{\{J(\varvec{X})\le \gamma \}} \ln f(\varvec{X};\varvec{v})\right] \end{aligned}$$
(2)

where I is the indicator function and \(\varvec{X}\) has density \(f(\cdot ;\varvec{u})\) for \(\varvec{u}\in \mathcal {V}\). The definition of density \(g^*\) above essentially means that, for a given \(\gamma \), we only consider densities that are positive only for arguments \(\varvec{x}\) for which \(J(\varvec{x}) \leqslant \gamma \). The generic CE method involves a 2-step procedure which alternates solving (2) for a candidate \(g^*\) with adaptively updating \(\gamma \). In practice, problem (2) is solved approximately via a Monte Carlo adaptation, i.e., by taking sample averages as estimators for \(\mathbb {E}_u\). The output of the CE method is a sequence of putative optimal densities identified by their parameters \(\hat{\varvec{v}}_0, \hat{\varvec{v}}_1, \ldots , \hat{\varvec{v}}^*\), and performance scores \(\hat{\gamma }_0, \hat{\gamma }_1, \ldots , \hat{\gamma }^*\), which improve with probability 1. For our problem, a key benefit of the CE method is that an analytic solution for (2) can be found when \(\{f(\cdot ;\varvec{v}),\varvec{v}\in \mathcal {V}\}\) is the exponential family of distributions. (More details in [29].)

Cross-Entropy Method for the SSA. We denote by \(r_j\) the number of firings of the jth reaction channel, \(\tau _i\) the time between the ith and \((i-1)\)th reaction, and \(\tau _{r+1}\) the final time interval at the end of the simulation in which no reaction occurs. It can be shown that an exact SSA trajectory \(\varvec{z}=(\varvec{x}_0,\ldots ,\varvec{x}_r)\), where r is the total number of reaction events \(r=\sum _{j=1}^mr_j\), belongs to the exponential family of distributions [41]—whose optimal CE parameter can be found analytically. Daigle et al. [8] showed that the solution of (2) for the SSA likelihood yields the following Monte Carlo estimate of the optimal CE parameter \(v_j^*\),

$$\begin{aligned} \hat{\theta }_j = \hat{v_j^*} = \displaystyle \frac{\sum ^K_{k=1} r_{jk} I_{\{J(\varvec{z}_k)\le \gamma \}}}{\sum ^K_{k=1}I_{\{J(\varvec{z}_k)\le \gamma \}}\left( \sum ^{r_{k}+1}_{i=1}\alpha _j(\varvec{x}_{i-1,k})\tau _{ik}\right) } \end{aligned}$$
(3)

where K is the number of SSA trajectories of the Monte Carlo approximation of (2), \(\varvec{z}_k\) is the kth trajectory, \(r_{jk}\) and \(\tau _{ik}\) are as before but w.r.t. the kth trajectory, \(\varvec{x}_{i,k}\) denotes the state after the \((i-1)\)th reaction in the kth trajectory, and the fraction is defined only when the denominator is nonzero (i.e., there is at least one trajectory \(\varvec{z}_k\) for which \(J(\varvec{z}_k) \le \gamma \)—so-called elite samples). Note for \(\gamma =0\), the CE estimator (3) coincides with the maximum likelihood estimator (MLE) for \(\theta _j\) over the same trajectory. Following [7] and [26, Sect. 5.3.4], it is easy to show that a Monte Carlo estimator of the covariance matrix of the optimal parameter estimators (3) is given (written in operator style) by the matrix:

$$\begin{aligned}&\hat{\varSigma }^{-1} = \bigg [-\frac{1}{K_E}\sum _{k\in E}\frac{\partial ^2}{\partial \theta ^2}-\frac{1}{K_E}\sum _{k\mathord {\in }E} \frac{\partial }{\partial \theta }\cdot \frac{\partial }{\partial \theta }^{\text {T}} \nonumber \\&\qquad \qquad \qquad \qquad \quad \quad \, +\frac{1}{K_E^2}\biggl (\sum _{k\mathord {\in }E }\frac{\partial }{\partial \theta }\biggr )\cdot \biggl (\sum _{k\mathord {\in }E }\frac{\partial }{\partial \theta }\biggr )^{\text {T}} \biggr ] (\log f(\theta |\varvec{x}, \varvec{z}_{k})) \end{aligned}$$
(4)

where E is the set of elite samples, \(K_E=|E|\), the operator \(\frac{\partial ^2}{\partial \theta ^2}\) returns a \(m\,\mathord {\times }\,m\) matrix, \(\frac{\partial }{\partial \theta }\) returns an m-dimensional vector (\(m\,\mathord {\times }\,1\) matrix), and \(\frac{\partial }{\partial \theta }^\text {T}\) denotes matrix transpose. From Eq. (4) parameter variance estimates can be readily derived. However, a more numerically stable option is to approximate the variance of the jth parameter estimator using the sample variance

$$\begin{aligned} \hat{\sigma }^2_j = \displaystyle \frac{1}{K_E}\sum _{k\in E}\left( \frac{r_{jk}}{\sum ^{r_{k}+1}_{i=1} \alpha _j(\varvec{x}_{i-1,k})\tau _{ik}}-\hat{\theta }_j\right) ^2. \end{aligned}$$
(5)

3 Methods

In this section, we present our stochastic rate parameter inference with cross-entropy (SPICE) algorithm.

Overview. To efficiently sample the parameter space, we treat each stochastic rate parameter as being log-normally distributed, i.e., \(\theta _j \sim \text {Lognormal}(\omega _j,\text {var}(\omega _j))\), where \(\omega _j = \log (\theta _j)\) is the log-transformed parameter calculated analagously to (3) and (4), respectively. For the initial iteration, we sample the parameter vector \(\varvec{\theta }\) from the (log-transformed) desired parameter search space \([\varvec{\theta }_{\textsc {min}}^{(0)}, \varvec{\theta }_{\textsc {max}}^{(0)}]\) using a Sobol low-discrepancy sequence [33] to ensure adequate coverage. Subsequent iterations then generate a sequence of distribution parameters \(\{(\gamma _n,\varvec{\theta }_n,\varvec{\varSigma }_n)\}\) which aim to converge to the optimal parameters as follows:

  1. 1.

    Updating of \(\gamma _n\): Generate K sample trajectories using the SSA, \(\varvec{z}_1,\ldots ,\varvec{z}_K\), from the model \(f(\cdot ;\varvec{\theta }^{(n-1)})\) with \(\varvec{\theta }^{(n-1)}\) sampled from the lognormal distribution, and sort them in order of their performances \(J_{1'} \le \cdots \le J_{K'}\) (see Eqs. (7) and (6) for the actual definition of the performance, or score, function we adopt). For a fixed small \(\rho \), say \(\rho =10^{-2}\), let \(\hat{\gamma }_n\) be defined as the \(\rho \)th quantile of \(J(\varvec{z})\), i.e., \( \hat{\gamma }_n=J_{(\lceil \rho K\rceil )}\).

  2. 2.

    Updating of \(\varvec{\theta }_n\): Using the estimated level \(\hat{\gamma }_n\), use the same K sample trajectories \(\varvec{z}_1,\ldots ,\varvec{z}_K\) to derive \(\hat{\varvec{\theta }}_{n}\) and \(\hat{\varvec{\sigma }}^2_n\) from the solution of Eqs. (3) and (4). In case of numerical issues (or undersampling) in our implementation we switch to (5) for updating the variance.

The SPICE algorithm’s pseudocode is shown in Algorithm 1. This 2-step approach provides a simple iterative scheme which converges asymptotically to the optimal density. A reasonable termination criteria to take would be to stop if \(\hat{\gamma }_n \nleq \hat{\gamma }_{n-1} \nleq \ldots \) for a fixed number of iterations. In general, more samples are required as the mean and variance of the estimates approach their optima.

Adaptive Sampling. We adaptively update the number of samples \(K_n\) taken at each iteration. The reasoning is to ensure the parameter estimates improve with statistical significance at each step. Thus, our method allows the algorithm to make faster evaluations early on in the iterative process, and concentrate simulation time on later iterations, where it becomes increasingly hard to distinguish significant improvements of the estimated parameters. We update our parameters based on a fixed number of elite samples, \(K_{{E}}\), satisfying \(J(\varvec{z})\le \gamma \). The performance of the ‘best’ elite sample is denoted \(J_n^*\), while the performance of the ‘worst’ elite sample—previously given by the \(\rho \)th quantile of \(J(\varvec{z})\)—is \(\hat{\gamma }_n\). The quantile parameter \(\rho \) is adaptively updated each iteration as \(\rho _n=K_{{E}}/K_n\), where \(K_{E}\) is typically taken to be 1–10% of the base number of samples \(K_0\). At each iteration, a check is made for improvement in either of the best or worst performing elite samples, i.e., if, \( J^*_n < J^*_{n-1}\) or \( \hat{\gamma }_n < \hat{\gamma }_{n-1}\), then we can update our parameters and proceed to the next iteration. If no improvement in either values are found, the number of samples \(K_n\) in the current iteration is increased in increments, up to a maximum \(K_{\text {max}}\). If we hit the maximum number of samples \(K_{\text {max}}\) for c iterations (e.g., \(c=3\)), then this suggests no further significant improvement can be made given the restriction on the number of samples.

Objective Function. The SPICE algorithm has been developed to handle an arbitrary number of datasets. Given N time series datasets, SPICE associates N objective function scores with each simulated trajectory. Each objective value corresponds to the standard sum of \(L^2\) distances of the trajectory across all time points in the respective dataset:

$$\begin{aligned} J_n(\varvec{z}) = \sum _{t=1}^{T}\ (\varvec{y}_{n,t}-\varvec{x}_t)^2 \quad \quad 1 \le n \le N \end{aligned}$$
(6)

where \(\varvec{x}_t=\varvec{x}(t)\) and \(\varvec{y}_{n,t}\) is the datapoint at time t in the nth dataset. To ensure adequate coverage of the data, we choose our elite samples to be the best performing quantile of trajectories for each individual dataset (with scores \(J_n\)).

In the absence of temporal correlation within the data (e.g., when measurements between time points are independent or individual cells cannot be tracked as in flow cytometry data), we instead construct an empirical Gaussian mixture model for each time point within the data. Each mixture model at time t is comprised of N multivariate normal distributions, each with a vector of mean values \(\varvec{y}_{n,t}\) corresponding to the observed species in the nth dataset, and diagonal covariance matrix \(\varvec{\sigma }^2_n\) corresponding to an error estimate or variance of the measurements on the species. In our experiments we used a 10% standard deviation, as we did not have any information about measurement noise. We then take the objective score function to be proportional to the negative log-likelihood of the simulated trajectory w.r.t. the data:

$$\begin{aligned} J_n(\varvec{z}) = - \sum _{t=1}^T \ln \left( \sum _{n=1}^N\exp \left[ -\frac{1}{2}(\varvec{y}_{n,t}-\varvec{x}_t)^{\intercal } \varvec{\sigma }_n^{-2}(\varvec{y}_{n,t}-\varvec{x}_t)\right] \right) . \end{aligned}$$
(7)

Smoothed Updates. We implement the parameter smoothing update formula

$$ \hat{\varvec{\theta }}^{(n)} = \lambda \tilde{\varvec{\theta }}^{(n)} + (1-\lambda )\hat{\varvec{\theta }}^{(n-1)}, \quad \hat{\varvec{\sigma }}^{(n)} = \beta _n\tilde{\varvec{\sigma }}^{(n)} + (1-\beta _n)\hat{\varvec{\sigma }}^{(n-1)} $$

where \( \beta _n = \beta - \beta \left( 1-\frac{1}{n}\right) ^q\), \(\lambda \,\mathord {\in }\, (0,1]\), \(q\,\mathord {\in }\,\mathbb {N}^+\) and \(\beta \,\mathord {\in }\,(0,1)\) are smoothing constants, and \(\tilde{\varvec{\theta }},\tilde{\varvec{\sigma }}\) are outputs from the solution of the cross-entropy in Eq. (2), approximated by (3) and (4), respectively. Parameter smoothing between iterations has three important benefits: (i) the parameter estimates converge to a more stable value, (ii) it reduces the probability of a parameter value tending towards zero within the first few iterations, and (iii) it prevents the sampling distribution from converging too quickly to a degenerate point probability mass at a local minima. Furthermore, [6] provide a proof that the CE method converges to an optimal solution with probability 1 in the case of smoothed updates.

Multiple Shooting and Particle Splitting. SPICE can optionally utilise these two techniques for trajectory simulation between time intervals. For multiple shooting we construct a sample trajectory comprised of T intervals matching the time stamps within the data \(\varvec{y}\). Originally [42], each segment from \(\varvec{x}_{t-1}\) to \(\varvec{x}_t\) was simulated using an ODE model with the initial conditions set to the previous time point of the dataset, i.e., \(\varvec{x}_{t-1} = \varvec{y}_{t-1}\). We instead treat the data as being mixture-normally distributed, thus we sample our initial conditions \(\varvec{x}_{t-1}\sim \mathcal {N}(\varvec{y}_{n,t-1},\varvec{\sigma }^2_{n,t-1})\), where the index of the time series n is first uniformly sampled. Using the SSA, each piecewise section of a trajectory belonging to sample k is then simulated with the same parameter vector \(\varvec{\theta }\). For particle splitting we adopt a multilevel splitting approach as in [8], and the objective function is calculated after the simulation of each segment from \(\varvec{x}_{t-1}\) to \(\varvec{x}_t\). The trajectories \(\varvec{z}_k\) satisfying \(J(\varvec{z}_k)\le \hat{\gamma }\) are then re-sampled with replacement \(K_n\) times before simulation continues (recall \(K_n\) is the number of samples in the nth iteration). This process aims at discarding poorly performing trajectories in favour of those ‘closest’ to the data. This will in turn create an enriched sample, at the cost of introducing an aspect of bias propagation.

Hyperparameters. SPICE allows for the inclusion of hyperparameters \(\varvec{\phi }\) (e.g., scaling constants, and non kinetic-rate parameters), which are sampled (logarithmically) alongside \(\varvec{\theta }\). These hyperparameters are updated at each iteration via the standard CE method.

Tau-Leaping. With inexact, faster methods such as tau-leaping [15] a degree of accuracy is traded off in favour of computational performance. Thus, we are interested in replacing the SSA with tau-leaping in our SPICE algorithm. The next Proposition shows that with a tau-leaping trajectory we get the same form for the optimal CE estimator as in (3).

figure a

Proposition 1

The CE solution for the optimal rate parameter over a tau-leaping trajectory is the same as that for a standard SSA trajectory.

Proof

We shall use the same notation of Sect. 2 and further assume a trajectory in which state changes occur at times \(t_l\), for \( l\, \mathord {\in }\, \{0,1,\ldots ,L\}\). For each given time interval of size \(\tau _l\) of the tau-leaping algorithm, \(k_{jl} \in \mathbb {Z}^+\) firings of each reaction channel \(\mathcal {R}_j\) are sampled from a Poisson process with mean \(\lambda _{jl} = \theta _j \alpha _j(\varvec{x}_{t_l})\tau _l\). Thus, the probability of firing \(k_{jl}\) reactions, in the interval \([t_l, t_l+\tau _l)\), given the initial state \(\varvec{x}_{t_l}\) is \(P(k_{jl}|\varvec{x}_{t_l}, \lambda _{jl}) = \exp \{-\lambda _{jl}\}(\lambda _{jl})^{k_{jl}}/{k_{jl}!}\), where \(P(0|\varvec{x}_{t_l}, 0) = 1\). Therefore, the combined probability across all reaction channels is:

$$ \prod _{j=1}^m P(k_{jl}|\varvec{x}_{t_l}, \lambda _{jl}) = \prod _{j=1}^m \frac{\exp \{-\lambda _{jl}\}(\lambda _{jl})^{k_{jl}}}{k_{jl}!}. $$

Extending for the entire trajectory, the complete likelihood is given by:

$$ \mathcal {L} = \prod _{l=0}^L \prod _{j=1}^m P(k_{jl}|\varvec{x}_{t_l}, \lambda _{jl}) = \prod _{l=0}^L \prod _{j=1}^m \frac{\exp \{-\lambda _{jl}\}(\lambda _{jl})^{k_{jl}}}{k_{jl}!}. $$

We can conveniently factorise the likelihood into component likelihoods associated with each reaction channel as \(\mathcal {L} = \prod _{j=1}^m \mathcal {L}_j \), where each component \(\mathcal {L}_j\) is given by \( \mathcal {L}_j = \prod _{l=0}^L \frac{\exp \{-\lambda _{jl}\}(\lambda _{jl})^{k_{jl}}}{k_{jl}!}\). Expanding \(\lambda _{jl}\):

$$\begin{aligned} \begin{aligned} \mathcal {L}_j&= \prod _{l=0}^L \frac{\exp \{-\theta _j\alpha _j(\varvec{x}_{t_l})\tau _l\}(\theta _j\alpha _j(\varvec{x}_{t_l})\tau _l)^{k_{jl}}}{k_{jl}!} \\&= \theta _j^{r_j}\exp \left\{ -\theta _j \sum _{l=0}^L \alpha _j(\varvec{x}_{t_l})\tau _l\right\} \prod _{l=0}^L \frac{(\alpha _j(\varvec{x}_{t_l})\tau _l)^{k_{jl}}}{k_{jl}!}, \end{aligned} \end{aligned}$$

where \(r_j=\sum _{l=0}^L k_{jl}\), i.e., the total number of firings of reaction channel \(\mathcal {R}_j\). From [29], the solution to (2) can be found by solving:

$$\begin{aligned} \mathbb {E}_u \left[ I_{\{J(\varvec{X})\ge \gamma \}} \nabla \ln \mathcal {L}_j\right] = 0, \end{aligned}$$

given that the differentiation and expectation operators can be interchanged. Expanding \(\ln \mathcal {L}_j\) and simplifying, we get:

$$\begin{aligned} \mathbb {E}_u \left[ I_{\{J(\varvec{X})\ge \gamma \}}\nabla \left( \ln \theta _j^{r_j} -\theta _j\sum _{l=0}^L \alpha _j(\varvec{x}_{t_l})\tau _l +\ln \left\{ \prod _{l=0}^L \frac{(\alpha _j(\varvec{x}_{t_l})\tau _l)^{k_{jl}}}{k_{jl}!}\right\} \right) \right] =0. \end{aligned}$$

We can then take the derivative, \(\nabla \), with respect to \(\theta _j\),

$$\begin{aligned} \mathbb {E}_u \left[ I_{\{J(\varvec{X})\ge \gamma \}}\left( \frac{r_j}{\theta _j} -\sum _{l=0}^L \alpha _j(\varvec{x}_{t_l})\tau _l \right) \right] = 0. \end{aligned}$$

It is simple to see that the previous entity holds when \(r_j/\theta _j=\sum _{l=0}^L \alpha _j(\varvec{x}_{t_l})\tau _l\), yielding the Monte Carlo estimate,

$$ \hat{\theta }_j = \frac{\sum ^K_{k=1} I_{\{J(\varvec{z}_k)\le \gamma \}}r_{jk}}{\sum ^K_{k=1}I_{\{J(\varvec{z}_k)\le \gamma \}}\sum ^{L}_{l=0}\alpha _j(\varvec{x}_{t_l,k})\tau _{l,k}}. $$

   \(\square \)

4 Experiments

We utilise our SPICE algorithm on four commonly investigated systems: (i) the Lotka-Volterra predator–prey model, (ii) a Yeast Polarization model, (iii) the bistable Schlögl system, and (iv) the Genetic Toggle Switch. We present results for each system obtained using both the standard SSA and optimised tau-leaping (with an error control parameter of \(\varepsilon =0.1\)) to drive our simulations.

For each run of the algorithm we set the sample parameters \(K_{E}=10\), \(K_{\text {min}}=1,000\), \(K_{\text {max}}=20,000\), and set an upper limit on the number of iterations to 250. The smoothing parameters \((\lambda , \beta , q)\) were set to (0.7, 0.8, 5) respectively. For our analysis, we define the mean relative error (MRE) between a parameter estimate \(\hat{\varvec{\theta }}\) and the truth \(\varvec{\theta }^*\) as \(\text {MRE}(\%_{\textsc {ERR}})=M^{-1}\sum _j^M|\hat{\theta }_j - \theta ^*_j| / \theta ^*_j \times 100 \). All our experiments were performed on a Intel Xeon 2.9GHz Linux system without using multiple cores—all reported CPU times are single-core. SPICE has been implemented in Julia and is open source (https://github.com/pzuliani/SPICE).

For models (i)–(iii), we use synthetic data where the true solution is known, and compare the results of SPICE against some commonly used parameter estimation techniques implemented in COPASI 4.16 [17]. Specifically, we check the performance of SPICE against the genetic algorithm (GA), evolution strategy (ES), evolutionary programming (EP), and particle swarm (PS) implementations. For the ES and EP algorithms we allow 250 generations with a population of 1,000 particles. For the GA, we run 500 generations with 2,000 particles. For the PS, we allow 1,000 iterations with 1,000 particlesFootnote 1. For model (iv), the Genetic Toggle Switch, we show results for SPICE using real experimental data.

All statistics presented are based on 100 runs of each algorithm using fixed datasets. For each approach we also compared the performance of using the standard SSA versus tau-leaping, alongside multiple-shooting and particle splitting approaches. However, for the models tested, neither multiple shooting nor particle splitting helped in reducing CPU times or improving the estimates accuracy.

Lotka-Volterra Predator–Prey Model. We implement the standard Lotka-Volterra model below with real parameters \((\theta _1,\theta _2,\theta _3) =\) (0.5, 0.0025, 0.3), and initial population \((X_1, X_2) =\) (50, 50)

$$ X_1 \overset{\theta _1}{\longrightarrow } X_1 + X_1 \quad \quad \quad X_1 + X_2 \overset{\theta _2}{\longrightarrow } X_2 + X_2\quad \quad \quad X_2 \overset{\theta _3}{\longrightarrow } \emptyset $$

We artificially generated 5 datasets each consisting of 40 timepoints using Gillespie’s SSA, and performed parameter estimation based on these datasets. For the initial iteration, we placed bounds on the Sobol sequence parameter search space of \(\theta _j \,\mathord {\in }\, [1\mathrm {e}{-6},10]\), for \(j = 1,2,3\). The minimum, maximum, and average MRE between the true parameters and their estimates across all 100 runs of each algorithm (using the standard SSA) are summarised in Table 1, together with corresponding CPU run times. Box plots summarising the obtained parameter estimates across all runs of each method are displayed in Fig. 1.

Fig. 1.
figure 1

Lotka-Volterra model: box plots showing the summary statistics across 100 runs of COPASI and SPICE for each of the 3 parameter estimates. We note SPICE consistently has the least variance.

In the previous Lotka-Volterra predator–prey example, SPICE was provided with the complete data for both species \(X_1,X_2\). However, we are also concerned with cases where the data is not fully observed, i.e., when we have latent species. To compare the effects of latent species on the quality of parameter estimates, we ran SPICE again (averaging across 100 runs), this time supplying information about species \(X_1\) alone. The results are presented in Table 1.

Yeast Polarization Model. We implement the Yeast Polarization model (see below) with real parameters \((\theta _1,\ldots ,\theta _8) =\) (0.38, 0.04, 0.082, 0.12, 0.021, 0.1, 0.005, 13.21), and initial population \((R,L,RL,G,G_a,G_{bg},G_d) =\) (500, 4, 110, 300, 2, 20, 90). The reactions of the model are [8]:

$$\begin{aligned} \emptyset&\overset{\theta _1}{\longrightarrow } R&RL+G&\overset{\theta _5}{\longrightarrow } G_a + G_{bg}\\ R&\overset{\theta _2}{\longrightarrow } \emptyset&G_a&\overset{\theta _6}{\longrightarrow } G_d\\ L + R&\overset{\theta _3}{\longrightarrow } RL+L&G_d + G_{bg}&\overset{\theta _7}{\longrightarrow } G\\ RL&\overset{\theta _4}{\longrightarrow } R&\emptyset&\overset{\theta _8}{\longrightarrow } RL \end{aligned}$$

We artificially generated 5 datasets each consisting of 17 timepoints using Gillespie’s SSA, and performed parameter estimation based on these datasets. For the initial iteration, we placed bounds on the parameter search space of \(\theta _j \,\mathord {\in }\, [1\mathrm {e}{-6},10]\) for \(1\leqslant j\leqslant 7\), and \(\theta _8 \,\mathord {\in }\, [1\mathrm {e}{-6},100]\). The average relative errors between the estimated and the real parameters across 100 runs of the algorithm are summarised in Table 1, along with the corresponding CPU run times. The variability of the estimates obtained using SPICE (and other methods) are shown in Fig. 2.

Fig. 2.
figure 2

Yeast Polarization parameter estimates: box plots showing the summary statistics of all 8 parameter estimates across 100 runs of COPASI’s methods and SPICE. We note once again SPICE produces the least variation of obtained estimates.

Schlögl System. We use the Schlögl model [30] with parameters \((\theta _1,\theta _2,\theta _3,\theta _4) = (3\mathrm {e}{-7}, 1\mathrm {e}{-4}, 1\mathrm {e}{-3}, 3.5)\), and initial population \((X, A, B) = (250, 1\mathrm {e}{5}, 2\mathrm {e}{5})\). This model is well known to produce bistable dynamics (see Fig. 4).

$$\begin{aligned} 2 X + A&\overset{\theta _1}{\longrightarrow } 3 X&B&\overset{\theta _3}{\longrightarrow } X\\ 3 X&\overset{\theta _2}{\longrightarrow } 2 X + A&X&\overset{\theta _4}{\longrightarrow } B \end{aligned}$$

We artificially generated 10 datasets (in order to partially capture a degree of the bistable dynamics) each consisting of 100 timepoints, and performed parameter estimation based on these datasets (also see Fig. 4). For the initial iteration, we placed bounds on the parameter search space of \(\theta _1\, \mathord {\in }\, [1\mathrm {e}{-9},1\mathrm {e}{-5}]\), \(\theta _2\, \mathord {\in }\, [1\mathrm {e}{-6},0.01]\), \(\theta _3\, \mathord {\in }\, [1\mathrm {e}{-5},10]\), \(\theta _4 \,\mathord {\in }\, [0.01,100]\). Unlike the previous models, we explicitly ran the Schlögl System using tau-leaping for all algorithms, due to the computation time being largely infeasible under the same conditions (4.5 h in SPICE, 48+ h in COPASI). The MRE of all the estimated parameters, together with CPU times for each algorithm are summarised in Table 1. Box plots of the SPICE algorithm’s performance are presented in Fig. 3. Note that the Schlögl system is sensitive to the initial conditions, so even slight perturbations of its parameters can cause the system to fail in producing bimodality.

Fig. 3.
figure 3

Schlögl system parameter estimates: box plots comparing the parameter estimates across 100 runs of COPASI’s methods and SPICE (all simulated using tau-leaping, \(\varepsilon =0.1\)). Again, SPICE shows the smallest variance, with mean estimates quite close to the real values of \(\theta _1\) and \(\theta _3\). For \(\theta _2\) and \(\theta _4\), all the best mean estimates have variance much larger than SPICE estimates.

Fig. 4.
figure 4

Schlögl: from the left: solid black lines: the 10 datasets generated using the SSA direct method and the real parameters, and used as input for SPICE. Blue lines: 100 model runs with estimated parameters sampled by the final parameter distributions obtained by SPICE with the direct method (means = (\(2.14\mathrm {e}{-7},7.63\mathrm {e}{-5},4.54\mathrm {e}{-4},2.18)\)); variances = (\(7.81\mathrm {e}{-16}, 2.81\mathrm {e}{-10}, 4.05\mathrm {e}{-8}, 0.13\))). Fitted: empirical distribution of 1,000 model simulations with sampled parameters from SPICE output. Real distribution: empirical distribution of 1,000 model simulations with the real parameters. (Color figure online)

Fig. 5.
figure 5

Toggle switch model: blue circles: the experimental data with the \(\log _{10}(\text {GFP})\) fluorescence plotted against the \(\log _{10}(\text {mCherry})\) fluorescence, across all timepoints up to 6h. Orange circles: 1,000 model simulations using the direct method, with parameters sampled from the final distribution obtained by SPICE using tau-leaping (\(\varepsilon =0.1\)). (Color figure online)

Toggle Switch Model. The genetic toggle switch is a well studied bistable system, with particular importance toward synthetic biology. The toggle switch is comprised of two repressors, and two promoters, often mediated in practice through IPTGFootnote 2 and aTcFootnote 3 induction. We perform parameter inference based on real high-throughput data (see Fig. 5), implemented upon a simple model (see below) based on [12]. For our model, we define the following reaction propensities:

$$\begin{aligned} h_1&= \theta _1\times \textsc {GFP}&h_3&= \theta _3\times \textsc {mCherry}\\ h_2&= \frac{\theta _2 \times \phi _1}{1+\phi _1 + \phi _2 \times \textsc {mCherry}^{2}}&h_4&= \frac{\theta _4 \times \phi _3}{1+\phi _3 + \phi _4 \times \textsc {GFP}^{2}} \end{aligned}$$

where GFP and mCherry are the two model species (reporter molecules), and the stochastic rate parameters are (\(\theta _1,\ldots ,\theta _4\)). The data used for parameter inference was obtained through fluorescent flow cytometry in [21], via the GFP and mCherry reporters, and consists 40,731 measurements across 7 timepoints over 6 h. We look specifically at the case where the switch starts in the low-GFP (high mCherry) state, and switches to the high-GFP (low-mCherry) state over the time course after aTc induction to the cells. The inclusion of real, noisy data requires a degree of additional care as the data needs to be rescaled from arbitrary units (a.u.) to discrete molecular counts. We assume a linear (multiplicative) scale, e.g., such that GFP (a.u.) \(= \phi _5\,\times \) GFP molecules. Furthermore, we can no longer assume all the cells begin at the same state, and we must assume the initial state belongs to a distribution. This introduces extra so-called ‘hyperparameters’, specifically the GFP molecule count to fluorescent (a.u.) scale factor \(\phi _5\), and the respective mCherry scale factor \(\phi _6\). In addition, the model now contains 4 additional parameters, \(\phi _1,\ldots ,\phi _4\), which in turn are required to be estimated. Each hyperparameter is initially sampled as before using the low-discrepancy Sobol sequence, and updated using the means and variances of the generated elite samples as per the CE method.

Table 1. The relative errors for each stochastic rate parameter averaged across 100 runs using COPASI’s Evolutionary Programming (EP), Evolution Strategy (ES), Genetic Algorithm (GA), and Particle Swarm (PS) algorithms, and our SPICE algorithm are shown. The minimum, maximum, and average mean relative error (MRE) for all parameter estimates across all runs are also given alongside the averaged CPU time.

The placed bounds on the initial kinetic parameter search space, based upon reported half-lives for the variants of GFP [2] and mCherry [31], were \(\theta _{1,3}\, \mathord {\in }\, [1\mathrm {e}{-3},1]\), and \(\theta _{2,4}\, \mathord {\in } \,[1,50]\). The respective bounds on the search space for the hyperparameters were \(\phi _{1,2,3,4}\, \mathord {\in }\, [1\mathrm {e}{-3},10]\), and \(\phi _{5,6}\, \mathord {\in }\, [50,500]\). To generate the parameter estimates, we used SPICE with tau-leaping (\(\varepsilon =0.1\), CPU time = 4,293 s). The estimated parameters and the resulting fit against the data for the model can be seen in Fig. 5.

5 Discussion

We can see from the presented results that our SPICE algorithm performs well on the models studied. For the Lotka-Volterra model the quality of the estimates is always good—there is no relative error larger than 2.1% in Table 1 for SPICE. The CPU times are reasonable in absolute terms (about 20 min, single core), and much smaller than those of the methods implemented in COPASI, and with smaller errors. Also, having one unobserved species (\(X_2\)) in the data does not seem to impact the results very much. In particular, from Table 1 we see that the latent model indeed has higher error than the fully observable model. However, the error is always smaller than 10%, which is acceptable.

The Yeast Polarization model is a more difficult system: we can indeed see from Table 1 that a number of parameter estimates have large relative errors. These are the same ‘hard’ parameters estimated by MCEM\(^2\) [8] with similar errors. However, in CPU time terms, our SPICE algorithm does much better than MCEM\(^2\): SPICE can return a quite good estimate (in line with MCEM\(^2\)’s) on average in about 18 min using the direct method, while MCEM\(^2\) would need about 30 days [8]—a speed-up of 2,400 times. Furthermore, for this model one could use tau-leaping instead of the direct method, gaining a 3x speedup in performance while giving up little on accuracy (the Min., Av., and Max. MRE \(\%_\text {ERR}\) were 31.2, 41.5, and 56.3, respectively; Av. CPU time was 303 s).

The Schlögl system is another challenging case study, as clearly showed by results of Table 1, which were obtained by utilising tau-leaping (as a matter of fact, for the Schlögl model the average accuracy of SPICE increases with the use of tau-leaping). Our choice was motivated by the large CPU time of the direct method due to the fact that the upper steady state for X in the model has a large molecule number (about 600), which negatively impacts the running time of the direct method samples. The results of Table 1 show that there is no clear winner: the Evolutionary Programming method in COPASI has the smallest runtime, but twice the error achieved by SPICE, which has the best accuracy. As noted before, running the COPASI implementations with larger populations and more iterations did not significantly improve accuracy for the increased cost.

Lastly, the genetic Toggle Switch presents an interesting real-world case study with high-throughput data. The model now comprises four hyperparameters, each of which must be estimated alongside the four kinetic rate constants. In addition, the non-discrete (and noisy) data is no longer known to be generated from a convenient mathematical model. In other terms, there is no guarantee that the model reflects the true underlying biochemical reaction network. Despite these challenges, our SPICE algorithm does a very good job (in little more than an hour of CPU time) in computing parameter estimates for which the model quite closely matches the experimental data—we see in fact from Fig. 5 that the model simulations fall inside the data, with very few exceptions, and the empirical and simulated distributions closely match.

Related Work. Techniques for stochastic rate parameter estimation fall into four categories. Early efforts included methods based on MLE: simulated maximum likelihood utilises Monte Carlo simulation and a genetic algorithm to maximise an approximated likelihood [34]. Efforts have been made to incorporate the Expectation-Maximisation (EM) algorithm with the SSA [18]. The stochastic gradient descent explores a Markov Chain Monte Carlo sampler with a Metropolis-Hastings update step [39]. In [25] a hidden Markov model is used for the system state, which is then solved by (approximate) likelihood maximisation. Lastly, a recent work [8] has combined an ascent-based EM algorithm with a modified cross-entropy method. Another category of methodologies include Bayesian inference. In particular, approximate Bayesian computation (ABC) gains an advantage by becoming ‘likelihood free’, and recent advances in sequential Monte Carlo (SMC) samplers have further improved these methods [32, 35]. We note the similarities between ABC(-SMC) approaches and SPICE. Both methods can utilize ‘elite’ samples to produce better parameter estimates. A key difference is that ABC(-SMC) uses accepted simulation parameters to construct a posterior distribution, while SPICE utilizes complete trajectory information to compute optimal updates of an underlying parameter distribution. The Bayesian approach presented in [5] can handle partially observed systems, including notions of experimental error. Linear noise approximation techniques have been used alongside Bayesian analysis [19]. A very recent work [36] combines Bayesian analysis with statistical emulation in an attempt at reducing the cost due to the SSA simulations. A third class of methodologies center around the numerical solution of the chemical master equation (CME), which is often intractable for all but the simplest of systems. One approach is to use dynamic state space truncation [3] or finite state projection methods [9] that truncate the CME state space by ignoring the smallest probability states. Another variation is to use a method of moments approximation [10, 16] to construct ordinary differential equations (ODEs) describing the time evolution for the mean, variance, etc., of the underlying distribution. Other CME approximations are system size expansion using van Kampen’s expansion [11], and solutions of the Fokker-Planck equation [22] using a form of linear noise approximation. Finally, another method [42] treats intervals between time measurements piecewise, and within each interval an ODE approximation is used for the objective function. This method has been recently extended using linear noise approximation [43]. A recent work [1], tailored for high-throughput data, proposes a stochastic parameter inference approach based on the comparison of distributions.

6 Conclusions

In this paper we have introduced the SPICE algorithm for rate parameter inference in stochastic reaction networks. Our algorithm is based on the cross-entropy method and Gillespie’s algorithm, with a number of significant improvements. Key strengths of our algorithm are its ability to use multiple, possibly incomplete datasets (including distribution data), and its (theoretically justified) use of tau-leaping methods for model simulation. We have shown that SPICE works well in practice, in terms of both computational cost and estimate accuracy (which was often the best in the models tested), even on challenging case studies involving bistable systems and real high-throughput data. On a non-trivial case study, SPICE can be orders of magnitude faster than other approaches, while offering comparable accuracy in the estimates.