Abstract
We explore efficient estimation of statistical quantities, particularly rare event probabilities, for stochastic reaction networks. Consequently, we propose an importance sampling (IS) approach to improve the Monte Carlo (MC) estimator efficiency based on an approximate tau-leap scheme. The crucial step in the IS framework is choosing an appropriate change of probability measure to achieve substantial variance reduction. This task is typically challenging and often requires insights into the underlying problem. Therefore, we propose an automated approach to obtain a highly efficient path-dependent measure change based on an original connection in the stochastic reaction network context between finding optimal IS parameters within a class of probability measures and a stochastic optimal control formulation. Optimal IS parameters are obtained by solving a variance minimization problem. First, we derive an associated dynamic programming equation. Analytically solving this backward equation is challenging, hence we propose an approximate dynamic programming formulation to find near-optimal control parameters. To mitigate the curse of dimensionality, we propose a learning-based method to approximate the value function using a neural network, where the parameters are determined via a stochastic optimization algorithm. Our analysis and numerical experiments verify that the proposed learning-based IS approach substantially reduces MC estimator variance, resulting in a lower computational complexity in the rare event regime, compared with standard tau-leap MC estimators.
1 Introduction
We propose an approach to efficiently estimate statistical quantities, particularly rare event probabilities for a particular class of continuous-time Markov chains known as stochastic reaction networks (SRNs). Consequently, we develop a learning-based importance sampling (IS) algorithm to improve the Monte Carlo (MC) estimator efficiency based on an approximate tau-leap (TL) scheme. The automated approach is based on an original connection between optimal IS parameter determination within a class of probability measures and stochastic optimal control (SOC) formulation.
SRNs (see Sect. 1.1 for a short introduction and [9] for more details) describe the time evolution of biochemical reactions, epidemic processes [5, 13], and transcription and translation in genomics and virus kinetics [32, 48], among other important applications. For the current study, let \({\textbf{X}}\) be an SRN that takes values in \({\mathbb {N}}^d\) and is defined in the time interval [0, T], where \(T>0\) is a user-selected final time. We aim to provide accurate and computationally efficient MC estimations for the expected value \({\mathbb {E}}[g({\textbf{X}}(T))]\), where \(g:{\mathbb {N}}^d\rightarrow {\mathbb {R}}\) is a scalar observable for \({\textbf{X}}\). In particular, we study estimating rare event probabilities with \(g({\textbf{x}})=\varvec{1}_{\{{\textbf{x}} \in {\mathcal {B}}\}}\) (i.e., the indicator function for a set \({\mathcal {B}} \subset {\mathbb {R}}^d\)).
The quantity of interest, \({\mathbb {E}}[g({\textbf{X}}(T))]\), can be computed by solving the corresponding Kolmogorov backward equations [8]. For most SRNs, deriving a closed-form solution for these ordinary differential equations is infeasible, and numerical approximations based on discretized schemes are commonly used. However, the computational cost scales exponentially with the number of species d. Therefore, we are particularly interested in estimating \({\mathbb {E}}[g({\textbf{X}}(T))]\) using MC methods, an attractive alternative to avoid the curse of dimensionality.
Many schemes have been developed to simulate exact sample paths for SRNs, such as the stochastic simulation algorithm [25] and modified next reaction method [4]. Pathwise exact SRN realizations can incur high computational costs if any reaction channels have high reaction rates. Gillespie [26] and Aparicio and Solari [6] independently proposed the explicit TL method (see Sect. 1.2) to overcome this issue by simulating approximate paths of \({\textbf{X}}\), evolving the process with fixed time steps and keeping reaction rates fixed within each time step. Various simulation schemes have been subsequently proposed to deal with situations incorporating well-separated fast and slow time scales [1, 2, 11, 14, 40, 45].
Various variance reduction techniques have been proposed in the SRN context to reduce the computational work to estimate \({\mathbb {E}}[g({\textbf{X}}(T))]\). Several multilevel Monte Carlo (MLMC) [21, 22] based methods have been proposed to address specific challenges in this context [3, 10, 11, 38, 40]. Furthermore, as naive MC and MLMC estimators fail to efficiently and accurately estimate rare event probabilities, different IS approaches [15, 16, 23, 24, 36, 46, 47] have been proposed.
The current paper proposes a path-dependent IS approach based on an approximate TL scheme to improve the MC estimator efficiency, and hence efficiently estimate various statistical quantities for SRNs (particularly rare event probabilities). Our class of probability measure change is based on modifying the Poisson random variable rates used to construct the TL paths. In particular, optimal IS parameters are obtained by minimizing the second moment of the IS estimator (equivalently the variance) which represents the cost function for the associated SOC problem. We show that the corresponding value function solves a dynamic programming relation that is challenging to solve analytically (see Sect. 2.1). We approximate the dynamic programming equation to derive a closed form solution and near-optimal control parameters. The cost to solve the associated backward equation numerically in multi-dimensional settings increases exponentially with respect to the dimension (i.e., the curse of dimensionality). Thus, we propose approximating the resulting value function using a neural network to overcome this issue. Utilizing the optimality criterion for the SOC problem, we obtain a relationship between optimal IS parameters and the value function. Finally, we employ a stochastic optimization algorithm to learn the corresponding neural network parameters. Our analysis and numerical results for different dimensions confirm that the proposed estimator considerably reduces the variance compared with the standard TL-MC method with a negligible additional cost. This allows rare event probabilities to be efficiently computed in a regime where standard TL-MC estimators commonly fail.
The proposed approach is more computationally efficient than previously proposed IS schemes in this context ( [15, 16, 23, 24, 36, 46, 47]) because it is based on an approximate TL scheme rather than the exact scheme. In contrast to previous approaches, the change of measure is systematically derived to ensure convergence to the optimal measure within the chosen class of probability measures, minimizing MC estimator variance. The novelty of this work is establishing a connection between IS and SOC in the context of pure jump processes, particularly for SRNs, with an emphasis on related practical and numerical aspects. Note that some previous studies [7, 17, 20, 28,29,30,31, 33, 41, 49] have established a similar connection, mainly in the diffusion dynamics context, with less focus on pure jump dynamics. In this work, the proposed methodology is based on an approximate explicit TL scheme, which could and be subsequently extended in future work to continuous-time formulation (exact schemes), and implicit TL schemes which are relevant for systems with fast and slow time scales.
The remainder of this paper is organized as follows. Sections 1.1, 1.2, 1.3 and 1.4 define relevant SRN, TL, MC and IS concepts, respectively. Section 2 establishes the connection between IS and SOC, formulating the SOC problem and defining its main ingredients: controls, cost function, and value function; then presents the dynamic programming solved by the optimal controls. Section 2.3 develops the proposed IS learning-based approach appropriate for multi-dimensional SRNs. Section 3 provides selected numerical experiments for different dimensions to illustrate the proposed approach’s efficiency compared with standard MC approaches. Finally, Sect. 4 summarizes and concludes the work, and discusses possible future research directions.
1.1 Stochastic reaction networks (SRNs)
We are interested in the time evolution for an homogeneously mixed chemical reacting system described by the Markovian pure jump process, \({\textbf{X}}:[0,T]\times \Omega \rightarrow {\mathbb {N}}^d\), where (\(\Omega \), \({\mathcal {F}}\), \({\mathbb {P}}\)) is a probability space. In this framework, we assume that d different species interact through J reaction channels. The i-th component, \(X_i(t)\), describes the abundance of the i-th species present in the chemical system at time t. This work studies the time evolution of the state vector,
Each reaction channel \({\mathcal {R}}_j\) is a pair \((a_j, \varvec{\nu }_{j})\) defined by its propensity function \(a_{j}:{\mathbb {R}}^{d} \rightarrow {\mathbb {R}}_{+}\) and stoichiometric vector \( \varvec{\nu }_{j}=( \nu _{j,1},\nu _{j,2},\ldots , \nu _{j,d})^\top \) satisfying
Thus, the probability of observing a jump in the process \({\textbf{X}}\) from state \({\textbf{x}}\) to state \({\textbf{x}} + \varvec{\nu }_{j}\), a consequence of reaction \({\mathcal {R}}_{j}\) firing during the small time interval \((t, t + \Delta t]\), is proportional to the time interval length, \(\Delta t\), where \(a_{j}({\textbf{x}})\) is the proportionality constant. We set \(a_j({\textbf{x}}){=}0\) for \({\textbf{x}}\) such that \({\textbf{x}}{+}\varvec{\nu }_j\notin {\mathbb {N}}^d\) (i.e., the non-negativity assumption: the system can never produce negative population values).
Hence, from (1.2), process \({\textbf{X}}\) is a continuous-time, discrete-space Markov chain that can be characterized by Kurtz’s random time change representation [19],
where \(Y_j:{\mathbb {R}}_+{\times } \Omega \rightarrow {\mathbb {N}}\) are independent unit-rate Poisson processes. Conditions on the reaction channels can be imposed to ensure uniqueness [5] and avoid explosions in finite time [18, 27, 44].
Applying the stochastic mass-action kinetics principle, we can assume that the propensity function \(a_j(\cdot )\) for reaction channel \({\mathcal {R}}_j\), represented asFootnote 1
obeys
where \(\{\theta _j\}_{j=1}^J\) represents positive constant reaction rates, and \(x_i\) is the counting number for species \(S_i\).
1.2 Explicit tau-leap approximation
The explicit-TL scheme is a pathwise approximate method [6, 26] to overcome computational drawbacks for exact methods (i.e., when many reactions fire during a short time interval). This scheme can be derived from the random time change representation (1.3) by approximating the integral \(\int _{t_i}^{t_{i+1}} a_{j}({\textbf{X}}(s)) \textrm{d}s \) as \(a_j({\textbf{X}}(t_i))\,(t_{i+1}-t_i)\), i.e., using the forward-Euler method with time mesh \(\{t_{0}=0, t_{1},\ldots ,t_{N}= T\}\) and size \(\Delta t=\frac{T}{N}\). Thus, the explicit-TL approximation for \({\textbf{X}}\) should satisfy for \(k\in \{1,2,\ldots ,N\}\)
and given \(\widehat{{\textbf{X}}}_0:= {\textbf{x}}_{0}\), we iteratively simulate a path for \(\widehat{{\textbf{X}}}^{\Delta t}\) as
where, conditioned on the current state \(\widehat{{\textbf{X}}}^{\Delta t}_{k}\), \(\{{\mathcal {P}}_{k,j}(r_{k,j})\}_{\{1\le j\le J \}}\) are independent Poisson random variables with respective rates \(r_{k,j}:=a_{j}(\widehat{{\textbf{X}}}^{\Delta t}_{k})\Delta t\).
The explicit-TL path \(\widehat{{\textbf{X}}}^{\Delta t}\) is defined only at time mesh points, but can be naturally extended to [0, T] as a piecewise constant path. We apply the projection to zero to prevent the process from exiting the lattice (i.e., producing negative values), hence (1.7) becomes
where the maximum is applied entry-wise. In this work, we use uniform time steps with length \(\Delta t\), but the explicit-TL scheme and the proposed IS scheme (see Sect. 2) can also be applied to non-uniform time meshes.
1.3 Biased Monte Carlo estimator
Let \({\textbf{X}}\) be a stochastic process and \(g: {\mathbb {R}}^{d} \rightarrow {\mathbb {R}}\) a scalar observable. We want to approximate \({\mathbb {E}} \left[ g({\textbf{X}}(T))\right] \), but rather than sampling directly from \({\textbf{X}}(T)\), we sample from \(\overline{{\textbf{X}}}^{\Delta t}(T)\), which are random variables generated by a numerical scheme with step size \(\Delta t\). We assume that variates \(\overline{{\textbf{X}}}^{\Delta t}(T)\) are generated with an algorithm with weak order, \({{\mathcal {O}}}\left( \Delta t\right) \), i.e., for sufficiently small \(\Delta t\),
where \(C>0\).Footnote 2
Let \(\mu _{M}\) be the standard MC estimator for \({\mathbb {E}} \left[ g(\overline{{\textbf{X}}}^{\Delta t}(T))\right] \),
where \(\{\overline{{\textbf{X}}}^{\Delta t}_{[m]}(T)\}_{m=1}^M\) are independent and distributed as \(\overline{{\textbf{X}}}^{\Delta t}(T)\).
The global error for the proposed MC estimator has error decomposition
To achieve the desired accuracy, \(\text {TOL}\), it is sufficient to bound the bias and statistical error equally by \(\frac{TOL}{2}\). From (1.9), choosing step size
ensures a bias of \(\frac{\text {TOL}}{2}\).
Thus, considering the central limit theorem, the statistical error can be approximated as
where constant \(C_{\alpha }\) is the \((1-\frac{\alpha }{2})-\)quantile for the standard normal distribution. We choose \(C_{\alpha }=1.96\) for a \(95\%\) confidence level corresponding to \(\alpha =0.05\). Choosing
sample paths ensures the statistical error to be approximately bounded by \(\frac{\text {TOL}}{2}\).
Given that the computational cost to simulate a single path is \({{\mathcal {O}}}\left( {\Delta t}^{-1}\right) \), the expected total computational complexity is \({{\mathcal {O}}}\left( \text {TOL}^{-3}\right) \); and the complexity scales with \(\text{ Var }\left[ g(\overline{{{\textbf {X}}}}^{\Delta t}(T))\right] \) (see (1.14)).
1.4 Importance sampling
Importance sampling (IS) techniques improve the computational costs for the crude MC estimator by variance reduction when used appropriately. To motivate the use of these techniques, consider estimating rare event probabilities, where the crude MC method is substantially expensive. In particular, consider estimating \(q={\mathbb {P}}(Y>\gamma )={\mathbb {E}}[{\textbf {1}}_{\{Y>\gamma \}}]\), where Y is a random variable taking values in \({\mathbb {R}}\) with probability density function \(\rho _{Y}\). Let \(\gamma \) be sufficiently large that q becomes sufficiently small. We can approximate q using the MC estimator
where \(\{Y^{(i)}\}_{i=1}^{M}\) are independent and identically distributed (i.i.d) realizations sampled according to \(\rho _Y\). The MC estimator variance is
For a sufficiently small q, we can use (1.16) and the central limit theorem to approximate the relative error as
where \(C_{\alpha }\) is chosen as in (1.13).
The number of required samples to attain a relative error tolerance \(TOL_{rel}\) is \(M\approx \frac{C_{\alpha }^2}{q\cdot TOL_{rel}^2}\). Thus, for q of the order of \(10^{-8}\), the number of required samples such that \(TOL_{rel} = 5\%\) is approximately equal to \(1.5\cdot 10^{11}\).
To demonstrate the IS concept, consider the general problem of estimating \({\mathbb {E}}[g(Y)]\), where g is a given observable. In the previous example, g was chosen as \(g(y)=\textbf{1}_{\{y>\gamma \}}\). Let \({\widehat{\rho }}_Z\) be the probability density function for a new real random variable Z, such that \(g\cdot \rho _Y\) is dominated by \({\widehat{\rho }}_Z\), i.e.,
for all \(x\in {\mathbb {R}}\). This permits, the quantity of interest to be expressed as
where \(L(\cdot )\) is the likelihood ratio. Hence the expected value under the new measure remains unchanged, but the variance could be reduced due to a different second moment \({\mathbb {E}}\left[ \left( g(Z)\cdot L(Z)\right) ^2\right] \).
The MC estimator under the IS measure is
where \(Z_{[j]}\) are i.i.d samples from \({\widehat{\rho }}_Z\) for \(j=1,\dots ,M\).
The main challenge when using IS is choosing a new probability measure that substantially reduces the variance compared with the original measure. This step strongly depends on the structure of the problem under consideration. Further, the new measure should be obtained with negligible computational cost to ensure a computational efficient IS scheme. This is particularly challenging in the present problem, since we are considering path-dependent probability measures. In particular, the aim is to introduce a path-dependent change of probability measure that corresponds to changing the Poisson random variable rates used to construct the TL paths. Section 2.1 shows how the optimal IS parameters can be obtained using a novel connection with SOC.
2 Importance sampling (IS) via stochastic optimal control (SOC)
2.1 Dynamic programming for the importance sample parameters
This section, establishes the connection between optimal IS measure determination within a class of probability measures, and SOC. Let \({\textbf{X}}\) be a SRN as defined in Sect. 1.1 and let \(\widehat{{\textbf{X}}}^{\Delta t}\) denote its TL approximation as given by (1.8). We aim to find a near-optimal IS measure to improve the MC estimator computational performance to estimate \({\mathbb {E}} \left[ g({\textbf{X}}(T))\right] \). Since finding the optimal path-dependent change of measure within all measure classes presents a challenging problem, we limit ourselves to a parameterized class obtained via modifying the Poisson random variable rates of the TL paths. This class of measure change was previously used in [10] to improve the MLMC estimator robustness and performance in this context; we focus on a single-level MC setting, and seek to automate the task to find a near-optimal IS measure within this class.
We introduce the change of measure resulting from changing the Poisson random variable rates in the TL scheme,
where \(\delta _{n,j}^{\Delta t}({\textbf{x}})\in {\mathcal {A}}_{{\textbf{x}},j}\) is the control parameter at time step n, under reaction j, and in state \({\textbf{x}}\in {\mathbb {N}}^d\); and conditioned on \(\overline{{\textbf{X}}}^{\Delta t}_{n}\), \({\mathcal {P}}_{n,j}(r_{n,j})\) are independent Poisson random variables with respective rates \(r_{n,j}:=\delta _{n,j}^{\Delta t}(\overline{{\textbf{X}}}^{\Delta t}_n)\Delta t\). The admissible set,
is chosen such that (1.18) is fulfilled and to avoid infinite variance for the IS estimator. The control \(\delta _{n,j}^{\Delta t}({\textbf{x}})\in {\mathcal {A}}_{{\textbf{x}},j}\) depends deterministically on the current time step n, reaction channel j, and current state \({\textbf{x}}=\overline{{\textbf{X}}}^{\Delta t}_n\) for the TL-IS approximation in (2.3).
Therefore, the resulting scheme under the new measure is
and the likelihood ratioFootnote 3 at step n associated with the new IS measure is
where \(\varvec{\delta }_n^{\Delta t}({\textbf{x}}) \in \times _{j=1}^J {\mathcal {A}}_{{\textbf{x}},j}\) are the IS parameters with \(\left( \varvec{\delta }_n^{\Delta t}({\textbf{x}})\right) _j=\delta _{n,j}^{\Delta t}({\textbf{x}}) \) and the Poisson realizations are denoted by \(\overline{{\textbf{P}}}_n\) with \(\left( \overline{{\textbf{P}}}_n\right) _j:={\overline{P}}_{n,j}\) for \(j=1,\dots ,J\). Equation (2.4) uses the convention that \(\frac{a_j(\overline{{\textbf{X}}}_{n}^{\Delta t})}{\delta _{n,j}^{\Delta t}(\overline{{\textbf{X}}}^{\Delta t}_n)}=1\), whenever \(a_j(\overline{{\textbf{X}}}_{n}^{\Delta t})=0\) and \(\delta _{n,j}^{\Delta t}(\overline{{\textbf{X}}}^{\Delta t}_n)=0\). From (2.2), this results in a factor of one in the likelihood ratio for reactions with \(a_j(\overline{{\textbf{X}}}_{n}^{\Delta t})=0\).
Therefore, the likelihood ratio for \(\{\overline{{\textbf{X}}}^{\Delta t}_n: n=0,\dots ,N\}\) across one path is
This likelihood ratio completes the characterization for the proposed IS approach, and allows the quantity of interest with respect to the new measure to be expressed as
with the expectation in the right-hand side of (2.6) taken with respect to the dynamics in (2.3).
Hereinafter, we aim to determine optimal parameters \(\{\varvec{\delta }_n^{\Delta t}({\textbf{x}})\}_{n=0,\dots ,N-1; {\textbf{x}}\in {\mathbb {N}}^d}\) that minimize the second moment (and hence the variance) for the IS estimator, given that \(\overline{{\textbf{X}}}_0^{\Delta t}={\textbf{x}}_0\). To that end, we derive an associated SOC formulation. First we introduce the cost function for the proposed SOC problem in Definition 2.1, then derive a dynamic programming equation in Theorem 2.4 that is satisfied by the value function \(u_{\Delta t}(\cdot ,\cdot )\) in Definition 2.3. The proof for Theorem 2.4 is given in “Appendix A”.
Definition 2.1
(Second moment for the proposed importance sampling estimator) Let \(0 \le n \le N\). Given that \(\overline{{\textbf{X}}}_{n}^{\Delta t}= {\textbf{x}}\), the second moment for the proposed IS estimator can be expressed as
with terminal cost \(C_{N,{\textbf{x}}}={\mathbb {E}}\left[ g^2\left( \overline{{\textbf{X}}}_N^{\Delta t}\right) | \overline{{\textbf{X}}}_N^{\Delta t}={\textbf{x}}\right] =g^2({\textbf{x}})\), for any \({\textbf{x}} \in {\mathbb {N}}^d\).
Compared with the classical SOC formulation, (2.7) can be interpreted as the expected total cost; where the main difference is that (2.7) uses a multiplicative cost structure rather than the standard additive one. Therefore, we derive a dynamic programming relation in Theorem 2.4 associated with this cost structure that is fulfilled by the corresponding value function (see Definition 2.3), in the SRN context.
Remark 2.2
(Structure of the cost function) One can derive an optimal control formulation with additive structure (similar to [30] in the stochastic differential equation setting) by applying a logarithmic transformation together with Jensen’s inequality to (2.7). This reduces the control problem to a Kullback–Leibler minimization. In [41, 42], this Kullback–Leibler minimization problem leads to the same optimal change of measure as the problem of finding the change of measure using a variance minimization approach. However, the previous conclusion needs more investigation in the setting of SRNs, which we leave for future potential work.
Definition 2.3
(Value function) The value function \(u_{\Delta t}(\cdot ,\cdot )\) is defined as the optimal (infimum) second moment for the proposed IS estimator. For time step \(0 \le n \le N\) and state \({\textbf{x}} \in {\mathbb {N}}^d\),
where is the admissible set for the IS parameters; and \(u_{\Delta t}(N,{\textbf{x}})=g^2({\textbf{x}})\), for any \({\textbf{x}} \in {\mathbb {N}}^d\).
Theorem 2.4
(Dynamic programming for importance sampling parameters) For \({\textbf{x}}\in {\mathbb {N}}^d\), the value function \(u_{\Delta t}(n,{\textbf{x}})\) fulfills the dynamic programming relation

where \(\varvec{\nu }=\left( \varvec{\nu }_1, \dots ,\varvec{\nu }_J\right) \in {\mathbb {Z}}^{d\times J}\).
Theorem 2.4 breaks down the minimization problem to a simpler optimization that can be solved stepwise backward in time starting from final time T. Solving the minimization problem (2.9) analytically is difficult due to the infinite sum. Section 2.2 shows how to overcome this issue by approximating (2.9) to derive near-optimal parameters for \(\{\varvec{\delta }_n^{\Delta t}({\textbf{x}})\}_{n=0,\dots ,N-1; {\textbf{x}}\in {\mathbb {N}}^d}\) for the proposed IS approach.
2.2 Approximate dynamic programming
Theorem 2.4 gives an exact solution for optimal IS parameters resulting from modifying the Poisson random variable rates in the TL paths. However, the infinite sum has to be evaluated in closed form to solve (2.9) analytically, which is generally difficult. Therefore, we propose approximating the value function \(u_{\Delta t}(n,{\textbf{x}})\) in (2.9) by \({\overline{u}}_{\Delta t}(n,{\textbf{x}})\) for all time steps \(n=0,\dots ,N\), reaction channels \(j=1,\dots ,J\) and states \({\textbf{x}}\in {\mathbb {N}}^d\). First, both \(u_{\Delta t}(n,{\textbf{x}})\) and \({\overline{u}}_{\Delta t}(n,{\textbf{x}})\) satisfy the same final condition,
Next, to derive the approximate dynamic programming relation for \({\overline{u}}_{\Delta t}(\cdot ,\cdot )\), we presume Assumption 2.5 to hold. This assumption is motivated by the behavior of the original propensities, which are of \({{\mathcal {O}}}\left( 1\right) \) due to the mass-action kinetics principle (refer to (1.5)).
Assumption 2.5
The controls \(\{\varvec{\delta }_{n}^{\Delta t}\}_{n=0,\dots ,N-1}\) are asymptotically constant (i.e., \(\delta _{n,j}^{\Delta t}({\textbf{x}}) \rightarrow c_{n,j,{\textbf{x}}}\), as \(\Delta t \rightarrow 0\), where \(c_{n,j,{\textbf{x}}}\) are constants for \(1\le j \le J\), \(0\le n \le N-1\), and \({\textbf{x}}\in {\mathbb {N}}^d\)).
Given Assumption 2.5 and that \(\{a_j(\cdot )\}_{j=1}^J\) are of \({{\mathcal {O}}}\left( 1\right) \), we apply a Taylor expansion around \(\Delta t=0\) to the exponential term in (2.9), then truncate the expression within the infimum such that the remaining terms are \({{\mathcal {O}}}\left( \Delta t\right) \). This truncates the infinite sum and linearizes the exponential term. Thus, for \({\textbf{x}}\in {\mathbb {N}}^d\) and \(n=N-1,\dots ,0\)
where \(\delta _j \in {\mathcal {A}}_{{\textbf{x}},j}\), \(j=1,\dots ,J\), are the SOC parameters at state \({\textbf{x}}\) for reaction j. The admissible set \({\mathcal {A}}_{{\textbf{x}},j}\) is defined in (2.2). Assumption 2.5 ensures that (i) we can apply the Taylor expansion to the exponential term as \(\Delta t\) decreases, and (ii) we have the exact approximation structure for (2.11) with no further terms scaling with \(\Delta t\) that have order less than \(\Delta t^2\).
The infimum in (2.11) is attained when
In this case, the approximate optimal SOC parameter \({\overline{\delta }}^{\Delta t}_{n,j}({\textbf{x}})\) can be analytically determined as
Note (2.13) includes the particular case when \(a_j({\textbf{x}})=0\) for some \(j\in \{1,\dots ,J\}\). In such a case, \({\overline{\delta }}_{n,j}^{\Delta t}({\textbf{x}})=0\), which agrees with (2.2).
An important advantage for this numerical approximation, \({\overline{u}}_{\Delta t}(\cdot ,\cdot )\), is that we reduce the complexity of the original optimization problem at each step in (2.9) from a simultaneous optimization over J variables to independent one-dimensional optimization problems that can be solved in parallel using (2.13).
Remark 2.6
(Assumption (2.12)) Whether the assumption in (2.12) is generally fulfilled depends on the method employed to solve the dynamic programming principle in (2.11). For example, if we use a direct numerical implementation either some special numerical treatment is required for the cases where (2.12) is violated, or some regularization is required to ensure well-posedness. The proposed approach from Sect. 2.3 avoids that issue since we model \({\overline{u}}_{\Delta t}(\cdot ,\cdot )\) with a strictly positive ansatz function, which guarantees condition (2.12) to hold for any state \({\textbf{x}}\) and all time steps n.
Remark 2.7
(Computational cost for dynamic programming) To derive a practical numerical algorithm for a finite number of states, we truncate the infinite state space \({\mathbb {N}}^d\) to , where \({\overline{S}}_1,\dots ,{\overline{S}}_d\) is a set of sufficiently large upper bounds. The computational cost to numerically solve the dynamic programming equation (2.11) for step size \(\Delta t\) and state space
can be expressed as
where \({\overline{S}}^*=\max _{i=1,\dots ,d}{\overline{S}}_i\).
The cost in (2.14) scales exponentially with dimension d. Section 2.3 proposes an alternative approach to address this curse of dimensionality. However, in future work, we aim to combine dimension reduction techniques for SRNs with a direct numerical implementation of dynamic programming.
2.3 Learning-based approach
Using the SOC formulation derived in Sect. 2.2, we propose approximating the value function \({\overline{u}}_{\Delta t}(\cdot ,\cdot )\) with a parameterized ansatz function, \({\widehat{u}}(t,{\textbf{x}};\varvec{\beta })\).
Remark 2.8
(Choosing the ansatz function) The parameterized ansatz function \({\widehat{u}}(t,{\textbf{x}};\varvec{\beta })\) should consider the final condition of the value function (2.9), and its choice depends on the given SRN and observable \(g({\textbf{x}})\). For linear observables, such as \(g({\textbf{x}})=x_i\), we can consider polynomial basis functions as an ansatz. For more complex problems, the ansatz function is a small neural network.
For rare event applications with observable \(g(\mathbf {{\textbf {x}}})={\textbf{1}}_{\{x_i>\gamma \}}\), we consider a sigmoid with learning parameters \(\varvec{\beta }=\left( \varvec{\beta }^{space},\beta ^{time}\right) \in {\mathbb {R}}^{d+1}\) as the ansatz function
where \(\langle \cdot ,\cdot \rangle \) denotes the inner product, and the time is scaled to one using \(t\in [0,1]\).
Parameters \(b_0\) and \(\beta _0\) are not learned through optimization but determined by fitting the final condition for Theorem 2.4, which imposes \({\widehat{u}}(1,{\textbf{x}};\varvec{\beta })\approx g^2({\textbf{x}})={\textbf{1}}_{\{x_i>\gamma \}}\). Therefore, the discontinuous indicator function is approximated by a sigmoid, and the fit is characterized by the position of the sigmoid’s inflection point and the sharpness of the slope. The position and value of local and global minima with respect to the learned parameters \(\varvec{\beta }^{space}\) and \({\beta }^{time}\) depend on the choices for \(b_0\) and \({\beta }_0\).
To derive IS parameters from the ansatz function, we use the previous SOC result from (2.13), i.e.,
We define \({\widehat{u}}(t,\cdot ;\cdot )\) in (2.15) as a time-continuous function for \(t\in [0,1]\); whereas the IS controls from \({\widehat{\delta }}^{\Delta t}_{j}(n,\cdot ;\cdot )\) are discrete in time for \(n=0,\dots ,N-1\), and depend on time step size \(\Delta t\). Therefore, \({\widehat{u}}(\cdot ,\cdot ;\varvec{\beta })\) can be used to derive control parameters for arbitrary \(\Delta t\) in (2.16).
The parameters \(\varvec{\beta }\) for the ansatz function are then chosen to minimize the second moment,
where \(\{\overline{{\textbf{X}}}_n^{\Delta t,\varvec{\beta }}\}_{n=1,\dots ,N}\) is the IS path generated using IS parameters from (2.16) and \(\left( \widehat{\varvec{\delta }}^{\Delta t}(n,{\textbf{x}};\varvec{\beta })\right) _j={\widehat{\delta }}^{\Delta t}_{j}(n,{\textbf{x}};\varvec{\beta })\) for \(1\le j\le J\).
We use a gradient based stochastic optimizer method to solve (2.17), and derive Lemma 2.9 (proof in “Appendix B”) for the gradient of the second moment with respect to parameters \(\varvec{\beta }\).
Lemma 2.9
The partial derivatives for the second moment \(C_{0,{\textbf{x}}}\left( \widehat{\varvec{\delta }}^{\Delta t}_0,\dots ,\widehat{\varvec{\delta }}^{\Delta t}_{N-1}; \varvec{\beta }\right) \) in (2.17) with respect to \(\beta _{l}\), \(l=1,\dots , (d+1)\), are given by
where \(\{\overline{{\textbf{X}}}_n^{\Delta t,\varvec{\beta }}\}_{n=1,\dots ,N}\) is the IS path generated using the IS parameters from (2.16) and
Thus, partial derivatives for \({\widehat{u}}(t,{\textbf{x}};\varvec{\beta })\) for the ansatz (2.15) are
where \(\left( \varvec{\beta }^{space}\right) _{i}\) denotes the i-th entry for \(\varvec{\beta }^{space}\).
For an ansatz function different from (2.15), the gradient is still given by Lemma 2.9 only the derivation of \(\frac{\partial }{\partial \beta _l}{\widehat{u}}(t,{\textbf{x}};\varvec{\beta })\) in (2.20) changes accordingly.
By estimating the gradient in (2.18) using a MC estimator, we iteratively optimize the parameters \(\varvec{\beta }\) to reduce the variance. For this optimization, we use the Adam optimizer with the same parameter values suggested in [34] with the only difference that the step size is tuned to fit our problem setting.
In Sect. 3, we illustrate the potential of our new IS method based on the learning approach numerically in terms of variance reduction. Further theoretical and numerical analysis of this approach is left for future work, particularly the initialization for the learned parameters \(\beta ^{time}\) and \(\varvec{\beta }^{space}\) in (2.15) and investigations of a stopping rule.
To derive an estimator for \({\mathbb {E}}[g({\textbf{X}}(T))]\) using the proposed IS change of measure, we first solve the related SOC problem using the approach from this section; then we simulate M paths under the new IS sampling measure. Thus, the MC estimator using the proposed IS change of measure over M paths becomes
where \(\overline{{\textbf{X}}}_{[i],N}^{\Delta t,\varvec{\beta }}\) is the i-th IS sample path and the corresponding likelihood factor from (2.5) is
Remark 2.10
The explicit pathwise derivatives in Lemma 2.9 have the following advantages compared with the finite difference approach: (i) the explicit pathwise derivatives are unbiased with respect to the TL scheme, resulting in only the MC error for evaluating the expectation (i.e., without additional finite difference error), and (ii) the gradient computation in (2.18) requires the estimation of an expected value with a high relative error because of g being fitted to an indicator function. Using the IS-TL paths we control better the related statistical error.
2.4 Computational cost for the learning-based approach
This section discusses the computational complexity for the learning approach to achieve a prescribed tolerance \(\text {TOL}\). Recall that the proposed approach comprises two steps; hence, two types of costs occur: (i) the offline learning cost for the ansatz function parameters \(\varvec{\beta }\), and (ii) the online cost to obtaining the MC estimator (2.21) based on M simulated paths using the derived IS measure (see (2.16)).
The offline cost for (i) can be expressed as
where I is the number of optimizer steps, \(M_0\) is the number of paths needed to derive the estimator of the gradient per optimizer step, \(C_{Poi}\) is the cost to generate one Poisson random variable, \(C_{grad}\) is the cost for the update of the algebraic evaluation of (2.18), and \(\Delta t_{pl}\) is the step size. In contrast to (2.14), this offline cost does not scale exponentially with dimension d.
Example 3.1 with step size \(\Delta t_{pl}=\Delta t_f=1/2^4\) for the proposed IS-MC estimator: a sample mean; b squared coefficient of variation; c parameters; d kurtosis for each optimizer step. Adam optimizer gradient, sample variance, and kurtosis were estimated using \(M_0=10^4\) samples. The reference value for the standard MC-TL approach was derived from a single run with \(M=10^6\) samples and with step size \(\Delta t=1/2^4\)
Example 3.2 with step size \(\Delta t_{ pl}=\Delta t_f=1/2^4\) for the proposed IS-MC estimator: a sample mean; b squared coefficient of variation; c parameters; d kurtosis for each optimizer step. The gradient for the Adam optimization, the sample variance, and the kurtosis were estimated using \(M_0=10^5\) samples. Standard MC-TL with step size \(\Delta t=1/2^4\) and \(M=10^7\) samples was used for comparison
Example 3.3 with step size \(\Delta t_{ pl}=\Delta t_f=1/2^4\) for the proposed IS-MC estimators: a sample mean; b squared coefficient of variation; c parameters; d kurtosis for each optimizer step. The gradient for Adam optimization, the sample variance, and the kurtosis were estimated using \(M_0=10^5\) samples. Standard MC-TL with \(M=10^6\) samples and step size \(\Delta t=1/2^4\) was used for comparison
The cost for one IS-TL path based on \({\widehat{u}}(\cdot ,\cdot ;\varvec{\beta })\) is the same as for a TL path with negligible additional factors \(C_{{\widehat{\delta }}}\) for evaluating (2.16) and \(C_{lik}\) for deriving the likelihood update, as given in (2.4),
where \(\Delta t_f\) is the step size. Thus, total cost is
Following the same derivation as for (1.11)–(1.14), we choose \(\Delta t_{f}=\frac{\text {TOL}}{2 \cdot C}\), where C is the constant from (1.9), to obtain total computational complexity to derive a prescribed tolerance TOL
where L is the likelihood factor corresponding to the IS path \(\overline{{\textbf{X}}}^{\Delta t,\varvec{\beta }}\) (refer to (2.22)).
Our numerical simulations suggest that the amount of variance reduction achieved with the proposed approach is not related to \(\Delta t_{pl}\) (see Fig. 4). Therefore, we can achieve a low offline parameter learning cost (\(W_{pl} (I,M_0,\Delta t_{pl})\)) by using \(\Delta t_{pl}\gg \Delta t_{f}\).
For comparison, Sect. 1.3 shows that the standard MC-TL approach has total computational complexity
The proposed IS approach reduces this cost by variance reduction \(\left( \text {Var}[g(\overline{{\textbf{X}}}^{\Delta t}_N)\cdot L]\ll Var[g(\widehat{{\textbf{X}}}^{\Delta t}_N)]\right) \) (refer to Figs. 1, 2, 3). The TL variance becomes increasingly large in the asymptotic regime for very rare event probabilities, such that the additional cost \(W_{pl} (I,M_0,\Delta t_{pl})\) for learning \(\varvec{\beta }\) in (2.23) becomes negligible. Therefore, we obtain \(W_{IS-TL}(TOL)\ll W_{MC-TL}(TOL)\) in the rare event regime.
3 Numerical experiments and results
Through Examples 3.1, 3.2, and 3.3, we demonstrate the advantages for the proposed IS approach compared with the standard MC approach. We numerically show that the proposed approach achieves substantial variance reduction compared with standard MC estimators when applied to SRNs with different dimensions.
Example 3.1
(Pure decay) This example considers one species and a single reaction,
where \(\theta _1=1\), and the final time \(T = 1\). Thus, the propensity is \(a(x)=\theta _1x\), the stoichiometric vector is \(\nu =-1\), and the observable is \(g(x)=\textbf{1}_{\{x>50\}}\) with \(X_0=100\).
Example 3.2
(Michaelis–Menten enzyme kinetics) The Michaelis-Menten enzyme kinetics [43] describe the catalytic conversion of substrate S into a product P through three reactions,
where E denotes the enzyme and \(\theta = (0.001,0.005,0.01)^\top \). We consider the initial state \({\textbf{X}}_0=(E(0),S(0),C(0),P(0))^\top =(100, 100, 0, 0)^\top \) and the final time \(T=1\). The corresponding propensity and the change of the state matrix are
The observable of interest is \(g({\textbf{x}})={\textbf{1}}_{\{x_3>22\}}\).
Example 3.2, parameters \(\varvec{\beta }^{space}\) and \(\beta ^{time}\) learned with \(\Delta t_{pl} =1/2^4\) (see final optimizer step in Fig. 2) and applied to forward runs with different \(\Delta t_f\) values. The squared coefficient of variation was estimated with \(M=10^6\) sample paths. The standard MC-TL approach is used as reference (dashed red line)
Example 3.3
(Enzymatic futile cycle model) The enzymatic futile cycle [36] describes two instances for the elementary single-substrate enzymatic reaction scheme and can be described by six reactions,
Initial states are \({\textbf{X}}(0)=\left( S_1(0),\dots ,S_6(0)\right) =\left( 1, 50, 0, 1, 50, 0 \right) \), and we take the rates as \(\theta _{1}=\theta _{2}=\theta _{4}=\theta _{5}=1\), and \(\theta _{3}=\theta _{6}=0.1\). The propensity \(a({\textbf{x}})\) follows the stochastic mass-action kinetics in (1.5) and the final time is \(T=2\). We consider \(g({\textbf{x}})={\textbf{1}}_{\{x_5>60\}}\) as the observable.
Since all three are rare event examples with observable \(g(\mathbf {{\textbf {x}}})={\textbf{1}}_{\{x_i>\gamma \}}\), we use the ansatz function (2.15) with initial parameters \(\varvec{\beta }^{space}=0\), and \(\beta ^{time}=0\). The relative error is more relevant for rare event occurrences than the absolute error, hence we use a relative version of the variance, i.e., the squared coefficient of variation [12, 35], which, for a random variable X, is given by
To judge the robustness of our variance estimators, we estimate the kurtosis, \(\kappa :=\frac{\textrm{E}\left[ \left( X-E\left[ X\right] \right) ^4\right] }{\left( {\text {Var}}\left[ X\right] \right) ^2}\), because the standard deviation of the sample variance [10] is given by
where M is the number of samples.
We set the Adam optimizer step size \(\alpha =0.1\) for the simulations.
Figure 1 shows 100 Adam optimization steps for the decay example (Example 3.1) for step size \(\Delta t_{pl}=\Delta t_f=1/2^4\). The quantity of interest is a rare event probability with magnitude \(10^{-3}\). To estimate the gradient, we use \(M=10^4\) samples per Adam iteration. The squared coefficient of variation is reduced by a factor of \(10^{2}\) compared with the standard MC-TL variance after 13 Adam iterations. After reaching this minimum, the squared coefficient of variation increases for the next iteration steps. This behavior might be avoided by employing a smaller step size in the Adam algorithm. Figure 1d confirms that the kurtosis is bounded to a level below the standard TL’s kurtosis, indicating a robust variance estimator.
For the 4-dimensional stochastic reaction network (Example 3.2), the rare event probability for the event \(\{X_3(T)>22\}\) is of magnitude \(10^{-5}\). Figure 2b confirms that the proposed learning-based approach reduces the variance by a factor \( 4\times 10^3\) compared with standard TL for step size \(\Delta t_{pl}=\Delta t_f=1/2^4\). Although Fig. 2c seems to shows that parameters \(\beta ^{time}\) and \(\beta ^{space}_4\) overlap, this is an artifact from the scale of the y-axis; in fact, the final values are \(\beta _{time}=-3.2\times 10^{-4}\) and \(\beta ^{space}_4=-3.0\times 10^{-3}\). The intrinsic structure of Example 3.2 results in similar molecule counts for E(t) and S(t) and hence similar values for \(\beta ^{space}_1\) and \(\beta ^{space}_2\). Figure 2d confirms that the kurtosis for the proposed approach is substantially reduced compared with the kurtosis for the standard TL approach.
The 6-dimensional example (Example 3.3) has a rare event probability with magnitude \(10^{-6}\). Figure 3 shows the Adam optimization results for step size \(\Delta t_{pl}=\Delta t_f=1/2^4\). The TL mean differs from the mean for the proposed approach (Fig. 3a) because the standard MC-TL estimator requires more than \(10^6\) runs to accurately estimate a probability of order \(10^{-6}\). The proposed learning-based approach reduces the variance by a factor of more than 50 after 43 iterations. The kurtosis is bounded and lower than the kurtosis for the TL approach, confirming that the proposed approach results in a robust variance estimator.
Examples 3.2 and 3.3 show that a good choice of the ansatz function in combination with reasonable initial parameters provides substantial variance reduction from the first optimization step. However, we do not expect this behavior in general, particularly for high dimensions, and therefore we performed some optimization iterations.
The examples used step size \(\Delta t_{pl}=1/2^4\) and showed the squared coefficient of variation with respect to the same step size. To demonstrate that the learned parameters \(\varvec{\beta }\) can be used for forward runs with smaller step sizes (i.e., \(\Delta t_{f}\ll \Delta t_{pl}\)) as claimed in Sect. 2.4, we consider Example 3.2 and the final parameters from Fig. 4 for forward runs with different \(\Delta t_{f}\). The results show that the variance reduction is constant with respect to \(\Delta t_f\), suggesting that a coarse \(\Delta t_{pl}\) is sufficient for parameter learning. The same behavior was observed for other tested examples.
Remark 3.4
We used the ansatz (2.15) based on a single sigmoid for the numerical experiments to demonstrate the potential for the proposed learning-based IS. Further variance reduction may be achieved either by summing several sigmoid functions as ansatz or selecting a different basis function shape. Relevant analyses will be pursued in future work.
4 Conclusions and future work
This work developed an efficient path-dependent IS scheme to estimate statistical quantities for SRN processes, particularly rare event probabilities. Optimal IS parameters were obtained within a pre-selected class of change of measure using the proposed connection to an associated SOC problem, which could be solved via dynamic programming. To mitigate the curse of dimensionality encountered by the dynamic programming relation, we proposed a method for multi-dimensional SRNs based on approximating the value function via an ansatz function (i.e. a neural network), where the parameters were learned using a stochastic optimization algorithm. Numerical examples and subsequent analyses verified that the proposed estimator achieved substantial variance reduction compared with the standard MC method, providing lowered computational complexity in the rare event regime.
Future work will further analyze the proposed learning-based approach and expand it to derive a multilevel MC estimator. We also plan to combine an implementation of the dynamic programming principle as derived in Sect. 2.2 with dimension reduction methods for SRNs.
Notes
\(\alpha _{j,i}\) molecules for species \(S_i\) are consumed and \(\beta _{j,i}\) are produced. Thus, \((\alpha _{j,i},\beta _{j,i}) \in {\mathbb {N}}^2\) but \(\beta _{j,i}-\alpha _{j,i}\) can be a negative integer, constituting the vector \(\varvec{\nu }_j=\left( \beta _{j,1}-\alpha _{j,1},\dots ,\beta _{j,d}-\alpha _{j,d}\right) \in {\mathbb {Z}}^d\).
Refer to [39] for the underlying assumptions and proofs for this statement in the TL scheme context.
We refer to [10] (Sect. 4.1) for the likelihood factor derivation of a similar IS scheme.
References
Abdulle, A., Yucheng, H., Li, T.: Chebyshev methods with discrete noise: the \(\tau \)-ROCK methods. J. Comput. Math. 28, 195–217 (2010)
Ahn, T.-H., Sandu, A., Han, X. Implicit simulation methods for stochastic chemical kinetics (2013). arXiv:1303.3614
Anderson, D., Higham, D.: Multilevel Monte Carlo for continuous Markov chains, with applications in biochemical kinetics. SIAM Multiscale Model. Simul. 10(1), 146–179 (2012)
Anderson, D.F.: A modified next reaction method for simulating chemical systems with time dependent propensities and delays. J. Chem. Phys. 127(21), 214107 (2007)
Anderson, D.F., Kurtz, T.G.: Stochastic Analysis of Biochemical Systems, vol. 1. Springer, Berlin (2015)
Aparicio, J.P., Solari, H.G.: Population dynamics: Poisson approximation and its relation to the Langevin process. Phys. Rev. Lett. 86(18), 4183 (2001)
Banisch, R., Hartmann, C. Meshless discretization of LQ-type stochastic control problems (2013). arXiv:1309.7497
Bayer, C., Moraes, A., Tempone, R., Vilanova, P.: An efficient forward-reverse expectation–maximization algorithm for statistical inference in stochastic reaction networks. Stoch. Anal. Appl. 34(2), 193–231 (2016)
Ben Hammouda, C.: Hierarchical approximation methods for option pricing and stochastic reaction networks. Ph.D. Thesis (2020)
Ben Hammouda, C., Rached, N.B., Tempone, R.: Importance sampling for a robust and efficient multilevel Monte Carlo estimator for stochastic reaction networks. Stat. Comput. 30(6), 1665–1689 (2020)
Ben Hammouda, C., Moraes, A., Tempone, R.: Multilevel hybrid split-step implicit tau-leap. Numer. Algorithms 74(2), 527–560 (2017)
Ben Rached, N., Haji-Ali, A.-L., Rubino, G., Tempone, R.: Efficient importance sampling for large sums of independent and identically distributed random variables. Stat. Comput. 31(6), 1–13 (2021)
Brauer, F., Castillo-Chavez, C.: Mathematical Models in Population Biology and Epidemiology, vol. 40. Springer, Berlin (2012)
Cao, Y., Petzold, L.: Trapezoidal tau-leaping formula for the stochastic simulation of biochemical systems. In: Proceedings of Foundations of Systems Biology in Engineering (FOSBE 2005), pp. 149–152 (2005)
Cao, Y., Liang, J.: Adaptively biased sequential importance sampling for rare events in reaction networks with comparison to exact solutions from finite buffer dCME method. J. Chem. Phys. 139(2), 07B605_1 (2013)
Daigle Jr, B.J., Roh, M.K., Gillespie, D.T., Petzold, L.R.: Automated estimation of rare event probabilities in biochemical systems. J. Chem. Phys. 134(4), 01B628 (2011)
Dupuis, P., Spiliopoulos, K., Wang, H.: Importance sampling for multiscale diffusions. Multiscale Model. Simul. 10(1), 1–27 (2012)
Engblom, S.: On the stability of stochastic jump kinetics (2012). arXiv:1202.3892
Ethier, S.N., Kurtz, T.G.: Markov Processes: Characterization and Convergence. Wiley Series in Probability and Mathematical Statistics, Wiley, New York (1986)
Fleming, W.H., Soner, H.M.: Controlled Markov Processes and Viscosity Solutions, vol. 25. Springer, Berlin (2006)
Giles, M.B.: Multilevel Monte Carlo path simulation. Oper. Res. 56(3), 607–617 (2008)
Giles, M.B.: Multilevel Monte Carlo methods. Acta Numer. 24, 259–328 (2015)
Gillespie, C.S., Golightly, A.: Guided proposals for efficient weighted stochastic simulation. J. Chem. Phys. 150(22), 224103 (2019)
Gillespie, D.T., Roh, M., Petzold, L.R.: Refining the weighted stochastic simulation algorithm. J. Chem. Phys. 130(17), 174103 (2009)
Gillespie, D.T.: A general method for numerically simulating the stochastic time evolution of coupled chemical reactions. J. Comput. Phys. 22(4), 403–434 (1976)
Gillespie, D.T.: Approximate accelerated stochastic simulation of chemically reacting systems. J. Chem. Phys. 115(4), 1716–1733 (2001)
Gupta, A., Briat, C., Khammash, M.: A scalable computational framework for establishing long-term behavior of stochastic reaction networks. PLoS Comput. Biol. 10(6), e1003669 (2014)
Hartmann, C., Banisch, R., Sarich, M., Badowski, T., Schütte, C.: Characterization of rare events in molecular dynamics. Entropy 16(1), 350–376 (2014)
Hartmann, C., Kebiri, O., Neureither, L., Richter, L.: Variational approach to rare event simulation using least-squares regression. Chaos Interdisc. J. Nonlinear Sci. 29(6), 063107 (2019)
Hartmann, C., Richter, L., Schütte, C., Zhang, W.: Variational characterization of free energy: theory and algorithms. Entropy 19(11), 626 (2017)
Hartmann, C., Schütte, C., Weber, M., Zhang, W.: Importance sampling in path space for diffusion processes with slow-fast variables. Probab. Theory Relat. Fields 170(1), 177–228 (2018)
Hensel, S.C., Rawlings, J.B., Yin, J.: Stochastic kinetic modeling of vesicular stomatitis virus intracellular growth. Bull. Math. Biol. 71(7), 1671–1692 (2009)
Kebiri, O., Neureither, L., Hartmann, C. Adaptive importance sampling with forward-backward stochastic differential equations. In: International workshop on Stochastic Dynamics out of Equilibrium, pp. 265–281. Springer, Berlin (2017)
Kingma, D.P., Ba, J. Adam: A Method for Stochastic Optimization (2014). arXiv:1412.6980
Kroese, D.P., Taimre, T., Botev, Z.I.: Handbook of Monte Carlo Methods. Wiley, New York (2013)
Kuwahara, H., Mura, I.: An efficient and exact stochastic simulation method to analyze rare events in biochemical systems. J. Chem. Phys. 129(16), 10B619 (2008)
L’Ecuyer, P.: Note: On the interchange of derivative and expectation for likelihood ratio derivative estimators. Manag. Sci. 41(4), 738–747 (1995)
Lester, C., Yates, C.A., Giles, M.B., Baker, R.E.: An adaptive multi-level simulation algorithm for stochastic biological systems. J. Chem. Phys. 142(2), 01B612_1 (2015)
Li, T.: Analysis of explicit tau-leaping schemes for simulating chemically reacting systems. Multiscale Model. Simul. 6(2), 417–436 (2007)
Moraes, A., Tempone, R., Vilanova, P.: A multilevel adaptive reaction-splitting simulation method for stochastic reaction networks. SIAM J. Sci. Comput. 38(4), A2091–A2117 (2016)
Nüsken, N., Richter, L.: Solving high-dimensional Hamilton–Jacobi–Bellman PDEs using neural networks: perspectives from the theory of controlled diffusions and measures on path space. Partial Differ. Equ. Appl. 2(4), 1–48 (2021)
Rached, N.B., Haji-Ali, A.L., Mohan, S., Tempone, R.: Double Loop Monte Carlo Estimator with Importance Sampling for Mckean–Vlasov Stochastic Differential Equation (2022). arXiv:2207.06926
Rao, C.V., Arkin, A.P.: Stochastic chemical kinetics and the quasi-steady-state assumption: application to the Gillespie algorithm. J. Chem. Phys. 118(11), 4999–5010 (2003)
Rathinam, M.: Moment Growth Bounds on Continuous Time Markov Processes on Non-negative Integer Lattices (2013). arXiv:1304.5169
Rathinam, M., El Samad, H.: Reversible-equivalent-monomolecular tau: a leaping method for small number and stiff stochastic chemical systems. J. Comput. Phys. 224(2), 897–923 (2007)
Roh, M.K.: Data-driven method for efficient characterization of rare event probabilities in biochemical systems. Bull. Math. Biol. 81(8), 3097–3120 (2019)
Roh, M.K., Gillespie, D.T., Petzold, L.R.: State-dependent biasing method for importance sampling in the weighted stochastic simulation algorithm. J. Chem. Phys. 133(17), 174106 (2010)
Ranjan, S., You, L., Summers, J., Stochastic vs, J.Y.: deterministic modeling of intracellular viral kinetics. J. Theor. Biol. 218(3), 309–321 (2002)
Zhang, W., Wang, H., Hartmann, C., Weber, M., Schütte, C.: Applications of the cross-entropy method to importance sampling and optimal control of diffusions. SIAM J. Sci. Comput. 36(6), A2654–A2672 (2014)
Acknowledgements
This publication is based upon work supported by the King Abdullah University of Science and Technology (KAUST) Office of Sponsored Research (OSR) under Award No. OSR-2019-CRG8-4033. This work was partially performed as part of the Helmholtz School for Data Science in Life, Earth and Energy (HDS-LEE) and received funding from the Helmholtz Association of German Research Centres and the Alexander von Humboldt Foundation.
Funding
Open Access funding enabled and organized by Projekt DEAL.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix A: Proof for Theorem 2.4
Proof for Theorem 2.4
To show (2.9), we first reformulate \(C_{n,x}(\varvec{\delta }^{\Delta t}_n,\dots ,\varvec{\delta }^{\Delta t}_{N-1})\) using the definition for the likelihood and the notion of conditional expectation,
Setting
we can reformulate (A.1) and derive
We can prove Theorem 2.4 using the above results. We split the proof into two parts, where the first inequality is obtained by
To prove the second inequality, we choose the control at the n-th time step to be an arbitrary \(\varvec{\delta }_n^{\Delta t,+}>0\), and for the remaining controls, we choose the elements of a minimizing sequence of controls such that
Therefore,
This inequality holds for any arbitrary \(\varvec{\delta }_n^{\Delta t,+}>0\), and hence
This completes the proof. \(\square \)
Appendix B: Proof for Lemma 2.9
The partial derivatives of the second moment \(C_{0,{\textbf{x}}}\left( \varvec{\delta }^{\Delta t}_n,\dots ,\varvec{\delta }^{\Delta t}_{N-1}; \varvec{\beta }\right) \) in (2.7) with respect to \(\beta _{l}\), \(l=1,\dots , (d+1)\) can be expressed as
where the Poisson increments with respect to the TL measure are given in \({\textbf{P}}_n\) with \(({\textbf{P}}_n)_j:=P_{n,j}={\mathcal {P}}_{n,j}\left( a_j(\widehat{{\textbf{X}}}_k^{\Delta t})\Delta t\right) \) for \(j=1,\dots ,J\). In \(\overset{(1)}{=}\), we assume that the expected value and the derivative commute (see [37] Assumption A1(1) for sufficient conditions). In \(\overset{(2)}{=}\), we consider that \(g^2\left( \widehat{{\textbf{X}}}_N^{\Delta t}\right) \) is based on the original TL measure and hence is not dependent on \(\beta _l\).
In (B.1), the term
is only deterministically dependent on \(\varvec{\beta }\), since \(\widehat{{\textbf{X}}}_k^{\Delta t}\) is independent of \(\varvec{\beta }\), and \(P_{k,j}\sim Poi(a_j(\widehat{{\textbf{X}}}_k^{\Delta t})\Delta t)\). Thus, the derivative can be computed in a closed form using the identity
We compute the derivative from (B.2) using the following steps.
-
1.
Apply (B.2),
$$\begin{aligned}&\frac{\partial }{\partial \beta _l} \left( \prod _{k=0}^{N-1} L_k\left( {{\textbf {P}}}_k,\widehat{\varvec{\delta }}^{\Delta t}(k,\widehat{{{\textbf {X}}}}_k^{\Delta t};\varvec{\beta })\right) \right) \\ {}&\quad =\left( \prod _{k=0}^{N-1} L_k\left( {{\textbf {P}}}_k,\widehat{\varvec{\delta }}^{\Delta t}(k,\widehat{{{\textbf {X}}}}_k^{\Delta t};\varvec{\beta })\right) \right) \\ {}&\qquad \frac{\partial }{\partial \beta _l} \ln \left( \prod _{k=0}^{N-1} L_k\left( {{\textbf {P}}}_k,\widehat{\varvec{\delta }}^{\Delta t}(k,\widehat{{{\textbf {X}}}}_k^{\Delta t};\varvec{\beta })\right) \right) \\ {}&\quad =\left( \prod _{k=0}^{N-1} L_k\left( {{\textbf {P}}}_k,\widehat{\varvec{\delta }}^{\Delta t}(k,\widehat{{{\textbf {X}}}}_k^{\Delta t};\varvec{\beta })\right) \right) \\ {}&\qquad \sum _{k=0}^{N-1} \frac{\partial }{\partial \beta _l} \ln \left( L_k\left( {{\textbf {P}}}_k,\widehat{\varvec{\delta }}^{\Delta t}(k,\widehat{{{\textbf {X}}}}_k^{\Delta t};\varvec{\beta })\right) \right) . \\ \end{aligned}$$ -
2.
The remaining derivative can be derived by chain rule,
$$\begin{aligned}&\frac{\partial }{\partial \beta _l} \ln \left( L_k\left( {\textbf{P}}_k,\widehat{\varvec{\delta }}^{\Delta t}(k,\widehat{{\textbf{X}}}_k^{\Delta t};\varvec{\beta })\right) \right) \\&\quad =\frac{1}{L_k\left( {\textbf{P}}_k,\widehat{\varvec{\delta }}^{\Delta t}(k,\widehat{{\textbf{X}}}_k^{\Delta t};\varvec{\beta })\right) } \frac{\partial }{\partial \beta _l} L_k\left( {\textbf{P}}_k,\widehat{\varvec{\delta }}^{\Delta t}(k,\widehat{{\textbf{X}}}_k^{\Delta t};\varvec{\beta })\right) . \end{aligned}$$ -
3.
Apply a second chain rule,
$$\begin{aligned}&\frac{\partial }{\partial \beta _l} L_k\left( {\textbf{P}}_k,\widehat{\varvec{\delta }}^{\Delta t}(k,\widehat{{\textbf{X}}}_k^{\Delta t};\varvec{\beta })\right) \nonumber \\&\quad =\frac{\partial }{\partial \beta _l} \widehat{\varvec{\delta }}^{\Delta t}(k,\widehat{{\textbf{X}}}_k^{\Delta t};\varvec{\beta }) \cdot \nabla _{\varvec{\delta }} L_k({\textbf{P}}_k,\varvec{\delta }). \end{aligned}$$(B.3) -
4.
$$\begin{aligned} L_k\left( {{\textbf {P}}}_k,\varvec{\delta }\right)&=\exp \left( -\left( \sum _{j=1}^J a_j(\widehat{{{\textbf {X}}}}_{k}^{\Delta t})- \delta _j\right) \Delta t\right) \nonumber \\ {}&\quad \cdot \prod _{j=1}^J\left( \frac{a_j(\widehat{{{\textbf {X}}}}_{k}^{\Delta t})}{\delta _j}\right) ^{P_{k,j}}, \end{aligned}$$(B.4)
hence
$$\begin{aligned}&\frac{\partial }{\partial \delta _i} L_k\left( {{\textbf {P}}}_k,\varvec{\delta }\right) \\ {}&\quad =\Delta t \exp \left( -\left( \sum _{j=1}^J a_j(\widehat{{{\textbf {X}}}}_{k}^{\Delta t})- \delta _j\right) \Delta t\right) \cdot \prod _{j=1}^J\left( \frac{a_j(\widehat{{{\textbf {X}}}}_{k}^{\Delta t})}{\delta _j}\right) ^{P_{k,j}}\\ {}&\qquad +\exp \left( -\left( \sum _{j=1}^J a_j(\widehat{{{\textbf {X}}}}_{k}^{\Delta t})- \delta _j\right) \Delta t\right) \\ {}&\qquad \cdot (-P_{k,i}) \frac{a_i^{P_{k,i}}}{\delta _i^{P_{k,i}+1}}\prod _{j=1,j\ne i}^J\left( \frac{a_j(\widehat{{{\textbf {X}}}}_{k}^{\Delta t})}{\delta _j}\right) ^{P_{k,j}}\\ {}&\quad =\exp \left( -\left( \sum _{j=1}^J a_j(\widehat{{{\textbf {X}}}}_{k}^{\Delta t})- \delta _j\right) \Delta t\right) \\ {}&\qquad \prod _{j=1}^J\left( \frac{a_j(\widehat{{{\textbf {X}}}}_{k}^{\Delta t})}{\delta _j}\right) ^{P_{k,j}}\cdot \left( \Delta t - \frac{P_{k,i}}{\delta _i} \right) \\ {}&\quad =L_k({{\textbf {P}}}_k,\varvec{\delta })\cdot \left( \Delta t - \frac{P_{k,i}}{\delta _i} \right) . \end{aligned}$$ -
5.
In (B.3), from (2.16), we obtain
$$\begin{aligned}&\frac{\partial }{\partial \beta _l} {\widehat{\delta }}_j^{\Delta t}(k,{\textbf{x}};\varvec{\beta }) =a_j({\textbf{x}})\frac{1}{2}\sqrt{\frac{{\widehat{u}}(\frac{(k+1)\Delta t}{T},{\textbf{x}};\varvec{\beta })}{{\widehat{u}}(\frac{(k+1)\Delta t}{T},\max ({\textbf{x}}+\nu _j,0)}} \nonumber \\&\qquad \cdot \left( \frac{\frac{\partial }{\partial \beta _l}{\widehat{u}}(\frac{(k+1)\Delta t}{T},\max ({\textbf{x}}+\nu _j,0);\varvec{\beta })}{{\widehat{u}}(\frac{(k+1)\Delta t}{T},{\textbf{x}};\varvec{\beta })}\right. \nonumber \\&\qquad \left. -\frac{{\widehat{u}}(\frac{(k+1)\Delta t}{T},\max ({\textbf{x}}+\nu _j,0);\varvec{\beta })\frac{\partial }{\partial \beta _l}{\widehat{u}}(\frac{(k+1)\Delta t}{T},{\textbf{x}};\varvec{\beta })}{{\widehat{u}}(\frac{(k+1)\Delta t}{T},{\textbf{x}};\varvec{\beta })^2}\right) \nonumber \\&\quad =\frac{a_j({\textbf{x}})^2}{2{\widehat{\delta }}_j^{\Delta t}(k,{\textbf{x}};\varvec{\beta })}\cdot \left( \frac{\frac{\partial }{\partial \beta _l}{\widehat{u}}(\frac{(k+1)\Delta t}{T},\max ({\textbf{x}}+\nu _j,0);\varvec{\beta })}{{\widehat{u}}(\frac{(k+1)\Delta t}{T},{\textbf{x}};\varvec{\beta })}\right. \nonumber \\&\qquad \left. -\frac{{\widehat{u}}(\frac{(k+1)\Delta t}{T},\max ({\textbf{x}}+\nu _j,0);\varvec{\beta })\frac{\partial }{\partial \beta _l}{\widehat{u}}(\frac{(k+1)\Delta t}{T},{\textbf{x}};\varvec{\beta })}{{\widehat{u}}(\frac{(k+1)\Delta t}{T},{\textbf{x}};\varvec{\beta })^2}\right) \end{aligned}$$(B.5)where \(\frac{\partial }{\partial \beta _l}{\widehat{u}}_{\Delta t}(t,{\textbf{x}};\varvec{\beta })\) depends on the chosen ansatz.
Combining the previous steps, the gradient can be expressed as
where the gradient of \( {\widehat{\delta }}_j^{\Delta t}\) is dependent on the ansatz used and given by (B.5).
Since the MC estimator (B.1) may have a large variance, we again apply IS,
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Ben Hammouda, C., Ben Rached, N., Tempone, R. et al. Learning-based importance sampling via stochastic optimal control for stochastic reaction networks. Stat Comput 33, 58 (2023). https://doi.org/10.1007/s11222-023-10222-6
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11222-023-10222-6
Keywords
- Stochastic reaction networks
- Tau-leap
- Importance sampling
- Stochastic optimal control
- Dynamic programming
- Rare event
Mathematics Subject Classification
- 60H35
- 60J75
- 65C05
- 93E20