1 Introduction

We propose an approach to efficiently estimate statistical quantities, particularly rare event probabilities for a particular class of continuous-time Markov chains known as stochastic reaction networks (SRNs). Consequently, we develop a learning-based importance sampling (IS) algorithm to improve the Monte Carlo (MC) estimator efficiency based on an approximate tau-leap (TL) scheme. The automated approach is based on an original connection between optimal IS parameter determination within a class of probability measures and stochastic optimal control (SOC) formulation.

SRNs (see Sect. 1.1 for a short introduction and [9] for more details) describe the time evolution of biochemical reactions, epidemic processes [5, 13], and transcription and translation in genomics and virus kinetics [32, 48], among other important applications. For the current study, let \({\textbf{X}}\) be an SRN that takes values in \({\mathbb {N}}^d\) and is defined in the time interval [0, T], where \(T>0\) is a user-selected final time. We aim to provide accurate and computationally efficient MC estimations for the expected value \({\mathbb {E}}[g({\textbf{X}}(T))]\), where \(g:{\mathbb {N}}^d\rightarrow {\mathbb {R}}\) is a scalar observable for \({\textbf{X}}\). In particular, we study estimating rare event probabilities with \(g({\textbf{x}})=\varvec{1}_{\{{\textbf{x}} \in {\mathcal {B}}\}}\) (i.e., the indicator function for a set \({\mathcal {B}} \subset {\mathbb {R}}^d\)).

The quantity of interest, \({\mathbb {E}}[g({\textbf{X}}(T))]\), can be computed by solving the corresponding Kolmogorov backward equations [8]. For most SRNs, deriving a closed-form solution for these ordinary differential equations is infeasible, and numerical approximations based on discretized schemes are commonly used. However, the computational cost scales exponentially with the number of species d. Therefore, we are particularly interested in estimating \({\mathbb {E}}[g({\textbf{X}}(T))]\) using MC methods, an attractive alternative to avoid the curse of dimensionality.

Many schemes have been developed to simulate exact sample paths for SRNs, such as the stochastic simulation algorithm [25] and modified next reaction method [4]. Pathwise exact SRN realizations can incur high computational costs if any reaction channels have high reaction rates. Gillespie [26] and Aparicio and Solari [6] independently proposed the explicit TL method (see Sect. 1.2) to overcome this issue by simulating approximate paths of \({\textbf{X}}\), evolving the process with fixed time steps and keeping reaction rates fixed within each time step. Various simulation schemes have been subsequently proposed to deal with situations incorporating well-separated fast and slow time scales [1, 2, 11, 14, 40, 45].

Various variance reduction techniques have been proposed in the SRN context to reduce the computational work to estimate \({\mathbb {E}}[g({\textbf{X}}(T))]\). Several multilevel Monte Carlo (MLMC) [21, 22] based methods have been proposed to address specific challenges in this context [3, 10, 11, 38, 40]. Furthermore, as naive MC and MLMC estimators fail to efficiently and accurately estimate rare event probabilities, different IS approaches [15, 16, 23, 24, 36, 46, 47] have been proposed.

The current paper proposes a path-dependent IS approach based on an approximate TL scheme to improve the MC estimator efficiency, and hence efficiently estimate various statistical quantities for SRNs (particularly rare event probabilities). Our class of probability measure change is based on modifying the Poisson random variable rates used to construct the TL paths. In particular, optimal IS parameters are obtained by minimizing the second moment of the IS estimator (equivalently the variance) which represents the cost function for the associated SOC problem. We show that the corresponding value function solves a dynamic programming relation that is challenging to solve analytically (see Sect. 2.1). We approximate the dynamic programming equation to derive a closed form solution and near-optimal control parameters. The cost to solve the associated backward equation numerically in multi-dimensional settings increases exponentially with respect to the dimension (i.e., the curse of dimensionality). Thus, we propose approximating the resulting value function using a neural network to overcome this issue. Utilizing the optimality criterion for the SOC problem, we obtain a relationship between optimal IS parameters and the value function. Finally, we employ a stochastic optimization algorithm to learn the corresponding neural network parameters. Our analysis and numerical results for different dimensions confirm that the proposed estimator considerably reduces the variance compared with the standard TL-MC method with a negligible additional cost. This allows rare event probabilities to be efficiently computed in a regime where standard TL-MC estimators commonly fail.

The proposed approach is more computationally efficient than previously proposed IS schemes in this context ( [15, 16, 23, 24, 36, 46, 47]) because it is based on an approximate TL scheme rather than the exact scheme. In contrast to previous approaches, the change of measure is systematically derived to ensure convergence to the optimal measure within the chosen class of probability measures, minimizing MC estimator variance. The novelty of this work is establishing a connection between IS and SOC in the context of pure jump processes, particularly for SRNs, with an emphasis on related practical and numerical aspects. Note that some previous studies [7, 17, 20, 28,29,30,31, 33, 41, 49] have established a similar connection, mainly in the diffusion dynamics context, with less focus on pure jump dynamics. In this work, the proposed methodology is based on an approximate explicit TL scheme, which could and be subsequently extended in future work to continuous-time formulation (exact schemes), and implicit TL schemes which are relevant for systems with fast and slow time scales.

The remainder of this paper is organized as follows. Sections and 1.4 define relevant SRN, TL, MC and IS concepts, respectively. Section 2 establishes the connection between IS and SOC, formulating the SOC problem and defining its main ingredients: controls, cost function, and value function; then presents the dynamic programming solved by the optimal controls. Section 2.3 develops the proposed IS learning-based approach appropriate for multi-dimensional SRNs. Section 3 provides selected numerical experiments for different dimensions to illustrate the proposed approach’s efficiency compared with standard MC approaches. Finally, Sect. 4 summarizes and concludes the work, and discusses possible future research directions.

1.1 Stochastic reaction networks (SRNs)

We are interested in the time evolution for an homogeneously mixed chemical reacting system described by the Markovian pure jump process, \({\textbf{X}}:[0,T]\times \Omega \rightarrow {\mathbb {N}}^d\), where (\(\Omega \), \({\mathcal {F}}\), \({\mathbb {P}}\)) is a probability space. In this framework, we assume that d different species interact through J reaction channels. The i-th component, \(X_i(t)\), describes the abundance of the i-th species present in the chemical system at time t. This work studies the time evolution of the state vector,

$$\begin{aligned} {\textbf{X}}(t) = \left( X_1(t), \ldots , X_d(t)\right) \in {\mathbb {N}}^d .\end{aligned}$$

Each reaction channel \({\mathcal {R}}_j\) is a pair \((a_j, \varvec{\nu }_{j})\) defined by its propensity function \(a_{j}:{\mathbb {R}}^{d} \rightarrow {\mathbb {R}}_{+}\) and stoichiometric vector \( \varvec{\nu }_{j}=( \nu _{j,1},\nu _{j,2},\ldots , \nu _{j,d})^\top \) satisfying

$$\begin{aligned} {\mathbb {P}}\left( {{\textbf {X}}}(t\!+\!\Delta t)={{\textbf {x}}}\!+\!\varvec{\nu }_{j} \mid {{\textbf {X}}}(t) ={{\textbf {x}}}\right) =a_{j}({{\textbf {x}}})\Delta t\!+\!{o}\left( \Delta t\right) , \, j=1,2,\ldots ,J . \end{aligned}$$

Thus, the probability of observing a jump in the process \({\textbf{X}}\) from state \({\textbf{x}}\) to state \({\textbf{x}} + \varvec{\nu }_{j}\), a consequence of reaction \({\mathcal {R}}_{j}\) firing during the small time interval \((t, t + \Delta t]\), is proportional to the time interval length, \(\Delta t\), where \(a_{j}({\textbf{x}})\) is the proportionality constant. We set \(a_j({\textbf{x}}){=}0\) for \({\textbf{x}}\) such that \({\textbf{x}}{+}\varvec{\nu }_j\notin {\mathbb {N}}^d\) (i.e., the non-negativity assumption: the system can never produce negative population values).

Hence, from (1.2), process \({\textbf{X}}\) is a continuous-time, discrete-space Markov chain that can be characterized by Kurtz’s random time change representation [19],

$$\begin{aligned} {\textbf{X}}(t)= {\textbf{x}}_{0}+\sum _{j=1}^{J} Y_j \left( \int _0^t a_{j}({\textbf{X}}(s)) \, \textrm{d}s \right) \varvec{\nu }_j ,\end{aligned}$$

where \(Y_j:{\mathbb {R}}_+{\times } \Omega \rightarrow {\mathbb {N}}\) are independent unit-rate Poisson processes. Conditions on the reaction channels can be imposed to ensure uniqueness [5] and avoid explosions in finite time [18, 27, 44].

Applying the stochastic mass-action kinetics principle, we can assume that the propensity function \(a_j(\cdot )\) for reaction channel \({\mathcal {R}}_j\), represented asFootnote 1

$$\begin{aligned} \alpha _{j,1} S_1+\dots +\alpha _{j,d} S_d \overset{\theta _j}{\rightarrow }\beta _{j,1} S_1+\dots +\beta _{j,d} S_d \end{aligned}$$


$$\begin{aligned} a_j({\textbf{x}}):=\theta _j \prod _{i=1}^d \frac{x_i!}{(x_i-\alpha _{j,i})!} {\textbf{1}}_{\{x_i\ge \alpha _{j,i}\}},\end{aligned}$$

where \(\{\theta _j\}_{j=1}^J\) represents positive constant reaction rates, and \(x_i\) is the counting number for species \(S_i\).

1.2 Explicit tau-leap approximation

The explicit-TL scheme is a pathwise approximate method [6, 26] to overcome computational drawbacks for exact methods (i.e., when many reactions fire during a short time interval). This scheme can be derived from the random time change representation (1.3) by approximating the integral \(\int _{t_i}^{t_{i+1}} a_{j}({\textbf{X}}(s)) \textrm{d}s \) as \(a_j({\textbf{X}}(t_i))\,(t_{i+1}-t_i)\), i.e., using the forward-Euler method with time mesh \(\{t_{0}=0, t_{1},\ldots ,t_{N}= T\}\) and size \(\Delta t=\frac{T}{N}\). Thus, the explicit-TL approximation for \({\textbf{X}}\) should satisfy for \(k\in \{1,2,\ldots ,N\}\)

$$\begin{aligned} \widehat{{\textbf{X}}}^{\Delta t}_k = {\textbf{x}}_{0}+\sum _{j=1}^{J} Y_{j} \left( \sum _{i=0}^{k-1} a_{j}(\widehat{{\textbf{X}}}^{\Delta t}_i) \Delta t \right) \varvec{\nu }_{j} ,\end{aligned}$$

and given \(\widehat{{\textbf{X}}}_0:= {\textbf{x}}_{0}\), we iteratively simulate a path for \(\widehat{{\textbf{X}}}^{\Delta t}\) as

$$\begin{aligned} \widehat{{\textbf{X}}}^{\Delta t}_k:=\widehat{{\textbf{X}}}^{\Delta t}_{k-1}+\sum _{j=1}^{J} {\mathcal {P}}_{k-1,j}\left( a_{j}(\widehat{{\textbf{X}}}^{\Delta t}_{k-1}) \Delta t\right) \varvec{\nu }_{j} ,\, 1 \le k \le N,\nonumber \\ \end{aligned}$$

where, conditioned on the current state \(\widehat{{\textbf{X}}}^{\Delta t}_{k}\), \(\{{\mathcal {P}}_{k,j}(r_{k,j})\}_{\{1\le j\le J \}}\) are independent Poisson random variables with respective rates \(r_{k,j}:=a_{j}(\widehat{{\textbf{X}}}^{\Delta t}_{k})\Delta t\).

The explicit-TL path \(\widehat{{\textbf{X}}}^{\Delta t}\) is defined only at time mesh points, but can be naturally extended to [0, T] as a piecewise constant path. We apply the projection to zero to prevent the process from exiting the lattice (i.e., producing negative values), hence (1.7) becomes

$$\begin{aligned} \widehat{{{\textbf {X}}}}^{\Delta t}_k\!\!:=\!\max \left( {{\textbf {0}}},\widehat{{{\textbf {X}}}}^{\Delta t}_{k-1}\!+\!\sum _{j=1}^{J} {\mathcal {P}}_{k-1,j}\!\!\left( a_{j}(\widehat{{{\textbf {X}}}}^{\Delta t}_{k-1}) \Delta t\right) \varvec{\nu }_{j} \!\!\right) \!, 1 \!\le \! k \!\le \! N, \end{aligned}$$

where the maximum is applied entry-wise. In this work, we use uniform time steps with length \(\Delta t\), but the explicit-TL scheme and the proposed IS scheme (see Sect. 2) can also be applied to non-uniform time meshes.

1.3 Biased Monte Carlo estimator

Let \({\textbf{X}}\) be a stochastic process and \(g: {\mathbb {R}}^{d} \rightarrow {\mathbb {R}}\) a scalar observable. We want to approximate \({\mathbb {E}} \left[ g({\textbf{X}}(T))\right] \), but rather than sampling directly from \({\textbf{X}}(T)\), we sample from \(\overline{{\textbf{X}}}^{\Delta t}(T)\), which are random variables generated by a numerical scheme with step size \(\Delta t\). We assume that variates \(\overline{{\textbf{X}}}^{\Delta t}(T)\) are generated with an algorithm with weak order, \({{\mathcal {O}}}\left( \Delta t\right) \), i.e., for sufficiently small \(\Delta t\),

$$\begin{aligned} \left| {\mathbb {E}} \left[ g({\textbf{X}}(T))- g(\overline{{\textbf{X}}}^{\Delta t}(T) )\right] \right| \le C\Delta t \end{aligned}$$

where \(C>0\).Footnote 2

Let \(\mu _{M}\) be the standard MC estimator for \({\mathbb {E}} \left[ g(\overline{{\textbf{X}}}^{\Delta t}(T))\right] \),

$$\begin{aligned} \mu _{M}:=\frac{1}{M}\sum _{m=1}^{M} g(\overline{{\textbf{X}}}^{\Delta t}_{[m]}(T)),\end{aligned}$$

where \(\{\overline{{\textbf{X}}}^{\Delta t}_{[m]}(T)\}_{m=1}^M\) are independent and distributed as \(\overline{{\textbf{X}}}^{\Delta t}(T)\).

The global error for the proposed MC estimator has error decomposition

$$\begin{aligned}&\left| {\mathbb {E}}\big [g({{\textbf {X}}}(T))\big ]-\mu _M\right| \nonumber \\&\quad \le \underbrace{\left| {\mathbb {E}}[g({{\textbf {X}}}(T))]-{\mathbb {E}}\big [g(\overline{{{\textbf {X}}}}^{\Delta t}(T))\big ]\right| }_{\text{ Bias }}+\underbrace{\left| {\mathbb {E}}\big [g(\overline{{{\textbf {X}}}}^{\Delta t}(T))\big ]-\mu _M\right| }_{\text{ Statistical } \text{ Error }}. \end{aligned}$$

To achieve the desired accuracy, \(\text {TOL}\), it is sufficient to bound the bias and statistical error equally by \(\frac{TOL}{2}\). From (1.9), choosing step size

$$\begin{aligned} \Delta t(\text {TOL})= \frac{\text {TOL}}{2\cdot C} \end{aligned}$$

ensures a bias of \(\frac{\text {TOL}}{2}\).

Thus, considering the central limit theorem, the statistical error can be approximated as

$$\begin{aligned} \left| {\mathbb {E}}\big [g(\overline{{{\textbf {X}}}}^{\Delta t}(T))\big ]-\mu _{M}\right| \approx C_{\alpha }\cdot \sqrt{\frac{\text{ Var }[g(\overline{{{\textbf {X}}}}^{\Delta t}(T))]}{M}}, \end{aligned}$$

where constant \(C_{\alpha }\) is the \((1-\frac{\alpha }{2})-\)quantile for the standard normal distribution. We choose \(C_{\alpha }=1.96\) for a \(95\%\) confidence level corresponding to \(\alpha =0.05\). Choosing

$$\begin{aligned} M^*(\text {TOL})=C_{\alpha }^2\frac{4\cdot \text {Var}\big [g(\overline{{\textbf{X}}}^{\Delta t}(T))\big ]}{\text {TOL}^2} \end{aligned}$$

sample paths ensures the statistical error to be approximately bounded by \(\frac{\text {TOL}}{2}\).

Given that the computational cost to simulate a single path is \({{\mathcal {O}}}\left( {\Delta t}^{-1}\right) \), the expected total computational complexity is \({{\mathcal {O}}}\left( \text {TOL}^{-3}\right) \); and the complexity scales with \(\text{ Var }\left[ g(\overline{{{\textbf {X}}}}^{\Delta t}(T))\right] \) (see (1.14)).

1.4 Importance sampling

Importance sampling (IS) techniques improve the computational costs for the crude MC estimator by variance reduction when used appropriately. To motivate the use of these techniques, consider estimating rare event probabilities, where the crude MC method is substantially expensive. In particular, consider estimating \(q={\mathbb {P}}(Y>\gamma )={\mathbb {E}}[{\textbf {1}}_{\{Y>\gamma \}}]\), where Y is a random variable taking values in \({\mathbb {R}}\) with probability density function \(\rho _{Y}\). Let \(\gamma \) be sufficiently large that q becomes sufficiently small. We can approximate q using the MC estimator

$$\begin{aligned} {\widehat{q}}=\frac{1}{M}\sum _{i=1}^M {\textbf {1}}_{\{Y^{(i)}>\gamma \}}, \end{aligned}$$

where \(\{Y^{(i)}\}_{i=1}^{M}\) are independent and identically distributed (i.i.d) realizations sampled according to \(\rho _Y\). The MC estimator variance is

$$\begin{aligned} Var\left[ \textbf{1}_{\{Y^{(i)}>\gamma \}}\right]&=q-q^2. \end{aligned}$$

For a sufficiently small q, we can use (1.16) and the central limit theorem to approximate the relative error as

$$\begin{aligned} \frac{|q-{\widehat{q}}|}{q}\approx C_{\alpha }\sqrt{\frac{1}{qM}}, \end{aligned}$$

where \(C_{\alpha }\) is chosen as in (1.13).

The number of required samples to attain a relative error tolerance \(TOL_{rel}\) is \(M\approx \frac{C_{\alpha }^2}{q\cdot TOL_{rel}^2}\). Thus, for q of the order of \(10^{-8}\), the number of required samples such that \(TOL_{rel} = 5\%\) is approximately equal to \(1.5\cdot 10^{11}\).

To demonstrate the IS concept, consider the general problem of estimating \({\mathbb {E}}[g(Y)]\), where g is a given observable. In the previous example, g was chosen as \(g(y)=\textbf{1}_{\{y>\gamma \}}\). Let \({\widehat{\rho }}_Z\) be the probability density function for a new real random variable Z, such that \(g\cdot \rho _Y\) is dominated by \({\widehat{\rho }}_Z\), i.e.,

$$\begin{aligned} {\widehat{\rho }}_Z(x)=0 \implies g(x)\cdot \rho _Y(x)=0 \end{aligned}$$

for all \(x\in {\mathbb {R}}\). This permits, the quantity of interest to be expressed as

$$\begin{aligned} {\mathbb {E}}[g(Y)]&=\int _{{\mathbb {R}}}g(x)\rho _Y(x)dx=\nonumber \\&=\int _{{\mathbb {R}}}g(x)\underbrace{\frac{\rho _Y(x)}{{\widehat{\rho }}_Z(x)}}_{L(x)}\cdot {\widehat{\rho }}_Z(x) dx={\mathbb {E}}[L(Z)\cdot g(Z)], \end{aligned}$$

where \(L(\cdot )\) is the likelihood ratio. Hence the expected value under the new measure remains unchanged, but the variance could be reduced due to a different second moment \({\mathbb {E}}\left[ \left( g(Z)\cdot L(Z)\right) ^2\right] \).

The MC estimator under the IS measure is

$$\begin{aligned} \mu _{M}^{IS}\!=\!\frac{1}{M} \sum _{j=1}^M L(Z_{[j]})\cdot g(Z_{[j]})\!=\!\frac{1}{M} \sum _{j=1}^M \frac{\rho _Y(Z_{[j]})}{{\widehat{\rho }}_Z(Z_{[j]})}\cdot g(Z_{[j]}), \end{aligned}$$

where \(Z_{[j]}\) are i.i.d samples from \({\widehat{\rho }}_Z\) for \(j=1,\dots ,M\).

The main challenge when using IS is choosing a new probability measure that substantially reduces the variance compared with the original measure. This step strongly depends on the structure of the problem under consideration. Further, the new measure should be obtained with negligible computational cost to ensure a computational efficient IS scheme. This is particularly challenging in the present problem, since we are considering path-dependent probability measures. In particular, the aim is to introduce a path-dependent change of probability measure that corresponds to changing the Poisson random variable rates used to construct the TL paths. Section 2.1 shows how the optimal IS parameters can be obtained using a novel connection with SOC.

2 Importance sampling (IS) via stochastic optimal control (SOC)

2.1 Dynamic programming for the importance sample parameters

This section, establishes the connection between optimal IS measure determination within a class of probability measures, and SOC. Let \({\textbf{X}}\) be a SRN as defined in Sect. 1.1 and let \(\widehat{{\textbf{X}}}^{\Delta t}\) denote its TL approximation as given by (1.8). We aim to find a near-optimal IS measure to improve the MC estimator computational performance to estimate \({\mathbb {E}} \left[ g({\textbf{X}}(T))\right] \). Since finding the optimal path-dependent change of measure within all measure classes presents a challenging problem, we limit ourselves to a parameterized class obtained via modifying the Poisson random variable rates of the TL paths. This class of measure change was previously used in [10] to improve the MLMC estimator robustness and performance in this context; we focus on a single-level MC setting, and seek to automate the task to find a near-optimal IS measure within this class.

We introduce the change of measure resulting from changing the Poisson random variable rates in the TL scheme,

$$\begin{aligned} {\overline{P}}_{n,j}= & {} {\mathcal {P}}_{n,j}\left( \delta _{n,j}^{\Delta t}(\overline{{\textbf{X}}}^{\Delta t}_n)\Delta t\right) ,\nonumber \\{} & {} \qquad ~~~ n=0,\dots , N-1, j=1,\dots ,J; \end{aligned}$$

where \(\delta _{n,j}^{\Delta t}({\textbf{x}})\in {\mathcal {A}}_{{\textbf{x}},j}\) is the control parameter at time step n, under reaction j, and in state \({\textbf{x}}\in {\mathbb {N}}^d\); and conditioned on \(\overline{{\textbf{X}}}^{\Delta t}_{n}\), \({\mathcal {P}}_{n,j}(r_{n,j})\) are independent Poisson random variables with respective rates \(r_{n,j}:=\delta _{n,j}^{\Delta t}(\overline{{\textbf{X}}}^{\Delta t}_n)\Delta t\). The admissible set,

$$\begin{aligned} {\mathcal {A}}_{{\textbf{x}},j}={\left\{ \begin{array}{ll} \{0\},&{}\quad \text {if }a_j({\textbf{x}})=0\\ \{y\in {\mathbb {R}}: y>0\},&{}\quad \text {otherwise}, \end{array}\right. } \end{aligned}$$

is chosen such that (1.18) is fulfilled and to avoid infinite variance for the IS estimator. The control \(\delta _{n,j}^{\Delta t}({\textbf{x}})\in {\mathcal {A}}_{{\textbf{x}},j}\) depends deterministically on the current time step n, reaction channel j, and current state \({\textbf{x}}=\overline{{\textbf{X}}}^{\Delta t}_n\) for the TL-IS approximation in (2.3).

Therefore, the resulting scheme under the new measure is

$$\begin{aligned} \overline{{\textbf{X}}}_{n+1}^{\Delta t}&=\max \left( {\textbf {0}},\overline{{\textbf{X}}}_{n}^{\Delta t}+\sum _{j=1}^J{\overline{P}}_{n,j}\varvec{\nu }_j\right) ,~~~ n=0,\dots ,N-1,\nonumber \\ \overline{{\textbf{X}}}_{0}^{\Delta t}&={\textbf{x}}_0, ; \end{aligned}$$

and the likelihood ratioFootnote 3 at step n associated with the new IS measure is

$$\begin{aligned} L_n(\overline{{{\textbf {P}}}}_n,\varvec{\delta }_n^{\Delta t}(\overline{{{\textbf {X}}}}^{\Delta t}_n))&=\prod _{j=1}^J\exp \left( -(a_j(\overline{{{\textbf {X}}}}_{n}^{\Delta t})-\delta _{n,j}^{\Delta t}(\overline{{{\textbf {X}}}}^{\Delta t}_n))\Delta t\right) \left( \frac{a_j(\overline{{{\textbf {X}}}}_{n}^{\Delta t})}{\delta _{n,j}^{\Delta t}(\overline{{{\textbf {X}}}}^{\Delta t}_n)}\right) ^{{\overline{P}}_{n,j}} \nonumber \\&\!=\!\exp \!\left( \!-\!\left( \!\sum _{j=1}^J a_j(\overline{{{\textbf {X}}}}_{n}^{\Delta t})\!-\delta _{n,j}^{\Delta t}(\overline{{{\textbf {X}}}}^{\Delta t}_n)\!\right) \!\Delta t\!\right) \!\cdot \!\prod _{j=1}^J\!\left( \frac{a_j(\overline{{{\textbf {X}}}}_{n}^{\Delta t})}{\delta _{n,j}^{\Delta t}(\overline{{{\textbf {X}}}}^{\Delta t}_n)}\!\right) ^{{\overline{P}}_{n,j}}\!; \end{aligned}$$

where \(\varvec{\delta }_n^{\Delta t}({\textbf{x}}) \in \times _{j=1}^J {\mathcal {A}}_{{\textbf{x}},j}\) are the IS parameters with \(\left( \varvec{\delta }_n^{\Delta t}({\textbf{x}})\right) _j=\delta _{n,j}^{\Delta t}({\textbf{x}}) \) and the Poisson realizations are denoted by \(\overline{{\textbf{P}}}_n\) with \(\left( \overline{{\textbf{P}}}_n\right) _j:={\overline{P}}_{n,j}\) for \(j=1,\dots ,J\). Equation (2.4) uses the convention that \(\frac{a_j(\overline{{\textbf{X}}}_{n}^{\Delta t})}{\delta _{n,j}^{\Delta t}(\overline{{\textbf{X}}}^{\Delta t}_n)}=1\), whenever \(a_j(\overline{{\textbf{X}}}_{n}^{\Delta t})=0\) and \(\delta _{n,j}^{\Delta t}(\overline{{\textbf{X}}}^{\Delta t}_n)=0\). From (2.2), this results in a factor of one in the likelihood ratio for reactions with \(a_j(\overline{{\textbf{X}}}_{n}^{\Delta t})=0\).

Therefore, the likelihood ratio for \(\{\overline{{\textbf{X}}}^{\Delta t}_n: n=0,\dots ,N\}\) across one path is

$$\begin{aligned}{} & {} L\left( \left( \overline{{\textbf{P}}}_0,\dots ,\overline{{\textbf{P}}}_{N-1}\right) ,\left( \varvec{\delta }_0^{\Delta t}\big (\overline{{\textbf{X}}}^{\Delta t}_0\big ),\dots ,\varvec{\delta }_{N-1}^{\Delta t}\big (\overline{{\textbf{X}}}^{\Delta t}_{N-1}\big )\right) \right) \nonumber \\{} & {} \quad =\prod _{n=0}^{N-1} L_n\left( \overline{{\textbf{P}}}_n,\varvec{\delta }_n^{\Delta t}\big (\overline{{\textbf{X}}}^{\Delta t}_n\big )\right) . \end{aligned}$$

This likelihood ratio completes the characterization for the proposed IS approach, and allows the quantity of interest with respect to the new measure to be expressed as

$$\begin{aligned}{} & {} {\mathbb {E}}[g(\widehat{{{\textbf {X}}}}^{\Delta t}_N)]\nonumber \\{} & {} ={\mathbb {E}}\left[ L\left( \!\left( \overline{{{\textbf {P}}}}_0,\dots ,\overline{{{\textbf {P}}}}_{N-1}\right) ,\left( \varvec{\delta }_0^{\Delta t}(\overline{{{\textbf {X}}}}^{\Delta t}_0), \dots ,\!\varvec{\delta }_{N-1}^{\Delta t}(\overline{{{\textbf {X}}}}^{\Delta t}_{N-1})\right) \!\right) \cdot g(\overline{{{\textbf {X}}}}^{\Delta t}_N)\right] \!, \end{aligned}$$

with the expectation in the right-hand side of (2.6) taken with respect to the dynamics in (2.3).

Hereinafter, we aim to determine optimal parameters \(\{\varvec{\delta }_n^{\Delta t}({\textbf{x}})\}_{n=0,\dots ,N-1; {\textbf{x}}\in {\mathbb {N}}^d}\) that minimize the second moment (and hence the variance) for the IS estimator, given that \(\overline{{\textbf{X}}}_0^{\Delta t}={\textbf{x}}_0\). To that end, we derive an associated SOC formulation. First we introduce the cost function for the proposed SOC problem in Definition  2.1, then derive a dynamic programming equation in Theorem 2.4 that is satisfied by the value function \(u_{\Delta t}(\cdot ,\cdot )\) in Definition 2.3. The proof for Theorem 2.4 is given in “Appendix A”.

Definition 2.1

(Second moment for the proposed importance sampling estimator) Let \(0 \le n \le N\). Given that \(\overline{{\textbf{X}}}_{n}^{\Delta t}= {\textbf{x}}\), the second moment for the proposed IS estimator can be expressed as

$$\begin{aligned}&C_{n,{{\textbf {x}}}}\left( \varvec{\delta }^{\Delta t}_n,\dots ,\varvec{\delta }^{\Delta t}_{N-1}\right) \nonumber \\ {}&\quad ={\mathbb {E}}\left[ g^2(\overline{{{\textbf {X}}}}_N^{\Delta t})\prod _{k=n}^{N-1} L_k^2 \left( \overline{{{\textbf {P}}}}_k,\varvec{\delta }_k^{\Delta t}(\overline{{{\textbf {X}}}}_k^{\Delta t})\right) | \overline{{{\textbf {X}}}}_n^{\Delta t}={{\textbf {x}}}\right] , \,\nonumber \\ {}&\quad 0 \le n \le N-1, \end{aligned}$$

with terminal cost \(C_{N,{\textbf{x}}}={\mathbb {E}}\left[ g^2\left( \overline{{\textbf{X}}}_N^{\Delta t}\right) | \overline{{\textbf{X}}}_N^{\Delta t}={\textbf{x}}\right] =g^2({\textbf{x}})\), for any \({\textbf{x}} \in {\mathbb {N}}^d\).

Compared with the classical SOC formulation, (2.7) can be interpreted as the expected total cost; where the main difference is that (2.7) uses a multiplicative cost structure rather than the standard additive one. Therefore, we derive a dynamic programming relation in Theorem 2.4 associated with this cost structure that is fulfilled by the corresponding value function (see Definition 2.3), in the SRN context.

Remark 2.2

(Structure of the cost function) One can derive an optimal control formulation with additive structure (similar to [30] in the stochastic differential equation setting) by applying a logarithmic transformation together with Jensen’s inequality to (2.7). This reduces the control problem to a Kullback–Leibler minimization. In [41, 42], this Kullback–Leibler minimization problem leads to the same optimal change of measure as the problem of finding the change of measure using a variance minimization approach. However, the previous conclusion needs more investigation in the setting of SRNs, which we leave for future potential work.

Definition 2.3

(Value function) The value function \(u_{\Delta t}(\cdot ,\cdot )\) is defined as the optimal (infimum) second moment for the proposed IS estimator. For time step \(0 \le n \le N\) and state \({\textbf{x}} \in {\mathbb {N}}^d\),

$$\begin{aligned} u_{\Delta t}(n,{\textbf{x}})&:=\inf _{\{\varvec{\delta }^{\Delta t}_k\}_{k=n,\dots ,N-1} \in {\mathcal {A}}^{N-n}}C_{n,{\textbf{x}}}\left( \varvec{\delta }^{\Delta t}_n,\dots ,\varvec{\delta }^{\Delta t}_{N-1}\right) \nonumber \\&=\inf _{\{\varvec{\delta }^{\Delta t}_k\}_{k=n,\dots ,N-1} \in {\mathcal {A}}^{N-n}}\nonumber \\&\quad {\mathbb {E}}\left[ g^2\left( \overline{{\textbf{X}}}_N^{\Delta t}\right) \prod _{k=n}^{N-1} L_k^2\left( \overline{{\textbf{P}}}_k,\varvec{\delta }_k^{\Delta t}(\overline{{\textbf{X}}}_k^{\Delta t})\right) | \overline{{\textbf{X}}}_n^{\Delta t}={\textbf{x}}\right] , \end{aligned}$$

where is the admissible set for the IS parameters; and \(u_{\Delta t}(N,{\textbf{x}})=g^2({\textbf{x}})\), for any \({\textbf{x}} \in {\mathbb {N}}^d\).

Theorem 2.4

(Dynamic programming for importance sampling parameters) For \({\textbf{x}}\in {\mathbb {N}}^d\), the value function \(u_{\Delta t}(n,{\textbf{x}})\) fulfills the dynamic programming relation


where \(\varvec{\nu }=\left( \varvec{\nu }_1, \dots ,\varvec{\nu }_J\right) \in {\mathbb {Z}}^{d\times J}\).

Theorem 2.4 breaks down the minimization problem to a simpler optimization that can be solved stepwise backward in time starting from final time T. Solving the minimization problem (2.9) analytically is difficult due to the infinite sum. Section 2.2 shows how to overcome this issue by approximating (2.9) to derive near-optimal parameters for \(\{\varvec{\delta }_n^{\Delta t}({\textbf{x}})\}_{n=0,\dots ,N-1; {\textbf{x}}\in {\mathbb {N}}^d}\) for the proposed IS approach.

2.2 Approximate dynamic programming

Theorem 2.4 gives an exact solution for optimal IS parameters resulting from modifying the Poisson random variable rates in the TL paths. However, the infinite sum has to be evaluated in closed form to solve (2.9) analytically, which is generally difficult. Therefore, we propose approximating the value function \(u_{\Delta t}(n,{\textbf{x}})\) in (2.9) by \({\overline{u}}_{\Delta t}(n,{\textbf{x}})\) for all time steps \(n=0,\dots ,N\), reaction channels \(j=1,\dots ,J\) and states \({\textbf{x}}\in {\mathbb {N}}^d\). First, both \(u_{\Delta t}(n,{\textbf{x}})\) and \({\overline{u}}_{\Delta t}(n,{\textbf{x}})\) satisfy the same final condition,

$$\begin{aligned} {\overline{u}}_{\Delta t}(N,{\textbf{x}})=u_{\Delta t}(N,{\textbf{x}})&=g^2({\textbf{x}}). \end{aligned}$$

Next, to derive the approximate dynamic programming relation for \({\overline{u}}_{\Delta t}(\cdot ,\cdot )\), we presume Assumption 2.5 to hold. This assumption is motivated by the behavior of the original propensities, which are of \({{\mathcal {O}}}\left( 1\right) \) due to the mass-action kinetics principle (refer to (1.5)).

Assumption 2.5

The controls \(\{\varvec{\delta }_{n}^{\Delta t}\}_{n=0,\dots ,N-1}\) are asymptotically constant (i.e., \(\delta _{n,j}^{\Delta t}({\textbf{x}}) \rightarrow c_{n,j,{\textbf{x}}}\), as \(\Delta t \rightarrow 0\), where \(c_{n,j,{\textbf{x}}}\) are constants for \(1\le j \le J\), \(0\le n \le N-1\), and \({\textbf{x}}\in {\mathbb {N}}^d\)).

Given Assumption 2.5 and that \(\{a_j(\cdot )\}_{j=1}^J\) are of \({{\mathcal {O}}}\left( 1\right) \), we apply a Taylor expansion around \(\Delta t=0\) to the exponential term in (2.9), then truncate the expression within the infimum such that the remaining terms are \({{\mathcal {O}}}\left( \Delta t\right) \). This truncates the infinite sum and linearizes the exponential term. Thus, for \({\textbf{x}}\in {\mathbb {N}}^d\) and \(n=N-1,\dots ,0\)

$$\begin{aligned}&{\overline{u}}_{\Delta t}(n,{\textbf{x}}) = \Delta t \inf _{(\delta _1,\dots ,\delta _J)\in {\mathcal {A}}_{\textbf{x}}}\big [ \sum _{j=1}^J \frac{a_j^2({\textbf{x}})}{\delta _j} {\overline{u}}_{\Delta t}(n+1,\max (0, {\textbf{x}}+\nu _j)) \nonumber \\&\quad + {\overline{u}}_{\Delta t}(n+1,{\textbf{x}}) \sum _{j=1}^J \delta _j\big ]\nonumber \\&\quad +{\overline{u}}_{\Delta t}(n+1,{\textbf{x}})-2\Delta t\cdot {\overline{u}}_{\Delta t}(n+1,{\textbf{x}}) \cdot \sum _{j=1}^J a_j({\textbf{x}})\nonumber \\&= \Delta t \cdot \sum _{j=1}^J\underbrace{ \inf _{\delta _j\in {\mathcal {A}}_{{\textbf{x}},j}}\big [ \frac{a_j^2({\textbf{x}})}{\delta _j}\cdot {\overline{u}}_{\Delta t}(n+1,\max (0, {\textbf{x}}+\nu _j))+ \delta _j\cdot {\overline{u}}_{\Delta t}(n+1,{\textbf{x}})\big ]}_{=:Q^{\Delta t}(n,j,{\textbf{x}})}\nonumber \\&\quad +\left( 1-2\Delta t \sum _{j=1}^J a_j({\textbf{x}})\right) {\overline{u}}_{\Delta t}(n+1,{\textbf{x}}), \end{aligned}$$

where \(\delta _j \in {\mathcal {A}}_{{\textbf{x}},j}\), \(j=1,\dots ,J\), are the SOC parameters at state \({\textbf{x}}\) for reaction j. The admissible set \({\mathcal {A}}_{{\textbf{x}},j}\) is defined in (2.2). Assumption 2.5 ensures that (i) we can apply the Taylor expansion to the exponential term as \(\Delta t\) decreases, and (ii) we have the exact approximation structure for (2.11) with no further terms scaling with \(\Delta t\) that have order less than \(\Delta t^2\).

The infimum in (2.11) is attained when

$$\begin{aligned}{} & {} (i) \, {\overline{u}}_{\Delta t}(n+1,{\textbf{x}}) \ne 0, \quad \text {and} \nonumber \\{} & {} (ii) \, {\overline{u}}_{\Delta t}(n+1, \max (0,{\textbf{x}}+\nu _j)) \ne 0, \, \forall \, 1 \le j \le J.\nonumber \\ \end{aligned}$$

In this case, the approximate optimal SOC parameter \({\overline{\delta }}^{\Delta t}_{n,j}({\textbf{x}})\) can be analytically determined as

$$\begin{aligned} {\overline{\delta }}^{\Delta t}_{n,j}({\textbf{x}})&= \frac{a_j(x)\sqrt{{\overline{u}}_{\Delta t}(n+1, \max (0,{\textbf{x}}+\nu _j))}}{\sqrt{{\overline{u}}_{\Delta t}(n+1,{\textbf{x}})}},\, 1 \le j \le J. \end{aligned}$$

Note (2.13) includes the particular case when \(a_j({\textbf{x}})=0\) for some \(j\in \{1,\dots ,J\}\). In such a case, \({\overline{\delta }}_{n,j}^{\Delta t}({\textbf{x}})=0\), which agrees with (2.2).

An important advantage for this numerical approximation, \({\overline{u}}_{\Delta t}(\cdot ,\cdot )\), is that we reduce the complexity of the original optimization problem at each step in (2.9) from a simultaneous optimization over J variables to independent one-dimensional optimization problems that can be solved in parallel using (2.13).

Remark 2.6

(Assumption (2.12)) Whether the assumption in (2.12) is generally fulfilled depends on the method employed to solve the dynamic programming principle in (2.11). For example, if we use a direct numerical implementation either some special numerical treatment is required for the cases where (2.12) is violated, or some regularization is required to ensure well-posedness. The proposed approach from Sect. 2.3 avoids that issue since we model \({\overline{u}}_{\Delta t}(\cdot ,\cdot )\) with a strictly positive ansatz function, which guarantees condition (2.12) to hold for any state \({\textbf{x}}\) and all time steps n.

Remark 2.7

(Computational cost for dynamic programming) To derive a practical numerical algorithm for a finite number of states, we truncate the infinite state space \({\mathbb {N}}^d\) to , where \({\overline{S}}_1,\dots ,{\overline{S}}_d\) is a set of sufficiently large upper bounds. The computational cost to numerically solve the dynamic programming equation (2.11) for step size \(\Delta t\) and state space can be expressed as

$$\begin{aligned} W_{\text {dp}}(\overline{{\textbf{S}}},\Delta t)\approx \left( {\overline{S}}^*\right) ^d \cdot \frac{T}{\Delta t}\cdot J, \end{aligned}$$

where \({\overline{S}}^*=\max _{i=1,\dots ,d}{\overline{S}}_i\).

The cost in (2.14) scales exponentially with dimension d. Section 2.3 proposes an alternative approach to address this curse of dimensionality. However, in future work, we aim to combine dimension reduction techniques for SRNs with a direct numerical implementation of dynamic programming.

2.3 Learning-based approach

Using the SOC formulation derived in Sect. 2.2, we propose approximating the value function \({\overline{u}}_{\Delta t}(\cdot ,\cdot )\) with a parameterized ansatz function, \({\widehat{u}}(t,{\textbf{x}};\varvec{\beta })\).

Remark 2.8

(Choosing the ansatz function) The parameterized ansatz function \({\widehat{u}}(t,{\textbf{x}};\varvec{\beta })\) should consider the final condition of the value function (2.9), and its choice depends on the given SRN and observable \(g({\textbf{x}})\). For linear observables, such as \(g({\textbf{x}})=x_i\), we can consider polynomial basis functions as an ansatz. For more complex problems, the ansatz function is a small neural network.

For rare event applications with observable \(g(\mathbf {{\textbf {x}}})={\textbf{1}}_{\{x_i>\gamma \}}\), we consider a sigmoid with learning parameters \(\varvec{\beta }=\left( \varvec{\beta }^{space},\beta ^{time}\right) \in {\mathbb {R}}^{d+1}\) as the ansatz function

$$\begin{aligned} {\widehat{u}}(t,{\textbf{x}};\varvec{\beta })= \frac{1}{1+e^{-(1-t) \cdot \left( \langle \varvec{\beta }^{space},{\textbf{x}}\rangle +\beta ^{time}\right) -b_0-\beta _0x_i}}, \end{aligned}$$

where \(\langle \cdot ,\cdot \rangle \) denotes the inner product, and the time is scaled to one using \(t\in [0,1]\).

Parameters \(b_0\) and \(\beta _0\) are not learned through optimization but determined by fitting the final condition for Theorem 2.4, which imposes \({\widehat{u}}(1,{\textbf{x}};\varvec{\beta })\approx g^2({\textbf{x}})={\textbf{1}}_{\{x_i>\gamma \}}\). Therefore, the discontinuous indicator function is approximated by a sigmoid, and the fit is characterized by the position of the sigmoid’s inflection point and the sharpness of the slope. The position and value of local and global minima with respect to the learned parameters \(\varvec{\beta }^{space}\) and \({\beta }^{time}\) depend on the choices for \(b_0\) and \({\beta }_0\).

To derive IS parameters from the ansatz function, we use the previous SOC result from (2.13), i.e.,

$$\begin{aligned} {\widehat{\delta }}^{\Delta t}_{j}(n,{{\textbf {x}}};\varvec{\beta })&= \frac{a_j({{\textbf {x}}})\sqrt{{\widehat{u}}\left( \frac{(n+1)\Delta t}{T}, \max (0,{{\textbf {x}}}+\nu _j);\varvec{\beta }\right) }}{\sqrt{{\widehat{u}}(\frac{(n+1)\Delta t}{T},{{\textbf {x}}};\varvec{\beta })}},\,\nonumber \\ {}&1 \le j \le J, \, 0 \le n\le N-1, \, {{\textbf {x}}}\in {\mathbb {N}}^d. \end{aligned}$$

We define \({\widehat{u}}(t,\cdot ;\cdot )\) in (2.15) as a time-continuous function for \(t\in [0,1]\); whereas the IS controls from \({\widehat{\delta }}^{\Delta t}_{j}(n,\cdot ;\cdot )\) are discrete in time for \(n=0,\dots ,N-1\), and depend on time step size \(\Delta t\). Therefore, \({\widehat{u}}(\cdot ,\cdot ;\varvec{\beta })\) can be used to derive control parameters for arbitrary \(\Delta t\) in (2.16).

The parameters \(\varvec{\beta }\) for the ansatz function are then chosen to minimize the second moment,

$$\begin{aligned} \inf _{\varvec{\beta \in {\mathbb {R}}^{d+1}}} {\mathbb {E}}\underbrace{\left[ g^2\left( \overline{{\textbf{X}}}_N^{\Delta t,\varvec{\beta }}\right) \prod _{k=0}^{N-1} L_k^2\left( \overline{{\textbf{P}}}_k,\widehat{\varvec{\delta }}^{\Delta t}(k,\overline{{\textbf{X}}}_k^{\Delta t,\varvec{\beta }};\varvec{\beta })\right) \right] }_{=:C_{0,{\textbf{x}}}\left( \widehat{\varvec{\delta }}^{\Delta t}_0,\dots ,\widehat{\varvec{\delta }}^{\Delta t}_{N-1}; \varvec{\beta }\right) }, \end{aligned}$$

where \(\{\overline{{\textbf{X}}}_n^{\Delta t,\varvec{\beta }}\}_{n=1,\dots ,N}\) is the IS path generated using IS parameters from (2.16) and \(\left( \widehat{\varvec{\delta }}^{\Delta t}(n,{\textbf{x}};\varvec{\beta })\right) _j={\widehat{\delta }}^{\Delta t}_{j}(n,{\textbf{x}};\varvec{\beta })\) for \(1\le j\le J\).

We use a gradient based stochastic optimizer method to solve (2.17), and derive Lemma 2.9 (proof in “Appendix B”) for the gradient of the second moment with respect to parameters \(\varvec{\beta }\).

Lemma 2.9

The partial derivatives for the second moment \(C_{0,{\textbf{x}}}\left( \widehat{\varvec{\delta }}^{\Delta t}_0,\dots ,\widehat{\varvec{\delta }}^{\Delta t}_{N-1}; \varvec{\beta }\right) \) in (2.17) with respect to \(\beta _{l}\), \(l=1,\dots , (d+1)\), are given by

$$\begin{aligned}&\frac{\partial }{\partial \beta _l}{\mathbb {E}}\left[ \underset{=:R({{\textbf {x}}}_0;\varvec{\beta })}{\underbrace{g^2\left( \overline{{{\textbf {X}}}}_N^{\Delta t,\varvec{\beta }}\right) \prod _{k=0}^{N-1} L_k^2\left( \overline{{{\textbf {P}}}}_k,\widehat{\varvec{\delta }}^{\Delta t}(k,\overline{{{\textbf {X}}}}_k^{\Delta t,\varvec{\beta }};\varvec{\beta })\right) }}\right] \nonumber \\ {}&\quad \!=\!{\mathbb {E}}\!\left[ R({{\textbf {x}}}_0;\varvec{\beta }) \!\left( \!\sum _{k=1}^{N-1}\!\sum _{j=1}^J \!\left( \Delta t - \frac{{\overline{P}}_{k,j}}{{\widehat{\delta }}_j^{\Delta t}(k,\overline{{{\textbf {X}}}}^{\Delta t,\varvec{\beta }}_k;\varvec{\beta })}\!\right) \cdot \frac{\partial }{\partial \beta _l} {\widehat{\delta }}_j^{\Delta t}(k,\overline{{{\textbf {X}}}}^{\Delta t,\varvec{\beta }}_k;\varvec{\beta })\!\right) \!\right] , \end{aligned}$$

where \(\{\overline{{\textbf{X}}}_n^{\Delta t,\varvec{\beta }}\}_{n=1,\dots ,N}\) is the IS path generated using the IS parameters from (2.16) and

$$\begin{aligned}&\frac{\partial }{\partial \beta _l} {\widehat{\delta }}_j^{\Delta t}(k,{{\textbf {x}}};\varvec{\beta })\nonumber \\ {}&=\frac{a_j^2({{\textbf {x}}})}{2{\widehat{\delta }}_j^{\Delta t}(k,{{\textbf {x}}};\varvec{\beta })}\cdot \left( \frac{\frac{\partial }{\partial \beta _l}{\widehat{u}}(\frac{(k+1)\Delta t}{T},\max ({{\textbf {x}}}+\nu _j,0);\varvec{\beta })}{{\widehat{u}}(\frac{(k+1)\Delta t}{T},{{\textbf {x}}};\varvec{\beta })} \nonumber \right. \\ {}&\left. \quad -\frac{{\widehat{u}}(\frac{(k+1)\Delta t}{T},\max ({{\textbf {x}}}+\nu _j,0);\varvec{\beta })\frac{\partial }{\partial \beta _l}{\widehat{u}}(\frac{(k+1)\Delta t}{T},{{\textbf {x}}};\varvec{\beta })}{{\widehat{u}}^2(\frac{(k+1)\Delta t}{T},{{\textbf {x}}};\varvec{\beta })}\right) . \end{aligned}$$

Thus, partial derivatives for \({\widehat{u}}(t,{\textbf{x}};\varvec{\beta })\) for the ansatz (2.15) are

$$\begin{aligned}&\frac{\partial }{\partial \beta _l}{\widehat{u}}(t,{\textbf{x}};\varvec{\beta })\nonumber \\&\quad ={\left\{ \begin{array}{ll} (1-t)x_i{\widehat{u}}(t,{\textbf{x}};\varvec{\beta })(1-{\widehat{u}}(t,{\textbf{x}};\varvec{\beta }))&{}, \text {if } \beta _l=\left( \varvec{\beta }^{space}\right) _{i}\\ (1-t){\widehat{u}}(t,{\textbf{x}};\varvec{\beta })(1-{\widehat{u}}(t,{\textbf{x}};\varvec{\beta }))&{}, \text {if } \beta _l=\beta ^{time}, \end{array}\right. }\nonumber \\ \end{aligned}$$

where \(\left( \varvec{\beta }^{space}\right) _{i}\) denotes the i-th entry for \(\varvec{\beta }^{space}\).

For an ansatz function different from (2.15), the gradient is still given by Lemma 2.9 only the derivation of \(\frac{\partial }{\partial \beta _l}{\widehat{u}}(t,{\textbf{x}};\varvec{\beta })\) in (2.20) changes accordingly.

By estimating the gradient in (2.18) using a MC estimator, we iteratively optimize the parameters \(\varvec{\beta }\) to reduce the variance. For this optimization, we use the Adam optimizer with the same parameter values suggested in [34] with the only difference that the step size is tuned to fit our problem setting.

In Sect. 3, we illustrate the potential of our new IS method based on the learning approach numerically in terms of variance reduction. Further theoretical and numerical analysis of this approach is left for future work, particularly the initialization for the learned parameters \(\beta ^{time}\) and \(\varvec{\beta }^{space}\) in (2.15) and investigations of a stopping rule.

To derive an estimator for \({\mathbb {E}}[g({\textbf{X}}(T))]\) using the proposed IS change of measure, we first solve the related SOC problem using the approach from this section; then we simulate M paths under the new IS sampling measure. Thus, the MC estimator using the proposed IS change of measure over M paths becomes

$$\begin{aligned} \mu ^{IS}_{M,\Delta t}=\frac{1}{M} \sum _{i=1}^M L_i\cdot g(\overline{{\textbf{X}}}_{[i],N}^{\Delta t,\varvec{\beta }}), \end{aligned}$$

where \(\overline{{\textbf{X}}}_{[i],N}^{\Delta t,\varvec{\beta }}\) is the i-th IS sample path and the corresponding likelihood factor from (2.5) is

$$\begin{aligned} L_i=L\left( \left( \overline{{{\textbf {P}}}}_0,\dots ,\overline{{{\textbf {P}}}}_{N-1}\right) ,\left( \widehat{\varvec{\delta }}^{\Delta t}(0,\overline{{{\textbf {X}}}}_{[i],0}^{\Delta t,\varvec{\beta }};\varvec{\beta }),\dots ,\widehat{\varvec{\delta }}^{\Delta t}(N-1,\overline{{{\textbf {X}}}}_{[i],N-1}^{\Delta t,\varvec{\beta }};\varvec{\beta })\right) \right) . \end{aligned}$$

Remark 2.10

The explicit pathwise derivatives in Lemma 2.9 have the following advantages compared with the finite difference approach: (i) the explicit pathwise derivatives are unbiased with respect to the TL scheme, resulting in only the MC error for evaluating the expectation (i.e., without additional finite difference error), and (ii) the gradient computation in (2.18) requires the estimation of an expected value with a high relative error because of g being fitted to an indicator function. Using the IS-TL paths we control better the related statistical error.

2.4 Computational cost for the learning-based approach

This section discusses the computational complexity for the learning approach to achieve a prescribed tolerance \(\text {TOL}\). Recall that the proposed approach comprises two steps; hence, two types of costs occur: (i) the offline learning cost for the ansatz function parameters \(\varvec{\beta }\), and (ii) the online cost to obtaining the MC estimator (2.21) based on M simulated paths using the derived IS measure (see (2.16)).

The offline cost for (i) can be expressed as

$$\begin{aligned} W_{pl}(I,M_0,\Delta t_{pl})\approx I \cdot M_0 \cdot \frac{T}{\Delta t_{pl}} \cdot J \cdot (C_{Poi} + C_{grad}), \end{aligned}$$

where I is the number of optimizer steps, \(M_0\) is the number of paths needed to derive the estimator of the gradient per optimizer step, \(C_{Poi}\) is the cost to generate one Poisson random variable, \(C_{grad}\) is the cost for the update of the algebraic evaluation of (2.18), and \(\Delta t_{pl}\) is the step size. In contrast to (2.14), this offline cost does not scale exponentially with dimension d.

Fig. 1
figure 1

Example 3.1 with step size \(\Delta t_{pl}=\Delta t_f=1/2^4\) for the proposed IS-MC estimator: a sample mean; b squared coefficient of variation; c parameters; d kurtosis for each optimizer step. Adam optimizer gradient, sample variance, and kurtosis were estimated using \(M_0=10^4\) samples. The reference value for the standard MC-TL approach was derived from a single run with \(M=10^6\) samples and with step size \(\Delta t=1/2^4\)

Fig. 2
figure 2

Example 3.2 with step size \(\Delta t_{ pl}=\Delta t_f=1/2^4\) for the proposed IS-MC estimator: a sample mean; b squared coefficient of variation; c parameters; d kurtosis for each optimizer step. The gradient for the Adam optimization, the sample variance, and the kurtosis were estimated using \(M_0=10^5\) samples. Standard MC-TL with step size \(\Delta t=1/2^4\) and \(M=10^7\) samples was used for comparison

Fig. 3
figure 3

Example 3.3 with step size \(\Delta t_{ pl}=\Delta t_f=1/2^4\) for the proposed IS-MC estimators: a sample mean; b squared coefficient of variation; c parameters; d kurtosis for each optimizer step. The gradient for Adam optimization, the sample variance, and the kurtosis were estimated using \(M_0=10^5\) samples. Standard MC-TL with \(M=10^6\) samples and step size \(\Delta t=1/2^4\) was used for comparison

The cost for one IS-TL path based on \({\widehat{u}}(\cdot ,\cdot ;\varvec{\beta })\) is the same as for a TL path with negligible additional factors \(C_{{\widehat{\delta }}}\) for evaluating (2.16) and \(C_{lik}\) for deriving the likelihood update, as given in (2.4),

$$\begin{aligned} W_{forward}(\Delta t_f)\approx \frac{T}{\Delta t_f} \cdot J \cdot (C_{Poi} +C_{lik}+C_{{\widehat{\delta }}}), \end{aligned}$$

where \(\Delta t_f\) is the step size. Thus, total cost is

$$\begin{aligned}&W_{IS-TL}(M,\Delta t_{pl},\Delta t_f) \\&\quad \approx W_{pl} (I,M_0,\Delta t_{pl}) + M\cdot W_{forward}(\Delta t_f). \end{aligned}$$

Following the same derivation as for (1.11)–(1.14), we choose \(\Delta t_{f}=\frac{\text {TOL}}{2 \cdot C}\), where C is the constant from (1.9), to obtain total computational complexity to derive a prescribed tolerance TOL

$$\begin{aligned}&W_{IS-TL}(\text {TOL})=W_{pl} (I,M_0,\Delta t_{pl}) \nonumber \\&\quad +const \cdot \frac{\text {Var}\left[ g\left( \overline{{\textbf{X}}}^{\Delta t,\varvec{\beta }}_N\right) \cdot L\right] }{\text {TOL}^3}, \end{aligned}$$

where L is the likelihood factor corresponding to the IS path \(\overline{{\textbf{X}}}^{\Delta t,\varvec{\beta }}\) (refer to (2.22)).

Our numerical simulations suggest that the amount of variance reduction achieved with the proposed approach is not related to \(\Delta t_{pl}\) (see Fig. 4). Therefore, we can achieve a low offline parameter learning cost (\(W_{pl} (I,M_0,\Delta t_{pl})\)) by using \(\Delta t_{pl}\gg \Delta t_{f}\).

For comparison, Sect. 1.3 shows that the standard MC-TL approach has total computational complexity

$$\begin{aligned} W_{MC-TL}(TOL)=const_{TL}\cdot \frac{Var[g(\widehat{{\textbf{X}}}^{\Delta t}_N)]}{TOL^3}. \end{aligned}$$

The proposed IS approach reduces this cost by variance reduction \(\left( \text {Var}[g(\overline{{\textbf{X}}}^{\Delta t}_N)\cdot L]\ll Var[g(\widehat{{\textbf{X}}}^{\Delta t}_N)]\right) \) (refer to Figs. 1, 2, 3). The TL variance becomes increasingly large in the asymptotic regime for very rare event probabilities, such that the additional cost \(W_{pl} (I,M_0,\Delta t_{pl})\) for learning \(\varvec{\beta }\) in (2.23) becomes negligible. Therefore, we obtain \(W_{IS-TL}(TOL)\ll W_{MC-TL}(TOL)\) in the rare event regime.

3 Numerical experiments and results

Through Examples 3.1, 3.2, and 3.3, we demonstrate the advantages for the proposed IS approach compared with the standard MC approach. We numerically show that the proposed approach achieves substantial variance reduction compared with standard MC estimators when applied to SRNs with different dimensions.

Example 3.1

(Pure decay) This example considers one species and a single reaction,

$$\begin{aligned} X\overset{\theta _1}{\rightarrow }\ \emptyset , \end{aligned}$$

where \(\theta _1=1\), and the final time \(T = 1\). Thus, the propensity is \(a(x)=\theta _1x\), the stoichiometric vector is \(\nu =-1\), and the observable is \(g(x)=\textbf{1}_{\{x>50\}}\) with \(X_0=100\).

Example 3.2

(Michaelis–Menten enzyme kinetics) The Michaelis-Menten enzyme kinetics [43] describe the catalytic conversion of substrate S into a product P through three reactions,

$$\begin{aligned} E+S\overset{\theta _1}{\rightarrow }\ C, ~~C\overset{\theta _2}{\rightarrow }\ E+S,~~ C\overset{\theta _3}{\rightarrow }\ E+P, \end{aligned}$$

where E denotes the enzyme and \(\theta = (0.001,0.005,0.01)^\top \). We consider the initial state \({\textbf{X}}_0=(E(0),S(0),C(0),P(0))^\top =(100, 100, 0, 0)^\top \) and the final time \(T=1\). The corresponding propensity and the change of the state matrix are

$$\begin{aligned} a({\textbf {x}})=\left( \begin{array}{c} \theta _{1} E S \\ \theta _{2} C \\ \theta _{3} C \end{array}\right) , \quad \varvec{\nu }=\left( \begin{array}{ccc} -1 &{} 1 &{} 1 \\ -1 &{} 1 &{} 0 \\ 1 &{} -1&{} -1 \\ 0&{} 0&{} 1 \end{array}\right) . \end{aligned}$$

The observable of interest is \(g({\textbf{x}})={\textbf{1}}_{\{x_3>22\}}\).

Fig. 4
figure 4

Example 3.2, parameters \(\varvec{\beta }^{space}\) and \(\beta ^{time}\) learned with \(\Delta t_{pl} =1/2^4\) (see final optimizer step in Fig. 2) and applied to forward runs with different \(\Delta t_f\) values. The squared coefficient of variation was estimated with \(M=10^6\) sample paths. The standard MC-TL approach is used as reference (dashed red line)

Example 3.3

(Enzymatic futile cycle model) The enzymatic futile cycle [36] describes two instances for the elementary single-substrate enzymatic reaction scheme and can be described by six reactions,

$$\begin{aligned}&R_{1}: S_{1}+S_{2} {\mathop {\longrightarrow }\limits ^{\theta _{1}}} S_{3}\text{, } \quad R_{2}: S_{3} {\mathop {\longrightarrow }\limits ^{\theta _{2}}} S_{1}+S_{2} \text{, } \quad \\&R_{3}: S_{3} {\mathop {\longrightarrow }\limits ^{\theta _{3}}} S_{1}+S_{5} \text{, } \\&R_{4}: S_{4}+S_{5} {\mathop {\longrightarrow }\limits ^{\theta _{4}}} S_{6} \text{, } \quad R_{5}: S_{6} {\mathop {\longrightarrow }\limits ^{\theta _{5}}} S_{4}+S_{5} \text{, } \quad \\&R_{6}: S_{6} {\mathop {\longrightarrow }\limits ^{\theta _{6}}} S_{4}+S_{2} \text{. } \end{aligned}$$

Initial states are \({\textbf{X}}(0)=\left( S_1(0),\dots ,S_6(0)\right) =\left( 1, 50, 0, 1, 50, 0 \right) \), and we take the rates as \(\theta _{1}=\theta _{2}=\theta _{4}=\theta _{5}=1\), and \(\theta _{3}=\theta _{6}=0.1\). The propensity \(a({\textbf{x}})\) follows the stochastic mass-action kinetics in (1.5) and the final time is \(T=2\). We consider \(g({\textbf{x}})={\textbf{1}}_{\{x_5>60\}}\) as the observable.

Since all three are rare event examples with observable \(g(\mathbf {{\textbf {x}}})={\textbf{1}}_{\{x_i>\gamma \}}\), we use the ansatz function (2.15) with initial parameters \(\varvec{\beta }^{space}=0\), and \(\beta ^{time}=0\). The relative error is more relevant for rare event occurrences than the absolute error, hence we use a relative version of the variance, i.e., the squared coefficient of variation [12, 35], which, for a random variable X, is given by

$$\begin{aligned} Var_{rel}[X]=\frac{Var[X]}{{\mathbb {E}}[X]^2}. \end{aligned}$$

To judge the robustness of our variance estimators, we estimate the kurtosis, \(\kappa :=\frac{\textrm{E}\left[ \left( X-E\left[ X\right] \right) ^4\right] }{\left( {\text {Var}}\left[ X\right] \right) ^2}\), because the standard deviation of the sample variance [10] is given by

$$\begin{aligned} \sigma _{{\mathcal {S}}^2\left( X\right) }=\frac{{\text {Var}}\left[ X\right] }{\sqrt{M}} \sqrt{\left( \kappa -1\right) +\frac{2}{M-1}}, \end{aligned}$$

where M is the number of samples.

We set the Adam optimizer step size \(\alpha =0.1\) for the simulations.

Figure 1 shows 100 Adam optimization steps for the decay example (Example 3.1) for step size \(\Delta t_{pl}=\Delta t_f=1/2^4\). The quantity of interest is a rare event probability with magnitude \(10^{-3}\). To estimate the gradient, we use \(M=10^4\) samples per Adam iteration. The squared coefficient of variation is reduced by a factor of \(10^{2}\) compared with the standard MC-TL variance after 13 Adam iterations. After reaching this minimum, the squared coefficient of variation increases for the next iteration steps. This behavior might be avoided by employing a smaller step size in the Adam algorithm. Figure 1d confirms that the kurtosis is bounded to a level below the standard TL’s kurtosis, indicating a robust variance estimator.

For the 4-dimensional stochastic reaction network (Example 3.2), the rare event probability for the event \(\{X_3(T)>22\}\) is of magnitude \(10^{-5}\). Figure 2b confirms that the proposed learning-based approach reduces the variance by a factor \( 4\times 10^3\) compared with standard TL for step size \(\Delta t_{pl}=\Delta t_f=1/2^4\). Although Fig. 2c seems to shows that parameters \(\beta ^{time}\) and \(\beta ^{space}_4\) overlap, this is an artifact from the scale of the y-axis; in fact, the final values are \(\beta _{time}=-3.2\times 10^{-4}\) and \(\beta ^{space}_4=-3.0\times 10^{-3}\). The intrinsic structure of Example 3.2 results in similar molecule counts for E(t) and S(t) and hence similar values for \(\beta ^{space}_1\) and \(\beta ^{space}_2\). Figure 2d confirms that the kurtosis for the proposed approach is substantially reduced compared with the kurtosis for the standard TL approach.

The 6-dimensional example (Example 3.3) has a rare event probability with magnitude \(10^{-6}\). Figure 3 shows the Adam optimization results for step size \(\Delta t_{pl}=\Delta t_f=1/2^4\). The TL mean differs from the mean for the proposed approach (Fig. 3a) because the standard MC-TL estimator requires more than \(10^6\) runs to accurately estimate a probability of order \(10^{-6}\). The proposed learning-based approach reduces the variance by a factor of more than 50 after 43 iterations. The kurtosis is bounded and lower than the kurtosis for the TL approach, confirming that the proposed approach results in a robust variance estimator.

Examples 3.2 and 3.3 show that a good choice of the ansatz function in combination with reasonable initial parameters provides substantial variance reduction from the first optimization step. However, we do not expect this behavior in general, particularly for high dimensions, and therefore we performed some optimization iterations.

The examples used step size \(\Delta t_{pl}=1/2^4\) and showed the squared coefficient of variation with respect to the same step size. To demonstrate that the learned parameters \(\varvec{\beta }\) can be used for forward runs with smaller step sizes (i.e., \(\Delta t_{f}\ll \Delta t_{pl}\)) as claimed in Sect. 2.4, we consider Example 3.2 and the final parameters from Fig. 4 for forward runs with different \(\Delta t_{f}\). The results show that the variance reduction is constant with respect to \(\Delta t_f\), suggesting that a coarse \(\Delta t_{pl}\) is sufficient for parameter learning. The same behavior was observed for other tested examples.

Remark 3.4

We used the ansatz (2.15) based on a single sigmoid for the numerical experiments to demonstrate the potential for the proposed learning-based IS. Further variance reduction may be achieved either by summing several sigmoid functions as ansatz or selecting a different basis function shape. Relevant analyses will be pursued in future work.

4 Conclusions and future work

This work developed an efficient path-dependent IS scheme to estimate statistical quantities for SRN processes, particularly rare event probabilities. Optimal IS parameters were obtained within a pre-selected class of change of measure using the proposed connection to an associated SOC problem, which could be solved via dynamic programming. To mitigate the curse of dimensionality encountered by the dynamic programming relation, we proposed a method for multi-dimensional SRNs based on approximating the value function via an ansatz function (i.e. a neural network), where the parameters were learned using a stochastic optimization algorithm. Numerical examples and subsequent analyses verified that the proposed estimator achieved substantial variance reduction compared with the standard MC method, providing lowered computational complexity in the rare event regime.

Future work will further analyze the proposed learning-based approach and expand it to derive a multilevel MC estimator. We also plan to combine an implementation of the dynamic programming principle as derived in Sect. 2.2 with dimension reduction methods for SRNs.