1 Introduction

Transcription, the production of RNA from a gene, is an inherently stochastic process. Specifically, the interval of time between two successive transcription events is a random variable whose statistics depend on multiple single-molecule events behind transcription (Sanchez and Golding 2013). When the distribution of this random variable is exponential, we say that expression is constitutive; in that case, the number of transcripts produced in a certain interval of time follows a Poisson distribution. On the other hand, when the distribution of times between two successive transcripts is non-exponential, then the number of transcripts is non-Poissonian. A special case of such non-constitutive behaviour is bursty expression, whereby transcripts are produced in short bursts that are separated by long silent intervals (Suter et al. 2011; Halpern et al. 2015). In yeast, genes whose expression is constitutive include MDN1, KAP104, and DOA1, whereas PDR5 is an example of a gene whose expression is bursty (Zenklusen et al. 2008).

For two decades, mathematical models of gene expression have been developed to predict the distribution of RNA abundance. By matching the theoretical distribution with experimental measurements from microscopy-based methods (Raj et al. 2008), one hopes to obtain insight into the underlying kinetics of transcription and to estimate transcriptional parameters. The standard model of gene expression which has been used for these analyses is the telegraph model (Peccoud and Ycart 1995), whereby a gene can be in two states. Transcription occurs in one of the states, whereupon RNA degrades; first-order kinetics is assumed for all processes. While the distribution obtained from the telegraph model can typically fit cellular RNA abundance data, there are innate difficulties with the interpretation of that fit: fluctuations in cellular RNA numbers and, hence, the shape of the experimental RNA distribution do not only reflect transcription, but also many processes downstream thereof, such as splicing, RNA degradation, and partitioning during cell division.

To counteract these difficulties, in the past few years, mathematical models (Choubey et al. 2015; Choubey 2018; Heng et al. 2016; Cao and Grima 2020) have been developed to predict the statistics of nascent RNA, i.e. of RNA in the process of being synthesised by the RNA polymerase molecule (RNAP), which can be visualised and quantified due to recent advances in fluorescence microscopy (Lenstra et al. 2016; Skinner et al. 2016; Larson et al. 2011; Antoine et al. 2014; Brouwer and Lenstra 2019). In contrast to cellular RNA, the statistics of nascent RNA is a direct reflection of the transcription process; hence, these models can potentially give more insight than the simpler, but cruder telegraph model. Choubey and collaborators (Choubey et al. 2015; Choubey 2018) have developed a stochastic model with the following properties: (i) a gene can be in two states (active or inactive); (ii) from the active state, transcription initiation occurs in two sequential steps: the pre-initiation complex is formed, after which the RNA polymerase escapes the promoter; (iii) once on the gene, the polymerase moves from one base pair to the next (with some probability) until the end of the gene is reached, when transcription is terminated and polymerase detaches. Queuing theory is used to derive analytical expressions for the transient and steady-state means and variances of numbers of RNAP that are attached to the gene in the long-gene limit when the elongation time is practically deterministic. Heng et al. (2016) have considered a coarse-grained version of that model, whereby the movement of RNAP from one base pair to the next is not explicitly modelled, obtaining an analytical expression for the total RNAP distribution in steady-state conditions. More recently, Cao and Grima (2020) have studied a model of eukaryotic gene expression that yields approximate time-dependent distributions of both nascent and cellular RNA abundance as a function of the parameters controlling gene switching, DNA duplication, partitioning at cell division, gene dosage compensation, and RNA degradation; in their coarse-grained model, the movement of RNAP is not explicitly modelled, while the elongation time is assumed to be exponentially distributed, which simplifies the requisite analysis.

The complexity of nascent RNA models has thus far not allowed the same detailed level of analysis as has been possible with the much simpler telegraph model. A few shortcomings of current models can be summarised as follows: (i) distributions of nascent RNA have been derived from models that do not explicitly model the movement of RNAP along a gene (Heng et al. 2016; Cao and Grima 2020), resulting in a disconnect between theoretical description and the microscopic processes underlying transcription; (ii) while the analysis of single-cell sequencing data and electron micrograph data yields the positions of individual polymerases along the gene, allowing for the calculation of statistics (means and variances) of the numbers of RNAP on gene segments that are obtained after binning, detailed models of RNAP elongation (Choubey et al. 2015; Choubey 2018) provide analytical results only for total RNAP on a gene and hence cannot be used to understand gene segment data; (iii) analytical calculations of the statistics of nascent RNA ignore important details of the transcription process such as pausing, traffic jams, backtracking, and premature termination, some of which have to date been explored via stochastic simulation (Klumpp and Hwa 2008; Rajala et al. 2010; Choubey et al. 2015; Rodriguez et al. 2019; Md Zulfikar et al. 2020).

In this paper, we overcome some of the aforementioned shortcomings of analytically tractable models for the transcription process. In Sect. 2, we study a stochastic model for promoter switching and the stochastic movement of RNAP along a gene, allowing for premature termination. We derive exact closed-form expressions for the first and second moments (means and variances) of local RNAP fluctuations on gene segments of arbitrary length, which allows us to study how these statistics vary along a gene as a function of transcriptional parameters; we also obtain expressions for the mean and variance of the total RNAP on the gene which generalise previous work by Choubey et al. (2015). In Sect. 3, we investigate approximations for the distributions of total RNAP and mature RNA, showing in particular that Negative Binomial distributions can provide an accurate approximation in certain biologically meaningful limits. In Sect. 4, we illustrate the difference between the statistics of local and total RNAP fluctuations and those of light fluorescence due to tagged nascent RNA. In Sect. 5, we extend our model to include pausing by deriving approximate expressions for the mean, variance, and distribution of observables. We conclude with a discussion of our results in Sect. 6.

2 Detailed Stochastic Model of Transcription: Set-up and Analysis

In this section, we specify the stochastic model studied here; then, we derive closed-form expressions for the moments of mature RNA and of local and total RNAP fluctuations in various parameter regimes.

Fig. 1
figure 1

(Color Figure Online) Model of transcription. a The gene is arbitrarily divided into L segments, with RNAP (blue) on gene segment i denoted by \(P_i\). The promoter switches from the active state \(G_\mathrm{on}\) to the inactive state \(G_\mathrm{off}\) with rate \(s_b\), while the reverse switching occurs with rate \(s_u\). When the promoter is active, initiation of RNAP occurs with rate r. Initiation is followed by elongation, which is modelled as RNAP ‘hopping’ from gene segment i to the neighbouring segment \(i+1\) with rate k, i.e. as the transformation of species \(P_i\) to \(P_{i+1}\). RNAP prematurely detaches from the gene with rate d. A nascent RNA tail (red), attached to the RNAP, grows as elongation proceeds. Termination is modelled by the change of \(P_L\) with rate k to mature RNA (M), which subsequently degrades with rate \(d_m\). In b, we show the probability distribution P(T) of the total elongation time T—the time between initiation and termination—as predicted by the stochastic simulation algorithm (SSA; histogram) and our theory (Erlang distribution with shape parameter L and rate \(k + d\); solid line). The parameter values used are \(L=50\), \(k=10\)/min, and \(d=1.5\)/min. In c, we show the dependence of the mean of the distribution P(T) on the RNAP detachment rate (d), as predicted by SSA (dots) and our theory (\(\langle {}T\rangle =L/(k+d)\); solid line). The relevant parameter values are \(L=50\) and \(k=\) 10/min (Color figure online)

2.1 Set-up of Model

We consider a stochastic model of transcription that includes the processes of initiation, elongation, and termination, as illustrated in Fig. 1. For simplicity, we divide the gene into L segments; the RNAP on gene segment i is then denoted by \(P_i\). The promoter can be either in the inactive state (\(G_{\mathrm{off}}\)) or the active state (\(G_{\mathrm{on}}\)), switching from the inactive state to the active one with rate \(s_u\) and from the active state to the inactive one with rate \(s_b\). When the promoter is active, initiation commences via the binding of an RNAP with rate r, denoted by \(P_1\). Subsequently, the RNAP either moves from a gene segment to the neighbouring segment with rate k, or it prematurely detaches with rate d. Note that here we have made two assumptions: (i) the movement of RNAP is unidirectional, away from the promoter site and hence left to right, with no pausing or backtracking allowed; (ii) the detachment and elongation rates are independent of the position of RNAP on the gene. Each RNAP has associated with it a nascent RNA tail that grows longer as the RNAP transcribes more of the gene. When the RNAP reaches the last gene segment, termination occurs, i.e. the RNAP–nascent RNA complex gets dissociated from the gene leading to a mature RNA (M) which degrades with rate \(d_m\). Note that for simplicity, we have not considered excluded-volume interaction between adjacent RNAPs here; hence, we make the implicit assumption of low ‘traffic’, which is plausible when the initiation rate is sufficiently low. (We test the validity of this assumption through simulations below.)

Note that, while the choice of L is arbitrary, it should be kept in mind that L needs to be sufficiently large for the dynamics to be described at a fine spatial resolution. However, L also has to be small enough for the length of each gene segment to be much larger than the footprint of an RNAP; the latter is needed to ensure the validity of the low-traffic assumption. The elongation time which is the total time T from initiation to termination, that is, conditioning on those realisations for which the RNAP does not prematurely detach, is Erlang distributed with mean \(L/(k+d)\) and coefficient of variation \(1/\sqrt{L}\); see ‘Appendix A’ for a derivation and Fig. 1b, c for verification through stochastic simulation (SSA).

Note that the total number of RNAPs transcribing the gene is equal to the number of nascent RNA molecules present, irrespective of their lengths; to shed light on the fluctuations of nascent RNA, in this section we therefore focus on the calculation of statistics of local and total RNAP fluctuations. We define the vector of molecule numbers \(\vec {m}=(n_0,n_1,\dots {},n_L,n)\), and we write \(\langle n_0 \rangle \), \(\langle n_i \rangle \) (\(i=1,2,\dots {},L\)), and \(\langle n \rangle \) for the average numbers of molecules of active gene, RNAP, and mature RNA, respectively. The above model can then be conveniently described by \(L+2\) species interacting via a set of \(2L+4\) reactions with the following rate functions:

Species

Molecule numbers

Position (in \(\vec {m}\))

\(G_\mathrm{on}\)

\(n_0\)

1

\(P_i,\quad i\in {}\{1,\dots {},L\}\)

\(n_i\)

\(i+1\)

M

n

\(L+2\)

Reaction

Rate function \(f_j\)

\(G_\mathrm{on} {\mathop {\longrightarrow }\limits ^{s_b}}G_\mathrm{off}\)

\(f_1=s_b \langle n_0 \rangle \)

\(G_\mathrm{off}{\mathop {\longrightarrow }\limits ^{s_u}}G_\mathrm{on}\)

\(f_2=s_u (1 - \langle n_0 \rangle )\)

\(G_\mathrm{on} {\mathop {\longrightarrow }\limits ^{r}}G_\mathrm{on}+P_1\)

\(f_3=r\langle n_0 \rangle \)

\(P_i {\mathop {\longrightarrow }\limits ^{k}}P_{i+1}\),   \(i\in {}\{1,\dots {},L-1\}\)

\(f_{i+3}=k \langle n_i \rangle \)

\(P_L {\mathop {\longrightarrow }\limits ^{k}}M\)

\(f_{L+3}=k \langle n_L \rangle \)

\(P_i {\mathop {\longrightarrow }\limits ^{d}}\varnothing \),    \(i\in {}\{1,\dots {},L\}\)

\(f_{i+L+3}=d \langle n_i \rangle \)

\(M {\mathop {\longrightarrow }\limits ^{d_m}}\varnothing \)

\(f_{2L+4}=d_m \langle n \rangle \)

Note that \(G_\mathrm{off}\) is not an independent species; the reason is that the binary state of the gene implies a conservation law, with the sum of the numbers of \(G_\mathrm{on}\) and \(G_\mathrm{off}\) equalling 1. Hence, the number of independent species in the model is \(L+2\). The rate functions \(f_j\) are the averaged propensities from the underlying chemical master equation (CME); note that, because our reaction network is composed of first-order reactions, these rate functions also equal the reaction rates in the corresponding deterministic rate equations. The description of our model is completed by the \((L+2)\times {}(2L+4)\)-dimensional stoichiometric matrix \(\mathbf {S}\); the element \(\mathbf {S}_{ij}\) of \(\mathbf {S}\) gives the net change in the number of molecules of the ith species when the jth reaction occurs. Given the ordering of species and reactions as described in the tables above, it follows that the matrix \(\mathbf {S}\) has the simple form

$$\begin{aligned} \begin{aligned} \mathbf {S}_{11}&=-1,&\qquad {} \mathbf {S}_{12}&=1,\\ \mathbf {S}_{i,i+1}&=1,&\qquad {} \mathbf {S}_{i,i+2}&=-1,&\qquad {} \mathbf {S}_{i,i+L+2}&=-1, \\ \mathbf {S}_{L+2,L+3}&=1,&\qquad {} \mathbf {S}_{L+2,2L+4}&=-1, \end{aligned} \end{aligned}$$
(1)

where \(i=2,\dots {},L+1\).

2.2 Closed-Form Expressions for Moments of Mature RNA and Local RNAP

In this subsection, we outline the derivation of the steady-state means and variances of local RNAP fluctuations (on each gene segment), as well as of mature RNA. Our results are summarised in the following two propositions.

Proposition 1

Let \(\eta = s_u/(s_u + s_b)\) be the fraction of time the gene spends in the active state, let \(\rho _k = r/k\) be the mean number of RNAPs binding to the promoter site in the time it takes for a single RNAP to move from one gene segment to the next, let \(\rho = r/d_m\) be the mean number of RNAPs binding to the promoter site in the time it takes for a mature RNA to decay, and let \(\mu = k/(k+d)\) be the probability that an RNAP molecule moves to the next gene segment rather than detaching prematurely. Then, the steady-state mean numbers of molecules of active gene, RNAP, and mature RNA are given by

$$\begin{aligned} \langle n_0 \rangle&= \eta , \end{aligned}$$
(2a)
$$\begin{aligned} \langle n_{i} \rangle&= \eta \rho _k \mu ^{i} \qquad \text {for }i=1,\dots {},L, \end{aligned}$$
(2b)
$$\begin{aligned} \langle n\rangle&= \eta \rho \mu ^{L}, \end{aligned}$$
(2c)

respectively.

Proposition 1 can be proved in a straightforward fashion, as follows. Using the underlying CME, one can show from the corresponding moment equations (Warren et al. 2006) that the time evolution of the vector \(\vec {\langle m \rangle }\) of mean molecule numbers in a system of zeroth-order or first-order reactions, i.e. with propensities that are linear in the number of molecules, is given by the time derivative \(\hbox {d}\vec {\langle m \rangle }/\hbox {d}t = \mathbf {S}\cdot \vec {f}\). Given the form of the stoichiometric matrix \(\mathbf {S}\) and of the rate functions \(f_j\), as described in Sect. 2.1, it follows that the mean numbers of all species in steady state can be obtained by solving the following system of \(L+2\) algebraic equations:

$$\begin{aligned} \begin{aligned} 0&= s_u (1 - \langle n_0 \rangle ) - s_b \langle n_0 \rangle , \\ 0&= r \langle n_0 \rangle - (k+d) \langle n_1 \rangle , \\ 0&= k \langle n_{i-1} \rangle - (k + d) \langle n_{i} \rangle \qquad \mathrm{for} \ i=2,\dots {},L, \\ 0&= k \langle n_L \rangle - d_m \langle n\rangle . \end{aligned} \end{aligned}$$
(3)

These equations can easily be solved simultaneously to yield the steady-state value of \(\vec {\langle m \rangle }\), as given in Eq. (2).

Proposition 2

Let \(\tau _p = 1/(d+k)\), \(\tau _g = 1/(s_u+s_b)\), and \(\tau _m = 1/d_m\) be the timescales of fluctuations of RNAP, gene, and mature RNA, respectively, and define the three new parameters

$$\begin{aligned} \alpha =\frac{1}{1+\tau _p/\tau _g}, \quad \gamma =\frac{1}{1+\tau _p/\tau _m}, \quad \text {and} \quad \theta =\frac{1}{1+\tau _m/\tau _g}. \end{aligned}$$

Furthermore, let \(\beta =s_b/s_u\) denote the ratio of gene inactivation and activation rates. Then, the variances and covariances of molecule number fluctuations of active gene, RNAP, and mature RNA are given by

$$\begin{aligned} \mathrm{Var}(n_0)&=\langle {}n_0\rangle {}^2\beta {}, \end{aligned}$$
(4a)
$$\begin{aligned} \mathrm{Cov}(n_0,n_i)&= \langle {}n_0\rangle {}\langle {}n_i\rangle {}\alpha \beta {}\cdot {}f_{1i},&\quad {}&\text {where }f_{1i}=\alpha ^{i-1}; \end{aligned}$$
(4b)
$$\begin{aligned} \mathrm{Cov}(n_0,n)&= \langle {}n_0\rangle {}\langle {}n\rangle {}\alpha \beta {}\cdot {}f_{1M},&\quad {}&\text {where }f_{1M}=\theta {}\alpha ^{L-1}, \end{aligned}$$
(4c)
$$\begin{aligned} \mathrm{Cov}(n_i,n_j)&=\delta _{ij}\langle {}n_i\rangle {}+ \langle {}n_i\rangle {}\langle {}n_j\rangle {}\alpha \beta {}\cdot {}f_{ij},&\quad {}&\text {where }f_{ij}=f(i,j)+f(j,i), \end{aligned}$$
(4d)
$$\begin{aligned} \mathrm{Cov}(n_i,n)&= \langle {}n_i\rangle {}\langle {}n\rangle {}\alpha \beta {}\cdot {}f_{iM},&\quad {}&\text {where }f_{iM}=\gamma {}^{i}\theta \alpha ^{L-1}+(1-\gamma {})\sum \limits _{q=1}^{i}\gamma {}^{i-q}{f}_{qL}, \end{aligned}$$
(4e)
$$\begin{aligned} \mathrm{Var}(n,n)&=\langle {}n\rangle {}+ \langle {}n\rangle {}^2\alpha \beta {}\cdot {}f_{MM},&\quad {}&\text {where }f_{MM}=f_{LM}, \end{aligned}$$
(4f)

and where \(i,j=1,\dots {},L\). Here, \(\delta _{ij}\) is the Kronecker delta; moreover,

$$\begin{aligned} {f}(i,j) =\dfrac{\alpha ^{i+j-1}}{(2\alpha -1)^{i}} +\dfrac{1}{2^{i+j-1}}\left( {\begin{array}{c}i+j-1\\ i\end{array}}\right) \Big [ 1-\dfrac{2\alpha -1}{2\alpha }{}_2F_1\big (1,i+j;j;\tfrac{1}{2\alpha }\big ) \Big ], \end{aligned}$$

where \({}_2F_1\) denotes the generalised hypergeometric function of the second kind (Digital Library of Mathematical Functions 2020a), which is defined as

$$\begin{aligned} {}_2F_1(a_1,a_2;b_1;z)=\sum _{s=0}^{\infty }\dfrac{(a_1)_s(a_2)_s}{(b_1)_s}\dfrac{z^s}{s!}, \end{aligned}$$

with \((a)_s=\Gamma (a+s)/\Gamma (a)\) the Pochhammer symbol.

Here, we note that an alternative representation of the functions \(f_{ij}\) in Eq. (4d), in terms of finite sums, is given in Eq. (B.33) of ‘Appendix B’.

As above, since the underlying propensities are linear in the number of molecules, the CME implies (Warren et al. 2006) that the corresponding second moments in steady state are exactly given by a Lyapunov equation. That equation, which is precisely the same as the one that is obtained from the linear-noise approximation (LNA) (Elf and Ehrenberg 2003), takes the form

$$\begin{aligned} \mathbf {J}\cdot {}\mathbf {C}+\mathbf {C}\cdot {}\mathbf {J}^{T}+\mathbf {D}=\mathbf {0}. \end{aligned}$$
(5)

Here, \(\mathbf {C}\), \(\mathbf {J}\), and \(\mathbf {D}\) are \((L+2)\times {}(L+2)\)-dimensional matrices; \(\mathbf {C}\) is a variance–covariance matrix that is symmetric (\(\mathbf {C}_{ij}=\mathbf {C}_{ji}\)), \(\mathbf {J}\) is the Jacobian matrix with elements \(\mathbf {J}_{ij}=\partial (\mathbf {S}\cdot {}\vec {f})_{i}/\partial {}\langle n_j\rangle \), and \(\mathbf {D}=\mathbf {S}\cdot {}\mathbf {Diag}(\vec {f})\cdot {}\mathbf {S}^{T}\) is a diffusion matrix, where \(\mathbf {Diag}(\vec {f})\) is a diagonal matrix whose elements are the entries in the rate function vector \(\vec {f}\). The nonzero elements of \(\mathbf {J}\) are given by

$$\begin{aligned} \begin{aligned} \mathbf {J}_{11}&=-(s_u+s_b),\\ \mathbf {J}_{21}&=r,&\qquad {} \mathbf {J}_{22}&=-(k+d),\\ \mathbf {J}_{i,i-1}&=k,&\qquad {} \mathbf {J}_{ii}&=-(k+d) \qquad \text {for }i=3,\dots {},L+1,\\ \mathbf {J}_{L+2,L+1}&=k,&\qquad {} \mathbf {J}_{L+2,L+2}&=-d_m,\\ \end{aligned} \end{aligned}$$
(6)

while the nonzero elements \(\mathbf {D}_{i}\) read

$$\begin{aligned} \begin{aligned} \mathbf {D}_{11}&=s_b \langle n_0 \rangle +s_u(1-\langle n_0 \rangle ),\\ \mathbf {D}_{22}&=r \langle n_0 \rangle +(k+d)\langle n_1 \rangle ,&\mathbf {D}_{23}&=-k\langle n_1 \rangle ,&\\ \mathbf {D}_{i,i-1}&=-k\langle n_{i-2} \rangle ,&\mathbf {D}_{ii}&=k\langle n_{i-1}\rangle +(k+d)\langle n_{i} \rangle&\&\text {for }i=3,\dots {},L+1,\\ \mathbf {D}_{i,i+1}&=-k\langle n_{i-1} \rangle&&\&\text {for }i=3,\dots {},L,\\ \mathbf {D}_{L+2,L+1}&=-k\langle n_{L} \rangle ,&\mathbf {D}_{L+2,L+2}&=k\langle n_{L} \rangle +d_m\langle n \rangle .&\end{aligned} \nonumber \\ \end{aligned}$$
(7)

Given the structure of the matrices \(\mathbf {J}\) and \(\mathbf {D}\) above, the Lyapunov Eq. (5) can be solved explicitly for the covariance matrix \(\mathbf {C}\) whose elements are given by Eq. (4). The solution by induction is involved and can be found in ‘Appendix B’, which proves Proposition 2.

2.2.1 Simplification in Bursty and Constitutive Limits

Bursty limit: We now consider a particular parameter regime—the limit of large initiation rate r and large gene inactivation rate \(s_b\) such that \(b=r/s_b\) is constant. Since the fraction of time spent in the active state is \(\eta \), it follows that the gene is mostly in the inactive state in that limit. During the short periods of time when it transitions to the active state, a burst of initiation events occur; in particular, a mean number b of RNAPs bind to the promoter during activation. Hence, such genes are often termed bursty, since transcription proceeds via sporadic bursts of activity and b is called the mean transcriptional burst size. For r and \(s_b\) large with b constant, the expressions for the first two moments of RNAP at every gene segment and of mature RNA from Eqs. (2) and (4), respectively, simplify to

$$\begin{aligned} \langle n_i\rangle _b&=b\upsilon _k\mu {}^{i}, \end{aligned}$$
(8a)
$$\begin{aligned} \langle n\rangle _b&=b\upsilon _m\mu {}^{L},&\end{aligned}$$
(8b)
$$\begin{aligned} \mathrm{Cov}(n_i,n_j)_b&=\delta _{ij}\langle {}n_i\rangle {}_b+\langle {}n_i\rangle {}_b\langle {}n_j\rangle {}_b(\upsilon _k\mu )^{-1}\cdot {}{h}_{ij},&\text {where }h_{ij}=\dfrac{1}{2^{i+j-2}}\dfrac{\Gamma (i+j-1)}{\Gamma (i)\Gamma (j)}, \end{aligned}$$
(8c)
$$\begin{aligned} \mathrm{Cov}(n_i,n)_b&=\langle {}n_i\rangle {}_b\langle {}n\rangle {}_b(\upsilon _k\mu )^{-1}\cdot {}{h}_{iM},&\text {where }h_{iM}=(1-\gamma {})\sum _{q=1}^{i}\gamma {}^{i-q}\cdot {}{h}_{qL} \end{aligned}$$
(8d)
$$\begin{aligned} \mathrm{Var}(n)_b&=\langle {}n\rangle {}_b+\langle {}n\rangle {}^2_b(\upsilon _k\mu )^{-1}\cdot {}{h}_{MM},&\text {where }h_{MM}=h_{LM}; \end{aligned}$$
(8e)

here, the subscript b denotes the moments in the bursty limit. Moreover, \(\upsilon _k=s_u/k\), \(\upsilon _m=s_u/d_m\), and \({h}_{ij}={f}_{ij}|_{\alpha \rightarrow {}0}\) denotes the simplified function \({f}_{ij}\) in the limit of \(\alpha {\longrightarrow {}}0\), which is achieved when \(s_b\rightarrow \infty \). We note that the above expressions for the functions \(h_{ij}\) are derived from the expressions for \(f_{ij}\) that are given in Eq. (B.33), rather than from those in Eq. (4d). The reason is that, in the bursty limit, we have that \(\frac{1}{2\alpha }\rightarrow {}\infty \), in which case the identity in Eq. (B.36) does not hold. The bursty limit in Eq. (B.33) is simply taken by collecting terms that are not dependent on \(\alpha \), since \(\alpha \longrightarrow 0\) in that limit.

To test the accuracy of our theory, in Fig. 2 we compare our analytical expressions for the mean of local RNAP numbers, as well as for various measures of local RNAP fluctuations—the coefficient of variation \(\mathrm {CV}\), the Fano factor \(\mathrm {FF}\), and the Pearson correlation coefficient \(\mathrm {CC}\)—with those calculated from stochastic simulation using Gillespie’s algorithm (SSA) (Gillespie 1977). Simulations are performed for two different scenarios: (i) without volume exclusion, where the footprint of RNAPs is not taken into account; and (ii) with volume exclusion, where RNAPs are treated as solid objects with a footprint of 35 bp, which is the value reported in Md Zulfikar et al. (2020). For our simulations in Fig. 2, we use parameter values characteristic for the gene PDR5 of length 3070 bp, as reported in Zenklusen et al. (2008). Our choice of \(L=30\) implies that the length of each gene segment is about 100 bp and, hence, that at most 3 RNAPs can fit in each segment when volume exclusion is taken into account. In this case, Gillespie’s algorithm is modified such that the initiation and RNAP ‘hopping’ rates are proportional to the available volume in the gene segment which the RNAP is moving to. That is achieved by rescaling the transcription initiation rate as \(r\mapsto {}r(1-n_1/3)\) and the RNAP hopping rate from the ith to the \((i+1)\)th gene segment as \(k\mapsto {}k(1-n_{i+1}/3)\). Since we use parameters measured for a gene that demonstrates bursty expression (\(\mathrm {PDR}5\)) (Zenklusen et al. 2008), we test the accuracy of both the exact theory from Eqs. (2) and (4) and the approximate expressions given in Eq. (8).

The perfect agreement between our exact theory (solid lines) and simulation without volume exclusion (dots) provides a numerical validation of that theory. Our approximate theory (dashed lines) also yields a reasonably good approximation; the mismatch can be decreased if the degree of burstiness is increased, i.e. by increasing the parameters r and \(s_b\) relative to the other rates in the model. We also note that the theory is in good agreement with simulation with volume exclusion (open circles), which shows that the ‘low traffic’ assumption upon which our theory is based is valid.

The following interesting observations can be made from these figures: (i) if the rate of premature detachment is greater than zero, then the mean of local RNAP decreases monotonically with the distance i from the promoter according to a power law, whereas that mean is constant along the gene if there is no premature detachment, as expected; (ii) the size of RNAP fluctuations, as measured by \(\mathrm{CV}\), decreases with i for small premature detachment rates, but increases with i for sufficiently large values of the detachment rate; (iii) the Fano factor approaches 1—the value of \(\mathrm{FF}\) for a Poissonian distribution—as i increases, which is due to the dispersal of the burst as stochastic elongation proceeds; (iv) the correlation coefficient between the local RNAP on two neighbouring gene segments decreases monotonically with i, which is exacerbated by premature detachment and is a direct result of the stochasticity inherent in the elongation process.

The observation in (iii) can be explained in detail as follows. When the detachment rate is zero, a burst of RNAPs rapidly bind to the promoter, leading to large fluctuations near that site; however, thereafter each RNAP moves distinctly from all others due to stochastic elongation. Hence, the burst is gradually dispersed as elongation proceeds, which implies a decrease in the variance of fluctuations with increasing i. When the detachment rate is nonzero, then the same effect is at play; however, the increase in the variance of fluctuations along the gene is now counteracted by the decrease in mean RNAP numbers, which leads to two types of behaviour: for small i, \(\mathrm{CV}\) decreases with i, since the variance dominates over the mean, while for large i, the opposite occurs and \(\mathrm{CV}\) increases with i.

Fig. 2
figure 2

(Color Figure Online) First and second moments of the distribution of local RNAP for the PDR5 gene in yeast, which demonstrates bursty expression. In ad, we show the dependence of the mean, coefficient of variation squared, Fano factor, and Pearson correlation coefficient, respectively, of local RNAP fluctuations on gene segment i, as predicted by our exact theory (Eqs. (2), (2); solid lines), the approximate theory in the bursty limit (Eq. (8); dashed lines), and simulation via Gillespie’s stochastic simulation algorithm (SSA), respectively. We performed simulations for two different cases: without volume exclusion (dots) and with volume exclusion (open circles). The parameters are fixed to \(s_u\)=0.44/min, \(s_b\)=4.7/min, and r=6.7/min, which are characteristic of the PDR5 gene in yeast, as reported in Supplemental Table 2 of Zenklusen et al. (2008). The number of gene segments is arbitrarily chosen to be \(L=30\). The total elongation time \(\langle {}T\rangle =4.5\) min is also reported for PDR5, described as the synthesis time and denoted by \(\tau \) in Zenklusen et al. (2008). The elongation rate by definition takes the value of the ratio \(k = L/\langle {}T\rangle {} - d \approx L/\langle {}T\rangle {}\), since \(d \ll k\). The detachment rate d is arbitrarily chosen to be \(d = 0.01\)/min (red lines and dots) or \(d = 0.2\)/min (black lines and dots). Note that, for the SSA, moments are calculated from one long trajectory with a few million time points, sampled at unit intervals (Color figure online)

Constitutive limit: The other common parameter regime is that of constitutive gene expression, where the gene spends most of its time in the active state and transcription is continuous, which corresponds to the limit of very small \(s_b\). In that limit, the expressions from Eqs. (2) and (4) simplify to

$$\begin{aligned} \begin{aligned} \langle n_i\rangle _c =\mathrm{Var}(n_i)_c=\rho _k\mu {}^{i} \qquad {}{} \text {and} \qquad {}{} \langle n\rangle _c&=\mathrm{Var}(n)_c=b\rho \mu {}^{L}, \end{aligned} \end{aligned}$$
(9)

while the covariances \(\mathrm{Cov}(n_i,n_j)_c\) and \(\mathrm{Cov}(n_i,n)_c\) between the species are zero; here, the subscript c denotes the constitutive limit. This drastic simplification reflects the fact that, in the constitutive limit, the distributions of mature RNA and local RNAP are Poissonian: as the regulatory network is effectively given by \(\emptyset \rightarrow P_1 \rightarrow P_2 \rightarrow ... \rightarrow P_L \rightarrow M \rightarrow \emptyset \) then, the result follows directly from the exact solution provided in Jahnke and Huisinga (2007).

To further test the accuracy of our theory, in Fig. 3 we compare our analytical expressions for the mean of local RNAP numbers, as well as for various measures of local RNAP fluctuations, with those calculated from stochastic simulation using Gillespie’s algorithm, where we use parameters measured for a gene that demonstrates constitutive expression (\(\mathrm {DOA}1\)) (Zenklusen et al. 2008). As before, we test the accuracy of both the exact theory given by Eqs. (2) and (4) and the approximate expressions from Eq. (9). Unsurprisingly, we observe agreement between exact theory (solid lines) and simulation (dots); the mismatch between our approximate theory and simulation is due to the fact that the gene does not spend 100% of its time in the active state—the true constitutive limit—but, rather, \(s_u/(s_u + s_b)\approx 85\%\). The local mean RNAP number decreases with distance from the promoter, as was the case for bursty expression in the previous subsubsection, which is to be expected. The various measures which depend on the second moments are, however, considerably different: \(\mathrm {CV}\) increases monotonically with i, independently of the rate of premature detachment, while \(\mathrm {FF}\) and \(\mathrm {CC}\) are very close to 1 and zero, respectively; moreover, the latter two measures practically show very little variation along the gene. The lack of transcriptional bursting explains all these effects in a straightforward fashion.

Finally, we remark that the accuracy of our expressions for the mean and variance of mature RNA, as given in Eq. (2) and (4), is verified by simulation (SSA) in Fig. 4a, b for parameters typical of the bursty \(\mathrm {PDR}5\) gene. The meaning of the dependence of descriptive statistics on L is discussed in the next section.

Fig. 3
figure 3

(Color Figure Online) First and second moments of the distribution of local RNAP for the DOA1 gene in yeast, which demonstrates constitutive expression. In ad, we show the dependence of the mean, coefficient of variation squared, Fano factor, and Pearson correlation coefficient, respectively, of local RNAP fluctuations on gene segment i, as predicted by our exact theory (Eqs. (2) and (2); solid lines), the approximate theory in the constitutive limit (Eq. (9); dashed lines), and simulation via Gillespie’s stochastic simulation algorithm (SSA; dots), respectively. The parameters are fixed to \(s_u=0.7\)/min, \(s_b=0.12\)/min and \(r=0.14\)/min, which are characteristic of the DOA1 gene in yeast, as reported in Supplemental Table 2 of Zenklusen et al. (2008). The number of gene segments is arbitrarily chosen to be \(L=30\). The total elongation time \(\langle {}T\rangle =2.9\) min is also reported for DOA1, described as the synthesis time and denoted by \(\tau \) in Zenklusen et al. (2008). The elongation rate by definition takes the value of the ratio \(k = L/\langle {}T\rangle {} - d \approx L/\langle {}T\rangle {}\), since \(d \ll k\). The detachment rate d is arbitrarily chosen to be \(d = 0.01\)/min (red lines and dots) or \(d = 0.2\)/min (black lines and dots). Note that, for the SSA, moments are calculated from one long trajectory with a few billion time points, sampled at unit intervals (Color figure online)

2.3 Closed-Form Expressions for Moments of Total RNAP

While local RNAP fluctuations are measurable in experiment, as discussed in the Introduction, measurements of total RNAP on a gene are typically reported. Hence, in this section, we briefly discuss descriptive statistics of total RNAP fluctuations.

Recalling that \(n_i\) is the number of RNAP molecules on the ith gene segment, the total number of RNAPs on the gene—arbitrarily divided into L segments—is given by \(n_\mathrm{tot}=\sum _{i=1}^{L}n_i\). Given Eq. (2) and (4), the steady-state mean \(\langle n_\mathrm{tot} \rangle =\sum _{i=1}^{L}\langle n_i\rangle \) and the steady-state variance \(\mathrm{Var}(n_\mathrm{tot})=\sum _{i,j=1}^{L}\mathrm{Cov}(n_i,n_j)\) of the total RNAP distribution are given by

$$\begin{aligned} \begin{aligned} \langle n_\mathrm{tot}\rangle =\eta {}\rho _k\mu \dfrac{\mu ^L-1}{\mu -1} \quad \text {and}\quad \mathrm{Var}(n_\mathrm{tot}) =\langle {}n_\mathrm{tot}\rangle {}+\alpha \beta (\eta {}\rho {}_k)^2\sum _{i,j=1}^{L}\mu ^{i+j}\cdot {}{f}_{ij}. \end{aligned}\nonumber \\ \end{aligned}$$
(10)
Fig. 4
figure 4

Mean and variance of the distributions of mature RNA and total RNAP for the PDR5 gene in yeast. In a, b, we show the dependence of the moments of mature RNA fluctuations on the number of gene segments L, as predicted by our theory (Eqs. (2) and (2); solid lines) and SSA (dots). In c, d, we show the dependence of the moments of total RNAP on L, as predicted by our exact theory (Eq. (10); solid lines) and SSA (dots). The parameters \(s_u\), \(s_b\), r, and \(\langle {}T\rangle \) are characteristic of the PDR5 gene and are the same as in Fig. 2. The premature detachment rate is chosen to be \(d=0.01\)/min; the elongation rate is then given by \(k\approx L/\langle {}T\rangle \). The degradation rate of mature RNA is \(d_m=0.04\)/min, which is chosen such that the mean mature RNA is roughly consistent with that reported in Fig. 6(b) of Zenklusen et al. (2008). Note that, for the SSA, moments are calculated from one long trajectory with a few billion time points, sampled at unit intervals

For a detailed derivation of the variance in Eq. (10), we refer to ‘Appendix C’. These expressions for the mean and variance of the total RNAP distribution simplify in the bursty and constitutive limits, as can be seen in ‘Appendix D’. The accuracy of Eq. (10) is tested by comparing against stochastic simulation with SSA in Fig. 4c, d. Both mean and variance are seen to increase monotonically with the number of gene segments L, as we keep the mean elongation time constant; the mean shows very little dependence on L, while the dependence of the variance is more pronounced. We recall that, while the parameter L is arbitrary in principle, it actually determines the size of fluctuations in the elongation time. Since that time is the sum of L independent exponential variables with mean \(1/(k+d)\) each, it follows that the distribution of the elongation time T is Erlang with mean \(\langle T \rangle =L/(k+d)\) and coefficient of variation squared equal to 1/L. Hence, the larger L is, the narrower is the distribution of T and the more deterministic is elongation itself. Thus, Fig. 4c, d predicts that the mean and variance of total RNAP increase rapidly with decreasing fluctuations in the elongation time T. It hence follows that models in which the elongation rate is assumed to be exponentially distributed (Cao and Grima 2020), which correspond to the case where \(L = 1\) in our model, underestimate the size of nascent RNA fluctuations.

2.4 Special Case of Deterministic Elongation

Next, we derive expressions for the descriptive statistics of total RNAP and mature RNA in the limit of large L taken at constant mean elongation time, which corresponds to deterministic elongation. As is shown in Fig. 4, these statistics converge quickly to the ones obtained in the large-L limit; hence, the resulting limiting expressions are likely to be useful across a variety of genes.

Moments of total RNAP distribution: We define the non-dimensional parameters \(\updelta _g=\tau _g/\tau _d\), \(T_g=\langle {}T\rangle {}/\tau _g\), and \(T_d=\langle {}T\rangle {}/\tau _d\), which correspond to the ratio of the gene timescale and the polymerase detachment timescale, the ratio of the mean elongation time and the gene timescale, and the ratio of the mean elongation time and the polymerase detachment timescale, respectively; here, \(\tau _d=1/d\), as before. Substituting \(k\mapsto {}L/\langle {}{}T\rangle {}{}-d\) into Eq. (10) and taking the limit of deterministic elongation, i.e. letting \(L\rightarrow {}\infty \) at constant \(\langle T\rangle \), we obtain the following expressions for the mean, variance, and \(\mathrm{CV}^2\) of total RNAP:

$$\begin{aligned} \begin{aligned} \langle {}n_\mathrm{tot}\rangle {}_{\infty }&= \eta \dfrac{r}{d}(1-\hbox {e}^{-T_d}) , \\ \mathrm{Var}(n_\mathrm{tot})_{\infty }&=\langle {}n_\mathrm{tot}\rangle {}_{\infty } +\langle {}n_\mathrm{tot}\rangle {}_{\infty }^2\cdot {} \beta {}\updelta _g \dfrac{(\updelta _g-1)+(\updelta _g+1)\hbox {e}^{-2T_d}-2\updelta _g\hbox {e}^{-T_g}\hbox {e}^{-T_d}}{(\updelta _g-1)(\updelta _g+1)(1-\hbox {e}^{-T_d})^2} ,\\ \mathrm{CV}^2(n_\mathrm{tot})_{\infty }&=\langle {}n_\mathrm{tot}\rangle {}_{\infty }^{-1} + \beta {}\updelta _g \dfrac{(\updelta _g-1)+(\updelta _g+1)\hbox {e}^{-2T_d}-2\updelta _g\hbox {e}^{-T_g}\hbox {e}^{-T_d}}{(\updelta _g-1)(\updelta _g+1)(1-\hbox {e}^{-T_d})^2}. \end{aligned} \nonumber \\ \end{aligned}$$
(11)

Here, the subscript \(\infty \) denotes the limit of \(L\rightarrow {}\infty \). A detailed derivation of the variance in Eq. (11) can be found in Lemma C.1 of ‘Appendix C’.

In the special case when RNAP does not prematurely detach from the gene, i.e. for \(d=0\), the expressions in Eq. (11) simplify to

$$\begin{aligned} \begin{aligned} \langle {}n_\mathrm{tot}\rangle {}_{(\infty ;0)}&=\eta {}r\langle {}T\rangle {}, \\ \mathrm{Var}(n_\mathrm{tot})_{(\infty ;0)}&=\langle {}n_\mathrm{tot}\rangle {}_{(\infty ;0)}+ \langle {}n_\mathrm{tot}\rangle {}_{(\infty ;0)}^2\cdot {} 2\beta {}T_g^{-1}\big (1-T_g^{-1}+T_g^{-1}\hbox {e}^{-T_g}\big ) ,\\ \mathrm{CV}^2_{(\infty ;0)}&=\langle {}n_\mathrm{tot}\rangle {}_{(\infty ;0)}^{-1}+ 2\beta {}T_g^{-1}\big (1-T_g^{-1}+T_g^{-1}\hbox {e}^{-T_g}\big ), \end{aligned} \end{aligned}$$
(12)

where the subscript \((\infty ;0)\) denotes the limit of \((L,d)\rightarrow {}(\infty ,0)\). The expressions in Eq. (12) have been previously reported in Choubey et al. (2015), where they were derived using queuing theory. Hence, our expressions in Eq. (11) constitute a generalisation of known results, by further taking into account premature detachment of RNAP from the gene.

Equation (12) shows that the coefficient of variation squared of total RNAP, denoted by \(\mathrm{CV}^2_{(\infty ;0)}\), can be written as the sum of two terms: (i) the inverse of the mean which is expected if the distribution of total RNAP is Poissonian, and (ii) a term that increases with increasing \(\beta \) and decreasing \(T_g\). Hence, the latter term provides a measure for the deviation of the total RNAP distribution from a Poissonian. In particular, it shows that the deviation is significant in genes for which (i) the fraction of time spent in the inactive state is large (large \(\beta \)), and (ii) the elongation time is much shorter than the switching time between the active and inactive states (small \(T_g\)).

Moments of mature RNA distribution: Similarly, in the limit of deterministic elongation, it is straightforward to show that the expressions for the mean and variance of the distribution of mature RNA given by Eqs. (2) and (4) reduce to

$$\begin{aligned} \begin{aligned} \langle {} n \rangle {}_{\infty }&=\eta \rho \hbox {e}^{-T_d}&\quad {}\text {and}\quad \mathrm{Var}(n)_{\infty }&=\langle {} n \rangle {}_{\infty }+\langle {} n \rangle {}_{\infty }^2\cdot {}\beta \theta . \end{aligned} \end{aligned}$$
(13)

These expressions can be further simplified in the special case of no premature detachment to read

$$\begin{aligned} \begin{aligned} \langle {} n \rangle {}_{(\infty ;0)}=\eta \rho \quad \text {and}\quad \mathrm{Var}(n)_{(\infty ;0)}&=\langle {} n \rangle {}_{(\infty ;0)}+\langle {} n \rangle {}_{(\infty ;0)}^2\cdot {}\beta \theta . \end{aligned} \end{aligned}$$
(14)

Note that the mean and variance are precisely the same as would be obtained from the telegraph model, for which the corresponding Fano factor in the bursty limit is given by Eq. (16) below. Hence, we anticipate that, in the limit of no premature detachment and deterministic elongation, the distribution of mature RNA from our transcription model is the same as the distribution obtained from the coarser telegraph model. A formal proof of that claim will be given in Sect. 3.

Relationship between Fano factors of total RNAP and mature RNA: Specifying to the case of no premature detachment, it is interesting to note that in the bursty limit, i.e. for \(r,s_b \rightarrow \infty \) at constant mean burst size \(b=r/s_b\) in Eq. (12), the Fano factor of total RNAP is given by

$$\begin{aligned} {\mathrm{FF}_n}_{(b;\infty ;0)} = 1 + 2b; \end{aligned}$$
(15)

see also Eq. (D.3) in ‘Appendix D’. Here, the subscript n denotes nascent RNA (total RNAP). Eq. (15) is in contrast to the Fano factor of mature RNA in the same bursty limit:

$$\begin{aligned} {\mathrm{FF}_m}_{(b;\infty ;0)} = 1 + b, \end{aligned}$$
(16)

see Eq. (D.8) in ‘Appendix D’, where the subscript m denotes mature RNA. (Note that \({\mathrm{FF}_m}_{(b;\infty ;0)}\) also equals the Fano factor of the telegraph model in the same bursty limit (Raj et al. 2006).) Hence, by comparing Eqs. (15) and (16), we can deduce the following for bursty expression: (i) if the telegraph model is used to estimate the mean transcriptional burst size from total RNAP data where the elongation time is deterministic, then the mean burst size will be overestimated by a factor of two—in other words, the implicit assumption that the elongation time is exponentially distributed is inadequate; (ii) fluctuations in total RNAP (nascent RNA) deviate more from Poisson statistics, for which the Fano factor equals one, than fluctuations in mature RNA.

Fig. 5
figure 5

Comparison between the Fano factors of nascent and mature RNA. Contour plot showing the variation of \(\Xi \)—a measure of the difference between the two Fano factors which is defined in Eq. (18)—with the non-dimensional parameters \(T_g\) and \(T_m\) which denote the ratio of the mean elongation time to the timescales of promoter switching and decay of mature RNA, respectively. As can be appreciated from Eq. (17), \(\Xi \) is positive if the Fano factor of nascent RNA is larger than that of mature RNA and negative if the reverse is true. The line \(T_m \approx 1 - \frac{5}{8} T_g\), where \(\Xi = 0\), shows where the two Fano factors are identical

More generally, if we do not enforce the bursty limit, then we find the following relationship between the Fano factors of total RNAP and mature RNA, which are calculated from Eqs. (12) and (14), respectively:

$$\begin{aligned} \begin{aligned} \frac{{\mathrm{FF}}_{n(\infty ;0)}}{{\mathrm{FF}}_{m(\infty ;0)}} = 1 + \frac{\hbox {e}^{-T_g} T_r T_{s_b} \Xi }{T_g^2 \big [T_r T_{s_b}+T_g(T_g+T_m)\big ]}. \end{aligned} \end{aligned}$$
(17)

Here,

$$\begin{aligned} \begin{aligned} \Xi = 2 (T_g+T_m)+\hbox {e}^{T_g} [2 (T_g-1) T_m+(T_g-2)T_g], \end{aligned} \end{aligned}$$
(18)

while \(T_g = (s_u + s_b)\langle T \rangle \), \(T_r = r\langle T \rangle \), \(T_m = d_m \langle T \rangle \), and \(T_{s_b} = s_b \langle T \rangle \) are non-dimensional parameters representing the ratio of the mean elongation time to the timescales of promoter switching, initiation, decay of mature RNA, and gene deactivation, respectively. From Eq. (17), we deduce that \(\mathrm {FF}_{n(\infty ;0)}>\mathrm {FF}_{m(\infty ;0)}\) if and only if \(\Xi >0\). From the contour plot of \(\Xi \) in Fig. 5, one can deduce that

$$\begin{aligned} \Xi > 0\quad \text {if and only if}\quad T_m > rapprox 1 - \frac{5}{8} T_g. \end{aligned}$$
(19)

Hence, the Fano factor of nascent RNA is larger than that of mature RNA if and only if the above (approximate) condition is satisfied. In the bursty limit, \(T_g \rightarrow \infty \) due to \(s_b \rightarrow \infty \) which, together with \(T_m > 0\), implies that Eq. (19) holds; the condition is also satisfied if promoter switching is very fast compared to elongation. By contrast, if \(T_m < 1\) and \(T_g < 1\), then it is possible to have the opposite scenario where the Fano factor of mature RNA is larger than that of nascent RNA, which occurs, for example, if promoter switching and mature RNA decay are very slow compared to elongation.

Sensitivity of coefficient of variation of total RNAP and mature RNA: Since we have found explicit expressions for the first two moments of the distributions of total RNAP and of mature RNA, we can now estimate the sensitivity of the noise in each of those to small perturbations in the transcriptional parameters. Specifically, we calculate the logarithmic sensitivity (LS), which is also known as the relativity sensitivity, of the coefficient of variation (CV) to a parameter s, which is defined as \(\Lambda {}_{s}=(s/{\mathrm{CV}})(\partial {}{\mathrm{CV}}/\partial {}s)\). (That definition implies that a \(1\%\) change in the value of the parameter s results in a change of \(\Lambda _{s}\%\) in \(\mathrm{CV}\).)

In Table 1b, we report the logarithmic sensitivity of the coefficient of variation of total RNAP fluctuations, which is obtained from Eq. (12), to perturbations in the parameters \(s_u\), \(s_b\), r, and \(\langle {}T\rangle {}\). Similarly, in Table 1c, we report the logarithmic sensitivity of the coefficient of variation of mature RNA fluctuations from Eq. (14) to perturbations in the parameters \(s_u\), \(s_b\), r, and \(d_m\). In both cases, these sensitivities are calculated for parameter values estimated for five genes in yeast, as reported in Zenklusen et al. (2008); see Table 1a.

The following observations can be made regarding the sensitivity of the noise in total RNAP fluctuations: (i) for the two genes PDR5 and POL1 which spend most of their time in the inactive state due to \(s_b \gg s_u\), \(\mathrm {CV}\) is most sensitive to changes in the parameters \(s_u\) and \(\langle {}T\rangle {}\); (ii) for the genes DOA1, MDN1, and KAP104 which spend most of their time in the active state due to \(s_u \gg s_b\), \(\mathrm {CV}\) is most sensitive to changes in the parameters r and \(\langle {}T\rangle {}\); (iii) the size of mature RNA fluctuations is found to be most sensitive to perturbations in \(s_u\) and \(d_m\) for PDR5 and POL1, and to perturbations in r and \(d_m\) for the other three genes. We furthermore note that for both total RNAP and mature RNA, r is the least sensitive parameter for the genes which are mostly inactive, whereas it is among the most sensitive parameters for genes that are mostly active.

Table 1 Logarithmic sensitivity (LS) of the coefficient of variation CV of total RNAP and mature RNA fluctuations for five genes in yeast; see Sect. 2.4 for a discussion. (a) Parameter values from Supplemental Tables 2 and 4 in Zenklusen et al. (2008). The degradation rate \(d_m\) of mature mRNA is estimated from the reported mean number of mature RNA, the parameters \(s_u\), \(s_b\), r, and Eq. (14) for the mean. (b) Logarithmic sensitivity of CV of total RNAP fluctuations. (c) Logarithmic sensitivity of CV of mature mRNA fluctuations. The most sensitive parameter and the next most sensitive one are marked in dark bold and italic, respectively

3 Approximate Distributions of Total RNAP and Mature RNA

Thus far, we have derived expressions for the first two moments of the distributions of total RNAP and mature RNA. Naturally, it would also be useful to derive closed-form expressions for the distributions themselves; such a derivation is, however, analytically intractable in general (Jahnke and Huisinga 2007) due to the presence of the catalytic reaction \(G_{\mathrm{on}} \rightarrow G_{\mathrm{on}} + P_1\), which models initiation of the transcription process. Still, there are two special cases where analytical distributions are known: (i) when the elongation time is considered to be fixed, which corresponds to our model with \(L \rightarrow \infty \) at constant \(\langle T \rangle \) (Heng et al. 2016; ii) when the elongation time is exponentially distributed, corresponding to our model with \(L=1\), in which case the distribution of total RNAP is identical to the one which is derived from the telegraph model (Peccoud and Ycart 1995; Raj et al. 2006). While one may argue that the analytical distribution of RNAP for deterministic elongation times may well approximate the stochastic (finite-L) case, the issue remains that the exact solution is not given in terms of simple functions unless promoter switching is slow compared to initiation, elongation, and termination, in which case the solution reduces to a weighted sum of two Poisson distributions (Heng et al. 2016). Hence, it is generally very difficult to apply in practice, such as to infer parameters from data using a Bayesian approach. Moreover, to our knowledge, no exact solutions are known for the distribution of mature RNA in our model. In this section, we aim to devise a simple approximation for the distribution of total RNAP numbers in terms of the Negative Binomial (NB) distribution; these simple distributions have shown great flexibility in describing complex gene expression models with a large number of parameters (Cao and Grima 2020). Finally, by means of singular perturbation theory, we will obtain the distribution of mature RNA under the assumption that RNA polymerase elongation is faster than degradation of mature RNA.

3.1 Approximation of Total RNAP Distribution

We approximate the distribution of total RNAP transcribing the gene via a Negative Binomial distribution, as follows. The mean and variance of the Negative Binomial distribution NB(qp) are given by \(pq/(1-p)\) and \(pq/(1-p)^2\), respectively. By assuming that these are equal to the exact mean and variance, respectively, of the total RNAP distribution, see Eq. (10), we obtain effective values for the parameters p and q:

$$\begin{aligned} n_\mathrm{tot}\sim \mathrm{NB}(q,p)\equiv {}\mathrm{NB}\bigg (\dfrac{\langle {}n_\mathrm{tot}\rangle {}^2}{\mathrm{Var}(n_\mathrm{tot})-\langle {}n_\mathrm{tot}\rangle {}},\dfrac{\mathrm{Var}(n_\mathrm{tot})-\langle {}n_\mathrm{tot}\rangle {}}{\mathrm{Var}(n_\mathrm{tot})} \bigg ). \end{aligned}$$
(20)

In Fig. 6, we show a comparison between the distributions of total RNAP obtained from SSA (dots) and the Negative Binomial approximation in Eq. (20) (solid lines). Our results are presented for two different values of the number of gene segments: \(L=1\) (exponentially distributed elongation time; left column) and \(L=50\) (quasi-deterministic elongation time; right column). Additionally, we rescale our gene inactivation rate as \(s_b\mapsto {}s_b\epsilon \), and we present results for three different values of the parameter \(\epsilon \): \(10^{-3}\), the constitutive limit of the gene being mostly in the active state (top row); \(10^{-1}\), where the gene spends almost equal amounts of time in the active and inactive states, with \(s_b\approx {}s_u\) (middle row); and 1, the bursty limit, where the gene spends most of its time in the inactive state (bottom row).

We can make several observations, as follows. For both \(L = 1\) and \(L = 50\), the Negative Binomial approximation performs well for bursting and constitutive expression (top and bottom rows), whereas it is appreciably poor when expression is in between those two limits (middle row). Intuitively, this observation can be explained via the following reasoning. In the limits of the gene being mostly in the active state (constitutive expression) or the inactive state (bursty expression), the distribution of total RNAP is necessarily unimodal. However, when the gene spends a considerable amount of time in each state, the distribution is the sum of two conditional distributions which can manifest either as bimodality or as a wide unimodal distribution, neither of which can be captured by a Negative Binomial distribution. Assuming bursty expression, the Negative Binomial distribution is a more accurate approximation to the distribution obtained from SSA for \(L = 1\) than it is for \(L = 50\); the reason is that \(L = 1\) corresponds to the telegraph model (Raj et al. 2006), in which case it can be proven analytically that the distribution reduces to a Negative Binomial in the limit of bursty expression. For constitutive expression, the Negative Binomial approximation is equally good for \(L=1\) and \(L=50\), as the distribution is necessarily Poissonian then and as it is well known that a Negative Binomial distribution can approximate a Poissonian to a high degree of accuracy. In summary, our results hence indicate that Eq. (20) yields a good approximation for the total RNAP distribution of bursty and constitutively expressed genes.

We also note from Fig. 6 that the comparison between the SSA distributions for \(L = 1\) and \(L = 50\), with equal mean elongation times, highlights the importance of modelling elongation with the correct distribution of elongation times for genes that are non-constitutive, i.e. for \(\epsilon = 10^{-1}\) or \(\epsilon =1\). In particular, if the elongation time is quasi-deterministic (\(L = 50\)), there appears to be a significant increase in the probability of observing zero total RNAP transcribing the gene compared to models with an exponentially distributed elongation time (\(L = 1\)).

Fig. 6
figure 6

(Color Figure Online) Steady-state distribution of total RNAP and its approximation by a Negative Binomial distribution. We compare the approximation from Eq. (20) (blue lines) with the distribution of total RNAP obtained from stochastic simulation (SSA; red dots). With the exception of \(s_b\), the parameters are for the PDR5 gene in yeast and are hence the same as in Fig. 2, with \(d=0.01/\mathrm {min}\). Results are presented for two different values of L, corresponding to an exponentially distributed elongation time (\(L = 1\)) and a quasi-deterministic elongation time (\(L = 50\)); k is rescaled such that the two have the same mean elongation time. Additionally, we rescale the gene inactivation rate via \(s_b\mapsto {}s_b\epsilon \), where \(\epsilon = 10^{-3},10^{-1},1\), corresponding to constitutive, general, and bursty expression, respectively. (Here, general expression is neither clearly constitutive nor bursty, since the gene spends roughly equal amounts of time in the inactive and active states.) Note that \(\epsilon = 1\) results in a distribution of nascent RNA that is consistent with that measured for PDR5; the experimental data from Fig. 6(b) of Zenklusen et al. (2008) are plotted for comparison. The Negative Binomial approximation is found to be accurate in the limits of constitutive and bursty expression (top and bottom rows), independently of L

3.2 Approximation of Mature RNA Distribution

Next, we apply singular perturbation theory to formally derive the distribution of mature RNA when the elongation rate is much larger than the degradation rate of mature RNA.

We start by defining \(P_j(\vec {n};t)\) (\(j=0,1\)) as the probability of the state \(\vec {n}=(n_1,\dots {},n_L,n)\) at time t while the gene is either active (0) or inactive (1). Note that \(n_i\) is the number of RNAPs on gene segment i for \(i=1,\dots ,L\), while n is the number of mature RNAs. The time evolution of the probabilities \(P_{j}(\vec {n};t)\) can be described by a system of coupled CMEs:

$$\begin{aligned} \partial {}_{t}P{}_{0}= & {} s_uP{}_{1}-s_bP{}_{0} +r(\mathbb {E}_{n_1}^{-1}-1)P_{0} +k\sum _{i=1}^{L-1}\big (\mathbb {E}_{n_i}\mathbb {E}_{n_{i+1}}^{-1}-1\big )n_iP_{0} \\&+k\big (\mathbb {E}_{n_L}\mathbb {E}_{n}^{-1}-1\big )n_LP_{0}+d\sum _{i=1}^{L}(\mathbb {E}_{n_i}-1)n_iP_{0} +d_m(\mathbb {E}_{n}-1)nP{}_{0}, \\ \partial {}_{t}P_{1}= & {} s_bP_{0}-s_uP_{1} +k\sum _{i=1}^{L-1}\big (\mathbb {E}_{n_i}\mathbb {E}_{n_{i+1}}^{-1}-1\big )n_iP_{1} +k\big (\mathbb {E}_{n_L}\mathbb {E}_{n}^{-1}-1\big )n_LP_{1} \\&+d\sum _{i=1}^{L}(\mathbb {E}_{n_i}-1)n_iP_{1} +d_m(\mathbb {E}_{n}-1)nP_{1}, \end{aligned}$$
(21)

where \(\mathbb {E}_{n_i}^c[f(\vec {n})]=f(n_1,n_2,\dots {},n_i+c,\dots {},n_{L},n)\), with \(c\in \mathbb {Z}\), denotes the standard step operator. We assume that the elongation rate k is faster than the degradation rate \(d_m\) of mature RNA, i.e. that \(k/d_m\gg {}1\). Since \(k=L/\langle T \rangle -d\), it follows that in the limit of deterministic elongation (\(k \rightarrow \infty \)), i.e. for \(L \rightarrow \infty \) at constant mean elongation time \(\langle T \rangle \), the condition \(k/d_m\gg {}1\) is naturally satisfied.

In order to find an analytical expression for the propagator probabilities \(P(\vec {n};t)\) which satisfies the system of CMEs in Eq. (21), we define the probability-generating function as \(F=\sum _jF_j\), with \(F_{j}(\vec {z};t)=\sum _{\vec {n}=\vec {0}}^{\infty {}}P_{j}({\vec {n}};t)\vec {z}^{\vec {n}}\); here, \(\vec {z}=(z_1,\dots {},z_{L},z)\) is a vector of variables corresponding to the state \(\vec {n}\). Given the equations for \(P_{j}({\vec {n}};t)\) from Eq. (21), we obtain the following systems of PDEs for the corresponding generating functions \(F_{j}(\vec {z};t)\):

$$\begin{aligned} \begin{aligned} \mathbb {L}[F_{0}]&=s_uF_{1}-s_bF_{0}+r(z_1-1)F_{0}, \\ \mathbb {L}[F_{1}]&={}s_bF_{0}-s_uF_{1}, \end{aligned} \end{aligned}$$
(22)

where

$$\begin{aligned} \mathbb {L}=\partial {}_{t} +k\sum _{i=1}^{L-1}(z_{i}-z_{i+1})\partial {}_{z_{i}} +k(z_{L}-z)\partial {}_{z_{L}} +d\sum _{i=1}^{L}(z_i-1)\partial {}_{z_{i}} +d_m(z-1)\partial {}_{z} \nonumber \\ \end{aligned}$$
(23)

is a differential operator acting on the generating functions \(F_0\) and \(F_1\). Eq. (22) represents a system of coupled, linear, first-order partial differential equations (PDEs). Now, we introduce the new variables \(u_i=z_i-1\) (\(i=1,\dots {},L\)) and \(u=z-1\) to rewrite Eq. (22) as

$$\begin{aligned} \begin{aligned} \mathbb {L}[F_{0}]&=s_uF_{1}-s_bF_{0}+r u_1F_{0}, \\ \mathbb {L}[F_{1}]&=s_bF_{0}-s_uF_{1}; \end{aligned} \end{aligned}$$
(24)

here, the operator in Eq. (23) now takes the form

$$\begin{aligned} \mathbb {L}=\partial {}_{t} +k\sum _{i=1}^{L-1}(u_{i}-u_{i+1})\partial {}_{u_{i}} +k(u_{L}-u)\partial {}_{u_{L}} +d\sum _{i=1}^{L}u_i\partial {}_{u_{i}} +d_mu\partial {}_{u}. \end{aligned}$$
(25)

In order to find an analytical solution to Eq. (24), we rescale all rates and the time variable by the decay rate of mature RNA; then, we apply the method of characteristics, with s being the characteristic variable. The first characteristic equation gives \(d_m(\hbox {d}t/\hbox {d}s)=1\), with solution \(s\equiv {}t'=d_mt\); hence, we can use the variable \(t'\) as the independent variable and thus convert the system of PDEs in Eq. (24) into a characteristic system of ordinary differential equations (ODEs),

$$\begin{aligned} {\dot{u}}_i&=(k/d_m)[u_i-u_{i+1}+(d/k) u_i] \qquad \text {for }i=1,\dots {},L-1, \end{aligned}$$
(26a)
$$\begin{aligned} {\dot{u}}_L&=(k/d_m)[u_L-u+(d/k) u_L], \end{aligned}$$
(26b)
$$\begin{aligned} {\dot{u}}&=u, \end{aligned}$$
(26c)
$$\begin{aligned} {\dot{F}}_{0}&=(s_u/d_m)F_{1}-(s_b/d_m)F_{0}+(r/d_m) u_1F_{0}, \end{aligned}$$
(26d)
$$\begin{aligned} {\dot{F}}_{1}&=(s_b/d_m)F_{0}-(s_u/d_m)F_{1}, \end{aligned}$$
(26e)

where the overdot denotes differentiation with respect to \(t'\). The existence of an integral-form solution to Eq. (26) follows from the fact that the reaction scheme in Fig. 1 contains first-order reactions only. Under the assumption that \(k\gg {}d_m\), we define \(\varepsilon =d_m/k\); then, we apply Geometric Singular Perturbation Theory (GSPT) (Fenichel 1979; Jones 1995), with \(0<\varepsilon \ll {}1\) as the (small) singular perturbation parameter. We hence separate the system in Eq. (26) into fast and slow dynamics, which will allow us to find an asymptotic approximation for \(F_{0}\) and \(F_{1}\) in steady state. A brief introduction to GSPT can be found in ‘Appendix E’. Given the above definition of \(\varepsilon \), Eqs. (26a) and (26b), the governing equations for \(u_i\) in the ‘slow system’, become

$$\begin{aligned} \begin{aligned} \varepsilon {\dot{u}}_i&=u_i-u_{i+1}+(d/k) u_i \qquad \text {for }i=1,\dots {},L-1,\\ \varepsilon {\dot{u}}_L&=u_L-u+(d/k) u_L, \end{aligned} \end{aligned}$$
(27)

where \(u_i\) (\(i,\dots {},L\)) are the fast variables and u, \(F_{0}\), and \(F_{1}\) are the slow ones. Setting \(\varepsilon =0\) in Eq. (27), we can express the variables \(u_i\) as \(u_i=\mu \cdot {}u_{i+1}\), with \(\mu =k/(k+d)\) for \(i=1,\dots ,L\). Finally, we write the variable \(u_1\) as \(u_1=\mu {}^L\cdot {}u\). Next, given Eq. (26c), we apply the chain rule, with \(\hbox {d}t'\equiv {}\hbox {d}u\cdot {} u\), to rewrite Eqs. (26d) and (26e) as

$$\begin{aligned} F'_0d_mu&=s_uF_1-s_bF_0+r \mu ^L uF_0, \end{aligned}$$
(28a)
$$\begin{aligned} F'_1d_mu&=s_bF_0-s_uF_1, \end{aligned}$$
(28b)

where the prime now denotes differentiation with respect to u. Solving Eq. (28a) for \(F_1\) and substituting the result into Eq. (28b), we obtain the second-order ODE

$$\begin{aligned} d_m^2uF''_0+d_m(d_m+s_b+s_u-r \mu ^L u)F'_0-r \mu {}^L(d_m+s_u)F_0=0 \end{aligned}$$
(29)

for \(F_0(u)\). Eq. (29) is a confluent hypergeometric differential equation (Kummer’s equation) (Digital Library of Mathematical Functions 2020b) which admits the solution

$$\begin{aligned} F_0(u)=C\cdot {}_1F_1\Big (\dfrac{d_m+s_u}{d_m};\dfrac{d_m+s_b+s_u}{d_m};\dfrac{r}{d_m}\mu {}^L u\Big ), \end{aligned}$$
(30)

where \({}_1F_1\) denotes the confluent hypergeometric function; here, we consider only one of two independent fundamental solutions of Kummer’s differential equation, as we are seeking a solution in steady state where the variable u is bounded. The constant C in Eq. (30) is a constant of integration that is determined from the normalisation condition on the full generating function: \(F=F_0+F_1\). From Eq. (28), one finds that F satisfies

$$\begin{aligned} F'=\dfrac{r}{d_m}\mu {}^LF_0. \end{aligned}$$
(31)

Making use of Eq. (31) and applying the normalisation condition \(F|_{u=0}=1\), we find that the generating function in steady state reads

$$\begin{aligned} F(z)={}_1F_1\Big (\dfrac{s_u}{d_m};\dfrac{s_b+s_u}{d_m};\dfrac{r}{d_m} \mu {}^L(z-1)\Big ). \end{aligned}$$
(32)

The probability distribution P(n) of mature RNA can be found from the formula

$$\begin{aligned} P(n)=\frac{1}{n!}\frac{d^n}{dz^n}F(z)|_{z=0}, \end{aligned}$$

which yields the analytical expression

$$\begin{aligned} P(n)=\dfrac{1}{n!}\dfrac{(s_u)_n}{(s_b+s_u)_n}\Big (\dfrac{r}{d_m}\Big )^n(\mu ^{L})^n{}_1F_1\big (\tfrac{s_u}{d_m}+n;\tfrac{s_b+s_u}{d_m}+n;-\tfrac{r}{d_m}\mu ^L\big ), \end{aligned}$$
(33)

where \((\cdot )_n\) is the Pochhammer symbol, as before. Note that the mean and variance of mature mRNA, as calculated from the distribution in Eq. (33), agree exactly with Eqs. (2c) and (4f) in the limit of fast elongation (\(k\rightarrow {}\infty \)). Note also that the solution in Eq. (33) depends on the parameter \(\mu {}^L\), which represents the survival probability of an RNAP molecule, i.e. the probability that RNAP will not prematurely detach from the gene. Finally, we take the limit of deterministic elongation, letting \(L \rightarrow \infty \) at constant \(\langle T \rangle \), which leads to

$$\begin{aligned} P(n)=\dfrac{1}{n!}\dfrac{(s_u)_n}{(s_b+s_u)_n}\Big (\dfrac{r}{d_m}\Big )^n \hbox {e}^{-nd\langle T \rangle }{}_1F_1\big (\tfrac{s_u}{d_m}+n;\tfrac{s_b+s_u}{d_m}+n;-\tfrac{r}{d_m}\hbox {e}^{-d\langle T \rangle }\big ). \end{aligned}$$
(34)

Note that in the limit of no premature detachment (\(d = 0\)), Eq. (34) is precisely equal to the distribution of mature RNA predicted by the telegraph model, which is in wide use in the literature (Raj et al. 2006). Hence, our perturbative approach can be seen as a means to formally derive the conventional telegraph model of gene expression starting from a more fundamental and microscopic model. In Fig. 7, we verify our analytical solution with stochastic simulation for two different genes in yeast. We also note that, for nonzero premature detachment rates (\(d \ne 0\)), Eq. (34) is the steady-state solution predicted by the telegraph model, with parameter r renormalised to \(r \hbox {e}^{-d\langle T \rangle }\); that is to be expected, as the latter is the rate at which RNAPs undergo termination, leading to mature RNAs.

Fig. 7
figure 7

Steady-state distribution of mature RNA for two different genes in yeast. We compare the distribution obtained from SSA (dots) to the perturbative approximation in Eq. (33) (solid lines) for two different genes. In a, we consider the PDR5 gene, fixing the parameters as in Fig. 2: \(s_u=0.44\)/min, \(s_b=4.7\)/min, \(r=6.7\)/min, \(d=0.01\)/min, and \(\langle {}T\rangle {}=4.5\) min. The degradation rate of mature RNA takes the values \(d_m=0.04, 0.10, 0.40\)/min; note that the experimental value is \(d_m = 0.04\)/min. In b, we consider the DOA1 gene, fixing the parameters as in Fig. 3: \(s_u=0.7\)/min, \(s_b=0.12\)/min, \(r=0.14\)/min, \(d=0.01\)/min, and \(\langle {}T\rangle {}=2.9\) min. The degradation rate of mature RNA again takes the values \(d_m=0.04, 0.10, 0.40\)/min; the experimental value is \(d_m = 0.05\)/min. For both genes, the agreement between SSA and our perturbative approximation increases with \(k/d_m\), as expected, since Eq. (33) is derived under the assumption that \(k \gg d_m\). Note that the distribution is practically independent of L, since Eq. (33) depends on L only through \(\mu ^L\), which for small premature detachment rates d implies \(\mu ^L \approx 1\) for any L (Color figure online)

4 Statistics of Fluorescent Nascent RNA Signal

Thus far, we have determined the statistics of the total number of RNAP transcribing the given gene; these are also the statistics of the number of nascent RNA molecules. However, in experiments using single-molecule fluorescence in situ hybridisation [smFISH (Heng et al. 2016)], molecule numbers of nascent RNA cannot be directly determined. Rather, the experimentally measured RNA ‘abundance’ is the fluorescent signal emitted by oligonucleotide probes bound to the RNA. Since the length of the nascent RNA grows as RNAP moves away from the promoter, it follows that we must account for the increase in the fluorescent signal as elongation proceeds.

In this section, we take into account these experimental details to obtain closed-form expressions for the mean and variance of the fluorescent signal of local and total nascent RNA. We assume that the signal from nascent RNA on the ith gene segment is given by \(r_i=(\nu {}/L)in_i\) for \(i=1,\dots {},L\), where \(\nu \) is some experimental constant; the value of the parameter \((\nu {}/L)i\) is increasing with i, which models the fact that the fluorescent signal becomes stronger as RNAP moves along the gene. The formula for the mean fluorescent signal at gene segment i is then given by \(\langle {}r_i\rangle {}=(\nu {}/L)i\langle {}n_i\rangle \), where \(\langle {}n_i\rangle \) follows from Eq. (2b); the covariance of two fluorescent signals along the gene, \(r_i\) and \(r_j\) (\(i,j=1,\dots ,L\)), is given by \(\mathrm{Cov}(r_i,r_j)=(\nu /L)^2ij\mathrm{Cov}(n_i,n_j)\), where \(\mathrm{Cov}(n_i,n_j)\) is obtained from Eq. (4d). In Fig. 8a, b, we plot the mean and Fano factor of the local signal as a function of the gene segment i; note the contrast between the statistics of the fluorescent signal and the corresponding statistics of local RNAP—which is the statistics of nascent RNA—shown in Fig. 2a, c.

Similarly, denoting by \(r_\mathrm{tot}=\sum _{i=1}^{L}r_i\) the total fluorescent signal across the gene, we find the following expressions for the steady-state mean \(\langle r_\mathrm{tot}\rangle =\sum _{i=1}^{L}\langle {}r_i\rangle {}\) and the steady-state variance \(\mathrm{Var}(r_\mathrm{tot})=\sum _{i,j=1}^{L}\mathrm{Cov}(r_i,r_j)\):

$$\begin{aligned} \begin{aligned} \langle r_\mathrm{tot}\rangle&= \nu \eta \rho _k\mu \dfrac{\mu ^L[L\mu -(L+1)]+1}{L(\mu -1)^2}, \\ \mathrm{Var}(r_\mathrm{tot})&=\Big (\dfrac{\nu }{L}\Big )^2\eta \rho {}_k \sum _{i=1}^{L}i^2\mu ^{i}+ \Big (\dfrac{\nu }{L}\Big )^2\alpha \beta (\eta \rho {}_k)^2\sum _{i,j=1}^{L}ij\cdot {}\mu ^{i+j}\cdot {}{f}_{ij}. \end{aligned} \end{aligned}$$
(35)

For a detailed derivation of the variance in Eq. (35), see Eq. (F.1) in ‘Appendix F’; see also ‘Appendix G’ for the corresponding expressions in the bursty, constitutive, and deterministic elongation limits. In Fig. 8c, d, we show the mean and Fano factor of the total signal as a function of the number of gene segments (L); as above, we note the contrasting difference between the statistics of the fluorescent signal and the corresponding statistics of total RNAP—which is the statistics of total nascent RNA—shown in Fig. 4c, d.

Hence, the calculation of the statistics of the number of nascent RNAs from the raw signal intensity presents a challenge and has to be approached carefully. The expressions presented above allow for the inference of transcriptional parameters from the first two moments of the fluorescent signal by means of moment-based inference techniques (Zechner et al. 2012). Quantitative information about nascent RNA can also be obtained from electron micrograph images (El Hage et al. 2010), which avoids the challenges presented by smFISH.

Fig. 8
figure 8

First and second moments of the local and total fluorescent signal for the bursty gene PDR5 in yeast. In a, b, we show the dependence of the mean and the Fano factor of local fluorescent signal fluctuations on the gene segment i, as predicted by our exact theory (solid lines) and SSA (dots), respectively. The plots for \(CV^2(r_i)\) and \(CC(r_i,r_{i+1})\) are identical to those of \(CV^2(n_i)\) and \(CC(n_i,n_{i+1})\) in Fig. 2. The number of gene segments is arbitrarily chosen to be \(L=30\). In c, d, we show the dependence of the mean and variance of total fluorescent signal fluctuations on the number of gene segments L, as predicted by our exact theory (Eq. (35); solid lines) and SSA (dots). The parameters \(s_u\), \(s_b\), r, and \(\langle {}T\rangle {}\) are characteristic of the PDR5 gene and take the same values as in Fig. 2, as do the rates of elongation and RNAP detachment. The value of the parameter \(\nu {}\) is arbitrarily chosen to be \(\nu {}=10\)

5 Model Extension with Pausing of RNAP

Thus far, we have studied a model where RNAPs do not pause as they move along the gene. A natural extension is provided by a modified model in which RNAPs pause along the gene at random sites and elongation is characterised by three processes: forward hopping, pausing, and unpausing of RNAP. The motivation for studying this extended model, which has recently been considered via stochastic simulation in Md Zulfikar et al. (2020), is that experiments have revealed that RNAP exhibits pauses of varying duration, typically on the timescale of few seconds (Forde et al. 2002; Adelman et al. 2002).

Fig. 9
figure 9

Model of transcription that includes RNAP pausing. In a, we extend the model in Fig. 1 so that it takes into account pausing of RNAP at random segments on the gene. Pausing on gene segment i is modelled by the transition from the active state \(P_i\) to the paused state \(\bar{P}_i\) with rate \(r_p\), while the reverse (‘unpausing’) transition occurs with rate \(r_a\). Premature termination of RNAP occurs with rate \(d_a\) from the actively moving state, and with rate \(d_p\) from the paused state. In b, we show the dependence of the coefficient of variation squared (CV\(^2_T\)) of the elongation time distribution on the pausing rate (\(r_p\)), as predicted from SSA (dots) and theory (Eq. (A.7); solid lines). Results are shown for two different parameter regimes: \(D_0\equiv {}{}\{d_a=0\mathrm{/ min}=d_p\}\) (no premature polymerase detachment) and \(D_1\equiv {}{}\{d_a=0.05\mathrm{/ min}=d_p\}\) (premature polymerase detachment). The remaining parameters are fixed to \(L=50\), \(k=10\)/min, and \(r_a=0.1\)/min

5.1 Closed-Form Expressions for Moments of Local RNAP Fluctuations

We extend the model described in Fig. 1 by assuming that the RNAP on gene segment i can switch between a non-paused (actively moving) state \(P_i\) and a paused state \(\bar{P}_i\). The actively moving state \(P_i\) switches to \(\bar{P}_i\) with rate \(r_p\), while the reverse reaction occurs with rate \(r_a\). Premature detachment from the actively moving RNAP occurs with rate \(d_a\), whereas it occurs with rate \(d_p\) from the paused RNAP. The resulting extended model is illustrated in Fig. 9a. In ‘Appendix A’, we derive the mean and variance of the corresponding elongation time, which is not Erlang distributed now, as was the case for the model without pausing. Furthermore we find two interesting properties of the coefficient of variation \({\mathrm{CV}}_T^2\) of the elongation time: (i) in the limit of large L at constant mean elongation time, \({\mathrm{CV}}_T^2\) does not tend to zero, which implies that elongation is not deterministic; (ii) for small rates of premature detachment, \({\mathrm{CV}}_T^2\) is at its maximum when \(r_p \approx r_a\), i.e. when RNAP spends roughly half of its time in the paused state. See ‘Appendix A’ for details and Fig. 9b for a confirmation through stochastic simulation.

Proposition 3

Let the number of RNAP molecules in the active state \(P_i\) be denoted by \(n^a_i\), let the number of molecules in the paused state \(\bar{P}_i\) be \(n^p_i\), and let the number of molecules of mature RNA be denoted by n. Let \(\sigma =r_p/r_a\) be the ratio of the pausing and activation rates, let \(\pi _{r_a}=r_a/(r_a+d_p)\) be the probability of RNAP switching to the actively moving state from the paused state, and let \(\pi _{d_p}=d_p/(r_a+d_p)\) be the probability of premature RNAP detachment from the paused state. Furthermore, define the new parameters \({\tilde{\mu }}=k/(k+d_a+r_p\pi _{d_p})\) and \({\lambda }=\sigma \pi _{r_a}\).

Then, it follows that the steady-state mean number of RNAP molecules in the active and paused states on gene segment i (\(i=1,\dots {}L\)) is given by

$$\begin{aligned} \langle {}{n}^a_i\rangle {}= \eta {}\rho {}_k\tilde{\mu {}}^{i} \quad \text {and}\quad \langle {}{n}^p_i\rangle {} = \langle n^a_i\rangle \lambda . \end{aligned}$$
(36)

Hence, the total mean number of RNAP molecules on each gene segment i reads

$$\begin{aligned} \langle {}{n}_i\rangle {}=\langle {}{n}^{a}_i\rangle {}+\langle {}{n}^{p}_i\rangle {}=\langle {}{n}^a_i\rangle {}(1+\lambda ). \end{aligned}$$
(37)

The proof of Proposition 3 can be found in ‘Appendix H’. Note that in the limit of no pausing, i.e. for \(r_p = 0\), Eq. (37) reduces to the expression for the mean of RNAP reported in Eq. (2b).

Fig. 10
figure 10

Dependence of the steady-state probability distributions of total RNAP and mature RNA on the RNAP pausing rate \(r_p\) for two different genes in yeast. In a, b, we compare the distribution \(P({n}_\mathrm{tot})\) of the total number of RNAP molecules, as predicted by our model (solid lines), with that obtained from SSA (dots) for yeast genes PDR5 and DOA1, respectively. The model prediction involves fitting a Negative Binomial distribution with a mean and variance given by the closed-form expressions in Eqs. (41) and (42). In c, d, we compare the distribution P(n) of mature RNA, as obtained from singular perturbation theory (Eq. (43); solid lines) with the SSA (dots) for yeast genes PDR5 and DOA1, respectively. Note that for both genes, we keep all parameters fixed (including the elongation rate k) while varying the pausing rate \(r_p\) to simulate an experiment where the pausing rate can be perturbed directly. The parameters for each gene can be found in Table 1a; we furthermore used \(L = 50\) and fixed k to \(L/\langle T \rangle \), where \(\langle T \rangle \) is the mean elongation time measured experimentally and reported in Table 1a. Note that the actual mean elongation time is not fixed, as it depends on the pausing rate (\(r_p\)) via Eq. (40). The remaining parameters are fixed to \(r_a=0.1\)/min, \(d_a=0.01\)/min, and \(d_p=0.03\)/min. The value of \(d_a\) is taken from Table 1 in Rajala et al. (2010), where it is reported as the premature termination rate of polymerase in E. coli; the value of \(d_p\) was chosen to be larger than that of \(d_a\) to simulate a scenario where premature detachment is enhanced in the paused state. Note that our theory is less accurate for PDR5 than it is for DOA1, as all parameters are very small compared to the elongation rate in the latter case, hence satisfying better the assumptions behind the theory (Color figure online)

Proposition 4

Let \(\tau _{r_a}=1/r_a\) be the timescale of RNAP activation from the paused state, let \(\tau _{d_p}=1/d_p\) be the timescale of premature termination of paused RNAP, let \(\tau _p=1/(k+d_a)\) be the typical time that an actively moving RNAP spends on a gene segment, and let \(\tau _{pp}=1/(r_a+d_p)\) be the typical time spent in the paused state. Furthermore, define the new parameters \(\lambda _{r_p}=\pi _{r_p}/(1-\pi _{r_p})\), where \(\pi _{r_p}=r_p/(r_p+k+d_a)\) is the probability of the actively moving RNAP switching to the paused state, as well as

$$\begin{aligned} \begin{aligned} \omega _{r_a}=\dfrac{\pi _{r_a}\tau _g}{\pi _{r_a}\tau _{r_a}+\tau _g},\quad {} {\tilde{\alpha }}{}=\dfrac{\tau _g+\lambda _{r_p}\pi _{d_p}\tau _g}{\tau _g+\tau _p+\lambda _{r_p}\tau _g(1-\omega _{r_a})}, \quad {}\text {and }\quad \omega =\dfrac{\tau _g}{\tau _{pp}+\tau _g}. \end{aligned} \nonumber \\ \end{aligned}$$
(38)

Assume that the elongation rate is faster than the rates of RNAP pausing, activation, and premature termination, i.e. that \(k\gg {}r_a,r_p,d_a,d_p\). Then, it follows that to leading order in 1/k, asymptotic expressions for the variances and covariances of molecule number fluctuations of active and paused RNAP are given by

$$\begin{aligned} \begin{aligned}&\mathrm{Cov}({n}^a_i,{n}^a_j)=\delta _{ij} \langle {}{n}^a_i\rangle {}+\langle {}{n}^a_i\rangle {}\langle {}{n}^a_j\rangle {}{\tilde{\alpha }}{}\beta {}\cdot {}{g}^{aa}_{ij},&\qquad {}{} \text {where } {g}^{aa}_{ij}&={g}^{aa}(i,j)+{g}^{aa}(j,i), \\&\mathrm{Cov}({n}^a_i,{n}^p_j)=\langle {}{n}^a_i\rangle {}\langle {}{n}^p_j\rangle {}{\tilde{\alpha }}{}\beta {}\cdot {}{g}^{ap}_{ij},&\qquad {}{} \text {where } {g}^{ap}_{ij}&=\omega {\tilde{\alpha }}{}^{j-1}, \\&\mathrm{Cov}({n}^p_i,{n}^a_j)=\langle {}{n}^p_i\rangle {}\langle {}{n}^a_j\rangle {}{\tilde{\alpha }}{}\beta {}\cdot {}{g}^{pa}_{ij},&\qquad {}{} \text {where } {g}^{pa}_{ij}&=\omega {\tilde{\alpha }}{}^{i-1}, \\&\mathrm{Cov}({n}^p_i,{n}^p_j)=\delta _{ij} \langle {}{n}^p_i\rangle {}+\langle {}{n}^p_i\rangle {}\langle {}{n}^p_j\rangle {}{\tilde{\alpha }}{}\beta {}\cdot {}{g}^{pp}_{ij},&\qquad {}{} \text {where } {g}^{pp}_{ij}&=({g}^{ap}_{ij}+{g}^{pa}_{ij})/2; \end{aligned} \end{aligned}$$
(39)

here, \(i,j=1,2,\dots {},L\) and

$$\begin{aligned} {g}^{aa}(i,j) =\dfrac{{\tilde{\alpha }}{}^{i+j-1}}{(2{\tilde{\alpha }}{}-1)^{i}} +\dfrac{1}{2^{i+j-1}}\left( {\begin{array}{c}i+j-1\\ i\end{array}}\right) \Big [ 1-\dfrac{2{\tilde{\alpha }}{}-1}{2{\tilde{\alpha }}{}}{}_2F_1\big (1,i+j;j;\tfrac{1}{2{\tilde{\alpha }}{}}\big ) \Big ]. \end{aligned}$$

These results are proved in full in ‘Appendix H’. From ‘Appendix A’, we also have that the mean elongation time in the pausing model is given by

$$\begin{aligned} \langle T \rangle = L\frac{(r_a+d_p)^2+r_a r_p}{(r_a+d_p) [(k+d_a)(r_a+d_p)+d_p r_p]}. \end{aligned}$$
(40)

Solving Eq. (40) for the elongation rate k, we find that in the limit of \(L \rightarrow \infty \) taken at constant mean elongation time, k tends to infinity and hence is much larger than \(r_a\), \(r_p\), \(d_a\), and \(d_p\), which implies that the results of Proposition 4 hold naturally in that limit.

5.2 Approximate Distributions of Total RNAP and Mature RNA

Negative Binomial approximation of total RNAP distribution: We define the total number of RNAP molecules as \({n}_\mathrm{tot}=\sum _{i=1}^{L}{n}_i\). It then immediately follows from Eq. (37) that the mean of the total RNAP distribution in the pausing model is given by

$$\begin{aligned} \begin{aligned} \langle {n}_\mathrm{tot}\rangle =\eta {}\rho _k(1+\lambda ){\tilde{\mu }}\dfrac{{\tilde{\mu }}^L-1}{{\tilde{\mu }}-1}. \end{aligned} \end{aligned}$$
(41)

It can also be shown that the variance of total RNAP fluctuations reads

$$\begin{aligned} \mathrm{Var}(n_\mathrm{tot})=\langle {}n_\mathrm{tot}\rangle {}+(\eta {}\rho _k)^2{\tilde{\alpha }}{}\beta {}\bigg [2\sum _{i,j=1}^{L}{g}^{aa}(i,j)+ \lambda (2+\lambda )\omega {}L\dfrac{{\tilde{\alpha }}{}^L-1}{{\tilde{\alpha }}{}-1}\bigg ]; \end{aligned}$$
(42)

see ‘Appendix H’. Next, we approximate the distribution of total RNAP by a Negative Binomial distribution whose mean and variance match those just derived, i.e. we consider Eq. (20) with the mean and variance of the total RNAP distribution given by Eqs. (41) and (42) now, respectively. The resulting approximate Negative Binomial distribution is compared with the distribution obtained from SSA in Fig. 10a, b for two different yeast genes, PDR5 and DOA1. The results verify that our approximation is accurate provided the elongation rate k is significantly larger than the other parameters, as assumed in Proposition 4.

Perturbative approximation of mature RNA distribution: We can apply singular perturbation theory to formally derive the distribution of mature RNA, assuming that \(k/d_m\gg {}1\) and \(r_a/d_m\gg {}1\). Following the derivation in Sect.  3.2, we find the following analytical expression for the steady-state probability distribution of mature RNA:

$$\begin{aligned} P(n)=\dfrac{1}{n!}\dfrac{(s_u)_n}{(s_b+s_u)_n}\Big (\dfrac{r}{d_m}\Big )^n\big ({\tilde{\mu }}^{L}\big )^n{}_1F_1\Big (\dfrac{s_u}{d_m}+n;\dfrac{s_b+s_u}{d_m}+n;-\dfrac{r}{d_m}{\tilde{\mu }}^L\Big ); \end{aligned}$$
(43)

see ‘Appendix I’ for details. Note that the solution in Eq. (43) is dependent on the parameter \({\tilde{\mu }}{}^L\), which gives the probability that an RNAP molecule does not prematurely detach before termination; see ‘Appendix A’. Also, note that in the limit of zero premature termination, i.e. for \(d_a=0=d_p\), Eq. (43) is identical to the distribution of mature RNA predicted by the telegraph model. Finally, by solving Eq. (40) for k, then substituting the resulting expression into Eq. (43) and taking the long-gene limit of \(L \rightarrow \infty \) at constant \(\langle T \rangle \), we obtain that the probability distribution of mature RNA has the same functional form as in Eq. (43), albeit with

$$\begin{aligned} \lim _{L\rightarrow {}\infty }{\tilde{\mu }}^L=\hbox {e}^{-\psi \langle {}T\rangle {}}, \quad {}{} \text {where } \psi =\dfrac{d_a+r_p\pi _{d_p}}{1+\sigma \pi _{r_a}}. \end{aligned}$$
(44)

Note that Eqs. (43) and (44) equal the steady-state solution predicted by the telegraph model, with the initiation rate r renormalised to \(r{\tilde{\mu }}^L\) or \(r \hbox {e}^{-\psi \langle T \rangle }\), respectively. In Fig. 10c, d, we verify the accuracy of our analytical solution using stochastic simulation for two different genes in yeast. Note that a change in the pausing rate \(r_p\) has relatively little effect on the distribution of mature RNA, as compared to the effect on the distribution of total RNAP; cf. panels (a) and (b) of Fig. 10 in comparison with panels (c) and (d), respectively.

Table 2 Summary of main results
Table 3 Definition of parameters and functions

6 Summary and Conclusion

In this paper, we have analysed a detailed stochastic model of transcription. Our model extends previous analytical work (Choubey et al. 2015; Heng et al. 2016) by (i) taking into account salient processes, such as premature detachment and pausing of RNAP, that were previously not considered analytically; (ii) deriving explicit expressions for the mean and variance of RNAP numbers (nascent RNA) on gene segments as well as on the entire gene; (iii) deriving explicit expressions for the mean and variance of the fluorescent nascent RNA signal obtained from smFISH and identifying differences between the statistics thereof and those of direct measurements of nascent RNA; and (iv) finding approximate distributions of total nascent RNA fluctuations on a gene, without assuming slow promoter switching. A number of interesting observations from our work include the following:

  1. (i)

    When the premature detachment rate of RNAP is nonzero and gene expression is bursty, the coefficient of variation of local RNAP fluctuations can either decrease or increase with distance from the promoter. By contrast, when expression is constitutive, the coefficient of variation increases monotonically with distance from the promoter. Other statistical measures such as the mean, Fano factor, and correlation coefficient of local RNAP numbers decrease monotonically with distance from the promoter.

  2. (ii)

    In the limits of bursty expression, deterministic elongation, and no premature detachment or pausing, the Fano factor of total nascent RNA equals \(1 + 2b\), whereas that of mature RNA is \(1 + b\), where b denotes the mean burst size. An implication is that the telegraph model will result in an overestimate of the mean burst size from nascent RNA data by a factor of 2. Another implication is that deviations from Poisson fluctuations are more apparent in data for nascent RNA than they are for mature RNA. One can further state the following relationship: the Fano factor of nascent RNA equals twice the Fano factor of mature RNA, minus 1. If expression is non-bursty, then the Fano factor of nascent RNA can be larger or smaller than that of mature RNA, as determined by the condition in Eq. (19).

  3. (iii)

    For genes characterised by bursty expression, the sensitivity of the noise in total RNAP fluctuations is highest to perturbations in the gene activation rate and the mean elongation time; for constitutive genes, the most sensitive parameters are the initiation rate and the mean elongation time.

  4. (iv)

    A Negative Binomial distribution, parameterised with the expressions for the mean and variance of total nascent RNA derived here, provides a good approximation to the true distribution of total nascent RNA fluctuations on a gene when expression is either bursty or constitutive; the approximation is not accurate when the gene spends roughly equal amounts of time in the active and inactive states. We show that the distribution of nascent RNA is highly sensitive to the distribution of elongation times. In particular, if the elongation time is assumed to be exponentially distributed, as is implicitly assumed by telegraph models of nascent RNA, then the probability of observing zero RNA is much lower than if the elongation time is assumed to be fixed.

  5. (v)

    Using geometric singular perturbation theory (GSPT), we have rigorously proven that, in the limit of deterministic elongation (or fast elongation), no pausing and premature detachment, the steady-state distribution of mature RNA in our model is identical to that in the telegraph model (Raj et al. 2006). Consideration of pausing and premature detachment leads to a distribution that can also be obtained from a telegraph model with appropriately renormalised parameters.

A summary of the main theoretical results can be found in Table 2, with all requisite parameters and functions defined in Table 3. The main limiting assumption of our theoretical approach is that the initiation rate is slow enough such that RNAP molecules do not frequently collide with each other while moving along the gene. Hence, the expressions we have derived are reasonable for all but the strongest promoters which are characterised by very fast initiation rates. We anticipate that approximate closed-form expressions for the corresponding moments can also be derived when volume exclusion between RNAPs is taken into account by a modification of methods previously devised to understand molecular movement and kinetics in crowded conditions (Cianci et al. 2016; Smith et al. 2017). It is also possible to extend our model by including translation of mature RNA to protein; one can then again apply GSPT to derive distributions for protein numbers in the limit of RNA decaying much faster than protein; however, given item (v) above, we anticipate that the resulting protein distribution will be very similar to those derived from models that do not explicitly take into account nascent RNA (Shahrezaei and Swain 2008; Popović et al. 2016). Further research is required to develop simple approximations of the nascent RNA distribution that are accurate independently of the ratio of gene switching rates. Finally, given the strong recent interest in the development of statistical inference techniques in molecular biology (Gorin et al. 2020; Zechner et al. 2012; Kaan Öcal et al. 2019), we expect that our closed-form expressions for the moments and distributions of nascent and mature RNA will be useful for developing computationally efficient and accurate methods for estimating transcriptional parameters.