We propose a Markov chain approximation of the delayed stochastic simulation algorithm to infer properties of the mechanisms in prokaryote transcription from the dynamics of RNA levels. We model transcription using the delayed stochastic modelling strategy and realistic parameter values for rate of transcription initiation and RNA degradation. From the model, we generate time series of RNA levels at the single molecule level, from which we use the method to infer the duration of the promoter open complex formation. This is found to be possible even when adding external Gaussian noise to the RNA levels.
Gene expression dynamics is influenced by even small fluctuations on the levels of various molecular species, such as RNA polymerases and transcription factors. In some cases, even the presence of a single molecule can cause phenotypic switching . This makes the cellular metabolism inherently stochastic .
The stochasticity in the abundance of a substance is in general thought of being noise that obscures a signal that carries information relevant to the cell. However, recent evidence suggests that cells may be able to use the noise component in benefit of their survival . Due to this, several modelling strategies have been proposed for accurately accounting for noise in the dynamics of gene regulatory networks (GRNs) [2, 4–7].
The chemical master equation is a probabilistic description of the dynamics of interacting molecules that fully captures the stochasticity of their kinetics. However, it is intractable to solve in the biologically relevant cases.
The stochastic simulation algorithm  (SSA) is a Monte Carlo simulation of the chemical master equation, allowing the study of complex models of gene expression. In the SSA, all chemical reactions are assumed instantaneous. However, several processes during the transcription and translation of a gene are highly complex, either involving many molecular species or involving reactions that are not bimolecular (e.g., the promoter open complex formation). To account for the effects of these events on the dynamics of RNA and proteins, the delayed SSA (DSSA) was proposed . The ability of the DSSA to model chemical reactions with noninstantaneous events makes it a good tool to model GRN .
Assessing a model's accuracy and validity is important . Even if experimental data has been used in model building, one must also be able to quantitatively rank the models based on the data. This ranking can be used to determine realistic parameter values, if these have not been measured directly, and to choose between models. As single molecule measurements of gene expression are becoming available , even the most detailed stochastic models can now be ranked.
Inference methods have been proposed to assess stochastic models of gene expression based on the SSA [11, 12]. Such methods are still lacking for the DSSA. Here, we present a method that, while requiring additional developments for analyzing complex gene networks, can be used to determine underlying features of single gene expression when simulated by the DSSA.
One feature in gene expression that has been proposed to influence noise in RNA and protein levels is the promoter open complex formation . We use the proposed method to determine the duration of the promoter open complex formation from the dynamics of RNA levels of a delayed stochastic model of transcription.
2.1. Stochastic and Delayed Stochastic Simulation Algorithms
The Stochastic Simulation Algorithm (SSA) is a Monte Carlo simulation of the chemical master equation and, thus, is an exact procedure for numerically simulating the time evolution of a well-stirred reacting system . Each chemical species quantity is treated as an independent variable, and each reaction is executed explicitly. Time is advanced by stepping from one reaction event to the next. At each step, the number of molecules of each affected species is updated according to the reaction formula.
For each reaction , the stochastic rate constant, , depends on the reactive radii of the molecules involved in the reaction and their relative velocities. The velocities depend on the temperature and molecular masses. After setting the initial species populations, , the SSA calculates the propensities , for all possible reactions, where is the number of distinct molecular reactants combinations available at a given moment. Then, it generates two random numbers, , the time until the next reaction occurs, and , the reaction to occur. The probability for is . Finally, the system time is increased by , and the quantities are adjusted to account for the occurrence of reaction , assuming it to be an instantaneous reaction. This process is repeated until no more reactions can occur or for a defined time interval.
Several steps in gene expression, such as transcripts assembly, are time consuming . Such complex processes involve many reactions and events that cannot be modelled as uni- or bimolecular reaction events. To account for these events, the "delayed SSA" was proposed . It uses a "waitlist" to store delayed output events. Multidelayed reactions are represented as . In this reaction, is instantaneously produced and and are placed on a waitlist until they are released, after and seconds, respectively.
The delayed SSA proceeds as follows.
Set , , set initial number of molecules and reactions, and create empty waitlist . Go to step (2).
Generate an SSA step for reacting events to get the next reacting event and the corresponding occurrence time . Go to step (3).
Compare with the least time in , . If or is empty, set: . Update the number of molecules by performing , adding to both any delayed products and the time delay for which they have to stay in . This time can be chosen from a defined distribution. Go to step (4).
If L is not empty and if , set . Update the number of molecules and , by releasing the first element in ; otherwise go to step (5).
If , go to step (2); otherwise stop.
2.2. Delayed Stochastic Model of Transcription
A delayed stochastic model of transcription that includes the promoter open complex formation was proposed in Ribeiro et al. . This model was shown to match the dynamics of transcription at the single RNA molecule level .
Our model is identical, except that it does not include an explicit representation of the RNA polymerase. This simplification is valid when the number of RNA polymerases does not vary significantly over time in the cell, which is likely to be the case in normal conditions in E. coli (Reaction (1)):
In Reaction (1), (set to 1 in the begin of the simulation) is the promoter region of the gene while is the stochastic rate constant of transcription initiation and its value is set to . This value assumes that the number of RNA polymerases available for transcription is always 40  and that the binding affinity between RNA polymerase and transcription start site equals the one measured for the lac promoter . The promoter delay, , is set to 40 s, in agreement with measurements for the lac Promoter . Also, RNA stands for a fully transcribed RNA molecule, and is the time that it takes for the transcription process to be completed, once initiated. This delay accounts for the promoter open complex formation (40 s), transcription elongation (mean value 60 s), and termination. Its value is randomly generated from a Gaussian distribution with a mean of 102 s and a standard deviation of 14 s. These values assume a lac promoter and a gene 2445 nucleotides long [16, 18].
Note that while Reaction (1) has a rate of , each activation cycle includes the open complex formation delay of seconds, making the effective mean cycle duration equal to .
Reaction (2) models RNA degradation. is the rate of degradation and is set to (10 min mean lifetime), which is within realistic parameter values for E. coli.
In Figure 1 are shown, as examples, levels of RNA molecules produced by independent simulations. The simulator ran for 6000 s from which the data from the last 3000 s was used as "steady state" data.
2.3. Approximative Inference
The system is approximated as a Markov chain with stationary distribution and transition matrix . As we are only considering steady state conditions, and can be built by thoroughly sampling ( samples) from the simulated model. To compensate for the sampling error both and are "smeared out" with a kernel of . For example, if the raw sampling yields , then after the smearing , , .
The log likelihood of the parameter , given a time series can then be computed by
where is the RNA level at time .
The likelihood term is evaluated at suitable points over the full range of possible values, ranging from zero to the maximum determined by dividing the mean RNA life time by the mean RNA level (in our case study, this ratio around 60). Due to the approximation of and , the likelihood term will be nonsmooth and cannot be used as such. Instead, a quadratic polynomial is fitted to the point samples. The quadratic fit was chosen because it gives a likelihood proportional to a truncated normal distribution. Similar to the application of Bayes' theorem with a flat, non informative prior, the likelihood is converted to a probability distribution by normalizing it to unit probability.
2.4. Error Model
To simulate measurement error, normally distributed noise with zero mean and 0.5 standard deviation was added to the simulated time series used for inference. Any negative values were zeroed.
In all simulations we set the sample interval to 30 s, as this is currently the shortest interval possible in real measurements of RNA numbers at the single molecule level . The inference was made using these point samples.
We applied the method to sample sizes of 10, 100, and 1000 independent time series of length 2970 s (100 time points). As no external noise sources are applied to these data, we refer to it as "noiseless" data. Results are shown in Figures 2, 3, and 4, respectively. As seen, as the sample size is increased, the better becomes the inference of the true value of .
Interestingly, as seen from these results, using this method it is possible to show, even using a small sample size of 10, that the time length of the promoter open complex formation measurably affects the dynamics of RNA levels as previously shown by confronting numerical simulations with a null model .
We now test the robustness of the method to experimental measurement error. For this, to the previous time series we add Gaussian noise "noisy data" as described in the Methods section. Results of the inference, using 10, 100 and 1000 time series, are shown in Figures 5, 6, and 7, respectively. As the results show, the accuracy of the method is not significantly affected when the standard deviation of the external noise is in the range 0 to 0.5. If the noise level in the data is increased beyond this, the results become biased.
Finally, we note that using 1000 time series for the inference procedure, the method takes 15 min to be completed on a contemporary personal computer.
We tested an inference method for inferring, from time series data, kinetic parameters affecting the dynamics of RNA levels subject to degradation. When inferring the duration of the promoter open complex formation, we showed that, for known values of the RNA degradation rate, the method is accurate and fast. When a reasonable amount of noise is added to the data the performance is not significantly affected.
The inference was shown possible when considering only one previous sample point, by approximating it with a time-homogeneous Markov chain. This is especially relevant as, in E. coli, most RNA mean levels are from 1 to a few , implying that the system may have very little memory of far past events.
While experimentally challenging, it is already possible to collect time series of RNA levels of living cells close to the accuracy assumed by the model. This can be done using a technique that is based on the ability of the MS2d-GFP protein complex to bind to a target RNA . This system possesses some limitations, such as the need to maintain weak transcription rate so as to distinguish individual RNA molecules .
While the present approximative method proposed is still far from an analytical likelihood, it can serve as a crude statistical tool to analyze experimental time series data. In the future, we aim to extend this method to infer other kinetic parameters associated with the dynamics RNA and protein levels in prokaryotes. Also, we will apply this method to determine from real measurements of RNA levels, if these are influenced by currently unknown processes.
Choi PJ, Cai L, Frieda K, Xie XS: A stochastic single-molecule event triggers phenotype switching of a bacterial cell. Science 2008,322(5900):442-446. 10.1126/science.1161427
McAdams HH, Arkin A: It's a noisy business! Genetic regulation at the nanomolar scale. Trends in Genetics 1999,15(2):65-69. 10.1016/S0168-9525(98)01659-X
Kærn M, Elston TC, Blake WJ, Collins JJ: Stochasticity in gene expression: from theories to phenotypes. Nature Reviews Genetics 2005,6(6):451-464. 10.1038/nrg1615
Bratsun D, Volfson D, Tsimring LS, Hasty J: Delay-induced stochastic oscillations in gene regulation. Proceedings of the National Academy of Sciences of the United States of America 2005,102(41):14593-14598. 10.1073/pnas.0503858102
Roussel MR, Zhu R: Validation of an algorithm for delay stochastic simulation of transcription and translation in prokaryotic gene expression. Physical Biology 2006,3(4):274-284. 10.1088/1478-3975/3/4/005
Ribeiro A, Zhu R, Kauffman SA: A general modeling strategy for gene regulatory networks with stochastic dynamics. Journal of Computational Biology 2006,13(9):1630-1639. 10.1089/cmb.2006.13.1630
Karlebach G, Shamir R: Modelling and analysis of gene regulatory networks. Nature Reviews Molecular Cell Biology 2008,9(10):770-780. 10.1038/nrm2503
Gillespie DT: Exact stochastic simulation of coupled chemical reactions. Journal of Physical Chemistry 1977,81(25):2340-2361. 10.1021/j100540a008
Wilkinson DJ: Stochastic modelling for quantitative description of heterogeneous biological systems. Nature Reviews Genetics 2009,10(2):122-133. 10.1038/nrg2509
Golding I, Paulsson J, Zawilski SM, Cox EC: Real-time kinetics of gene activity in individual bacteria. Cell 2005,123(6):1025-1036. 10.1016/j.cell.2005.09.031
Boys RJ, Wilkinson DJ, Kirkwood TBL: Bayesian inference for a discretely observed stochastic kinetic model. Statistics and Computing 2008,18(2):125-135. 10.1007/s11222-007-9043-x
Wang Y, Christley S, Mjolsness E, Xie X: Parameter inference for discretely observed stochastic kinetic models using stochastic gradient descent. BMC Systems Biology 2010., 4, article 99:
Ribeiro AS, Häkkinen A, Mannerström H, Lloyd-Price J, Yli-Harja O: Effects of the promoter open complex formation on gene expression dynamics. Physical Review E 2010.,81(1):
Ota K, Yamada T, Yamanishi Y: Comprehensive analysis of delay in transcriptional regulation using expression profiles. Genome Informatics 2003, 14: 302-303.
Ribeiro AS: Stochastic and delayed stochastic models of gene expression and regulation. Mathematical Biosciences 2010,223(1):1-11. 10.1016/j.mbs.2009.10.007
Zhu R, Ribeiro AS, Salahub D, Kauffman SA: Studying genetic regulatory networks at the molecular level: delayed reaction stochastic models. Journal of Theoretical Biology 2007,246(4):725-745. 10.1016/j.jtbi.2007.01.021
McClure WR: Rate-limiting steps in RNA chain initiation. Proceedings of the National Academy of Sciences of the United States of America 1980,77(10 II):5634-5638.
Yu JI, Xiao J, Ren X, Lao K, Xie XS: Probing gene expression in live cells, one protein molecule at a time. Science 2006,311(5767):1600-1603. 10.1126/science.1119623
Bernstein JA, Khodursky AB, Lin P-H, Lin-Chao S, Cohen SN: Global analysis of mRNA decay and abundance in Escherichia coli at single-gene resolution using two-color fluorescent DNA microarrays. Proceedings of the National Academy of Sciences of the United States of America 2002,99(15):9697-9702. 10.1073/pnas.112318199
Fusco D, Accornero N, Lavoie B, Shenoy SM, Blanchard JM, Singer RH, Bertrand E: Single mRNA molecules demonstrate probabilistic movement in living mammalian cells. Current Biology 2003,13(2):161-167. 10.1016/S0960-9822(02)01436-7
This work was supported by Academy of Finland and FiDiPro program of Tekes.
About this article
Cite this article
Mannerstrom, H., Yli-Harja, O. & Ribeiro, A.S. Inference of Kinetic Parameters of Delayed Stochastic Models of Gene Expression Using a Markov Chain Approximation. J Bioinform Sys Biology 2011, 572876 (2011). https://doi.org/10.1155/2011/572876
- Stochastic Simulation Algorithm
- Single Molecule Level
- Chemical Master Equation
- Markov Chain Approximation
- Independent Time Series