Inference on autoregulation in gene expression with variance-to-mean ratio

Wang, Yue; He, Siqi

doi:10.1007/s00285-023-01924-6

Inference on autoregulation in gene expression with variance-to-mean ratio

Open access
Published: 03 May 2023

Volume 86, article number 87, (2023)
Cite this article

Download PDF

You have full access to this open access article

Journal of Mathematical Biology Aims and scope Submit manuscript

Inference on autoregulation in gene expression with variance-to-mean ratio

Download PDF

1554 Accesses
1 Citation
Explore all metrics

Abstract

Some genes can promote or repress their own expressions, which is called autoregulation. Although gene regulation is a central topic in biology, autoregulation is much less studied. In general, it is extremely difficult to determine the existence of autoregulation with direct biochemical approaches. Nevertheless, some papers have observed that certain types of autoregulations are linked to noise levels in gene expression. We generalize these results by two propositions on discrete-state continuous-time Markov chains. These two propositions form a simple but robust method to infer the existence of autoregulation from gene expression data. This method only needs to compare the mean and variance of the gene expression level. Compared to other methods for inferring autoregulation, our method only requires non-interventional one-time data, and does not need to estimate parameters. Besides, our method has few restrictions on the model. We apply this method to four groups of experimental data and find some genes that might have autoregulation. Some inferred autoregulations have been verified by experiments or other theoretical works.

Limits of noise for autoregulated gene expression

Article Open access 24 May 2018

Approximate solutions to the response time problems of transcription autoregulatory gene networks

Article 31 January 2022

Effects of promoter leakage on dynamics of gene expression

Article Open access 21 March 2015

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

In general, genes are transcribed to mRNAs and then translated to proteins. We can use the abundance of mRNA or protein to represent the expression levels of genes. Both the synthesis and degradation of mRNAs/proteins can be affected (activated or inhibited) by the expression levels of other genes (Karamyshev and Karamysheva 2018), which is called (mutual) gene regulation. Genes and their regulatory relations form a gene regulatory network (GRN) (Cunningham and Duester 2015), generally represented as a directed graph: each vertex is a gene, and each directed edge is a regulatory relationship. See Fig. 1 for an example of a GRN.

The expression of one gene could promote/repress its own expression, which is called positive/negative autoregulation (Carrier and Keasling 1999). Autoregulation is very common in E. coli (Shen-Orr et al. 2002). Positive autoregulation is also called autocatalysis or autoactivation, and negative autoregulation is also called autorepression (Baumdick et al. 2018; Fang et al. 2017). For instance, HOX proteins form and maintain spatially inhomogeneous expression of HOX genes (Sheth et al. 2014). For genes with position-specific expressions during development, it is common that the increase of one gene can further increase or decrease its level (Wang et al. 2020b). Autoregulation has the effect of stabilizing transposons in genomes (Bouuaert et al. 2013), which can affect cell behavior (Kang et al. 2015). Autoregulation can also stabilize the cell phenotype (Barros et al. 2011), which is related to cancer development (Zhou et al. 2014; Niu et al. 2015; Chen et al. 2016).

While countless works infer the regulatory relationships between different genes (the GRN structure) (Wang and Wang 2022), determining the existence of autoregulation is an equally important yet less-studied field. Due to technical limitations, it is difficult and sometimes impossible to directly detect autoregulation in experiments. Instead, we can measure gene expression profiles and infer the existence of autoregulation. In this paper, we consider a specific data type: measure the expression levels of certain genes without intervention for a single cell (which reaches stationarity) at a single time point, and repeat for many different cells to obtain a probability distribution for expression levels. Such single-cell non-interventional one-time gene expression data can be obtained with a relatively low cost (Luecken and Theis 2019).

With such single-cell level data for one gene V, we can calculate the ratio of variance and mean of the expression level (mRNA or protein count). This quantity is called the variance-to-mean ratio (VMR) or the Fano factor. Many papers that study gene expression systems with autoregulations have found that negative autoregulation can decrease noise (smaller VMR), and positive autoregulation can increase noise (larger VMR) (Thattai and Van Oudenaarden 2001; Swain 2004; Hornos et al. 2005; Munsky et al. 2012; Grönlund et al. 2013; Dessalles et al. 2017; Czuppon and Pfaffelhuber 2018). This means VMR can be used to infer the existence of autoregulation.

We generalize the above observation and develop two mathematical results that use VMR to determine the existence of autoregulation. They apply to some genes that have autoregulation. For genes without autoregulation, these results cannot determine that autoregulation does not exist. We apply these results to four experimental gene expression data sets and detect some genes that might have autoregulation.

We start with some setup and introduce our main results (Sect. 2). Then we cite some previous works on this topic and compare them with our results (Sect. 3). For a single gene that is not regulated by other genes (Sect. 4) and multiple genes that regulate each other (Sect. 5), we develop mathematical results to identify the existence of autoregulation. These two mathematical sections can be skipped. We summarize the procedure of our method and apply it to experimental data (Sect. 6). We finish with some conclusions and discussions (Sect. 7).

2 Setup and main results

One possible mechanism of “the increase of one gene’s expression level further increases its expression level” is a positive feedback loop between two genes (Hui et al. 2020). Here $V_1$ and $V_2$ promote each other, so that the increase of $V_1$ increases $V_2$, which in return further increases $V_1$. We should not regard this feedback loop as autoregulation. When we define autoregulation for a gene V, we should fix environmental factors and other genes that regulate V, and observe whether the expression level of V can affect itself. If V is in a feedback loop that contains other genes, then those genes (which regulate V and are regulated by V) cannot be fixed when we change V. Therefore, it is essentially difficult to determine whether V has autoregulation in this scenario. In the following, we need to assume that V is not contained in a feedback loop that involves other genes.

The actual gene expression mechanism might be complicated. Besides other genes/factors that can regulate a gene, for a gene V itself, it might switch between inactivated (off) and activated (on) states (Cao and Grima 2020). These states correspond to different transcription rates to produce mRNAs. When mRNAs are translated into proteins, those proteins might affect the transition of gene activation states, which forms autoregulation (Firman et al. 2018). See Fig. 2 for an illustration. Therefore, for a gene V, we should regard the gene activation state, mRNA count, and protein count as a triplet of random variables (G, M, P), which depend on each other.

When we fix environmental factors and other genes that affect V, the triplet (G, M, P) should follow a continuous-time Markov chain. A possible state is the gene activation state on/off (for G), the mRNA count on ${\mathbb {Z}}$ (for M), and the protein count on ${\mathbb {Z}}$ (for P). Thus the total space is $\{0,1\}\times {\mathbb {Z}} \times {\mathbb {Z}}$. When we consider the expression level M or P (but have no access to the value of G), sometimes itself is Markovian (its dynamics can be fully determined by itself, without the knowledge of G), and we call this scenario “autonomous”. In other cases, M or P itself is no longer Markovian (its dynamics explicitly depends on G), and we call this scenario “non-autonomous”. We need to consider the triplet (G, M, P) in the non-autonomous scenario. This is similar to a hidden Markov model, where a two-dimensional Markov chain is no longer Markovian if projected to one dimension (since this dimension depends on the other dimension).

For the autonomous scenario, we can fully classify autoregulation for a gene V. Assume environmental factors and other genes that affect the expression of V are kept at constants. Define the expression level (mRNA count for example) of one cell to be $X=n$, the mRNA synthesis rate at $X=n-1$ to be $f_n$, and the degradation rate for each mRNA molecule at $X=n$ to be $g_n$. This is a standard continuous-time Markov chain on ${\mathbb {Z}}$ with transition rates

$$\begin{aligned}{} & {} \frac{1}{\Delta t}{\mathbb {P}}[X(t+\Delta t)=n\mid X(t)=n-1]=f_n,\\{} & {} \frac{1}{\Delta t}{\mathbb {P}}[X(t+\Delta t)=n-1\mid X(t)=n]=ng_n. \end{aligned}$$

Define the relative growth rate $h_n=f_n/g_n$. If there is no autoregulation, then $h_n$ is a constant. Positive autoregulation means $h_n>h_{n-1}$ for some n, so that $f_n>f_{n-1}$ and/or $g_n<g_{n-1}$; negative autoregulation means $h_n<h_{n-1}$ for some n, so that $f_n<f_{n-1}$ and/or $g_n>g_{n-1}$. Notice that we can have $h_n>h_{n-1}$ for some n and $h_{n'}<h_{n'-1}$ for some other $n'$, meaning that positive autoregulation and negative autoregulation can both exist for the same gene, but occur at different expression levels. Such non-monotonicity in regulating gene expression often appear in reality (Angelini et al. 2022).

For the non-autonomous scenario, we can still define autoregulation. Consider the expression level X of V (mRNA count or protein count) and its interior factor I. If X is the mRNA count, then I is the gene state; if X is the protein count, then I is the gene state and the mRNA count. If there is no autoregulation, then X cannot affect I, and for each value of I, the relative growth rate $h_n$ of X is a constant. If X can affect I, or $h_n$ is not a constant, then there is autoregulation. When X can affect I, there is a directed cycle ($X\rightarrow I\rightarrow X$), and the change of X can affect itself through I. In this case, it is not always easy to distinguish between positive autoregulation and negative autoregulation.

Quantitatively, for the autonomous scenario, when we fix other factors that might regulate this gene V, if V has no autoregulation, then $h_n=f_n/g_n$ is a constant h for all n. In this case, the stationary distribution of V satisfies ${\mathbb {P}}(X=n)/{\mathbb {P}}(X=n-1)=h/n$, meaning that the distribution is Poissonian with parameter h, ${\mathbb {P}}(X=n)=h^ne^{-h}/n!$, and $\text {VMR}=1$. If there exists positive autoregulation of certain forms, $\text {VMR}>1$; if there exists negative autoregulation of certain forms, $\text {VMR}<1$. However, such results are derived by assuming that $f_n,g_n$ take certain functional forms, such as linear functions (Paulsson 2005; Ramos et al. 2015), quadratic functions (Giovanini et al. 2020), or Hill functions (Stewart et al. 2013). There are other papers that consider Markov chain models in gene expression/regulation (Jia 2017; Sharma and Adlakha 2014; Shmulevich et al. 2003; Chen and Jia 2020; Shen et al. 2019; Ko et al. 2019), but the role of VMR is not thoroughly studied.

In this paper, we generalize the above result of inferring autoregulation with VMR by dropping the restrictions on parameters. Consider a gene V in a known GRN, and assume it is not regulated by other genes, or assume other factors that regulate V are fixed. Assume we have the autonomous scenario, meaning that its expression level $X=n$ satisfies a general Markov chain with synthesis rate $f_n$ and per molecule degradation rate $g_n$. We do not add any restrictions on $f_n$ and $g_n$. Use the single-cell non-interventional one-time gene expression data to calculate the VMR of V. Proposition 1 states that $\text {VMR}>1$ or $\text {VMR}<1$ means the existence of positive/negative autoregulation.

Nevertheless, the autonomous condition requires some assumptions, and often does not hold in reality (Bokes et al. 2012; Jia et al. 2017a; Jia 2020, 2017). Consider a gene V that is not regulated by other genes, and has no autoregulation. The mRNA count or the protein count is regulated by the gene activation state (an interior factor), which cannot be fixed. Due to this non-controllable factor, there might be transcriptional bursting (Shahrezaei and Swain 2008; Dobrinić et al. 2021) or translational bursting (Cagnetta et al. 2019), where transcription or translation can occur in bursts, and we have $\text {VMR}>1$. This does not mean that Proposition 1 is wrong. Instead, it means that the expression level itself is not Markovian, and the scenario is non-autonomous. In this scenario, we should apply Proposition 2, described below, which states that no autoregulation means $\text {VMR}\ge 1$.

We extend the idea of inferring autoregulation with VMR to a gene that is regulated by other genes, or with non-autonomous expression. Consider a gene $V'$ in a known GRN. Denote other genes that regulate $V'$ and the interior factors (gene state and/or mRNA count) of $V'$ by $\varvec{F}$. Denote the values of $V',\varvec{F}$ as $X,\varvec{Y}$. Assume $V'$ is not contained in a feedback loop, and assume $g_n$, the per molecule degradation rate of $V'$ at $X=n$, is not regulated by other genes or its interior factors (gene state and/or mRNA count). We do not add any restrictions on the synthesis rate $f_n$. Proposition 2 states that if $V'$ has no autoregulation, then $\text {VMR}(X)\ge 1$. Therefore, $\text {VMR}(X)<1$ means autoregulation for $V'$.

Proposition 2 is derived in a “one-step” Markov chain model, where at one time point, only transitions to the nearest neighbors are allowed: $(X=n,\varvec{Y}=\varvec{a})\rightarrow (X=n+1,\varvec{Y}=\varvec{a})$, $(X=n,\varvec{Y}=\varvec{a})\rightarrow (X=n-1,\varvec{Y}=\varvec{a})$, and $(X=n,\varvec{Y}=\varvec{a})\rightarrow (X=n,\varvec{Y}=\varvec{a}')$. This one-step Markov chain model is the most common approach in stochastic representations of gene regulation (Thattai and Van Oudenaarden 2001; Hornos et al. 2005; Paulsson 2005; Munsky et al. 2012; Czuppon and Pfaffelhuber 2018). Recently, there are some studies that consider “multi-step” Markov chain models, where at one time point, the change of mRNA/protein count can be accompanied with the change of other factors, such as the gene state (Braichenko et al. 2021; Karmakar and Das 2021; Voliotis et al. 2008). For example, the following transition is allowed: $(G^*,M=n)\rightarrow (G,M=n+1)$. In this multi-step model, Proposition 2 is no longer valid: even without autoregulation, it is possible that $\text {VMR}(X)<1$. Consider an example that the production of one mRNA molecule needs many steps of gene state transition, and the gene returns to the initial step after producing one mRNA molecule: $G_1\rightarrow G_2\rightarrow \cdots \rightarrow G_k\rightarrow G_1+M$, $M\rightarrow \emptyset $. Since there are many steps, the total time for one cycle of $G_1\rightarrow \cdots \rightarrow G_k\rightarrow G_1+M$ can be highly deterministic, such as 1 second. Assume the degradation probability for each mRNA molecule in 1 second is 0.01. Then the mRNA count is highly concentrated near 100, and $\text {VMR}(X)<1$ (close to 0.5 in numerical simulations).

Since multi-step models allow more transitions, they are more general than one-step models. However, it is still a question that whether such generalizations are necessary, since one-step models have good fitting with experimental data (Jia et al. 2017b; Dessalles et al. 2017; Cao and Grima 2018). Proposition 2 provides a method to verify this problem: If a gene has $\text {VMR}(X)<1$, but we use other methods to determine that it has no autoregulation, then Proposition 2 states that one-step models deviate from reality, and multi-step models should be adopted. Therefore, when one-step models hold, Proposition 2 is a valid method to determine the existence of autoregulation; when one-step models do not hold, combined with other methods to determine autoregulation, Proposition 2 can detect the failure of one-step models.

In the scenario that Proposition 2 may apply, if $\text {VMR}\ge 1$, Proposition 2 cannot determine whether autoregulation exists. In fact, with VMR, or even the full probability distribution, we might not distinguish a non-autonomous system with autoregulation from a non-autonomous system without autoregulation, which both have $\text {VMR}\ge 1$ (Cao and Grima 2018). In the non-autonomous scenario, we only focus on the less complicated case of $\text {VMR}<1$, and derive Proposition 2 that firmly links VMR and autoregulation.

In reality, Proposition 1 and Proposition 2 can only apply to a few genes (which are not regulated by other genes or have $\text {VMR}<1$), and they cannot determine negative results. Thus the inference results about autoregulation are a few “yes” and many “we do not know”. Besides, for the results inferred by Proposition 1, especially those with $\text {VMR}>1$ (positive autoregulation), we cannot verify whether their expression is autonomous, and the inference results are less reliable.

Current experimental methods can hardly determine the existence of autoregulation, and to determine that a gene does not have autoregulation is even more difficult. Therefore, about whether genes in a GRN have autoregulation, experimentally, we do not have “yes” or “no”, but a few “yes” and many “we do not know”. Thus there is no gold standard to thoroughly evaluate the performance of our inference results. We can only report that some genes inferred by our method to have autoregulation are also verified by experiments or other inference methods to have autoregulation. Besides, if the result by Proposition 2 does not match with other methods, it is possible that the one-step model fails. Instead, in Appendix A, we test our methods with numerical simulations, and the performances of both Propositions are satisfactory.

3 Related works

There have been some results of inferring autoregulation with VMR (Thattai and Van Oudenaarden 2001; Swain 2004; Hornos et al. 2005; Munsky et al. 2012; Grönlund et al. 2013; Dessalles et al. 2017; Czuppon and Pfaffelhuber 2018). However, these VMR-based methods have various restrictions on the model, and some of them are derived by applying linear noise approximations, which are not always reliable in gene regulatory networks (Thomas et al. 2013).

Besides VMR-based methods, there are other mathematical approaches to infer the existence of autoregulation in gene expression (Sanchez-Castillo et al. 2018; Xing and Van Der Laan 2005; Feigelman et al. 2016; Veerman et al. 2021; Jia et al. 2018; Zhou and Zhang 2012; Jia and Grima 2020a, b). We introduce some works and compare them with our method. (A) Sanchez-Castillo et al. (2018) considered an autoregressive model for multiple genes. This method (1) needs time series data; (2) requires the dynamics to be linear; (3) estimates a group of parameters. (B) Xing and Van Der Laan (2005) applied causal inference to a complicated gene expression model. This method (1) needs promoter sequences and information on transcription factor binding sites; (2) requires linearity for certain steps; (3) estimates a group of parameters. (C) Feigelman et al. (2016) applied a Bayesian method for model selection. This method (1) needs time series data; (2) estimates a group of parameters. (D) Veerman et al. (2021) considered the probability-generating function of a propagator model. This method (1) needs time series data; (2) estimates a group of parameters; (3) needs to approximate a Cauchy integral. (E) Jia et al. (2018) compared the relaxation rate with degradation rate. This method (1) needs interventional data; (2) only works for a single gene that is not regulated by other genes; (3) requires that the per molecule degradation rate is a constant.

Compared to other more complicated methods, VMR-based methods (including ours) have two advantages: (1) VMR-based methods use non-interventional one-time data. Time series data require measuring the same cell multiple times without killing it, and interventional data require some techniques to interfere with gene expression, such as gene knockdown. Therefore, non-interventional one-time data used in VMR-based methods are much easier and cheaper to obtain. (2) VMR-based methods do not estimate parameters, and only calculate the mean and variance of the expression level. Some other methods need to estimate many parameters or approximate some complicated quantities, meaning that they need large data size and high data accuracy. Therefore, our method is easy to calculate, and need lower data accuracy and smaller data size.

Compared to other VMR-based methods, our method has few restrictions on the model, making them applicable to various scenarios with different dynamics. Besides, our derivations do not use any approximations.

In sum, compared to other VMR-based methods, our method is universal. Compared to other non-VMR-based methods, our method is simple, and has lower requirements on data quality.

Compared to other non-VMR-based methods, our method has some disadvantages: (1) The GRN structure needs to be known. (2) Our method does not work for certain genes, depending on regulatory relationships. Proposition 1 only works for a gene that is not regulated by other genes, and we require its expression to be autonomous; Proposition 2 only works for a gene that is not in a feedback loop. (3) Proposition 2 requires the per molecule degradation rate to be a constant, and it cannot provide information about autoregulation if $\text {VMR}\ge 1$. (4) Our method only works for cells at equilibrium. Thus time series data that contain time-specific information cannot be utilized other than treated as one-time data. With just the stationary distribution, sometimes it is impossible to build the causal relationship (including autoregulation) (Wang and Wang 2020). Thus with this data type, some disadvantages are inevitable.

4 Scenario of a single isolated gene

4.1 Setup

We first consider the expression level (e.g., mRNA count) of one gene V in a single cell. At the single-cell level, gene expression is essentially stochastic, and we do not further consider dynamical system approaches with deterministic (Wang et al. 2020a) or stochastic (Ye et al. 2016) operators. We use a random variable X to represent the mRNA count of V. We assume V is not in a feedback loop. We also assume all environmental factors and other genes that can affect X are kept at constant levels, so that we can focus on V alone. This can be achieved if no other genes point to gene V in the GRN, such as PIP3 in Fig. 1. Then we assume that the expression of V is autonomous, thus X satisfies a time-homogeneous Markov chain defined on ${\mathbb {Z}}^*$.

Assume that the mRNA synthesis rate at $X(t)=n-1$, namely the transition rate from $X=n-1$ to $X=n$, is $f_n\ge 0$. Assume that with n mRNA molecules, the degradation rate for each mRNA molecule is $g_n>0$. Then the overall degradation rate at $X(t)=n$, namely the transition rate from $X=n$ to $X=n-1$, is $g_nn$. The associated master equation is

$$\begin{aligned} \begin{aligned} \frac{\textrm{d}{\mathbb {P}}[X(t)=n]}{\textrm{d}t}=&\,{\mathbb {P}}[X(t)=n+1]g_{n+1}(n+1)+{\mathbb {P}}[X(t)=n-1]f_n\\&-{\mathbb {P}}[X(t)=n](f_{n+1}+g_nn). \end{aligned} \end{aligned}$$

(1)

When $f_n,g_n$ take specific forms, this master equation also corresponds to a branching process, so that related techniques can be applied (Jiang et al. 2017). Define the relative growth rate $h_n=f_n/g_n$. We assume that as time tends to infinity, the process reaches equilibrium, where (1) the stationary probability distribution $P_n=\lim _{t\rightarrow \infty }{\mathbb {P}}[X(t)=n]$ exists, and $P_n=P_{n-1}h_n/n$; (2) the mean $\lim _{t\rightarrow \infty }{\mathbb {E}}[X(t)]$ and the variance $\lim _{t\rightarrow \infty }\sigma ^2[X(t)]$ are finite. Such requirements can be satisfied under simple assumptions, such as assuming $h_n$ has a finite upper bound (Norris 1998; Wang et al. 2022).

If $h_n>h_{n-1}$ for some n, then there exists positive autoregulation. If $h_n<h_{n-1}$ for some n, then there exists negative autoregulation. If there is no autoregulation, then $h_n$ is a constant h, and the stationary distribution is Poissonian with parameter h. In this setting, positive autoregulation and negative autoregulation might coexist, meaning that $h_{n+1}<h_n$ for some n and $h_{n'+1}>h_{n'}$ for some $n'$.

4.2 Theoretical results

With single-cell non-interventional one-time gene expression data for one gene, we have the stationary distribution of the Markov chain X. We can infer the existence of autoregulation with the VMR of X, defined as $\text {VMR}(X)=\sigma ^2(X)/{\mathbb {E}}(X)$. The idea is that if we let $f_n$ increase/decrease with n, and control $g_n$ to make ${\mathbb {E}}(X)$ invariant, then the variance $\sigma ^2(X)$ increases/decreases (Wang 2018, Section 2.5.1). We shall prove that $\text {VMR}>1$ implies the occurrence of positive autoregulation, and $\text {VMR}<1$ implies the occurrence of negative autoregulation. Notice that $\text {VMR}>1$ does not exclude the possibility that negative autoregulation exists for some expression level. This also applies to $\text {VMR}<1$ and positive autoregulation.

We can illustrate this result with a linear model:

Example 1

Consider a Markov chain that satisfies Eq. 1, and set $f_n=k+b(n-1)$, $g_n=c$. Here b (can be positive or negative) is the strength of autoregulation, and c satisfies $c>0$ and $c-b>0$. We can calculate that $\text {VMR}=1+b/(c-b)$ (see Appendix B.1 for details). Therefore, $\text {VMR}>1$ means positive autoregulation, $b>0$; $\text {VMR}<1$ means negative autoregulation, $b<0$; $\text {VMR}=1$ means no autoregulation, $b=0$.

Lemma 1

Consider a Markov chain X(t) that follows Eq. 1 with general transition coefficients $f_n,g_n$. Here X(t) models the mRNA/protein count of one gene whose expression is autonomous. Calculate $\text {VMR}(X)$ at stationarity. (1) Assume $h_{n+1}\ge h_n$ for all n. We have $\text {VMR}(X)\ge 1$; moreover, $\text {VMR}(X)= 1$ if and only if $h_{n+1}= h_n$ for all n. (2) Assume $h_{n+1}\le h_n$ for all n. We have $\text {VMR}(X)\le 1$; moreover, $\text {VMR}(X)= 1$ if and only if $h_{n+1}= h_n$ for all n.

We can take negation of Lemma 1 to obtain the following proposition.

Proposition 1

In the setting of Lemma 1, (1) If $\text {VMR}(X)>1$, then there exists at least one value of n for which $h_{n+1}>h_n$; thus this gene has positive autoregulation. (2) If $\text {VMR}(X)<1$, then there exists at least one value of n for which $h_{n+1}<h_n$; thus this gene has negative autoregulation. (3) If $\text {VMR}(X)=1$, then either (A) $h_{n+1}=h_n$ for all n, meaning that this gene has no autoregulation; or (B) $h_{n+1}<h_n$ for one n and $h_{n'+1}>h_{n'}$ for another $n'$, meaning that this gene has both positive and negative autoregulation (at different expression levels).

Remark 1

Results similar to Proposition 1 have been proven by Jia et al. in another model of expression for a single gene (Jia et al. 2017b). However, they require that $g_i=g_j$ for any i, j. Proposition 1 can handle arbitrary $g_i$, thus being novel.

Proof of Lemma 1

Define $\lambda =-\log P_0$, so that $P_0=\exp (-\lambda )$. Define $d_n=\prod _{i=1}^{n}h_i>0$ and stipulate that $d_0=1$. We can see that

$$\begin{aligned} \frac{d_nd_{n+2}}{d_{n+1}^2}=\frac{h_{n+2}}{h_{n+1}}. \end{aligned}$$

Also,

$$\begin{aligned} P_n=P_{n-1}f_n/(g_n n)=P_{n-1}h_n/n=\cdots =P_0\left( \prod _{i=1}^{n}h_i\right) /n!\ =e^{-\lambda }\frac{d_n}{n!}. \end{aligned}$$

Then

$$\begin{aligned} \begin{aligned} {\mathbb {E}}(X^2)-{\mathbb {E}}(X)&=\sum _{n=1}^{\infty }(n^2-n)P_n=e^{-\lambda }\sum _{n=1}^{\infty }(n^2-n)\frac{d_n}{n!}\\&=e^{-\lambda }\sum _{n=2}^{\infty }\frac{d_n}{(n-2)!}=e^{-\lambda }\sum _{n=0}^{\infty }\frac{d_{n+2}}{n!}, \end{aligned} \end{aligned}$$

$$\begin{aligned}{}[{\mathbb {E}}(X)]^2=\left( \sum _{n=1}^{\infty }nP_n\right) ^2=e^{-2\lambda } \left( \sum _{n=1}^{\infty }n\frac{d_n}{n!}\right) ^2=e^{-2\lambda }\left( \sum _{n=0}^{\infty }\frac{d_{n+1}}{n!}\right) ^2. \end{aligned}$$

Besides,

$$\begin{aligned} 1=\sum _{n=0}^{\infty }P_n=e^{-\lambda }\sum _{n=0}^{\infty }\frac{d_n}{n!}. \end{aligned}$$

Now we have

$$\begin{aligned} {\mathbb {E}}(X^2)-{\mathbb {E}}(X)-[{\mathbb {E}}(X)]^2=e^{-2\lambda }\left( \sum _{n=0}^{\infty } \frac{d_n}{n!}\right) \left( \sum _{n=0}^{\infty }\frac{d_{n+2}}{n!}\right) -e^{-2\lambda } \left( \sum _{n=0}^{\infty }\frac{d_{n+1}}{n!}\right) ^2. \end{aligned}$$

(1) Assume $h_{n+1}\ge h_n$ for all n. Then

$$\begin{aligned} \begin{aligned}&{\mathbb {E}}(X^2)-{\mathbb {E}}(X)-[{\mathbb {E}}(X)]^2\\ \ge&e^{-2\lambda }\left( \sum _{n=0}^{\infty }\frac{\sqrt{d_nd_{n+2}}}{n!}\right) ^2 - e^{-2\lambda }\left( \sum _{n=0}^{\infty }\frac{d_{n+1}}{n!}\right) ^2\ge 0. \end{aligned} \end{aligned}$$

(2)

Here the first inequality is from the Cauchy-Schwarz inequality, and the second inequality is from $d_nd_{n+2}\ge d_{n+1}^2$ for all n. Then $\text {VMR}(X)=\{{\mathbb {E}}(X^2)-[{\mathbb {E}}(X)]^2\}/{\mathbb {E}}(X)\ge 1$. Equality holds if and only if $d_n/d_{n+2}=d_{n+1}/d_{n+3}$ for all n (the first inequality of Eq. 2) and $d_nd_{n+2}=d_{n+1}^2$ for all n (the second inequality of Eq. 2). The equality condition is equivalent to $h_{n+1}=h_n$ for all n.

(2) Assume $h_{n+1}\le h_n$ for all n. Then $d_{n+2}/d_{n+1}\le d_{n+1}/d_n$, and $d_n\le h_1^n$ for all n. Define

$$\begin{aligned} H(t)=\sum _{n=0}^{\infty }\frac{d_n}{n!}t^n. \end{aligned}$$

Since $0<d_n\le h_1^n$, this series converges for all $t\in {\mathbb {C}}$, so that H(t) is a well-defined analytical function on ${\mathbb {C}}$, and

$$\begin{aligned} H'(t)=\sum _{n=0}^{\infty }\frac{d_{n+1}}{n!}t^n,\ \text { and }\ H''(t)=\sum _{n=0}^{\infty }\frac{d_{n+2}}{n!}t^n. \end{aligned}$$

In the following, we only consider $H(t),H'(t),H''(t)$ as real functions for $t\in {\mathbb {R}}$.

To prove $\text {VMR}(X)\le 1$, we just need to prove ${\mathbb {E}}(X^2)-{\mathbb {E}}(X)-[{\mathbb {E}}(X)]^2=e^{-2\lambda }\{H(1)H''(1)-[H'(1)]^2\}\le 0$. However, we shall prove $H''(t)H(t)\le [H'(t)]^2$ for all $t\in \mathfrak {I}$, where $\mathfrak {I}=(a,b)$ is a fixed interval in ${\mathbb {R}}$ with $0<a<1$ and $1<b<\infty $. Thus $t=1$ is an interior point of $\mathfrak {I}$. Since $H(t),H'(t),H''(t)$ have positive lower bounds on $\mathfrak {I}$, the following statements are obviously equivalent: (i) $H''(t)H(t)\le [H'(t)]^2$ for all $t\in \mathfrak {I}$; (ii) $\{\log [H'(t)/H(t)]\}'\le 0$ for all $t\in \mathfrak {I}$; (iii) $\log [H'(t)/H(t)]$ is non-increasing on $\mathfrak {I}$; (iv) $H'(t)/H(t)$ is non-increasing on $\mathfrak {I}$. To prove (i), we just need to prove (iv).

Consider any $t_1,t_2\in \mathfrak {I}$ with $t_1\le t_2$ and any $p,q\in {\mathbb {N}}$ with $p\ge q$. Since $d_{p+1}/d_p\le d_{q+1}/d_q$, and $t_1^{p-q}\le t_2^{p-q}$, we have

$$\begin{aligned} d_pd_qt_1^qt_2^q\left( \frac{d_{p+1}}{d_p}-\frac{d_{q+1}}{d_q}\right) (t_1^{p-q}-t_2^{p-q})\ge 0, \end{aligned}$$

which means

$$\begin{aligned} d_{p+1}d_qt_1^pt_2^q+d_{q+1}d_p t_1^qt_2^p\ge d_{p+1}d_qt_2^pt_1^q+d_{q+1}d_pt_2^qt_1^p. \end{aligned}$$

Sum over all $p,q\in {\mathbb {N}}$ with $p\ge q$ to obtain

$$\begin{aligned} \begin{aligned} H'(t_1)H(t_2)&=\left( \sum _{n=0}^{\infty }\frac{d_{n+1}}{n!}t_1^n\right) \left( \sum _{n=0}^{\infty }\frac{d_n}{n!}t_2^n\right) \\&\ge \left( \sum _{n=0}^{\infty }\frac{d_{n+1}}{n!}t_2^n\right) \left( \sum _{n=0}^{\infty }\frac{d_n}{n!}t_1^n\right) =H'(t_2)H(t_1). \end{aligned} \end{aligned}$$

Thus $H'(t_1)/H(t_1)\ge H'(t_2)/H(t_2)$ for all $t_1,t_2\in \mathfrak {I}$ with $t_1\le t_2$. This means $H''(t)H(t)\le [H'(t)]^2$ for all $t\in \mathfrak {I}$, and $\text {VMR}(X)\le 1$.

About the condition for the equality to hold, assume $h_{n'+1}<h_{n'}$ for a given $n'$. Then

$$\begin{aligned} d_{n'}d_{n'-1}t_1^{n'-1}t_2^{n'-1}\left( \frac{d_{n'+1}}{d_{n'}}-\frac{d_{n'}}{d_{n'-1}}\right) (t_1-t_2)\ge C(t_2-t_1) \end{aligned}$$

for all $t_1,t_2\in \mathfrak {I}$ with $t_1\le t_2$ and a constant C that does not depend on $t_1,t_2$. Therefore,

$$\begin{aligned} \begin{aligned}&[H'(t_1)/H(t_1)-H'(t_2)/H(t_2)]\cdot [H(t_1)H(t_2)]\\&\quad =\left( \sum _{n=0}^{\infty }\frac{d_{n+1}}{n!}t_1^n\right) \left( \sum _{n=0}^{\infty }\frac{d_n}{n!}t_2^n\right) - \left( \sum _{n=0}^{\infty }\frac{d_{n+1}}{n!}t_2^n\right) \left( \sum _{n=0}^{\infty }\frac{d_n}{n!}t_1^n\right) \\&\quad \ge d_{n'}d_{n'-1}t_1^{n'-1}t_2^{n'-1}\left( \frac{d_{n'+1}}{d_{n'}}-\frac{d_{n'}}{d_{n'-1}}\right) (t_1-t_2)\\&\quad \ge C(t_2-t_1). \end{aligned} \end{aligned}$$

Since H(t) has a finite positive upper bound A and a positive lower bound B on $\mathfrak {I}$, we have

$$\begin{aligned} H'(t_1)/H(t_1)-H'(t_2)/H(t_2)\ge C(t_2-t_1)/A^2, \end{aligned}$$

meaning that

$$\begin{aligned} \forall t\in \mathfrak {I},\,\,[H'(t)/H(t)]'=\{H(t)H''(t)-[H'(t)]^2\}/[H(t)]^2\le -C/A^2, \end{aligned}$$

and thus

$$\begin{aligned} \forall t\in \mathfrak {I},\,\,H(t)H''(t)-[H'(t)]^2\le -CB^2/A^2<0. \end{aligned}$$

Therefore, ${\mathbb {E}}(X^2)-{\mathbb {E}}(X)-[{\mathbb {E}}(X)]^2=e^{-2\lambda }\{H(1)H''(1)-[H'(1)]^2\}<0$, and $\text {VMR}(X)<1$.

We have proved in (1) that if $h_{n+1}=h_n$ for all n, then $\text {VMR}(X)=1$. Thus when $h_{n+1}\le h_n$ for all n, $\text {VMR}(X)=1$ if and only if $h_{n+1}=h_n$ for all n. $\square $

In sum, for the Markov chain model of one gene (by assuming the expression to be autonomous), when we have the stationary distribution from single-cell non-interventional one-time gene expression data, we can calculate the VMR of X. $\text {VMR}(X)>1$ means the existence of positive autoregulation (while negative autoregulation might still be possible at different expression levels), and $\text {VMR}(X)<1$ means the existence of negative autoregulation (while positive autoregulation might still be possible at different expression levels). $\text {VMR}(X)=1$ means either (1) no autoregulation exists; or (2) both positive autoregulation and negative autoregulation exist (at different expression levels). In reality, many genes are non-autonomous, and transcriptional/translational bursting can make the VMR to be larger than 100 (Paulsson 2005). Since Proposition 1 does not apply to non-autonomous cases, such genes might not have autoregulations.

5 Scenario of multiple entangled genes

5.1 Setup

We consider m genes $V_1,\ldots ,V_m$ for a single cell. Denote their expression levels by random variables $X_1,\ldots ,X_m$. The change of $X_i$ can depend on $X_j$ (mutual regulation) and $X_i$ itself (autoregulation). Since these genes regulate each other, and their expression levels are not fixed, we cannot consider them separately. If the expression of gene $V_k$ is non-autonomous, we also need to add its interior factors (gene state and/or mRNA count) into $X_1,\ldots ,X_m$.

We can use a continuous-time one-step Markov chain on $({\mathbb {Z}}^*)^m$ to describe the dynamics. Each state of this Markov chain, $(X_1=n_1,\ldots ,X_i=n_i,\ldots ,X_m=n_m)$, can be abbreviated as $\varvec{n}=(n_1,\ldots ,n_i,\ldots ,n_m)$. For gene $V_i$, the transition rate of $n_i-1\rightarrow n_i$ is $f_i(\varvec{n})$, and the transition rate of $n_i\rightarrow n_i-1$ is $g_i(\varvec{n})n_i$. Transitions with more than one step are not allowed. The master equation of this process is

$$\begin{aligned} \begin{aligned} \frac{\textrm{d}{\mathbb {P}}(\varvec{n})}{\textrm{d}t} =&\sum _i{\mathbb {P}}(n_1,\ldots ,n_i+1,\ldots ,n_m)g_i(n_1,\ldots ,n_i+1,\ldots ,n_m)(n_i+1)\\&+\sum _i{\mathbb {P}}(n_1,\ldots ,n_i-1,\ldots ,n_m)f_i(\varvec{n})\\&-{\mathbb {P}}(\varvec{n})\sum _i[f_i(n_1,\ldots ,n_i+1,\ldots ,n_m)+g_i(\varvec{n})n_i]. \end{aligned} \end{aligned}$$

(3)

Define $\varvec{n}_{\bar{i}}=(n_1,\ldots ,n_{i-1},n_{i+1},\ldots ,n_m)$. Define $h_i(\varvec{n})=f_i(\varvec{n})/g_i(\varvec{n})$ to be the relative growth rate of gene $V_i$. Autoregulation means for some fixed $\varvec{n}_{\bar{i}}$, $h_i(\varvec{n})$ is (locally) increasing/decreasing with $n_i$, thus $f_i(\varvec{n})$ increases/decreases and/or $g_i(\varvec{n})$ decreases/increases with $n_i$. For the non-autonomous scenario, another possibility for autoregulation is that $V_i$ can affect its interior factors (gene state and/or mRNA count).

5.2 Theoretical results

With expression data for multiple genes, there are various methods to infer the regulatory relationships between different genes, so that the GRN can be reconstructed (Wang and Wang 2022). In the GRN, if there is a directed path from gene $V_i$ to gene $V_j$, meaning that $V_i$ can directly or indirectly regulate $V_j$, then $V_i$ is an ancestor of $V_j$, and $V_j$ is a descendant of $V_i$.

Fix a gene $V_k$ in a GRN. We consider a simple case that $V_k$ is not contained in any directed cycle (feedback loop), which means no gene is both an ancestor and a descendant of $V_k$, such as PIP2 in Fig. 1. This means $V_k$ itself is a strongly connected component of the GRN. This condition is automatically satisfied if the GRN has no directed cycle. If the expression of $V_k$ is non-autonomous, we need to add the interior factors (gene state and/or mRNA count) of $V_k$ into $V_1,\ldots ,V_m$, and it is acceptable that $V_k$ regulates its interior factors. In this case, if the one-step model holds, we can prove that if $V_k$ does not regulate itself, meaning that $h_k(\varvec{n})$ is a constant for fixed $\varvec{n}_{\bar{k}}$ and different $n_k$, and $X_k$ does not affect its interior factors (if non-autonomous), then $\text {VMR}(X_k)\ge 1$. The reason is that $\text {VMR}<1$ requires either a feedback loop or autoregulation. Certainly, $\text {VMR}<1$ might also mean that the one-step model fails. One intuition is to assume the transitions of $V_{\bar{k}}$ are extremely slow, so that $V_k$ is approximately the average of many Poisson variables. It is easy to verify that the average of Poisson variables has $\text {VMR}\ge 1$. We need to assume that the per molecule degradation rate $g_k(\cdot )$ for $V_k$ is not affected by $V_1,\ldots ,V_m$, which is not always true in reality (Karamyshev and Karamysheva 2018). With this result, when $\text {VMR}<1$, there might be autoregulation.

Proposition 2

Consider the one-step Markov chain model for multiple genes, described by Eq. 3. Assume the GRN has no directed cycle, or at least there is no directed cycle that contains gene $V_k$. Assume $g_k(\cdot )$ is a constant for all $\varvec{n}$. If $V_k$ has no autoregulation, meaning that $h_k(\cdot )$ and $f_k(\cdot )$ do not depend on $n_k$, and $V_k$ does not regulate its interior factors (gene state and/or mRNA count), then $V_k$ has $\text {VMR}\ge 1$. Therefore, $V_k$ has $\text {VMR}< 1$ means $V_k$ has autoregulation, or the one-step model fails.

Paulsson et al. study a similar problem (Hilfinger et al. 2016; Yan et al. 2019), and they state Proposition 2 in an unpublished work [personal communication from Dr. Jiawei Yan in Jan. 2022]. Proposition 2 also appears in a preprint by Mahajan et al. (2021), but the proof is based on a linear noise approximation, which requires that $f_k(\cdot )$ is linear with $\varvec{n}_{\bar{k}}$. We propose a rigorous proof independently.

Proof

Denote the expression level of $V_k$ by W. Assume the ancestors of $V_k$ are $V_1,\ldots ,V_l$. For simplicity, denote the expression levels of $V_1,\ldots ,V_l$ by a (high-dimensional) random variable Y. Assume $V_k$ has no autoregulation. Since $V_k$ does not regulate $V_1,\ldots ,V_l$, W does not affect Y. Denote the transition rate from $Y=i$ to $Y=j$ by $q_{ij}\ge 0$. Stipulate that $q_{ii}=-\sum _{j\ne i}q_{ij}$. When $Y=i$, the transition rate from $W=n$ to $W=n+1$ is $F_i$ (does not depend on n), and the transition rate from $W=n$ to $W=n-1$ is G.

The master equation of this process is

$$\begin{aligned} \begin{aligned}&\frac{\textrm{d}{\mathbb {P}}[W(t)=n,Y(t)=i]}{\textrm{d}t}\\&\quad ={\mathbb {P}}[W(t)=n-1,Y(t)=i]F_i+{\mathbb {P}}[W(t)=n+1,Y(t)=i]G(n+1)\\&\qquad +\sum _{j\ne i}{\mathbb {P}}[W(t)=n,Y(t)=j]q_{ji}-{\mathbb {P}}[W(t)=n,Y(t)=i](F_i+Gn+\sum _{j\ne i}q_{ij}). \end{aligned} \end{aligned}$$

Assume there is a unique stationary probability distribution $P_{n,i}=\lim _{t\rightarrow \infty }{\mathbb {P}}[W(t)=n,Y(t)=i]$. This can be guaranteed by assuming the process to be irreducible. Then we have

$$\begin{aligned} P_{n,i}\Big [F_i+Gn+\sum _{j}q_{ij}\Big ]=P_{n-1,i}F_i+P_{n+1,i}G(n+1)+\sum _{j}P_{n,j}q_{ji}. \end{aligned}$$

(4)

Define $P_i=\sum _n P_{n,i}$. Sum over n for Eq. 4 to obtain

$$\begin{aligned} P_i\sum _{j}q_{ij}=\sum _{j}P_jq_{ji}, \end{aligned}$$

(5)

meaning that $P_i$ is the stationary probability distribution of Y.

Define $W_i$ to be W conditioned on $Y=i$ at stationarity. Then ${\mathbb {P}}(W_i=n)={\mathbb {P}}(W=n\mid Y=i)=P_{n,i}/P_i$, and ${\mathbb {E}}(W_i)=\sum _n n P_{n,i}/P_i$. Multiply Eq. 4 by n and sum over n to obtain

$$\begin{aligned} \Big (G+\sum _j q_{ij}\Big )P_i{\mathbb {E}}(W_i)=F_iP_i+\sum _j q_{ji}P_j{\mathbb {E}}(W_j). \end{aligned}$$

(6)

Here and in the following, we repeatedly apply the tricks of splitting n and shifting the index of summation. For example,

$$\begin{aligned} \begin{aligned}&\sum _{n=1}^{\infty } P_{n-1,i}F_i n-\sum _{n=1}^{\infty } P_{n,i}F_i n\\&\quad =\sum _{n=1}^{\infty } P_{n-1,i}F_i (n-1)+\sum _{n=1}^{\infty } P_{n-1,i}F_i-\sum _{n=1}^{\infty } P_{n,i}F_i n\\&\quad =\sum _{n-1=0}^{\infty } P_{n-1,i}F_i (n-1)+\sum _{n-1=0}^{\infty } P_{n-1,i}F_i-\sum _{n=0}^{\infty } P_{n,i}F_i n\\&\quad =\sum _{n=0}^{\infty } P_{n,i}F_i n+F_i\sum _{n=0}^{\infty } P_{n,i}-\sum _{n=0}^{\infty } P_{n,i}F_i n=F_iP_i. \end{aligned} \end{aligned}$$

Sum over i for Eq. 6 to obtain

$$\begin{aligned} G\sum _iP_i{\mathbb {E}}(W_i)=\sum _iF_iP_i. \end{aligned}$$

(7)

Multiply Eq. 4 by $n^2$ and sum over n to obtain

$$\begin{aligned} \Big (2G+\sum _j q_{ij}\Big )P_i{\mathbb {E}}(W_i^2)=F_iP_i+(2F_i+G)P_i{\mathbb {E}}(W_i)+\sum _j q_{ji}P_j{\mathbb {E}}(W_j^2). \end{aligned}$$

(8)

Sum over i for Eq. 8 to obtain

$$\begin{aligned} 2G\sum _iP_i{\mathbb {E}}(W_i^2)=\sum _iF_iP_i+2\sum _i F_iP_i{\mathbb {E}}(W_i)+G\sum _iP_i{\mathbb {E}}(W_i). \end{aligned}$$

(9)

Multiply Eq. 6 by ${\mathbb {E}}(W_i)$ and sum over i to obtain

$$\begin{aligned} \begin{aligned}&G\sum _iP_i[{\mathbb {E}}(W_i)]^2+\sum _{i,j}P_iq_{ij}[{\mathbb {E}}(W_i)]^2\\&\quad =\sum _iF_iP_i{\mathbb {E}}(W_i)+\sum _{i,j}P_jq_{ji}{\mathbb {E}}(W_i){\mathbb {E}}(W_j). \end{aligned} \end{aligned}$$

(10)

Then we have

$$\begin{aligned} \begin{aligned}&\sum _iF_iP_i{\mathbb {E}}(W_i)-G\sum _iP_i[{\mathbb {E}}(W_i)]^2 \\&\quad =\sum _{i,j}P_iq_{ij}[{\mathbb {E}}(W_i)]^2-\sum _{i,j}P_jq_{ji}{\mathbb {E}}(W_i){\mathbb {E}}(W_j)\\&\quad =\frac{1}{2}\Big \{\sum _{i,j}P_iq_{ij}[{\mathbb {E}}(W_i)]^2+\sum _i[{\mathbb {E}}(W_i)]^2\sum _jP_iq_{ij} -2\sum _{i,j}P_iq_{ij}{\mathbb {E}}(W_i){\mathbb {E}}(W_j)\Big \}\\&\quad =\frac{1}{2}\Big \{\sum _{i,j}P_iq_{ij}[{\mathbb {E}}(W_i)]^2+\sum _i[{\mathbb {E}}(W_i)]^2\sum _jP_jq_{ji} -2\sum _{i,j}P_iq_{ij}{\mathbb {E}}(W_i){\mathbb {E}}(W_j)\Big \}\\&\quad =\frac{1}{2}\Big \{\sum _{i,j}P_iq_{ij}[{\mathbb {E}}(W_i)]^2+\sum _{i,j}P_iq_{ij}[{\mathbb {E}}(W_j)]^2 -2\sum _{i,j}P_iq_{ij}{\mathbb {E}}(W_i){\mathbb {E}}(W_j)\Big \}\\&\quad =\frac{1}{2}\sum _{i,j}P_iq_{ij}[{\mathbb {E}}(W_i)-{\mathbb {E}}(W_j)]^2\ge 0. \end{aligned} \end{aligned}$$

(11)

Here the first equality is from Eq. 10, the third equality is from Eq. 5, and other equalities are equivalent transformations.

Now we have

$$\begin{aligned}&{\mathbb {E}}(W^2)-{\mathbb {E}}(W)-[{\mathbb {E}}(W)]^2\nonumber \\&\quad =\sum _iP_i{\mathbb {E}}(W_i^2)-\sum _iP_i{\mathbb {E}}(W_i)-\Big [\sum _iP_i{\mathbb {E}}(W_i)\Big ]^2\nonumber \\&\quad =\frac{1}{G}\sum _i F_iP_i{\mathbb {E}}(W_i)+\sum _i P_i{\mathbb {E}}(W_i)-\sum _iP_i{\mathbb {E}}(W_i) -\Big [\sum _iP_i{\mathbb {E}}(W_i)\Big ]^2\nonumber \\&\quad \ge \sum _iP_i[{\mathbb {E}}(W_i)]^2-\Big [\sum _iP_i{\mathbb {E}}(W_i)\Big ]^2\nonumber \\&\quad =\Big (\sum _iP_i\Big )\sum _iP_i[{\mathbb {E}}(W_i)]^2-\Big [\sum _iP_i{\mathbb {E}}(W_i)\Big ]^2\ge 0, \end{aligned}$$

(12)

where the first equality is by definition, the second equality is from Eqs. 7, 9, the first inequality is from Eq. 11, the third equality is from $\sum _i P_i=1$, and the second inequality is the Cauchy-Schwarz inequality.

Since ${\mathbb {E}}(W^2)-[{\mathbb {E}}(W)]^2\ge {\mathbb {E}}(W)$, $\text {VMR}(W)=\{{\mathbb {E}}(W^2)-[{\mathbb {E}}(W)]^2\}/ {\mathbb {E}}(W)\ge 1$. $\square $

Remark 2

In gene expression, the total noise ($\sigma ^2(X)/({\mathbb {E}}X)^2$) can be decomposed into intrinsic (cellular) noise and extrinsic (environmental) noise (Baudrimont et al. 2019; Thomas 2019; Ham et al. 2020; Lin and Amir 2021; Wang et al. 2019). Inspired by that, we can decompose the VMR into intrinsic and extrinsic components. Denote intrinsic and extrinsic stochastic factors as I, E, and the expression level X is a deterministic function of these factors: $X=X(I,E)$. Then

$$\begin{aligned} \text {VMR}_{\text {int}}= & {} \frac{{\mathbb {E}}_E({\mathbb {E}}_{I\mid E}X^2)- {\mathbb {E}}_E({\mathbb {E}}_{I\mid E}X)^2}{{\mathbb {E}}X},\\ \text {VMR}_{\text {ext}}= & {} \frac{{\mathbb {E}}_E({\mathbb {E}}_{I\mid E}X)^2-[{\mathbb {E}}_E({\mathbb {E}}_{I\mid E}X)]^2}{{\mathbb {E}}X}, \end{aligned}$$

where ${\mathbb {E}}_{I\mid E}$ is the expectation conditioned on E. This decomposition might lead to further understanding of Proposition 2.

We hypothesize that the requirement for $g_k(\cdot )$ in Proposition 2 can be dropped:

Conjecture 1

Assume $V_k$ is not contained in a directed cycle in the GRN, and $V_k$ does not regulate its interior factors (gene state and/or mRNA count). If $V_k$ has no autoregulation, meaning that $h_k(\cdot )$ does not depend on $n_k$ (but might depend on $\varvec{n}_{\bar{k}}$), then $V_k$ has $\text {VMR}\ge 1$.

The main obstacle of proving this conjecture is that the second equality in Eq. 12 does not hold. The reason is that $G_i$ cannot be extracted from the summation, and we cannot link $\sum _iP_i{\mathbb {E}}(W_i^2)$ and $\sum _i G_iP_i{\mathbb {E}}(W_i^2)$.

If the GRN has directed cycles, there is a result by Hilfinger et al. (2016) and Yan et al. (2019), which is proved under first-order approximations of covariances. The general case (when the approximations do not apply) has been numerically verified but not proved yet:

Conjecture 2

Assume for each $V_i$, $g_i(\cdot )$ does not depend on $\varvec{n}$, and $f_i(\cdot )$ does not depend on $n_i$ (no autoregulation). Then for at least one gene $V_j$, we have $\text {VMR}\ge 1$ (Hilfinger et al. 2016; Yan et al. 2019).

Due to the existence of directed cycles, one gene can affect itself through other genes, and we cannot study them separately.

Notice that Conjecture 2 does not hold if $g_i$ depends on $\varvec{n}_{\bar{i}}$:

Example 2

Consider a one-step Markov chain that satisfies Eq. 3, where $m=2$, $f_1(n_2)=g_1(n_2)=1$ for $n_2=2$, $f_1(n_2)=g_1(n_2)=0$ for $n_2\ne 2$, and $f_2(n_1)=g_2(n_1)=1$ for $n_1=2$, $f_2(n_1)=g_2(n_1)=0$ for $n_1\ne 2$. The initial state is $(n_1=2,n_2=2)$. Then $\text {VMR}=2e/(4e-1)\approx 0.55$ for both genes (see Appendix B.2 for details).

Assume Conjecture 2 is correct. For m genes, if we find that VMR for each gene is less than 1, then we can infer that autoregulation exists, although we do not know which gene has autoregulation. Another possibility is that the one-step model fails.

6 Applying theoretical results to experimental data

We summarize our theoretical results into Algorithm 1. Proposition 1 applies to a gene that has no ancestor in the GRN. However, it requires the corresponding gene has autonomous expression (or the transition rates of gene states are high enough, so that the non-autonomous process is close to an autonomous process), which is difficult to validate and often does not hold in reality. Thus the inference result by Proposition 1 for $\text {VMR}>1$ (positive autoregulation) is not very reliable. When $\text {VMR}<1$ and Proposition 1 could apply, we should instead apply Proposition 2 to determine the existence of autoregulation, since Proposition 2 does not require the expression to be autonomous, thus being much more reliable, although it may fail if the one-step model does not hold. Proposition 2 applies when the gene is not in a feedback loop and has $\text {VMR}<1$. Notice that our result cannot determine that a gene has no autoregulation.

For a given gene without autoregulation, its expression level satisfies a Poisson distribution, and VMR is 1. If we have n samples of its expression level, then the sample VMR (sample variance divided by sample mean) asymptotically satisfies a Gamma distribution $\Gamma [(n-1)/2,2/(n-1)]$, and we can determine the confidence interval of sample VMR (Eden and Kramer 2010). If the sample VMR is out of this confidence interval, then we know that VMR is significantly different from 1, and Propositions 1, 2 might apply.

We apply our method to four groups of single-cell non-interventional one-time gene expression data from experiments, where the corresponding GRNs are known. Notice that we need to convert indirect measurements into protein/mRNA count. See Table 1 for our inference results and theoretical/experimental evidence that partially validates our results. See Appendix C for details. There are 186 genes in these four data sets, and we can only determine that 12 genes have autoregulation (7 genes determined by Proposition 1, and 5 genes determined by Proposition 2). Not every VMR is less than 1, so that Conjecture 2 does not apply. For the other 174 genes, (1) some of them are not contained in the known GRN, and we cannot determine if they are in directed cycles; (2) some of them are in directed cycles; (3) some of them have ancestors, and we cannot reject the hypothesis that $\text {VMR}\ge 1$; (4) some of them have no ancestors, and we cannot reject the hypothesis that $\text {VMR}= 1$. Therefore, Proposition 1 and Proposition 2 do not apply, and we do not know whether they have autoregulation.

In some cases, we have experimental evidence that some genes have autoregulation, so that we can partially validate our inference results. Nevertheless, as discussed in the Introduction, there is no gold standard to evaluate our inference results. Besides, Proposition 2 requires that the one-step model holds, which we cannot verify.

In the data set by Guo et al. (2010), Sanchez-Castillo et al. (2018) inferred that 17 of 39 genes have autoregulation, and 22 genes do not have autoregulation. We infer that 5 genes have autoregulation, and 34 genes cannot be determined. Here 3 genes are shared by both inference results to have autoregulation. Consider a random classifier that randomly picks 5 genes and claims they have autoregulation. Using Sanchez-Castillo et al. as the standard, this random classifier has probability $62.55\%$ to be worse than our result, and $10.17\%$ to be better than our result. Thus our inference result is better than a random classifier, but the advantage is not substantial.

Table 1 The autoregulation inference results by our method on four data sets

Full size table

7 Conclusions

For a single gene that is not affected by other genes, or a group of genes that form a connected GRN, we develop rigorous theoretical results (without applying approximations) to determine the existence of autoregulation. These results generalize known relationships between autoregulation and VMR by dropping restrictions on parameters. Our results only depend on VMR, which is easy to compute and more robust than other complicated statistics. We also apply our method to experimental data and detect some genes that might have autoregulation.

Our method requires independent and identically distributed samples from the exact stationary distribution of a fully observed Markov chain, plus a known GRN. Proposition 1 requires that the expression is autonomous. Proposition 2 requires that the Markov chain model is one-step, the GRN has no directed cycle, and degradation is not regulated. If our inference fails, then some requirements are not met: (1) cells might affect each other, making the samples dependent; (2) cells are heterogeneous; (3) the measurements have extra errors; (4) the cells are not at stationarity; (5) there exist unobserved variables that affect gene expression; (6) the GRN is inferred by a theoretical method, which can be interfered by the existence of autoregulation; (7) the expression is non-autonomous; (8) the Markov chain is multi-step; (9) the GRN has unknown directed cycles; (10) the degradation rate is regulated by other genes. Such situations, especially the unobserved variables, are unavoidable. Therefore, current data might not satisfy these requirements, and our inference results should be interpreted as informative findings, not ground truths.

There are some known methods that overcome the above obstacles, and there are also some possible solutions that might appear in the future. (1) The dependency can be solved by better measurements for isolated cells that do not affect each other. In fact, the relationship between autoregulation and cell-cell interaction has been studied (Levenberg et al. 1998). (2) About cell heterogeneity, we prove a result in Appendix D that if several cell types have $\text {VMR}\ge 1$, then for a mixed population of such cell types, we still have $\text {VMR}\ge 1$. Therefore, cell heterogeneity does not fail Proposition 2, since $\text {VMR}<1$ for the mixture of several cell types means $\text {VMR}<1$ for at least one cell type. (3) With the development of experimental technologies, we expect that the measurement error can decrease. (4) Some works study autoregulation in non-stationary situations (Cao and Grima 2020; Swain et al. 2002; Skinner et al. 2016; Jia and Grima 2021). (5) Since hidden variables hurt any mechanism-based models, we can develop methods (especially with machine learning tools) that determine autoregulation based on similarities between gene expression profiles (Wang et al. 2021; Yang et al. 2020; Wang 2022). (6) Some GRN inference methods can also determine the existence of autoregulation (Sanchez-Castillo et al. 2018). (7) Many methods (including our Proposition 2) work in non-autonomous situations. (8) Some works study multi-step models (Braichenko et al. 2021; Karmakar and Das 2021; Voliotis et al. 2008). (9) We expect the appearance of more advanced GRN inference methods. (10) If probabilists can prove Conjecture 1, then the restriction on degradation rate can be lifted.

In fact, other theoretical works that determine gene autoregulation, or general gene regulation, also need various assumptions and might fail. Nevertheless, with the development of experimental technologies and theoretical results, we believe that some obstacles will be lifted, and our method will be more applicable in the future. Besides, our method can be further developed and combined with other methods.

Availability of data and materials

All data are available in https://github.com/YueWangMathbio/Autoregulation.

Code availability

All code files are available in https://github.com/YueWangMathbio/Autoregulation.

References

Angelini E, Wang Y, Zhou JX, Qian H, Huang S (2022) A model for the intrinsic limit of cancer therapy: duality of treatment-induced cell death and treatment-induced stemness. PLoS Comput Biol 18(7):e1010319
Google Scholar
Barros R, da Costa LT, Pinto-de Sousa J, Duluc I, Freund JN, David L et al (2011) CDX2 autoregulation in human intestinal metaplasia of the stomach: impact on the stability of the phenotype. Gut 60(3):290–298
Google Scholar
Baudrimont A, Jaquet V, Wallerich S, Voegeli S, Becskei A (2019) Contribution of RNA degradation to intrinsic and extrinsic noise in gene expression. Cell Rep 26(13):3752–3761
Google Scholar
Baumdick M, Gelléri M, Uttamapinant C, Beránek V, Chin JW, Bastiaens PI (2018) A conformational sensor based on genetic code expansion reveals an autocatalytic component in EGFR activation. Nat Commun 9(1):1–13
Google Scholar
Bokes P, King JR, Wood AT, Loose M (2012) Multiscale stochastic modelling of gene expression. J Math Biol 65(3):493–520
MathSciNet MATH Google Scholar
Bouuaert CC, Lipkow K, Andrews SS, Liu D, Chalmers R (2013) The autoregulation of a eukaryotic DNA transposon. eLife 2:e00668
Google Scholar
Braichenko S, Holehouse J, Grima R (2021) Distinguishing between models of mammalian gene expression: telegraph-like models versus mechanistic models. J R Soc Interface 18(183):20210510
Google Scholar
Cagnetta R, Wong HHW, Frese CK, Mallucci GR, Krijgsveld J, Holt CE (2019) Noncanonical modulation of the eIF2 pathway controls an increase in local translation during neural wiring. Mol Cell 73(3):474–489
Google Scholar
Cao Z, Grima R (2018) Linear mapping approximation of gene regulatory networks with stochastic dynamics. Nat Commun 9(1):1–15
Google Scholar
Cao Z, Grima R (2020) Analytical distributions for detailed models of stochastic gene expression in eukaryotic cells. Proc Natl Acad Sci USA 117(9):4682–4692
Google Scholar
Carrier TA, Keasling JD (1999) Investigating autocatalytic gene expression systems through mechanistic modeling. J Theor Biol 201(1):25–36
Google Scholar
Chahar S, Gandhi V, Yu S, Desai K, Cowper-Sal-lari R, Kim Y et al (2014) Chromatin profiling reveals regulatory network shifts and a protective role for hepatocyte nuclear factor 4$\alpha $ during colitis. Mol Cell Biol 34(17):3291–3304
Google Scholar
Chan TE, Stumpf MP, Babtie AC (2017) Gene regulatory network inference from single-cell data using multivariate information measures. Cell Syst 5(3):251–267
Google Scholar
Chen X, Jia C (2020) Limit theorems for generalized density-dependent Markov chains and bursty stochastic gene regulatory networks. J Math Biol 80(4):959–994
MathSciNet MATH Google Scholar
Chen X, Wang Y, Feng T, Yi M, Zhang X, Zhou D (2016) The overshoot and phenotypic equilibrium in characterizing cancer dynamics of reversible phenotypic plasticity. J Theor Biol 390:40–49
MathSciNet MATH Google Scholar
Cunningham TJ, Duester G (2015) Mechanisms of retinoic acid signalling and its roles in organ and limb development. Nat Rev Mol Cell Biol 16(2):110–123
Google Scholar
Czuppon P, Pfaffelhuber P (2018) Limits of noise for autoregulated gene expression. J Math Biol 77(4):1153–1191
MathSciNet MATH Google Scholar
Dessalles R, Fromion V, Robert P (2017) A stochastic analysis of autoregulation of gene expression. J Math Biol 75(5):1253–1283
MathSciNet MATH Google Scholar
Dobrinić P, Szczurek AT, Klose RJ (2021) PRC1 drives Polycomb-mediated gene repression by controlling transcription initiation and burst frequency. Nat Struct Mol Biol 28(10):811–824
Google Scholar
Eden UT, Kramer MA (2010) Drawing inferences from Fano factor calculations. J Neurosci Methods 190(1):149–152
Google Scholar
Fang J, Ianni A, Smolka C, Vakhrusheva O, Nolte H, Krüger M et al (2017) Sirt7 promotes adipogenesis in the mouse by inhibiting autocatalytic activation of Sirt1. Proc Natl Acad Sci USA 114(40):E8352–E8361
Google Scholar
Feigelman J, Ganscha S, Hastreiter S, Schwarzfischer M, Filipczyk A, Schroeder T et al (2016) Analysis of cell lineage trees by exact Bayesian inference identifies negative autoregulation of Nanog in mouse embryonic stem cells. Cell Syst 3(5):480–490
Google Scholar
Firman T, Wedekind S, McMorrow T, Ghosh K (2018) Maximum caliber can characterize genetic switches with multiple hidden species. J Phys Chem B 122(21):5666–5677
Google Scholar
Giovanini G, Sabino AU, Barros LR, Ramos AF (2020) A comparative analysis of noise properties of stochastic binary models for a self-repressing and for an externally regulating gene. Math Biosci Eng 17(5):5477–5503
MathSciNet MATH Google Scholar
Grönlund A, Lötstedt P, Elf J (2013) Transcription factor binding kinetics constrain noise suppression via negative feedback. Nat Commun 4(1):1–5
Google Scholar
Guo G, Huss M, Tong GQ, Wang C, Sun LL, Clarke ND et al (2010) Resolution of cell fate decisions revealed by single-cell gene expression analysis from zygote to blastocyst. Dev Cell 18(4):675–685
Google Scholar
Ham L, Brackston RD, Stumpf MP (2020) Extrinsic noise and heavy-tailed laws in gene expression. Phys Rev Lett 124(10):108101
Google Scholar
Hara T, Abe M, Inoue H, Yu L, Veenstra TD, Kang Y et al (2006) Cytokinesis regulator ECT2 changes its conformation through phosphorylation at Thr-341 in G2/M phase. Oncogene 25(4):566–578
Google Scholar
Hilfinger A, Norman TM, Vinnicombe G, Paulsson J (2016) Constraints on fluctuations in sparsely characterized biological systems. Phys Rev Lett 116(5):058101
MATH Google Scholar
Hornos JE, Schultz D, Innocentini GC, Wang J, Walczak AM, Onuchic JN et al (2005) Self-regulating gene: an exact solution. Phys Rev E 72(5):051907
MathSciNet Google Scholar
Hui Z, Jiang Z, Qiao D, Bo Z, Qiyuan K, Shaohua B et al (2020) Increased expression of LCN2 formed a positive feedback loop with activation of the ERK pathway in human kidney cells during kidney stone formation. Sci Rep 10(1):1–12
Google Scholar
Jia C (2017) Simplification of Markov chains with infinite state space and the mathematical theory of random gene expression bursts. Phys Rev E 96(3):032402
MathSciNet Google Scholar
Jia C (2020) Kinetic foundation of the zero-inflated negative binomial model for single-cell RNA sequencing data. SIAM J Appl Math 80(3):1336–1355
MathSciNet MATH Google Scholar
Jia C, Grima R (2020a) Small protein number effects in stochastic models of autoregulated bursty gene expression. J Chem Phys 152(8):084115
Google Scholar
Jia C, Grima R (2020b) Dynamical phase diagram of an auto-regulating gene in fast switching conditions. J Chem Phys 152(17):174110
Google Scholar
Jia C, Grima R (2021) Frequency domain analysis of fluctuations of mRNA and protein copy numbers within a cell lineage: theory and experimental validation. Phys Rev X 11(2):021032
Google Scholar
Jia C, Zhang MQ, Qian H (2017a) Emergent Lévy behavior in single-cell stochastic gene expression. Phys Rev E 96(4):040402
Google Scholar
Jia C, Xie P, Chen M, Zhang MQ (2017b) Stochastic fluctuations can reveal the feedback signs of gene regulatory networks at the single-molecule level. Sci Rep 7(1):1–9
Google Scholar
Jia C, Qian H, Chen M, Zhang MQ (2018) Relaxation rates of gene expression kinetics reveal the feedback signs of autoregulatory gene networks. J Chem Phys 148(9):095102
Google Scholar
Jiang DQ, Wang Y, Zhou D (2017) Phenotypic equilibrium as probabilistic convergence in multi-phenotype cell population dynamics. PLoS ONE 12(2):e0170916
Google Scholar
Kang Y, Gu C, Yuan L, Wang Y, Zhu Y, Li X et al (2014) Flexibility and symmetry of prokaryotic genome rearrangement reveal lineage-associated core-gene-defined genome organizational frameworks. MBio 5:e01867
Google Scholar
Karamyshev AL, Karamysheva ZN (2018) Lost in translation: ribosome-associated mRNA and protein quality controls. Front Genet 9:431
Google Scholar
Karmakar R, Das AK (2021) Effect of transcription reinitiation in stochastic gene expression. J Stat Mech Theory Exp 2021(3):033502
MathSciNet MATH Google Scholar
Kidder BL, Palmer S (2010) Examination of transcriptional networks reveals an important role for TCFAP2C, SMARCA4, and EOMES in trophoblast stem cell maintenance. Genome Res 20(4):458–472
Google Scholar
Ko Y, Kim J, Rodriguez-Zas SL (2019) Markov chain Monte Carlo simulation of a Bayesian mixture model for gene network inference. Genes Genom 41(5):547–555
Google Scholar
Levenberg S, Katz BZ, Yamada KM, Geiger B (1998) Long-range and selective autoregulation of cell-cell or cell-matrix adhesions by cadherin or integrin ligands. J Cell Sci 111(3):347–357
Google Scholar
Lin J, Amir A (2021) Disentangling intrinsic and extrinsic gene expression noise in growing cells. Phys Rev Lett 126(7):078101
Google Scholar
Luecken MD, Theis FJ (2019) Current best practices in single-cell RNA-seq analysis: a tutorial. Mol Syst Biol 15(6):e8746
Google Scholar
Mahajan T, Singh A, Dar R (2021) Topological constraints on noise propagation in gene regulatory networks. bioRxiv
Moignard V, Woodhouse S, Haghverdi L, Lilly AJ, Tanaka Y, Wilkinson AC et al (2015) Decoding the regulatory network of early blood development from single-cell gene expression measurements. Nat Biotechnol 33(3):269–276
Google Scholar
Munsky B, Neuert G, Van Oudenaarden A (2012) Using gene expression noise to understand gene regulation. Science 336(6078):183–187
MathSciNet MATH Google Scholar
Niu Y, Wang Y, Zhou D (2015) The phenotypic equilibrium of cancer cells: From average-level stability to path-wise convergence. J Theor Biol 386:7–17
MathSciNet MATH Google Scholar
Norris JR (1998) Markov chains. Cambridge University Press, Cambridge
MATH Google Scholar
Paulsson J (2005) Models of stochastic gene expression. Phys Life Rev 2(2):157–175
Google Scholar
Pramono A, Zahabi A, Morishima T, Lan D, Welte K, Skokowa J (2016) Thrombopoietin induces hematopoiesis from mouse ES cells via HIF-1$\alpha $-dependent activation of a BMP4 autoregulatory loop. Ann N Y Acad Sci 1375(1):38–51
Google Scholar
Psaila B, Barkas N, Iskander D, Roy A, Anderson S, Ashley N et al (2016) Single-cell profiling of human megakaryocyte-erythroid progenitors identifies distinct megakaryocyte and erythroid differentiation pathways. Genome Biol 17(1):1–19
Google Scholar
Ramos AF, Hornos JEM, Reinitz J (2015) Gene regulation and noise reduction by coupling of stochastic processes. Phys Rev E 91(2):020701
MathSciNet Google Scholar
Sachs K, Perez O, Pe’er D, Lauffenburger DA, Nolan GP (2005) Causal protein-signaling networks derived from multiparameter single-cell data. Science 308(5721):523–529
Google Scholar
Sanchez-Castillo M, Blanco D, Tienda-Luna IM, Carrion M, Huang Y (2018) A Bayesian framework for the inference of gene regulatory networks from time and pseudo-time series data. Bioinformatics 34(6):964–970
Google Scholar
Shahrezaei V, Swain PS (2008) Analytical distributions for stochastic gene expression. Proc Natl Acad Sci USA 105(45):17256–17261
Google Scholar
Sharma A, Adlakha N (2014) Markov chain model to study the gene expression. Adv Appl Sci Res 5(2):387–393
Google Scholar
Shen H, Huo S, Yan H, Park JH, Sreeram V (2019) Distributed dissipative state estimation for Markov jump genetic regulatory networks subject to round-robin scheduling. IEEE Trans Neural Netw Learn Syst 31(3):762–771
MathSciNet Google Scholar
Shen-Orr SS, Milo R, Mangan S, Alon U (2002) Network motifs in the transcriptional regulation network of Escherichia coli. Nat Genet 31(1):64–68
Google Scholar
Sheth R, Bastida MF, Kmita M, Ros M (2014) “Self-regulation,’’ a new facet of Hox genes’ function. Dev Dyn 243(1):182–191
Google Scholar
Shmulevich I, Gluhovsky I, Hashimoto RF, Dougherty ER, Zhang W (2003) Steady-state analysis of genetic regulatory networks modelled by probabilistic Boolean networks. Comp Funct Genom 4(6):601–608
Google Scholar
Skinner SO, Xu H, Nagarkar-Jaiswal S, Freire PR, Zwaka TP, Golding I (2016) Single-cell analysis of transcription kinetics across the cell cycle. Elife 5:e12175
Google Scholar
Stewart AJ, Seymour RM, Pomiankowski A, Reuter M (2013) Under-dominance constrains the evolution of negative autoregulation in diploids. PLoS Comput Biol 9(3):e1002992
Google Scholar
Swain PS (2004) Efficient attenuation of stochasticity in gene expression through post-transcriptional control. J Mol Biol 344(4):965–976
Google Scholar
Swain PS, Elowitz MB, Siggia ED (2002) Intrinsic and extrinsic contributions to stochasticity in gene expression. Proc Natl Acad Sci USA 99(20):12795–12800
Google Scholar
Thattai M, Van Oudenaarden A (2001) Intrinsic noise in gene regulatory networks. Proc Natl Acad Sci USA 98(15):8614–8619
Google Scholar
Thomas P (2019) Intrinsic and extrinsic noise of gene expression in lineage trees. Sci Rep 9(1):1–16
MathSciNet Google Scholar
Thomas P, Matuschek H, Grima R (2013) How reliable is the linear noise approximation of gene regulatory networks? BMC Genom 14(4):1–15
Google Scholar
Veerman F, Popović N, Marr C (2021) Parameter inference with analytical propagators for stochastic models of autoregulated gene expression. Int J Nonlinear Sci Numer Simul 23:565–577
MathSciNet MATH Google Scholar
Voliotis M, Cohen N, Molina-París C, Liverpool TB (2008) Fluctuations, pauses, and backtracking in DNA transcription. Biophys J 94(2):334–348
Google Scholar
Wang Y (2018) Some problems in stochastic dynamics and statistical analysis of single-cell biology of cancer. Ph.D. thesis, University of Washington
Wang Y (2022) Two metrics on rooted unordered trees with labels. Algorithms Mol Biol 17(1):1–17
Google Scholar
Wang Y, Qian H (2020) Mathematical representation of Clausius’ and Kelvin’s statements of the second law and irreversibility. J Stat Phys 179(3):808–837
MathSciNet MATH Google Scholar
Wang Y, Wang L (2020) Causal inference in degenerate systems: an impossibility result. In: International conference on artificial intelligence and statistics, PMLR, pp 3383–3392
Wang Y, Wang Z (2022) Inference on the structure of gene regulatory networks. J Theor Biol 539:111055
MathSciNet MATH Google Scholar
Wang DG, Wang S, Huang B, Liu F (2019) Roles of cellular heterogeneity, intrinsic and extrinsic noise in variability of p53 oscillation. Sci Rep 9(1):1–11
Google Scholar
Wang Y, Minarsky A, Penner R, Soulé C, Morozova N (2020a) Model of morphogenesis. J Comput Biol 27(9):1373–1383
MathSciNet Google Scholar
Wang Y, Kropp J, Morozova N (2020b) Biological notion of positional information/value in morphogenesis theory. Int J Dev Biol 64:453–463
Google Scholar
Wang Y, Zhang B, Kropp J, Morozova N (2021) Inference on tissue transplantation experiments. J Theor Biol 520:110645
MATH Google Scholar
Wang Y, Mistry BA, Chou T (2022) Discrete stochastic models of SELEX: aptamer capture probabilities and protocol optimization. J Chem Phys 156(24):244103
Google Scholar
Werhli AV, Grzegorczyk M, Husmeier D (2006) Comparative evaluation of reverse engineering gene regulatory networks with relevance networks, graphical Gaussian models and Bayesian networks. Bioinformatics 22(20):2523–2531
Google Scholar
Xing B, Van Der Laan MJ (2005) A causal inference approach for constructing transcriptional regulatory networks. Bioinformatics 21(21):4007–4013
Google Scholar
Yan J, Hilfinger A, Vinnicombe G, Paulsson J et al (2019) Kinetic uncertainty relations for the control of stochastic reaction networks. Phys Rev Lett 123(10):108101
MathSciNet Google Scholar
Yang W, Peng L, Zhu Y, Hong L (2020) When machine learning meets multiscale modeling in chemical reactions. J Chem Phys 153(9):094117
Google Scholar
Ye FXF, Wang Y, Qian H (2016) Stochastic dynamics: Markov chains and random transformations. Discrete Contin Dyn Syst B 21(7):2337
MathSciNet MATH Google Scholar
Zhou T, Zhang J (2012) Analytical results for a multistate gene model. SIAM J Appl Math 72(3):789–818
MathSciNet MATH Google Scholar
Zhou D, Wang Y, Wu B (2014) A multi-phenotypic cancer model with cell plasticity. J Theor Biol 357:35–45
MathSciNet MATH Google Scholar

Download references

Acknowledgements

Y.W. would like to thank Jiawei Yan for fruitful discussions, and Xiangting Li, Zikun Wang, Mingtao Xia for helpful comments.

Funding

This research was partially supported by NIH Grant R01HL146552 (Y.W.).

Author information

Authors and Affiliations

Department of Computational Medicine, University of California, Los Angeles, CA, 90095, USA
Yue Wang
Institut des Hautes Études Scientifiques (IHÉS), Bures-sur-Yvette, 91440, Essonne, France
Yue Wang
Simons Center for Geometry and Physics, Stony Brook University, Stony Brook, NY, 11794, USA
Siqi He

Authors

Yue Wang
View author publications
You can also search for this author in PubMed Google Scholar
Siqi He
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Conceptualization: YW; Methodology: YW, SH; Formal analysis and investigation: YW, SH; Writing—original draft preparation: YW; Writing—review and editing: YW; Resources: YW.

Corresponding author

Correspondence to Yue Wang.

Ethics declarations

Conflict of interest

The authors declare no conflict of interest.

Ethics approval

Not applicable.

Consent to participate

Not applicable.

Consent for publication

Not applicable.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A Simulation results

1.1 A.1 Test Proposition 1 without autoregulation

Simulation 1: Consider the Markov chain in Example 1 with $k=1,b=0,c=1$. The stationary distribution is Poissonian with parameter 1. This process has no autoregulation with the true $\text {VMR}=1$. For sample sizes $n=100$, $n=1000$, $n=10{,}000$, we repeat the experiment for 10,000 times and calculate the rate that the sample VMR falls in the $95\%$ confidence interval. See Table 2 for results. Proposition 1 has about $95\%$ probability to produce the correct result that $\text {VMR}=1$ (no autoregulation), since the confidence interval is $95\%$.

Table 2 Success rates for determining VMR in Simulation 1

Full size table

1.2 A.2 Test Proposition 1 with autoregulation

Simulation 2: Consider the Markov chain in Example 1 with $k=1,b=1,c=2$. The stationary distribution is geometric with parameter 0.5. This process has positive autoregulation with the true $\text {VMR}=2$. For sample sizes $n=100$, $n=1000$, $n=10000$, we repeat the experiment for 10000 times and calculate the rate that the sample VMR falls in the $95\%$ confidence interval. See Table 3 for results. When n is not too small, Proposition 1 always produces the correct result that $\text {VMR}>1$ (positive autoregulation).

Table 3 Success rates for determining VMR in Simulation 2

Full size table

1.3 A.3 Test Proposition 2 without autoregulation

Simulation 3: Consider a Markov chain (G, M) that satisfies Eq. 3. G can take 0 and 1, and M can take values in ${\mathbb {Z}}$. G does not depend on M, and transition rates are both $10^{-10}$ for $G=0 \rightarrow G=1$ and $G=1\rightarrow G=0$. Restricted on $G=0$, M is the same as Example 1 with $k=1,b=0,c=1$. Restricted on $G=1$, M is the same as Example 1 with $k=2,b=0,c=1$. The stationary distribution is the average of two Poisson distributions with parameters 1 and 2. This process has no autoregulation with the true $\text {VMR}=1.167$. For sample sizes $n=100$, $n=1000$, $n=10000$, we repeat the experiment for 10000 times and calculate the rate that the sample VMR falls in the $95\%$ confidence interval. See Table 4 for results. When n increases, we are very likely to obtain the correct $\text {VMR}>1$, but Proposition 2 cannot determine whether autoregulation exists.

Table 4 Success rates for determining VMR in Simulation 3

Full size table

1.4 A.4 Test Proposition 2 with negative autoregulation

Simulation 4: Consider a Markov chain (G, M) that satisfies Eq. 3. G can take 0 and 1, and M can take values in ${\mathbb {Z}}$. G does not depend on M, and transition rates are both $10^{-10}$ for $G=0 \rightarrow G=1$ and $G=1\rightarrow G=0$. Restricted on $G=0$, M is the same as Example 1 with $k=10,b=-1,c=1$. Restricted on $G=1$, M is the same as Example 1 with $k=10,b=-1,c=2$. This process has (negative) autoregulation with the true $\text {VMR}=0.733$. For sample sizes $n=100$, $n=1000$, $n=10000$, we repeat the experiment for 10000 times and calculate the rate that the sample VMR falls in the $95\%$ confidence interval. See Table 5 for results. When n is not too small, Proposition 2 always produces the correct result that $\text {VMR}<1$ (autoregulation).

Table 5 Success rates for determining VMR in Simulation 4

Full size table

1.5 A.5 Test Proposition 2 with positive autoregulation

Simulation 5: Consider a Markov chain (G, M) that satisfies Eq. 3. G can take 0 and 1, and M can take values in ${\mathbb {Z}}$. G does not depend on M, and transition rates are both $10^{-10}$ for $G=0 \rightarrow G=1$ and $G=1\rightarrow G=0$. Restricted on $G=0$, M is the same as Example 1 with $k=10,b=1,c=4$. Restricted on $G=1$, M is the same as Example 1 with $k=10,b=1,c=5$. This process has (positive) autoregulation with the true $\text {VMR}=1.357$. For sample sizes $n=100$, $n=1000$, $n=10000$, we repeat the experiment for 10000 times and calculate the rate that the sample VMR falls in the $95\%$ confidence interval. See Table 6 for results. When n is not too small, we can always obtain the correct result that $\text {VMR}>1$, but Proposition 2 cannot determine whether autoregulation exists.

Table 6 Success rates for determining VMR in Simulation 5

Full size table

Appendix B Details of examples

1.1 B.1 Details of Example 1

In Example 1, the stationary distribution $P_n$ exists, and satisfies

$$\begin{aligned} {[}b(n-1)+k]P_{n-1}=cnP_n. \end{aligned}$$

(B1)

Taking summation for Eq. (B1), we have

$$\begin{aligned} \begin{aligned}&\sum _{n=1}^{\infty }[b(n-1)+k]P_{n-1}=\sum _{n=1}^{\infty }cnP_n,\\&\quad \Rightarrow b\sum _{n=0}^{\infty }nP_n+k\sum _{n=0}^{\infty }P_n=c\sum _{n=0}^{\infty }nP_n,\\&\quad \Rightarrow b{\mathbb {E}}X+k=c{\mathbb {E}}X. \end{aligned} \end{aligned}$$

Thus ${\mathbb {E}}X=k/(c-b)$.

Also, multiplying Eq. (B1) by $n-1$ and taking summation, we have

$$\begin{aligned} \begin{aligned}&\sum _{n=1}^{\infty }[b(n-1)^2+k(n-1)]P_{n-1}=\sum _{n=1}^{\infty }cn^2P_n-\sum _{n=1}^{\infty }cnP_n,\\&\quad \Rightarrow b\sum _{n=0}^{\infty }n^2P_n+k\sum _{n=0}^{\infty }nP_n=c\sum _{n=0}^{\infty }n^2P_n-c\sum _{n=0}^{\infty }nP_n,\\&\quad \Rightarrow b{\mathbb {E}}(X^2)+k{\mathbb {E}}X=c{\mathbb {E}}(X^2)-c{\mathbb {E}}X. \end{aligned} \end{aligned}$$

Thus ${\mathbb {E}}(X^2)=(c+k)k/(c-b)^2$, and the variance $\sigma ^2(X)=ck/(c-b)^2$. Then $\text {VMR}(X)=1+b/(c-b)$.

1.2 B.2 Details of Example 2

In Example 2, the process is restricted to two lines: $n_1=2$ and $n_2=2$. Since this process has no cycle, it is detailed balanced (Wang and Qian 2020), and the stationary distribution $P(n_1,n_2)$ satisfies

$$\begin{aligned} P(k,2)=(k+1)P(k+1,2) \end{aligned}$$

and

$$\begin{aligned} P(2,k)=(k+1)P(2,k+1). \end{aligned}$$

Restricted on $n_1=2$ or $n_2=2$, the stationary distribution is Poissonian, $P(k,2)=c/(ek!)$. After normalization, we find that $c=2e/(4e-1)$. Thus

$$\begin{aligned} P(k,2)=P(2,k)=\frac{2}{k!(4e-1)}. \end{aligned}$$

Besides,

$$\begin{aligned} P(2,\cdot )=\sum _{k=0}^\infty P(2,k)=\sum _{k=0}^\infty \frac{2}{k!(4e-1)}=\frac{2e}{4e-1}. \end{aligned}$$

Then for the first gene X, we have

$$\begin{aligned} \begin{aligned} {\mathbb {E}}X&=\sum _{k\ne 2} kP(k,2)+2P(2,\cdot )=\sum _{k=0}^\infty kP(k,2)+2[P(2,\cdot )-P(2,2)] \\&= \sum _{k=0}^\infty \frac{2k}{k!(4e-1)}+\frac{4e-2}{4e-1}=\frac{6e-2}{4e-1}, \end{aligned} \end{aligned}$$

and

$$\begin{aligned} \begin{aligned} {\mathbb {E}}(X^2)&=\sum _{k\ne 2} k^2P(k,2)+4P(2,\cdot )=\sum _{k=0}^\infty k^2P(k,2)+4[P(2,\cdot )-P(2,2)] \\&= \sum _{k=0}^\infty \frac{2k(k-1)}{k!(4e-1)}+\sum _{k=0}^\infty \frac{2k}{k!(4e-1)}+\frac{8e-4}{4e-1}=\frac{12e-4}{4e-1}. \end{aligned} \end{aligned}$$

Thus

$$\begin{aligned} \text {VMR}(X)=\frac{{\mathbb {E}}(X^2)}{{\mathbb {E}}X}-{\mathbb {E}}X=2-\frac{6e-2}{4e-1}=\frac{2}{4e-1}\approx 0.55. \end{aligned}$$

Due to symmetry, the other gene also has $\text {VMR}(Y)\approx 0.55$.

Appendix C Details of applications on experimental data

In experiments, the expression levels of genes are not directly measured as mRNA or protein counts. Rather, they are measured as cycle threshold (Ct) values or fluorescence intensity values. Such indirect measurements need to be converted. Related details can be found in other papers (Jia et al. 2017b).

Guo et al. (2010) measured the expression (mRNA) levels of 48 genes for mouse embryo cells at different developmental stages. We consider three groups (16-cell stage, 32-cell stage, 64-cell stage) that have more than 50 samples. Sanchez-Castillo et al. (2018) used such data to infer the GRN structure, including autoregulation, but the inferred GRN only contains 39 genes. We cannot guarantee that the other 9 genes have no ancestors in the true GRN (to apply Proposition 1) or these genes are not contained in directed cycles (to apply Proposition 2). Thus we ignore those 9 genes not in this GRN. In the inferred GRN, genes BMP4, CREB312, and TCFAP2C are not contained in directed cycles. In the 16-cell stage group with 75 samples, if there is no autoregulation, then the $95\%$ confidence interval of VMR is [0.7041, 1.3470]. BMP4 ($\text {VMR}=0.2139$), CREB312 ($\text {VMR}=0.1971$), and TCFAP2C ($\text {VMR}=0.3468$) have significantly small VMR, and we can apply Proposition 2 to infer that BMP4, CREB312, and TCFAP2C might have autoregulation. In the other two groups, these genes do not have $\text {VMR}<1$, and the results are relatively weak. Besides, in the inferred GRN, genes FN1 and HNF4A have no ancestors. For the 16-cell stage with 75 samples, the VMR of FN1 and HNF4A are 3.4522 and 1.3599, outside of the $95\%$ confidence interval [0.7041, 1.3470]; for the 32-cell stage with 113 samples, the VMR of FN1 and HNF4A are 93.1070 and 46.7688, outside of the $95\%$ confidence interval [0.7554, 1.2784]; for the 64-cell stage with 159 samples, the VMR of FN1 and HNF4A are 117.3059 and 93.9589, outside of the $95\%$ confidence interval [0.7917, 1.2322]. Thus we can apply Proposition 1 to infer that FN1 and HNF4A ($\text {VMR}>1$ for all three cell groups) might have positive autoregulation. Nevertheless, it is more likely that the expressions of FN1 and HNF4A are non-autonomous, and there is no autoregulation. Sanchez-Castillo et al. (2018) inferred that BMP4, HNF4A, TCFAP2C have autoregulation. Besides, there is experimental evidence that BMP4 (Pramono et al. 2016), HNF4A (Chahar et al. 2014), TCFAP2C (Kidder and Palmer 2010) have autoregulation. Therefore, our inference results are partially validated.

Psaila et al. (2016) measured the expression (mRNA) levels of 90 genes for human megakaryocyte-erythroid progenitor cells. Chan et al. (2017) inferred the GRN structure (autoregulation not included). In the inferred GRN, genes BIM, CCND1, ECT2, PFKP have no ancestors. BIM has 214 effective samples, and VMR is 187.7, outside of the $95\%$ confidence interval [0.8191, 1.1987]. CCND1 has 68 effective samples, and VMR is 111.3, outside of the $95\%$ confidence interval [0.6905, 1.3660]. ECT2 has 56 effective samples, and VMR is 8.2, outside of the $95\%$ confidence interval [0.6618, 1.4069]. PFKP has 134 effective samples, and VMR is 82.1, outside of the $95\%$ confidence interval [0.7742, 1.2543]. Thus we can apply Proposition 1 to infer that BIM, CCND1, ECT2, PFKP might have positive autoregulation. Nevertheless, it is more likely that the expressions of these four genes are non-autonomous, and there is no autoregulation. There is experimental evidence that ECT2 has autoregulation (Hara et al. 2006), which partially validates our inference results. No other gene fits the requirement of Proposition 2.

Moignard et al. (2015) measured the expression (mRNA) levels of 46 genes for mouse embryo cells. Chan et al. (2017) inferred the GRN structure (autoregulation not included). Gene EIF2B1 has 3934 effective samples, and VMR is 0.66, outside of the $95\%$ confidence interval [0.9563, 1.0447]. Gene EIF2B1 has 12 effective samples, and VMR is 0.24, outside of the $95\%$ confidence interval [0.3469, 1.9927]. We can apply Proposition 2 to infer that EIF2B1 and HOXD8 might have autoregulation. No other gene fits the requirement of Proposition 1.

Sachs et al. (2005) measured the expression (protein) levels of 11 genes in the RAF signaling pathway for human T cells. The measurements were repeated for 14 groups of cells under different interventions. Werhli et al. (2006) inferred the GRN structure (autoregulation not included). In the inferred GRN (Fig. 1), PIP3 gene has no ancestor, and its VMRs in all 14 groups are larger than 5, while the $95\%$ confidence intervals for all 14 groups are contained in [0.8, 1.2]. Therefore, we can apply Proposition 1 and infer that PIP3 might have positive autoregulation. Nevertheless, it is more likely that the expression of PIP3 is non-autonomous, and there is no autoregulation. No other gene fits the requirement of Proposition 2.

Appendix D Heterogeneity and VMR

Proposition 3

Consider n independent random variables $X_1,\ldots ,X_n$ and probabilities $p_1,\ldots ,p_n$ with $\sum p_i=1$. Consider an independent random variable R that equals i with probability $p_i$. Construct a random variable Z that equals $X_i$ when $R=i$. If each $X_i$ has $\text {VMR}\ge 1$, then Z has $\text {VMR}\ge 1$.

Proof

We only need to prove this for $n=2$. The case for general n can be proved by mixing two variables iteratively.

Consider random variables X, Y and construct Z that equals X or Y with probability p or $1-p$. Since $\text {VMR}(X)\ge 1$, $\text {VMR}(Y)\ge 1$, we have ${\mathbb {E}}(X^2)-[{\mathbb {E}}(X)]^2\ge {\mathbb {E}}(X)$ and ${\mathbb {E}}(Y^2)-[{\mathbb {E}}(Y)]^2\ge {\mathbb {E}}(Y)$. Then

$$\begin{aligned}&\text {VMR}(Z)\\&\quad =\frac{p{\mathbb {E}}(X^2)+(1-p){\mathbb {E}}(Y^2)}{p{\mathbb {E}}(X)+(1-p){\mathbb {E}}(Y)}\\&\qquad +\frac{-p^2[{\mathbb {E}}(X)]^2-2p(1-p){\mathbb {E}}(X){\mathbb {E}}(Y)-(1-p)^2[{\mathbb {E}}(Y)]^2}{p{\mathbb {E}}(X)+(1-p){\mathbb {E}}(Y)}\\&\quad =\frac{p{\mathbb {E}}(X^2)-p[{\mathbb {E}}(X)]^2+(1-p){\mathbb {E}}(Y^2)-(1-p)[{\mathbb {E}}(Y)]^2}{p{\mathbb {E}}(X)+(1-p){\mathbb {E}}(Y)}\\&\qquad +\frac{p(1-p)[{\mathbb {E}}(X)]^2-2p(1-p){\mathbb {E}}(X){\mathbb {E}}(Y)+p(1-p)[{\mathbb {E}}(Y)]^2}{p{\mathbb {E}}(X)+(1-p){\mathbb {E}}(Y)}\\&\quad \ge \frac{p{\mathbb {E}}(X)+(1-p){\mathbb {E}}(Y)}{p{\mathbb {E}}(X)+(1-p){\mathbb {E}}(Y)} +\frac{p(1-p)[{\mathbb {E}}(X)-{\mathbb {E}}(Y)]^2}{p{\mathbb {E}}(X)+(1-p){\mathbb {E}}(Y)}\\&\quad \ge 1. \end{aligned}$$

$\square $

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Wang, Y., He, S. Inference on autoregulation in gene expression with variance-to-mean ratio. J. Math. Biol. 86, 87 (2023). https://doi.org/10.1007/s00285-023-01924-6

Download citation

Received: 16 December 2022
Revised: 14 April 2023
Accepted: 18 April 2023
Published: 03 May 2023
DOI: https://doi.org/10.1007/s00285-023-01924-6

Keywords

Mathematics Subject Classification

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Inference on autoregulation in gene expression with variance-to-mean ratio

Abstract

Similar content being viewed by others

Limits of noise for autoregulated gene expression

Approximate solutions to the response time problems of transcription autoregulatory gene networks

Effects of promoter leakage on dynamics of gene expression

1 Introduction

2 Setup and main results

3 Related works

4 Scenario of a single isolated gene

4.1 Setup

4.2 Theoretical results

Example 1

Lemma 1

Proposition 1

Remark 1

Proof of Lemma 1

5 Scenario of multiple entangled genes

5.1 Setup

5.2 Theoretical results

Proposition 2

Proof

Remark 2

Conjecture 1

Conjecture 2

Example 2

6 Applying theoretical results to experimental data

7 Conclusions

Availability of data and materials

Code availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethics approval

Consent to participate

Consent for publication

Additional information

Publisher's Note

Appendices

Appendix A Simulation results

1.1 A.1 Test Proposition 1 without autoregulation

1.2 A.2 Test Proposition 1 with autoregulation

1.3 A.3 Test Proposition 2 without autoregulation

1.4 A.4 Test Proposition 2 with negative autoregulation

1.5 A.5 Test Proposition 2 with positive autoregulation

Appendix B Details of examples

1.1 B.1 Details of Example 1

1.2 B.2 Details of Example 2

Appendix C Details of applications on experimental data

Appendix D Heterogeneity and VMR

Proposition 3

Proof

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation