Identifying atypically expressed chromosome regions using RNA-Seq data


The number of studies dealing with RNA-Seq data analysis has experienced a fast increase in the past years making this type of gene expression a strong competitor to the DNA microarrays. This paper proposes a Bayesian model to detect low and highly-expressed chromosome regions using RNA-Seq data. The methodology is based on a recent work designed to detect highly-expressed (overexpressed) regions in the context of microarray data. A hidden Markov model is developed by considering a mixture of Gaussian distributions with ordered means in a way that first and last mixture components are supposed to accommodate the under and overexpressed genes, respectively. The model is flexible enough to efficiently deal with the highly irregular spaced configuration of the data by assuming a hierarchical Markov dependence structure. The analysis of four cancer data sets (breast, lung, ovarian and uterus) is presented. Results indicate that the proposed model is selective in determining the expression status, robust with respect to prior specifications and provides tools for a global or local search of under and overexpressed chromosome regions.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8


  1. Albert JH (1992) Bayesian estimation of normal ogive item response curves using Gibbs sampling. J Educ Behav Stat 17:251–269

    Article  Google Scholar 

  2. Anders S, Huber W (2010) Differential expression analysis for sequencing count data. Genome Biol 11:R106

    Article  Google Scholar 

  3. Berger MF, Levin JZ, Vijayendran K, Sivachenko A, Maguire XAJ, Johnson LA, Robinson J, Verhaak RG, Sougnez C, Onofrio RC, Ziaugra L, Cibulskis K, Laine E, Barretina J, Winckler W, Fisher DE, Getz G, Meyerson M, Jaffe DB, Gabriel SB, Lander ES, Dummer R, Gnirke A, Nusbaum C, Garraway LA (2010) Integrative analysis of the melanoma transcriptome. Genome Res 20:413–427

    Article  Google Scholar 

  4. Bivand R, Piras G (2015) Comparing implementations of estimation methods for spatial econometrics. J Stat Softw 63(18):1–36

    Article  Google Scholar 

  5. Broet P, Lewin A, Richardson S, Dalmasso C, Magdelenat H (2004) A mixture model-based strategy for selecting sets of genes in multiclass response microarray experiments. Bioinformatics 20:2562–2571

    Article  Google Scholar 

  6. Bullard JH, Purdom E, Hansen KD, Dudoit S (2010) Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments. BMC Bioinform 11:94

    Article  Google Scholar 

  7. Chu Y, Corey DR (2012) RNA sequencing: platform selection, experimental design and data interpretation. Nucl Acid Ther 22(4):271–274

    Article  Google Scholar 

  8. Conesa A, Madrigal P, Tarazona S, Gomez-Cabrero D, Cervera A, McPherson A, Szczesniak MW, Gaffney D, Elo LL, Zhang X, Mortazavi A (2016) A survey of best practices for RNA-Seq data analysis. Genome Biol 17:13

    Article  Google Scholar 

  9. Dean N, Raftery AE (2005) Normal uniform mixture differential gene expression detection for cDNA microarrays. BMC Bioinform 6(1):173–187

    Article  Google Scholar 

  10. Dillies MA, Rau A, Aubert J, Hennequet-Antier C, Jeanmougin M, Servant N, Keime C, Marot G, Castel D, Estelle J, Guernec G, Jagla B, Jouneau L, Laloe D, Le-Gall C, Schaeffer B, Le-Crom S, Guedj M, Jaffrezic F (2012) A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis. Brief Bioinform 14(6):671–683

    Article  Google Scholar 

  11. Do KA, Muller P, Tang F (2005) A Bayesian mixture model for differential gene expression. J R Stat Soc Ser C 54(3):627–644

    MathSciNet  Article  Google Scholar 

  12. Frazee AC, Sabunciyan S, Hansen KD, Irizarry RA, Leek JT (2014) Differential expression analysis of DNA-Seq data at single-base resolution. Biostatistics 15(3):413–426

    Article  Google Scholar 

  13. Gentleman R, Carey V, Bates D, Bolstad B, Dettling M, Dudoit S, Ellis B, Gautier L, Ge Y, Gentry J, Hornik K, Hothorn T, Huber W, Iacus S, Irizarry R, Leisch F, Li C, Maechler M, Rossini A, Sawitzki G, Smith C, Smyth G, Tierney L, Yang J, Zhang J (2004) Bioconductor: open software development for computational biology and bioinformatics. Genome Biol 5:R80

    Article  Google Scholar 

  14. Geweke J (1992) Evaluating the accuracy of sampling-based approaches to the calculation of posterior moments. In: Bernardo JM, Berger J, Dawid AP, Smith AFM (eds) Bayesian statistics, vol 4. Oxford University Press, Oxford, pp 169–193

    Google Scholar 

  15. Green PJ (1995) Reversible jump MCMC and Bayesian model determination. Biometrika 82(4):711–732

    MathSciNet  Article  Google Scholar 

  16. Han Y, Chen J, Zhao X, Liang C, Wang Y, Sun L, Jiang Z, Zhang Z, Yang R, Chen J, Li Z, Tang A, Li Z, Ye J, Guan Z, Gui Y, Cai Z (2011) MicroRNA expression signatures of bladder cancer revealed by deep sequencing. PLoS One 6(3):e18286

    Article  Google Scholar 

  17. Hansen KD, Irizarry RA, Wu Z (2012) Removing technical variability in RNA-seq data using conditional quantile normalization. Biostatistics 41(2):204–216

    Article  Google Scholar 

  18. Hebenstreit D, Fang M, Gu M, Charoensawan V, Van-Oudenaarden A, Teichmann SA (2011) RNA sequencing reveal two major classes of gene expression levels in metazoan cells. Mol Syst Biol 7:497.

    Article  Google Scholar 

  19. Lewin A, Bochkina N, Richardson S (2007) Fully Bayesian mixture model for differential gene expression: simulations and model checks. Stat Appl Genet Mol Biol 6:36.

    MathSciNet  Article  MATH  Google Scholar 

  20. Liu JS (1994) The collapsed Gibbs sampler in Bayesian computations with applications to a gene regulation problem. J Am Stat Assoc 89:958–966

    MathSciNet  Article  Google Scholar 

  21. Lucas JE, Kung HN, Chi JTA (2010) Latent factor analysis to discover pathway associated putative segmental aneuploidies in human cancers. PLoS Comput Biol 6:e1000920

    Article  Google Scholar 

  22. Maher CA, Kumar-Sinha C, Cao X, Kalyana-Sundaram S, Han B, Jing X, Sam L, Barrette T, Palanisamy N, Chinnaiyan AM (2009) Transcriptome sequencing to detect gene fusions in cancer. Nature 458(7234):97–101

    Article  Google Scholar 

  23. Mayrink VD, Gonçalves FB (2017) A Bayesian hidden Markov mixture model to detect overexpressed chromosome regions. J R Stat Soc Ser C 66(2):387–412

    MathSciNet  Article  Google Scholar 

  24. McCarthy DJ, Chen Y, Smyth GK (2012) Differential expression analysis of multifactor RNA-Seq experiments with respect to biological variation. Nucl Acids Res 40:4288–4297

    Article  Google Scholar 

  25. Moran PAP (1950) Notes on continuous stochastic phenomena. Biometrika 37(1):17–23

    MathSciNet  Article  Google Scholar 

  26. Nueda MJ, Tarazona S, Conesa A (2014) Next maSigPro: updating maSigPro bioconductor package for RNA-Seq time series. Bioinformatics 30(18):2598–2602

    Article  Google Scholar 

  27. Oshlack A, Robinson MD, Young MD (2010) From RNA-Seq reads to differential expression results. Genome Biol 11(12):220.

    Article  Google Scholar 

  28. Papastamoulis P, Rattray M (2018) A Bayesian model selection approach for identifying differentially expressed transcripts from RNA sequencing data. J R Stat Soc Ser C 67(1):3–23

    MathSciNet  Article  Google Scholar 

  29. Plummer M, Best N, Cowles K, Vines K (2006) CODA: convergence diagnosis and output analysis for MCMC. R News 6(1):7–11

    Google Scholar 

  30. Pollack JR, Sorlie T, Perou CM, Rees CA, Jeffrey SS, Lonning PE, Tibshirani R, Botstein D, Dale ALB, Brown PO (2002) Microarray analysis reveals a major direct role of DNA copy number alteration in the transcriptional program of human breast tumors. Proc Natl Acad Sci USA 99:12963–12968

    Article  Google Scholar 

  31. R Core Team (2019) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. Accessed 10 Oct 2019

  32. Robinson MD, Oshlack A (2010) A scaling normalization method for differential expression analysis of RNA-Seq data. Genome Biol 11:R25

    Article  Google Scholar 

  33. Robinson MD, McCarthy DJ, Smyth GK (2010) edgeR: a bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26:139–140

    Article  Google Scholar 

  34. Soneson C, Delorenzi M (2013) A comparison of methods for differential expression analysis of RNA-Seq data. BMC Bioinform 14:91

    Article  Google Scholar 

  35. Van-De-Wiel MA, Leday GGR, Pardo L, Rue H, Van-Der-Vaart AW, Van-Wieringen WN (2013) Bayesian analysis of RNA sequencing data by estimating multiple shrinkage priors. Biostatistics 14(1):113–128

    Article  Google Scholar 

  36. Wagner GP, Kin K, Lynch VJ (2013) A model based criterion for gene expression calls using RNA-Seq data. Theory Biosci 132(3):159–164.

    Article  Google Scholar 

  37. Wang Z, Gerstein M, Snyder M (2009) RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet 10(1):57–63

    Article  Google Scholar 

  38. Zhang H, Xu J, Jiang N, Hu X, Luo Z (2015) PLNseq: a multivariate Poisson lognormal distribution for high-throughput matched RNA-sequencing read count data. Stat Med 34:1577–1589

    MathSciNet  Article  Google Scholar 

Download references


The authors would like to thank an anonymous referee for constructive comments to improve this work.

Author information



Corresponding author

Correspondence to Vinícius Diniz Mayrink.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.


Appendix A: joint and full conditional distributions

The joint density of X and all the unknown components in the model is:

$$\begin{aligned}&\pi (X,Z,W,V,q_0,Q,\psi ,\beta ) \; \nonumber \\&\quad = \left[ \prod _{i=1}^n \pi (X_i|Z_i,\psi ) \, \pi (Z_i|Z_{i-1},W_i,q_0,Q) \, \pi (W_i|V_i) \, \pi (V_i|\beta ) \right] \, \pi (q_0,Q,\psi ,\beta ), \nonumber \\&\quad =\Bigg [ \prod _{i=1}^n \left[ \prod _{k=1}^{K} f_k(X_i|\psi )^{Z_{i,k}} (q_{0k}^{Z_{i,k}})^{1-W_i}(q_{k_{(i-1)}k}^{Z_{i,k}})^{W_i} \right] \nonumber \\&\qquad \times \; \left[ \mathbb {1}(W_i = 1) \; \mathbb {1}(V_i > 0) + \mathbb {1}(W_i = 0) \; \mathbb {1}(V_i \le 0) \right] \nonumber \\&\qquad \times \; \phi (V_i-\beta '\mathbf {d_i}) \Bigg ] \left[ \prod _{k=1}^{K} q_{0k}^{Z_{0,k}}q_{0k}^{r_{0k}-1} \right] \; \left[ \prod _{k_1=1}^{K} \prod _{k=1}^{K}q_{k_1 k}^{r_{k_1 k}-1} \right] \; \pi (\psi ) \; \pi (\beta ), \end{aligned}$$

where \(k_{(i-1)} = j\) if \(Z_{i-1,j}=1\),   and   \(\pi (\beta )=|\varSigma _0|^{-1/2}\phi _2[\varSigma _{0}^{-1/2}(\beta -\mu _0)]\). Notation: \(\phi (.)\) and \(\phi _2(.)\) are the densities of the uni- and bi-dimensional standard Gaussian distributions, respectively.

$$\begin{aligned} \pi (\psi )= & {} \displaystyle \prod _{k=1}^{K} \; \pi _{\text{ NIG }}(\mu _k,\sigma _{k}^{2}; m_k, v_k, s_{1k},s_{2k}) \; \mathbb {1}(\mu _1<\cdots <\mu _K). \end{aligned}$$

The full conditional distribution of \((q_0, Q, \psi , \beta )\) is:

$$\begin{aligned} (q_0|\cdot )\sim & {} \text{ Dir } \left[ r_{0} + \sum _{i=1}^n Z_{i} (1-W_i) \right] , \end{aligned}$$
$$\begin{aligned} (q_k|\cdot )\sim & {} \text{ Dir } \left[ r_{k} + \sum _{i=2}^n (Z_{i-1,k} W_i) Z_{i} \right] , \end{aligned}$$
$$\begin{aligned} (\beta |\cdot )\sim & {} \text{ N }_2(\mu _0^*,\varSigma _0^*), \end{aligned}$$

with \(\varSigma _{0}^{*} = \left( \varSigma _{0}^{-1} + \sum _{i = 1}^{n} \mathbf d _{i} \mathbf d _{i'} \right) ^{-1}\) and \(\mu _{0}^{*} = \varSigma _{0}^{*} \left( \varSigma _{0}^{-1} \mu _0 + \sum _{i=1}^{n} V_i \mathbf d _i \right) \).

$$\begin{aligned} (\mu _k|\cdot ) \sim \text{ N }(m_{k}^{*},v_{k}^{*}) \quad \text{ and } \quad (\sigma _{k}^{2}|\cdot ) \sim \text{ IG }(s_{1k}^{*},s_{2k}^{*}), \quad \text{ for } \; k = 1, \ldots , K, \end{aligned}$$

where \(v_{k}^{*} = \frac{v_k \sigma _{k}^{2}}{1 + v_{k} \sum _{i=1}^{n} Z_{i,k}}\), \(m_{k}^{*} = v_{k}^{*} \left( \frac{m_{k} + v_{k} \sum _{i=1}^{n} Z_{i,k} X_{i}}{v \sigma _{k}^{2}} \right) \), \(s_{1k}^{*} = s_{1k} + \frac{1}{2} \sum _{i=1}^{n} Z_{i,k}\)

and \(s_{2k}^{*} = s_{2k} + \frac{1}{2} \left[ \frac{m_{k}^{2}}{v_{k}} + \sum _{i=1}^{n} Z_{i,k} X_{i}^{2} -\frac{v_{k}}{1+v_{k} \sum _{i=1}^{n} Z_{i,k}} \left( \frac{m_{k}}{v_{k}} + \sum _{i=1}^{n} Z_{i,k} X_{i} \right) ^2 \right] \).

The sampling step of (ZW) is

$$\begin{aligned} Z_1\sim & {} \text{ Mult }(1,q_{0}^*), \\ (W_i|Z_{i-1,j}=1,\cdot )\sim & {} \text{ Ber }(1,p_{(i,j)}^*), \;\; i = 2, \ldots , n, \\ (Z_i|W_i=l,Z_{i-1,j}=1,\cdot )\sim & {} \text{ Mult }(1,q_{(i,j,l)}^*), \;\; i = 2, \ldots , n, \end{aligned}$$


$$\begin{aligned} q_{0k}^*= & {} c_{1,k}q_{0k}/a_1, \quad p_{(i,j)}^* = b_{i,j}\varPhi _{i}^+/c_{i,j}, \nonumber \\ q_{(i,j,0)k}^*= & {} c_{i+1,k}f_k(X_i|\psi )q_{0k}/a_i, \quad q_{(i,j,1)k}^* = c_{i+1,k}f_k(X_i|\psi )q_{j,k}/b_{i,j}.\qquad \end{aligned}$$

Consider \(c_{n+1,j}=1\),   \(\forall j\),   and:

$$\begin{aligned} a_n= & {} \sum _{k=1}^{K}f_k(X_n|\psi )q_{0k}, \quad b_{n,j} = \sum _{k=1}^{K}f_k(X_n|\psi )q_{j,k}, \nonumber \\&c_{n,j}=b_{n,j}\varPhi _{n}^++a_n\varPhi _{n}^- , \nonumber \\ a_{i}= & {} \sum _{k=1}^{K}c_{i+1,k}f_k(X_i|\psi )q_{0k}, \quad b_{i,j} = \sum _{k=1}^{K}c_{i+1,k}f_k(X_i|\psi )q_{j,k}, \nonumber \\&c_{i,j} = b_{i,j}\varPhi _{i}^++a_i\varPhi _{i}^-, \quad \text{ for } \;\; i = n-1, \ldots , 2, \nonumber \\ a_{1}= & {} \sum _{k=1}^{K}c_{1,k}q_{0k}. \end{aligned}$$

First, the calculations are performed recursively, starting from n and moving backwards in the filtering part given in (A.7); here, \(b_i\) and \(c_i\) are K-vectors and \(a_i\) is a scalar. Next, the probabilities shown in (A.6) are calculated and (ZW) are sampled.

Appendix B: sensitivity analysis

Fig. 9

Histogram of all expressions (breast in row 1, ovarian in row 2) overlaid by the estimated mixture density (dashed curve) and its Gaussian components (grey and black curves) for seven different choices of K

Table 6 Posterior estimates (for each size K, Ovarian data) of the weight, \(\mu _k\) and \(\sigma _{k}^{2}\) related to the lower and upper Gaussian components and the mean (expec.) and variance (var.) of the (normalised) mixture of internal Gaussians (without the lower and upper components); standard errors in parentheses

Figure 9 and Table 6 report the sensitivity of the results to the choice of K (number of Gaussian components). Here, the model is fitted assuming seven configurations: \(K = 3\), 4, 5, 6, 7, 8 and 9. The Dirichlet priors for \(q_0\) and Q are specified so that the degree of information in the prior (sum of the hyperparameters of the Dirichlet) is the same across all configurations; see the description in Sect. 5 of the main article.

As expected, the results for \(K = 3\) indicates lower/upper Gaussian components with high variance and accommodating too many expressions. The weights of the lower/upper Gaussian components decreases as K increases. In terms of variance, the two smallest values are observed for large K in both cases. Table 6 also shows that the variance of the mixture involving only the internal Gaussians is increasing with K. Note that the results become quite similar for \(K = 7\) and 8 (Breast) and \(K = 8\) and 9 (Ovarian). Recall that the Ovarian data set is larger (6541 more aligned genes) than the Breast data; this might explain the differences observed between them. In conclusion, it seems reasonable to choose the model with \(K = 8\) (results are in Sect. 5).

One may argue that the Reversible Jump MCMC (RJMCMC), proposed in Green (1995), should be considered as an alternative to estimate K. Note that the value of K obtained via RJMCMC will depend on the goodness of fit, directly related with the likelihood “penalized” by the prior distribution of K. This strategy may not determine an optimal choice of K in our context of low/high expression detection. As mentioned in Sect. 5, the value of K should be the one such that the first and last mixture components are not too wide and have small posterior weights. The RJMCMC result may not comply with these features. In addition, the RJMCMC can determine a different K for two RNA-Seq data sets, which may imply that the criterion for low/high expression classification has changed between the analyses.

Fig. 10

Comparison of prior specifications (breast data). Histogram of all expressions overlaid by the estimated mixture density (dashed curve) and its components (grey and black curves)

Table 7 Comparing prior specifications (Breast data). Posterior estimates of: weights, \(\mu _k\), \(\sigma _{k}^{2}\), for \(k = 1\) and 8 (lower and upper Gaussian components), mean (expec.) and variance (var.) of the normalised mixture of internal Gaussians; standard errors in parentheses

The next results are related to a prior sensitive analysis developed for the parameters of the Gaussian components in the mixture with \(K=8\). Two different specifications are studied here, the first one has the same configuration explored in Sect. 5 of the main article. Consider \((\mu _k, \sigma _{k}^{2}) \sim \text{ NIG }(m_k,\;10,\; 2.1,\;1.1)\) with \(m_k = -10\), \(-7.14\), \(-4.29\), \(-1.43\), 1.43, 4.29, 7.14 and 10, for \(k = 1, \ldots , 8\), respectively. The second specification assumes higher variability by doubling the original standard deviation, i.e., \((\mu _k, \sigma _{k}^{2}) \sim \text{ NIG }(m_k,\;40,\; 2.025,\;1.025)\); the values of \(m_k\) and \(E(\sigma _{k}^{2})\) are not changed with respect to the first option. Figure 10 and Table 7 show the results; they are visually the same and this indicates robustness to the prior uncertainty of these parameters.

Appendix C: MCMC diagnostics

This section presents some results to show that the MCMC designed for our gene expression analysis has good convergence properties. Two aspects explaining this behaviour are: the chosen blocking scheme for the algorithm and the fact that all parameters, including (ZW), can be directly sampled from their full conditional distributions. The graphs in Figs. 11 and 12 explore some MCMC chains for the block (ZW). Here, the trace plots indicate good mixing properties and the trajectories of the ergodic averages suggest fast convergence.

Fig. 11

Trace plots of: \(Z_{i1}\) and \(W_i\) (from location \(i = 1379\)) and \(Z_{i8}\) and \(W_i\) (from location \(i = 24{,}714\)). These are locations with posterior mean near 0.5 (Breast data)

Fig. 12

Evolution of the ergodic average for three pairs of \((Z_{i1},W_i)\) and three pairs of \((Z_{i8},W_i)\) along the MCMC chain (breast data)

Table 8 shows the z-scores of the Geweke convergence diagnostic test (Geweke 1992) for the Markov chains related to \(\mu _k\), \(\sigma _{k}^{2}\) and \(\beta \) (assuming \(K = 8\)). The R package coda (Plummer et al. 2006) is considered here to apply the test. As it can be seen, all z-scores are within the N(0,1) symmetric 95% interval around zero given by (\(-1.96\), 1.96). This result indicates the convergence of all the chains evaluated in the table.

Table 8 Z-scores of the Geweke convergence diagnostic test for the chains of \(\mu _k\), \(\sigma _{k}^{2}\) (\(K = 8\) components), \(\beta _0\) and \(\beta _1\). The test involves \(20\%\) (at the beginning) and \(30\%\) (at the end) of the chain
Table 9 Comparing results: RNA-Seq versus microarray (Mayrink and Gonçalves 2017). Number of highly-expressed genes (detected in the RNA-Seq study) and the percentages of them near overexpressed locations identified in the microarray analysis

Appendix D: comparative study

This section is dedicated to a comparative analysis involving the highly-expressed genome locations identified in the present RNA-Seq study and those from Mayrink and Gonçalves (2017) using microarray data. The analysis is focused on Table 9, which shows the percentages of highly-expressed genes (in our RNA-Seq study) located in a chromosome region near overexpressed locations from the microarray study. This strategy is used since the number of aligned locations is bigger in the microarray case and the lists of locations are not the same for both studies. The criterion to delimit the searching neighbourhood is empirical. Let \(\epsilon _j\) be the median (less affected by outliers) of the distances between all genes (in the RNA-Seq study) aligned to chromosome j. The target regions are given by the locations of those genes \(\pm \epsilon _j\). The values exhibited in Table 9 are small (highest percentage of \(19.355\%\)), however, the reader should note that the number of highly-expressed genes is also small relative to the total number of genes under investigation. In addition, the specific type of tumors may not be the same in the comparison. As can be seen, all percentages are non-null indicating some consensus between the studies.

Appendix E: synthetic data analysis

In this section, we show some results using simulated data to confirm that the model is performing well and the corresponding code is correct. The synthetic gene expressions were generated from the model in Sect. 3 assuming \(K = 8\) mixture components. As real values of parameters consider: \(\beta _0 = 5.05\), \(\beta _1 = -9.55\), \(\mu = \{-11, -7, -5, -3, -2, 0, 1, 4\}\) and \(\sigma ^2 = \{2, 3, 2, 5, 2, 3, 4, 1\}\). The real \(q_0\) and \(q_k\) were chosen based on the specifications of \(r_0\) and r presented in Sect. 5; we use the specified vector divided by 2000. These parameter values determine synthetic expressions in the same scale of the real cancer data sets. The data are generated assuming 23 chromosomes and 1000 aligned genes per chromosome, i.e., the whole synthetic sequence has 23, 000 genes. The distances between genes were generated from the \(\text{ Uniform }(0,1)\) distribution as an strategy to have all magnitude of distances ranging from low (\(\approx 0\)) to high (\(\approx 1\)) in this simulation study. The MCMC, described in Sect. 4.1, is run assuming the same configuration (number of iterations and priors) presented in Sect. 5.

Fig. 13

Histogram of the synthetic expressions overlaid by the mixture density (dashed lines) and its components (lower and upper Gaussian in black and internal components in grey). Real mixture on the left and the estimated mixture on the right

Figure 13 shows the histogram of the synthetic data overlaid by the real (left panel) and the estimated (right panel) mixture densities and their components. As can be seen, the shapes and locations of each Gaussian component and the overall mixture density are quite similar when comparing both panels. This is a strong indication that the model is working properly and estimating well the parameters. Another interesting result is the percentage of genes correctly detected as underexpressed (or overexpressed), relative to the total number of genes receiving the same classification from the model (criterion: posterior probability \(> 0.5\)). In this synthetic data analysis, we have observed \(83.67\%\) and \(74.11\%\) of correct low and high expression detection, respectively.

Appendix F: exploring the size of the clusters

This section shows an additional result connected to the Panels (a) and (b) in Fig. 6. The barplots in Fig. 14 show the frequencies for the different sizes of consecutive locations detected as atypical. For each location, we evaluate the posterior probability of belonging to the lower/upper Gaussian component to categorize as atypical. As can be seen, many underexpressed genes are found isolated without a neighbor with the same classification. The overexpressed case does not show the same high concentration in the category “cluster of size 1”.

Fig. 14

Bar-plots indicating how frequent (in %) are the number of consecutive locations detected with atypical expression. Graphs in the first row are related to the lower Gaussian component. The second row corresponds to the upper Gaussian case. The high posterior probability (\(> 0.5\)) of belonging to the target Gaussian component is the criterion to establish atypical expression

Appendix G: effect of the Markov dependence

This section is intended to provide empirical evidence to justify the behaviour observed for the upper Gaussian component in Fig. 4. A visual analysis reveals that some observations, with high expressions in the right tail of the histograms, are accommodated by adjacent Gaussian components. In order to justify such behavior, we analyse all genes with expressions in (0, 5), which is where the upper Gaussian component concentrates most of its mass. To simplify the notation, denote \(G_8\) as the set of genes allocated to the upper Gaussian and \(G_\bullet \) as the remaining ones. As it is described in Sect. 4.2, a gene is allocated to the upper Gaussian if its posterior probability of belonging to this component is greater than 0.5.

Fig. 15

Comparison of two sets: \(\text {D}(\bullet ,8) =\) distances between each non-upper Gaussian gene and its nearest upper Gaussian neighbour and \(\text {D}(8,8) =\) distances between each upper Gaussian gene and its nearest upper Gaussian neighbour. The graphs in row 1 consider all genes with expression in the interval (0, 5). The graphs in row 2 consider specific subsets of the smallest distances in each set. The histogram shows the nearest neighbour expression distribution for highly expressed genes (\(> 5\)) accommodated by internal Gaussian components

The boxplots, in the first row of Fig. 15, compare two sets: \(\text {D}(\bullet ,8)\), which includes the distances between genes in \(G_\bullet \) and their nearest neighbour in \(G_8\) and \(\text {D}(8,8)\), which includes the distances between the members of \(G_8\) and their nearest neighbour in \(G_8\). As expected from our conjecture, values in \(\text {D}(8,8)\) are distributed among smaller values than those in \(\text {D}(\bullet ,8)\). The second row of boxplots shows a similar comparison involving the smallest distances of \(\text {D}(8,8)\) and \(\text {D}(\bullet ,8)\). In this case, for each dataset, both of the boxplots considered the same number of the smallest observations, which is the number of genes for which \(\text {D}(8,8)\) is smaller than 1. If we combine all four datasets, \(92.61\%\) of the values in \(\text {D}(8,8)\) are lower than 1 (188 out of 203 distances). The graphs show that most of the smallest distances in \(G_\bullet \) are larger than the smallest distances in \(G_8\), which again supports our conjecture. In fact, the \(92.61\%\) percentile of the four combined \(\text {D}(8,8)\) sets is 0.690, i.e. the largest distance smaller than 1 is 0.69, whereas the corresponding percentile for the combined four \(\text {D}(\bullet ,8)\) sets is 2461.779. The histogram in Fig. 15 shows the distribution of the expressions for the nearest neighbour of genes having expressions higher than 5, but that were not allocated to the upper Gaussian. The histogram indicates that most of the expressions (\(82.64\%\)) are lower than zero. Furthermore, a deeper investigation reveals that all of those expressions belong to genes not allocated to the upper Gaussian component. As we have stated in our conjecture, this is a consequence of the Markov dependence in the model, which is using the moderate expressions in the neighborhood as an evidence to avoid allocating a gene with high expression to the upper Gaussian component.

The small noise in this analysis is explained by the stochastic noise from the model fitting and the fact that we only considered the one nearest neighbour when, in fact, the dependence modeled by the Markov structure involves the whole neighbourhood of a gene.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Mayrink, V.D., Gonçalves, F.B. Identifying atypically expressed chromosome regions using RNA-Seq data. Stat Methods Appl 29, 619–649 (2020).

Download citation


  • Bayesian inference
  • Mixture model
  • Gibbs sampling
  • Gene expression
  • Cancer