## Abstract

The number of studies dealing with RNA-Seq data analysis has experienced a fast increase in the past years making this type of gene expression a strong competitor to the DNA microarrays. This paper proposes a Bayesian model to detect low and highly-expressed chromosome regions using RNA-Seq data. The methodology is based on a recent work designed to detect highly-expressed (overexpressed) regions in the context of microarray data. A hidden Markov model is developed by considering a mixture of Gaussian distributions with ordered means in a way that first and last mixture components are supposed to accommodate the under and overexpressed genes, respectively. The model is flexible enough to efficiently deal with the highly irregular spaced configuration of the data by assuming a hierarchical Markov dependence structure. The analysis of four cancer data sets (breast, lung, ovarian and uterus) is presented. Results indicate that the proposed model is selective in determining the expression status, robust with respect to prior specifications and provides tools for a global or local search of under and overexpressed chromosome regions.

This is a preview of subscription content, log in to check access.

## References

Albert JH (1992) Bayesian estimation of normal ogive item response curves using Gibbs sampling. J Educ Behav Stat 17:251–269

Anders S, Huber W (2010) Differential expression analysis for sequencing count data. Genome Biol 11:R106

Berger MF, Levin JZ, Vijayendran K, Sivachenko A, Maguire XAJ, Johnson LA, Robinson J, Verhaak RG, Sougnez C, Onofrio RC, Ziaugra L, Cibulskis K, Laine E, Barretina J, Winckler W, Fisher DE, Getz G, Meyerson M, Jaffe DB, Gabriel SB, Lander ES, Dummer R, Gnirke A, Nusbaum C, Garraway LA (2010) Integrative analysis of the melanoma transcriptome. Genome Res 20:413–427

Bivand R, Piras G (2015) Comparing implementations of estimation methods for spatial econometrics. J Stat Softw 63(18):1–36

Broet P, Lewin A, Richardson S, Dalmasso C, Magdelenat H (2004) A mixture model-based strategy for selecting sets of genes in multiclass response microarray experiments. Bioinformatics 20:2562–2571

Bullard JH, Purdom E, Hansen KD, Dudoit S (2010) Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments. BMC Bioinform 11:94

Chu Y, Corey DR (2012) RNA sequencing: platform selection, experimental design and data interpretation. Nucl Acid Ther 22(4):271–274

Conesa A, Madrigal P, Tarazona S, Gomez-Cabrero D, Cervera A, McPherson A, Szczesniak MW, Gaffney D, Elo LL, Zhang X, Mortazavi A (2016) A survey of best practices for RNA-Seq data analysis. Genome Biol 17:13

Dean N, Raftery AE (2005) Normal uniform mixture differential gene expression detection for cDNA microarrays. BMC Bioinform 6(1):173–187

Dillies MA, Rau A, Aubert J, Hennequet-Antier C, Jeanmougin M, Servant N, Keime C, Marot G, Castel D, Estelle J, Guernec G, Jagla B, Jouneau L, Laloe D, Le-Gall C, Schaeffer B, Le-Crom S, Guedj M, Jaffrezic F (2012) A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis. Brief Bioinform 14(6):671–683

Do KA, Muller P, Tang F (2005) A Bayesian mixture model for differential gene expression. J R Stat Soc Ser C 54(3):627–644

Frazee AC, Sabunciyan S, Hansen KD, Irizarry RA, Leek JT (2014) Differential expression analysis of DNA-Seq data at single-base resolution. Biostatistics 15(3):413–426

Gentleman R, Carey V, Bates D, Bolstad B, Dettling M, Dudoit S, Ellis B, Gautier L, Ge Y, Gentry J, Hornik K, Hothorn T, Huber W, Iacus S, Irizarry R, Leisch F, Li C, Maechler M, Rossini A, Sawitzki G, Smith C, Smyth G, Tierney L, Yang J, Zhang J (2004) Bioconductor: open software development for computational biology and bioinformatics. Genome Biol 5:R80

Geweke J (1992) Evaluating the accuracy of sampling-based approaches to the calculation of posterior moments. In: Bernardo JM, Berger J, Dawid AP, Smith AFM (eds) Bayesian statistics, vol 4. Oxford University Press, Oxford, pp 169–193

Green PJ (1995) Reversible jump MCMC and Bayesian model determination. Biometrika 82(4):711–732

Han Y, Chen J, Zhao X, Liang C, Wang Y, Sun L, Jiang Z, Zhang Z, Yang R, Chen J, Li Z, Tang A, Li Z, Ye J, Guan Z, Gui Y, Cai Z (2011) MicroRNA expression signatures of bladder cancer revealed by deep sequencing. PLoS One 6(3):e18286

Hansen KD, Irizarry RA, Wu Z (2012) Removing technical variability in RNA-seq data using conditional quantile normalization. Biostatistics 41(2):204–216

Hebenstreit D, Fang M, Gu M, Charoensawan V, Van-Oudenaarden A, Teichmann SA (2011) RNA sequencing reveal two major classes of gene expression levels in metazoan cells. Mol Syst Biol 7:497. https://doi.org/10.1038/msb.2011.28

Lewin A, Bochkina N, Richardson S (2007) Fully Bayesian mixture model for differential gene expression: simulations and model checks. Stat Appl Genet Mol Biol 6:36. https://doi.org/10.2202/1544-6115.1314

Liu JS (1994) The collapsed Gibbs sampler in Bayesian computations with applications to a gene regulation problem. J Am Stat Assoc 89:958–966

Lucas JE, Kung HN, Chi JTA (2010) Latent factor analysis to discover pathway associated putative segmental aneuploidies in human cancers. PLoS Comput Biol 6:e1000920

Maher CA, Kumar-Sinha C, Cao X, Kalyana-Sundaram S, Han B, Jing X, Sam L, Barrette T, Palanisamy N, Chinnaiyan AM (2009) Transcriptome sequencing to detect gene fusions in cancer. Nature 458(7234):97–101

Mayrink VD, Gonçalves FB (2017) A Bayesian hidden Markov mixture model to detect overexpressed chromosome regions. J R Stat Soc Ser C 66(2):387–412

McCarthy DJ, Chen Y, Smyth GK (2012) Differential expression analysis of multifactor RNA-Seq experiments with respect to biological variation. Nucl Acids Res 40:4288–4297

Moran PAP (1950) Notes on continuous stochastic phenomena. Biometrika 37(1):17–23

Nueda MJ, Tarazona S, Conesa A (2014) Next maSigPro: updating maSigPro bioconductor package for RNA-Seq time series. Bioinformatics 30(18):2598–2602

Oshlack A, Robinson MD, Young MD (2010) From RNA-Seq reads to differential expression results. Genome Biol 11(12):220. https://doi.org/10.1186/gb-2010-11-12-220

Papastamoulis P, Rattray M (2018) A Bayesian model selection approach for identifying differentially expressed transcripts from RNA sequencing data. J R Stat Soc Ser C 67(1):3–23

Plummer M, Best N, Cowles K, Vines K (2006) CODA: convergence diagnosis and output analysis for MCMC. R News 6(1):7–11

Pollack JR, Sorlie T, Perou CM, Rees CA, Jeffrey SS, Lonning PE, Tibshirani R, Botstein D, Dale ALB, Brown PO (2002) Microarray analysis reveals a major direct role of DNA copy number alteration in the transcriptional program of human breast tumors. Proc Natl Acad Sci USA 99:12963–12968

R Core Team (2019) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org. Accessed 10 Oct 2019

Robinson MD, Oshlack A (2010) A scaling normalization method for differential expression analysis of RNA-Seq data. Genome Biol 11:R25

Robinson MD, McCarthy DJ, Smyth GK (2010) edgeR: a bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26:139–140

Soneson C, Delorenzi M (2013) A comparison of methods for differential expression analysis of RNA-Seq data. BMC Bioinform 14:91

Van-De-Wiel MA, Leday GGR, Pardo L, Rue H, Van-Der-Vaart AW, Van-Wieringen WN (2013) Bayesian analysis of RNA sequencing data by estimating multiple shrinkage priors. Biostatistics 14(1):113–128

Wagner GP, Kin K, Lynch VJ (2013) A model based criterion for gene expression calls using RNA-Seq data. Theory Biosci 132(3):159–164. https://doi.org/10.1007/s12064-013-0178-3

Wang Z, Gerstein M, Snyder M (2009) RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet 10(1):57–63

Zhang H, Xu J, Jiang N, Hu X, Luo Z (2015) PLNseq: a multivariate Poisson lognormal distribution for high-throughput matched RNA-sequencing read count data. Stat Med 34:1577–1589

## Acknowledgements

The authors would like to thank an anonymous referee for constructive comments to improve this work.

## Author information

### Affiliations

### Corresponding author

## Additional information

### Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

## Appendices

### Appendix A: joint and full conditional distributions

The joint density of *X* and all the unknown components in the model is:

where \(k_{(i-1)} = j\) if \(Z_{i-1,j}=1\), and \(\pi (\beta )=|\varSigma _0|^{-1/2}\phi _2[\varSigma _{0}^{-1/2}(\beta -\mu _0)]\). Notation: \(\phi (.)\) and \(\phi _2(.)\) are the densities of the uni- and bi-dimensional standard Gaussian distributions, respectively.

The full conditional distribution of \((q_0, Q, \psi , \beta )\) is:

with \(\varSigma _{0}^{*} = \left( \varSigma _{0}^{-1} + \sum _{i = 1}^{n} \mathbf d _{i} \mathbf d _{i'} \right) ^{-1}\) and \(\mu _{0}^{*} = \varSigma _{0}^{*} \left( \varSigma _{0}^{-1} \mu _0 + \sum _{i=1}^{n} V_i \mathbf d _i \right) \).

where \(v_{k}^{*} = \frac{v_k \sigma _{k}^{2}}{1 + v_{k} \sum _{i=1}^{n} Z_{i,k}}\), \(m_{k}^{*} = v_{k}^{*} \left( \frac{m_{k} + v_{k} \sum _{i=1}^{n} Z_{i,k} X_{i}}{v \sigma _{k}^{2}} \right) \), \(s_{1k}^{*} = s_{1k} + \frac{1}{2} \sum _{i=1}^{n} Z_{i,k}\)

and \(s_{2k}^{*} = s_{2k} + \frac{1}{2} \left[ \frac{m_{k}^{2}}{v_{k}} + \sum _{i=1}^{n} Z_{i,k} X_{i}^{2} -\frac{v_{k}}{1+v_{k} \sum _{i=1}^{n} Z_{i,k}} \left( \frac{m_{k}}{v_{k}} + \sum _{i=1}^{n} Z_{i,k} X_{i} \right) ^2 \right] \).

The sampling step of (*Z*, *W*) is

where

Consider \(c_{n+1,j}=1\), \(\forall j\), and:

First, the calculations are performed recursively, starting from *n* and moving backwards in the filtering part given in (A.7); here, \(b_i\) and \(c_i\) are *K*-vectors and \(a_i\) is a scalar. Next, the probabilities shown in (A.6) are calculated and (*Z*, *W*) are sampled.

### Appendix B: sensitivity analysis

Figure 9 and Table 6 report the sensitivity of the results to the choice of *K* (number of Gaussian components). Here, the model is fitted assuming seven configurations: \(K = 3\), 4, 5, 6, 7, 8 and 9. The Dirichlet priors for \(q_0\) and *Q* are specified so that the degree of information in the prior (sum of the hyperparameters of the Dirichlet) is the same across all configurations; see the description in Sect. 5 of the main article.

As expected, the results for \(K = 3\) indicates lower/upper Gaussian components with high variance and accommodating too many expressions. The weights of the lower/upper Gaussian components decreases as *K* increases. In terms of variance, the two smallest values are observed for large *K* in both cases. Table 6 also shows that the variance of the mixture involving only the internal Gaussians is increasing with *K*. Note that the results become quite similar for \(K = 7\) and 8 (Breast) and \(K = 8\) and 9 (Ovarian). Recall that the Ovarian data set is larger (6541 more aligned genes) than the Breast data; this might explain the differences observed between them. In conclusion, it seems reasonable to choose the model with \(K = 8\) (results are in Sect. 5).

One may argue that the Reversible Jump MCMC (RJMCMC), proposed in Green (1995), should be considered as an alternative to estimate *K*. Note that the value of *K* obtained via RJMCMC will depend on the goodness of fit, directly related with the likelihood “penalized” by the prior distribution of *K*. This strategy may not determine an optimal choice of *K* in our context of low/high expression detection. As mentioned in Sect. 5, the value of *K* should be the one such that the first and last mixture components are not too wide and have small posterior weights. The RJMCMC result may not comply with these features. In addition, the RJMCMC can determine a different *K* for two RNA-Seq data sets, which may imply that the criterion for low/high expression classification has changed between the analyses.

The next results are related to a prior sensitive analysis developed for the parameters of the Gaussian components in the mixture with \(K=8\). Two different specifications are studied here, the first one has the same configuration explored in Sect. 5 of the main article. Consider \((\mu _k, \sigma _{k}^{2}) \sim \text{ NIG }(m_k,\;10,\; 2.1,\;1.1)\) with \(m_k = -10\), \(-7.14\), \(-4.29\), \(-1.43\), 1.43, 4.29, 7.14 and 10, for \(k = 1, \ldots , 8\), respectively. The second specification assumes higher variability by doubling the original standard deviation, i.e., \((\mu _k, \sigma _{k}^{2}) \sim \text{ NIG }(m_k,\;40,\; 2.025,\;1.025)\); the values of \(m_k\) and \(E(\sigma _{k}^{2})\) are not changed with respect to the first option. Figure 10 and Table 7 show the results; they are visually the same and this indicates robustness to the prior uncertainty of these parameters.

### Appendix C: MCMC diagnostics

This section presents some results to show that the MCMC designed for our gene expression analysis has good convergence properties. Two aspects explaining this behaviour are: the chosen blocking scheme for the algorithm and the fact that all parameters, including (*Z*, *W*), can be directly sampled from their full conditional distributions. The graphs in Figs. 11 and 12 explore some MCMC chains for the block (*Z*, *W*). Here, the trace plots indicate good mixing properties and the trajectories of the ergodic averages suggest fast convergence.

Table 8 shows the z-scores of the Geweke convergence diagnostic test (Geweke 1992) for the Markov chains related to \(\mu _k\), \(\sigma _{k}^{2}\) and \(\beta \) (assuming \(K = 8\)). The R package coda (Plummer et al. 2006) is considered here to apply the test. As it can be seen, all z-scores are within the N(0,1) symmetric 95% interval around zero given by (\(-1.96\), 1.96). This result indicates the convergence of all the chains evaluated in the table.

### Appendix D: comparative study

This section is dedicated to a comparative analysis involving the highly-expressed genome locations identified in the present RNA-Seq study and those from Mayrink and Gonçalves (2017) using microarray data. The analysis is focused on Table 9, which shows the percentages of highly-expressed genes (in our RNA-Seq study) located in a chromosome region near overexpressed locations from the microarray study. This strategy is used since the number of aligned locations is bigger in the microarray case and the lists of locations are not the same for both studies. The criterion to delimit the searching neighbourhood is empirical. Let \(\epsilon _j\) be the median (less affected by outliers) of the distances between all genes (in the RNA-Seq study) aligned to chromosome *j*. The target regions are given by the locations of those genes \(\pm \epsilon _j\). The values exhibited in Table 9 are small (highest percentage of \(19.355\%\)), however, the reader should note that the number of highly-expressed genes is also small relative to the total number of genes under investigation. In addition, the specific type of tumors may not be the same in the comparison. As can be seen, all percentages are non-null indicating some consensus between the studies.

### Appendix E: synthetic data analysis

In this section, we show some results using simulated data to confirm that the model is performing well and the corresponding code is correct. The synthetic gene expressions were generated from the model in Sect. 3 assuming \(K = 8\) mixture components. As real values of parameters consider: \(\beta _0 = 5.05\), \(\beta _1 = -9.55\), \(\mu = \{-11, -7, -5, -3, -2, 0, 1, 4\}\) and \(\sigma ^2 = \{2, 3, 2, 5, 2, 3, 4, 1\}\). The real \(q_0\) and \(q_k\) were chosen based on the specifications of \(r_0\) and *r* presented in Sect. 5; we use the specified vector divided by 2000. These parameter values determine synthetic expressions in the same scale of the real cancer data sets. The data are generated assuming 23 chromosomes and 1000 aligned genes per chromosome, i.e., the whole synthetic sequence has 23, 000 genes. The distances between genes were generated from the \(\text{ Uniform }(0,1)\) distribution as an strategy to have all magnitude of distances ranging from low (\(\approx 0\)) to high (\(\approx 1\)) in this simulation study. The MCMC, described in Sect. 4.1, is run assuming the same configuration (number of iterations and priors) presented in Sect. 5.

Figure 13 shows the histogram of the synthetic data overlaid by the real (left panel) and the estimated (right panel) mixture densities and their components. As can be seen, the shapes and locations of each Gaussian component and the overall mixture density are quite similar when comparing both panels. This is a strong indication that the model is working properly and estimating well the parameters. Another interesting result is the percentage of genes correctly detected as underexpressed (or overexpressed), relative to the total number of genes receiving the same classification from the model (criterion: posterior probability \(> 0.5\)). In this synthetic data analysis, we have observed \(83.67\%\) and \(74.11\%\) of correct low and high expression detection, respectively.

### Appendix F: exploring the size of the clusters

This section shows an additional result connected to the Panels (a) and (b) in Fig. 6. The barplots in Fig. 14 show the frequencies for the different sizes of consecutive locations detected as atypical. For each location, we evaluate the posterior probability of belonging to the lower/upper Gaussian component to categorize as atypical. As can be seen, many underexpressed genes are found isolated without a neighbor with the same classification. The overexpressed case does not show the same high concentration in the category “cluster of size 1”.

### Appendix G: effect of the Markov dependence

This section is intended to provide empirical evidence to justify the behaviour observed for the upper Gaussian component in Fig. 4. A visual analysis reveals that some observations, with high expressions in the right tail of the histograms, are accommodated by adjacent Gaussian components. In order to justify such behavior, we analyse all genes with expressions in (0, 5), which is where the upper Gaussian component concentrates most of its mass. To simplify the notation, denote \(G_8\) as the set of genes allocated to the upper Gaussian and \(G_\bullet \) as the remaining ones. As it is described in Sect. 4.2, a gene is allocated to the upper Gaussian if its posterior probability of belonging to this component is greater than 0.5.

The boxplots, in the first row of Fig. 15, compare two sets: \(\text {D}(\bullet ,8)\), which includes the distances between genes in \(G_\bullet \) and their nearest neighbour in \(G_8\) and \(\text {D}(8,8)\), which includes the distances between the members of \(G_8\) and their nearest neighbour in \(G_8\). As expected from our conjecture, values in \(\text {D}(8,8)\) are distributed among smaller values than those in \(\text {D}(\bullet ,8)\). The second row of boxplots shows a similar comparison involving the smallest distances of \(\text {D}(8,8)\) and \(\text {D}(\bullet ,8)\). In this case, for each dataset, both of the boxplots considered the same number of the smallest observations, which is the number of genes for which \(\text {D}(8,8)\) is smaller than 1. If we combine all four datasets, \(92.61\%\) of the values in \(\text {D}(8,8)\) are lower than 1 (188 out of 203 distances). The graphs show that most of the smallest distances in \(G_\bullet \) are larger than the smallest distances in \(G_8\), which again supports our conjecture. In fact, the \(92.61\%\) percentile of the four combined \(\text {D}(8,8)\) sets is 0.690, i.e. the largest distance smaller than 1 is 0.69, whereas the corresponding percentile for the combined four \(\text {D}(\bullet ,8)\) sets is 2461.779. The histogram in Fig. 15 shows the distribution of the expressions for the nearest neighbour of genes having expressions higher than 5, but that were not allocated to the upper Gaussian. The histogram indicates that most of the expressions (\(82.64\%\)) are lower than zero. Furthermore, a deeper investigation reveals that all of those expressions belong to genes not allocated to the upper Gaussian component. As we have stated in our conjecture, this is a consequence of the Markov dependence in the model, which is using the moderate expressions in the neighborhood as an evidence to avoid allocating a gene with high expression to the upper Gaussian component.

The small noise in this analysis is explained by the stochastic noise from the model fitting and the fact that we only considered the one nearest neighbour when, in fact, the dependence modeled by the Markov structure involves the whole neighbourhood of a gene.

## Rights and permissions

## About this article

### Cite this article

Mayrink, V.D., Gonçalves, F.B. Identifying atypically expressed chromosome regions using RNA-Seq data.
*Stat Methods Appl* **29, **619–649 (2020). https://doi.org/10.1007/s10260-019-00496-4

Accepted:

Published:

Issue Date:

### Keywords

- Bayesian inference
- Mixture model
- Gibbs sampling
- Gene expression
- Cancer