Identifying atypically expressed chromosome regions using RNA-Seq data

  • Vinícius Diniz MayrinkEmail author
  • Flávio B. Gonçalves
Original Paper


The number of studies dealing with RNA-Seq data analysis has experienced a fast increase in the past years making this type of gene expression a strong competitor to the DNA microarrays. This paper proposes a Bayesian model to detect low and highly-expressed chromosome regions using RNA-Seq data. The methodology is based on a recent work designed to detect highly-expressed (overexpressed) regions in the context of microarray data. A hidden Markov model is developed by considering a mixture of Gaussian distributions with ordered means in a way that first and last mixture components are supposed to accommodate the under and overexpressed genes, respectively. The model is flexible enough to efficiently deal with the highly irregular spaced configuration of the data by assuming a hierarchical Markov dependence structure. The analysis of four cancer data sets (breast, lung, ovarian and uterus) is presented. Results indicate that the proposed model is selective in determining the expression status, robust with respect to prior specifications and provides tools for a global or local search of under and overexpressed chromosome regions.


Bayesian inference Mixture model Gibbs sampling Gene expression Cancer 



The authors would like to thank an anonymous referee for constructive comments to improve this work.


  1. Albert JH (1992) Bayesian estimation of normal ogive item response curves using Gibbs sampling. J Educ Behav Stat 17:251–269CrossRefGoogle Scholar
  2. Anders S, Huber W (2010) Differential expression analysis for sequencing count data. Genome Biol 11:R106CrossRefGoogle Scholar
  3. Berger MF, Levin JZ, Vijayendran K, Sivachenko A, Maguire XAJ, Johnson LA, Robinson J, Verhaak RG, Sougnez C, Onofrio RC, Ziaugra L, Cibulskis K, Laine E, Barretina J, Winckler W, Fisher DE, Getz G, Meyerson M, Jaffe DB, Gabriel SB, Lander ES, Dummer R, Gnirke A, Nusbaum C, Garraway LA (2010) Integrative analysis of the melanoma transcriptome. Genome Res 20:413–427CrossRefGoogle Scholar
  4. Bivand R, Piras G (2015) Comparing implementations of estimation methods for spatial econometrics. J Stat Softw 63(18):1–36CrossRefGoogle Scholar
  5. Broet P, Lewin A, Richardson S, Dalmasso C, Magdelenat H (2004) A mixture model-based strategy for selecting sets of genes in multiclass response microarray experiments. Bioinformatics 20:2562–2571CrossRefGoogle Scholar
  6. Bullard JH, Purdom E, Hansen KD, Dudoit S (2010) Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments. BMC Bioinform 11:94CrossRefGoogle Scholar
  7. Chu Y, Corey DR (2012) RNA sequencing: platform selection, experimental design and data interpretation. Nucl Acid Ther 22(4):271–274CrossRefGoogle Scholar
  8. Conesa A, Madrigal P, Tarazona S, Gomez-Cabrero D, Cervera A, McPherson A, Szczesniak MW, Gaffney D, Elo LL, Zhang X, Mortazavi A (2016) A survey of best practices for RNA-Seq data analysis. Genome Biol 17:13CrossRefGoogle Scholar
  9. Dean N, Raftery AE (2005) Normal uniform mixture differential gene expression detection for cDNA microarrays. BMC Bioinform 6(1):173–187CrossRefGoogle Scholar
  10. Dillies MA, Rau A, Aubert J, Hennequet-Antier C, Jeanmougin M, Servant N, Keime C, Marot G, Castel D, Estelle J, Guernec G, Jagla B, Jouneau L, Laloe D, Le-Gall C, Schaeffer B, Le-Crom S, Guedj M, Jaffrezic F (2012) A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis. Brief Bioinform 14(6):671–683CrossRefGoogle Scholar
  11. Do KA, Muller P, Tang F (2005) A Bayesian mixture model for differential gene expression. J R Stat Soc Ser C 54(3):627–644MathSciNetCrossRefGoogle Scholar
  12. Frazee AC, Sabunciyan S, Hansen KD, Irizarry RA, Leek JT (2014) Differential expression analysis of DNA-Seq data at single-base resolution. Biostatistics 15(3):413–426CrossRefGoogle Scholar
  13. Gentleman R, Carey V, Bates D, Bolstad B, Dettling M, Dudoit S, Ellis B, Gautier L, Ge Y, Gentry J, Hornik K, Hothorn T, Huber W, Iacus S, Irizarry R, Leisch F, Li C, Maechler M, Rossini A, Sawitzki G, Smith C, Smyth G, Tierney L, Yang J, Zhang J (2004) Bioconductor: open software development for computational biology and bioinformatics. Genome Biol 5:R80CrossRefGoogle Scholar
  14. Geweke J (1992) Evaluating the accuracy of sampling-based approaches to the calculation of posterior moments. In: Bernardo JM, Berger J, Dawid AP, Smith AFM (eds) Bayesian statistics, vol 4. Oxford University Press, Oxford, pp 169–193Google Scholar
  15. Green PJ (1995) Reversible jump MCMC and Bayesian model determination. Biometrika 82(4):711–732MathSciNetCrossRefGoogle Scholar
  16. Han Y, Chen J, Zhao X, Liang C, Wang Y, Sun L, Jiang Z, Zhang Z, Yang R, Chen J, Li Z, Tang A, Li Z, Ye J, Guan Z, Gui Y, Cai Z (2011) MicroRNA expression signatures of bladder cancer revealed by deep sequencing. PLoS One 6(3):e18286CrossRefGoogle Scholar
  17. Hansen KD, Irizarry RA, Wu Z (2012) Removing technical variability in RNA-seq data using conditional quantile normalization. Biostatistics 41(2):204–216CrossRefGoogle Scholar
  18. Hebenstreit D, Fang M, Gu M, Charoensawan V, Van-Oudenaarden A, Teichmann SA (2011) RNA sequencing reveal two major classes of gene expression levels in metazoan cells. Mol Syst Biol 7:497. CrossRefGoogle Scholar
  19. Lewin A, Bochkina N, Richardson S (2007) Fully Bayesian mixture model for differential gene expression: simulations and model checks. Stat Appl Genet Mol Biol 6:36. MathSciNetCrossRefzbMATHGoogle Scholar
  20. Liu JS (1994) The collapsed Gibbs sampler in Bayesian computations with applications to a gene regulation problem. J Am Stat Assoc 89:958–966MathSciNetCrossRefGoogle Scholar
  21. Lucas JE, Kung HN, Chi JTA (2010) Latent factor analysis to discover pathway associated putative segmental aneuploidies in human cancers. PLoS Comput Biol 6:e1000920CrossRefGoogle Scholar
  22. Maher CA, Kumar-Sinha C, Cao X, Kalyana-Sundaram S, Han B, Jing X, Sam L, Barrette T, Palanisamy N, Chinnaiyan AM (2009) Transcriptome sequencing to detect gene fusions in cancer. Nature 458(7234):97–101CrossRefGoogle Scholar
  23. Mayrink VD, Gonçalves FB (2017) A Bayesian hidden Markov mixture model to detect overexpressed chromosome regions. J R Stat Soc Ser C 66(2):387–412MathSciNetCrossRefGoogle Scholar
  24. McCarthy DJ, Chen Y, Smyth GK (2012) Differential expression analysis of multifactor RNA-Seq experiments with respect to biological variation. Nucl Acids Res 40:4288–4297CrossRefGoogle Scholar
  25. Moran PAP (1950) Notes on continuous stochastic phenomena. Biometrika 37(1):17–23MathSciNetCrossRefGoogle Scholar
  26. Nueda MJ, Tarazona S, Conesa A (2014) Next maSigPro: updating maSigPro bioconductor package for RNA-Seq time series. Bioinformatics 30(18):2598–2602CrossRefGoogle Scholar
  27. Oshlack A, Robinson MD, Young MD (2010) From RNA-Seq reads to differential expression results. Genome Biol 11(12):220. CrossRefGoogle Scholar
  28. Papastamoulis P, Rattray M (2018) A Bayesian model selection approach for identifying differentially expressed transcripts from RNA sequencing data. J R Stat Soc Ser C 67(1):3–23MathSciNetCrossRefGoogle Scholar
  29. Plummer M, Best N, Cowles K, Vines K (2006) CODA: convergence diagnosis and output analysis for MCMC. R News 6(1):7–11Google Scholar
  30. Pollack JR, Sorlie T, Perou CM, Rees CA, Jeffrey SS, Lonning PE, Tibshirani R, Botstein D, Dale ALB, Brown PO (2002) Microarray analysis reveals a major direct role of DNA copy number alteration in the transcriptional program of human breast tumors. Proc Natl Acad Sci USA 99:12963–12968CrossRefGoogle Scholar
  31. R Core Team (2019) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. Accessed 10 Oct 2019
  32. Robinson MD, Oshlack A (2010) A scaling normalization method for differential expression analysis of RNA-Seq data. Genome Biol 11:R25CrossRefGoogle Scholar
  33. Robinson MD, McCarthy DJ, Smyth GK (2010) edgeR: a bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26:139–140CrossRefGoogle Scholar
  34. Soneson C, Delorenzi M (2013) A comparison of methods for differential expression analysis of RNA-Seq data. BMC Bioinform 14:91CrossRefGoogle Scholar
  35. Van-De-Wiel MA, Leday GGR, Pardo L, Rue H, Van-Der-Vaart AW, Van-Wieringen WN (2013) Bayesian analysis of RNA sequencing data by estimating multiple shrinkage priors. Biostatistics 14(1):113–128CrossRefGoogle Scholar
  36. Wagner GP, Kin K, Lynch VJ (2013) A model based criterion for gene expression calls using RNA-Seq data. Theory Biosci 132(3):159–164. CrossRefGoogle Scholar
  37. Wang Z, Gerstein M, Snyder M (2009) RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet 10(1):57–63CrossRefGoogle Scholar
  38. Zhang H, Xu J, Jiang N, Hu X, Luo Z (2015) PLNseq: a multivariate Poisson lognormal distribution for high-throughput matched RNA-sequencing read count data. Stat Med 34:1577–1589MathSciNetCrossRefGoogle Scholar

Copyright information

© Springer-Verlag GmbH Germany, part of Springer Nature 2019

Authors and Affiliations

  1. 1.Departamento de EstatísticaICEx Universidade Federal de Minas Gerais, Av. Antônio CarlosBelo HorizonteBrazil

Personalised recommendations