Transcriptomics and RNA-Seq Data Analysis

  • Xuhua Xia


High-throughput sequencing data (HTS) has been used in detecting not only differential gene expression, but also alternative splicing events and different transcription start and termination sites. It has also been used in ribosome profiling for characterizing translation efficiency and Hi-C method for constructing genome 3-D architecture. There are two main difficulties in analyzing HTS data: the large file size, often in terabytes, and the allocation of reads to paralogous genes which impacts the accuracy of computed RPKM values, especially for multicellular eukaryotes with many paralogues. This chapter provides a conceptual framework for analyzing HTS data and offers numerical illustrations of solutions to both problems mentioned above. It includes examples from real data on how to compare performance of different software packages.


  1. Abolbaghaei A, Silke JR, Xia X (2017) How changes in anti-SD sequences would affect SD sequences in Escherichia coli and Bacillus subtilis. G3 (Bethesda, Md) 7(5):1607–1615CrossRefGoogle Scholar
  2. Abraham JM, Feagin JE, Stuart K (1988) Characterization of cytochrome c oxidase III transcripts that are edited only in the 3′ region. Cell 55(2):267–272CrossRefPubMedGoogle Scholar
  3. Alatortsev VS, Cruz-Reyes J, Zhelonkina AG, Sollner-Webb B (2008) Trypanosoma brucei RNA editing: coupled cycles of U deletion reveal processive activity of the editing complex. Mol Cell Biol 28(7):2437–2445CrossRefPubMedPubMedCentralGoogle Scholar
  4. Arava Y, Wang Y, Storey JD, Liu CL, Brown PO, Herschlag D (2003) Genome-wide analysis of mRNA translation profiles in Saccharomyces cerevisiae. Proc Natl Acad Sci USA 100(7):3889–3894CrossRefGoogle Scholar
  5. Arvaniti E, Moulos P, Vakrakou A, Chatziantoniou C, Chadjichristos C, Kavvadas P, Charonis A, Politis PK (2016) Whole-transcriptome analysis of UUO mouse model of renal fibrosis reveals new molecular players in kidney diseases. Sci Rep 6:26235CrossRefPubMedPubMedCentralGoogle Scholar
  6. Awan AR, Manfredo A, Pleiss JA (2013) Lariat sequencing in a unicellular yeast identifies regulated alternative splicing of exons that are evolutionarily conserved with humans. Proc Natl Acad Sci USA 110(31):12762–12767CrossRefPubMedGoogle Scholar
  7. Bell D, Bell AH, Bondaruk J, Hanna EY, Weber RS (2016) In-depth characterization of the salivary adenoid cystic carcinoma transcriptome with emphasis on dominant cell type. Cancer 122(10):1513–1522CrossRefPubMedGoogle Scholar
  8. Benoit G, Lemaitre C, Lavenier D, Drezen E, Dayris T, Uricaru R, Rizk G (2015) Reference-free compression of high throughput sequencing data with a probabilistic de Bruijn graph. BMC Bioinform 16:288CrossRefGoogle Scholar
  9. Berger MF, Levin JZ, Vijayendran K, Sivachenko A, Adiconis X, Maguire J, Johnson LA, Robinson J, Verhaak RG, Sougnez C et al (2010) Integrative analysis of the melanoma transcriptome. Genome Res 20(4):413–427CrossRefPubMedPubMedCentralGoogle Scholar
  10. Birney E, Stamatoyannopoulos JA, Dutta A, Guigo R, Gingeras TR, Margulies EH, Weng Z, Snyder M, Dermitzakis ET, Thurman RE et al (2007) Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature 447(7146):799–816CrossRefGoogle Scholar
  11. Brazma A, Hingamp P, Quackenbush J, Sherlock G, Spellman P, Stoeckert C, Aach J, Ansorge W, Ball CA, Causton HC et al (2001) Minimum information about a microarray experiment (MIAME)-toward standards for microarray data. Nat Genet 29(4):365–371CrossRefPubMedGoogle Scholar
  12. Deng Q, Ramskold D, Reinius B, Sandberg R (2014a) Single-cell RNA-seq reveals dynamic, random monoallelic gene expression in mammalian cells. Science 343(6167):193–196CrossRefPubMedGoogle Scholar
  13. Diehn M, Eisen MB, Botstein D, Brown PO (2000) Large-scale identification of secreted and membrane-associated gene products using DNA microarrays. Nat Genet 25(1):58–62CrossRefPubMedGoogle Scholar
  14. Dobin A, Davis CA, Schlesinger F, Drenkow J, Zaleski C, Jha S, Batut P, Chaisson M, Gingeras TR (2013) STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29(1):15–21CrossRefPubMedGoogle Scholar
  15. Epstein CB, Butow RA (2000) Microarray technology – enhanced versatility, persistent challenge. Curr Opin Biotechnol 11(1):36–41CrossRefPubMedGoogle Scholar
  16. Furukawa R, Hachiya T, Ohmomo H, Shiwa Y, Ono K, Suzuki S, Satoh M, Hitomi J, Sobue K, Shimizu A (2016) Intraindividual dynamics of transcriptome and genome-wide stability of DNA methylation. Sci Rep 6:26424CrossRefPubMedPubMedCentralGoogle Scholar
  17. Gaasterland T, Bekiranov S (2000) Making the most of microarray data [news]. Nat Genet 24(3):204–206CrossRefPubMedGoogle Scholar
  18. Guindon S, Dufayard JF, Lefort V, Anisimova M, Hordijk W, Gascuel O (2010) New algorithms and methods to estimate maximum-likelihood phylogenies: assessing the performance of PhyML 3.0. Syst Biol 59(3):307–321CrossRefGoogle Scholar
  19. Haustead DJ, Stevenson A, Saxena V, Marriage F, Firth M, Silla R, Martin L, Adcroft KF, Rea S, Day PJ et al (2016) Transcriptome analysis of human ageing in male skin shows mid-life period of variability and central role of NF-kappaB. Sci Rep 6:26846CrossRefPubMedPubMedCentralGoogle Scholar
  20. Heath JR, Ribas A, Mischel PS (2016) Single-cell analysis tools for drug discovery and development. Nat Rev Drug Discov 15(3):204–216PubMedPubMedCentralCrossRefGoogle Scholar
  21. Ingolia NT (2010) Genome-wide translational profiling by ribosome footprinting. Methods Enzymol 470:119–142CrossRefGoogle Scholar
  22. Ingolia NT (2014) Ribosome profiling: new views of translation, from single codons to genome scale. Nat Rev Genet 15(3):205–213CrossRefGoogle Scholar
  23. Ingolia NT (2016) Ribosome footprint profiling of translation throughout the Genome. Cell 165(1):22–33PubMedPubMedCentralCrossRefGoogle Scholar
  24. Ingolia NT, Ghaemmaghami S, Newman JRS, Weissman JS (2009) Genome-wide analysis in vivo of translation with nucleotide resolution using ribosome profiling. Science 324(5924):218–223PubMedPubMedCentralCrossRefGoogle Scholar
  25. Janin L, Schulz-Trieglaff O, Cox AJ (2014) BEETL-fastq: a searchable compressed archive for DNA reads. Bioinformatics 30(19):2796–2801CrossRefPubMedGoogle Scholar
  26. Katoh K, Toh H (2008) Recent developments in the MAFFT multiple sequence alignment program. Brief Bioinform 9(4):286–298CrossRefPubMedGoogle Scholar
  27. Kawashima T, Douglass S, Gabunilas J, Pellegrini M, Chanfreau GF (2014) Widespread use of non-productive alternative splice sites in Saccharomyces cerevisiae. PLoS Genet 10(4):e1004249CrossRefPubMedPubMedCentralGoogle Scholar
  28. Kingsford C, Patro R (2015) Reference-based compression of short-read sequences using path encoding. Bioinformatics 31(12):1920–1928CrossRefPubMedPubMedCentralGoogle Scholar
  29. Kodama Y, Shumway M, Leinonen R (2012) The sequence read archive: explosive growth of sequencing data. Nucleic Acids Res 40(Database issue):D54–D56CrossRefPubMedGoogle Scholar
  30. Lamond AI (1988) RNA editing and the mysterious undercover genes of trypanosomatid mitochondria. Trends Biochem Sci 13(8):283–284CrossRefPubMedGoogle Scholar
  31. Langmead B, Salzberg SL (2012) Fast gapped-read alignment with Bowtie 2. Nat Methods 9(4):357–359CrossRefPubMedPubMedCentralGoogle Scholar
  32. Langmead B, Schatz MC, Lin J, Pop M, Salzberg SL (2009a) Searching for SNPs with cloud computing. Genome Biol 10(11):R134CrossRefPubMedPubMedCentralGoogle Scholar
  33. Langmead B, Trapnell C, Pop M, Salzberg SL (2009b) Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol 10(3):R25CrossRefPubMedPubMedCentralGoogle Scholar
  34. Langmead B, Hansen KD, Leek JT (2010) Cloud-scale RNA-sequencing differential expression analysis with Myrna. Genome Biol 11(8):R83CrossRefPubMedPubMedCentralGoogle Scholar
  35. Leinonen R, Sugawara H, Shumway M (2011) The sequence read archive. Nucleic Acids Res 39(Database):D19–D21CrossRefPubMedGoogle Scholar
  36. Li F, Ge P, Hui WH, Atanasov I, Rogers K, Guo Q, Osato D, Falick AM, Zhou ZH, Simpson L (2009) Structure of the core editing complex (L-complex) involved in uridine insertion/deletion RNA editing in trypanosomatid mitochondria. Proc Natl Acad Sci U S A 106(30):12306–12310CrossRefPubMedPubMedCentralGoogle Scholar
  37. Liu X, Jiang H, Gu Z, Roberts JW (2013) High-resolution view of bacteriophage lambda gene expression by ribosome profiling. Proc Natl Acad Sci U S A 110(29):11928–11933PubMedPubMedCentralCrossRefGoogle Scholar
  38. Livesey R (2002) Have microarrays failed to deliver for developmental biology? Genome Biol 3(9):comment2009CrossRefGoogle Scholar
  39. MacKay VL, Li X, Flory MR, Turcott E, Law GL, Serikawa KA, Xu XL, Lee H, Goodlett DR, Aebersold R et al (2004) Gene expression analyzed by high-resolution state array analysis and quantitative proteomics: response of yeast to mating pheromone. Mol Cell Proteomics 3(5):478–489CrossRefGoogle Scholar
  40. Mlera L, Lam J, Offerdahl DK, Martens C, Sturdevant D, Turner CV, Porcella SF, Bloom ME (2016) Transcriptome analysis reveals a signature profile for tick-borne Flavivirus persistence in HEK 293T cells. MBio 7(3):e00314–e00316CrossRefPubMedPubMedCentralGoogle Scholar
  41. Morin R, Bainbridge M, Fejes A, Hirst M, Krzywinski M, Pugh T, McDonald H, Varhol R, Jones S, Marra M (2008a) Profiling the HeLa S3 transcriptome using randomly primed cDNA and massively parallel short-read sequencing. BioTechniques 45(1):81–94CrossRefPubMedGoogle Scholar
  42. Morin RD, O’Connor MD, Griffith M, Kuchenbauer F, Delaney A, Prabhu AL, Zhao Y, McDonald H, Zeng T, Hirst M et al (2008b) Application of massively parallel sequencing to microRNA profiling and discovery in human embryonic stem cells. Genome Res 18(4):610–621CrossRefPubMedPubMedCentralGoogle Scholar
  43. Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B (2008) Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods 5(7):621–628CrossRefGoogle Scholar
  44. Nicolae M, Pathak S, Rajasekaran S (2015) LFQC: a lossless compression algorithm for FASTQ files. Bioinformatics 31(20):3276–3281CrossRefPubMedPubMedCentralGoogle Scholar
  45. Numanagic I, Bonfield JK, Hach F, Voges J, Ostermann J, Alberti C, Mattavelli M, Sahinalp SC (2016) Comparison of high-throughput sequencing data compression tools. Nat Methods 13(12):1005–1008CrossRefPubMedGoogle Scholar
  46. Pleiss JA, Whitworth GB, Bergkessel M, Guthrie C (2007) Rapid, transcript-specific changes in splicing in response to environmental stress. Mol Cell 27(6):928–937CrossRefPubMedPubMedCentralGoogle Scholar
  47. Pobre V, Arraiano CM (2015) Next generation sequencing analysis reveals that the ribonucleases RNase II, RNase R and PNPase affect bacterial motility and biofilm formation in E. coli. BMC Genomics 16:72CrossRefPubMedPubMedCentralGoogle Scholar
  48. Rahi SJ, Pecani K, Ondracka A, Oikonomou C, Cross FR (2016) The CDK-APC/C oscillator predominantly entrains periodic cell-cycle transcription. Cell 165(2):475–487CrossRefPubMedPubMedCentralGoogle Scholar
  49. Roberts A, Pachter L (2013) Streaming fragment assignment for real-time analysis of sequencing experiments. Nat Methods 10(1):71–73CrossRefPubMedGoogle Scholar
  50. Roberts A, Trapnell C, Donaghey J, Rinn JL, Pachter L (2011) Improving RNA-Seq expression estimates by correcting for fragment bias. Genome Biol 12(3):R22CrossRefPubMedPubMedCentralGoogle Scholar
  51. Roberts A, Feng H, Pachter L (2013a) Fragment assignment in the cloud with eXpress-D. BMC Bioinform 14:358CrossRefGoogle Scholar
  52. Roberts A, Schaeffer L, Pachter L (2013b) Updating RNA-Seq analyses after re-annotation. Bioinformatics 29(13):1631–1637CrossRefPubMedPubMedCentralGoogle Scholar
  53. Rogers MF, Thomas J, Reddy AS, Ben-Hur A (2012) SpliceGrapher: detecting patterns of alternative splicing from RNA-Seq data in the context of gene models and EST data. Genome Biol 13(1):R4CrossRefPubMedPubMedCentralGoogle Scholar
  54. Rogozin IB, Managadze D, Shabalina SA, Koonin EV (2014) Gene family level comparative analysis of gene expression in mammals validates the ortholog conjecture. Genome Biol Evol 6(4):754–762CrossRefPubMedPubMedCentralGoogle Scholar
  55. Saadatpour A, Lai S, Guo G, Yuan GC (2015) Single-cell analysis in cancer genomics. Trends Genet 31(10):576–586PubMedPubMedCentralCrossRefGoogle Scholar
  56. Saha S, Sparks AB, Rago C, Akmaev V, Wang CJ, Vogelstein B, Kinzler KW, Velculescu VE (2002) Using the transcriptome to annotate the genome. Nat Biotechnol 20(5):508–512CrossRefGoogle Scholar
  57. Schena M (1996) Genome analysis with gene expression microarrays. BioEssays 18(5):427–431PubMedPubMedCentralCrossRefGoogle Scholar
  58. Schena M (2003) Microarray analysis. Wiley-Liss, New YorkGoogle Scholar
  59. Schena M, Shalon D, Davis RW, Brown PO (1995) Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science 270(5235):467–470CrossRefGoogle Scholar
  60. Schena M, Heller RA, Theriault TP, Konrad K, Lachenmeier E, Davis RW (1998) Microarrays: biotechnology’s discovery platform for functional genomics [see comments]. Trends Biotechnol 16(7):301–306CrossRefPubMedGoogle Scholar
  61. Shoemaker DD, Schadt EE, Armour CD, He YD, Garrett-Engele P, McDonagh PD, Loerch PM, Leonardson A, Lum PY, Cavet G et al (2001) Experimental annotation of the human genome using microarray technology. Nature 409(6822):922–927CrossRefPubMedGoogle Scholar
  62. Simpson RM, Bruno AE, Bard JE, Buck MJ, Read LK (2016) High-throughput sequencing of partially edited trypanosome mRNAs reveals barriers to editing progression and evidence for alternative editing. RNA 22(5):677–695CrossRefPubMedPubMedCentralGoogle Scholar
  63. Smircich P, Eastman G, Bispo S, Duhagon MA, Guerra-Slompo EP, Garat B, Goldenberg S, Munroe DJ, Dallagiovanna B, Holetz F et al (2015) Ribosome profiling reveals translation control as a key mechanism generating differential gene expression in Trypanosoma cruzi. BMC Genomics 16:443PubMedPubMedCentralCrossRefGoogle Scholar
  64. Stepankiw N, Raghavan M, Fogarty EA, Grimson A, Pleiss JA (2015) Widespread alternative and aberrant splicing revealed by lariat sequencing. Nucleic Acids Res 43(17):8488–8501CrossRefPubMedPubMedCentralGoogle Scholar
  65. Team GE (2011) Closure of the NCBI SRA and implications for the long-term future of genomics data storage. Genome Biol 12(3):402Google Scholar
  66. Tjaden B (2015) De novo assembly of bacterial transcriptomes from RNA-seq data. Genome Biol 16:1CrossRefPubMedPubMedCentralGoogle Scholar
  67. Trapnell C (2015) Defining cell types and states with single-cell genomics. Genome Res 25(10):1491–1498CrossRefPubMedPubMedCentralGoogle Scholar
  68. Trapnell C, Pachter L, Salzberg SL (2009) TopHat: discovering splice junctions with RNA-Seq. Bioinformatics 25(9):1105–1111CrossRefPubMedPubMedCentralGoogle Scholar
  69. Trapnell C, Williams BA, Pertea G, Mortazavi A, Kwan G, van Baren MJ, Salzberg SL, Wold BJ, Pachter L (2010) Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat Biotechnol 28(5):511–515CrossRefPubMedPubMedCentralGoogle Scholar
  70. Trapnell C, Roberts A, Goff L, Pertea G, Kim D, Kelley DR, Pimentel H, Salzberg SL, Rinn JL, Pachter L (2012) Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nat Protoc 7(3):562–578CrossRefPubMedPubMedCentralGoogle Scholar
  71. Trapnell C, Hendrickson DG, Sauvageau M, Goff L, Rinn JL, Pachter L (2013) Differential analysis of gene regulation at transcript resolution with RNA-seq. Nat Biotechnol 31(1):46–53CrossRefPubMedGoogle Scholar
  72. Velculescu VE, Zhang L, Vogelstein B, Kinzler KW (1995) Serial analysis of gene expression. Science 270(5235):484–487PubMedPubMedCentralCrossRefGoogle Scholar
  73. Velculescu VE, Zhang L, Zhou W, Vogelstein J, Basrai MA, Bassett DE Jr, Hieter P, Vogelstein B, Kinzler KW (1997) Characterization of the yeast transcriptome. Cell 88(2):243–251PubMedPubMedCentralCrossRefGoogle Scholar
  74. Velculescu VE, Madden SL, Zhang L, Lash AE, Yu J, Rago C, Lal A, Wang CJ, Beaudry GA, Ciriello KM et al (1999) Analysis of human transcriptomes. Nat Genet 23(4):387–388CrossRefPubMedGoogle Scholar
  75. Velculescu VE, Vogelstein B, Kinzler KW (2000) Analysing uncharted transcriptomes with SAGE. Trends Genet 16(10):423–425CrossRefPubMedGoogle Scholar
  76. Vlasschaert C, Xia X, Gray DA (2016) Selection preserves Ubiquitin Specific Protease 4 alternative exon skipping in therian mammals. Sci Rep 6:20039CrossRefPubMedPubMedCentralGoogle Scholar
  77. Wei Y, Silke JR, Xia X (2017) Elucidating the 16S rRNA 3′ boundaries and defining optimal SD/aSD pairing in Escherichia coli and Bacillus subtilis using RNA-Seq data. Sci Rep.
  78. Wu J, Tzanakakis ES (2013) Deconstructing stem cell population heterogeneity: single-cell analysis and modeling approaches. Biotechnol Adv 31(7):1047–1062PubMedPubMedCentralCrossRefGoogle Scholar
  79. Xia X (2013) DAMBE5: a comprehensive software package for data analysis in molecular biology and evolution. Mol Biol Evol 30:1720–1728PubMedPubMedCentralCrossRefGoogle Scholar
  80. Xia X (2017a) ARSDA: a new approach for storing, transmitting and analyzing transcriptomic data. G3: Genes|Genomes|Genetics. Scholar
  81. Xia X (2017c) DAMBE6: new tools for microbial genomics, phylogenetics and molecular evolution. J Hered 108(4):431–437. Scholar
  82. Xia X, MacKay V, Yao X, Wu J, Miura F, Ito T, Morris DR (2011) Translation initiation: a regulatory role for poly(A) tracts in front of the AUG codon in saccharomyces cerevisiae. Genetics 189(2):469–478CrossRefPubMedPubMedCentralGoogle Scholar
  83. Yoon JH, De S, Srikantan S, Abdelmohsen K, Grammatikakis I, Kim J, Kim KM, Noh JH, White EJ, Martindale JL et al (2014) PAR-CLIP analysis uncovers AUF1 impact on target RNA fate and genome integrity. Nat Commun 5:5248PubMedPubMedCentralCrossRefGoogle Scholar
  84. Zhu Z, Li L, Zhang Y, Yang Y, Yang X (2015a) CompMap: a reference-based compression program to speed up read mapping to related reference sequences. Bioinformatics 31(3):426–428CrossRefPubMedGoogle Scholar
  85. Zhu Z, Zhang Y, Ji Z, He S, Yang X (2015b) High-throughput DNA sequence data compression. Brief Bioinform 16(1):1–15CrossRefPubMedGoogle Scholar

Copyright information

© Springer Science+Business Media LLC 2018

Authors and Affiliations

  • Xuhua Xia
    • 1
  1. 1.University of Ottawa CAREG and Biology DepartmentOttawaCanada

Personalised recommendations