How to Analyze Gene Expression Using RNA-Sequencing Data

  • Daniel Ramsköld
  • Ersen Kavak
  • Rickard SandbergEmail author
Part of the Methods in Molecular Biology book series (MIMB, volume 802)


RNA-Seq is arising as a powerful method for transcriptome analyses that will eventually make microarrays obsolete for gene expression analyses. Improvements in high-throughput sequencing and efficient sample barcoding are now enabling tens of samples to be run in a cost-effective manner, competing with microarrays in price, excelling in performance. Still, most studies use microarrays, partly due to the ease of data analyses using programs and modules that quickly turn raw microarray data into spreadsheets of gene expression values and significant differentially expressed genes. Instead RNA-Seq data analyses are still in its infancy and the researchers are facing new challenges and have to combine different tools to carry out an analysis. In this chapter, we provide a tutorial on RNA-Seq data analysis to enable researchers to quantify gene expression, identify splice junctions, and find novel transcripts using publicly available software. We focus on the analyses performed in organisms where a reference genome is available and discuss issues with current methodology that have to be solved before RNA-Seq data can utilize its full potential.

Key words

RNA-Seq Genomics Tutorial 


  1. 1.
    Wang Z, Gerstein M, Snyder M (2009) RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet 10:57–63PubMedCrossRefGoogle Scholar
  2. 2.
    Wang ET, Sandberg R, Luo S et al (2008) Alternative isoform regulation in human tissue transcriptomes. Nature 456:470–476PubMedCrossRefGoogle Scholar
  3. 3.
    Pan Q, Shai O, Lee L et al (2008) Deep surveying of alternative splicing complexity in the human transcriptome by high-throughput sequencing. Nat Genet 40:1413–1415PubMedCrossRefGoogle Scholar
  4. 4.
    Yoder-Himes DR, Chain PSG, Zhu Y et al (2009) Mapping the Burkholderia cenocepacia niche response via high-throughput sequencing. Proc Natl Acad Sci USA 106:3976–3981PubMedCrossRefGoogle Scholar
  5. 5.
    Armour CD, Castle JC, Chen R et al (2009) Digital transcriptome profiling using selective hexamer priming for cDNA synthesis. Nat Methods 6:647–649PubMedCrossRefGoogle Scholar
  6. 6.
    Core LJ, Waterfall JJ and Lis JT (2008) Nascent RNA sequencing reveals widespread pausing and divergent initiation at human promoters. Science 322:1845–1848PubMedCrossRefGoogle Scholar
  7. 7.
    Ingolia NT, Ghaemmaghami S, Newman JRS et al (2009) Genome-wide analysis in vivo of translation with nucleotide resolution using ribosome profiling. Science 324:218–223PubMedCrossRefGoogle Scholar
  8. 8.
    Metzker ML (2010) Sequencing technologies – the next generation. Nat Rev Genet 11:31–46PubMedCrossRefGoogle Scholar
  9. 9.
    Mortazavi A, Williams BA, McCue K et al (2008) Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods 5:621–628PubMedCrossRefGoogle Scholar
  10. 10.
    Guttman M, Garber M, Levin JZ et al (2010) Ab initio reconstruction of cell type-specific transcriptomes in mouse reveals the conserved multi-exonic structure of lincRNAs. Nat Biotechnol 28:503–510PubMedCrossRefGoogle Scholar
  11. 11.
    Trapnell C, Williams BA, Pertea G et al (2010) Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat Biotechnol 28:511–515PubMedCrossRefGoogle Scholar
  12. 12.
    Sequence Read Archive.
  13. 13.
    Gene Expression Omnibus.
  14. 14.
    Ewing B, Hillier L, Wendl MC et al (1998) Base-calling of automated sequencer traces using phred I accuracy assessment. Genome Res 8:175–185PubMedGoogle Scholar
  15. 15.
    Cock PJA, Fields CJ, Goto N et al (2010) The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants. Nucleic Acids Res 38:1767–1771PubMedCrossRefGoogle Scholar
  16. 16.
    Giardine B, Riemer C, Hardison RC et al (2005) Galaxy: a platform for interactive large-scale genome analysis. Genome Res 15:1451–1455PubMedCrossRefGoogle Scholar
  17. 17.
    Stajich JE, Block D, Boulez K et al (2002) The Bioperl toolkit: Perl modules for the life sciences. Genome Res 12:1611–1618PubMedCrossRefGoogle Scholar
  18. 18.
    Cock PJA, Antao T, Chang JT et al (2009) Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics 25:1422–1423PubMedCrossRefGoogle Scholar
  19. 19.
    NCBI (2010) Sequence Read Archive Submission Guidelines. Accessed 2 Nov 2010
  20. 20.
    SOLiD Sequence Read Format package.
  21. 21.
  22. 22.
  23. 23.
    Pepke S, Wold B, Mortazavi A (2009) Computation for ChIP-seq and RNA-seq studies. Nat Methods 6:S22-S32PubMedCrossRefGoogle Scholar
  24. 24.
    Dohm JC, Lottaz C, Borodina T et al (2008) Substantial biases in ultra-short read data sets from high-throughput DNA sequencing. Nucleic Acids Res 36:e105PubMedCrossRefGoogle Scholar
  25. 25.
    Langmead B, Trapnell C, Pop M et al (2009) Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol 10:R25PubMedCrossRefGoogle Scholar
  26. 26.
    Trapnell C, Pachter L and Salzberg SL (2009) TopHat: discovering splice junctions with RNA-Seq. Bioinformatics 25:1105–1111PubMedCrossRefGoogle Scholar
  27. 27.
    Chen Y, Souaiaia T and Chen T (2009) PerM: efficient mapping of short sequencing reads with periodic full sensitive spaced seeds. Bioinformatics 25:2514–2521PubMedCrossRefGoogle Scholar
  28. 28.
  29. 29.
    Galaxy Experimental Features.
  30. 30.
  31. 31.
    Homer N, Merriman B, Nelson SF (2009) BFAST: an alignment tool for large scale genome resequencing. PLoS ONE 4:e7767PubMedCrossRefGoogle Scholar
  32. 32.
  33. 33.
    Ozsolak F, Platt AR, Jones DR et al (2009) Direct RNA sequencing. Nature 461:814–818PubMedCrossRefGoogle Scholar
  34. 34.
  35. 35.
    UCSC Genome Browser FAQ File Formats.
  36. 36.
  37. 37.
    RNA-Seq files at sandberg lab homepage.
  38. 38.
  39. 39.
  40. 40.
    Li H, Handsaker B, Wysoker A et al (2009) The Sequence Alignment/Map format and SAMtools. Bioinformatics 25:2078–2079PubMedCrossRefGoogle Scholar
  41. 41.
    UCSC Genome Browser Downloads.
  42. 42.
    van Bakel H, Nislow C, Blencowe BJ et al (2010) Most “dark matter” transcripts are associated with known genes. PLoS Biol 8:e1000371PubMedCrossRefGoogle Scholar
  43. 43.
    Integrative Genome Browser.
  44. 44.
    Sandberg R, Neilson JR, Sarma A et al (2008) Proliferating cells express mRNAs with shortened 3′ untranslated regions and fewer microRNA target sites. Science 320:1643–7PubMedCrossRefGoogle Scholar
  45. 45.
    Neilson JR and Sandberg R (2010) Heterogeneity in mammalian RNA 3′ end formation. Exp Cell Res 316:1357–1364PubMedCrossRefGoogle Scholar
  46. 46.
    Ramsköld D, Wang ET, Burge CB et al (2009) An abundance of ubiquitously expressed genes revealed by tissue transcriptome sequence data. PLoS Comput Biol 5:e1000598PubMedCrossRefGoogle Scholar
  47. 47.
    Montgomery SB, Sammeth M, Gutierrez-Arcelus M et al (2010) Transcriptome genetics using second generation sequencing in a Caucasian population. Nature 464:773–777PubMedCrossRefGoogle Scholar
  48. 48.
  49. 49.
    Kent WJ, Zweig AS, Barber G et al (2010) BigWig and BigBed: enabling browsing of large distributed datasets. Bioinformatics 26:2204–2207PubMedCrossRefGoogle Scholar
  50. 50.
    UCSC stand-alone bioinformatic programs.
  51. 51.
  52. 52.
    Marioni JC, Mason CE, Mane SM et al (2008) RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays. Genome Res 18:1509–1517PubMedCrossRefGoogle Scholar
  53. 53.
    Allison DB, Cui X, Page GP et al (2006) Microarray data analysis: from disarray to consolidation and consensus. Nat Rev Genet 7:55–65PubMedCrossRefGoogle Scholar
  54. 54.
    Robinson MD, McCarthy DJ and Smyth GK (2010) edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26:139–140PubMedCrossRefGoogle Scholar
  55. 55.
    Anders S, Huber W (2010) Differential expression analysis for sequence count data. Genome Biol 11:R106PubMedCrossRefGoogle Scholar
  56. 56.
  57. 57.
  58. 58.
  59. 59.
    Li H, Ruan J, Durbin R (2008) Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res 18:1851–1858PubMedCrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2012

Authors and Affiliations

  • Daniel Ramsköld
    • 1
  • Ersen Kavak
    • 1
  • Rickard Sandberg
    • 1
    Email author
  1. 1.Department of Cell and Molecular BiologyKarolinska Institutet and Ludwig Institute for Cancer ResearchStockholmSweden

Personalised recommendations