Computational Analysis of RNA-seq

  • Scott A. Givan
  • Christopher A. Bottoms
  • William G. Spollen
Part of the Methods in Molecular Biology book series (MIMB, volume 883)


Using High-Throughput DNA Sequencing (HTS) to examine gene expression is rapidly becoming a ­viable choice and is typically referred to as RNA-seq. Often the depth and breadth of coverage of RNA-seq data can exceed what is achievable using microarrays. However, the strengths of RNA-seq are often its greatest weaknesses. Accurately and comprehensively mapping millions of relatively short reads to a reference genome sequence can require not only specialized software, but also more structured and automated procedures to manage, analyze, and visualize the data. Additionally, the computational hardware required to efficiently process and store the data can be a necessary and often-overlooked component of a research plan. We discuss several aspects of the computational analysis of RNA-seq, including file management and data quality control, analysis, and visualization. We provide a framework for a standard nomenclature ­system that can facilitate automation and the ability to track data provenance. Finally, we provide a general workflow of the computational analysis of RNA-seq and a downloadable package of scripts to automate the processing.

Key words

High-throughput DNA sequencing RNA-seq Gene expression Data processing 



This work was supported by NSF grant 0701731 and a Missouri Life Sciences Trust Fund Research Grant.


  1. 1.
    Hannon (2011) FASTX-Toolkit, FASTQ/A short-reads pre-processing tools. Accessed 25 Feb 2011
  2. 2.
    Li H et al (2009) The sequence alignment/map format and SAMtools. Bioinformatics 25:2078–2079PubMedCrossRefGoogle Scholar
  3. 3.
    Langmead B et al (2009) Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol 10:R25PubMedCrossRefGoogle Scholar
  4. 4.
    Trapnell C, Pachter L, Salzberg SL (2009) TopHat: discovering splice junctions with RNA-Seq. Bioinformatics 25:1105–1111PubMedCrossRefGoogle Scholar
  5. 5.
    Trapnell C et al (2010) Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat Biotechnol 28:511–515PubMedCrossRefGoogle Scholar
  6. 6.
    Langille MG, Eisen JA (2010) BioTorrents: a file sharing service for scientific data. PLoS One. doi:10.1371/journal.pone.0010071Google Scholar
  7. 7.
    Barrett T et al (2011) NCBI GEO: archive for functional genomics data sets–10 years on. Nucleic Acids Res 39:D1005–D1010PubMedCrossRefGoogle Scholar
  8. 8.
    Parkinson H et al (2011) ArrayExpress update—an archive of microarray and high-throughput sequencing-based functional genomics experiments. Nucleic Acids Res 39:D1002–D1004PubMedCrossRefGoogle Scholar
  9. 9.
    Ewing B, Green P (1998) Base-calling of automated sequencer traces using phred. II. Error probabilities. Genome Res 8:186–194PubMedGoogle Scholar
  10. 10.
    Longo MS, O’Neill MJ, O’Neill RJ (2011) Abundant human DNA contamination identified in non-primate genome databases. PLoS One. doi:10.1371/journal.pone.0016410Google Scholar
  11. 11.
    Tarailo-Graovac M, Chen N (2009) Using RepeatMasker to identify repetitive elements in genomic sequences. In: Baxevanis AD (ed) Current protocols in bioinformatics, vol Suppl 25. Wiley, New YorkGoogle Scholar
  12. 12.
    Schnable PS et al (2009) The B73 maize genome: complexity, diversity, and dynamics. Science 326:1112–1115PubMedCrossRefGoogle Scholar
  13. 13.
    Richard GF, Kerrest A, Dujon B (2008) Comparative genomics and molecular dynamics of DNA repeats in eukaryotes. Microbiol Mol Biol Rev 72:686–727PubMedCrossRefGoogle Scholar
  14. 14.
    Vicient CM (2010) Transcriptional activity of transposable elements in maize. BMC Genomics. doi:doi:10.1186/1471-2164-11-601Google Scholar
  15. 15.
    Stein LD et al (2002) The generic genome browser: a building block for a model organism system database. Genome Res 12:1599–1610PubMedCrossRefGoogle Scholar
  16. 16.
    Milne I et al (2010) Tablet—next generation sequence assembly visualization. Bioinformatics 26:401–402PubMedCrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2012

Authors and Affiliations

  • Scott A. Givan
    • 1
  • Christopher A. Bottoms
    • 2
  • William G. Spollen
    • 2
  1. 1.Department of Molecular Microbiology and Immunology, Informatics Research Core FacilityUniversity of MissouriColumbiaUSA
  2. 2.Informatics Research Core FacilityUniversity of MissouriColumbiaUSA

Personalised recommendations