Advertisement

Quality Control of RNA-Seq Experiments

  • Xing Li
  • Asha Nair
  • Shengqin Wang
  • Liguo WangEmail author
Part of the Methods in Molecular Biology book series (MIMB, volume 1269)

Abstract

Direct sequencing of the complementary DNA (cDNA) using high-throughput sequencing technologies (RNA-seq) is widely used and allows for more comprehensive understanding of the transcriptome than microarray. In theory, RNA-seq should be able to precisely identify and quantify all RNA species, small or large, at low or high abundance. However, RNA-seq is a complicated, multistep process involving reverse transcription, amplification, fragmentation, purification, adaptor ligation, and sequencing. Improper operations at any of these steps could make biased or even unusable data. Additionally, RNA-seq intrinsic biases (such as GC bias and nucleotide composition bias) and transcriptome complexity can also make data imperfect. Therefore, comprehensive quality assessment is the first and most critical step for all downstream analyses and results interpretation. This chapter discusses the most widely used quality control metrics including sequence quality, sequencing depth, reads duplication rates (clonal reads), alignment quality, nucleotide composition bias, PCR bias, GC bias, rRNA and mitochondria contamination, coverage uniformity, etc.

Key words

Quality control RNA-seq High-throughput sequencing Next-generation sequencing 

References

  1. 1.
    Mortazavi A, Williams BA, McCue K et al (2008) Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods 5:621–628. doi: 10.1038/nmeth.1226 PubMedCrossRefGoogle Scholar
  2. 2.
    Marioni JCJ, Mason CEC, Mane SMS et al (2008) RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays. Gene Dev 18:1509–1517. doi: 10.1101/gr.079558.108 Google Scholar
  3. 3.
    Wang Z, Gerstein M, Snyder M (2009) RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet 10:57–63. doi: 10.1038/nrg2484 PubMedCentralPubMedCrossRefGoogle Scholar
  4. 4.
    Wilhelm BT, Landry J-R (2009) RNA-Seq—quantitative measurement of expression through massively parallel RNA-sequencing. Methods 48:249–257. doi: 10.1016/j.ymeth.2009.03.016 PubMedCrossRefGoogle Scholar
  5. 5.
    Wang ET, Sandberg R, Luo S et al (2008) Alternative isoform regulation in human tissue transcriptomes. Nature 456:470–476. doi: 10.1038/nature07509 PubMedCentralPubMedCrossRefGoogle Scholar
  6. 6.
    Katz Y, Wang ET, Airoldi EM, Burge CB (2010) Analysis and design of RNA sequencing experiments for identifying isoform regulation. Nat Methods 7:1009–1015. doi: 10.1038/nmeth.1528 PubMedCentralPubMedCrossRefGoogle Scholar
  7. 7.
    Trapnell C, Williams BA, Pertea G et al (2010) Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat Biotechnol 28:511–515. doi: 10.1038/nbt.1621 PubMedCentralPubMedCrossRefGoogle Scholar
  8. 8.
    Cabili MN, Trapnell C, Goff L et al (2011) Integrative annotation of human large intergenic noncoding RNAs reveals global properties and specific subclasses. Gene Dev 25:1915–1927. doi: 10.1101/gad.17446611 PubMedCentralPubMedCrossRefGoogle Scholar
  9. 9.
    Guttman M, Garber M, Levin JZ et al (2010) Ab initio reconstruction of cell type-specific transcriptomes in mouse reveals the conserved multi-exonic structure of lincRNAs. Nat Biotechnol 28:503–510. doi: 10.1038/nbt.1633 PubMedCentralPubMedCrossRefGoogle Scholar
  10. 10.
    Prensner JRJ, Iyer MKM, Balbin OAO et al (2011) Transcriptome sequencing across a prostate cancer cohort identifies PCAT-1, an unannotated lincRNA implicated in disease progression. Nat Biotechnol 29:742–749. doi: 10.1038/nbt.1914 PubMedCentralPubMedCrossRefGoogle Scholar
  11. 11.
    Kannan K, Wang L, Wang J et al (2011) Recurrent chimeric RNAs enriched in human prostate cancer identified by deep sequencing. Proc Natl Acad Sci U S A 108:9172–9177. doi: 10.1073/pnas.1100489108 PubMedCentralPubMedCrossRefGoogle Scholar
  12. 12.
    Pflueger D, Terry S, Sboner A et al (2011) Discovery of non-ETS gene fusions in human prostate cancer using next-generation RNA sequencing. Gene Dev 21:56–67. doi: 10.1101/gr.110684.110 Google Scholar
  13. 13.
    Edgren H, Murumagi A, Kangaspeska S et al (2011) Identification of fusion genes in breast cancer by paired-end RNA-sequencing. Genome Biol 12:R6. doi: 10.1186/gb-2011-12-1-r6 PubMedCentralPubMedCrossRefGoogle Scholar
  14. 14.
    Peng ZZ, Cheng YY, Tan BC-MB et al (2012) Comprehensive analysis of RNA-Seq data reveals extensive RNA editing in a human transcriptome. Nat Biotechnol 30:253–260. doi: 10.1038/nbt.2122 PubMedCrossRefGoogle Scholar
  15. 15.
    Bahn JHJ, Lee J-HJ, Li GG et al (2012) Accurate identification of A-to-I RNA editing in human by transcriptome sequencing. Gene Dev 22:142–150. doi: 10.1101/gr.124107.111 Google Scholar
  16. 16.
    Park EE, Williams BB, Wold BJB, Mortazavi AA (2012) RNA editing in the human ENCODE RNA-seq data. Gene Dev 22:1626–1633. doi: 10.1101/gr.134957.111 Google Scholar
  17. 17.
    Ramaswami G, Zhang R, Piskol R et al (2013) Identifying RNA editing sites using RNA sequencing data alone. Nat Methods. doi: 10.1038/nmeth.2330 Google Scholar
  18. 18.
    Benjamini Y, Speed TP (2012) Summarizing and correcting the GC content bias in high-throughput sequencing. Nucleic Acids Res 40:e72. doi: 10.1093/nar/gks001 PubMedCentralPubMedCrossRefGoogle Scholar
  19. 19.
    Hansen KD, Brenner SE, Dudoit S (2010) Biases in Illumina transcriptome sequencing caused by random hexamer priming. Nucleic Acids Res 38:e131. doi: 10.1093/nar/gkq224 PubMedCentralPubMedCrossRefGoogle Scholar
  20. 20.
    Ewing B, Hillier L, Wendl MC, Green P (1998) Base-calling of automated sequencer traces using phred. I. Accuracy assessment. Genome Res 8(3):175–85Google Scholar
  21. 21.
    Ewing B, Green P (1998) Base-calling of automated sequencer traces using phred. II. Error probabilities. Genome Res 8(3):186–94Google Scholar
  22. 22.
    Babraham Bioinformatics – FastQC a quality control tool for high throughput sequence data. http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
  23. 23.
    Wang L, Wang S, Li W (2012) RSeQC: quality control of RNA-seq experiments. Bioinformatics. Oxford, England. doi: 10.1093/bioinformatics/bts356 Google Scholar
  24. 24.
    Levin JZ, Yassour M, Adiconis X et al (2010) Comprehensive comparative analysis of strand-specific RNA sequencing methods. Nat Methods. doi: 10.1038/nmeth.1491 PubMedCentralPubMedGoogle Scholar
  25. 25.
    Cock PJ, Fields CJ, Goto N, Heuer ML, Rice PM (2010) The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants. Nucleic Acids Res. 38(6):1767-71. doi: 10.1093/nar/gkp1137

Copyright information

© Springer Science+Business Media New York 2015

Authors and Affiliations

  1. 1.Division of Biomedical Statistics and Informatics, Department of Health Sciences ResearchMayo ClinicRochesterUSA
  2. 2.School of Biological Science and Medical EngineeringSoutheast UniversityNanjingP.R. China

Personalised recommendations