Skip to main content

A Guide for Designing and Analyzing RNA-Seq Data

  • Protocol
  • First Online:
Gene Expression Analysis

Part of the book series: Methods in Molecular Biology ((MIMB,volume 1783))

Abstract

The identity of a cell or an organism is at least in part defined by its gene expression and therefore analyzing gene expression remains one of the most frequently performed experimental techniques in molecular biology. The development of the RNA-Sequencing (RNA-Seq) method allows an unprecedented opportunity to analyze expression of protein-coding, noncoding RNA and also de novo transcript assembly of a new species or organism. However, the planning and design of RNA-Seq experiments has important implications for addressing the desired biological question and maximizing the value of the data obtained. In addition, RNA-Seq generates a huge volume of data and accurate analysis of this data involves several different steps and choices of tools. This can be challenging and overwhelming, especially for bench scientists. In this chapter, we describe an entire workflow for performing RNA-Seq experiments. We describe critical aspects of wet lab experiments such as RNA isolation, library preparation and the initial design of an experiment. Further, we provide a step-by-step description of the bioinformatics workflow for different steps involved in RNA-Seq data analysis. This includes power calculations, setting up a computational environment, acquisition and processing of publicly available data if desired, quality control measures, preprocessing steps for the raw data, differential expression analysis, and data visualization. We particularly mention important considerations for each step to provide a guide for designing and analyzing RNA-Seq data.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Protocol
USD 49.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 139.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 179.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 249.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Ozsolak F, Milos PM (2011) RNA sequencing: advances, challenges and opportunities. Nat Rev Genet 12:87–98

    Article  CAS  PubMed  Google Scholar 

  2. Bustin SA, Benes V, Garson JA et al (2009) The MIQE guidelines: minimum information for publication of quantitative real-time PCR experiments. Clin Chem 55:611–622

    Article  CAS  PubMed  Google Scholar 

  3. Schena M, Shalon D, Davis RW et al (1995) Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science 270:467–470

    Article  CAS  PubMed  Google Scholar 

  4. Murphy D (2002) Gene expression studies using microarrays: principles, problems, and prospects. Adv Physiol Educ 26:256–270

    Article  PubMed  Google Scholar 

  5. Abdullah-Sayani A, Bueno-de-Mesquita JM, van de Vijver MJ (2006) Technology insight: tuning into the genetic orchestra using microarrays—limitations of DNA microarrays in clinical practice. Nat Clin Pract Oncol 3:501–516

    Article  CAS  PubMed  Google Scholar 

  6. Goodwin S, McPherson JD, McCombie WR (2016) Coming of age: ten years of next-generation sequencing technologies. Nat Rev Genet 17:333–351

    Article  CAS  PubMed  Google Scholar 

  7. Wang Z, Gerstein M, Snyder M (2009) RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet 10:57–63

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  8. Crick F (1970) Central dogma of molecular biology. Nature 227:561–563

    Article  CAS  PubMed  Google Scholar 

  9. Crick FH (1958) On protein synthesis. Symp Soc Exp Biol 12:138–163

    PubMed  CAS  Google Scholar 

  10. ENCODE Project Consortium (2012) An integrated encyclopedia of DNA elements in the human genome. Nature 489:57–74

    Google Scholar 

  11. Chatterjee A, Eccles MR (2015) DNA methylation and epigenomics: new technologies and emerging concepts. Genome Biol 16:103

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  12. Chatterjee A, Stockwell PA, Rodger EJ et al (2016) scan_tcga tools for integrated epigenomic and transcriptomic analysis of tumor subgroups. Epigenomics 8(10):1315–1330

    Article  CAS  PubMed  Google Scholar 

  13. Chatterjee A, Stockwell PA, Rodger EJ et al (2016) Genome-scale DNA methylome and transcriptome profiling of human neutrophils. Sci Data 3:160019

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  14. Chatterjee A, Stockwell PA, Rodger EJ et al (2015) Genome-wide DNA methylation map of human neutrophils reveals widespread inter-individual epigenetic variation. Sci Rep 5:17328

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  15. Leichter AL, Purcell RV, Sullivan MJ et al (2015) Multi-platform microRNA profiling of hepatoblastoma patients using formalin fixed paraffin embedded archival samples. Gigascience 4:54

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  16. Chatterjee A, Leichter AL, Fan V et al (2015) A cross comparison of technologies for the detection of microRNAs in clinical FFPE samples of hepatoblastoma patients. Sci Rep 5:10438

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  17. Schroeder A, Mueller O, Stocker S et al (2006) The RIN: an RNA integrity number for assigning integrity values to RNA measurements. BMC Mol Biol 7:3

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  18. Walther C, Hofvander J, Nilsson J et al (2015) Gene fusion detection in formalin-fixed paraffin-embedded benign fibrous histiocytomas using fluorescence in situ hybridization and RNA sequencing. Lab Investig 95:1071–1076

    Article  CAS  PubMed  Google Scholar 

  19. Puls F, Hofvander J, Magnusson L et al (2016) FN1-EGF gene fusions are recurrent in calcifying aponeurotic fibroma. J Pathol 238:502–507

    Article  CAS  PubMed  Google Scholar 

  20. Huang W, Goldfischer M, Babyeva S et al (2015) Identification of a novel PARP14-TFE3 gene fusion from 10-year-old FFPE tissue by RNA-seq. Genes Chromosomes Cancer. https://doi.org/10.1002/gcc.22261

  21. Quinlan AR, Boland MJ, Leibowitz ML et al (2011) Genome sequencing of mouse induced pluripotent stem cells reveals retroelement stability and infrequent DNA rearrangement during reprogramming. Cell Stem Cell 9:366–373

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  22. Zhao S, Zhang Y, Gordon W et al (2015) Comparison of stranded and non-stranded RNA-seq transcriptome profiling and investigation of gene overlap. BMC Genomics 16:675

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  23. Hansen KD, Wu Z, Irizarry RA et al (2011) Sequencing technology does not eliminate biological variability. Nat Biotechnol 29:572–573

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  24. Liu Y, Zhou J, White KP (2014) RNA-seq differential expression studies: more sequence or more replication? Bioinformatics 30:301–304

    Article  CAS  PubMed  Google Scholar 

  25. Conesa A, Madrigal P, Tarazona S et al (2016) A survey of best practices for RNA-seq data analysis. Genome Biol 17:13

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  26. Schurch NJ, Schofield P, Gierlinski M et al (2016) How many biological replicates are needed in an RNA-seq experiment and which differential expression tool should you use? RNA 22:839–851

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  27. Ching T, Huang S, Garmire LX (2014) Power analysis and sample size estimation for RNA-Seq differential expression. RNA 20:1684–1696

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  28. Busby MA, Stewart C, Miller CA et al (2013) Scotty: a web tool for designing RNA-Seq experiments to measure differential gene expression. Bioinformatics 29:656–657

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  29. Patel RK, Jain M (2012) NGS QC toolkit: a toolkit for quality control of next generation sequencing data. PLoS One 7:e30619

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  30. Stockwell PA, Chatterjee A, Rodger EJ et al (2014) DMAP: differential methylation analysis package for RRBS and WGBS data. Bioinformatics 30:1814–1822

    Article  CAS  PubMed  Google Scholar 

  31. Bolger AM, Lohse M, Usadel B (2014) Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics 30:2114–2120

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  32. DeLuca DS, Levin JZ, Sivachenko A et al (2012) RNA-SeQC: RNA-seq metrics for quality control and process optimization. Bioinformatics 28:1530–1532

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  33. Wang L, Wang S, Li W (2012) RSeQC: quality control of RNA-seq experiments. Bioinformatics 28:2184–2185

    Article  CAS  PubMed  Google Scholar 

  34. Okonechnikov K, Conesa A, Garcia-Alcalde F (2016) Qualimap 2: advanced multi-sample quality control for high-throughput sequencing data. Bioinformatics 32:292–294

    PubMed  CAS  Google Scholar 

  35. Johnson WE, Li C, Rabinovic A (2007) Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics 8:118–127

    Article  PubMed  Google Scholar 

  36. Kim D, Pertea G, Trapnell C et al (2013) TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biol 14:R36

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  37. Dobin A, Davis CA, Schlesinger F et al (2013) STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29:15–21

    Article  CAS  PubMed  Google Scholar 

  38. Kim D, Langmead B, Salzberg SL (2015) HISAT: a fast spliced aligner with low memory requirements. Nat Methods 12:357–360

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  39. Langmead B, Trapnell C, Pop M et al (2009) Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol 10:R25

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  40. Li H, Durbin R (2009) Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25:1754–1760

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  41. Wu TD, Nacu S (2010) Fast and SNP-tolerant detection of complex variants and splicing in short reads. Bioinformatics 26:873–881

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  42. Grabherr MG, Haas BJ, Yassour M et al (2011) Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nat Biotechnol 29:644–652

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  43. Schulz MH, Zerbino DR, Vingron M et al (2012) Oases: robust de novo RNA-seq assembly across the dynamic range of expression levels. Bioinformatics 28:1086–1092

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  44. Patro R, Mount SM, Kingsford C (2014) Sailfish enables alignment-free isoform quantification from RNA-seq reads using lightweight algorithms. Nat Biotechnol 32:462–464

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  45. Trapnell C, Hendrickson DG, Sauvageau M et al (2013) Differential analysis of gene regulation at transcript resolution with RNA-seq. Nat Biotechnol 31:46–53

    Article  CAS  PubMed  Google Scholar 

  46. Robinson MD, McCarthy DJ, Smyth GK (2010) edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26:139–140

    Article  CAS  PubMed  Google Scholar 

  47. Love MI, Huber W, Anders S (2014) Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol 15:550

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  48. Law CW, Chen Y, Shi W et al (2014) voom: precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biol 15:R29

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  49. Robinson JT, Thorvaldsdottir H, Winckler W et al (2011) Integrative genomics viewer. Nat Biotechnol 29:24–26

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  50. Kim SH, Das A, Chai JC et al (2016) Transcriptome sequencing wide functional analysis of human mesenchymal stem cells in response to TLR4 ligand. Sci Rep 6:30311

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  51. Kopylova E, Noe L, Touzet H (2012) SortMeRNA: fast and accurate filtering of ribosomal RNAs in metatranscriptomic data. Bioinformatics 28:3211–3217

    Article  CAS  PubMed  Google Scholar 

  52. Pertea M, Kim D, Pertea GM et al (2016) Transcript-level expression analysis of RNA-seq experiments with HISAT, StringTie and Ballgown. Nat Protoc 11:1650–1667

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  53. Xie Y, Wu G, Tang J et al (2014) SOAPdenovo-Trans: de novo transcriptome assembly with short RNA-Seq reads. Bioinformatics 30:1660–1666

    Article  CAS  PubMed  Google Scholar 

  54. Engstrom PG, Steijger T, Sipos B et al (2013) Systematic evaluation of spliced alignment programs for RNA-seq data. Nat Methods 10:1185–1191

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  55. Medina I, Tarraga J, Martinez H et al (2016) Highly sensitive and ultrafast read mapping for RNA-seq analysis. DNA Res 23:93–100

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  56. Li H, Ruan J, Durbin R (2008) Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res 18:1851–1858

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  57. Haas BJ, Papanicolaou A, Yassour M et al (2013) De novo transcript sequence reconstruction from RNA-seq using the Trinity platform for reference generation and analysis. Nat Protoc 8:1494–1512

    Article  CAS  Google Scholar 

  58. Robertson G, Schein J, Chiu R et al (2010) De novo assembly and analysis of RNA-seq data. Nat Methods 7:909–912

    Article  CAS  PubMed  Google Scholar 

  59. Li B, Dewey CN (2011) RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinformatics 12:323

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  60. Mortazavi A, Williams BA, McCue K et al (2008) Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods 5:621–628

    Article  CAS  PubMed  Google Scholar 

  61. Trapnell C, Roberts A, Goff L et al (2012) Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nat Protoc 7:562–578

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  62. Wagner GP, Kin K, Lynch VJ (2012) Measurement of mRNA abundance using RNA-seq data: RPKM measure is inconsistent among samples. Theory Biosci 131:281–285

    Article  CAS  PubMed  Google Scholar 

  63. Bray NL, Pimentel H, Melsted P et al (2016) Near-optimal probabilistic RNA-seq quantification. Nat Biotechnol 34:525–527

    Article  CAS  PubMed  Google Scholar 

  64. Soneson C, Delorenzi M (2013) A comparison of methods for differential expression analysis of RNA-seq data. BMC Bioinformatics 14:91

    Article  PubMed  PubMed Central  Google Scholar 

  65. Guo Y, Li CI, Ye F et al (2013) Evaluation of read count based RNAseq analysis methods. BMC Genomics 14(Suppl 8):S2

    Article  PubMed  PubMed Central  Google Scholar 

  66. Seyednasrollah F, Laiho A, Elo LL (2015) Comparison of software packages for detecting differential expression in RNA-seq studies. Brief Bioinform 16:59–70

    Article  CAS  PubMed  Google Scholar 

  67. Zhang ZH, Jhaveri DJ, Marshall VM et al (2014) A comparative study of techniques for differential expression analysis on RNA-Seq data. PLoS One 9:e103207

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  68. Khang TF, Lau CY (2015) Getting the most out of RNA-seq data analysis. PeerJ 3:e1360

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  69. Ghosh S, Chan CK (2016) Analysis of RNA-Seq data using TopHat and cufflinks. Methods Mol Biol 1374:339–361

    Article  CAS  PubMed  Google Scholar 

  70. Chatterjee A, Stockwell PA, Rodger EJ et al (2012) Comparison of alignment software for genome-wide bisulphite sequence data. Nucleic Acids Res 40:e79

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  71. Love MI, Anders S, Kim V et al (2015) RNA-Seq workflow: gene-level exploratory analysis and differential expression. F1000Res 4:1070

    Article  PubMed  PubMed Central  Google Scholar 

  72. Carvalho BS, Irizarry RA (2010) A framework for oligonucleotide microarray preprocessing. Bioinformatics 26:2363–2367

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  73. Andersson R, Gebhard C, Miguel-Escalada I et al (2014) An atlas of active enhancers across human cell types and tissues. Nature 507:455–461

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  74. Lun AT, Chen Y, Smyth GK (2016) It’s DE-licious: a recipe for differential expression analyses of RNA-seq experiments using quasi-likelihood methods in edgeR. Methods Mol Biol 1418:391–416

    Article  PubMed  Google Scholar 

  75. Chen Y, Lun AT, Smyth GK (2016) From reads to genes to pathways: differential expression analysis of RNA-Seq experiments using Rsubread and the edgeR quasi-likelihood pipeline. F1000Res 5:1438

    PubMed  PubMed Central  Google Scholar 

  76. Li H, Handsaker B, Wysoker A et al (2009) The sequence alignment/map format and SAMtools. Bioinformatics 25:2078–2079

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  77. Chatterjee A, Stockwell PA, Ahn A et al (2017) Genome-wide methylation sequencing of paired primary and metastatic cell lines identifies common DNA methylation changes and a role for EBF3 as a candidate epigenetic driver of melanoma metastasis. Oncotarget 8(4):6085–6101

    Article  PubMed  Google Scholar 

  78. Li B, Ruotti V, Stewart RM et al (2010) RNA-Seq gene expression estimation with read mapping uncertainty. Bioinformatics 26:493–500

    Article  CAS  PubMed  Google Scholar 

  79. Al Ameri A, Koller C, Kantarjian H et al (2010) Acute pulmonary failure during remission induction chemotherapy in adults with acute myeloid leukemia or high-risk myelodysplastic syndrome. Cancer 116:93–97

    PubMed  Google Scholar 

  80. Robinson MD, Oshlack A (2010) A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biol 11:R25

    Article  CAS  PubMed  PubMed Central  Google Scholar 

Download references

Acknowledgments

A.C. and M.R.E. are grateful to the New Zealand Institute for Cancer Research Trust for supporting their respective positions.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Aniruddha Chatterjee .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Science+Business Media, LLC, part of Springer Nature

About this protocol

Check for updates. Verify currency and authenticity via CrossMark

Cite this protocol

Chatterjee, A., Ahn, A., Rodger, E.J., Stockwell, P.A., Eccles, M.R. (2018). A Guide for Designing and Analyzing RNA-Seq Data. In: Raghavachari, N., Garcia-Reyero, N. (eds) Gene Expression Analysis. Methods in Molecular Biology, vol 1783. Humana Press, New York, NY. https://doi.org/10.1007/978-1-4939-7834-2_3

Download citation

  • DOI: https://doi.org/10.1007/978-1-4939-7834-2_3

  • Published:

  • Publisher Name: Humana Press, New York, NY

  • Print ISBN: 978-1-4939-7833-5

  • Online ISBN: 978-1-4939-7834-2

  • eBook Packages: Springer Protocols

Publish with us

Policies and ethics