Abstract
RNA-Seq is now a routinely employed assay to measure gene expression. As the technique matured over the last decade, so have dedicated analytic tools. In this chapter, we first describe the mainstream as well as the most up-to-date protocols and their implications on downstream analysis. We then detail the steps entailing RNA-Seq analysis in three main stages: (i) preprocessing and data preparation, (ii) upstream processing, and (iii) high-level analyses. We review the most recent and relevant tools as one workflow following a stepwise order. The chapter further encompasses in-depth features of these tools. Details of the required code are made available throughout the chapter, as well as of the underlying statistics. We illustrate these steps with analysis of publicly available RNA-Seq data.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Buratowski S (2009) Progression through the RNA polymerase II CTD cycle. Mol Cell 36(4):541–546. https://doi.org/10.1016/j.molcel.2009.10.019
Franzen O, Jerlstrom-Hultqvist J, Einarsson E, Ankarklev J, Ferella M, Andersson B, Svard SG (2013) Transcriptome profiling of Giardia intestinalis using strand-specific RNA-seq. PLoS Comput Biol 9(3):e1003000. https://doi.org/10.1371/journal.pcbi.1003000
Meyer M, Kircher M (2010) Illumina sequencing library preparation for highly multiplexed target capture and sequencing. Cold Spring Harb Protoc 2010(6):pdb.prot5448. https://doi.org/10.1101/pdb.prot5448
Antonio Urrutia G, Ramachandran H, Cauchy P, Boo K, Ramamoorthy S, Boller S, Dogan E, Clapes T, Trompouki E, Torres-Padilla ME, Palvimo JJ, Pichler A, Grosschedl R (2021) ZFP451-mediated SUMOylation of SATB2 drives embryonic stem cell differentiation. Genes Dev 35(15–16):1142–1160. https://doi.org/10.1101/gad.345843.120
Zhao S, Zhang Y, Gamini R, Zhang B, von Schack D (2018) Evaluation of two main RNA-seq approaches for gene quantification in clinical RNA sequencing: polyA+ selection versus rRNA depletion. Sci Rep 8(1):4781. https://doi.org/10.1038/s41598-018-23226-4
Zhao W, He X, Hoadley KA, Parker JS, Hayes DN, Perou CM (2014) Comparison of RNA-Seq by poly (A) capture, ribosomal RNA depletion, and DNA microarray for expression profiling. BMC Genomics 15:419. https://doi.org/10.1186/1471-2164-15-419
Wei C-L, Ruan Y (2008) Multiplex sequencing of paired end ditags for transcriptome and genome analysis. In: Next generation genome sequencing. Wiley, pp 165–182. https://doi.org/10.1002/9783527625130.ch13
Levin JZ, Yassour M, Adiconis X, Nusbaum C, Thompson DA, Friedman N, Gnirke A, Regev A (2010) Comprehensive comparative analysis of strand-specific RNA sequencing methods. Nat Methods 7(9):709–715. https://doi.org/10.1038/nmeth.1491
Zhao S, Zhang Y, Gordon W, Quan J, Xi H, Du S, von Schack D, Zhang B (2015) Comparison of stranded and non-stranded RNA-seq transcriptome profiling and investigation of gene overlap. BMC Genomics 16:675. https://doi.org/10.1186/s12864-015-1876-7
Agarwal S, Macfarlan TS, Sartor MA, Iwase S (2015) Sequencing of first-strand cDNA library reveals full-length transcriptomes. Nat Commun 6:6002. https://doi.org/10.1038/ncomms7002
Picelli S, Faridani OR, Bjorklund AK, Winberg G, Sagasser S, Sandberg R (2014) Full-length RNA-seq from single cells using Smart-seq2. Nat Protoc 9(1):171–181. https://doi.org/10.1038/nprot.2014.006
Bhardwaj V, Heyne S, Sikora K, Rabbani L, Rauer M, Kilpert F, Richter AS, Ryan DP, Manke T (2019) snakePipes: facilitating flexible, scalable and integrative epigenomic analysis. Bioinformatics 35(22):4757–4759. https://doi.org/10.1093/bioinformatics/btz436
Woste M, Leitao E, Laurentino S, Horsthemke B, Rahmann S, Schroder C (2020) wg-blimp: an end-to-end analysis pipeline for whole genome bisulfite sequencing data. BMC Bioinformatics 21(1):169. https://doi.org/10.1186/s12859-020-3470-5
Yukselen O, Turkyilmaz O, Ozturk AR, Garber M, Kucukural A (2020) DolphinNext: a distributed data processing platform for high throughput genomics. BMC Genomics 21(1):310. https://doi.org/10.1186/s12864-020-6714-x
Afgan E, Baker D, van den Beek M, Blankenberg D, Bouvier D, Cech M, Chilton J, Clements D, Coraor N, Eberhard C, Gruning B, Guerler A, Hillman-Jackson J, Von Kuster G, Rasche E, Soranzo N, Turaga N, Taylor J, Nekrutenko A, Goecks J (2016) The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2016 update. Nucleic Acids Res 44(W1):W3–W10. https://doi.org/10.1093/nar/gkw343
Dobin A, Davis CA, Schlesinger F, Drenkow J, Zaleski C, Jha S, Batut P, Chaisson M, Gingeras TR (2013) STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29(1):15–21. https://doi.org/10.1093/bioinformatics/bts635
Andrews S (2010) FastQC: a quality control tool for high throughput sequence data. Github
Ewels P, Magnusson M, Lundin S, Kaller M (2016) MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics 32(19):3047–3048. https://doi.org/10.1093/bioinformatics/btw354
Bolger AM, Lohse M, Usadel B (2014) Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics 30(15):2114–2120. https://doi.org/10.1093/bioinformatics/btu170
Ramirez F, Dundar F, Diehl S, Gruning BA, Manke T (2014) deepTools: a flexible platform for exploring deep-sequencing data. Nucleic Acids Res 42(Web Server issue):W187–W191. https://doi.org/10.1093/nar/gku365
Liao Y, Smyth GK, Shi W (2014) featureCounts: an efficient general purpose program for assigning sequence reads to genomic features. Bioinformatics 30(7):923–930. https://doi.org/10.1093/bioinformatics/btt656
Love MI, Huber W, Anders S (2014) Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol 15(12):550. https://doi.org/10.1186/s13059-014-0550-8
Smyth GK (2004) Linear models and empirical bayes methods for assessing differential expression in microarray experiments. Stat Appl Genet Mol Biol 3(1):3. https://doi.org/10.2202/1544-6115.1027
Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES, Mesirov JP (2005) Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci U S A 102(43):15545–15550. https://doi.org/10.1073/pnas.0506580102
Martin M (2011) Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet J 17(1):3. https://doi.org/10.14806/ej.17.1.200
Krueger F (2016) Trim Galore! Github. https://doi.org/10.5281/zenodo.5127898
Amemiya HM, Kundaje A, Boyle AP (2019) The ENCODE blacklist: identification of problematic regions of the genome. Sci Rep 9(1):9354. https://doi.org/10.1038/s41598-019-45839-z
Cauchy P, Maqbool MA, Zacarias-Cabeza J, Vanhille L, Koch F, Fenouil R, Gut M, Gut I, Santana MA, Griffon A, Imbert J, Moraes-Cabe C, Bories JC, Ferrier P, Spicuglia S, Andrau JC (2016) Dynamic recruitment of Ets1 to both nucleosome-occupied and -depleted enhancer regions mediates a transcriptional program switch during early T-cell differentiation. Nucleic Acids Res 44(8):3567–3585. https://doi.org/10.1093/nar/gkv1475
Ewing B, Green P (1998) Base-calling of automated sequencer traces using phred. II. Error probabilities. Genome Res 8(3):186–194
Benelli M, Pescucci C, Marseglia G, Severgnini M, Torricelli F, Magi A (2012) Discovering chimeric transcripts in paired-end RNA-seq data by using EricScript. Bioinformatics 28(24):3232–3239. https://doi.org/10.1093/bioinformatics/bts617
Engstrom PG, Steijger T, Sipos B, Grant GR, Kahles A, Ratsch G, Goldman N, Hubbard TJ, Harrow J, Guigo R, Bertone P (2013) Systematic evaluation of spliced alignment programs for RNA-seq data. Nat Methods 10(12):1185–1191. https://doi.org/10.1038/nmeth.2722
Ye H, Meehan J, Tong W, Hong H (2015) Alignment of short reads: a crucial step for application of next-generation sequencing data in precision medicine. Pharmaceutics 7(4):523–541. https://doi.org/10.3390/pharmaceutics7040523
Kim D, Langmead B, Salzberg SL (2015) HISAT: a fast spliced aligner with low memory requirements. Nat Methods 12(4):357–360. https://doi.org/10.1038/nmeth.3317
Kim D, Pertea G, Trapnell C, Pimentel H, Kelley R, Salzberg SL (2013) TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biol 14(4):R36. https://doi.org/10.1186/gb-2013-14-4-r36
Liao Y, Smyth GK, Shi W (2013) The Subread aligner: fast, accurate and scalable read mapping by seed-and-vote. Nucleic Acids Res 41(10):e108. https://doi.org/10.1093/nar/gkt214
Liao Y, Smyth GK, Shi W (2019) The R package Rsubread is easier, faster, cheaper and better for alignment and quantification of RNA sequencing reads. Nucleic Acids Res 47(8):e47. https://doi.org/10.1093/nar/gkz114
Huang S, Zhang J, Li R, Zhang W, He Z, Lam TW, Peng Z, Yiu SM (2011) SOAPsplice: genome-wide ab initio detection of splice junctions from RNA-Seq data. Front Genet 2:46. https://doi.org/10.3389/fgene.2011.00046
Wu TD, Nacu S (2010) Fast and SNP-tolerant detection of complex variants and splicing in short reads. Bioinformatics 26(7):873–881. https://doi.org/10.1093/bioinformatics/btq057
Veeneman BA, Shukla S, Dhanasekaran SM, Chinnaiyan AM, Nesvizhskii AI (2016) Two-pass alignment improves novel splice junction quantification. Bioinformatics 32(1):43–49. https://doi.org/10.1093/bioinformatics/btv642
Dobin A, Gingeras TR (2015) Mapping RNA-seq reads with STAR. Curr Protoc Bioinformatics 51:11.14.11–11.14.19. https://doi.org/10.1002/0471250953.bi1114s51
Klepikova AV, Kasianov AS, Chesnokov MS, Lazarevich NL, Penin AA, Logacheva M (2017) Effect of method of deduplication on estimation of differential gene expression using RNA-seq. PeerJ 5:e3091. https://doi.org/10.7717/peerj.3091
Li H, Ruan J, Durbin R (2008) Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res 18(11):1851–1858. https://doi.org/10.1101/gr.078212.108
Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R (2009) The sequence alignment/map format and SAMtools. Bioinformatics 25(16):2078–2079. https://doi.org/10.1093/bioinformatics/btp352
Heinz S, Benner C, Spann N, Bertolino E, Lin YC, Laslo P, Cheng JX, Murre C, Singh H, Glass CK (2010) Simple combinations of lineage-determining transcription factors prime cis-regulatory elements required for macrophage and B cell identities. Mol Cell 38(4):576–589. https://doi.org/10.1016/j.molcel.2010.05.004
Ribeiro A, Golicz A, Hackett CA, Milne I, Stephen G, Marshall D, Flavell AJ, Bayer M (2015) An investigation of causes of false positive single nucleotide polymorphisms using simulated reads from a small eukaryote genome. BMC Bioinformatics 16:382. https://doi.org/10.1186/s12859-015-0801-z
Robinson JT, Thorvaldsdottir H, Winckler W, Guttman M, Lander ES, Getz G, Mesirov JP (2011) Integrative genomics viewer. Nat Biotechnol 29(1):24–26. https://doi.org/10.1038/nbt.1754
Tourriere H, Chebli K, Tazi J (2002) mRNA degradation machines in eukaryotic cells. Biochimie 84(8):821–837. https://doi.org/10.1016/s0300-9084(02)01445-1
Edginton-White B, Cauchy P, Assi SA, Hartmann S, Riggs AG, Mathas S, Cockerill PN, Bonifer C (2019) Global long terminal repeat activation participates in establishing the unique gene expression programme of classical Hodgkin lymphoma. Leukemia 33(6):1463–1474. https://doi.org/10.1038/s41375-018-0311-x
Ren X, Kuan PF (2020) Negative binomial additive model for RNA-Seq data analysis. BMC Bioinformatics 21(1):171. https://doi.org/10.1186/s12859-020-3506-x
Robinson MD, McCarthy DJ, Smyth GK (2010) edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26(1):139–140. https://doi.org/10.1093/bioinformatics/btp616
Trapnell C, Hendrickson DG, Sauvageau M, Goff L, Rinn JL, Pachter L (2013) Differential analysis of gene regulation at transcript resolution with RNA-seq. Nat Biotechnol 31(1):46–53. https://doi.org/10.1038/nbt.2450
Durbin BP, Hardin JS, Hawkins DM, Rocke DM (2002) A variance-stabilizing transformation for gene-expression microarray data. Bioinformatics 18(suppl_1):S105–S110. https://doi.org/10.1093/bioinformatics/18.suppl_1.S105
Leek JT, Johnson WE, Parker HS, Jaffe AE, Storey JD (2012) The sva package for removing batch effects and other unwanted variation in high-throughput experiments. Bioinformatics 28(6):882–883. https://doi.org/10.1093/bioinformatics/bts034
Wu T, Hu E, Xu S, Chen M, Guo P, Dai Z, Feng T, Zhou L, Tang W, Zhan L, Fu X, Liu S, Bo X, Yu G (2021) clusterProfiler 4.0: a universal enrichment tool for interpreting omics data. Innovation (N Y) 2(3):100141. https://doi.org/10.1016/j.xinn.2021.100141
Huang DW, Sherman BT, Lempicki RA (2009) Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat Protoc 4(1):44–57. https://doi.org/10.1038/nprot.2008.211
Slenter DN, Kutmon M, Hanspers K, Riutta A, Windsor J, Nunes N, Melius J, Cirillo E, Coort SL, Digles D, Ehrhart F, Giesbertz P, Kalafati M, Martens M, Miller R, Nishida K, Rieswijk L, Waagmeester A, Eijssen LMT, Evelo CT, Pico AR, Willighagen EL (2018) WikiPathways: a multifaceted pathway database bridging metabolomics to other omics research. Nucleic Acids Res 46(D1):D661–D667. https://doi.org/10.1093/nar/gkx1064
Korotkevich G, Sukhov V, Budin N, Shpak B, Artyomov MN, Sergushichev A (2021) Fast gene set enrichment analysis. bioRxiv:060012. https://doi.org/10.1101/060012
Wong DJ, Liu H, Ridky TW, Cassarino D, Segal E, Chang HY (2008) Module map of stem cell genes guides creation of epithelial cancer stem cells. Cell Stem Cell 2(4):333–344. https://doi.org/10.1016/j.stem.2008.02.009
Ramalho-Santos M, Yoon S, Matsuzaki Y, Mulligan RC, Melton DA (2002) “Stemness”: transcriptional profiling of embryonic and adult stem cells. Science 298(5593):597–600. https://doi.org/10.1126/science.1072530
Tamayo P, Scanfeld D, Ebert BL, Gillette MA, Roberts CW, Mesirov JP (2007) Metagene projection for cross-platform, cross-species characterization of global transcriptional states. Proc Natl Acad Sci U S A 104(14):5959–5964. https://doi.org/10.1073/pnas.0701068104
Grabherr MG, Haas BJ, Yassour M, Levin JZ, Thompson DA, Amit I, Adiconis X, Fan L, Raychowdhury R, Zeng Q, Chen Z, Mauceli E, Hacohen N, Gnirke A, Rhind N, di Palma F, Birren BW, Nusbaum C, Lindblad-Toh K, Friedman N, Regev A (2011) Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nat Biotechnol 29(7):644–652. https://doi.org/10.1038/nbt.1883
Leinonen R, Sugawara H, Shumway M (2011) The sequence read archive. Nucleic Acids Res 39(Database issue):D19–D21. https://doi.org/10.1093/nar/gkq1019
Katz Y, Wang ET, Airoldi EM, Burge CB (2010) Analysis and design of RNA sequencing experiments for identifying isoform regulation. Nat Methods 7(12):1009–1015. https://doi.org/10.1038/nmeth.1528
Anders S, Reyes A, Huber W (2012) Detecting differential usage of exons from RNA-seq data. Genome Res 22(10):2008–2017. https://doi.org/10.1101/gr.133744.111
Cauchy P, James SR, Zacarias-Cabeza J, Ptasinska A, Imperato MR, Assi SA, Piper J, Canestraro M, Hoogenkamp M, Raghavan M, Loke J, Akiki S, Clokie SJ, Richards SJ, Westhead DR, Griffiths MJ, Ott S, Bonifer C, Cockerill PN (2015) Chronic FLT3-ITD signaling in acute myeloid leukemia is connected to a specific chromatin signature. Cell Rep 12(5):821–836. https://doi.org/10.1016/j.celrep.2015.06.069
Obier N, Cauchy P, Assi SA, Gilmour J, Lie ALM, Lichtinger M, Hoogenkamp M, Noailles L, Cockerill PN, Lacaud G, Kouskoff V, Bonifer C (2016) Cooperative binding of AP-1 and TEAD4 modulates the balance between vascular smooth muscle and hemogenic cell fate. Development 143(23):4324–4340. https://doi.org/10.1242/dev.139857
Kreher S, Bouhlel MA, Cauchy P, Lamprecht B, Li S, Grau M, Hummel F, Kochert K, Anagnostopoulos I, Johrens K, Hummel M, Hiscott J, Wenzel SS, Lenz P, Schneider M, Kuppers R, Scheidereit C, Giefing M, Siebert R, Rajewsky K, Lenz G, Cockerill PN, Janz M, Dorken B, Bonifer C, Mathas S (2014) Mapping of transcription factor motifs in active chromatin identifies IRF5 as key regulator in classical Hodgkin lymphoma. Proc Natl Acad Sci U S A 111(42):E4513–E4522. https://doi.org/10.1073/pnas.1406985111
Hakimi AA, Reznik E, Lee CH, Creighton CJ, Brannon AR, Luna A, Aksoy BA, Liu EM, Shen R, Lee W, Chen Y, Stirdivant SM, Russo P, Chen YB, Tickoo SK, Reuter VE, Cheng EH, Sander C, Hsieh JJ (2016) An integrated metabolic atlas of clear cell renal cell carcinoma. Cancer Cell 29(1):104–116. https://doi.org/10.1016/j.ccell.2015.12.004
Schönberger K, Obier N, Romero-Mulero MC, Cauchy P, Mess J, Pavlovich PV, Zhang YW, Mitterer M, Rettkowski J, Lalioti M-E, Jäcklein K, Curtis JD, Féret B, Sommerkamp P, Morganti C, Ito K, Ghyselinck NB, Trompouki E, Buescher JM, Pearce EL, Cabezas-Wallscheid N (2021) Multilayer omics analysis reveals a non-classical retinoic acid signaling axis that regulates hematopoietic stem cell identity. Cell Stem Cell 29:1–18. https://doi.org/10.1016/j.stem.2021.10.002
Pease J, Sooknanan R (2012) A rapid, directional RNA-seq library preparation workflow for IlluminaÛ sequencing. Nat Methods 9(3):i–ii. https://doi.org/10.1038/nmeth.f.355
Acknowledgment
This work was supported by the Max Planck Society and the German Cancer Research Centre (DKFZ).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature
About this protocol
Cite this protocol
Pavlovich, P.V., Cauchy, P. (2022). Sequences to Differences in Gene Expression: Analysis of RNA-Seq Data. In: Christian, S.L. (eds) Cancer Cell Biology. Methods in Molecular Biology, vol 2508. Humana, New York, NY. https://doi.org/10.1007/978-1-0716-2376-3_20
Download citation
DOI: https://doi.org/10.1007/978-1-0716-2376-3_20
Published:
Publisher Name: Humana, New York, NY
Print ISBN: 978-1-0716-2375-6
Online ISBN: 978-1-0716-2376-3
eBook Packages: Springer Protocols