Advertisement

Information and Statistical Analysis Pipeline for High-Throughput RNA Sequencing Data

  • Shinji NakaokaEmail author
  • Keita Matsuyama
Protocol
  • 801 Downloads
Part of the Methods in Molecular Biology book series (MIMB, volume 2109)

Abstract

Applications of RNA sequencing have been wide-spreading in various subfields of life science. Construction of information and statistical analysis pipeline is indispensable to process raw RNA sequencing (RNA-seq) data generated by next-generation sequencers in order to extract biological implications. In this chapter, we introduce a common pipeline for RNA-seq data. A collection of notes on related advanced topics will be useful when conducting information and statistical analysis in practice.

Keywords

Next-generation sequencer RNA sequencing Bioinformatics Python 

Notes

Acknowledgments

This work is supported by JST PRESTO Grant Number JPMJPR16E9 and the Japan Society for the Promotion of Science (JSPS) Grant-in-Aid (C) JP16K05265 and (S) JP15H05707. The authors are grateful to Ms. Mai Suganami for typesetting references and corrections of typos.

References

  1. 1.
    Lowe R, Shirley N, Bleackley M, Dolan S, Shafee T (2017) Transcriptomics technologies. PLoS Comput Biol 13:e1005457CrossRefGoogle Scholar
  2. 2.
    Cock PJ, Fields CJ, Goto N, Heuer ML, Rice PM (2010) The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants. Nucleic Acids Res 38:1767–1771CrossRefGoogle Scholar
  3. 3.
    Kobayashi T, Voisin B, Kim DY, Kennedy EA, Jo JH, Shih HY, Truong A, Doebel T, Sakamoto K, Cui CY, Schlessinger D, Moro K, Nakae S, Horiuchi K, Zhu J, Leonard WJ, Kong HH, Nagao K (2019) Homeostatic control of sebaceous glands by innate lymphoid cells regulates commensal bacteria equilibrium. Cell 176:982–997.e916CrossRefGoogle Scholar
  4. 4.
  5. 5.
    Bolger AM, Lohse M, Usadel B (2014) Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics 30:2114–2120CrossRefGoogle Scholar
  6. 6.
    Chen S, Zhou Y, Chen Y, Gu J (2018) Fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics 34:i884–i890CrossRefGoogle Scholar
  7. 7.
    Kim D, Langmead B, Salzberg SL (2015) HISAT: a fast spliced aligner with low memory requirements. Nat Methods 12:357–360CrossRefGoogle Scholar
  8. 8.
    Dobin A, Davis CA, Schlesinger F, Drenkow J, Zaleski C, Jha S, Batut P, Chaisson M, Gingeras TR (2013) STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29:15–21CrossRefGoogle Scholar
  9. 9.
    Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R, Genome Project Data Processing Subgroup (2009) The sequence alignment/map format and SAMtools. Bioinformatics 25:2078–2079CrossRefGoogle Scholar
  10. 10.
    Anders S, Pyl PT, Huber W (2015) HTSeq—a python framework to work with high-throughput sequencing data. Bioinformatics 31:166–169CrossRefGoogle Scholar
  11. 11.
    Liao Y, Smyth GK, Shi W (2013) The subread aligner: fast, accurate and scalable read mapping by seed-and-vote. Nucleic Acids Res 41:e108CrossRefGoogle Scholar
  12. 12.
    Liao Y, Smyth GK, Shi W (2019) The R package Rsubread is easier, faster, cheaper and better for alignment and quantification of RNA sequencing reads. Nucleic Acids Res 47(8):e47CrossRefGoogle Scholar
  13. 13.
    Love MI, Huber W, Anders S (2014) Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol 15:550Google Scholar
  14. 14.
    Robinson MD, McCarthy DJ, Smyth GK (2010) edgeR: a bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26:139–140CrossRefGoogle Scholar
  15. 15.
    Dillies MA, Rau A, Aubert J, Hennequet-Antier C, Jeanmougin M, Servant N, Keime C, Marot G, Castel D, Estelle J, Guernec G, Jagla B, Jouneau L, Laloe D, Le Gall C, Schaeffer B, Le Crom S, Guedj M, Jaffrezic F, French StatOmique C (2013) A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis. Brief Bioinform 14:671–683CrossRefGoogle Scholar
  16. 16.
    Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate—a practical and powerful approach to multiple testing. J R Stat Soc B 57:289–300Google Scholar
  17. 17.
    Harris MA, Clark J, Ireland A, Lomax J, Ashburner M, Foulger R, Eilbeck K, Lewis S, Marshall B, Mungall C, Richter J, Rubin GM, Blake JA, Bult C, Dolan M, Drabkin H, Eppig JT, Hill DP, Ni L, Ringwald M, Balakrishnan R, Cherry JM, Christie KR, Costanzo MC, Dwight SS, Engel S, Fisk DG, Hirschman JE, Hong EL, Nash RS, Sethuraman A, Theesfeld CL, Botstein D, Dolinski K, Feierbach B, Berardini T, Mundodi S, Rhee SY, Apweiler R, Barrell D, Camon E, Dimmer E, Lee V, Chisholm R, Gaudet P, Kibbe W, Kishore R, Schwarz EM, Sternberg P, Gwinn M, Hannick L, Wortman J, Berriman M, Wood V, de la Cruz N, Tonellato P, Jaiswal P, Seigfried T, White R, Gene Ontology C (2004) The Gene Ontology (GO) database and informatics resource. Nucleic Acids Res 32:D258–D261CrossRefGoogle Scholar
  18. 18.
    Yu G, Wang LG, Han Y, He QY (2012) clusterProfiler: an R package for comparing biological themes among gene clusters. OMICS 16:284–287CrossRefGoogle Scholar
  19. 19.
    Kanehisa M, Furumichi M, Tanabe M, Sato Y, Morishima K (2017) KEGG: new perspectives on genomes, pathways, diseases and drugs. Nucleic Acids Res 45:D353–D361CrossRefGoogle Scholar
  20. 20.
    Fabregat A, Sidiropoulos K, Garapati P, Gillespie M, Hausmann K, Haw R, Jassal B, Jupe S, Korninger F, McKay S, Matthews L, May B, Milacic M, Rothfels K, Shamovsky V, Webber M, Weiser J, Williams M, Wu G, Stein L, Hermjakob H, D'Eustachio P (2016) The reactome pathway knowledgebase. Nucleic Acids Res 44:D481–D487CrossRefGoogle Scholar
  21. 21.
    Team RC (2017) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, AustriaGoogle Scholar
  22. 22.
    Gruning B, Dale R, Sjodin A, Chapman BA, Rowe J, Tomkins-Tinch CH, Valieris R, Koster J, Bioconda T (2018) Bioconda: sustainable and comprehensive software distribution for the life sciences. Nat Methods 15:475–476CrossRefGoogle Scholar
  23. 23.
  24. 24.
  25. 25.
  26. 26.
    Eid J, Fehr A, Gray J, Luong K, Lyle J, Otto G, Peluso P, Rank D, Baybayan P, Bettman B, Bibillo A, Bjornson K, Chaudhuri B, Christians F, Cicero R, Clark S, Dalal R, Dewinter A, Dixon J, Foquet M, Gaertner A, Hardenbol P, Heiner C, Hester K, Holden D, Kearns G, Kong X, Kuse R, Lacroix Y, Lin S, Lundquist P, Ma C, Marks P, Maxham M, Murphy D, Park I, Pham T, Phillips M, Roy J, Sebra R, Shen G, Sorenson J, Tomaney A, Travers K, Trulson M, Vieceli J, Wegener J, Wu D, Yang A, Zaccarin D, Zhao P, Zhong F, Korlach J, Turner S (2009) Real-time DNA sequencing from single polymerase molecules. Science 323:133–138CrossRefGoogle Scholar
  27. 27.
    Mikheyev AS, Tin MM (2014) A first look at the Oxford Nanopore MinION sequencer. Mol Ecol Resour 14:1097–1102CrossRefGoogle Scholar
  28. 28.
    Johnson WE, Li C, Rabinovic A (2007) Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics 8:118–127CrossRefGoogle Scholar
  29. 29.
    Bourgon R, Gentleman R, Huber W (2010) Independent filtering increases detection power for high-throughput experiments. Proc Natl Acad Sci U S A 107:9546–9551CrossRefGoogle Scholar
  30. 30.
    Zheng GX, Terry JM, Belgrader P, Ryvkin P, Bent ZW, Wilson R, Ziraldo SB, Wheeler TD, McDermott GP, Zhu J, Gregory MT, Shuga J, Montesclaros L, Underwood JG, Masquelier DA, Nishimura SY, Schnall-Levin M, Wyatt PW, Hindson CM, Bharadwaj R, Wong A, Ness KD, Beppu LW, Deeg HJ, McFarland C, Loeb KR, Valente WJ, Ericson NG, Stevens EA, Radich JP, Mikkelsen TS, Hindson BJ, Bielas JH (2017) Massively parallel digital transcriptional profiling of single cells. Nat Commun 8:14049CrossRefGoogle Scholar
  31. 31.
    Wu AR, Neff NF, Kalisky T, Dalerba P, Treutlein B, Rothenberg ME, Mburu FM, Mantalas GL, Sim S, Clarke MF, Quake SR (2014) Quantitative assessment of single-cell RNA-sequencing methods. Nat Methods 11:41–46CrossRefGoogle Scholar
  32. 32.
    Hwang B, Lee JH, Bang D (2018) Single-cell RNA sequencing technologies and bioinformatics pipelines. Exp Mol Med 50:96CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2019

Authors and Affiliations

  1. 1.Faculty of Advanced Life ScienceHokkaido UniversitySapporoJapan

Personalised recommendations