Analysis and Visualization of RNA-Seq Expression Data Using RStudio, Bioconductor, and Integrated Genome Browser
Sequencing costs are falling, but the cost of data analysis remains high, often because unforeseen problems arise, such as insufficient depth of sequencing or batch effects. Experimenting with data analysis methods during the planning phase of an experiment can reveal unanticipated problems and build valuable bioinformatics expertise in the organism or process being studied. This protocol describes using R Markdown and RStudio, user-friendly tools for statistical analysis and reproducible research in bioinformatics, to analyze and document the analysis of an example RNA-Seq data set from tomato pollen undergoing chronic heat stress. Also, we show how to use Integrated Genome Browser to visualize read coverage graphs for differentially expressed genes. Applying the protocol described here and using the provided data sets represent a useful first step toward building RNA-Seq data analysis expertise in a research group.
Key wordsIntegrated genome browser Tomato Pollen Visualization RNA-Seq R Differential gene expression edgeR
The example data set was from the Workshop in Next-Generation Sequencing (WiNGS), which was co-sponsored by the NSF Research Coordination Network on Integrative Pollen Biology (award 0955431), the NSF Plant Genome Research Program (award 1238051), and the Department of Bioinformatics and Genomics at UNC Charlotte. NIH R01 grant number 21737838 supports development of the IGB software.
- 1.Lister R, O’Malley RC, Tonti-Filippini J, Gregory BD, Berry CC, Millar AH, Ecker JR (2008) Highly integrated single-base resolution maps of the epigenome in Arabidopsis. Cell 133:523–536, PMID: 18423832Google Scholar
- 2.Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B (2008) Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods 5:621–628, PMID: 18516045Google Scholar
- 3.Nagalakshmi U, Wang Z, Waern K, Shou C, Raha D, Gerstein M, Snyder M (2008) The transcriptional landscape of the yeast genome defined by RNA sequencing. Science 320:1344–1349, PMID: 18451266Google Scholar
- 4.Shendure J (2008) The beginning of the end for microarrays? Nat Methods 5:585–587, PMID: 18587314Google Scholar
- 5.Robinson MD, McCarthy DJ, Smyth GK (2010) edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26:139–140, PMID: 19910308Google Scholar
- 6.Nikolayeva O, Robinson MD (2014) edgeR for differential RNA-seq and ChIP-seq analysis: an application to stem cell biology. Methods Mol Biol 1150:45–79, PMID: 24743990Google Scholar
- 7.Oshlack A, Wakefield MJ (2009) Transcript length bias in RNA-seq data confounds systems biology. Biol Direct 4:14, PMID: 19371405Google Scholar
- 8.Nicol JW, Helt GA, Blanchard SG Jr, Raja A, Loraine AE (2009) The Integrated Genome Browser: free software for distribution and exploration of genome-scale datasets. Bioinformatics 25:2730–2731Google Scholar
- 9.Young MD, Wakefield MJ, Smyth GK, Oshlack A (2010) Gene ontology analysis for RNA-seq: accounting for selection bias. Genome Biol 11:R14 PMID: 20132535Google Scholar