Next Generation Microarray Bioinformatics pp 259-274 | Cite as
How to Analyze Gene Expression Using RNA-Sequencing Data
- 18 Citations
- 2 Mentions
- 17k Downloads
Abstract
RNA-Seq is arising as a powerful method for transcriptome analyses that will eventually make microarrays obsolete for gene expression analyses. Improvements in high-throughput sequencing and efficient sample barcoding are now enabling tens of samples to be run in a cost-effective manner, competing with microarrays in price, excelling in performance. Still, most studies use microarrays, partly due to the ease of data analyses using programs and modules that quickly turn raw microarray data into spreadsheets of gene expression values and significant differentially expressed genes. Instead RNA-Seq data analyses are still in its infancy and the researchers are facing new challenges and have to combine different tools to carry out an analysis. In this chapter, we provide a tutorial on RNA-Seq data analysis to enable researchers to quantify gene expression, identify splice junctions, and find novel transcripts using publicly available software. We focus on the analyses performed in organisms where a reference genome is available and discuss issues with current methodology that have to be solved before RNA-Seq data can utilize its full potential.
Key words
RNA-Seq Genomics TutorialReferences
- 1.Wang Z, Gerstein M, Snyder M (2009) RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet 10:57–63PubMedCrossRefGoogle Scholar
- 2.Wang ET, Sandberg R, Luo S et al (2008) Alternative isoform regulation in human tissue transcriptomes. Nature 456:470–476PubMedCrossRefGoogle Scholar
- 3.Pan Q, Shai O, Lee L et al (2008) Deep surveying of alternative splicing complexity in the human transcriptome by high-throughput sequencing. Nat Genet 40:1413–1415PubMedCrossRefGoogle Scholar
- 4.Yoder-Himes DR, Chain PSG, Zhu Y et al (2009) Mapping the Burkholderia cenocepacia niche response via high-throughput sequencing. Proc Natl Acad Sci USA 106:3976–3981PubMedCrossRefGoogle Scholar
- 5.Armour CD, Castle JC, Chen R et al (2009) Digital transcriptome profiling using selective hexamer priming for cDNA synthesis. Nat Methods 6:647–649PubMedCrossRefGoogle Scholar
- 6.Core LJ, Waterfall JJ and Lis JT (2008) Nascent RNA sequencing reveals widespread pausing and divergent initiation at human promoters. Science 322:1845–1848PubMedCrossRefGoogle Scholar
- 7.Ingolia NT, Ghaemmaghami S, Newman JRS et al (2009) Genome-wide analysis in vivo of translation with nucleotide resolution using ribosome profiling. Science 324:218–223PubMedCrossRefGoogle Scholar
- 8.Metzker ML (2010) Sequencing technologies – the next generation. Nat Rev Genet 11:31–46PubMedCrossRefGoogle Scholar
- 9.Mortazavi A, Williams BA, McCue K et al (2008) Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods 5:621–628PubMedCrossRefGoogle Scholar
- 10.Guttman M, Garber M, Levin JZ et al (2010) Ab initio reconstruction of cell type-specific transcriptomes in mouse reveals the conserved multi-exonic structure of lincRNAs. Nat Biotechnol 28:503–510PubMedCrossRefGoogle Scholar
- 11.Trapnell C, Williams BA, Pertea G et al (2010) Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat Biotechnol 28:511–515PubMedCrossRefGoogle Scholar
- 12.Sequence Read Archive. http://www.ncbi.nlm.nih.gov/sra.
- 13.Gene Expression Omnibus. http://www.ncbi.nlm.nih.gov/geo.
- 14.Ewing B, Hillier L, Wendl MC et al (1998) Base-calling of automated sequencer traces using phred I accuracy assessment. Genome Res 8:175–185PubMedGoogle Scholar
- 15.Cock PJA, Fields CJ, Goto N et al (2010) The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants. Nucleic Acids Res 38:1767–1771PubMedCrossRefGoogle Scholar
- 16.Giardine B, Riemer C, Hardison RC et al (2005) Galaxy: a platform for interactive large-scale genome analysis. Genome Res 15:1451–1455PubMedCrossRefGoogle Scholar
- 17.Stajich JE, Block D, Boulez K et al (2002) The Bioperl toolkit: Perl modules for the life sciences. Genome Res 12:1611–1618PubMedCrossRefGoogle Scholar
- 18.Cock PJA, Antao T, Chang JT et al (2009) Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics 25:1422–1423PubMedCrossRefGoogle Scholar
- 19.NCBI (2010) Sequence Read Archive Submission Guidelines. http://www.ncbi.nlm.nih.gov/Traces/sra/static/SRA_Submission_Guidelines.pdf. Accessed 2 Nov 2010
- 20.SOLiD Sequence Read Format package. http://solidsoftwaretools.com/gf/project/srf/
- 21.Staden IO module. http://staden.sourceforge.net/
- 22.Sequenceread package http://sourceforge.net/projects/sequenceread/
- 23.Pepke S, Wold B, Mortazavi A (2009) Computation for ChIP-seq and RNA-seq studies. Nat Methods 6:S22-S32PubMedCrossRefGoogle Scholar
- 24.Dohm JC, Lottaz C, Borodina T et al (2008) Substantial biases in ultra-short read data sets from high-throughput DNA sequencing. Nucleic Acids Res 36:e105PubMedCrossRefGoogle Scholar
- 25.Langmead B, Trapnell C, Pop M et al (2009) Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol 10:R25PubMedCrossRefGoogle Scholar
- 26.Trapnell C, Pachter L and Salzberg SL (2009) TopHat: discovering splice junctions with RNA-Seq. Bioinformatics 25:1105–1111PubMedCrossRefGoogle Scholar
- 27.Chen Y, Souaiaia T and Chen T (2009) PerM: efficient mapping of short sequencing reads with periodic full sensitive spaced seeds. Bioinformatics 25:2514–2521PubMedCrossRefGoogle Scholar
- 28.Galaxy. http://g2.bx.psu.edu
- 29.Galaxy Experimental Features. http://test.g2.bx.psu.edu
- 30.Novoalign. http://www.novocraft.com
- 31.Homer N, Merriman B, Nelson SF (2009) BFAST: an alignment tool for large scale genome resequencing. PLoS ONE 4:e7767PubMedCrossRefGoogle Scholar
- 32.
- 33.Ozsolak F, Platt AR, Jones DR et al (2009) Direct RNA sequencing. Nature 461:814–818PubMedCrossRefGoogle Scholar
- 34.
- 35.UCSC Genome Browser FAQ File Formats. http://genome.ucsc.edu/FAQ/FAQformathtml#format1
- 36.
- 37.RNA-Seq files at sandberg lab homepage. http://sandberg.cmb.ki.se/rnaseq/
- 38.
- 39.Python. http://www.python.org
- 40.Li H, Handsaker B, Wysoker A et al (2009) The Sequence Alignment/Map format and SAMtools. Bioinformatics 25:2078–2079PubMedCrossRefGoogle Scholar
- 41.UCSC Genome Browser Downloads. http://hgdownload.cse.ucsc.edu/downloads.html
- 42.van Bakel H, Nislow C, Blencowe BJ et al (2010) Most “dark matter” transcripts are associated with known genes. PLoS Biol 8:e1000371PubMedCrossRefGoogle Scholar
- 43.Integrative Genome Browser. http://www.broadinstitute.org/igv
- 44.Sandberg R, Neilson JR, Sarma A et al (2008) Proliferating cells express mRNAs with shortened 3′ untranslated regions and fewer microRNA target sites. Science 320:1643–7PubMedCrossRefGoogle Scholar
- 45.Neilson JR and Sandberg R (2010) Heterogeneity in mammalian RNA 3′ end formation. Exp Cell Res 316:1357–1364PubMedCrossRefGoogle Scholar
- 46.Ramsköld D, Wang ET, Burge CB et al (2009) An abundance of ubiquitously expressed genes revealed by tissue transcriptome sequence data. PLoS Comput Biol 5:e1000598PubMedCrossRefGoogle Scholar
- 47.Montgomery SB, Sammeth M, Gutierrez-Arcelus M et al (2010) Transcriptome genetics using second generation sequencing in a Caucasian population. Nature 464:773–777PubMedCrossRefGoogle Scholar
- 48.NumPy. http://numpy.scipy.org
- 49.Kent WJ, Zweig AS, Barber G et al (2010) BigWig and BigBed: enabling browsing of large distributed datasets. Bioinformatics 26:2204–2207PubMedCrossRefGoogle Scholar
- 50.UCSC stand-alone bioinformatic programs. http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/
- 51.UCSC Mappability Data. http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeMapability/
- 52.Marioni JC, Mason CE, Mane SM et al (2008) RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays. Genome Res 18:1509–1517PubMedCrossRefGoogle Scholar
- 53.Allison DB, Cui X, Page GP et al (2006) Microarray data analysis: from disarray to consolidation and consensus. Nat Rev Genet 7:55–65PubMedCrossRefGoogle Scholar
- 54.Robinson MD, McCarthy DJ and Smyth GK (2010) edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26:139–140PubMedCrossRefGoogle Scholar
- 55.Anders S, Huber W (2010) Differential expression analysis for sequence count data. Genome Biol 11:R106PubMedCrossRefGoogle Scholar
- 56.
- 57.
- 58.Bioconductor, http://www.bioconductor.org/
- 59.Li H, Ruan J, Durbin R (2008) Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res 18:1851–1858PubMedCrossRefGoogle Scholar