Analysis of RNA-Seq Data Using TopHat and Cufflinks

Part of the Methods in Molecular Biology book series (MIMB, volume 1374)

Abstract

The recent advances in high throughput RNA sequencing (RNA-Seq) have generated huge amounts of data in a very short span of time for a single sample. These data have required the parallel advancement of computing tools to organize and interpret them meaningfully in terms of biological implications, at the same time using minimum computing resources to reduce computation costs. Here we describe the method of analyzing RNA-seq data using the set of open source software programs of the Tuxedo suite: TopHat and Cufflinks. TopHat is designed to align RNA-seq reads to a reference genome, while Cufflinks assembles these mapped reads into possible transcripts and then generates a final transcriptome assembly. Cufflinks also includes Cuffdiff, which accepts the reads assembled from two or more biological conditions and analyzes their differential expression of genes and transcripts, thus aiding in the investigation of their transcriptional and post transcriptional regulation under different conditions. We also describe the use of an accessory tool called CummeRbund, which processes the output files of Cuffdiff and gives an output of publication quality plots and figures of the user’s choice. We demonstrate the effectiveness of the Tuxedo suite by analyzing RNA-Seq datasets of Arabidopsis thaliana root subjected to two different conditions.

Key words

RNA-seq Bowtie TopHat Cufflinks Cuffmerge Cuffcompare Cuffdiff CummeRbund Differential gene expression Transcriptome assembly 

References

  1. 1.
    Ozsolak F, Milos PM (2011) RNA sequencing: advances, challenges and opportunities. Nat Rev Genet 12(2):87–98. doi:10.1038/nrg2934 PubMedCentralCrossRefPubMedGoogle Scholar
  2. 2.
    Metzker ML (2010) Sequencing technologies - the next generation. Nat Rev Genet 11(1):31–46. doi:10.1038/nrg2626 CrossRefPubMedGoogle Scholar
  3. 3.
    Mardis ER (2008) The impact of next-generation sequencing technology on genetics. Trends Genet 24(3):133–141. doi:10.1016/j.tig.2007.12.007 CrossRefPubMedGoogle Scholar
  4. 4.
    Marioni JC, Mason CE, Mane SM, Stephens M, Gilad Y (2008) RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays. Genome Res 18(9):1509–1517. doi:10.1101/gr.079558.108 PubMedCentralCrossRefPubMedGoogle Scholar
  5. 5.
    Ozsolak F, Platt AR, Jones DR, Reifenberger JG, Sass LE, McInerney P, Thompson JF, Bowers J, Jarosz M, Milos PM (2009) Direct RNA sequencing. Nature 461(7265):814–818. doi:10.1038/nature08390 CrossRefPubMedGoogle Scholar
  6. 6.
    Roy SW, Irimia M (2008) When good transcripts go bad: artifactual RT-PCR ‘splicing’ and genome analysis. BioEssays 30(6):601–605. doi:10.1002/bies.20749 CrossRefPubMedGoogle Scholar
  7. 7.
    Garber M, Grabherr MG, Guttman M, Trapnell C (2011) Computational methods for transcriptome annotation and quantification using RNA-seq. Nat Methods 8(6):469–477. doi:10.1038/nmeth.1613 CrossRefPubMedGoogle Scholar
  8. 8.
    Trapnell C, Williams BA, Pertea G, Mortazavi A, Kwan G, van Baren MJ, Salzberg SL, Wold BJ, Pachter L (2010) Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat Biotechnol 28(5):511–515. doi:10.1038/nbt.1621 PubMedCentralCrossRefPubMedGoogle Scholar
  9. 9.
    Wang Z, Gerstein M, Snyder M (2009) RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet 10(1):57–63. doi:10.1038/nrg2484 PubMedCentralCrossRefPubMedGoogle Scholar
  10. 10.
    Schena M, Shalon D, Davis RW, Brown PO (1995) Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science 270(5235):467–470CrossRefPubMedGoogle Scholar
  11. 11.
    Schena M, Shalon D, Heller R, Chai A, Brown PO, Davis RW (1996) Parallel human genome analysis: microarray-based expression monitoring of 1000 genes. Proc Natl Acad Sci U S A 93(20):10614–10619PubMedCentralCrossRefPubMedGoogle Scholar
  12. 12.
    Shiraki T, Kondo S, Katayama S, Waki K, Kasukawa T, Kawaji H, Kodzius R, Watahiki A, Nakamura M, Arakawa T, Fukuda S, Sasaki D, Podhajska A, Harbers M, Kawai J, Carninci P, Hayashizaki Y (2003) Cap analysis gene expression for high-throughput analysis of transcriptional starting point and identification of promoter usage. Proc Natl Acad Sci U S A 100(26):15776–15781. doi:10.1073/pnas.2136655100 PubMedCentralCrossRefPubMedGoogle Scholar
  13. 13.
    Kodzius R, Kojima M, Nishiyori H, Nakamura M, Fukuda S, Tagami M, Sasaki D, Imamura K, Kai C, Harbers M, Hayashizaki Y, Carninci P (2006) CAGE: cap analysis of gene expression. Nat Methods 3(3):211–222. doi:10.1038/nmeth0306-211 CrossRefPubMedGoogle Scholar
  14. 14.
    Velculescu VE, Zhang L, Vogelstein B, Kinzler KW (1995) Serial analysis of gene expression. Science 270(5235):484–487CrossRefPubMedGoogle Scholar
  15. 15.
    Reinartz J, Bruyns E, Lin JZ, Burcham T, Brenner S, Bowen B, Kramer M, Woychik R (2002) Massively parallel signature sequencing (MPSS) as a tool for in-depth quantitative gene expression profiling in all organisms. Brief Funct Genomic Proteomic 1(1):95–104CrossRefPubMedGoogle Scholar
  16. 16.
    Brenner S, Johnson M, Bridgham J, Golda G, Lloyd DH, Johnson D, Luo S, McCurdy S, Foy M, Ewan M, Roth R, George D, Eletr S, Albrecht G, Vermaas E, Williams SR, Moon K, Burcham T, Pallas M, DuBridge RB, Kirchner J, Fearon K, Mao J, Corcoran K (2000) Gene expression analysis by massively parallel signature sequencing (MPSS) on microbead arrays. Nat Biotechnol 18(6):630–634. doi:10.1038/76469 CrossRefPubMedGoogle Scholar
  17. 17.
    Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B (2008) Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods 5(7):621–628. doi:10.1038/nmeth.1226 CrossRefPubMedGoogle Scholar
  18. 18.
    Feldmeyer B, Wheat CW, Krezdorn N, Rotter B, Pfenninger M (2011) Short read Illumina data for the de novo assembly of a non-model snail species transcriptome (Radix balthica, Basommatophora, Pulmonata), and a comparison of assembler performance. BMC Genomics 12:317. doi:10.1186/1471-2164-12-317 PubMedCentralCrossRefPubMedGoogle Scholar
  19. 19.
    Langmead B, Trapnell C, Pop M, Salzberg SL (2009) Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol 10(3):R25. doi:10.1186/gb-2009-10-3-r25 PubMedCentralCrossRefPubMedGoogle Scholar
  20. 20.
    Li H, Durbin R (2010) Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics 26(5):589–595. doi:10.1093/bioinformatics/btp698 PubMedCentralCrossRefPubMedGoogle Scholar
  21. 21.
    Simpson JT, Durbin R (2010) Efficient construction of an assembly string graph using the FM-index. Bioinformatics 26(12):i367–i373. doi:10.1093/bioinformatics/btq217 PubMedCentralCrossRefPubMedGoogle Scholar
  22. 22.
    Trapnell C, Pachter L, Salzberg SL (2009) TopHat: discovering splice junctions with RNA-Seq. Bioinformatics 25(9):1105–1111. doi:10.1093/bioinformatics/btp120 PubMedCentralCrossRefPubMedGoogle Scholar
  23. 23.
    Trapnell C, Roberts A, Goff L, Pertea G, Kim D, Kelley DR, Pimentel H, Salzberg SL, Rinn JL, Pachter L (2012) Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nat Protoc 7(3):562–578. doi:10.1038/nprot.2012.016 PubMedCentralCrossRefPubMedGoogle Scholar
  24. 24.
    Goff L, Trapnell C, Kelley D, Guide PRSCU, biocViews Clustering D, DataRepresentation D, GeneExpression I, MultipleComparison Q, RNASeq R, since BioC IB (2012) Analysis, exploration, manipulation, and visualization of Cufflinks high-throughput sequencing data. R package version 2 (1)Google Scholar
  25. 25.
    Goecks J, Nekrutenko A, Taylor J, Galaxy T (2010) Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biol 11(8):R86. doi:10.1186/gb-2010-11-8-r86 PubMedCentralCrossRefPubMedGoogle Scholar
  26. 26.
    Li R, Yu C, Li Y, Lam TW, Yiu SM, Kristiansen K, Wang J (2009) SOAP2: an improved ultrafast tool for short read alignment. Bioinformatics 25(15):1966–1967. doi:10.1093/bioinformatics/btp336 CrossRefPubMedGoogle Scholar
  27. 27.
    Lunter G, Goodson M (2011) Stampy: a statistical algorithm for sensitive and fast mapping of Illumina sequence reads. Genome Res 21(6):936–939. doi:10.1101/gr.111120.110 PubMedCentralCrossRefPubMedGoogle Scholar
  28. 28.
    Wang K, Singh D, Zeng Z, Coleman SJ, Huang Y, Savich GL, He X, Mieczkowski P, Grimm SA, Perou CM, MacLeod JN, Chiang DY, Prins JF, Liu J (2010) MapSplice: accurate mapping of RNA-seq reads for splice junction discovery. Nucleic Acids Res 38(18), e178. doi:10.1093/nar/gkq622 PubMedCentralCrossRefPubMedGoogle Scholar
  29. 29.
    Wu TD, Nacu S (2010) Fast and SNP-tolerant detection of complex variants and splicing in short reads. Bioinformatics 26(7):873–881. doi:10.1093/bioinformatics/btq057 PubMedCentralCrossRefPubMedGoogle Scholar
  30. 30.
    Guttman M, Garber M, Levin JZ, Donaghey J, Robinson J, Adiconis X, Fan L, Koziol MJ, Gnirke A, Nusbaum C, Rinn JL, Lander ES, Regev A (2010) Ab initio reconstruction of cell type-specific transcriptomes in mouse reveals the conserved multi-exonic structure of lincRNAs. Nat Biotechnol 28(5):503–510. doi:10.1038/nbt.1633 PubMedCentralCrossRefPubMedGoogle Scholar
  31. 31.
    Zerbino DR, Birney E (2008) Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res 18(5):821–829. doi:10.1101/gr.074492.107 PubMedCentralCrossRefPubMedGoogle Scholar
  32. 32.
    Lee S, Seo CH, Lim B, Yang JO, Oh J, Kim M, Lee S, Lee B, Kang C, Lee S (2011) Accurate quantification of transcriptome from RNA-Seq data by effective length normalization. Nucleic Acids Res 39(2):e9. doi:10.1093/nar/gkq1015 PubMedCentralCrossRefPubMedGoogle Scholar
  33. 33.
    Wang L, Feng Z, Wang X, Wang X, Zhang X (2010) DEGseq: an R package for identifying differentially expressed genes from RNA-seq data. Bioinformatics 26(1):136–138. doi:10.1093/bioinformatics/btp612 CrossRefPubMedGoogle Scholar
  34. 34.
    Anders S, Huber W (2010) Differential expression analysis for sequence count data. Genome Biol 11(10):R106. doi:10.1186/gb-2010-11-10-r106 PubMedCentralCrossRefPubMedGoogle Scholar
  35. 35.
    Twine NA, Janitz K, Wilkins MR, Janitz M (2011) Whole transcriptome sequencing reveals gene expression and splicing differences in brain regions affected by Alzheimer’s disease. PLoS One 6(1), e16266PubMedCentralCrossRefPubMedGoogle Scholar
  36. 36.
    Vidal EA, Moyano TC, Krouk G, Katari MS, Tanurdzic M, McCombie WR, Coruzzi GM, Gutierrez RA (2013) Integrated RNA-seq and sRNA-seq analysis identifies novel nitrate-responsive genes in Arabidopsis thaliana roots. BMC Genomics 14:701. doi:10.1186/1471-2164-14-701 PubMedCentralCrossRefPubMedGoogle Scholar
  37. 37.
    Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R, Genome Project Data Processing S (2009) The Sequence Alignment/Map format and SAMtools. Bioinformatics 25(16):2078–2079. doi:10.1093/bioinformatics/btp352 PubMedCentralCrossRefPubMedGoogle Scholar
  38. 38.
    Alexandrov NN, Troukhan ME, Brover VV, Tatarinova T, Flavell RB, Feldmann KA (2006) Features of Arabidopsis genes and genome discovered using full-length cDNAs. Plant Mol Biol 60(1):69–85. doi:10.1007/s11103-005-2564-9 CrossRefPubMedGoogle Scholar
  39. 39.
    Filichkin SA, Priest HD, Givan SA, Shen R, Bryant DW, Fox SE, Wong WK, Mockler TC (2010) Genome-wide mapping of alternative splicing in Arabidopsis thaliana. Genome Res 20(1):45–58. doi:10.1101/gr.093302.109 PubMedCentralCrossRefPubMedGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2016

Authors and Affiliations

  1. 1.Department of BiotechnologyNational Institute of TechnologyDurgapurIndia
  2. 2.School of Plant BiologyUniversity of Western AustraliaCrawley, PerthAustralia

Personalised recommendations