De Novo Plant Transcriptome Assembly and Annotation Using Illumina RNA-Seq Reads

  • Stephanie C. Kerr
  • Federico Gaiti
  • Milos TanurdzicEmail author
Part of the Methods in Molecular Biology book series (MIMB, volume 1933)


The ability to identify and quantify transcribed sequences from a multitude of organisms using high-throughput RNA sequencing has revolutionized our understanding of genetics and plant biology. However, a number of computational tools used in these analyses still require a reference genome sequence, something that is seldom available for non-model organisms. Computational tools employing de Bruijn graphs to reconstruct full-length transcripts from short sequence reads allow for de novo transcriptome assembly. Here we provide detailed methods for generating and annotating de novo transcriptome assembly from plant RNA-seq data.

Key words

RNA-seq Long noncoding RNA Trinity De novo transcriptome assembly 

Supplementary material (1 kb)
Supplementary File 1 : The Perl script used in Subheading 3.5, step 5 for identifying the longest open reading frame from each transcript present in your transcriptome assembly (PL 1 kb) (4 kb)
Supplementary File 2 : The bash script used in Subheading 3.5, step 9 (SH 4 kb)


  1. 1.
    Kim D, Pertea G, Trapnell C et al (2013) TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biol 14:R36CrossRefGoogle Scholar
  2. 2.
    Dobin A, Davis CA, Schlesinger F et al (2013) STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29:15–21CrossRefGoogle Scholar
  3. 3.
    Love MI, Huber W, Anders S (2014) Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol 15:550CrossRefGoogle Scholar
  4. 4.
    Robinson MD, McCarthy DJ, Smyth GK (2010) edgeR: a bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26:139–140CrossRefGoogle Scholar
  5. 5.
    Trapnell C, Williams BA, Pertea G et al (2010) Transcript assembly and abundance estimation from RNA-Seq reveals thousands of new transcripts and switching among isoforms. Nat Biotechnol 28:511–515CrossRefGoogle Scholar
  6. 6.
    Afgan E, Sloggett C, Goonasekera N et al (2015) Genomics virtual laboratory: a practical bioinformatics workbench for the cloud. PLoS One 10:e0140829CrossRefGoogle Scholar
  7. 7.
    Ungaro A, Pech N, Martin J-F et al (2017) Challenges and advances for transcriptome assembly in non-model species. PLoS One 12:e0185020CrossRefGoogle Scholar
  8. 8.
    Moreton J, Izquierdo A, Emes RD (2016) Assembly, assessment, and availability of de novo generated eukaryotic transcriptomes. Front Genet 6:361CrossRefGoogle Scholar
  9. 9.
    Atallah NM, Vitek O, Gaiti F et al (2018) Sex determination in Ceratopteris richardii is accompanied by transcriptome changes that drive epigenetic reprogramming of the young gametophyte. G3 Genes Genomes Genet 8:2205–2214Google Scholar
  10. 10.
    Kerr SC, Gaiti F, Beveridge CA et al (2017) De novo transcriptome assembly reveals high transcriptional complexity in Pisum sativum axillary buds and shows rapid changes in expression of diurnally regulated genes. BMC Genomics 18:221CrossRefGoogle Scholar
  11. 11.
    Grabherr MG, Haas BJ, Yassour M et al (2011) Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nat Biotechnol 29:644–652CrossRefGoogle Scholar
  12. 12.
    Schulz MH, Zerbino DR, Vingron M et al (2012) Oases: robust de novo RNA-seq assembly across the dynamic range of expression levels. Bioinformatics 28:1086–1092CrossRefGoogle Scholar
  13. 13.
    Bogdanov EA, Shagina I, Barsova EV, et al (2010) Normalizing cDNA libraries. Curr Protoc Mol Biol Chapter 5:Unit 5.12.1-27Google Scholar
  14. 14.
    Schmieder R, Edwards R (2011) Fast identification and removal of sequence contamination from genomic and metagenomic datasets. PLoS One 6:e17288CrossRefGoogle Scholar
  15. 15.
    Haas BJ, Papanicolaou A, Yassour M et al (2013) De novo transcript sequence reconstruction from RNA-seq using the Trinity platform for reference generation and analysis. Nat Protoc 8:1494–1512CrossRefGoogle Scholar
  16. 16.
    Finn RD, Clements J, Eddy SR (2011) HMMER web server: interactive sequence similarity searching. Nucleic Acids Res 39:W29–W37CrossRefGoogle Scholar
  17. 17.
    Petersen TN, Brunak S, von Heijne G et al (2011) SignalP 4.0: discriminating signal peptides from transmembrane regions. Nat Methods 8:785–786CrossRefGoogle Scholar
  18. 18.
    Rice P, Longden I, Bleasby A (2000) EMBOSS: the European molecular biology open software suite. Trends Genet 16:276–277CrossRefGoogle Scholar
  19. 19.
    Kong L, Zhang Y, Ye Z-Q et al (2007) CPC: assess the protein-coding potential of transcripts using sequence features and support vector machine. Nucleic Acids Res 35:W345–W349CrossRefGoogle Scholar
  20. 20.
    Nawrocki EP, Burge SW, Bateman A et al (2015) Rfam 12.0: updates to the RNA families database. Nucleic Acids Res 43:D130–D137CrossRefGoogle Scholar
  21. 21.
    Lai Z, Kane NC, Kozik A et al (2012) Genomics of Compositae weeds: EST libraries, microarrays, and evidence of introgression. Am J Bot 99:209–218CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2019

Authors and Affiliations

  • Stephanie C. Kerr
    • 1
  • Federico Gaiti
    • 2
  • Milos Tanurdzic
    • 1
    Email author
  1. 1.School of Biological SciencesThe University of QueenslandSt LuciaAustralia
  2. 2.New York Genome Center and Department of MedicineWeill Cornell MedicineNew YorkUSA

Personalised recommendations