Skip to main content

Suitability of Illumina deep mRNA sequencing for reliable gene expression profiling in a non-model conifer species (Pseudotsuga menziesii)

Abstract

Pseudotsuga menziesii (Douglas-fir) is an ideal model system to study the effect of local adaptation and intraspecific variation in transcriptome responses to the environment. Nonetheless, the lack of genomic resources and standardized microarray platforms for gene expression profiling has been a limitation to test the hypothesis on transcriptome organization and variation. Only recently, deep mRNA sequencing has become a promising alternative to overcome the present limitations. However, information on the transcript abundance distribution is needed for unbiased gene expression profiling from mRNA sequencing data. Since this information is not available for adult conifer needle tissue, we inferred the transcript abundance distribution and tested the effect of sequencing depth on the reliable detection and quantification of transcripts from the needle tissue of 50-year-old Douglas-fir trees. We obtained a similar distribution of GO-slim categories in our mRNA-sequencing libraries and in previously published putative unique transcripts (PUTs) for Douglas-fir, that were used as alignment reference. However, the GO-slim distribution in the Douglas-fir libraries and the Douglas-fir PUTs differed from the GO-slim distributions reported from mRNA deep sequencing libraries obtained from Arabidopsis thaliana leaf tissue. Apparently, several highly abundant PUTs associated with proteins involved in photosynthesis were limiting the benefits of increased sequencing depth. Simulations and empirical data indicated that a 3-fold increase from 5 to 15 million aligned reads results in about twice the number of PUTs that surpass the 100 aligned reads threshold that was used for robust transcript quantification.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Abbreviations

CA:

Canada

CV:

coefficient of variation

EST:

expressed sequence tags

GEO:

gene expression omnibus

GO:

gene ontology

KDE:

kernel density estimate

MAQC:

Micro-Array Quality Control

Mreads:

million reads

PUT:

putative unique transcript

qPCR:

quantitative polymerase chain reaction

References

  1. Ashburner M, Ball C, Blake J, Botstein D, Butler H, Cherry J, Davis A, Dolinski K, Dwight S, Eppig J et al (2000) Gene ontology: tool for the unification of biology. Nat Genet 25(1):25

    PubMed  Article  CAS  Google Scholar 

  2. Ausin I, Greenberg M, Simanshu D, Hale C, Vashisht A, Simon S, Lee T, Feng S, Española S, Meyers B et al (2012) INVOLVED IN DE NOVO 2-containing complex involved in RNA-directed DNA methylation in Arabidopsis. Proc Natl Acad Sci 109(22):8374–8381

    PubMed  Article  CAS  Google Scholar 

  3. Bullard J, Purdom E, Hansen K, Dudoit S (2010) Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments. BMC Bioinforma 11(1):94

    Article  Google Scholar 

  4. Cai G, Li H, Lu Y, Huang X, Lee J, Müller P, Ji Y, Liang S (2012) Accuracy of RNA-Seq and its dependence on sequencing depth. BMC Bioinforma 13(Suppl 13):S5

    Article  CAS  Google Scholar 

  5. Chang S, Puryear J, Cairney J (1993) A simple and efficient method for isolating RNA from pine trees. Plant Mol Biol Report 11(2):113–116

    Article  CAS  Google Scholar 

  6. Clark J, Brooksbank C, Lomax J (2005) It's all go for plant scientists. Plant Physiol 138(3):1268–1279

    PubMed  Article  CAS  Google Scholar 

  7. Daines B, Wang H, Wang L, Li Y, Han Y, Emmert D, Gelbart W, Wang X, Li W, Gibbs R et al (2011) The Drosophila melanogaster transcriptome by paired-end RNA sequencing. Genome Res 21(2):315–324

    PubMed  Article  CAS  Google Scholar 

  8. Draghici S, Khatri P, Eklund A, Szallasi Z (2006) Reliability and reproducibility issues in DNA microarray measurements. Trends Genet 22(2):101

    PubMed  Article  CAS  Google Scholar 

  9. Gan X, Stegle O, Behr J, Steffen J, Drewe P, Hildebrand K, Lyngsoe R, Schultheiss S, Osborne E, Sreedharan V et al (2011) Multiple reference genomes and transcriptomes for Arabidopsis thaliana. Nature 477(7365):419–423

    PubMed  Article  CAS  Google Scholar 

  10. Götz S, García-Gómez JM, Terol J et al (2008) High-throughput functional annotation and data mining with the Blast2GO suite. Nucleic Acids Res 36:3420–3435

    PubMed  Article  Google Scholar 

  11. Gugger P, Sugita S, Cavender-Bares J (2010) Phylogeography of Douglas-fir based on mitochondrial and chloroplast DNA sequences: testing hypotheses from the fossil record. Mol Ecol 19(9):1877–1897

    PubMed  Article  CAS  Google Scholar 

  12. Hebenstreit D, Fang M, Gu M, Charoensawan V, van Oudenaarden A, Teichmann S (2011) RNA sequencing reveals two major classes of gene expression levels in metazoan cells. Mol Syst Biol 7(1):497

    PubMed  Google Scholar 

  13. Hermann R, Lavender D (1999) Douglas-fir planted forests. New For 17(1):53–70

    Article  Google Scholar 

  14. Holliday J, Ralph S, White R, Bohlmann J, Aitken S (2008) Global monitoring of autumn gene expression within and among phenotypically divergent populations of Sitka spruce (Picea sitchensis). New Phytol 178(1):103–122

    PubMed  Article  CAS  Google Scholar 

  15. Howe GT, Yu J, Knaus B et al (2013) A SNP resource for Douglas-fir: de novo transcriptome assembly and SNP detection and validation. BMC Genomics 14:137

    PubMed  Article  CAS  Google Scholar 

  16. Hunter JD (2007) Matplotlib: a 2D graphics environment. Comput Sci Eng 9(3):90–95

    Article  Google Scholar 

  17. Łabaj P, Leparc G, Linggi B, Markillie L, Wiley H, Kreil D (2011) Characterization and improvement of RNA-Seq precision in quantitative transcript expression profiling. Bioinformatics 27(13):i383–i391

    PubMed  Article  Google Scholar 

  18. Langmead B, Salzberg SL (2012) Fast gapped-read alignment with Bowtie 2. Nat Methods 9:357–359

    PubMed  Article  CAS  Google Scholar 

  19. Li H, Durbin R (2009) Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics 25(14):1754–1760

    PubMed  Article  CAS  Google Scholar 

  20. Li P, Ponnala L, Gandotra N, Wang L, Si Y, Tausta S, Kebrom T, Provart N, Patel R, Myers C et al (2010) The developmental dynamics of the maize leaf transcriptome. Nat Genet 42(12):1060–1067

    PubMed  Article  CAS  Google Scholar 

  21. Liu S, Lin L, Jiang P, Wang D, Xing Y (2011) A comparison of RNA-Seq and high-density exon array for detecting differential gene expression between closely related species. Nucleic Acids Res 39(2):578–588

    PubMed  Article  CAS  Google Scholar 

  22. Lorenz W, Alba R, Yu Y, Bordeaux J, Simões M, Dean J (2011) Microarray analysis and scale-free gene networks identify candidate regulators in drought-stressed roots of Loblolly pine (P. taeda L.). BMC Genomics 12(1):264

    PubMed  Article  CAS  Google Scholar 

  23. Mane S, Evans C, Cooper K, Crasta O, Folkerts O, Hutchison S, Harkins T, Thierry-Mieg D, Thierry-Mieg J, Jensen R (2009) Transcriptome sequencing of the microarray quality control (MAQC) RNA reference samples using next generation sequencing. BMC Genomics 10(1):264

    PubMed  Article  Google Scholar 

  24. Marioni J, Mason C, Mane S, Stephens M, Gilad Y (2008) RNA-Seq: an assessment of technical reproducibility and comparison with gene expression arrays. Genome Res 18(9):1509–1517

    PubMed  Article  CAS  Google Scholar 

  25. McIntyre L, Lopiano K, Morse A, Amin V, Oberg A, Young L, Nuzhdin S (2011) RNA-Seq: technical variability and sampling. BMC Genomics 12(1):293

    PubMed  Article  CAS  Google Scholar 

  26. Mortazavi A, Williams B, McCue K, Schaeffer L, Wold B (2008) Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods 5(7):621–628

    PubMed  Article  CAS  Google Scholar 

  27. Müller T, Ensminger I, Schmid K (2012) A catalogue of putative unique transcripts from Douglas-fir (Pseudotsuga menziesii) based on 454 transcriptome sequencing of genetically diverse, drought stressed seedlings. BMC Genomics 13(1):673

    PubMed  Article  Google Scholar 

  28. Ning Z, Cox A, Mullikin J (2001) SSAHA: a fast search method for large DNA databases. Genome Res 11(10):1725–1729

    PubMed  Article  CAS  Google Scholar 

  29. Oberg A, Bot B, Grill D, Poland G, Therneau T (2012) Technical and biological variance structure in mRNA-Seq data: life in the real world. BMC Genomics 13(1):304

    PubMed  Article  CAS  Google Scholar 

  30. Raherison E, Rigault P, Caron S, Poulin P, Boyle B, Verta J, Giguère I, Bomal C, Bohlmann J, MacKay J (2012) Transcriptome profiling in conifers and the PiceaGenExpress database show patterns of diversification within gene families and interspecific conservation in vascular gene expression. BMC Genomics 13(1):434

    PubMed  Article  CAS  Google Scholar 

  31. Ramsköld D, Wang E, Burge C, Sandberg R (2009) An abundance of ubiquitously expressed genes revealed by tissue transcriptome sequence data. PLoS Comput Biol 5(12):e1000,598

    Article  Google Scholar 

  32. Raz T, Kapranov P, Lipson D, Letovsky S, Milos P, Thompson J (2011) Protocol dependence of sequencing-based gene expression measurements. PloS One 6(5):e19,287

    Article  CAS  Google Scholar 

  33. Rigault P, Boyle B, Lepage P, Cooke J, Bousquet J, MacKay J (2011) A white spruce gene catalog for conifer genome analyses. Plant Physiol 157(1):14–28

    PubMed  Article  CAS  Google Scholar 

  34. Shi L, Reid L, Jones W, Shippy R, Warrington J, Baker S, Collins P, de Longueville F, Kawasaki E, Lee K et al (2006) The microarray quality control (MAQC) project shows inter- and intraplatform reproducibility of gene expression measurements. Nat Biotechnol 24(9):1151–1161

    PubMed  Article  CAS  Google Scholar 

  35. Smith A, Heisler L, Onge R, Farias-Hesson E, Wallace I, Bodeau J, Harris A, Perry K, Giaever G, Pourmand N et al (2010) Highly-multiplexed barcode sequencing: an efficient method for parallel analysis of pooled samples. Nucleic Acids Res 38(13):e142–e142

    PubMed  Article  Google Scholar 

  36. Tarazona S, Garcìa-Alcalde F, Dopazo J, Ferrer A, Conesa A (2011) Differential expression in RNA-Seq: a matter of depth. Genome Res 21(12):2213–2223

    PubMed  Article  CAS  Google Scholar 

  37. Team RC (2013) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, pp 1–1731

  38. Toung J, Morley M, Li M, Cheung V (2011) RNA-sequence analysis of human B-cells. Genome Res 21(6):991–998

    PubMed  Article  CAS  Google Scholar 

Download references

Acknowledgments

This project is part of the collaborative project "DougAdapt" with funding from the Deutsche Forschungsgemeinschaft to IE (DFG-project EN 829/4-1). The authors are grateful to Anita Kleiber and Anna-Maria Weisser for technical assistance with RNA extraction. The authors also thank Wolfgang Hess for valuable comments and discussion.

Conflict of interests

The authors declare that they have no competing interests.

Ethical standards

All experiments comply with the current laws of the Federal Republic of Germany.

Data archiving statement

All sequence data has been submitted to the NCBI Sequence Read Archive (SRA, www.ncbi.nlm.nih.gov/sra). Accession numbers are SRR908308(COA1), SRR908309 (COA2), SRR868709 (INT1), SRR908307 (INT2). Accession number of the study: SRP026170.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Ingo Ensminger.

Additional information

Communicated by J. Wegrzyn

Electronic supplementary material

Below is the link to the electronic supplementary material.

Online Resource 1

Distribution of GO-slim categories in the namespace "Biological Process" within the 100 most abundant PUTs. Distribution of GO-slim categories within the 100 highest expressed PUTs in the deep sequencing libraries COA1, COA2, INT1 and INT2 that were detected when using the Müller or the Howe PUT set as alignment reference. Annotation and GO-slimming is described in Methods section (XLS 14 kb)

Online Resource 2

Annotation statistics for Howe PUT set. Summary of numbers of detected PUTs and detected PUTs with functional annotation. Functionally annotated PUTs have a hit in the Arabidopsis thaliana peptide database (ARA) or in the NCBI Plant RefSeq peptide database. "Unique annotations" are the set of all unique hits in the A. thaliana or the NCBI Plant RefSeq peptide database. GO annotations refer to the PUTs with GO slim annotation (details of functional and GO-slim annotation in Methods section). Blast2GO annotations refer to annotations inferred by Blast2GO. Relative numbers with respect to the PUT set are shown in parentheses, relative numbers with respect to the deep sequencing libraries are shown in square brackets (XLS 11 kb)

Online Resource 3

Distribution of GO-slim terms (Howe PUT set). The relative abundance of functional categories, represented by plant GO-slims in the four libraries and the Douglas-fir PUT set, compared with the relative abundance detected in deep mRNA-sequencing data generated from Arabidopsis thaliana whole seedlings (NCBI GEO [GSM762070]) and leaves (NCBI GEO [GSM881683]). The functional annotation of Douglas-fir PUTs was obtained by aligning the PUTs to the NCBI Plant RefSeq peptide database and feeding the alignment to the Blast2GO pipeline (for details, see Methods section). The distribution of the deviation of GO-slim abundances relative to the Howe PUT set or the A. thaliana samples in the namespace "molecular function" is shown as smoothed kernel density estimates (KDE) (a). The relative abundance of a GO-slim category in one of the four Douglas-fir libraries or the Douglas-fir PUT set is normalized by the relative abundance of this GO-slim category in an A. thaliana full seedling (b) or leaf (c) deep mRNA-sequencing library. A value of 0 plotted on the y-axis implies an equal distribution of GO-slim terms in the Douglas-fir libraries compared to A. thaliana deep mRNA sequencing libraries or the Douglas-fir PUT set (PDF 1702 kb)

Online Resource 4

Impact of sequencing depth on the number of reliably quantified PUTs when using the Howe PUT set. The number of PUTs with a hit in the NCBI Plant RefSeq peptide database detected with more than x number of aligned reads (value shown on the x-axis). To demonstrate the effect of sequencing depth, sub-samples of library COA2 are included (gradient: yellow to red). The number of aligned reads is printed in the legend. Estimates of expected binomial sampling error (as coefficient of variation [CV]), dependent on the number of aligned reads per PUT are shown for 10, 100 and 1,000 aligned reads per PUT (PDF 858 kb)

Online Resource 5

Shared annotations among Müller and Howe PUT sets, P. glauca transcript clusters and Arabidopsis peptides. Venn diagram which shows the overlap of the functional annotations inferred by Blast2GO of the Müller PUT set (Muller), the Howe PUT set (Howe), the P. glauca transcript cluster database (Picea) and the Arabidopsis thaliana peptide database (Ara). All sequence sets have been aligned to the NCBI Plant RefSeq peptides. This alignment was used for detecting annotations using Blast2GO (for details, see Methods section) (PDF 307 kb)

Online Resource 6

Top 1,000 most abundant PUTs in the deep sequencing libraries (alignment to Müller and Howe PUT set). Top 1,000 most abundant PUTs in the libraries COA1, COA2, INT1 and INT2 sorted by the number of aligned reads. For each PUT, the PUT name with the associated annotation inferred by Blast2GO and the number of aligned reads are printed in the form: PUT name-Blast2GO Annotation-counts (XLS 5787 kb)

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Hess, M., Wildhagen, H. & Ensminger, I. Suitability of Illumina deep mRNA sequencing for reliable gene expression profiling in a non-model conifer species (Pseudotsuga menziesii). Tree Genetics & Genomes 9, 1513–1527 (2013). https://doi.org/10.1007/s11295-013-0656-2

Download citation

Keywords

  • Illumina
  • Deep mRNA sequencing
  • Conifer
  • Sequencing depth
  • Transcriptome
  • Next-generation sequencing