Abstract
Pseudotsuga menziesii (Douglas-fir) is an ideal model system to study the effect of local adaptation and intraspecific variation in transcriptome responses to the environment. Nonetheless, the lack of genomic resources and standardized microarray platforms for gene expression profiling has been a limitation to test the hypothesis on transcriptome organization and variation. Only recently, deep mRNA sequencing has become a promising alternative to overcome the present limitations. However, information on the transcript abundance distribution is needed for unbiased gene expression profiling from mRNA sequencing data. Since this information is not available for adult conifer needle tissue, we inferred the transcript abundance distribution and tested the effect of sequencing depth on the reliable detection and quantification of transcripts from the needle tissue of 50-year-old Douglas-fir trees. We obtained a similar distribution of GO-slim categories in our mRNA-sequencing libraries and in previously published putative unique transcripts (PUTs) for Douglas-fir, that were used as alignment reference. However, the GO-slim distribution in the Douglas-fir libraries and the Douglas-fir PUTs differed from the GO-slim distributions reported from mRNA deep sequencing libraries obtained from Arabidopsis thaliana leaf tissue. Apparently, several highly abundant PUTs associated with proteins involved in photosynthesis were limiting the benefits of increased sequencing depth. Simulations and empirical data indicated that a 3-fold increase from 5 to 15 million aligned reads results in about twice the number of PUTs that surpass the 100 aligned reads threshold that was used for robust transcript quantification.
Similar content being viewed by others
Abbreviations
- CA:
-
Canada
- CV:
-
coefficient of variation
- EST:
-
expressed sequence tags
- GEO:
-
gene expression omnibus
- GO:
-
gene ontology
- KDE:
-
kernel density estimate
- MAQC:
-
Micro-Array Quality Control
- Mreads:
-
million reads
- PUT:
-
putative unique transcript
- qPCR:
-
quantitative polymerase chain reaction
References
Ashburner M, Ball C, Blake J, Botstein D, Butler H, Cherry J, Davis A, Dolinski K, Dwight S, Eppig J et al (2000) Gene ontology: tool for the unification of biology. Nat Genet 25(1):25
Ausin I, Greenberg M, Simanshu D, Hale C, Vashisht A, Simon S, Lee T, Feng S, Española S, Meyers B et al (2012) INVOLVED IN DE NOVO 2-containing complex involved in RNA-directed DNA methylation in Arabidopsis. Proc Natl Acad Sci 109(22):8374–8381
Bullard J, Purdom E, Hansen K, Dudoit S (2010) Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments. BMC Bioinforma 11(1):94
Cai G, Li H, Lu Y, Huang X, Lee J, Müller P, Ji Y, Liang S (2012) Accuracy of RNA-Seq and its dependence on sequencing depth. BMC Bioinforma 13(Suppl 13):S5
Chang S, Puryear J, Cairney J (1993) A simple and efficient method for isolating RNA from pine trees. Plant Mol Biol Report 11(2):113–116
Clark J, Brooksbank C, Lomax J (2005) It's all go for plant scientists. Plant Physiol 138(3):1268–1279
Daines B, Wang H, Wang L, Li Y, Han Y, Emmert D, Gelbart W, Wang X, Li W, Gibbs R et al (2011) The Drosophila melanogaster transcriptome by paired-end RNA sequencing. Genome Res 21(2):315–324
Draghici S, Khatri P, Eklund A, Szallasi Z (2006) Reliability and reproducibility issues in DNA microarray measurements. Trends Genet 22(2):101
Gan X, Stegle O, Behr J, Steffen J, Drewe P, Hildebrand K, Lyngsoe R, Schultheiss S, Osborne E, Sreedharan V et al (2011) Multiple reference genomes and transcriptomes for Arabidopsis thaliana. Nature 477(7365):419–423
Götz S, García-Gómez JM, Terol J et al (2008) High-throughput functional annotation and data mining with the Blast2GO suite. Nucleic Acids Res 36:3420–3435
Gugger P, Sugita S, Cavender-Bares J (2010) Phylogeography of Douglas-fir based on mitochondrial and chloroplast DNA sequences: testing hypotheses from the fossil record. Mol Ecol 19(9):1877–1897
Hebenstreit D, Fang M, Gu M, Charoensawan V, van Oudenaarden A, Teichmann S (2011) RNA sequencing reveals two major classes of gene expression levels in metazoan cells. Mol Syst Biol 7(1):497
Hermann R, Lavender D (1999) Douglas-fir planted forests. New For 17(1):53–70
Holliday J, Ralph S, White R, Bohlmann J, Aitken S (2008) Global monitoring of autumn gene expression within and among phenotypically divergent populations of Sitka spruce (Picea sitchensis). New Phytol 178(1):103–122
Howe GT, Yu J, Knaus B et al (2013) A SNP resource for Douglas-fir: de novo transcriptome assembly and SNP detection and validation. BMC Genomics 14:137
Hunter JD (2007) Matplotlib: a 2D graphics environment. Comput Sci Eng 9(3):90–95
Łabaj P, Leparc G, Linggi B, Markillie L, Wiley H, Kreil D (2011) Characterization and improvement of RNA-Seq precision in quantitative transcript expression profiling. Bioinformatics 27(13):i383–i391
Langmead B, Salzberg SL (2012) Fast gapped-read alignment with Bowtie 2. Nat Methods 9:357–359
Li H, Durbin R (2009) Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics 25(14):1754–1760
Li P, Ponnala L, Gandotra N, Wang L, Si Y, Tausta S, Kebrom T, Provart N, Patel R, Myers C et al (2010) The developmental dynamics of the maize leaf transcriptome. Nat Genet 42(12):1060–1067
Liu S, Lin L, Jiang P, Wang D, Xing Y (2011) A comparison of RNA-Seq and high-density exon array for detecting differential gene expression between closely related species. Nucleic Acids Res 39(2):578–588
Lorenz W, Alba R, Yu Y, Bordeaux J, Simões M, Dean J (2011) Microarray analysis and scale-free gene networks identify candidate regulators in drought-stressed roots of Loblolly pine (P. taeda L.). BMC Genomics 12(1):264
Mane S, Evans C, Cooper K, Crasta O, Folkerts O, Hutchison S, Harkins T, Thierry-Mieg D, Thierry-Mieg J, Jensen R (2009) Transcriptome sequencing of the microarray quality control (MAQC) RNA reference samples using next generation sequencing. BMC Genomics 10(1):264
Marioni J, Mason C, Mane S, Stephens M, Gilad Y (2008) RNA-Seq: an assessment of technical reproducibility and comparison with gene expression arrays. Genome Res 18(9):1509–1517
McIntyre L, Lopiano K, Morse A, Amin V, Oberg A, Young L, Nuzhdin S (2011) RNA-Seq: technical variability and sampling. BMC Genomics 12(1):293
Mortazavi A, Williams B, McCue K, Schaeffer L, Wold B (2008) Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods 5(7):621–628
Müller T, Ensminger I, Schmid K (2012) A catalogue of putative unique transcripts from Douglas-fir (Pseudotsuga menziesii) based on 454 transcriptome sequencing of genetically diverse, drought stressed seedlings. BMC Genomics 13(1):673
Ning Z, Cox A, Mullikin J (2001) SSAHA: a fast search method for large DNA databases. Genome Res 11(10):1725–1729
Oberg A, Bot B, Grill D, Poland G, Therneau T (2012) Technical and biological variance structure in mRNA-Seq data: life in the real world. BMC Genomics 13(1):304
Raherison E, Rigault P, Caron S, Poulin P, Boyle B, Verta J, Giguère I, Bomal C, Bohlmann J, MacKay J (2012) Transcriptome profiling in conifers and the PiceaGenExpress database show patterns of diversification within gene families and interspecific conservation in vascular gene expression. BMC Genomics 13(1):434
Ramsköld D, Wang E, Burge C, Sandberg R (2009) An abundance of ubiquitously expressed genes revealed by tissue transcriptome sequence data. PLoS Comput Biol 5(12):e1000,598
Raz T, Kapranov P, Lipson D, Letovsky S, Milos P, Thompson J (2011) Protocol dependence of sequencing-based gene expression measurements. PloS One 6(5):e19,287
Rigault P, Boyle B, Lepage P, Cooke J, Bousquet J, MacKay J (2011) A white spruce gene catalog for conifer genome analyses. Plant Physiol 157(1):14–28
Shi L, Reid L, Jones W, Shippy R, Warrington J, Baker S, Collins P, de Longueville F, Kawasaki E, Lee K et al (2006) The microarray quality control (MAQC) project shows inter- and intraplatform reproducibility of gene expression measurements. Nat Biotechnol 24(9):1151–1161
Smith A, Heisler L, Onge R, Farias-Hesson E, Wallace I, Bodeau J, Harris A, Perry K, Giaever G, Pourmand N et al (2010) Highly-multiplexed barcode sequencing: an efficient method for parallel analysis of pooled samples. Nucleic Acids Res 38(13):e142–e142
Tarazona S, Garcìa-Alcalde F, Dopazo J, Ferrer A, Conesa A (2011) Differential expression in RNA-Seq: a matter of depth. Genome Res 21(12):2213–2223
Team RC (2013) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, pp 1–1731
Toung J, Morley M, Li M, Cheung V (2011) RNA-sequence analysis of human B-cells. Genome Res 21(6):991–998
Acknowledgments
This project is part of the collaborative project "DougAdapt" with funding from the Deutsche Forschungsgemeinschaft to IE (DFG-project EN 829/4-1). The authors are grateful to Anita Kleiber and Anna-Maria Weisser for technical assistance with RNA extraction. The authors also thank Wolfgang Hess for valuable comments and discussion.
Conflict of interests
The authors declare that they have no competing interests.
Ethical standards
All experiments comply with the current laws of the Federal Republic of Germany.
Data archiving statement
All sequence data has been submitted to the NCBI Sequence Read Archive (SRA, www.ncbi.nlm.nih.gov/sra). Accession numbers are SRR908308(COA1), SRR908309 (COA2), SRR868709 (INT1), SRR908307 (INT2). Accession number of the study: SRP026170.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by J. Wegrzyn
Electronic supplementary material
Below is the link to the electronic supplementary material.
Online Resource 1
Distribution of GO-slim categories in the namespace "Biological Process" within the 100 most abundant PUTs. Distribution of GO-slim categories within the 100 highest expressed PUTs in the deep sequencing libraries COA1, COA2, INT1 and INT2 that were detected when using the Müller or the Howe PUT set as alignment reference. Annotation and GO-slimming is described in Methods section (XLS 14 kb)
Online Resource 2
Annotation statistics for Howe PUT set. Summary of numbers of detected PUTs and detected PUTs with functional annotation. Functionally annotated PUTs have a hit in the Arabidopsis thaliana peptide database (ARA) or in the NCBI Plant RefSeq peptide database. "Unique annotations" are the set of all unique hits in the A. thaliana or the NCBI Plant RefSeq peptide database. GO annotations refer to the PUTs with GO slim annotation (details of functional and GO-slim annotation in Methods section). Blast2GO annotations refer to annotations inferred by Blast2GO. Relative numbers with respect to the PUT set are shown in parentheses, relative numbers with respect to the deep sequencing libraries are shown in square brackets (XLS 11 kb)
Online Resource 3
Distribution of GO-slim terms (Howe PUT set). The relative abundance of functional categories, represented by plant GO-slims in the four libraries and the Douglas-fir PUT set, compared with the relative abundance detected in deep mRNA-sequencing data generated from Arabidopsis thaliana whole seedlings (NCBI GEO [GSM762070]) and leaves (NCBI GEO [GSM881683]). The functional annotation of Douglas-fir PUTs was obtained by aligning the PUTs to the NCBI Plant RefSeq peptide database and feeding the alignment to the Blast2GO pipeline (for details, see Methods section). The distribution of the deviation of GO-slim abundances relative to the Howe PUT set or the A. thaliana samples in the namespace "molecular function" is shown as smoothed kernel density estimates (KDE) (a). The relative abundance of a GO-slim category in one of the four Douglas-fir libraries or the Douglas-fir PUT set is normalized by the relative abundance of this GO-slim category in an A. thaliana full seedling (b) or leaf (c) deep mRNA-sequencing library. A value of 0 plotted on the y-axis implies an equal distribution of GO-slim terms in the Douglas-fir libraries compared to A. thaliana deep mRNA sequencing libraries or the Douglas-fir PUT set (PDF 1702 kb)
Online Resource 4
Impact of sequencing depth on the number of reliably quantified PUTs when using the Howe PUT set. The number of PUTs with a hit in the NCBI Plant RefSeq peptide database detected with more than x number of aligned reads (value shown on the x-axis). To demonstrate the effect of sequencing depth, sub-samples of library COA2 are included (gradient: yellow to red). The number of aligned reads is printed in the legend. Estimates of expected binomial sampling error (as coefficient of variation [CV]), dependent on the number of aligned reads per PUT are shown for 10, 100 and 1,000 aligned reads per PUT (PDF 858 kb)
Online Resource 5
Shared annotations among Müller and Howe PUT sets, P. glauca transcript clusters and Arabidopsis peptides. Venn diagram which shows the overlap of the functional annotations inferred by Blast2GO of the Müller PUT set (Muller), the Howe PUT set (Howe), the P. glauca transcript cluster database (Picea) and the Arabidopsis thaliana peptide database (Ara). All sequence sets have been aligned to the NCBI Plant RefSeq peptides. This alignment was used for detecting annotations using Blast2GO (for details, see Methods section) (PDF 307 kb)
Online Resource 6
Top 1,000 most abundant PUTs in the deep sequencing libraries (alignment to Müller and Howe PUT set). Top 1,000 most abundant PUTs in the libraries COA1, COA2, INT1 and INT2 sorted by the number of aligned reads. For each PUT, the PUT name with the associated annotation inferred by Blast2GO and the number of aligned reads are printed in the form: PUT name-Blast2GO Annotation-counts (XLS 5787 kb)
Rights and permissions
About this article
Cite this article
Hess, M., Wildhagen, H. & Ensminger, I. Suitability of Illumina deep mRNA sequencing for reliable gene expression profiling in a non-model conifer species (Pseudotsuga menziesii). Tree Genetics & Genomes 9, 1513–1527 (2013). https://doi.org/10.1007/s11295-013-0656-2
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11295-013-0656-2