Tree Genetics & Genomes

, Volume 9, Issue 4, pp 1031–1041

Transcriptome characterization and detection of gene expression differences in aspen (Populus tremuloides)


  • Hardeep S. Rai
    • Ecology Center and Department of Wildland ResourcesUtah State University
  • Karen E. Mock
    • Ecology Center and Department of Wildland ResourcesUtah State University
  • Bryce A. Richardson
    • Rocky Mountain Research StationUSDA Forest Service
  • Richard C. Cronn
    • Pacific Northwest Research StationUSDA Forest Service
  • Katherine J. Hayden
    • Department of Environmental Science, Policy, and ManagementUniversity of California
  • Jessica W. Wright
    • Pacific Southwest Research StationUSDA- Forest Service
  • Brian J. Knaus
    • Pacific Northwest Research StationUSDA Forest Service
    • Ecology Center and Department of BiologyUtah State University
Original Paper

DOI: 10.1007/s11295-013-0615-y

Cite this article as:
Rai, H.S., Mock, K.E., Richardson, B.A. et al. Tree Genetics & Genomes (2013) 9: 1031. doi:10.1007/s11295-013-0615-y


Aspen (Populus tremuloides) is a temperate North American tree species with a geographical distribution more extensive than any other tree species on the continent. Because it is economically important for pulp and paper industries and ecologically important for its role as a foundation species in forest ecosystems, the decline of aspen in large portions of its range is of serious concern. The availability and annotation of the black cottonwood (Populus trichocarpa) genome enables a range of high throughput sequencing approaches that can be used to understand rangewide patterns of genetic variation, adaptation, and responses to environmental challenges in other Populus species, including aspen. Gene expression studies are particularly useful for understanding the molecular basis of ecological responses, but are limited by the availability of transcriptome data. We explored the aspen transcriptome through the use of high-throughput sequencing with two main goals: (1) characterization of the expressed portion of the P. tremuloides genome in leaves and (2) assessment of variation in gene expression among genets collected from distinct latitudes but reared in a common garden. We also report a large single nucleotide polymorphism dataset that provides the groundwork for future studies of aspen evolution and ecology, and we identify a set of differentially expressed genes across individuals and population boundaries for the leaf transcriptome of P. tremuloides.


Trembling aspenQuaking aspenRNA-SeqDifferential expressionSNP (single nucleotide polymorphism)Populus trichocarpa


The poplars (Populus L., Salicaceae) are widely distributed northern hemisphere trees that comprise 22–85 species, usually in six sections, depending on taxonomic scheme (Eckenwalder 1977; Hamzeh and Dayanandan 2004). Here, we have sequenced the leaf tissue transcriptome of Populus tremuloides Michx. (trembling aspen), a close relative of the fully sequenced model organism black cottonwood, Populus trichocarpa Torr. & Gray. P. tremuloides has the broadest distribution, in terms of both latitude and longitude, of any North American tree species (Jones 1985) and has been proposed as an ideal study system for both adaptation and responses to climate change (Hogg and Hurdle 1995; Isebrands et al. 2001). Its distribution across many climatic zones suggests that P. tremuloides is highly plastic, highly adaptable, or both (Mitton and Grant 1996). P. tremuloides has been shown to have extremely high levels of genetic variation at the population level, both in terms of trait heritability and molecular marker diversity (Cole 2005; Liu and Furnier 1993). Aspen is economically important for pulp and paper industries and ecologically important for its role as a foundation species in forest ecosystems (Mitton and Grant 1996). Dramatic mortality in certain portions of the species’ range has recently been documented, raising concerns about the ability of P. tremuloides to respond to a changing climate (Frey et al. 2004; Iverson and Prasad 2002; Rehfeldt et al. 2009; Worrall et al. 2008; Hogg et al. 2008). A pervasive loss of this tree species at a continental scale would have significant ecological and economic impact and could result in dramatic carbon fluxes (Kurz and Apps 1999; Hogg and Hurdle 1995).

The availability of a high-quality genomic reference in the genus, P. trichocarpa (Tuskan et al. 2006), along with dramatic advances in sequencing technologies (reviewed by Fang and Cui 2011; Metzker 2009), provide new opportunities to understand the evolutionary history of Populus species and to characterize molecular responses to environmental challenges (e.g., Jiang et al. 2012; Qiu et al. 2011). In P. tremuloides, these technologies enable studies at the molecular level to examine how a single species can achieve such an enormous ecological amplitude (e.g., through plasticity, adaptation, migration) and what factors limit its persistence in changing climates. Answers to these questions are critical for effective management of natural populations as well as for silvicultural improvement aimed at specific applications (e.g., biofuels, carbon sequestration).

Here, we present an initial characterization of the P. tremuloides transcriptome. The high-throughput sequencing of cDNA libraries (RNA-Seq) enables the direct study of the expressed portion of the genome in sampled tissues (Marioni et al. 2008; Nagalakshmi et al. 2008; Perkins et al. 2009; Wang et al. 2009). Compared to array-based technologies, RNA-Seq has the advantage of higher sensitivity and the ability to capture a significantly larger component of gene expression (Severin et al. 2010), allowing for the assay of tens of thousands of transcripts, or even the entire transcriptome (e.g., Bajgain et al. 2011; Bräutigam et al. 2011; Coppe et al. 2010; Geraldes et al. 2011; Nagalakshmi et al. 2008; Severin et al. 2010; Wang et al. 2009). A distinct advantage of RNA-Seq is that the relative abundance of sequencing reads, aligned with or without the existence of a reference genome, can be used as an estimate of transcript abundance, which can be compared among individuals or under different environmental conditions (Bashir et al. 2010; Wang et al. 2009).

Variation within the transcriptome of other Populus species has been investigated for responses to various sources of stress using microarrays (Grisel et al. 2010; Hamanishi et al. 2010), for comparative studies (using real-time PCR and microarray analysis: Quesada et al. 2008), and even as a tool for understanding ecosystem interactions (using RNA-Seq data: Larsen et al. 2011). However, little attention has been paid to the descriptive, functional, or comparative genomics of P. tremuloides. Here, we present an overview of the aspen transcriptome, focusing on expression in leaf tissues using an RNA-Seq approach. Our major objectives are (1) to provide a snapshot of the expressed portion of the aspen genome and (2) to characterize variation present within and among aspen from two latitudinally distinct regions.

Within the transcribed regions reported here, we characterize sequence variation [single nucleotide polymorphisms (SNPs)] present in the expressed regions of the genome of aspen within and among individual genotypes. Gene expression patterns vary during development, between tissue types, and with external stimuli, and play a key role in the functioning of all organisms. We also highlight novel transcribed regions of the genome relative to the transcript assemblies of black cottonwood (P. trichocarpa v2; Phytozome v8.0). The availability of this type of data has utility in future studies of aspen, including the discovery and identification of novel genes, identification of genetic variants useful for mapping and association studies, assessment of plasticity and adaptive mechanisms, and identification of genetic markers for quantitative trait loci (QTL).

Materials and methods

Plant material and RNA isolation

We obtained root segments from five aspen genets in two US states (three individuals from a site in Arizona and two from a site in Montana; Table 1). Root segments were propagated following Schier (1978), and the shoots were grown in a controlled greenhouse (currently housed at the USDA Forest Sciences Laboratory in Logan, UT, USA). Each genet was represented by a single ramet. From each of the five ramets in the greenhouse, we extracted total RNA from a single undamaged, mature, mid-season leaf. We used 75 mg of leaf tissue from the proximal end of the leaf, avoiding the mid-vein. We isolated RNA using a Spectrum Plant Total RNA kit (Sigma-Aldrich), with a DNase digestion step, and the final product was eluted twice from each column. We examined quantity and quality of RNA using a Qubit fluorometer (Invitrogen, Carlsbad, CA, USA).
Table 1

Geographic sources of root cuttings used for greenhouse common garden propagation

Root sample

Collection date


Geographic coordinates



Trout Creek, MT, USA

Latitude: 47°52′2.454″ N

Longitude: 115°36′42.078″ W



Trout Creek, MT, USA

Latitude: 47°53′24.630″ N

Longitude: 115°37′ 33.050″ W



Trout Creek, MT, USA

Latitude: 47° 53′ 28.169″ N

Longitude: 115° 37′ 14.896″ W



Mt. Lemon, AZ, USA

Latitude: 32°25′7.368″ N

Longitude: 110°43′55.673″ W



Mt. Lemon, AZ, USA

Latitude: 32°27′7.721″ N

Longitude: 110°46′59.474″ W

cDNA library preparation

Sera-Mag Oligo(dT) beads (Thermo Scientific, Freemont, CA, USA) were used to enrich poly-adenylated RNA by hybridization, following the manufacturer’s protocol, and the enriched RNA was resuspended in 50 μl of 10 mM Tris–HCl. Library preparation followed the Illumina GAII RNA-Seq protocol (San Diego, CA; part no. 1004898 Rev. A), except that custom barcoded adapters were used during ligation steps (Cronn et al. 2008). We used PCR to enrich libraries for fragments containing adapters ligated in the correct orientation. Following each enzymatic step, a cleaning step was performed using AMPure XP beads (Agencourt, Danvers, MA, USA). Libraries were quantified using a Qubit fluorometer and their quality validated using an Agilent 2100 Bioanalyser (Santa Clara, CA, USA). The five libraries were pooled in equimolar quantities for a 5 pM final concentration and run on three separate lanes of the Illumina sequencer (two of these lanes were part of the same flow cell).

High-throughput sequencing

Illumina high-throughput sequencing was performed at the Oregon State University Center for Genome Research and Biocomputing, on an Illumina GAIIx using 80 bp single-end indexing runs, resulting in 73-bp sequences and a 6-bp index. The samples were sequenced across three Illumina lanes. Base calling was performed in Casava v1.8 (Illumina, San Diego, CA, USA). We culled low-quality reads using the Illumina analysis pipeline. We then used custom perl scripts ( for barcode (index) deconvolution. The final 43.7 M reads remaining after passing quality control steps are summarized in Table 2.
Table 2

Summary of reads generated from Illumina sequencing for five Populus tremuloides individuals from two populations in Western North America



Average length

Total bases

























Transcriptome assembly

We trimmed adapter sequences and binned single-end reads by barcode (deconvolution) and aligned them to the P. trichocarpa v2 transcript assemblies [45,033 protein-coding transcripts (file name: Ptrichocarpa_156_transcript.fa.gz); Phytozome v8.0] using CLC Genomics Workbench (v4.9; CLC Bio, Cambridge, MA, USA). We mapped to transcript assemblies rather than to the genome to avoid problems with gapped alignments at splice junctions. The maximum gap and mismatch count were set to 2, and insertion and deletion costs were set to 3, with a minimum contig length of 200 bp. Length fraction and similarity parameters were set to 0.5 and 0.8, respectively.

Gene ontology (GO) annotations were assigned using Blast2GO (Conesa et al. 2005). GO terms were searched and retrieved to match the results from a blastx search against the NCBI nr database (maximum of 10 hits for each contig with an e-value cutoff of 1 × e−10), and the annotation function of Blast2GO was used to select final GO terms for each contig. Finally, the terms were analyzed through the GO Slim for plants function of Blast2GO to provide a broad overview of the gene product functions contained within the ontology content.

We also used CLC to perform a de novo assembly of those unmapped sequences that did not produce significant alignments to the reference in the P. trichocarpa reference-guided assembly above. CLC’s de novo assembler performs a two-step assembly by first assembling simple contigs using De Bruijn graphs and then mapping back all reads using the simple contigs as references. The mismatch count was set to 2, and insertion and deletion costs were set to 3, with a minimum contig length of 200 bp. Length fraction and similarity parameters were set to 0.5 and 0.8, respectively. Nonspecific matches were discarded because we chose to be conservative in our treatment of reads that mapped to more than one position. We then performed a blast search of the contigs resulting from this de novo assembly against the NCBI “nr” database, to identify potential contaminants (see Table 3).
Table 3

Summary of blast results of the unmapped reads in Populus tremuloides from the reference-guided assembly. A de novo assembly of these reads resulted in 13,544 contigs


Number of contigs

Percent of total




Other plants



Bacterial origin









No significant match



SNP searching

We used our reference-guided assembly (above) to detect SNPs using CLC Genomics Workbench to detect SNPs occurring between P. tremuloides and P. trichocarpa as well as SNPs between the two sampled populations of P. tremuloides. Only variants with a minimum coverage of 8× and a minimum variant count (same as minimum allele depth) of 4 were called. Maximum coverage for the SNP search was set at 500× in order to avoid highly repetitive regions of the genome. The resulting search for variable sites was filtered for SNPs specific to aspen and also filtered for SNPs within open reading frames (ORFs) of more than 200 bp.

Differential expression

We used our raw reads to explore differential expression among P. tremuloides genets within and among sampling sites representing different latitudes. In order to quantify transcript abundance, the 73 bp reads were mapped to the P. trichocarpa reference using the reference-based aligner Bowtie version 0.12.7 (, resulting in SAM format files. Bowtie deals with multimapping reads by assigning them randomly among transcript sites (Langmead et al. 2009). The maximum number of mismatches was set to 2 in the –n alignment mode. The number of reads mapped to each transcript was tallied using a custom perl script. Unbalanced library sizes were down-sampled using the method implemented in NBPSeq (Di et al. 2011); the smallest library size was identified, and the other libraries were down-sampled to this number of reads, based on a probabilistic step, resulting in approximately equal library sizes. Once libraries were normalized to balanced sizes, comparisons of differential expression between genets were made using Fisher’s exact test (Fisher 1922), which is appropriate for small sample sizes (1 or 2). We corrected for multiple comparisons by setting the false discovery rate to 0.05 (Benjamini and Hochberg 1995). For comparison across the two sampling sites, the dispersion parameter for the negative binomial distribution was estimated for each transcript using the Bioconductor package edgeR (Robinson et al. 2010) using the “estimateTagwiseDisp” function, and negative-binomial exact tests were fit for each gene using the dispersion parameters estimated above. Results from the statistical analysis of differential expression are presented with MA plots (log ratio versus log abundance) using the plotSmear function in EdgeR (Robinson et al. 2010).

Results and discussion

In this study, we characterized the leaf transcriptome of P. tremuloides, an ecologically important tree species in western North America. Although we lacked biological and experimental replication, our study design included five individual aspen genotypes from two distinct and latitudinally separated sampling sites. This enabled us to make the first examination of variation in gene expression in aspen based on RNA-Seq data. We mapped the trimmed 73-bp sequence reads to the P. trichocarpa reference genome (Tuskan et al. 2006), from which we generated a functional characterization of the transcriptome and a set of putative SNPs. We also characterized the differences in gene expression patterns between sampled regions and among individual genets.

Transcriptome assembly and analysis

We generated 43,742,783 sequence reads from three independent runs of pooled and barcoded RNA-Seq libraries (Table 2). All reads are deposited at the NCBI Sequence Read Archive (accession number SRA057223). Assuming a high degree of sequence and transcript similarity (Cervera et al. 2005; Chen et al. 2010; Chen et al. 2007; Cole 2005; Hamzeh and Dayanandan 2004; Hamzeh et al. 2006; Unneberg et al. 2005), sequences from all P. tremuloides individuals were combined into a single reference-mapped assembly against the 45,033 annotated, protein-coding transcripts of P. trichocarpa. A large proportion of the total sequencing reads, 84 % (∼37M), were successfully mapped to the reference library with an average depth of ∼66X. The 6,813,822 sequences of the ∼44M total reads were not mapped to the reference transcripts in this assembly. We detected 7.8M nonspecific reads, mapping to more than one P. trichocarpa transcript. In these cases, we assigned the reads randomly. Thus, we may have slightly overestimated the total coverage and underestimated the sequencing depth.

The above assembly resulted in the pooled P. tremuloides sequencing reads mapping to 38,177 of the 45,033 P. trichocarpa reference transcripts with at least a single read in each contig (of these, 20,906 had an average read depth of 4×). Even when this was examined for each of the five genets individually, a large proportion of the P. trichocarpa transcripts were recovered: MT5, 32,963; MT6, 32,049; MT8, 31,272; AZ8, 33,109; AZ9, 34,007 total transcripts for each genet, respectively. This demonstrates our ability to detect low abundance transcripts using RNA-Seq, with which we have captured a large proportion of expressed genes in the P. tremuloides leaf transcriptome, and that our aspen leaf transcriptome, contains a substantial proportion of the total annotated P. trichocarpa transcripts.

In order to test whether the depth of our individual RNA-Seq libraries was sufficient to maximize recovery of the expressed portion of the aspen leaf genome, we randomly sampled our P. tremuloides reads and counted the number that corresponded to the reference (P. trichocarpa) transcripts. We plotted the number of reads for each individual separately. All individuals show the same asymptote for transcripts recovered after the initial 2 million reads (Fig. 1). The results of these rarefaction plots indicate that our current level of sequencing for each of the genets is sufficient to recover a reasonable sampling of the entire leaf transcriptome and that additional sequencing is likely to yield only a small number of additional expressed transcripts (likely very low-expression level genes in this portion of the transcriptome).
Fig. 1

Example of a rarefaction plot for one Populus tremuloides clone (from Arizona). The raw reads were plotted against the P. trichocarpa transcript assembly (v2). Line indicates total number of whole-plant transcripts in reference (P. trichocarpa)

We assigned GO annotations to the assembled contigs with a significant match within the Viridiplantae using Blast2GO (Conesa et al. 2005). Of the 20,906 contigs that we analyzed, 15,421 were annotated with a GO term. The distribution of GO terms assigned to the contigs is presented in Fig. 2.
Fig. 2

Distribution of assembled sequences of Populus tremuloides in three main GO categories. The annotated sequences were run through GO-Slim to provide a high level summary of functions. a Biological processes (10,556 contigs annotated). b Cellular components (12,593 contigs annotated). c Molecular function (10,503 contigs annotated). Note that a contig can be assigned to more than one category

The de novo assembly of the unmapped reads resulted in 13,544 assembled contigs >200 bp with an N50 of 493 bp (4,042,046 unmapped reads assembled). An initial search using blastx against the NCBI “nr” database was performed (with a minimum e value cutoff of 1 × e−10; Table 3). Approximately 27 % of the de novo assembled contigs had no significant match in the NCBI ‘nr’ database, whereas 28 % had significant homology to at least one angiosperm sequence (25 % of these were reported from the genus Populus). Not surprisingly, almost half (44 %) of the contigs assembled from the unmapped reads showed sequence similarity (based on the above blast criteria) with bacteria, suggesting low-level contamination (fewer than 1 % of all high-throughput sequencing reads).

SNP discovery and genotyping

We found 396,909 total SNPs, of which 220,212 (55.5 %) were fixed differences between P. tremuloides and P. trichocarpa. Of these differences, 177,880 (80.8 %) were within ORFs and 83,201 (46.8 %) of these were synonymous differences. Within P. tremuloides, we found 176,697 SNPs, equivalent to 1 SNP/224 bp. These putative SNPs include heterozygous sites within aspen individuals as well as nucleotide substitutions among the individuals. Of the 146,970 aspen SNPs detected in ORFs, 69,888 (47.6 %) were synonymous substitutions. We detected 77,082 nonsynonymous substitutions, of which 59,866 were found in transcripts with only a single nonoverlapping ORF. Thus, the proportion of nonsynonymous SNPs within P. tremuloides was similar to that between P. tremuloides and P. trichocarpa. Within the aspen leaf transcriptome, we found levels of SNP polymorphism similar to those found in studies of other plant species (e.g., Bajgain et al. 2011; Novaes et al. 2008; Coppe et al. 2010; Geraldes et al. 2011). A summary of SNP frequency and mutation type is provided in Table 4 and the distribution of read coverage depth in Fig. 3. We have made the full list of putative SNPs within aspen publically available at the AspenDB website (, which is part of the Dendrome Project ( The data reported for each putative SNP include the position of the SNP relative to the P. trichocarpa genome assembly, the reference base, the nucleotide variants and their frequencies present in the reads mapped here, and the coverage depth for each nucleotide site. These putative SNPs should help guide future population studies, comparative analyses, or even possibly development of QTLs within aspen, and add to the growing SNP resources for aspen (Kelleher et al. 2012) and other Populus species (Fladung and Buschbom 2009). Further characterization and validation of these data will be invaluable for research related to aspen.
Table 4

Frequency of SNPs in Populus tremuloides by mutation type


SNP type


Percent count

Percent total





























Complex SNP







Fig. 3

Distribution of SNPs by read depth in Populus tremuloides (176,697 total SNPs). SNPs with read coverage >100× are not shown. The coverage depth for each SNP is the total number of reads that includes representatives for both alleles present in each diploid individual

Differential expression

To characterize the variation present among P. tremuloides samples, we analyzed gene expression differences between the RNA-Seq libraries and determined whether there were statistically significant differences in expression levels of transcripts across samples. We made two separate comparisons: one among the five genets and a second comparison between the two geographic regions. All of our RNA-Seq libraries were carefully selected from leaves at the same developmental stages. Collected at the same time on the same date from trees grown in a greenhouse common garden.

One characteristic of the individual pairwise comparisons for differential expression was the large number of genes that differed statistically in expression among sampled genets, whether or not they originated from geographically distant regions (Fig. 4 and Table 5). In other studies of differential gene expression, the high sensitivity of the RNA-Seq method has allowed for a new level of detail and detection when considering closely related genotypes (Zenoni et al. 2010; Kawahara-Miki et al. 2011). The amount of expression variation between individuals of aspen is higher than similar comparisons of gene expression in other plants at this taxonomic scale (Cohen et al. 2010; Matzkin 2012, 2008; Zhou et al. 2012; Beritognolo et al. 2011; Kawahara-Miki et al. 2011), but it is consistent with results from studies that used other measures of genotypic diversity in P. tremuloides (Chong et al. 1994; Cole 2005; Lund et al. 1992). We suspect that the high interindividual expression variability may be due in part to our lack of replication and suggest that future transcriptome studies include replication of populations at contrasting latitudes, increased representation of individuals within populations, as well as within-individual (technical) replication. Otherwise, it is difficult to determine whether the large numbers of differentially expressed genes is a genus-wide phenomenon resulting from different sources of unaccounted-for variation such as temporal or mechanical effects, stochastic variation or whether it is simply false positives resulting from lack of replication.
Fig. 4

Example MA plot of differentially expressed genes in comparisons of two Populus tremuloides individuals from the same population using Fisher’s exact test to examine significance in expression difference at a false discovery rate value 0.05. (7,886 genes; see Table 4 for complete pairwise results)

Table 5

Pairwise comparison results of the numbers of significantly differentially expressed genes among Populus tremuloides individuals: significance was based on Fisher’s exact test with a false discovery rate of 0.05


























We discovered a small set of genes that were expressed at significantly different levels between two latitudinally separated sampling regions, despite the enormous variation present between individuals (Fig. 5). We further characterized these differentially expressed transcripts by querying against NCBI’s “nr” database and assigning GO terms in Blast2GO (Fig. 6). Overall, the GO term contributions were similar to those for total transcripts (Fig. 2), although it appears that the organellar component of differentially expressed genes represents a smaller overall contribution compared to the proportion in the totals.
Fig. 5

Differential expression between Populus tremuloides populations (two and three individuals from Arizona and Montana, respectively). The MA (log ratio versus log abundance) plot highlights 154 statistically significant transcripts at a false discovery rate value of 0.05
Fig. 6

Gene ontology (GO) graphs of differentially expressed genes from pooled individuals of Populus tremuloides within two collection sites: Trout Creek, MT, USA and Mount Lemon, AZ, USA. a Biological processes. b Cellular component. c Molecular function


Aspen has the potential to be an important study system for responses to climate change and its effects across a landscape that spans a large majority of the North American continent. The characterization of the leaf transcriptome signals an important step in the direction of high-throughput comparative genomics in this ecologically significant tree species. We have identified a large number of SNPs in transcribed genes that should prove useful for species-wide population genetic studies of P. tremuloides, enabling a range of high-throughput sequencing and genotyping methodologies (e.g., microfluidic SNP genotyping across large number of individuals). Our study of differential gene expression in aspen is a part of a small but growing body of literature that documents differential gene expression between latitudinally separate populations (Geraldes et al. 2011; Kawahara-Miki et al. 2011; Beritognolo et al. 2011; Eckert et al. 2009; Ellison et al. 2011) and provides a first look into the differentiation and gene expression changes within the transcriptome of aspen.


Thanks to Tim Benedict, Mary Lou Fairweather, and E. Pfalzer for sample collections. Thanks to Tara Jennings for laboratory assistance and Chris Sullivan for biocomputing assistance. This research was funded by the USDA Forest Service Western Forest Transcriptome Survey and National Fire Plan (2012-NFP-GSD-1).

Copyright information

© Springer-Verlag Berlin Heidelberg (outside the USA) 2013