Background

Four remarkable evolutionary phenomena are associated with organellar genomes of Geraniaceae. First, mitochondrial genomes show multiple, major shifts in rates of synonymous substitutions, especially in the genus Pelargonium[1, 2]. Rate fluctuations of such magnitude have been documented in only two other plant lineages, Plantago[3] and Silene[46]. Second, mitochondrial genomes have experienced extensive loss of genes and sites of RNA editing. At least 12 putative gene losses have been documented in Erodium[7], and mitochondrial genes sequenced from Pelargonium x hortorum had a drastic reduction in predicted or verified RNA editing sites compared to all other angiosperms examined [1]. Third, genome-wide comparisons of nucleotide substitutions in plastid DNA indicated rapid rate acceleration in genes encoding ribosomal proteins, RNA polymerase, and ATP synthase subunits in some lineages. In the case of RNA polymerase genes there was evidence for positive selection [8, 9]. Fourth, plastid genomes of Geraniaceae are the most highly rearranged of any photosynthetic land plants examined [1013]. Multiple and extreme contractions and expansions of the inverted repeat (IR) have resulted in genomes with both the largest IR (74,571 bp, [11]) as well as the complete loss of this feature [12, 13]. Considerable accumulation of dispersed repeats associated with changes in gene order has been documented along with disruption of highly conserved operons and repeated losses and duplications of genes [12]. In P. x hortorum plastids, these genomic changes have generated several fragmented and highly divergent rpoA-like ORFs of questionable functionality [8, 1012], despite the fact that rpoA encodes an essential component of the plastid-encoded RNA polymerase (PEP).

Because nuclear genes supply both organelles with the majority of their proteins, it is likely that the extensive organellar genomic upheaval in Geraniaceae will also influence the evolution of organelle-targeted genes in the nuclear genome. For example, given the drastic reduction of RNA editing in Geraniaceae mitochondrial transcripts, it is reasonable to expect a correlated reduction of nucleus-encoded pentatricopeptide repeat (PPR) proteins, many of which are critical for organellar RNA editing [1417]. The uncertain status of the P. x hortorum plastid-encoded rpoA gene is also likely to have nuclear consequences. If this plastid gene is not functional, then a functional copy might have been relocated to the nuclear genome, which has only occurred once in the evolution of land plants in mosses [18, 19]. Alternatively, it is possible that PEP has become nonfunctional in P. x hortorum, as observed in the holoparasite Phelipanche aegyptiaca[20]. In P. aegyptiaca, loss of all plastid-encoded PEP components (rpoA, rpoB, rpoC1 and rpoC2) resulted in the parallel loss of the requisite nucleus-encoded components (sigma factors) that assemble with the plastid encoded proteins to form the core of the PEP holoenzyme [20]. In contrast, if the highly divergent plastid rpoA gene is still functional in P. x hortorum, then the typical set of sigma factors should be present in the nuclear genome.

One prerequisite to begin to address the effects of organellar genomic upheaval on the nuclear genome in Geraniaceae, is availability of nuclear sequence information. Transcriptome sequencing provides a tractable proxy for nuclear gene space. The use of next-generation sequencing (NGS) for transcriptome sequencing is widespread because volumes of data can be generated rapidly at a low cost relative to traditional Sanger sequencing [21]. The assembly of reads into contigs may be executed using a de novo or a reference-based approach [22]. In studies of non-model organisms, de novo assembly is more commonly used due to the absence of a closely related reference [23, 24]. A survey of recent transcriptome studies in comparative biology demonstrates that most sequencing projects are focusing on non-model organisms where little or no genomic data is available [22, 2531]. The lack of a reference genome makes the reconstruction and evaluation of the transcriptome assembly challenging. Several issues must be addressed when performing transcriptome sequencing of non-model organisms, including which NGS platform should be employed, how much sequence data is needed to provide a comprehensive transcriptome, which assembler should be utilized, and what tissues should be sampled.

This paper provides a comprehensive comparison of the transcriptomes of two non-model plant species, Pelargonium x hortorum and Geranium maderense, from the two largest genera of Geraniaceae. There were three primary goals for the initial comparative transcriptome analysis in Geraniaceae: (1) What are the best sequencing platforms and assembly methods for generating a high-quality transcriptome that broadly covers gene space in the absence of a reference genome? (2) Does sequencing from multiple tissue types improve the breadth of transcriptome coverage? (3) Are there any losses of PPR proteins involved in RNA editing and sigma factors associated with PEP in Geraniaceae?

Results

Ribosomal RNA content and Illumina library complexity

To assess the efficiency of ribosomal RNA (rRNA) depletion in Geraniaceae transcriptome libraries rRNA contigs were identified using rRNA from Arabidopsis thaliana as a reference. All Illumina reads (146,690,142 reads for Geranium maderense and 148,749,374 reads for Pelargonium x hortorum) were mapped to rRNA contigs as described in methods, and 0.7% and 2% of the reads of G. maderense and P. x hortorum were identified as rRNA reads, respectively. Library complexity was analyzed using Picard [32] and rRNA reads were eliminated prior to the analysis. The percentages of unique start sites were 42.7% and 46.1% for G. maderense and P. x hortorum, respectively. The values for rRNA content and library complexity were comparable to other transcriptome analyses using similar approaches [33, 34].

Assessment of sequencing platforms and assemblers for transcriptome assembly

To determine the optimal sequencing and assembly strategy, the efficacy of five different assemblers was examined using two initial data sets generated by Roche/454 FLX and Illumina Hiseq 2000 platforms for P. x hortorum. The Illumina run produced approximately 40 times more sequence data than the 454 run, even though the cost of the 454 data was at least four times more than the Illumina data (Table 1). A comparison of basic assembly statistics (Table 2) showed that the Trinity assembler outperformed all other platform/software combinations in terms of number of contigs, number of assembled nucleotides, mean and maximum contig length, and N50. More generally, the Illumina assemblers consistently outperformed the 454 assemblers, although the MIRA and Newbler 454 assemblers produced longer maximal contigs than SOAPdenovo and SOAPdenovo-trans (SOAPtrans). To determine the amount of usable protein sequence information generated by each assembler, the assemblies were translated as described in methods and compared (Table 3). Again, the Illumina assemblers outperformed the 454 assemblers in all metrics, with the Trinity assembler providing the most amino acids with the longest mean and maximal sequences. The length distribution of assembled nucleotides and translated amino acids further confirms that Trinity outperformed SOAPdenovo and SOAPtrans, and all three Illumina assemblers outperformed the 454 assemblers (Figure 1).

Table 1 The Pelargonium x hortorum transcriptome dataset read statistics
Table 2 Basic assembly statistics for the Pelargonium x hortorum transcriptome
Table 3 Translated contig statistics for Pelargonium x hortorum
Figure 1
figure 1

Contig length. (A) Contig length distribution. The vertical dashed line shows the arbitrary cutoff of 200 base pairs (bp). Contigs shorter than 200 bp were disregarded for this analysis. (B) Translated contigs length distribution. The vertical dashed line shows the arbitrary cutoff of 50 amino acids (aa). Contigs shorter than 50 aa were disregarded for this analysis.

Two important considerations in assembly analysis are the breadth of gene space coverage and the degree of coverage fragmentation. A good assembler should generate high-quality assemblies that contain as many reference transcripts as possible, and each reference transcript should be covered as completely as possible with a single long contig rather than a combination of several short contigs. To assess assembly coverage and fragmentation, two published data bases were used, 357 ultra-conserved ortholog (UCO) coding sequence [35] from Arabidopsis and 959 single copy nuclear genes shared between Arabidopsis, Oryza, Populus, and Vitis[36]. Trinity and SOAPtrans outperformed all other assemblers in terms of the percentage of reference genes identified, completeness of coverage (i.e. fraction of reference gene coverage by one or more contigs), and contiguity of coverage (i.e. fraction of reference gene coverage by a single long contig), with Trinity performance slightly better than SOAPtrans at higher thresholds (Figures 2 and 3).

Figure 2
figure 2

Contiguity (A) and completeness (B) of different assemblers at different thresholds. The assemblies were aligned with two published reference data bases: 357 ultra-conserved ortholog (UCO) coding sequence [35] and 959 single copy nuclear genes [36].

Figure 3
figure 3

Completeness and contiguity results at threshold 80% using two published reference protein sets. Data sets: (A) 959 single copy nuclear genes (APVO); (B) 357 ultra-conserved ortholog (UCO) coding sequence [35, 36]. Cont = contiguity, comp = completeness, % hits = percentage of hits in reference transcriptome.

To examine whether the superior performance of Trinity and SOAPtrans was due to the much larger amount (40 times) of Illumina data than 454 data, the Illumina assemblers were re-anlayzed using a data set containing 1/40th of the Illumina reads (Additional file 1). In terms of contiguity and completeness, the performance of Trinity using the reduced Illumina data set remained superior to the 454 programs (Newbler, MIRA) that used the entire 454 data sets. In contrast, the performance of SOAPdenovo and SOAPtrans were noticeably worse with the reduced Illumina data set than with the full data set, producing results that were generally worse than the original 454 assemblies.

Effect of sequencing depth on assembly coverage breadth and fragmentation

To determine how much sequence data is needed to assemble a high-quality transcriptome with broad coverage, 146,690,142 reads for G. maderense and 148,749,374 reads for P. x hortorum were generated on the Illumina Hiseq 2000 platform assembled using Trinity with different increments of reads from 5% to 100% of the total. While the number of contigs assembled continued to increase with increasing numbers of reads (Figure 4A), the percentage of reference genes recovered and their contiguity and completeness plateaued at approximately 40% of the total reads (Figure 4B-D). Including the remaining 60% of the reads increased contiguity and completeness by only 1% to 2% (Figure 4B-C). Although there were more translated contigs of G. maderense than P. x hortorum, the contiguity and completeness of both species were very similar.

Figure 4
figure 4

Comparisons of Geranium maderense and Pelargonium x hortorum for four assembly parameters using different percentages of sequencing reads. (A) number of contigs, (B) contiguity, (C) completeness, and (D) percentage of hits. For completeness and contiguity two published reference protein sets were used (357 ultra-conserved ortholog (UCO) coding sequence [35] and 959 single copy nuclear genes [36]). Assemblies were aligned with the reference data sets using BLASTX with an E-value of 1 E-10.

Although increasing the number of reads beyond 10% contributed little to finding novel hits to the local Arabidopsis data base, increasing the amount of data did help extend the existing contigs and generate longer alignments to reference genes. To evaluate this, the contiguity of all contigs relative to the two published databases was calculated at different contiguity thresholds up to 100% (Figure 5). The inclusion of more reads generated assemblies with higher contiguity, especially when contiguity thresholds were greater than 50%. To allow for the high level of sequence divergence between Geraniaceae and Arabidopsis, the number of contigs that had contiguity thresholds ≥80% was calculated. When 100% of the reads were used 4185 contigs and 4494 contigs were found in G. maderense and P. x hortorum, respectively. Reducing the read input to 40% reduced contiguity values by 7% (4163/4494) in G. maderense and 11% (3731/4185) in P. x hortorum.

Figure 5
figure 5

Contiguity of Geranium maderense (A) and Pelargonium x hortorum (B) at different threshold values with differ percentages of reads using all Arabidopsis proteins from Uniprot/Swissprot database [[85]]. Assemblies were aligned with the database using BLASTX with an evalue of 1 E-10.

Functional assessment of Geraniaceae nuclear transcriptomes

The assemblies generated using 100% of the reads for both Geraniaceae species were used for functional annotation. Assemblies were first aligned against the NCBI nr database and the alignment results were used to generate the gene ontology (GO) terms. Of the 114,762 contigs in P. x hortorum, 56,283 (49%) had blast hits; 42,506 (37%) were annotated and 222,765 GO terms were retrieved (Table 4). Of the 119,217 contigs in G. maderense, 76,332 (64%) had blast hits; 58,461 (49%) were annotated (Table 4) and 311,108 GO terms were retrieved. The annotation files are shown in Additional file 2. The distribution of gene ontology annotations was examined using GO-slim (plant) ontology to compare the transcriptomes of G. maderense and P. x hortorum. Although the number of annotated contigs differed substantially between the two transcriptomes (Table 4), the proportion of annotated contigs in all categories with >1% representation within the categories cellular component, molecular function, and biological process were very similar (Figure 6). This similarity persists even though only emergent leaves were sampled for G. maderense versus four tissue types (emergent and expanded leaves, roots and flowers) for P. x hortorum.

Figure 6
figure 6

Gene ontology assignments for Geranium maderense (GMR) and Pelargonium x hortorum . The proportion of annotated contigs in all categories with >1% representation within the ontology (GO) categories for cellular component, molecular function, and biological process.

Table 4 Statistics of transcriptome annotations for Geranium maderense (GMR) and Pelargonium x hortorum (PHR)

To more directly address the question whether sequencing from multiple tissue types improves the breadth of transcriptome coverage, orthologous genes between G. maderense and A. thaliana and between P. x hortorum and A. thaliana were identified. Of the 35,386 protein sequences from A. thaliana, the G. maderense assembly had homologs to 11,131 sequences and the P. x hortorum assembly had homologs to 11,583 sequences. The comparable numbers of orthologous genes found for the two Geraniaceae species indicated that there was little improvement on the breadth of transcriptome coverage by sequencing from multiple tissue types (1 versus 4 tissues for G. maderense and P. x hortorum, respectively).

Identification of selected organelle targeted genes

Pentatricopeptide repeat proteins (PPRs) are a large family of RNA binding proteins encoded by over 400 genes in angiosperms; most are organelle targeted and involved in regulating organelle gene expression. The transcriptomes of P. x hortorum and G. maderense were annotated using 429 Arabidopsis PPR sequences as a reference database (Table 5). The overall number of PPR genes varied considerably between the two Geraniaceae and Arabidopsis, with PPR gene number reduced in P. x hortorum. The numbers of P class PPR genes were found to be similar in all three species, whereas many fewer PLS class genes were found in the Geraniaceae, especially in P. x hortorum.

Table 5 PPR protein a and sigma factor b distribution

Sigma factors are nuclear encoded, plastid targeted proteins that assemble with four plastid encoded proteins (rpoA, rpoB, rpoC1 and rpoC2) to form the core of the PEP holoenzyme. At least one copy of each of the six Arabidopsis sigma factors was detected in both the G. maderense and P. x hortorum transcriptomes (Table 5). The nucleotide and amino acid sequence identities between Arabidopsis/Geranium and Arabidopsis/Pelargonium for all six sigma factors were very similar (Table 6). The four contigs from G. maderense that aligned to sigma factor 2 were similar to each other in nucleotide sequence identity (87%), suggesting that they may represent variant copies of the same gene. Two of the three contigs from G. maderense that aligned to sigma factor 5 were very similar to each other but less so to the third contig (98% versus 71% nucleotide sequence identity). Sigma factors 2 and 6 were each represented by two P. x hortorum contigs, however only one of the contigs for each sigma factor appeared functional having start/stop codons at the 5′ and 3′ ends and lacking internal stop codons. Further experiments are needed to determine if the copies with internal stop codons are pseudogenes or assembly artifacts.

Table 6 Sequence identities between intact contigs in Geraniaceae and Arabidopsis thaliana sigma factors

Discussion

Strategies for de novoassembly of transcriptomes

The use of NGS platforms is widespread and is applied in many research fields as volumes of data can be generated rapidly at a low cost relative to traditional Sanger sequencing [21]. RNA-seq, one popular NGS application, provides an efficient and cost-effective way of obtaining transcriptome data. There are a number of platforms available for generating NGS data [37, 38]. Currently among the most popular are the Roche/454 FLX (http://www.roche.com) and the Illumina Hiseq 2000 (formerly Solexa; http://www.illumina.com) platforms. The Roche/454 FLX system is advantageous when longer reads are important (average read length 700 bp), whereas the Illumina system provides deeper sequencing coverage at a reduced cost per base, albeit with shorter read length (average length 100 bp).

For each platform various assemblers have emerged but during the past several years Roche 454 sequencing and the platform-specific assembler Newbler has been the most common approach for de novo assembly of transcriptome data [3943]. This may be attributed to the idea that longer reads are more likely to overcome the specific challenges of de novo transcriptome assembly. Illumina sequencing has been used mainly when a related organism’s genome was available for reference-based assembly [44, 45], although due to recently increased read length it is becoming more common for use in de novo assembly as well [46, 47]. Several recent studies compared the performance of different sequencing platforms and assembly methods [4850] but none of these comparisons evaluated the level of completeness or contiguity of their assemblies, nor was the performance of the assemblers evaluated without known genome information, which is the situation for any project on non-model organisms.

Our comparisons of sequencing platforms and assemblers for the Geraniaceae clearly indicated that the Illumina platform with Trinity assembly delivered the best performance in assembling a complete transcriptome in the absence of a reference genome. The Illumina assemblers (Trinity, SOAPdenovo, SOAPtrans) generated more contigs containing a greater total number of bases than the Roche/454 FLX assemblers (Newbler, MIRA). hile the MIRA assembly generated many more long contigs (>6 kb) than SOAPdenovo, the Trinity assembly out-performed all others in delivering long contigs, suggesting that the Trinity assembly contained more useful information than any of the other assemblies analyzed. While the Roche/454 FLX assemblies and the Illumina SOAPdenovo assembly produced similar results with regard to completeness and contiguity, the Illumina Trinity and SOAPtrans assemblies obtained much higher values for both parameters indicating that these assemblies comprise many more nearly complete transcripts (Figures 2 and 3).

Functional annotation of Geraniaceae transcriptomes

A total of 58,461 (49%) and 42,506 (37%) contigs were annotated from G. maderense and P. x hortorum, respectively. The low percentage of annotated contigs is most likely due to the large number of total contigs assembled. The number of aligned and annotated contigs is comparable to nine other recently published transcriptomes [22, 27, 5156]. The number of annotated contigs in assemblies from both Geraniaceae species was very similar for the three major categories cellular component, molecular function, and biological process (Figure 6). This is encouraging since different tissues were sampled for the two species; only one tissue, emergent leaves for Geranium and four tissues, emergent leaves, expanded leaves, roots and flowers for Pelargonium. Particularly noteworthy is the detection of genes associated with flower and embryo development and pollen-pistil interaction since flowers were not sampled for Geranium. Overall, this comparison indicates that there is no marked improvement in transcriptome breadth of coverage when sampling four tissues compared to only emergent leaves.

PPR proteins and sigma factors in Geraniaceae

PPRs are a large family of RNA binding proteins encoded by over 450 genes in sequenced angiosperms. Most are organelle targeted and involved in regulating organelle gene expression [57]. Of the two classes (P and PLS) within the PPR family, those from PLS class (E and DYW subclasses) have been reported to be involved in RNA editing [1416, 5864]. Previous studies have demonstrated correlated evolution of PLS genes and RNA editing sites in plants [17, 65]. Consistent with these results, a reduction in PLS genes (Table 5) in Geraniaceae was detected, where reduced editing frequency was previously demonstrated [1]. The reduced editing frequency and reduced PPR content in Geraniaceae is especially intriguing with respect to the increased mitochondrial substitution rate in this family. Although an inverse correlation between editing frequency and substitution rate has been noted previously in Geraniaceae and other taxa [1, 6668], the finding that PPR gene content is also reduced in Geraniaceae indicates that this family is ideally suited for future studies assessing the evolutionary dynamics of editing frequency, PPR content, and mitochondrial substitution rates.

One long-standing question regarding the plastid genomes in Geraniaceae is the putative loss of the rpoA gene from P. x hortorum[6971]. The complete plastid genome sequence of this species revealed several rpoA-like open reading frames (ORFs) that are highly divergent relative to rpoA genes in other angiosperms or even other Geraniaceae [11, 12]. Two alternative explanations were suggested for these observations: (1) a copy of the gene in the nucleus had gained functionality; or (2) at least one of the highly divergent rpoA-like ORFs remains functional. Extensive evolutionary rate comparisons of plastid genes across the Geraniaceae revealed that the other three PEP subunits (rpoB, rpoC1, rpoC2) have significantly elevated nucleotide substitution rates and have likely experienced positive selection [8, 9]. Despite exhaustive searching of the nuclear transcriptome of P. x hortorum no copy of the rpoA gene was detected. However, intact copies of all six sigma factors, which are required for PEP to function [72], were identified in the transcriptome. The holoparasite Phelipanche aegyptiaca lacks a functional PEP and mining unigene files published in a recent transcriptomic study of parasitic plants [20] failed to uncover a single sigma factor suggesting that in species where PEP sequences are lost from the plastid the requisite sigma factors are also absent from the nuclear transcriptome. The identification of all six sigma factors in the P. x hortorum transcriptome supports the likelihood that PEP is active in P. x hortorum plastids.

Conclusions

With the widespread application of NGS techniques, the ability to process and analyze massive quantities of sequence data in a timely manner becomes imperative to a successful project. Regardless of the goals of a particular project, it is desirable to obtain data that is as accurate and complete as possible in a way that is cost effective as well as timely. In this study a cross-platform comparison of de novo transcriptome assembly was conducted using representative species from the two largest genera of Geraniaceae, G. maderense and P. x hortorum. As no reference genome is available for Geraniaceae, or any of its close relatives, this approach represents a truly de novo assembly allowing evaluation of efficacy among the platforms/assemblers that more closely resembles current NGS research. The assembly of Illumina Hiseq 2000 reads with Trinity or SOAPtrans was highly effective in reconstructing, as completely as is currently feasible, the protein-coding transcripts of Geraniaceae. As for the differences between the two assemblers, Trinity generated slightly more single contiguous contigs and reconstructed more reference genes with a combination of multiple contigs, while SOAPtrans ran much faster than Trinity. These differences in contiguity and completeness became more obvious with the reduced set of input data (1/40th in this case). These findings recommend the Illumina platform with Trinity assembly to obtain the most complete gene coverage by a single contig, especially when a small amount of reads are available. In instances where a large amount of data is available and there are limited computational resources, Illumina SOAPtrans assembly may be preferred as it generated a relatively complete assembly much more quickly than Trinity. Furthermore, evaluation of the amount of Illumina sequence data required for generating a complete transcriptome is approximately 60 million reads.

Geraniaceae organelle genomes have been shown to exhibit a number of unusual features relative to other angiosperms, including highly accelerated rates of nucleotide substitutions in both mitochondrial and plastid genes [1, 8, 9], reduced RNA editing in mitochondrial genomes [1] and highly rearranged plastid genomes [1013]. This comparative transcriptome analysis of G. maderense and P. x hortorum detected a reduction in PPR proteins associated with RNA editing, which corresponds with reduced RNA editing in the mitochondria. Examination of nuclear encoded, plastid targeted sigma factors required for PEP function supports the hypothesis that PEP is active in P. x hortorum plastids, possibly incorporating the product of at least one of the highly divergent rpoA-like ORFs in the plastid genome.

Methods

RNA isolation

Plant tissues were collected from live plants grown in the University of Texas (UT) greenhouse and frozen in liquid nitrogen for two species from different genera of Geraniaceae, Geranium maderense and Pelargonium x hortorum cv ringo white. For Pelargonium leaf and inflorescence samples were collected. Leaves were of two developmental stages, newly emerged and fully expanded. Entire inflorescences were harvested prior to anthesis. Root samples of P. x hortorum were harvested from specimens grown aseptically in agar media. For Geranium, only emergent leaves were collected. Total RNA was isolated separately from each sample type by grinding in liquid nitrogen followed by 30 min incubation at 65°C in two volumes of extraction buffer (2% Cetyltrimethylammonium bromide, 3% Polyvinylpyrrolidone-40, 3% 2-Mercaptoethanol, 25 mM Ethylenediaminetetraacetic acid, 100 mM Tris(hydroxymethyl)aminomethane-HCl pH 8, 2 M NaCl, 2.5 mM spermidine trihydrochloride) with vortexing at 5 min intervals. Phase separation with chloroform:isolamyl alcohol (24:1) was performed twice and the aqueous phase was adjusted to 2 M LiCl. Samples were precipitated overnight at 4°C and total RNA was pelleted by centrifugation at 17,000 × g for 20 min at 4°C. RNA pellets were washed once with 70% ethanol and air dried at room temperature. Following resuspension in RNase free water, RNAs were analyzed by denaturing gel electrophoresis and by spectrophotometry. For Pelargonium, the four tissue types were pooled in equimolar ratio. All RNAs were treated with DNase I (Fermentas, Glen Burnie MD, USA) according to the product protocol. DNase I was removed from the solution by extraction with phenol:chloroform:isoamyl alcohol (25:24:1) and the aqueous phase was adjusted to 0.3 M sodium acetate. RNA was precipitated with 2.5 volumes of cold absolute ethanol for 20 min at −80°C. Pellets were washed with 70% ethanol, air-dried and resuspended in water to 1 μg μL-1. Total RNA sample aliquots were frozen in liquid nitrogen and shipped on dry ice to the Beijing Genomics Institute (BGI) in Hong Kong or delivered to the Genome Sequencing Analysis Facility (GSAF) at UT. Confirmation of sample quality and concentration was conducted at each facility using the Agilent 2100 Bioanalyzer instrument (Agilent Technologies, Santa Clara CA, USA).

Illumina sequencing

Sample preparation for Illumina sequencing was performed at BGI according to Illumina’s protocol (Part # 1004898 Rev. D). Total RNA was treated with the Ribo-Zero™ rRNA Removal Kit (Epicentre Biotechnolgies, Madison WI, USA) prior to fragmentation and priming with random hexamers for first strand cDNA synthesis using SuperScript® III Reverse Transcriptase (Invitrogen, Beijing, China). Second strand cDNA synthesis was carried out using RNase H (Invitrogen) and DNA polymerase I (New England BioLabs, Beijing, China). The resulting cDNA fragments were purified with QIAQuick® PCR extraction kit (Qiagen, Shanghai, China) and normalized with Duplex-Specific thermostable nuclease (DSN) enzyme from Kamchatka crab (Evrogen, Moscow, Russia) according to the protocol outlined by Invitrogen (Part # 15014673 Rev. C). End repair and adenylation of the normalized cDNA library was followed by ligation to the paired-end (PE) sequencing adapters. Following gel electrophoresis for size selection (180–220 bp) the library was PCR amplified for sequencing using the Illumina HiSeq™ 2000. The PE library was sequenced for 101 bp.

Roche/454 FLX sequencing

The method for cDNA library construction and normalization was based on that of Meyer et al. [73]. Briefly, total RNA was reverse-transcribed using oligo-dT coupled to a PCR-suppression primer. The reverse complement of this primer was incorporated at the 3′ end of the first-strand cDNA using the template switching capability of the SuperScript II Reverse Transcriptase (Invitrogen). Duplex-specific nuclease was added to digest the abundant double-stranded cDNA. After purification, PCR was performed, and the product was purified and sheared by nebulization. The fragmented DNA was then end-repaired and ligated to Roche Rapid library adaptors using the NEBNext® Quick DNA Sample Prep Master Mix Set 2 and NEBNext® DNA Sample Prep Master Mix Set 2 (New England BioLabs). Final library size and concentration were measured on the Agilent BioAnalyzer and by qPCR before sequencing on the Roche/454 FLX sequencer.

Read pre-processing

Raw reads were preprocessed to eliminate contaminant and low quality sequences. Filtering of Illumina Hiseq 2000 reads included the removal of low quality bases, reads where (poly) adenosine constitutes more than 6% of bases, and reads containing specialized features such as adaptors and other artifacts arising from library construction. Roche/454 FLX reads were preprocessed by removing reads shorter than 50 bp and reads with artificial sequences based on a vector reference file. The complete data set is available at NCBI Sequence Read Archive (Accession numbers SRA059171 for Geranium and SRA053016.1 for Pelargonium).

Ribosomal RNA content and Illumina library complexity

Ribosomal RNA (rRNA) contigs were identified using reciprocal blast of rRNA from Arabidopsis (5.8S, 18S and 25S from nucleus, 5S, 16S and 23S in chloroplast, and 5S, 18S and 26S in mitochondria) as reference. The rRNA sequences from Arabidopsis were downloaded from TAIR [74]. Ribosomal RNA reads were removed prior to the library complexity analysis. Due to a lack of nuclear genome sequence, the remaining reads were mapped back to the whole transcriptome data using bowtie2 [75]. The mapping results were sorted using samtools [76] and then analyzed with MarkDuplicates module of Picard [32].

Assembly

Transcritpome assemblies were initially performed on Pelargonium using a variety of assemblers to compare the efficacy of different platforms and assemblers. After these initial comparisons, all subsequent assembles were performed on both Geranium and Pelargonium using Trinity and Illumina data. For assembly of clean Illumina reads, Trinity [77], SOAPdenovo and SOAPtrans (http://soap.genomics.org.cn/SOAPdenovo-Trans.html) [78, 79] were used. Trinity, released on 2011-08-20 (http://sourceforge.net/projects/trinityrnaseq/), was run with parameters “--seqType fq --CPU 10 --paired_fragment_length 200 --run_butterfly” on a 24-core 3.33 GHz linux work station with 1 TB memory at the Texas Advanced Computing Center (TACC, http://www.tacc.utexas.edu/). The assembly was split into three steps according to the provided script trinity.pl released with the software. The split scripts run the corresponding three steps in Trinity: inchworm, chrysalis, and butterfly. The parameters were the same for each step, and each step picked up the previous step’s output as input and processed it. The scripts will be provided by JZ upon request. The SOAPtrans assembly was run with the parameters “kmer = 61, max_rd_length = 100, avg_ins = 200” on the same server as that of Trinity. For SOAPtrans kmer lengths from 23 bp to 81 bp were explored; 61 bp was selected because it generated the best contiguity compared with other kmer values. The SOAPdenovo assembly was done at BGI on a 48-core 2.67 GHz Linux workstation with 50 GB memory with parameters “Kmer = 41, insert size = 200, overlap threshold = 50” for assembly, and “Kmer + 1” to fill the gaps. The generated fasta file was postprocessed by BGI to remove the sequences shorter than 150 bp. Assembly of Roche/454 FLX utilized MIRA [80] and Newbler [81]. MIRA 3.4.0 for a 64-bit linux system (http://sourceforge.net/projects/mira-assembler/files/MIRA/stable/) was released on 2011-08-21. MIRA was run with parameters “--job = denovo, est, accurate, 454 --fasta 454_SETTINGS” on a 12-core 3.33 GHz linux work station with 24 GB memory at TACC. Newbler 2.6 accompanies the Roche/454 FLX platform and assembly was conducted at UT GSAF on 24-core 2.40 GHz linux work station with 64 GB memory using the parameters “runAssembly -cpu 8 -urt -cdna –vt vector.fa”.

Comparative analysis of assemblies

Trinity, SOAPdenovo and SOAPtrans assembly output comprised a single contig file each and these were used in the analyses. Unpadded fasta files were selected from the MIRA output and the isotig file was selected from the Newbler output for use in analyses.

The initial assembly quality was evaluated using the following metrics: number of assembled contigs, maximum, minimum and mean contig length, N50 and redundancy. Initial assembly statistics and contig length distribution analysis was done by custom perl scripts and MATLAB version R2011b. Contig clustering and removal of redundant contig sequences was performed using CD-HIT [82]. CD-HIT version 4.5.4 (downloaded from http://code.google.com/p/cdhit/downloads/list) was executed using parameters “cd-hit -c 1.0 -n 5 -T 12” for cDNA sequences and “cd-hit-est -c 1.0 -n 10 -T 12” for protein sequences. Redundancy was calculated from the difference between the number of contigs before and after clustering. Maximum, minimum, and mean contig length, N50 and total bases were calculated from the contigs after clustering and removal of those contigs < 200 bp.

The assemblies were aligned to two published reference databases: 357 ultra-conserved ortholog (UCO) coding sequence [35] from Arabidopsis (sequences available at: http://compgenomics.ucdavis.edu/compositae_reference.php), and a list of 959 single copy nuclear genes shared between Arabidopsis, Oryza, Populus, and Vitis[36] using BLASTX with evalue of 1 E-10. Contig alignment to the reference databases utilized the standalone BLAST + [83] program for 64-bit linux system (ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/). The parameters for BLAST + DNA alignment were “blastn -task blastn -evalue 1 E-10 -word_size 11 -outfmt 6 -num_threads 12”. Parameters for protein alignment were “blastp -task blastp -num_threads 12 -outfmt 6”. For blastp, two different e values were used, 1 E-10 and 1 E-20, in order to address the generality of the results. Multiple sequence alignment was done by muscle [84]. Muscle for 64-bit linux system (http://www.drive5.com/muscle/downloads.htm) was used with default parameters.

The local reference database for identifying the open reading frames contained four proteomes downloaded from Phytozome (http://www.phytozome.net/search.php): Citrus clementina, C. sinensis, Eucalyptus grandis and Arabidopsis thaliana. Contigs were translated by alignment to the local database using blastx to identify open reading frames. The blastx parameter was “blastx -evalue 1e-6 -max_target_Seqs 1 -num_threads 48 –outfmt’6 std qframe”. The reading frame parameter was added to the output in order to facilitate the following analysis. The aligned regions of contigs were translated, extracted, and then extended by translating the contigs in both directions according to standard codon usage until a stop codon was encountered. The translated contigs were clustered again using CD-HIT at a threshold of 100% and all other parameters used the default settings. Two parameters, contiguity and completeness as described by Martin and Wang [85] were used to evaluate the alignment results. Briefly, contiguity is defined as the percentage of the reference transcripts covered at some arbitrary coverage threshold by a single longest contig. Completeness is defined as the percentage of the reference transcripts covered at a threshold by multiple assembled contigs (Box one in [85]). In this study a range of thresholds up to 100% was evaluated, and 80% was selected as the threshold for both contiguity and completeness calculations. Both parameters were calculated with protein sequence alignment, and the alignment results were analyzed using custom perl scripts available from JZ upon request.

Evaluation of assemblies with different proportion of reads

To assess how much data (number of reads) is needed to construct the complete transcriptome, different proportions of sequencing data ranging from 5% to 100% were extracted for both species. The extracted reads were assembled with Trinity using the parameters described above. Extraction and assembly were repeated three times for each proportion except 100%, and the assembly statistics (contig number, contiguity and etc.) were averaged.

Basic statistics and assembly parameters such as contiguity and completeness were calculated using the same local database described above. To determine how well the assemblies cover a complete transcriptome, the custom Arabidopsis protein database was constructed by extracting all Arabidopsis proteins from Uniprot/Swissprot database [86], and protein sequences with name “hypothetical” or “predicted” were discarded. The assemblies were aligned with the database using BLASTX with an E-value of 1 E-10.

Orthologous genes identification

Orthologous genes between transcriptomes of G. maderense, P. x hortorum and A. thaliana were identified with reciprocal blast with parameters “blastp -task blastp -num_threads 12 -max_target_Seqs 1 -evalue 1e-10 -outfmt = '6 std qlen slen”. Blast results were analyzed with custom perl scripts.

Functional annotation

The assemblies were aligned with the NCBI nr database using BLASTX with an E-value of 1 E-6 and taking the best 10 hits for annotation. The blast results were used to annotate each sequence with gene ontology (GO) terms using Blast2GO [8789]. To improve the efficiency of annotation, local blast2go database was downloaded (http://www.blast2go.com/b2glaunch/resources/35-localb2gdb). GO terms were mapped to the reduced GO-slim (plant) ontology to get a broader functional representation of the transcriptome.

Identification of selected organelle targeted genes

PPR proteins were searched for using HMMER [90, 91] with previously established PPR motif alignment files [92]. Transcript sequences with more than one PPR motif were considered PPR genes. Sigma factor protein sequences from Arabidopsis were downloaded from TAIR [74] and used as reference. Sigma factor structure and conserved domain information were obtained from previous studies [9395]. Putative transit peptides were predicted with targetP [96, 97]. Orthologs from two transcriptomes of G. maderense and P. x hortorum were identified by reciprocal blast at E-value 1 E-10.