Massively parallel pyrosequencing of two samples of T. hassleriana (Additional file 1: Table S1) yielded 1,254,286 sequencing reads in total. The sequencing raw data are deposited in the DDBJ (DNA Data Bank Japan, http://trace.ddbj.nig.ac.jp/index_e.html) under the experiments SRR1051360 and SRX393170 https://trace.ddbj.nig.ac.jp/DRASearch/run?acc=SRR1051360 and (https://trace.ddbj.nig.ac.jp/DRASearch/experiment?acc=SRX393170) The histogram of reads by length (Additional file 2: Figure S1a) showed an average read length of ~316 nucleotides. Roughly 45% of the reads could be mapped against A. thaliana TAIR10 coding sequences for counting gene expression.
Assembling the reads de novo resulted in 49,237 contigs with an N50 of 690 bases (Additional file 2: Figure S1b). Of these, 41,320 could be annotated by mapping against Arabidopsis. 1.1% (537) chimeric contigs could be detected in the assembly.
Rarefied libraries were constructed separately for the two biological replicates 1 and 2 and a merged sample to illustrate possible differences in gene discovery rates. Although the gene discovery rate of replicate 1 was less than the one of replicate 2 (Figure 2), the curves for both replicates indicated that a larger part of the T. hassleriana floral transcriptome was detected as the curves already flattened. However, the merging of the information of both libraries affected the overall output as the rarefaction curve reached nearly a plateau (Figure 2). This shows that each library comprised genes not detected with the other one. Thus, the merged data set allows drawing a detailed view of the transcriptome of T. hassleriana. Increasing the sequencing depth would only result in the detection of extremely rare genes.
qRT-PCR expression analysis validates transcriptome sequencing expression (TSE)
The robustness of expression data generated by the transcriptome sequencing was analyzed independently using a qRT-PCR assay (Figure 3). A normalized expression profile for T. hassleriana reads mapped to A. thaliana CDS sequences was created by calculating the ratio of reads mapped to an individual gene against the reads mapped to A. thaliana ACT7. A subset of 14 genes was randomly chosen to represent genes with high (normalized expression ratio 1.0 – 10.0, Figure 4a), moderate (normalized expression ratio 0.3 – 1.0, Figure 4b) and low (normalized expression ratio 0.05 – 0.3, Figure 4c) expression levels. The expression of the putative T. hassleriana orthologs of the A. thaliana genes RBCS1A, MVP1, GAPC1, TT4, BGLUC19, GAMMAVPE, ATP3, SCE1A, SFGH, ARF6, PGLUHYD, GI, OMR1, and SPL7 was analyzed in T. hassleriana floral tissue (A. thaliana gene identifier, full gene names are shown in Additional file 1: Table S3). The qRT-PCR expression data were also normalized to the expression of the T. hassleriana ACT7.
Generally we found a better match of transcript abundance detected by qRT-PCR in T. hassleriana as compared to reads mapped to the A. thaliana orthologs (TSE1) than to the T. hassleriana contigs (TSE2). A correlation plot for the comparison of expression measured by qRT-PCR and TSE was generated (Additional file 2: Figure S3). When all the 14 gene expressions by the two methods were plotted a positive linear correlation was observed (Additional file 2: Figure S3a) as indicated by a R2 value 0.55. The expression of MVP1 and BGLUC19 gene homologs which belong to big gene families with 41 and 66 homologs in A. thaliana respectively was the most significant outlier in this plot. When the expression data for the MVP1 and BGLUC19 gene homologs were removed and the data plotted again a very strong positive linear correlation between TSE1 and qRT-PCR expression values was obtained with an R2 value 0.91 (Additional file 2: Figure S3b). This indicated that TSE1 approach for measuring gene expression was very robust except for genes belonging to large gene families with highly similar homologs in which case the read mapping may be incorrect. Nonetheless a positive linear expression correlation for all genes corroborates the TSE1 expression data. In particular, similar normalized fold expression between qRT-PCR data and reads mapped to the A. thaliana orthologs were observed in the genes RBCS1A (high expression), ATP3, SCE1A, SFGH (moderate expression), and ARF6, PGLUHYD, OMR1, and SPL7 (low expression) P >0,01 (Additional file 1: Table S4 shows the comparative P values for the ANOVA tests). In case of T. hassleriana homologs of genes GAMMAVPE, MVP1 and TT4 the transcript abundance detected by qRT-PCR was more similar when reads were mapped to the T. hassleriana contigs (TSE2) P > 0,01. In case of the GAPC1 and BGLUC19 homologs the difference between qRT-PCR expression and TSE1 and TSE2 was statistically significant P < 0,01. It was further observed that the number of reads mapped to the T. hassleriana contigs was in all cases, with the exception of TT4, grossly overestimating gene expression.
Expression of genes controlling floral traits in the flower and leaf transcriptome
Genes controlling various floral traits and flower development in A. thaliana, Antirrhinum majus, Fagopyrum esculentum etc. were identified based on literature [26–31]. The expression pattern of their putative T. hassleriana orthologs identified by a bidirectional BLATX search with the A. thaliana CDS sequences was analyzed in the flower and leaf transcriptomes to learn more about the regulation of the special floral traits of T. hassleriana (Figure 4). The selected genes were first grouped into different classes such as homeotic transcription factors, regulators of homeotic genes etc. and ordered within their groups according to transcript abundance. Of the genes analyzed, 49 (41.9%) were specific to the flower transcriptome and not found in the leaf transcriptome. (A. thaliana gene identifier, full gene names are shown in Additional file 1: Table S3).
Amongst the putative class ABCDE homeotic transcription factor orthologs, the highest expression was observed among the class B gene homologs AP3 and PI and the class E gene homologs SEP1 and SEP3. The putative ortholog of the C class gene AG was expressed at a 10 fold lower magnitude compared to the class B and E genes. The expression of the putative orthologs of the D class genes SHP1, SHP2 and STK the expression of which regulates the ovule and fruit development in A. thaliana was found to be considerably lower, when compared to the class ABCE genes. AP3, SEP3, SEP1, and STK transcripts were not present in the leaf transcriptome while PI, AP1, AP2, and SEP4 are expressed at a very low level in leaves. In addition to these, 25 other putative MADS box transcription factors without floral homeotic function that are members of the MIKC, Mα, Mβ, Mγ, Mδ subfamilies were also found to be expressed in the floral transcriptome.
Amongst the genes putatively regulating the class ABCDE homeotic transcription factors, the LUG, LUH and SEU orthologs showed highest expression in the flower transcriptome, while a 10 fold lower expression of these genes was observed in the leaf transcriptome. The expression of the LFY homolog was also observed in the floral transcriptome albeit at very low levels. Interestingly, putative homologs of genes regulating class B gene activities like UFO and SUP and class A gene activity like SAP were not identified in the floral transcriptome library. The orthologs for genes regulating patterning and symmetry also showed expression in the floral transcriptome. The putative orthologs of TCP4, TCP2, and PTF1 showed the highest expression. The expression of these genes was also observed in the leaf transcriptome; in case of the TCP4 ortholog a 100 fold higher expression was observed in the floral transcriptome when compared to the leaf transcriptome, while the TCP2 ortholog expression was almost equal in both the transcriptomes. Comparatively low expression of other putative patterning gene orthologs like TCP14, TCP15, TCP18, CUC2 and CUC3 was also observed specific to the floral transcriptome.
While the A. thaliana flowers are mostly free of pigments, the petals and reproductive organs of T. hassleriana are pink and dark magenta and hence the expression of putative orthologs of genes regulating anthocyanin production, regulation, and deposition was analyzed. Very high expression was observed for the putative orthologs of TT4, F3H, CHIL, DFR and LDOX. Most genes show a higher expression in flowers than in leaves and for several, such as DFR, LDOX, and TT8, expression is specific to the flower suggesting key roles in flower pigmentation. Very low expression of TT4 and F3H orthologs (about 300 and 200 fold lower respectively) was observed also in the leaf transcriptome, whereas the CHIL ortholog expression was only about 5 fold lower in the leaf transcriptome. The spatiotemporal expression pattern of A. thaliana orthologs of these genes was investigated in A. thaliana using the Arabidopsis eFP Browser (http://bar.utoronto.ca/efp/cgi-bin/efpWeb.cgi) . The expression patterns for the homologs of TT4, F3H, CHIL, and FLS1 was very similar in T. hassleriana and A. thaliana. The enzymes encoded by these genes are required for the synthesis of flavonoids like quercitin, dihydroquercitin, myricetin etc., which are intermediates of anthocyanin biosynthesis. The products of the genes DFR, LDOX, UGTD2 which were found to be expressed in in the T. hassleriana floral transcriptome but only in senescing leaves in A. thaliana (Table 1) are involved in downstream processes that catalyze the conversion of the flavonoids into anthocyanins like Pelargonidin and Cyanidin which determine the characteristic pink-magenta flower color. Genes like PAP2, MYB111, MYB113, and EGL3 are regulators of flavonoid and anthocyanin biosynthesis and were also expressed in T. hassleriana floral tissue whereas in A. thaliana their expression was restricted to senescing leaves and seeds during early stages of embryo development.
Expression of gene orthologs governing time to flower was also analyzed. Expression of both antagonistic groups of genes that prolong time to flower or enhance the transition into flowering was observed. Among the orthologs inducing flowering AGL20, GI, EBS, and FLK had the highest expression; expression of these genes was also observed in the leaf transcriptome at very comparable levels. Amongst the orthologs of genes delaying flowering SUF4, HUA1, COL9, and FLC had high levels of expression which was also observed at comparable levels in the leaf transcriptome. The orthologs of FRI, HUA2, SMZ, and ATC showed moderate to low floral transcriptome specific expression.
T. hassleriana homologs of meristem activity regulators, such as GAI, ANT and CLV1 which are involved in decreasing meristem proliferation was observed at high levels in the flower and varying levels in the leaf transcriptome while ANT expression was not detected in the leaf transcriptome. Putative homologs of genes BAM1, BAM2, BAM3 and WUS which enhance meristem proliferation were also found to have moderate expression levels in the floral transcriptome. Interestingly, putative homologs for FTA, ERA1, and STM, were found to be expressed in the floral transcriptome as their A. thaliana counterparts show very low expression the flower.
Another important category of gene orthologs analyzed for expression are the genes that co-regulate floral organ development alongside the ABCDE floral homeotic transcription factors. High expression was observed in case of orthologs of YAB2, AFO and PEP in both the floral and leaf transcriptomes whereas the expression of the CAL ortholog was about 100 fold higher in the flower transcriptome. Other floral organ developmental regulators, such as ENP, CRC, INO, NUB, JAG, and SPL were not identified in the T. hassleriana leaf transcriptome, but only in floral transcriptome whereas they are also expressed in A. thaliana leaves at very low levels. No expression was observed for ROXY gene homologs which are responsible for anther and male gametophyte development downstream of SPL.
Other putative homologs of A. thaliana floral regulators were identified amongst them were the highly expressed homologs of genes ASK1, CEV1, SKP1, RBX1, which are part of SCF ubiquitin protein ligase complexes which regulate multiple aspects of flower development together with UFO in A. thaliana. Homologs of genes ER, ERL1 and ERL2 which are protein kinases that influence meristem cell fate and patterning in the inflorescence meristem were also highly expressed. Interestingly the homolog of BPEP was found to be expressed only in the floral transcriptome, while the two distinct BPEP transcripts in A. thaliana are expressed in the floral as well as in vegetative organs respectively. Homologs of genes PIN1 and PID were also expressed which are known to affect size, floral organ number and total number of flowers in A. thaliana.
This in silico expression analysis of genes related to flower development demonstrates that with the chosen RNAseq method we are able to monitor gene expression in logarithmic scales covering more than two magnitudes. In addition, the two library preparations for this sequencing experiment show only rarely any difference in RPKM. Detailed expression analysis of putative T. hassleriana homologs of A. thaliana genes in the T. hassleriana floral transcriptome is provided in Additional file 3 along with the AGI identifiers.
Characterization of genes putatively governing sterility in T. hassleriana
The particular T. hassleriana hybrid used in this study was sterile. While orthologs of A. thaliana regulators of anther development were expressed in the T. hassleriana flower, no expression of ROXY1 and ROXY2 was detected. These two genes redundantly control the anther lobe and pollen mother cell differentiation downstream of SPL. The genome of one of the parents of this hybrid, T. hassleriana Purple Queen (ES1100) was recently published  and this plant, unlike its hybrid offspring is fertile. Only ROXY1 ortholog was found in the T. hassleriana genome To learn more about the possible causes for the sterility we compared the expression pattern of homologs of SPL, ROXY1 and their A. thaliana downstream targets DYT1 and MYB35 affecting stamen development and microsporogenesis in these two plants by qRT-PCR at small, medium and large buds (Figure 5).
Expression analysis by qRT-PCR indeed revealed that the expression of the ROXY1 homolog was very low (103 fold lower compared to ACT7) and well beyond the scope of detection by RNA seq. ROXY1 expression was down regulated in the sterile hybrid only at bud stage M when compared to the fertile parent the (Figure 5b) whereas it was similar to the parent at the younger and later developmental stages. Along with the down regulation of ROXY1, expression for the DYT1 and MYB35 homologs which most likely act downstream of ROXY1 was also down regulated in stage M buds. In stages other than M, the expression of DYT1 and MYB35 homologs in the sterile T. hassleriana hybrid was several fold higher than the respective expression in fertile parent buds in both the early and late developmental stages. Expression of the SPL homolog in the sterile hybrid buds was 3–4 fold higher than the fertile plant buds in stages S and M whereas in stage latter L the expression was 8 fold. Thus our expression data suggest that the complex network governing stamen development and microsporogenesis is disrupted in the T. hassleriana hybrid which could provide a causal link to its sterility.
Characterization of T. hassleriana floral transcriptome specific genes in comparison to A. thaliana
We described above that the flower of T. hassleriana is morphologically distinct from the A. thaliana flower and our aim was to identify genes that may contribute to the differences by comparing the A. thaliana floral transcriptome with that of T. hassleriana. However, as our data are based on RPKM and the A. thaliana are microarray data the two datasets may be compared only qualitatively but not quantitatively. We thus chose the more careful approach to score only for presence/absence of transcripts of A. thaliana/T. hassleriana putatively orthologous gene pairs. Of the 21,107 genes in A. thaliana for which microarray expression data for the floral transcriptome could be compiled, ~1200 genes were not expressed in the A. thaliana. The expression analysis of these gene homologs in the T. hassleriana revealed that a majority of these genes (~750) were also not expressed in the T. hassleriana floral transcriptome. But 351 gene homologs were identified that were expressed differentially amongst the floral transcriptomes of the two species. These differentially expressed Tarenaya transcripts were assigned GO annotations using Blast2GO® by performing a BLASTX search with a cut off value of e-100 to identify the molecular processes that are distinct between T. hassleriana and A. thaliana. 81 genes were annotated as genes with unknown function. The remaining 270 genes were assigned multiple GO annotations based on the biological processes associated with the function of these genes (Additional file 1: Table S5). Of special interest were genes annotated to be involved in anthocyanin accumulation, cell growth, flower development and other developmental processes. Candidate genes were selected for further analysis (Table 2). High expression of PGP10 homolog, a gene involved in anthocyanin accumulation in response to UV light was observed in the T. hassleriana floral transcriptome whereas its expression is limited to pollen in A. thaliana. The homolog of TTFP which codes for a tyrosine transaminase family protein was also expressed at high levels in the T. hassleriana floral transcriptome; this gene is involved in regulation of cell growth in response to external stimulus and is primarily expressed in the roots of A. thaliana. Other notable gene homologs involved in various aspects of cell growth were LRX2, HAT4 and PIP5K3. Of the gene homologs involved in various aspects of floral development, prominent were ICMTA and TEM2. ICMTA is an enzyme belonging to the methyltransferase family, which is inducted during floral morphogenesis. TEM2 is a transcription factor known for its role in flowering time regulation by controlling FT expression. Amongst the genes annotated as genes governing various aspects of development were JAL33, MTSP1, EMB2217 and GLUDOXRP which are involved in embryo and root development.
Identification and characterization of Cleome lineage specific genes
To identify genes shared between Cleome and other closely related rosids and genes that are specific to the Cleome lineage a BLASTX search with a cut off value of e-10 was performed with the 49,237 Tarenaya floral transcriptome contigs against the A. thaliana, Brassica rapa, C. papaya (all malvids, order Brassicales) and Populus trichocarpa (fabid, order Malpighiales) protein databases in a systematic manner (Figure 6a). This allows the assessment of gene births and gene losses in the rosid lineage. Figure 6b shows the result of the comparative analysis: A large number of the contigs 37,989 (subset I) represent the sequences shared between malvids and fabids. According to our analysis, only 684 genes are shared between all Brassicales, but 1375 genes (subset B) are shared between the core Brassicales, which include T. hassleriana, A. thaliana, and B. rapa. This suggests a high rate of gene births in the lineage leading to core Brassicales after their split from the lineage leading to C. papaya. Conversely, 148 genes (subset K) are shared between T. hassleriana, C. papaya and P. trichocarpa and not found in the Brassicaceae suggesting that these genes were lost in the lineage leading to A. thaliana and B. rapa after its separation from the lineage leading to T. hassleriana. Another 132 (subset G) genes are found only in C. papaya and T. hassleriana indicating that these are Brassicales-specific genes that were lost in the Brassicaceae. 453 genes are shared between T. hassleriana and A. thaliana but not found in B. rapa suggesting that they were lost in the lineage leading to B. rapa. Conversely, only 246 genes were lost in the lineage leading to A. thaliana and are shared between B. rapa and T. hassleriana (subset C).
An astonishing number of 5600 contigs (subset Z) could not be matched with high confidence to any other sequence from P. trichocarpa, C. papaya, A. thaliana and B. rapa. Of these contigs only 82 could be assigned to 353 GO terms, but a vast majority of the contigs could not be annotated attributing to no significant BLAST hits. A sequence length histogram for these contigs (Additional file 2: Figure S2) shows a bias towards shorter sequences when compared to the sequence length histogram of all contigs (Additional file 2: Figure S1b) suggesting that these were too short for proper annotation and/or may represent 5’ and 3’ UTR regions of transcripts. Another reason for the small number of annotated genes is because most of the current annotations are based on A. thaliana, B. rapa and P. trichocarpa and we already subtracted the sequences orthologous to them. The GO annotations for the T. hassleriana specific genes are the following: cellular process (26.98%), metabolic process (29.36%), response to stimulus (4.76%), biological regulation (4.7%), development (1.5%), cell proliferation (1.5%), reproduction (6.34%) and signaling processes (4.76%) (Additional file 1: Table S6).