In silico identification and characterization of C. sinensisGST transcripts
The number of GSTs and the multiplicity of their functions in plant genomes are surprisingly large [13]. The question of whether they are all expressed can be addressed by querying EST databases [19] and/or by experimental analyses [8, 20]. In the present work we used both computational and experimental analyses to identify, classify and validate tissue expression profiling of C. sinensis GST encoding sequences. The whole procedure is traced in the Additional file 1.
Different members of the GST gene family were identified by screening a collection of 94,127 EST sequences from C. sinensis. 370 ESTs, putatively encoding GST proteins, were filtered out and used to feed a clustering/assembling procedure to attempt the definition of full-length transcripts. Sixty-two putative unique transcripts, including 28 ESTs assembled into tentative consensus sequences (TCs), and 34 single ESTs (sESTs), were defined. The Tau and the Phi classes show the highest number of unique transcripts: 29 and 20, respectively, confirming evidence from other plants [19, 21]. The remaining classes, Lambda (6 sequences), Zeta (3 sequences), Theta (2 sequences) and Mapeg (2 sequences), have fewer components.
The collection of the 62 transcripts was compared versus the GenBank non-redundant nucleotide database in order to identify sequence data, to be exploited to further elongate the 62 transcripts. This analysis also highlighted that TCs CITSI00:1 and CITSI02:1 were contiguous sequences, that can be collapsed into a unique transcript corresponding to the sequence DQ198153 from GenBank. As a consequence, the total number of unique transcripts was reduced to 61 sequences. It is interesting to note that the transcript DQ198153 is exclusively represented by ESTs from fruit flesh. To be precise, these ESTs come from the SSH library described in [22] and appear to be Moro specific.
All the transcripts in each GST class were multiple aligned and analysed so to define the mRNA structure in terms of 5′ UTR, coding region and 3’ UTR segments. This task was also supported by the detection of the longest open reading frame (ORF).
The six different multiple alignments, generated by grouping the 61 transcripts according to the GST class membership, can be scrolled at http://biosrv.cab.unina.it/citrusGST/, while a schematic view summarizing the structures and the similarity relationships among the segments within each class is reported in Figure 1. Each mRNA sequence is represented by a bar that is split into segments corresponding to protein coding exons, 5′ and 3′ UTRs. In addition, each transcript is associated to a tag describing the full- or the partial-length nature of the sequence as inferred from the ORF analysis. The transcripts defined as full-length mRNAs (FL, Figure 1) are the ones showing a complete ORF. The remaining transcripts exhibited partial ORFs. Those including the ATG starting triplet but lacking the stop codon were classified as 5′ fragments (5 F). On the other hand, those transcripts lacking the initiating ATG, but presenting a termination triplet, were classified as 3′ fragments (3 F). The remaining segments were classified as fragment (F). Finally, transcripts showing interspersed stop codons were classified as no good ORFs (NGO).
Pairwise DNA distances were calculated and highly similar sequence segments were identified using a cut-off of 15% divergence. In Figure 1, highly similar sequence segments within each GST class are reported with the same colour. However, the sequence similarity obtained by this analysis was not sufficient to collapse these sequences into a unique transcript during the EST assembly step. Sequence variability is further highlighted by the nucleotide sequence alignment available at http://biosrv.cab.unina.it/citrusGST/.
Multiple alignments revealed that one class Tau GST (CITSI24:CX077363) and three class Phi GSTs (CITSI23:1; CITSI35:CK936125; CITSI26:1) could be intron-retaining (IR) sequences. This is also confirmed by the ORF analysis and by BLAST searches that showed the absence of similarity between the putative intron regions and peptides available in protein databases. Putative introns (dashed lines) as well as splice acceptor/donor sites, were indicated in Figure 1. The intron retained by CITSI24:CX077363 is limited by the canonical GT-AG sites. Within the Phi class, CITSI23:1 is an intron retaining isoform compared to CITSI38:CK934228; it was generated by assembling two ESTs (CX671312 and CX671311) and retains both the intron 1 and 2 being the first intron supported by CX671312 and the second by CX671312 and CX671311. The sequence CITSI34:CK939385 is highly similar to both CITSI38:CK934228 and CITSI23:1 sequences (first and second intron excluded), even if several polymorphisms in the 5′ UTR, CDS and 3′ UTR are evident. The singleton CITSI35:CK936125 is considered the intron retaining variant of CITSI29:1, though it is characterized by a non-canonical acceptor site (TC) in the second intron. CITSI26:1, built by assembling 6 ESTs, is the intron retaining isoform of CITSI18:1.
Genome based assessment of the citrus GST collection
The list of the 61 GST-encoding transcripts and a larger subset of ESTs putatively encoding GSTs (1,611 from a total of 213,830 EST sequences), derived from an independent update of the EST collection (see Methods), were aligned along the C. sinensis genome.
This allowed to compare sweet orange gene models (see Methods) and two custom genome annotation tracks, representing the 61 GST encoding transcripts we assembled, and 1,611 ESTs putatively encoding GSTs on the basis of their alignment to genomic sequences. All these annotation tracks can be accessed through the web based resource we made available.
The current C. sinensis genome annotation consists of 62 GST gene loci with a total of 84 possible transcripts. Twenty-eight of the GST genes have correspondence to one or more transcripts from our collection; 13 are gene loci whose exon/intron structure is validated by EST evidence only; 21 are GST gene loci with no EST evidence, 7 of which resulted to have a gene structure defined only by one exon.
Furthermore, some of the transcripts we described are associated to the same gene loci and it is interesting to note that this is valid also for full-length transcripts. This means that, although the transcripts present differences at sequence level, many of them can be associated to a single locus revealing the high level of heterozygosity of the C. sinensis genome and/or the high frequency of alternative transcripts in the orange genome. A short class-by-class description of the results follows (see also Additional file 2).
Tau class
Twenty-one out of the 29 transcripts from the Tau class were aligned versus 14 gene loci from C. sinensis. Moreover, among the remaining 8 transcripts, 5 were mapped on the C. clementina genome.
Specifically, the pairs of sequences CITSI16:1 and CITSI55:CB293075, CITSI25:1 and CITSI40:CK701666, CITSI58:BQ624883 and CITSI39:CK739807 and CITSI47:1 and CITSI31:CN188035 are transcribed from the same locus and their exon/intron structure overlaps the one of the corresponding gene. For this reason and considering results from the DNA distance matrix, it is very likely that these transcripts represent allelic isoforms of the corresponding locus.
CITSI53:1, CITSI36:1 and CITSI01:EG358290 reflect the exon/intron structure of the corresponding predicted genes.
A cluster of three Tau class GSTs (CITSI41:1; CITSI59:1; CITSI61:BQ623696) was successfully aligned in the same genomic region. In the region, two different gene structures were also predicted.
Sequences CITSI51:1 and CITSI04:1 correspond to the genes orange1.1g036834m and orange1.1g027333m, respectively.
The singletons CITSI60:BQ624512 and CITSI13:CX300684 were aligned in the same genomic region where a partial gene (only one exon) is predicted. Interestingly, both of the sequences are longer than the predicted gene.
CITSI61:BQ623696 matches the C. sinensis genome in three different scaffolds, finding similarity with GST annotated genes. CITSI13:CX300684 overlaps the second exon of the corresponding predicted gene, while CITSI15:DN618611 corresponds to the first exon of the predicted gene in the same locus.
CITSI47:2 and CITSI17:CX675467 were aligned in the same genomic region and have correspondence to the second exon of the predicted gene. Finally, no alignments were obtained versus Citrus sinensis for the remaining eight sequences from our collection.
To this end, we decided to exploit the C. clementina genome in order to check the existence of genes that could be related to these transcripts. Five transcripts out of 8 were aligned along this genome overlapping glutathione S-transferase annotated loci, revealing the still incomplete nature of the C. sinensis genome.
Phi class
All transcripts we defined from the Phi class were aligned to the C. sinensis genome except CITSI11:1, which was successfully mapped on the C. clementina genome.
CITSI52:1 and the very short singleton CITSI54:CB293267 are transcribed from the same locus. A cluster of eight Phi class GSTs was successfully aligned in the same genomic region, where two different gene models are available.
CITSI02:1 presents the same exon/intron structure of the gene orange1.1g041226m. Sequences CITSI29:1 and CITSI35:CK936125 are associated to the same locus. Two gene structures were predicted in the same region: the first one includes three exons and reflects the structure of CITSI29:1; the second retains intron II as CITSI35:CK936125 does. The spliced alignment of CITSI18:1 determines three exons, exactly as in the case of the predicted gene. However, two additional sequences CITSI26:1 and CITSI08:CV716584 were aligned in the same region. Both of them describe a unique exon.
CITSI22:CX672147 corresponds to the third exon of the predicted gene in the same locus.
Lambda class
CITSI42:CK665050 was successfully aligned in a region that includes two predicted gene structures. CITSI37:CK935114 and CITSI63:BQ623555 gene structures, even though partial, reflect the ones of the corresponding predicted genes.
The spliced alignment of CITSI30:CV884511 along the reference genome determines four exons, one more than the predicted gene. Finally, in case of sequence CITSI46:1 and CITSI32:CN187483 no spliced alignment was obtained.
Theta class
No spliced alignment was obtained for both singletons from the Theta class GSTs.
Zeta class
All the three Zeta class GSTs are transcribed from the same locus. In this region ten different gene structures were predicted.
Mapeg class
Both the Mapeg sequences are associated to the same locus and their exon/intron structure duplicates the one of the corresponding predicted gene. It is very likely that they represent allelic variants since they could not be assembled into one single consensus sequence, during the EST processing.
Expression profiling of full-length transcripts in different tissues/organs in Cadenera and Moro cultivars
In order to address the expression profiling in citrus, we performed SemiQ RT-PCR analyses, considering six different tissues/organs (albedo, flavedo, flesh, young leaf, adult leaf and ovary). Similar experimental approaches are reported for model plant species (Arabidopsis and rice) and further important crops, such as maize and soybean [8], in which GST expression patterns were analysed in a variety of tissues, developmental stages and environmental conditions [21, 23]. Moreover GST family-based expression analyses were performed in Arabidopsis [24], poplar [25] and rice [23], while transcript profiling of all the family members are not available in other plants [26].
Here we considered expression in two different cultivars of sweet orange: Cadenera and Moro, common and pigmented ones, respectively. The analysis aimed to point out differences with respect to the different tissues/organs as well as to the two cultivars, especially because GSTs may have implication in differential anthocyanin accumulation. Recently, Butelli et al. [27] demonstrated that blood orange phenotype results from the insertion of an LTR retrotransposon in the promoter region of the Ruby gene (which encodes for a MYB transcription factor) and that this gene is specifically activated in fruit in a cold-dependent manner. The implication of cold temperatures in activating the anthocyanin pathway was recently discussed also in [28]. However, it is very likely that Ruby itself is necessary, but not sufficient, in the determination of the anthocyanin pigmentation of the flesh fruit and that GST proteins may play a crucial role in the definition of the phenotype. Indeed, Sun et al. [29] proved that the Arabidopsis GST encoding Transparent Testa 19 gene (TT19) acts as a carrier to transport anthocyanins to the tonoplast. In addition, TT19 mutations do not modify the expression level of genes implicated in anthocyanin biosynthesis but at the same time limit anthocyanin accumulation since they cause the transformation of anthocyanin precursors into flavonols.
Among those sequences that resulted to be full-length mRNAs, the ones whose identifier in the Figure 1 is highlighted in red were considered for analyses based on SemiQ RT-PCR experiments. Because of the lack of full-length sequences, no SemiQ RT-PCR experiments were carried out for sequences included in the Theta class.
In Figure 1, primers are drawn as zebra bars below the line representing the sequence to be amplified. Forward primers within the Phi class were selected over the second and the third coding exons, to detect the expression of the mature transcripts. Forward primers within the Tau class were selected over the 5′ UTR and the ATG initial codon, because of higher sequence variability in the 5’ UTR regions that could ensure a higher specificity in primer selection. In Table 1 primer characteristics are indicated.
Table 1
List of sequences and primers designed to perform SemiQ RT-PCR experiments
Expression patterns for each of the 25 citrus GST transcripts are summarized in Table 2. Electrophoresis gel images from the experimental analysis are reported at http://biosrv.cab.unina.it/citrusGST/. Positive symbols showed the expression at different cycles ['+' (weak) to '++++' (strong)] depending on the amplicon presence at the PCR cycles 20, 25, 30 and 35. Negative symbol indicates absent expression.
Table 2
Summary of tissue/organ expression profiling by SemiQ RT-PCR
The results presented in Table 2 show two different aspects: transcripts presence/abundance (i) in various tissues/organs and (ii) in both cultivars. In the case of Tau class GSTs, the expression is detected in all the tissues/organs, with the only exception of CITSI04:1, which results expressed only in adult leaves of both cultivars; CITSI51:1, which is not expressed in young leaves of Cadenera and Moro, and in ovary of common orange; CITSI16:1, which is not present in flavedo and ovary of both cultivars. Considering the cultivar related expression, differences of more than ten cycles are indicated by asterisks in Table 2. Moreover, some of the transcripts were not detected at all in the ovary samples from Moro, such as CITSI47:3, CITSI57:1, CITSI51:1 and CITSI44:1 (indicated by ‘-’ in correspondence of the cultivar were the expression was not detected in Table 2). Multiple-bands indicated by uppercase a in Table 2) generally occur only in specific organs in both cultivars as in CITSI24:CX077363, CITSI47:3, CITSI57:1 and CITSI41:1. Expression levels marked by uppercase b (CITSI47:3, CITSI51:1 and CITSI44:1) indicate RT-PCR products in the column 'ovary' with size greater than the expected on the basis of the primer design and the transcripts. Interestingly, in some samples, this behaviour is cultivar specific, as in the ovary of CITSI43:1, CITSI53:1, CITSI41:1 and CITSI36:1 in Cadenera and CITSI47:3 in the albedo in Moro. The presence of multiple bands is not suspicious since, as discussed in the previous section, alternative transcription for citrus GST genes seems to be very common.
Phi class GSTs show more specific expression patterns. Indeed, CITSI33:1 and CITSI05:DY257328 are expressed in all the tissues/organs. CITSI38:CK934228 and CITSI23:1 were detected with similar expression patterns in flavedo and leaves; both transcripts are absent in the albedo of both cultivars and in the ovary of Moro. The only difference in the expression pattern of the two transcripts is the absence of CITSI38:CK934228 in the flesh of Moro. CITSI34:CK939385 follows a similar behaviour with a higher expression in the adult leaf of Cadenera and absence of expression in albedo and in the flesh for both cultivars. Heterogeneous patterns were detected for the remaining transcripts considered in the different tissues/organs, indicating a higher variability in the expression pattern within the Phi class. Table 2 shows remarkable differences (i.e. transcripts with expression differing of more than 10 cycles) between the two cultivars. Higher expression levels were detected in Cadenera for CITSI34:CK939385 in adult leaves, and in the flesh column for the sequence DQ198153 (Table 2). Interestingly, the early over-expression of DQ198153 transcript in the Moro is corroborated by the expression patterns inferred when considering the EST-based tissue/organ distribution, as reported in the section 'In silico identification and characterization of C. sinensis GST transcripts'. Moreover, the low expression detected in flavedo of the Moro cultivar compared to a completely absent signal in Cadenera may be associated to the presence of some pigmentation also in this part of the fruit, in Moro, even if at a lowest extent. This finding suggests that DQ198153 may be responsible for anthocyanin accumulation in the Moro fruit flesh, similarly to the Mapeg GST CITSI20:1, which also resulted over-expressed in the flesh of the Moro fruit. As pointed out by Conn [30], this microsomial GST can be involved in anthocyanin transport to the vacuole or be associated to their biosynthesis at the cytosolic face of the endoplasmic reticulum (reviewed in [31]).
As stated, no expression in the flesh of Moro was evident for CITSI38:CK934228 (Table 2). Moreover, CITSI18:1 is only detected in the adult leaf and in the ovary of Cadenera thus indicating distinct roles of GSTs members in the different tissues of different cultivars. Multiple bands are not common in the Phi class, probably thanks to more reliable gene structures inferred by the computational analysis and, as a consequence, to a more specific primer design (Figure 1). However, multiple-bands are evident only for CITSI05:DY257328 in the adult leaves of Cadenera.
No relevant difference concerning the expression in all the tissues/organs and between Cadenera and Moro is evident in Lambda, Zeta and Mapeg classes.
Double-bands are also evident in the Zeta class (Table 2). This could be a direct result of alternative transcription of the corresponding gene locus, as it is clear from what is described in the section 'Genome based assessment of the citrus GST collection'.
As already shown in the Figure 1, some of the citrus GST sequences share a high similarity at nucleotide level (bars with identical colours). This is generally confirmed at protein level (as said after) as often the same sequences cluster together also sharing minimum distances (Table 3). However, it is also evident that the nucleotide sequences do not fall in the same tentative consensus when assembled (see Methods). Indeed, high similar sequences within a heterozygous species, like C. sinensis, may also account for allelic isoforms.
Table 3
Minimum distance matrix of GST proteins between
C. sinensis
GSTs and other species
Similar sequences here considered often show differences in the expression patterns within a cultivar, confirming their structural difference, and indicating different functional roles. As an example, sequences CITSI38:CK934228, CITSI23:1 and CITSI34:CK939385 show similar expression patterns though the primers are associated to regions showing nucleotide differences among the sequences. In particular, CITSI38:CK934228 and CITSI23:1 show expression differences in the flesh tissue when compared to CITSI34:CK939385 in Cadenera, while CITSI38:CK934228 and CITSI34:CK939385 show differences with CITSI23:1 in Moro. More striking differences are evident within the whole group of sequences, as evident from Table 2.
Considering the Tau class, differences in expression within a cultivar are evident within the group of sequences CITSI57:1, CITSI43:1 and CITSI16:1, which prominently diverge in the 5′ UTR and 3' UTR regions, as well as within the group of sequences CITSI51:1, CITSI44:1, CITSI36:1 and CITSI24:CX077363.
In addition, the interesting aspect is that the sequences mentioned may show different behaviors when considering each cultivar, as already shown for the Phi sequences CITSI38:CK934228, CITSI23:1 and CITSI34:CK939385. This is valid also in the Tau class. As an example, sequence CITSI51:1 show lack of expression in the young leaves in both cultivars when compared to the other similar sequences (CITSI44:1, CITSI36:1, CITSI24:CX077363). However, the expression in the ovary is similar to the one reported for sequence CITSI44:1, where the two cultivars show different behaviors.
Comparative analysis of citrus GSTs and sequences from other plant species
We compared the protein products of the experimentally analysed full-length sequences in our dataset, with GSTs from A. thaliana, Glycine max, Vitis vinifera, Oryza sativa and Zea mays. In Additional file 3, the number of sequences per species grouped by GST class is reported. Sequence identifiers as well as chromosome positions are indicated when known. All the considered species have representatives within each GST class with the exception of Lambda, Mapeg, Theta and DHAR. Furthermore, Lambda and DHAR classes are represented only in dicots. This is probably due to the different evolutionary fates in each clade [26] or to limits in the annotation available for each of the species considered here. A manually edited multiple alignment was drawn (available at http://biosrv.cab.unina.it/citrusGST/). The alignment shows groups of sequences clearly associated to different GST classes. Moreover, it reveals high heterogeneity within the most representative classes, Tau and Phi.
Sequences from dicots generally group separated from monocots, confirming their independent expansion [26]. In addition, the alignment reveals both sequences that are grouped by species as well as clusters of similar sequences from different species. Sequences revealing high similarity are not necessarily those that co-localize along a chromosome. This highlights that sequence similarities within GSTs in a plant genome cannot be only explained by recent tandem duplication events, as previously described [21], but can be due to sequence conservation linked to specific functional requirements. A distance matrix was also defined from the multiple alignment per each class. In Table 3, the identifiers of sequences showing the minimum distance with the citrus GSTs are reported. In case available, references concerning known roles of the sequences are included in the table. Distance relationships in which citrus GSTs have minimum distances with sequences from other plant species are reported too (e.g. see sequences number 1, 7, 8, 10, 11 and 12 within the Tau class; 13, 20, 21 and 22 within the Phi class; sequence 23 within the Lambda class; sequences 24 and 25 within the Zeta and the Mapeg classes).
In general, GSTs have been studied as stress related genes [8, 32–34]. The role reported in the literature for the Tau Class GST sequences At_NP_172508.1, Gm_AAG34804.1 and At_NP_180510.1, which are the most similar to CITSI47:3, CITSI24:CX077363, CITSI25:1, and the Phi class GST At|NP_171792.1 similar to CITSI52:1, is associated to stress. Though the sequence similarity, we cannot confirm this role for the citrus GSTs since the expression patterns here considered do not include tissues under stress conditions. However, a high similarity level was detected between the sequence DQ198153 from C. sinensis and sequence NP_197224 (i.e. TT19) from A. thaliana, which is reported to be involved in anthocyanins accumulation [29, 35, 36]. The over expression of DQ198153 in Moro flesh when compared to Cadenera (Table 2), as well as the tissue specificity of the ESTs this sequence is assembled from, highlight that the sequence is differentially expressed in Moro flesh fruit, and suggest that this GST may be specifically involved in the accumulation of anthocyanins in the blood cultivar [37]. In fact, it is known that although the anthocyanin’s biosynthetic enzymes are localized in the cytoplasm [38], stable colouration develops only after the pigment is transferred to the vacuole thank to a GST, as observed in maize [39]. The functional role of DQ198153 is also corroborated (i) by its sequence similarity with the Anthocyanin 9 (An9) from P. hybrida[22], and (ii) by GST enzyme assays [40]. AN9, indeed, has been the first plant GST proposed to be a cytoplasmic flavonoid carrier protein [7].