Gene organization and evolutionary history

Classification

Glycosyltransferases (EC 2.4.x.y) catalyze the transfer of sugars to a wide range of acceptor molecules. The enzymes can be classified into families on the basis of sequence similarity, catalytic specificity and the existence of consensus sequences [1,2,3]. This review concerns only one family of UDP glycosyltransferases in higher plants: that defined by the presence of a carboxy-terminal consensus sequence, the UDP-glycosyltransferases signature, that is thought to be involved in binding of the protein to the UDP moiety of the sugar nucleotide (Figure 1) [3,4]. This consensus sequence can be identified in open reading frames from animal, plant, yeast and bacterial genomes, and it probably defines a single multigene superfamily. The term 'UGT' will be used throughout this review to refer specifically to those glycosyltransferases containing this consensus sequence. The nomenclature of the UGT superfamilyis co-ordinated by a group of scientists who were invited by the relevant international nomenclature committees to help with the systematic naming of these and other carbohydrate-handling enzymes [3]. Figure 2 summarizes the system currently used by this group; it is based primarily on amino acid sequence identity. All Arabidopsis thaliana UGT genes discussed in this review have been named using this nomenclature system. A broader classification system for all NDP-sugar hexosyltransferases (EC 2.4.1.x) has also been described, and this groups all known glycosyltransferases into 47 distinct families [1,5]. The UGT genes discussed in this review fall into family 1 within the latter classification system. The evolutionary relationship of the different glycosyltransferase families is not yet clear although recent clues have emerged as an increasing number of three-dimensional structures are determined (see the Characteristic structural features section).

Figure 1
figure 1

Amino acid consensus sequence of UDP-glycosyltransferases taken from the PROSITE database of protein families and domains, which was used to identify the 107 A. thaliana UGT genes [7]. Letters in brackets denote alternative amino acids at a particular position; X denotes any amino acid.

Figure 2
figure 2

Summary of the current UGT superfamily nomenclature system. The diagram illustrates the system currently used to name plant UDP glycosyltransferases. Further details of this nomenclature system can be found in [3] and on the UDP Glucuronosyltransferase home page [4].

Evolution

The sequencing of the model plant species A. thaliana has recently been completed [6]. Using the UGT amino acid consensus sequence shown in Figure 1 as a search tool, we have screened its genome and identified a very large glycosyltransferase superfamily containing 107 putative UGT genes and 10 UGT pseudogenes [7]. Analysis of this superfamily has allowed the first characterization of higher plant UGTs at the genomic level to be performed. Information available on other plant UGTs can now begin to be integrated into the results from this genomic analysis.

We have performed a detailed analysis of the amino acid sequences of the open reading frames of 88 A. thaliana UGT genes using neighbor-joining and parsimony-based analysis methods with statistical confidence measurements by bootstrap analysis [7], resulting in an unrooted phylogenetic tree consisting of 12 well-defined major evolutionary groups. An updated but less detailed analysis using 107 A. thaliana UGT genes (including the 88 analyzed in [7]) is shown in Figure 3. Further refinement of more closely related sequences has been shown in the equivalent analysis of 88 Arabidopsis UGTs [7]. Bootstrap analysis of this expanded tree shows that the superfamily is likely to contain 14 distinct groups that evolved from an equivalent number of ancestral UGT genes.

Figure 3
figure 3

Phylogenetic analysis of the Arabidopsis UGT superfamily. Neighbor-joining and parsimony-based analysis of nine conserved amino acid sequences shown in Figure 4 was performed as described previously [7]. Bootstrap values over 60% are indicated above the nodes, with the number on the left indicating neighbor-joining and that on the right indicating parsimony. Dashes indicate bootstrap values under 60%. Further refinement of more closely related sequences has been shown in the equivalent analysis of 88 Arabidopsis UGTs [7]. Hypothetical intron gains and losses are indicated by diamonds with the intron number (I) shown (see Figure 4). Postulated intron gains are indicated by filled diamonds, intron losses by unfilled diamonds and the questionable intron loss by a striped diamond.

Using programs capable of detecting more distantly related sequences, such as PSI-BLAST, two additional A. thaliana genes have recently been identified that contain amino acid sequences similar to the UGT consensus sequence. These genes encode proteins 100 residues longer than any of the previously identified A. thaliana UGT genes and each contains 13 introns (see the Gene organization section of this article for details of the intron organization of UGT genes). One of these genes has been previously identified as a UDP-glucose sterol ß-D-glucosyltransferase [8].

No comparable analysis has yet been carried out for other plant species. Given that so many UGT sequences can be found in the comparatively small genome of A. thaliana, however, it is probable that large numbers will also be detected in species throughout the plant kingdom. Similarly, large numbers can be detected in species of the animal kingdom - such as the 60-gene UGT superfamily in Caenorhabditis elegans. A complete list of UGTs currently annotated can be found at the UDP Glucuronosyltransferase home page [4].

Gene organization

The A. thaliana UGT genes are scattered throughout the genome but do show clustering into groups of two to seven genes; clustered genes show a high degree of amino acid sequence similarity. The genes encoding the A. thaliana UGTs contain up to two introns, but over half (58/107) contain no introns (Figure 4). An analysis of the intron-containing UGT genes suggests that a minimum of ten independent intron-insertion events and either one or two intron-loss events have occurred during A. thaliana UGT evolution (Figure 3) [7].

Figure 4
figure 4

The conserved regions and intron positions of the UGT genes of A. thaliana. The nine conserved amino acid regions are shown as red boxes. Segments between these boxes represent regions with a variable number of residues. The positions of introns are indicated by arrows and inverted triangles. Examples of UGT genes containing one or more of the nine introns are shown.

Characteristic structural features

Sequence features

The amino acid sequences encoded by the UGT genes containing the consensus sequence shown in Figure 1, which vary in length from 435 to 507 amino acids, have all been found to possess nine conserved regions, including the UGT-defining consensus sequence (Figure 4) [7]. The level of similarity between these UGT amino acid sequences varies from over 95% to lower than 30% identity. The amino-terminal regions are more variable than the carboxy-terminal regions, supporting the suggestion that the domain involved in the recognition and binding of the diverse aglycone substrates is located towards the amino terminus of the protein whereas the carboxy-terminal region encodes a domain involved in binding the nucleotide sugar substrate [9].

Structural features

To date, none of the proteins encoded by the UGT superfamily has been crystallized and their three-dimensional structures are not known. Six glycosyltransferases from other superfamilies have been analyzed structurally, however, and these analyses suggest that, although they were previously thought to be unrelated, they may fall into just two superfamilies [10]. The first of these contains bacteriophage T4 ß-glucosyltransferase (BGT) and the Escherichia coli N-acetylglucosaminyltransferase MurG, and the second contains Bacillus subtilis glycosyltransferase SpsA, bovine ß-1,4-galactosyltransferase 1, rabbit N-acetylglucosaminyltransferase I and the catalytic fragment of the human glucuronyltransferase I. Interestingly, an approximately 30 amino acid sequence motif in MurG, suggested by the structure to be involved in nucleotide-sugar binding, has been shown to be similar to the UGT consensus sequence described above (Figure 1) [2,11]. Further insight into UGT structure and subsequent structure-function relationship now awaits the resolution of a three-dimensional structure for an enzyme from this superfamily.

Localization and function

Localization

Mammalian UGTs, which transfer glucuronic acid to hydrophobic substrates, are membrane-bound enzymes localized in the endoplasmic reticulum with their catalytic sites facing the lumen. These enzymes contain an amino-terminal leader sequence that is cleaved on cotranslational segregation into the rough endoplasmic reticulum, and a hydrophobic carboxy-terminal halt sequence that anchors the enzyme to the membrane [11]. Our analysesof A. thaliana UGTs using TopPred2, SignalP and Psort programs has not identified either of these motifs, supporting the widely held belief that plant UGTs are cytoplasmic enzymes.

Very little information is available from plants regarding the expression of UGT genes. Tomato and tobacco UGTs have been shown to respond rapidly to signals from wounds and pathogen attack [12,13]. There are also now significant data available from the Stanford microarray websiteon expression [14] of 14 of the 106 A. thaliana UGTs. The high level of sequence homology between family members suggests, however, that expression data using either expressed sequence tag (EST) or full length cDNA probes should be treated with caution, as full-length probes may well hybridize to several closely related UGTs and produce misleading expression profiles. No data are yet available to evaluate whether UGT expression is regulated principally at the DNA, RNA or protein level.

Functions

The UGT superfamily in higher plants is thought to encode enzymes that glycosylate a broad array of aglycones, including plant hormones, all major classes of plant secondary metabolites, and xenobiotics such as herbicides [15]. Glycosylation regulates many properties of the aglycones, such as their bioactivity, their solubility and their transport properties within the cell and throughout the plant. In addition to the A. thaliana UGT genes, numerous UGT genes have been isolated from a wide range of different plant species and their corresponding gene products either characterized biochemically or defined by genetic analysis [15]. An alignment of these sequences with the A. thaliana UGT superfamily and their phylogenetic analysis predicts their position on the A. thaliana UGT tree [7], which is shown in Figure 5 along with the available data on substrate specificity of these enzymes.

Figure 5
figure 5

The relationship of the groups of A. thaliana UGTs with other published plant UGTs. A simplified version of the A. thaliana UGT phylogenetic tree is shown with other plant UGTs added. The bootstrap values, which give the degree of confidence in the branching pattern presented, are 60-90% unless otherwise stated. The published substrate specificities for UGTs other than A. thaliana are listed to the right of the figure. Full species names referred to in the figure are as follows: Brassica napus, Citrus unshiu, Nicotiana tabacum, Perilla frutescens, Zea mays, Sorghum bicolor, Gentiana triflora, Perilla hybrida, Vitis vinifera, Phaseolus lanatus, Phaseolus vulgaris, Dorotheanthus bellidiformis, Solanum tuberosum.

The task of comprehensively assaying UGT substrate specificity is a formidable one and much work remains to be done. Nevertheless, the identification of substrate specificity of higher plant UGTs is beginning to allow some conclusions to be drawn and some interesting relationships between different UGTs to be detected. For example: enzymes that catalyze the formation of salicylic glucose ester and indole-3-acetic acid glucose ester share the highest sequence homology to Group L from Arabidopsis, which contains enzymes that produce hydroxycinnamoyl glucose ester [15,16,17]; three UGTs known to be involved in the 3-O-glucosylation of anthocyanidin in both monocotyledons and dicotyledons are all clustered with the Arabidopsis Group F [18]; and two highly homologous sequences encoding enzymes that glycosylate the plant hormone zeatin are distinct from all the major UGT groups of Arabidopsis, suggesting the possible presence of Arabidopsis zeatin glycosyltransferases that have not been identified in the A. thaliana UGT superfamily [19].

These data, taken together, provide a useful foundation for starting to understand the structure-activity relationships of the UGT family. It will be interesting to compare the catalytic specificity in vitro with the consequences of changing the level of individual enzymes in vivo. A broad specificity of recombinant enzymes in vitro may not provide insight into the activity in planta, because substrate availability will also be relevant in the cellular context.

It has been suggested that many UGTs may not exhibit high substrate specificity at all, but rather recognize individual hydroxyl groups present on a wide range of different aglycones [15]. Our substrate-specificity data do not seem to support this suggestion, as screening of 36 Arabidopsis UGTs revealed only one enzyme capable of glucosylating indole-3-acetic acid [16]. Thus, for at least certain UGTs, reactions may be directed by substrate specificity rather than regiospecificity. A much clearer picture will emerge when substrates of more Arabidopsis enzymes have been identified and these data are considered within the context of temporal and spatial expression profiles in planta.

Enzyme mechanism

UGTs transfer nucleotide-diphosphate-activated sugars to low-molecular-weight aglycone substrates. In plants, the activated sugar is usually UDP-glucose but other sugars such as UDP-xylose [19] are also found. The conjugation of the sugar can lead to the formation of a range of glycosylated molecules including glucose esters, cyanogenic glucosides, phenolic glucosides and glucosinolates containing a ß-thioglucose moiety. Many aglycones, such as the flavonols, can also accept more than one sugar if a number of sites are available for glycosylation. The exact catalytic mechanism used by UGTs is not yet known. As discussed above, the enzymes are generally thought to contain an aglycone-binding amino terminus and a UDP-sugar-binding carboxyl terminus but any conclusions regarding enzymatic mechanism await determination of the crystal structure.

Frontiers

It will be essential to integrate data from in vitro and in vivo studies to gain a more complete picture of the potential biological roles of UGTs in plants. This is now feasible with current technology: microarray data, details of the catalytic activities of specific recombinant proteins, metabolite profiles of plants over-expressing or lacking individual UGTs, as well as information on the cell- and tissue-specificity of gene expression, can all be accessed and integrated. Similarly, once the three-dimensional structure of one UGT has been accomplished, molecular modeling will provide very rapid insights into the structural relatedness of other superfamily members and how this relatedness is reflected in catalytic activities.

The recent realization that Arabidopsis, with such a small genome relative to other species in the plant kingdom, has so many UGTs opens up a whole range of new frontiers, both for the fundamental understanding of UGT functions and for the many strategic applications of the UGT superfamily.

Additional data file

An Excel file containing the accession numbers of the Arabidopsis BAC clones that contain UGT genes and the location of each gene in the clone is included (file added on 17 July 2001).