Background

Glutathione transferases (GSTs; EC 2.5.1.18) comprise a superfamily of genes encoding ubiquitous enzymes that are very important in the clinical outcome of cancer therapy because they metabolize and inactivate cancer agents (in particular four classes of GST, alpha, mu, pi and theta) [1, 2]. Furthermore, the interaction of GSTs with pesticides and pollutants makes them an interesting target for protein engineering in plants [3]. The general role of GSTs in detoxification and the metabolism of xenobiotics has been well documented (reviewed in [37]), and makes GSTs a promising subject for evolutionary analyses, especially considering temporal fluctuations in toxic chemicals in the environment.

GSTs are an excellent example of how multiple gene duplication events involving further sub- or neofunctionalization has resulted in groups of enzymes with a myriad of functions. Their occurrence either as homodimers or heterodimers further increases the diversity of the GSTs catalytic activities, which are all centered on chemical reactions that use the tripeptide glutathione (GSH). GSTs are involved in metabolic detoxification of reactive electrophiles, the biosynthesis of leukotrienes, prostaglandins, testosterone and progesterone, and in the degradation of tyrosine [3, 5]. Furthermore, some GSTs have also been attributed non-enzymatic regulatory roles [8, 9].

The expression of GSTs can be regulated by different stresses (e.g. variation in temperature, oxidative damage, and exposure to toxins) and cGSTs promoters contain antioxidant response elements [3]. This suggests that GSTs could be part of an adaptive response to cellular stress [5, 10]. GSH is a scavenger of reactive oxygen species and GSTs are part of the cell machinery responsible for metabolizing by-products of oxidative stress [5]. Indeed, soluble GSTs have been detected in the mitochondria, where they are thought to play an important role in protection against the effects of reactive oxygen species produced by the mitochondrial respiratory chain [11]. However, this protection can also occur when there are cells that are too damaged and should be eliminated, leading to cancer and other diseases [12]. Likewise, a number of GST polymorphisms have been implicated in tumor resistance to chemotherapy [13], and in a number of diseases (Table 1).

Table 1 Diseases associated with cGSTs.

GSTs also have noncatalytic roles that include binding (covalently and noncovalently) dangerous chemicals and biomolecules such as bilirubin and hormones [5, 14]. Binding to reactive electrophiles is thought to be important for preventing DNA damage whereas for other molecules GSTs act as intracellular carriers [5, 14].

Cytosolic GSTs (cGSTs) are by far the most abundant GST subfamily and can be found in all aerobic organisms [3]. They are actively involved in the detoxification of generally nonpolar compounds that contain an electrophilic carbon, nitrogen, or sulfur atom [5]. cGSTs are divided into several classes, the most ubiquitous being alpha (GSTA), mu (GSTM), pi (GSTP), sigma (GSTS), omega (GSTO), zeta (GSTZ) and theta (GSTT) (see Table 2 for the occurrence of these classes in various taxa). These are also the only classes present in mammals. Additional classes are specific to plants (phi and tau), fish (rho), insects (delta and epsilon), and bacteria (beta) [6, 15]. The number of isoforms per class varies widely, ranging from one to forty [3].

Table 2 Distribution of cGSTs that are present in mammals in major taxa (sequences obtained from GenBank; details in Table 1 in Additional file 1).

In this work, we used Bayesian and maximum likelihood-based molecular phylogenetics approaches to ascertain the overall evolutionary pattern in cGSTs. We then focused on mammalian cGSTS and their natural substrates in order to determine if the role of GSTs in detoxification is the major force driving the expansion of this gene family, as it has been suggested [3, 5]. More than 20 mammalian cGSTs have been identified to date, many metabolizing the same substrates, especially those substrates of anthropomorphic origin. This catalytic promiscuity probably enhances the formation of duplicates, increasing the probability that duplicated genes become fixed in a population, as adaptation will promote reactions that are already catalyzed by the enzyme [16]. We therefore tested available sequence data for the existence of positive selection using both gene and protein based statistical approaches. Although positive selection has been mostly detected in genes involved in host-pathogen interactions, it has also been shown to influence the active site of enzymes and protein-protein interfaces in membrane receptors [1719]. The gene-based approach is based on the idea that nonsynonymous substitutions may influence the fitness of an individual or population. Thus, adaptive molecular evolution may cause the nonsynonymous substitution rate (dN) to be higher than the synonymous rate (dS), with the ratio ω (dN/dS) being higher than 1 [20]. Likelihood ratio tests (LRTs) implemented in PAML were used to identify genes under positive selection (ω >1) by comparing two probabilistic models of variable ω ratios among sites, the simpler of which does not allow sites with ω >1 and a more general which does [21]. ConTest was used to measure evolutionary rates in protein sequences accounting for the variation of specific biochemical properties, like volume, polarity and charge [22]. Finally, we intertwined the results from the evolutionary analyzes with information regarding the functions of the different classes of cGSTs, and concluded that the activity of cGSTs on endogenous substrates is sufficient to explain the overall expansion pattern of this subfamily.

Results & Discussion

Structure conservation

Within specific GST classes, amino acid sequence identity between paralogues is typically >40%, whereas among classes it can be less than 25% (Table 3 in Additional file 1). cGSTs are characterized by two domains, each containing an active site. The N-terminal domain adopts a thioredoxin-like fold (βαβαββα) that is mostly responsible for binding GSH and the C-terminal domain contains varying numbers of α-helices and encloses the substrate-binding domain (Figure 1). Active cGSTs are either assembled as homodimers or heterodimers formed between elements of the same class (Figure 1B).

Table 3 Biological information about the bacterial species present in Figure 3.
Figure 1
figure 1

GSTs structure. A) Active site close-ups showing the different amino acids responsible for activating GSH in different GSTs (left: structures from human GST classes sigma, zeta and omega from top to bottom; right: the amino acids from closely related isoforms (e.g. M1 is GSTM1) are depicted, after superposition of the corresponding structures ); B) Structure of a pi GST dimer; C) Neighbor-joining trees obtained from distance matrices correspondent to the RMSD in angström between C-alpha carbons of representative structures for each cGST class (Table 4 in Additional file 1).

The N-terminal domain contains residues that are critical for activation of the sulfhydryl group of GSH. Its structure is quite conserved when compared to the substrate binding C-terminal domain (Figure 1C). Conservation in the C-terminal domain is apparent between the elements of the more recently evolved and closely related classes (alpha, mu, pi and sigma), which is in line with their overlapping substrates specificities.

Phylogeny

Given the high sequence divergence within cytosolic GSTs (Table 3 in Additional file 1), building an accurate phylogenetic tree including representatives of all GST gene classes is challenging, and inferences based on these trees should be done cautiously. Since cGSTs protein structure is quite conserved, we have overcome the problems of nucleotide homoplasy by using the structural information when building sequence alignments for cGSTs from multiple classes, which increased confidence in the phylogenetic analyses relative to previous studies (see [6, 23] and references therein). The phylogenetic relationship among the cGST classes in mammals is shown in Figure 2. Bayesian and Maximum Likelihood approaches result in the same overall clade topology (Figure 2A in Additional file 1). Orthology and parology are difficult to assess for genes in classes alpha and mu where duplication events were very profuse. The phylogenetic relationships within classes that have a single isoform (pi, sigma and zeta) are incongruent with known phylogenetic relationships, namely for muridae GSTs (Figure 2) [24]. Muridae are also the mammals with the highest number of GST isoforms, another indication that GSTs might have followed an independent evolutionary path in this group. The phylogenetic relationship among the cGST classes in a wide variety of taxa is shown in Figure 3, where we can see a perfect correspondence with the currently accepted phylogeny of species at the animal class level (mammals, amphibians, insects, etc). Theta, zeta and omega cGSTs are generally considered to be the most ancestral, and are found in a wide variety of organisms (Table 2) [3]. However this has been difficult to prove (Figure 3) [3, 6]. The omega class are a good choice because they use a cysteine residue to activate GSH (Figure 1A) similarly to glutaredoxins, which are the suggested ancestors of the N-terminal domain of cGSTs [25]. Class theta has been previously appointed as the root of the GST tree mainly because a lot of sequences were initially allocated to this class, and it seemed to be the class represented by the largest diversity of organisms (Table 2). However, this pattern could have been biased in a scenario where some of the classes were lost during evolution because they became nonessential in some species. The fact that class theta is involved in the metabolism of products of oxidative damage could have made it more essential than the other two ancient classes, omega and zeta. The emergence of new genes could also have lead to species-specific redundancy and the elimination of elements of a GST class. Another possibility is that these genes have not been yet detected in certain taxa because of lack of sampling, since more and more GST genes are being detected in newly sequenced genomes. In fact classes alpha and sigma were thought until recently to be only present in metazoans [3, 23], but our BLAST searches have successfully retrieved bacterial homologous genes (Figure 3), revealing that the increasing amount of available sequence data is still shaping our knowledge on the evolution of cGSTs. We actually found bacterial homologs for nearly all GST classes (Figure 3), which again shows how GSTs are more ubiquitous than first thought. Even though hypothetical events of horizontal gene-transfer from metazoan to bacteria species are possible, we suggest that these bacterial cGSTs sequences indeed have an ancient origin because: i) there is a deep divergence among bacterial GST sequences (Figure 3), ii) the corresponding organisms inhabit distinct habitats (Table 3), and that iii) these bacteria are phylogenetically divergent (e.g. cyanobacteria vs protobacteria, divergence time over 2 billion years ago [26]; see Figure 1 in Additional file 1).

Figure 2
figure 2

The nucleotide phylogenetic tree of mammalian cGSTs. The tree was built in MrBayes, after excluding the third codon position [36]. Posterior probabilities (PP) values are shown (in parenthesis are highlighted the PP values obtained in a tree reconstruction using amino acid sequences). The multiple sequence alignment was first done at the amino acid level using structural information in 3D-Coffee [32]. Changes in the key catalytic amino acid are depicted in black boxes. The enzymatic activities of each class are described on the right. Classes theta, zeta and omega are thought to be the most ancient (see Table 2 for distribution in various taxa).

Figure 3
figure 3

Phylogenetic tree of cGSTs from various taxonomic groups. The tree was built in MrBayes [36] and posterior probabilities values are depicted. The multiple sequence alignment was done at the amino-acid level using structural information in PROMALS3D [32]. Bacterial elements are shown in boxes. The first four letter of the sequence name are an abbreviation of the species name and the last characters correspond to the GST isoform, when defined (see Table 1 in Additional file 1).

The cGSTs phylogenetic tree is supported by two fundamental changes in the chemistry of GSTs involving a conserved residue before the third β strand that provides the mainchain hydrogen-bond donors and acceptors for GSH. This residue changes (1) from a cysteine to a serine, and then (2) from a serine to a tyrosine (Figure 1A). This residue activates GSH during catalysis. Classes alpha and mu, which use a tyrosine in the active site, present the higher number of successful duplicates, with up to seven isoforms per species. It is precisely within these classes that we can find extensive signatures of positive selection (Table 4). In the next section we will discuss the functional diversification of mammalian cGSTs considering the relevance of their natural substrates. Also noteworthy is the strong agreement between the phylogenetic tree of the mammalian classes and the tree based on the root mean square deviation (RMSD) between the corresponding structures of the ligand binding domain (Figure 1C).

Table 4 Positive selection analysis on mammalian sequences using site models M7 and M8 in PAML [21] (probabilistic models of variable w ratios among sites, the simpler M7 which does not allow sites with w >1 and the more general which does M8).

Functional diversification and asymmetrical family expansion

The GSTs family has expanded through multiple duplications [3, 5]. The fact that GST enzymes accumulate multiple roles, both enzymatic and non-enzymatic, facilitates the process of neofunctionalization by optimization of what was a secondary function [16]. This is obvious for omega cGSTs, where the inference of orthology is pretty straightforward. We suggest that the ancestral enzyme had as a main function the regeneration of ascorbate (the main reaction catalyzed by GSTO2 [27]), since we find that GSTO1 is under positive selection (Table 5). The duplicate has then diverged from this initial function, and optimized what was a secondary reaction in the ancestral enzyme, the metabolism of arsenic. We expect this pattern to be general in GSTs: duplicates sometimes retain the previous main function but are optimized for a new one (e.g. both GSTOs are involved in the metabolism of arsenic and ascorbate, but GSTO1 is better in metabolizing arsenic and GSTO2 has an activity 70-100 times greater than GSTO1 towards ascorbate [27]).

Table 5 Sites under positive selection detected by the BEB approach in PAML (bold, PP >0.9) and by ConTest (underlined, p < 0.05).

The expansion of the cGSTs was particularly profuse, but as we can see in mammals, not all the resulting classes of enzymes have duplicated extensively. For example, sigma and zeta cGSTs are involved in pathways that require a more precise regulation, and for which the existence of a duplicate could have been harmful. GSTZ is a key enzyme in the metabolic degradation of phenylalanine and tyrosine and its product can cause the fatal hereditary disease tyrosinemia type I [28]. GSTS produces prostaglandin D2 [29] (prostaglandins are lipid mediators that are involved in the regulation of the respiratory, cardiovascular, central nervous system (CNS), genitourinary, endocrine, and immune systems). In contrast, alpha, mu and theta cGSTs are all involved in the cellular reaction under stress conditions, and have multiple duplicates (up to seven, five and four duplicates for alpha, mu and theta classes, respectively, compared to only one isoform in e.g. sigma and zeta classes) that probably contributed to an increase in fitness by the elimination of a broader range of reactive and harmful chemicals, especially through variation in the substrate binding pocket. Changes of a hypervariable site in this region have been shown to have a central role in defining enzyme specificity [30]. Our hypothesis is thus supported by evidence of several positively selected sites in the substrate binding pocket (Table 5).

The catalytic promiscuity of GSTs has certainly facilitated the process of neofunctionalization and duplicate gene retention for the most populated cGST classes. We further suggest that subsequent loss of duplicates arising from functional redundancy was probably avoided by variation in tissue expression (e.g. human GSTA3 is expressed solely in steroidogenic tissues and GSTM3 is selectively expressed in testis and brain [31]) and increased specificity for a particular molecule (such as GSTO1 that has an improved efficiency in metabolizing arsenic, see above).

Conclusions

Glutathione transferases enzymes represent a superfamily with many functional roles throughout evolution. This work has revealed that cGSTs are a lot more ubiquitous and old than previously thought, with many cGSTs classes having a bacterial isoform. The need for protection against the products of oxidative damage has no doubt driven the expansion of the family via positive selection on GST duplicates, but on the way GSTs have acquired many other roles such as the metabolism of sex hormones and the regulation of apoptosis, which likely were vital for the retention of duplicates. The metabolism of dangerous xenobiotics is also a major role of GSTs and has been suggested to be driving the expansion of the family. In fact, times when changes in the levels and the toxicity of environmentally available chemicals have occurred, the catalytic promiscuity and an elevated number of sGST isoforms would have constituted a fitness advantage. However, the functional redundancy towards many of these chemicals would eventually lead to a loss in the number of cGST isoforms. We suggest that the expansion of the cGST family was strongly favored by the fact that the different isoforms within a class have acquired new functions towards endogenous substrates. We hypothesize that these roles dictate the number of duplicates per class, being more abundant for classes involved in the metabolism of product of oxidative stress (alpha, mu and theta), but reduced to one when related with tightly regulated biological processes (such is the case for class zeta involved in the metabolism of the aromatic amino acids tyrosine and phenylalanine). The need for duplicates with sometimes overlapping substrate specificities is further supported by differences in tissue specificity and the increase of specificity for a particular reaction that is weakly catalyzed by other isoforms.

Presently it is still difficult to ascertain the most ancient class of cGSTs, with theta and omega being the best candidates. Nevertheless, the information arising from whole-genome sequencing of an increasing number of non-mammalian species will certainly provide further insight into the origins and evolution of this ancient gene family.

Materials and methods

The cGST sequences used in this work were obtained from GenBank and are listed in Table 1 in Additional file 1.

Because of the high sequence divergence of cGSTs, amino acid multiple sequence alignments were done using protein structure information in 3D-Coffee [32] and PROMALSD [33](see Additional Files 2 and 3 for the alignments used to build the trees presented on Figures 2 and 3, respectively). The codons in the nucleotide alignment were aligned accordingly. We have chosen a limited number of representative sequences for all clades (the insect epsilon class is omitted as it forms a clade with the delta class). DAMBE and MEGA4 were used for sequence editing and formatting [34, 35].

We have built the various phylogenetic trees based on the nucleotide and amino acid alignments using MrBayes [36] and PhyML [37] after determining the optimal model of sequence substitution with Modeltest 3.04 (TVM+I+G) [38] and Prottest (JTT+I+G) [39]. One cold and four incrementally heated chains were run for 2,000,000 generations with chains I = 2, 3, 4, and 5 incrementally heated with heat being 1/(1+[i-1]T) and T = 0.2. Following a burn-in of 500,000, trees were sampled every 100 generations (well after the chain reached stationarity) and 15,000 trees were used for inferring Bayesian posterior probability. The nucleotide tree was obtained after removal of the third codon position. The trees obtained with PhyML are presented in Figure 2 in Additional file 1. All trees were drawn in FigTree http://tree.bio.ed.ac.uk/software/figtree/.

Likelihood ratio tests (LRTs) implemented in PAML [21] were used to identify genes under positive selection. These tests are used to identify adaptive molecular evolution which occurs when the nonsynonymous substitution rate (dN) is higher than the synonymous rate (dS), with the ratio ω (dN/dS) being higher than 1 [20]. The LRTs used compare two probabilistic models of variable ω ratios among sites, the simpler of which does not allow sites with ω >1 and a more general which does [21] (model M7 vs M8). Amino acid sites under positive selection were detected with Bayesian empirical bayes (BEB) inference under the M8 model in PAML [21], and with CONTEST, that assesses protein changes in biochemical constraints to calculate evolutionary rates [22]. CONTEST accounts for the variation of specific biochemical properties, like volume, polarity and charge and presents a new statistical method based on the comparison of two-rate measures where a site is considered constrained for a given property if it shows high conservation relatively to its total evolutionary rate (some positions may be constrained while having a high substitution rate, provided these substitutions do not affect the biochemical property under constraint). CONTEST can also be used to infer positively selected positions, as it looks for sites that have experienced more nonconservative substitutions than expected by chance under the neutral hypothesis. Sequence alignments were built to perform the positive selection test. For classes where orthology/parology was difficult to assign, tests were done using sets of genes existing in single species (human, macaque, mouse and rat; Table 4 and Table 5).

Multiprot [40] was used to calculate the root mean square deviation (RMSD) between C-alpha atoms of cGST structures (Table 2 in Additional file 1). Neighbor-joining trees were obtained from the corresponding distance matrices in PAUP [41].