Long-term balancing selection maintains trans-specific polymorphisms in the human TRIM5 gene
- First Online:
- Cite this article as:
- Cagliani, R., Fumagalli, M., Biasin, M. et al. Hum Genet (2010) 128: 577. doi:10.1007/s00439-010-0884-6
The human TRIM5 genes encodes a retroviral restriction factor (TRIM5α). Evolutionary analyses of this gene in mammals have revealed a complex and multifaceted scenario, suggesting that TRIM5 has been the target of exceptionally strong selective pressures, possibly exerted by recurrent waves of retroviral infections. TRIM5 displays inter-individual expression variability in humans and high levels of TRIM5 mRNA have been associated with a reduced risk of HIV-1 infection. We resequenced TRIM5 in chimpanzees and identified two polymorphisms in intron 1 that are shared with humans. Analysis of the gene region encompassing the two trans-specific variants in human populations identified exceptional nucleotide diversity levels and an excess of polymorphism compared to fixed divergence. Most tests rejected the null hypothesis of neutral evolution for this region and haplotype analysis revealed the presence of two deeply separated clades. Calculation of the time to the most recent common ancestor (TMRCA) for TRIM5 haplotypes yielded estimates ranging between 4 and 7 million years. Overall, these data indicate that long-term balancing selection, an extremely rare process outside MHC genes, has maintained trans-specific polymorphisms in the first intron of TRIM5. Bioinformatic analyses indicated that variants in intron 1 may affect transcription factor-binding sites and, therefore, TRIM5 transcriptional activity. Data herein confirm an extremely complex evolutionary history of TRIM5 genes in primates and open the possibility that regulatory variants in the gene modulate the susceptibility to HIV-1.
The TRIM5 gene encodes a member of the tripartite motif protein family which counts more that 70 members in the human genome. TRIM5 is located on human chromosome 11, in a cluster of four TRIM genes. Several transcripts originate from TRIM5 by alternative splicing; the longest splicing isoform (TRIM5α) contains a SPRY domain and possesses antiviral activity (Stremlau et al. 2004). Human TRIM5α has been shown to restrict some retroviruses but it is scarcely efficient against HIV (Stremlau et al. 2004; Kaiser et al. 2007). Conversely, orthologs from macaque and other primates are highly efficient in restricting HIV, possibly by binding to the incoming viral capsid and leading to its premature disassembly (Stremlau et al. 2006).
The species-specificity of TRIM5α against retroviruses is thought to be the result of aminoacid variations that have been selected along primate evolution to fend off the threat imposed by ancient or ongoing retroviral infections and TRIM5 genes have been selection targets in many mammalian species (reviewed in Johnson and Sawyer. 2009). Thus, Sawyer and co-workers (2007) showed that the SPRY domain has undergone multiple episodes of positive selection in primates and the same protein region has experienced length variation and segmental duplications in different primate lineages (Song et al. 2005). Expansions, deletions and duplications of the entire TRIM5 gene have also been observed in different mammalian species (Sawyer et al. 2007; Tareen et al. 2009) and chimeric TRIM5-cyclophilyn genes have arisen independently at least twice during the evolutionary history of primates (Virgen et al. 2008).
Recent data indicated that TRIM5 has evolved under long-term balancing selection in some primate species, and trans-specific polymorphism in macaques and sooty mangabeys have been identified (Newman et al. 2006). Old polymorphisms shared by multiple species are extremely rare and are generally considered a compelling evidence of long-term balancing selection, as the maintenance of a neutral allele over long evolutionary times is quite unlikely (unless species are very closely related) (Charlesworth 2006). The best known examples of trans-specific polymorphisms involve MHC loci in multiple species (including humans) and the self-incompatibility genes of certain plants and fungi (Charlesworth 2006).
In humans, a recent survey for shared polymorphisms with chimpanzees outside the MHC revealed no instance that could be ascribed to the action of balancing selection but rather to coincidental mutation (Asthana et al. 2005).
We analyzed nucleotide variation at the TRIM5 locus in humans and chimpanzees; results indicate that intron 1 harbours trans-specific polymorphisms maintained by long-term balancing selection and displays extreme nucleotide diversity levels.
Materials and methods
DNA samples and sequencing
Human genomic DNA was obtained from the Coriell Institute for Medical Research; all individuals have been included in the HapMap project. These samples only partially coincide with those resequenced by the NIEHS SNP discovery Program; therefore, all analyses were performed separately using either our resequencing data (population genetic analyses) or the NIEHS sample data (sliding window analysis). The genetic material of three unrelated chimpanzees (Pan troglodytes) was purchased from the European Collection of Cell Cultures. All analyzed regions were PCR amplified and directly sequenced; primer sequences are available upon request. PCR products were treated with ExoSAP-IT (USB Corporation, Cleveland, OH, USA), directly sequenced on both strands with a Big Dye Terminator sequencing Kit (v3.1 Applied Biosystems) and run on an Applied Biosystems ABI 3130 XL Genetic Analyzer (Applied Biosystems). Sequences were assembled using AutoAssembler version 1.4.0 (Applied Biosystems), and inspected manually by two distinct operators.
Data retrieval and haplotype construction
Genotype data for 2-kb regions from 238 resequenced human genes were derived from the NIEHS SNPs Program web site. In particular, we selected genes that had been resequenced in populations of defined ethnicity including Europeans (EU), Yoruba (YRI) and Asians (AS) (NIEHS panel 2).
Haplotypes were inferred using PHASE version 2.1 (Stephens et al. 2001; Stephens and Scheet 2005), a programme for reconstructing haplotypes from unrelated genotype data through a Bayesian statistical method. When inferring haplotypes using PHASE, we monitored confidence probabilities associated with each phase call. Most of them are nearly 1 and the overall probabilities associated with each individual are quite high (>75%) with only few exceptions due to singleton assignments. Haplotypes for individuals resequenced in this study are available as supplemental material (Supplementary Table 1).
Tajima’s D (1989), Fu and Li’s D* and F* (1993) statistics, as well as diversity parameters θW (Watterson 1975) and π (Nei and Li 1979) were calculated using libsequence (Thornton 2003), a C++ class library providing an object-oriented framework for the analysis of molecular population genetic data. Calibrated coalescent simulations were performed using the cosi package (Schaffner et al. 2005) and its best-fit parameters for YRI, EU and AS populations with 10,000 iterations. The maximum-likelihood-ratio HKA test was performed using the MLHKA software (Wright and Charlesworth 2004), as previously proposed (Fumagalli et al. 2009). Briefly, 16 reference loci were randomly selected among NIEHS loci shorter than 20 kb that have been resequenced in the three populations; the only criterion was that Tajima’s D did not suggest the action of natural selection (i.e. Tajima’s D is higher than the 5th and lower than the 95th percentiles in the distribution of NIEHS genes). The reference set was accounted for by the following genes: VNN3, PLA2G2D, MB, MAD2L2, HRAS, CYP17A1, ATOX1, BNIP3, CDC20, NGB, TUBA1, MT3, NUDT1, PRDX5, RETN and JUND.
In all analyses, the chimpanzee sequence was used as the out-group because the orthologous regions in orangutan and macaque is interrupted by several transposon insertions.
In order to test for gene conversion events, we applied Sawyer’s gene conversion algorithm (Sawyer 1989) implemented in the GENECONV program. Significance was assessed using the approximate p value method described in (Karlin and Altschul 1990, 1993). We performed several tests by varying the mismatch penalty from 0 to 3. For all these runs, no pairwise or global p value involving TRIM5 intron 1 resulted to be significant, suggesting no apparent gene conversion in this region.
Haplotype analysis and TMRCA calculation
The reduced-median network to infer haplotype genealogy was constructed using NETWORK 4.5 (Bandelt et al. 1995). Estimate of the time to the most recent common ancestor (TMRCA) was obtained using a phylogeny-based approach implemented in NETWORK 4.5 using a mutation rate based on the number of fixed differences between chimpanzee and humans (Forster et al. 1996). A second TMRCA was obtained by applying a previously described method (Evans et al. 2005) that calculates the average pairwise difference between all chromosomes and the MRCA: this value was converted into years based on the mutation rate. A third TMRCA estimate derived from application of a maximum-likelihood coalescent method implemented in GENETREE (Griffiths and Tavare 1994, 1995). Again, the mutation rate μ was obtained on the basis of the divergence between human and chimpanzee and under the assumption both that the species separation occurred 6 million years (MY) ago (Glazko and Nei 2003) and of a generation time of 25 years. The migration matrix was derived from previous estimated migration rates (Schaffner et al. 2005). Using this μ and θ maximum likelihood (θML), we estimated the effective population size parameter (Ne). With these assumptions, the coalescence time, scaled in 2Ne units, was converted into years. For the coalescence process, 106 simulations were performed.
All calculations were carried out in the R environment (R Development Core Team 2008).
Analysis of transcription factor-binding sites
Transcription factor-binding sites (TFBS) analysis was performed using TFSEARCH (http://www.cbrc.jp/research/db/TFSEARCH.html) with a threshold score of 80.0. Following this prediction, single matrices for TFBS were retrieved from the Transfac 7.0 database (Heinemeyer et al. 1998) and manually inspected. In addition to the results reported in the text, TFSEARCH predicted the loss of an Arp-1-binding site for the CTC deleted allele. Yet, inspection of Arp-1 matrix revealed that the consensus was based on a small number (n = 9) of sequences and was therefore ignored. Pictograms for E2F1, AP-1 and AML-1-binding sites were derived from M00516, M00517, M00271 Transfac consensus matrices, respectively; each cell value was converted in bits by multiplying its frequency for the information content of that position.
Trans-specific polymorphisms and nucleotide diversity
Frequency in human population, ancestral alleles and chimpanzee genotypes for the two trans-specific polymorphisms
Frequencies in human populations
Genotypes of Pan troglodytes
CTC insertion (gorilla and oragutan)
A (gorilla) T (orangutan) T (macaque)
As reported above, trans-specific polymorphisms are extremely rare and represent a hallmark of long-standing balancing selection. In order to verify whether this is the case for TRIM5, we resequenced a 2,230-bp intron 1 region encompassing the two trans-specific variants in two additional HapMap populations, namely Europeans (EU) and East Asians (AS) (as shown in Fig. 1, this region was only partially covered by NIEHS data). Including data from YRI, we identified a total of 57 variants, 3 and 54 of them being accounted for by small indels and single base substitutions, respectively. The region displays relatively high linkage disequilibrium (LD) in the three populations and is covered by a single LD block which excludes few SNPs at the 3′ end (Fig. 1).
Nucleotide diversity and neutrality tests for the TRIM5 intron 1 region
Fu and Li’s D*
Fu and Li’s F*
Under neutral evolution, values of θW and π are expected to be roughly equal; for the TRIM5 intron 1 region this is not the case, π being definitely higher than θW in all populations (Table 2). Tajima’s D (DT) (1989) evaluates departure from neutrality by comparing θW and π. Positive values of DT indicate an excess of intermediate frequency variants and are a hallmark of balancing selection. Fu and Li’s F* and D* (1993) are also based on SNP frequency spectra and differ from Tajima’s D in that they also take into account whether mutations occur in external or internal branches of a genealogy. The statistical significance of these statistics is calculated by performing coalescent simulations. Since, in addition to selective processes, population demographic history affects allele frequency spectra, we performed coalescent simulations using population genetics models that incorporates demographic scenarios (Schaffner et al. 2005; Voight et al. 2005; Marth et al. 2004) (Supplementary Table 2). Also, in order to disentangle the effects of selection and population history, we exploited the fact that selection is a locus-specific force while demography affects the whole genome. Thus, we compared data obtained for the TRIM5 intron 1 region to those of 2 kb windows deriving from NIEHS genes. Neutrality tests for the TRIM5 intron 1 region indicated departure from neutrality in all populations with significantly positive values for most statistics and using all demographic models. In line with these findings, DT and Fu and Li’s F* calculated for the TRIM5 intron 1 region rank above the 95th percentile in the distribution of 2-kb reference windows in all populations.
4.06 × 10−5
1.25 × 10−5
6.08 × 10−6
As reported above, TRIM5 is part of a large gene family and is located within a cluster of TRIM genes, raising the possibility that non-homologous gene conversion is responsible for the high nucleotide diversity we observed. Although sequence homology among TRIM genes is low in intronic regions, we wished to exclude that TRIM5 intron 1 undergoes non-allelic gene conversion with other paralogous TRIM genes on chromosome 11. To this purpose, we applied Sawyer’s gene conversion algorithm (Sawyer 1989) which identified no region of apparent gene conversion within TRIM5 intron 1 (see “Materials and methods”).
Finally, a copy number polymorphisms encompassing human TRIM5 has been described (Zogopoulos et al. 2007): the variant is quite rare with two duplication instances described in 1,190 subjects, indicating that it is unlikely to affect our results.
Haplotype analysis and TMRCA estimates
Two nonsynonymous variants (H43Y and R136Q, rs3740996 and rs10838525, respectively) in the first coding exon of TRIM5 (exon 2) have been shown to alter antiviral activity (Sawyer et al. 2006; Javanbakht et al. 2006) and their role in modulating the clinical course of HIV-1 infection has been addressed in several studies (Javanbakht et al. 2006; Speelmon et al. 2006; Goldschmidt et al. 2006; van Manen et al. 2008). In order to study how these two variants relate to intron 1 haplotypes, we typed them in YRI, EU and AS. Results indicated that a similar proportion of chromosomes in clades A and B carry the low-frequency 43Y and 136Q alleles (not shown). This is likely the result of historical recombination events along intron 1, as the two variants are in no LD with SNPs in the region we analyzed (Fig. 1).
Host–pathogen interactions are a major driver of molecular evolution. TRIM5, with its complex evolutionary history, perfectly exemplifies this concept. Along mammalian evolutionary history the gene has undergone copy number variation, acquisition of new domains by exon capture, protein sequence diversification, and maintenance of balanced polymorphisms (Johnson and Sawyer 2009). These events testify the central role of TRIM5 in antiviral response and suggest that the gene has been constantly subjected to exceptionally strong selective pressures.
Our data add further complexity to the evolutionary history of TRIM5 by showing that a region in intron 1 has been the target of long-standing balancing selection. We resequenced the TRIM5 gene in chimpanzees and identified two polymorphisms that are shared with humans (CTC deletion and G to A substitution). Analysis of the gene region encompassing the two variants indicated that it displays exceptional nucleotide diversity levels and an excess of polymorphism compared to fixed divergence. Consistent with these data, most tests rejected the null hypothesis of neutral evolution for this gene region and calculation of TMRCA estimates yielded extremely deep coalescence times. Specifically, TMRCAs ranging between 4 and 7 MY were obtained, suggesting that the two haplotype clades have been maintained since the time when the human and chimpanzee linages split (Glazko and Nei 2003). Such deep coalescence times are extremely rare in the human genome (Tishkoff and Verrelli 2003). In humans, the only instances of trans-specific polymorphisms that are thought to be maintained by long-term balancing selection are located in the MHC (Charlesworth 2006). Additionally, we have previously described a trans-specific SNP in the defensin beta 1 (DEFB1) promoter, a region that is subjected to long-standing balancing selection (Cagliani et al. 2008). In analogy to one of the two shared variants we identified here, the trans-specific polymorphism in DEFB1 occurs at a CpG dinucleotide, raising the possibility that these SNPs result from coincidental mutation in humans and chimpanzees. Asthana and co-workers (2005) performed a genome-wide search for polymorphisms shared between humans and chimpanzees and retrieved only 11 of such instances, a number similar to the one that would be expected by chance (i.e. as a consequence of coincidental mutation) (Asthana et al. 2005). As the likelihood of recurrent mutations is higher at CpG dinucleotides, it is well possible that the G>A variant we identified arose independently in humans and chimpanzees, rather than being maintained by a selective process. Nonetheless, the two possibilities are not mutually exclusive as balancing selection might maintain variants that independently arose at the same position in two species.
Different is the situation for the CTC deletion. Although the evolutionary dynamics of indels are less well understood compared to those of SNPs, the frequency of small (<10 bp in length) insertion and deletion events in mammals is about 20 times smaller than that of single bas pair substitutions (Cooper et al. 2004); therefore, the possibility that a deletion of the same size arose independently in two species is extremely low, suggesting that the indel polymorphism represents a true trans-specific variant. As shown in Fig. 4, the polymorphic CTC deletion only segregates in the African population, suggesting that either different selective pressures have acted in distinct geographic locations or that demographic effects have influenced the distribution of this variant in different populations.
As mentioned above, both trans-specific polymorphisms occur within transposable elements, namely ERV1 sequences. This observation does not detract to the possibility that these variants are functional, as a large portion of human regulatory sequences was acquired from repetitive elements (van de Lagemaat et al. 2003; Jordan et al. 2003). The localization of the region subjected to balancing selection within intron 1 suggests that it may harbour variants that modulate TRIM5 expression and we noticed that the G>A variant in intron 1 potentially affects TFBSs. Although these data suggest that the two SNP alleles might result in different regulation of TRIM5 transcription, caution should be used in interpreting bioinformatic predictions, as TFBS consensuses are typically short and degenerate. Moreover, the binding specificity of several transcription factors is presently too limited to allow prediction of their binding sites in DNA sequences. Therefore, further analysis will be needed to analyze the role of intron 1 in regulating TRIM5 expression. In this respect, it is worth mentioning that signatures of balancing selection have previously been identified at the promoter/cis-regulatory regions of HLA genes (Cagliani et al. 2008; Tan et al. 2005; Loisel et al. 2006; Liu et al. 2006) and other loci involved in immune response (e.g. CCR5 and DEFB1) (Cagliani et al. 2008; Bamshad et al. 2002); these findings have been interpreted in terms of nucleotide diversity conferring increased regulatory flexibility. Specifically, different alleles/haplotypes might confer preferential expression in a tissue- or cell type-specificity manner, as well as modulate transcription in response to distinct stimuli. This same hypothesis holds for TRIM5, especially in light of its almost ubiquitous expression in human tissues (Sawyer et al. 2007). Interestingly, high levels of TRIM5 mRNA in peripheral blood mononuclear cells (PBMC) have been associated with a reduced risk of HIV-1 infection (Sewram et al. 2009), suggesting that inter-individual variability in TRIM5 expression does exist (at least in PBMC) and may affect susceptibility to viral infections. A number of studies has also addressed the role of TRIM5 variants in modulating the clinical course of HIV-1 infection. Most studies have focused on two nonsynonymous variants located in exon 2, H43Y and R136Q, but contrasting results have been obtained (Javanbakht et al. 2006; Speelmon et al. 2006; Goldschmidt et al. 2006; van Manen et al. 2008). Herein, we analyzed the distribution of H43Y and R136Q in clade A and B haplotypes. Results showing that both variants are equally common in both clades suggest that studies based on the sole typing of the two coding SNPs result in analyses of subjects having the same aminoacid alleles but different alleles in intron 1. Assuming a functional role for variants in intron 1, this observation might partially explain the low consistency among studies.
Sawyer and co-workers (2006) have shown that the 43Y variant decreases TRIM5 restriction activity against two distantly related retroviruses, namely HIV-1 and N-MLV (a murine γ retrovirus). The derived, low-efficiency allele is widespread in many human populations. In line with the data herein, Sawyer et al. (2006) ruled out the possibility that balancing selection acting on exon 2 is responsible for the maintenance of the impaired allele and suggested that relaxation of selective pressure on antiviral response genes together with genetic drift effects might result in the persistence of this variant in human populations. Our data suggest an alternative possibility: although balancing selection is typically limited to narrow genomic regions (Charlesworth 2006) (as it is evident from Fig. 2), depending on the degree of linkage disequilibrium and on physical proximity, variants in flanking genomic regions may be affected by the maintenance of balanced polymorphisms. Therefore, the persistence of the 43Y allele might in part be explained by its being located very close to the balancing selection region.
The introduction of HIV in human populations likely occurred too recently for its selection signatures to be detectable in the human genome. Therefore, we must assume that both the balancing selection signature we identified and the multiple selective events that have acted upon TRIM5 genes in primates and, more generally, in mammals have resulted from other infective agents, possibly extinct retroviruses. The role of past infections in shaping the repertoire and diversity of human antiviral genes has recently been discussed (Emerman and Malik 2010). Our data add further insight into this complex scenario and suggest that a regulatory region in the first intron of TRIM5 has conferred (and possibly still confers) protection against viral infections.
MC is supported by grants from Istituto Superiore di Sanita’ “Programma Nazionale di Ricerca sull’ AIDS”, the EMPRO and AVIP EC WP6 Projects, the nGIN EC WP7 Project, the Japan Health Science Foundation, 2008 Ricerca Finalizzata [Italian Ministry of Health], 2008 Ricerca Corrente [Italian Ministry of Health], Progetto FIRB RETI: Rete Italiana Chimica Farmaceutica CHEM-PROFARMA-NET [RBPR05NWWC], and Fondazione CARIPLO. MS is a member of the Doctorate School in Molecular Medicine, University of Milan.