HLA-A and HLA-B supertype frequencies and their geographic distributions
In a previous study (the only one, to our knowledge, except our own study on HLA-DRB1 (Gibert and Sanchez-Mazas 2003)) addressing population differentiation at the supertype level, Sidney et al. (1996) used five population samples and reported that all supertypes were present in all world regions. This current study with 55 populations greatly extends those original observations, allowing us to show that some supertypes are not observed in all populations while reaching a frequency of more than 50 % in others (Figs. 1a, c, 2, and 3). Among the HLA-A supertypes, A1 is the rarest, showing frequencies smaller than 9 % in more than half of the populations (Fig. 1a) and being virtually absent in five of them (Fig. 1c). A1 alleles are found with high frequencies (22 % in average) in Africa, Southwest Asia, and Europe (Fig. 2), resulting in a significant geographic structure, i.e., with most of the variation being found among populations of different geographic regions (F
CT > F
SC; Table 2). The A1 supertype is represented by a small number of alleles, with one or two alleles in more than half of the populations (Fig. 1b) and only one in 14 of them (Fig. 1c). The A2 and A3 supertypes exhibit more even distributions, half of the populations having frequencies ranging from 14 to 29 % for A2 and 14 to 32 % for A3 (Figs. 1a and 2). As a consequence, among the HLA-A supertypes, A2 and A3 present either the lowest or no geographic structure at all (F
CT < F
SC for A2 and F
CT not significantly different from 0 for A3; Table 2). All populations present at least one allele of supertype A2 (eight of them showing just one), while the A3 supertype is represented by a large number of alleles (Fig. 1b, c). The A24 supertype is observed in all populations (Fig. 1c), with frequencies ranging from 13 to 40 % in half of them (Fig. 1a). Despite its broad distribution, A24 is often represented by only two alleles, A*23:01 and A*24:02, with 26 and 10 populations showing just one or both of these alleles, respectively (Fig. 1b, c). This supertype is found at higher frequencies (40 % in average) in SEA, PAC, AUS, NEA, and AME (Fig. 2). Although A24 exhibits the highest level of population differentiation among the four HLA-A supertypes (F
ST = 11 %, p < 0.0001), most of the variation is found within geographic regions (F
CT < F
SC). The frequencies of the HLA-A non-classified alleles (NCAs) vary greatly between populations, ranging from 2 to 14 % in half of them (Fig. 1a). The NCA group presents a strong geographic structure (F
CT being twice as much as F
SC) and a very high F
ST value (almost 16 %) (Table 2). The highest NCA frequencies are found in African and Australian populations (averages of 16 and 43 %, respectively) (Fig. 2).
Table 2 Supertype differentiation indexes among populations (F
ST), among populations within geographic regions (F
SC), and among geographic regions (F
CT)
The HLA-B supertypes fall into two main categories regarding their frequency distributions. On the one hand, B7 and B44 exhibit a pattern resembling A2 and A3, with high average frequencies (Figs. 1a and 3) and relatively low levels of geographic structure (Table 2). Half of the populations present frequencies ranging from 18 to 31 % for B7 and from 21 to 32 % for B44, respectively (Figs. 1a and 3). Both B7 and B44 are observed in all populations (except B7 in the Yami; Figs. 1c and 3), with large numbers of alleles per population (Fig. 1b, c). By contrast, B58 and B62 exhibit very low frequencies, ranging from 0 to 5.8 % and from 2.9 to 18 % in half of the populations, respectively (Fig. 1a). Among the five HLA-B supertypes, B62 presents the highest level of population differentiation (F
ST = 11.38 %, p < 0.0001; Table 2), although with no clear geographic structure (F
CT < F
SC; Table 2). Such a geographic structure is only found for B58 (F
CT of 5.6 %, almost twice as great as F
SC; Table 2), which is observed in SSA populations at an average frequency of 33 % (from 23 to 60 %; Fig. 3), against 4.2 % in the other regions (Fig. 3) and no observation at all in many populations (18 out of 55; Fig. 3). The B27 supertype presents an intermediate pattern between B7/B44 and B58/B62. It exhibits relatively lower frequencies (from 7 to 19 % in half of the populations; Fig. 1a) and a higher level of population differentiation than B7 and B44 (F
ST = 7.5 %, p < 0.0001; Table 2) but no geographic structure (F
CT very close to zero; Table 2). Contrasting with what is observed for the NCA, the non-classified alleles for HLA-B (NCB) are quite frequent, with frequencies ranging from 10 to 17 % in half of the populations (Fig. 1a). More than 75 % of populations present at least two different NCBs (Fig. 1b), and only two populations lack one of these alleles (Fig. 1c). The NCBs also exhibit a significant geographic structure, although not as strong as for NCA (Table 2).
In summary, based on the observed data, supertypes can be allocated into two main categories: on the one hand, A2, A3, B7, B27, and B44 fit the classical view that supertypes are evenly distributed (Figs. 1a, 2, and 3), poorly structured geographically (Table 2), and represented by a large number of alleles (Fig. 1b, c). On the other hand, A1, A24, B58, and B62 present a greater frequency variation among populations (Figs. 2 and 3 and Table 2), and in some cases significant geographic structure (i.e., for A1 and B58, both being very common in Africa), and are represented by a smaller number of alleles. Although the unclassified alleles have brought noise to the analysis, they should not be ignored. They are a consequence of the functional supertype classification, and they were kept to understand exactly how they influence the variations in HLA-A and HLA-B. As discussed above, the NCA consists of a small group of alleles, which reach high frequencies in island populations. On the other hand, NCB is a more heterogeneous group appearing in almost all populations.
Heterozygosity and interpopulation differentiation
Using both complete and reduced datasets (see “Materials and methods” section), the heterozygosity estimated for the data treated at the allelic level is always larger than that estimated for the data treated at the supertype level (Table 3). This result is expected because alleles are nested within supertypes, and the heterozygosity of the latter is thus constrained to be equal to or smaller than that of the former.
Table 3 Expected heterozygosity (He) of alleles and supertypes
In order to define the degree to which genetic differentiation, measured by G
ST between populations, was concordant at the supertype and allelic levels, we estimated the correlation between these measures and tested their significance using Mantel tests. The results suggest that when using the complete population dataset, the patterns of population differentiation observed at the supertype and allelic levels are very similar, especially for HLA-A (r = 0.956, p < 0.0005; Fig. 4a) but also for HLA-B (r = 0.75, p < 0.0005; Fig. 4b). The removal of the Pacific, Australian, Taiwanese, and Native American populations provokes an overall drop of both the G
ST values and their correlations. Despite this decrease, a high-correlation coefficient is still observed for HLA-A (r = 0.62, p < 0.0005; Fig. 4c), whereas the value is much lower for HLA-B (r = 0.3, p < 0.0005; Fig. 4d). Because Pacific, Australian, Taiwanese, and Native American populations contribute to large differentiation values, lower-correlation coefficients were expected after removing them. Furthermore, these populations also exhibit a reduced set of alleles per supertype, which may explain the higher correlations between alleles and supertypes when they are taken into account. The difference between alleles and supertypes is less pronounced for HLA-A which presents a smaller number of alleles per supertype in all populations (Fig. 1b, c).
Patterns of molecular variability for different PBR pockets of HLA-A and HLA-B
Our goal in this part of the study was to test the prediction that the B and F pockets of the PBR exhibit the highest levels of variation as a consequence of their crucial role in peptide binding, which is expected to result in a stronger effect of balancing selection.
We first estimated the global levels of variation at the PBR and observed significantly higher levels of nucleotide diversity (π
total) at HLA-B, compared to HLA-A (p < 0.0000005; Wilcoxon rank sum test). Moreover, these two genes differ in the way molecular variation is distributed among the A, B, CDE, and F pockets within the PBR (Fig. 5). The rank order of π
total is pCDE ≫ pB ≫ pA > pF, at HLA-A, and pB ≫ pF > pCDE ≫ pA, at HLA-B (where p is an abbreviation for “pocket” and ≫ and > indicate greater than and significant, at the 0.00001 level, and greater than but non-significant differences, respectively, according to a Wilcoxon rank sum test; Fig. 5). Among the HLA-A pockets, most of the variation is found in the CDE pockets, which makes up the central region of the PBR, and significantly less in pB (π
total values ranging from 0.14 to 0.15 and from 0.11 to 012 in half of the populations, respectively; Fig. 5). The pA and pF pockets exhibit the smallest levels of variation (π
total values ranging from 0.07 to 0.09 in half of the populations; Fig. 5). Among the HLA-B pockets, pB exhibits by far the highest variation, with π
total values ranging from 0.18 to 0.21 in half of the populations, whereas the other pockets exhibit a relatively narrow π
total distribution (ranging from 0.10 to 012 in half of the populations; Fig. 5).
The hypothesis that the pockets B and F are the main targets of balancing selection is thus partially supported for HLA-B, since pB presents by far the highest level of nucleotide diversity. Interestingly, van Deutekom and Kesmir (2015) recently showed that changes involving several of the B pocket’s amino acids had a profound impact on peptide-binding properties, which corroborates our interpretation. On the other hand, pF, which is not significantly different from pA at HLA-A, and from pCDE at HLA-B, does not present an increased value of π
total which would be an evidence against balancing selection. It is important to note that these results were obtained independently from the classification of alleles into supertypes, since the determination of the pockets’ codons was taken from the classical study of Saper et al. (1991).
We also analyzed how the nucleotide diversity was distributed between supertypes. Since the supertype categorization is based on variations of pB and pF, these pockets were expected to present more differences between supertypes than the others. This prediction was confirmed for pF at HLA-A and pB at HLA-B (Fig. 6).
As pB presents the highest levels of variation at HLA-B and also accounts for most of the differences between HLA-B supertypes, we conclude that the variation between HLA-B supertypes accounts for most of the differences observed between HLA-B alleles. In other words, alleles classified within a same HLA-B supertype share more similarities than alleles assigned to different HLA-B supertypes. By contrast, most of the differences between HLA-A supertypes lie within pF, the pocket presenting the lowest π
total values for this gene. Therefore, at this locus, the supertypes do not account for most of the variation between alleles (Fig. 6). In other words, HLA-A presents more variation within than between supertypes.
Simulation approach to test selection on supertypes
According to the definition of Sidney et al. (1996), alleles included within the same supertype have overlapping peptide-binding specificities. To test the effects of the supertype classification on expected heterozygosities (He) and pairwise differentiation (G
ST), we generated null distributions for these two statistics under the hypothesis that alleles within supertypes are a random collection, with no shared functional attributes. To this end, the assignment of alleles to supertypes was randomized by permuting the supertype labels attributed to each allele motif, as described in the “Materials and methods” section. As the same patterns were obtained using the two different simulation approaches (see “Materials and methods” section), we only present the results for the case without any constraint on the number of alleles associated to a specific supertype.
For HLA-A, we do not observe any population with a significant difference in He in contrasts between the real and random supertype assignments. For HLA-B, 6 out of 55 populations exhibit significantly lower He (permutation-based p < 0.05) than those acquired via simulations. These six populations belong to the reduced dataset. Because the number of populations with individually significant p values in either direction (i.e., with significantly lower or greater He compared to the simulated value) is small, we investigated whether the distribution of the p values itself was informative regarding selective effects. To do this, we used an exact binomial test to assess whether the observed distribution of p values deviated from one composed of equal numbers of values on either side of 0.5 (the expected proportion of deviation in either direction under the null hypothesis; Fig. 7). For HLA-A, no significant deviation is found (p value > 0.05 for both complete and reduced datasets). For HLA-B, however, a significant skew towards p values greater than 0.5 is observed, indicating an overall significant excess of populations with lower He than those obtained through simulations (p value < 0.05 and p value < 0.005 for complete and reduced datasets, respectively).
For both HLA-A and HLA-B, G
ST values were not significantly different from those of the randomized data, when using the complete dataset. This is also true when using the reduced dataset for HLA-A but not for HLA-B. Indeed, after removing the Pacific, Australian, Taiwanese, and Native American populations, the observed G
ST is higher than 98 % of the simulations for HLA-B (Fig. 8). This finding differs from the expectations of Sidney et al. (1996), who predicted an overall decrease of differentiation at the supertype level. However, it is in agreement with our description of the observed data. Indeed, in our simulations, alleles were randomly assigned to supertypes, creating randomized supertypes with similar contents of common and rare alleles. The common alleles are expected to be assigned to different randomized supertypes in most of the simulations because they are less numerous than the rare alleles. Such a pattern is similar to that described for real HLA-A supertypes, which present a low number of common alleles per population (Fig. 1b, c). As discussed above, this pattern also explains the high correlation found between G
ST values measured at the allelic and supertype levels for this locus (Fig. 4). Finally, as also discussed above for the PBR pockets, less variation is found between than within HLA-A supertypes. This indicates that HLA-A supertypes are composed of heterogeneous sets of alleles with few sequence similarities at pF (Figs. 5 and 6), which explains the similarity between the results based on the observed and randomized data. On the other hand, HLA-B supertypes appear to be composed of alleles sharing more sequence similarities, as shown by the molecular analysis of the PBR pockets (Figs. 5 and 6).
In summary, HLA-B supertypes are sets of alleles with B pocket resemblances, and these similarities can be interpreted directly in terms of peptide presentation profiles because HLA-B supertypes exhibit major differences regarding the chemical properties of pB. Thus, our results showing an increased differentiation at the level of HLA-B supertypes are consistent with an effect of natural selection resulting in local adaptation of populations to different pathogen environments. Through our simulations, the functional grouping of alleles reflected by the HLA-B supertypes is disrupted, creating randomized groups in the same way as described for HLA-A. The frequent allocation of common alleles into different randomized supertypes in the simulations thus provokes both an increase of He and a decrease of population differentiations (G
ST), when compared with the observed data (Figs. 7 and 8). In agreement with this interpretation, the inclusion of the Pacific, Australian, Taiwanese, and Native American populations reduces this effect because the patterns of variation at HLA-B for these populations resemble those observed at HLA-A, with a relatively low number of alleles belonging to different supertypes.