The large extent of polymorphism of major histocompatibility complex (MHC) genes is believed to be maintained by balancing selection for the extent of the peptide binding repertoire between individuals (Hughes and Nei 1988, 1989; Takahata and Nei 1990; Hughes and Yeager 1998). A unique effect of balancing selection is the long persistence time of alleles in populations and, consequently, trans-species polymorphism (Klein 1987; Takahata 1990; Takahata et al. 1992; Klein et al. 1998, 2007). However, it is difficult to show direct evidence of such selection by experiments and to measure selection intensity directly. Satta et al. (1994) estimated the intensity of selection at the human MHC (human leukocyte antigen (HLA)) loci by using the available collection of allelic sequences and a simple model based on symmetric overdominant selection and the theory of allelic genealogy (Kimura and Crow 1964; Takahata 1990; Takahata and Nei 1990; Takahata et al. 1992).

In recent years, a number of HLA allelic nucleotide sequences have become available through IMGT/HLA database (http://www.ebi.ac.uk/imgt/hla/, Robinson et al. 2011). Currently (2012), the database contains 7,670 alleles. This large dataset of sequences provides an opportunity to estimate more reliable evolutionary parameters, such as natural selection intensity. Hence, we re-estimated the selection coefficient and compared the estimates with those in the previous study that was based on a limited number of sequences (Satta et al. 1994).

The large number of nucleotide sequences at the six functional HLA loci (HLA-A, HLA-B, HLA-C, HLA-DRB1, HLA-DQB1, and HLA-DPB1), which play important roles in peptide presentation, was obtained from the IMGT/HLA database. In addition, nucleotide sequences of alleles at the HLA class II A (DQA1 and DPA1) and class II B (DRB3 and DRB5) loci were also used in this analysis. Because the inclusion of recombinants will lead a biased estimation of the selection intensity, possible recombinant alleles were excluded by using the method described by Satta (1992). This method assumes that the relationship between the number of substitutions in a particular region and the number of substitutions in the entire region is binomially distributed. At the HLA-B locus, an exceptionally divergent HLA-B*73:01 allele (Abi-Rached et al. 2011), which might have been transmitted to extant humans from a distinct Homo by interbreeding, was also excluded from this analysis. Applying the theory of allelic genealogy under symmetric overdominant selection to this analysis, we used only dominant alleles that have a frequency >1 % throughout various human populations (the NCBI dbMHC database, http://www.ncbi.nlm.nih.gov/gv/mhc, Meyer et al. 2007). We also excluded the nucleotide sequences with a wide range of undetermined nucleotides from this analysis (Table 1). Therefore, the number for alleles used in this analysis was limited to 9 HLA-A alleles, 19 HLA-B, 20 HLA-C, 25 HLA-DRB1, 13 HLA-DQB1, 10 HLA-DPB1, 6 HLA-DQA1, 3 HLA-DPA1, 13 HLA-DRB3, and 5 HLA-DRB5. These HLA alleles are listed in Online Resource 1. Interestingly, most of the enormously large numbers of nucleotide sequences in the current database are minor or private alleles.

Table 1 The number of alleles and dominant alleles in the database

According to the theory described in Takahata (1990) and Takahata et al. (1992), to estimate the selection coefficient s, two estimators, γ and K B, must be calculated. The estimator γ is the ratio of the number of nonsynonymous substitutions per peptide-binding region (PBR) site to that of synonymous substitutions per site among given pairs of alleles, whereas K B is the mean number of pairwise nonsynonymous substitutions in the PBR. The number of synonymous and nonsynonymous sites was estimated using the modified Nei–Gojobori method (Zhang et al. 1998) with the Jukes–Cantor correction (Jukes and Cantor 1969). Because of the relatively early ceiling in the number of nonsynonymous substitutions in the PBR due to acceleration of the nucleotide substitution rate by balancing selection, Satta et al. (1994) developed five methods for estimating K B, and these methods were evaluated by computer simulations. Here, we used method II because this method minimized errors in the multiple-hit correction (Satta et al. 1994). In this method, selection coefficients can be adequately estimated by using only sets of sequences that are relatively closely related.

The estimated values of K B and γ at the six major HLA loci described above are provided in Table 2. Using these values, we obtained other estimators, M and S, which were also necessary for estimating the selection coefficient, s (see Satta et al. 1994). Assuming that a long-term effective population size of humans is 105, the s values of HLA-B and HLA-DRB1 loci (s = 4.4 and 1.9 %, respectively) in the present study were the highest for the class I and class II loci, respectively. This result was consistent with that of the previous study (Satta et al. 1994). All s values were more or less similar to those of the previous study with the exception of DQB1 and DPB1 loci: the current estimate of DQB1 was lower than the previous estimate and the value for DPB1 was much higher than the previous estimate (Satta et al. 1994). One possible reason for this is the different set of nucleotides sequences used than the previous study. In fact, both for DQB1 and DPB1, the number of dominant alleles used in the present analysis increases compared to that of the previous one.

Table 2 Estimates of the mean number of nonsynonymous substitutions, the relative nonsynonymous substitution rate in the PBR, and the selection coefficient (s)

Allelic genealogy predicts that K B is approximately equal to the number of dominant alleles (n a) in a population. In fact, n a showed good agreement with K B in three class II B loci (Table 2). In class I loci, the HLA-C showed relatively good agreement between n a and K B, whereas for the HLA-A and HLA-B loci, the observed number of dominant alleles was less than the expected number. This discrepancy might indicate that the definition of dominant alleles is inappropriate for class I loci. Originally, we regarded an allele with a frequency of more than 1 % over all populations examined as a dominant allele. According to the dbMHC database, the number of chromosomes examined at all three class I loci was more than 10,000 in total, ranging from allele to allele. Thus, we defined 1 % (100 chromosomes) of 10,000 chromosomes as a class I dominant allele. In addition, the mean number of populations in which class II dominant alleles were observed was about 25. Therefore, for class I loci, we considered the alleles detected on >100 chromosomes through >25 populations as a dominant allele. Surprisingly, n a of class I loci under this new definition showed good agreement with K B (Table 2). This might imply that some dominant alleles, with <1 % allele frequency in the entire world population, were dominantly distributed throughout the human population until quite recently and that they have decreased in frequency because their alleles might be replaced by other alleles that had an advantage in the modern environments of some populations. The number of different dominant alleles in the PBR also shows good agreement with expectations (Table 1). After the exclusion of possible recombinants, the numbers at each locus were 26 at HLA-A, 39 at HLA-B, and 19 at HLA-C. However, when we included rare alleles, these numbers increased to 32, 113, and 60, respectively. The number of rare alleles which have de novo PBR nonsynonymous mutations is large and they may have emerged by a population expansion quite recently (Fu et al. 2013).

In addition to the above estimates, we further estimated the selection coefficients for DRB3, DRB4, DRB5, DQA1, and DPA1 (Table 2). With the exception of DRB4 (see below), all selection coefficient s of the four HLA class II loci were lower than those of the six major HLA loci (HLA-A, HLA-B, HLA-C, HLA-DRB1, HLA-DQB1 and HLA-DPB1), indicating that the six major loci have been strongly affected by balancing selection. The present s estimate of DQA1 is lower than that of the previous one, but the present K B value is similar to the n a. We consider that the present estimate is close to the true value. For DRB4, 15 alleles were deposited in the database and they are identical at the PBR sites and nearly identical at the neutral (synonymous and non-PBR nonsynonymous) sites. Thus, inference of the γ and K B values is difficult. The relatively recent emergence of DRB4 (the per site nucleotide divergence from DRB2 is 0.015∼0.017: Satta et al. 1996) supports this observation. In addition, the small amount of nucleotide divergence at neutral sites for DRB4 indicates the relatively small effective population size of DRB4. This suggests that the frequency of DR53 haplotype on which DRB4 resides is relatively lower than that of other HLA haplotypes. In addition, DRB3 and DRB5 also show the smaller effective size than that of other HLA loci (The estimated N e values of DRB3 and DRB5 are quite smaller than 105). This is also because that DRB3 and DRB5 are located on a limited DR haplotype, whereas other HLA loci exist in all humans.

Our findings show that although the number of sequences in the database has greatly increased in the past 20 years, most of the accumulated sequences are minor or private alleles and the number of dominant alleles does not change largely since the previous estimation. Therefore, most of selection coefficients in the six major HLA loci estimated in the present study were similar to those of the previous study. One may consider that application of symmetrical overdominance is too strict for the actual data. However, the simulation study by Takahata and Nei (1990) reveals that the asymmetrical overdominance model does not fit the mode of polymorphism for actual data: under a given selection coefficient of asymmetrical model, the number of alleles and the average heterozygosity become smaller than those under symmetrical overdominance model. In fact, the number of dominant alleles at all HLA loci was consistent with the K B values under symmetrical overdominance, suggesting the consistency between our assumed model and the actual data. Therefore, the overdominance model is appropriate to the present estimation. Through this analysis, we confirmed that the selection intensity (selection coefficient, s) of HLA loci in modern humans is at most 4.4 %, even though HLA is the prominent example on which natural selection acts.