A broadly accepted model for the genetic architecture of complex disease is the common disease – common variant (CDCV) hypothesis. This hypothesis proposes that risk alleles for common complex diseases should be common (i.e. ≥ 5%) and thus are likely old and found in multiple human populations, rather than being population specific [14]. From analyses of genome-wide polymorphism data from populations of African, Asian and European ancestry, it has been shown that common alleles in one population are frequently both shared and common among human populations [57]. However, a recent comprehensive study of 3,873 genes from African, Asian, Latino/Hispanic, and European Americans found that common alleles in one population were frequently not common in another population [8]. Similarly, from a meta-analysis of disease-association studies, Ioannidis et al. (2004) argued that the frequencies of disease-associated alleles show "large heterogeneity between races" [9]. These observations suggest that the frequency of a risk allele discovered in one population is not always a strong predictor of the frequency of that risk allele in other populations. This raises the question of whether risk alleles discovered in one population account for disease prevalence across all human populations. Thus, it remains unknown how well the CDCV model accounts for disease prevalence across populations on a worldwide scale.

In addition to evaluating the extent to which disease-associated alleles differ in frequency between populations, it is of great interest to determine what evolutionary forces are responsible for the observed degree of population differentiation at disease-associated SNPs. Because disease is so tightly linked to survival and reproductive success, it follows that disease has likely been a strong selective force in human evolution. Moreover, alleles that cause disease in contemporary environments may have been positively selected in ancestral environments. For example, the thrifty gene hypothesis posits that populations whose ancestral environments were characterized by periods of feast and famine may have experienced selection for a "thrifty genotype" that promotes efficient fat and carbohydrate storage [10]. Though formerly advantageous, thrifty genotypes may be causing obesity and type 2 diabetes in contemporary environments where food is often abundantly available. Previous studies have suggested that genes associated with complex diseases such as cardiovascular disease [1114] and type 2 diabetes [1517] have been targets of positive natural selection. If disease genes have often been targeted by selection, then identifying loci that have experienced selection may aid in disease-related research [18].

Local (i.e. geographically-restricted) positive selection results in large allele frequency differences between populations [e.g. [19, 20]]. The Fst statistic captures the difference in allele frequency between populations at any given SNP and ranges from 0 (no differentiation) to 1 (fixed difference between populations). Thus, when compared to a set of random SNPs in the genome, positively selected alleles tend to accumulate in the top tail of the Fst distribution [2123]. It has previously been shown that local positive selection has had no widespread effect on disease allele frequency differences between populations: on average, disease-associated SNPs showed allele frequency differences between populations similar to those observed for random SNPs [24]. Individually, however, several disease-associated alleles appear to have been driven to high frequency by positive selection in certain human populations and thus may be responsible for large differences in disease prevalence between populations [15, 25].

The conclusions drawn from previous studies that have evaluated levels of population differentiation at disease-associated SNPs are limited for two reasons. First, these studies relied on many disease-gene associations that have not been successfully replicated and thus likely do not represent true associations. Second, previous studies made use of disease allele frequencies from a small number of populations (i.e. ≤ 4). To address the strength of the CDCV model on a worldwide scale and to evaluate the effects of local positive selection on worldwide risk allele frequencies, we present allele frequencies and levels of population differentiation across 53 populations for 25 SNPs which show replicated association with the following common complex human diseases: Crohn's disease, type 1 diabetes, type 2 diabetes, rheumatoid arthritis, coronary artery disease and obesity [17, 2642]. These newly identified genetic variants came from recent genome-wide association (GWA) study data, which have revolutionized approaches for identifying disease loci [43].


The 25 SNPs from Table 1 were genotyped in the HGDP-CEPH Panel [44]. Atypical and related individuals were removed [45], which resulted in 952 individuals from 53 populations. SNPs were genotyped by KBioscience using the KASPar chemistry, a competitive allele specific PCR SNP genotyping system [46].

Table 1 Worldwide risk allele frequencies and global Fst for 25 disease-associated SNPs typed in the CEPH-HGDP panel.

All of the genotype calls were confirmed by visual inspection. After Bonferroni correction for 25 comparisons, there remained 4 SNPs for which a population was out of Hardy-Weinberg equilibrium at p < 0.002. The genotype calls in these cases were re-confirmed by visual inspection of the cluster plots and no data were removed. The amount of missing data per SNP ranged from 2.0% – 5.4% with a mean of 3.3%. These data are accessible from the CEPH database [47] or by request to the corresponding author.

Global Fst [48], the degree of differentiation among the 7 geographic regions represented in the CEPH-HGDP panel, was calculated for each of the 25 SNPs. Results were largely the same when global Fst was calculated among all 53 populations. We obtained an empirical Fst distribution from 2750 autosomal markers (2540 SNPs [49] and 210 indels [50]) previously typed in 927 individuals from the CEPH-HGDP panel. Global Fst values for the disease-associated SNPs were calculated from the same set of 927 individuals to allow for an unbiased comparison to the empirical distribution. For each disease-associated SNP, a P value was calculated as the proportion of Fst values from the empirical distribution that were ≥ the observed Fst value. We found that global Fst is weakly but significantly correlated with global minor allele frequency (R2 = 0.0152, P = 5.04 × 10-23, see Additional file 1) and that the Fst distribution often differs significantly between minor allele frequency bins (Additional file 2). We therefore provide corrected P values (P cor ) for each Fst value by comparing only to SNPs from the empirical distribution that fall into the same minor allele frequency bin.


We genotyped the HGDP-CEPH Human GenomeDiversity Cell Line Panel [44] for 25 disease-associated SNPs recently identified from GWA studies [2628]. The global and regional allele frequencies for each disease-associated SNP are summarized in Table 1. To visualize worldwide risk allele frequencies, Figure 1 shows the allele frequency distribution across populations for each disease-associated SNP. A summary of the maximum allele frequency difference between any 2 of the 53 populations for each disease-associated SNP is presented in Figure 2.

Figure 1
figure 1

Risk allele frequency across populations for 25 disease-associated SNPs. The title of each histogram includes the dbSNP ID and the disease with which each SNP is associated. Abbreviations for disease names can be found in Table 1. Note that the Y axes have different scales across histograms.

Figure 2
figure 2

The maximum difference in risk allele frequency between any 2 of the 53 populations in the CEPH-HGDP panel across the 25 disease-associated SNPs.

For each disease-associated SNP, global Fst, a measure of allele frequency difference, was calculated among the 7 geographical regions represented in the HGDP-CEPH panel. It has been shown from empirical data and from simulations with varying parameters that alleles that have been targets of local positive selection tend to accumulate in the top tail of the Fst distribution [1923]. Uncorrected P values (P) and P values corrected for allele frequency (P cor ) were generated by comparing each observed global Fst value to an empirical global Fst distribution from 2750 markers typed in the same samples (see Materials and Methods for details). The global Fst value and the corresponding P value for each of the 25 disease-associated SNPs are summarized in Table 1. The empirical global Fst distribution is shown in Figure 3 along with the 4 most highly differentiated disease-associated SNPs (i.e. SNPs with uncorrected P values < 0.1).

Figure 3
figure 3

Empirical global Fst distribution of 2750 markers typed in 927 individuals from the CEPH-HGDP panel. Disease-associated SNPs with global Fst values within the top 10% of the empirical distribution are indicated.

To determine whether the mean global Fst of 0.100 for the 25 disease-associated SNPs is unusually high, this value was compared to a distribution of mean global Fst values from 25 SNPs sampled at random 10,000 times from the empirical distribution. We found that disease-associated SNPs are not more differentiated than random markers (P = 0.462, P cor = 0.500). This analysis was repeated for groups of SNPs associated with each of the diseases listed in Table 1. In no case were the disease-associated SNPs more differentiated than expected at random (P and P cor > 0.3 in every case).

Global Fst provides a rough measure of the magnitude of allele frequency differentiation worldwide, but local positive selection acting at finer geographical scales will likely remain undetected using this measure. To examine the patterns of population differentiation at a more refined geographical scale, we calculated Fst for every pairwise comparison among the 53 populations and 7 geographic regions to produce 53 × 53 and 7 × 7 Fst matrices, respectively. Each Fst value was then compared to the corresponding empirical distribution of Fst values to generate a P value without correction for allele frequency.

Figure 4 shows risk allele frequencies across populations and the two Fst matrices for the most highly-differentiated disease SNP rs10761659, a variant associated with Crohn's disease. Allele frequency and Fst estimates for populations with small sample sizes and/or missing genotypes may be unreliable and sample size is therefore also included in Figure 4. For rs10761659, the risk allele is rare in Africa but is found at high frequency in most non-African populations. The degree of differentiation at this SNP is unusually high compared to the empirical distribution as indicated by the low P values (i.e. dark boxes in Figure 4) in population pairwise comparisons between Africans and most non-African populations. We have produced similar plots for all 25 disease-associated SNPs for visual inspection (Additional file 3).

Figure 4
figure 4

Worldwide risk allele frequencies and population differentiation for rs10761659, a SNP associated with Crohn's disease. The vertical bar chart displays risk allele frequencies in each of the populations represented in the CEPH-HGDP panel with sample sizes in number of individuals on the left. The shaded boxes in the 53 × 53 and 7 × 7 matrices show which pairwise Fst values are significant compared to the empirical distribution at three P value thresholds (see the boxed-in P value legend).


The extent to which the CDCV hypothesis is applicable across human populations depends in part on the extent to which common risk alleles identified in one population are also common in other populations. The majority of disease association studies are conducted using case-control cohorts of European ancestry. The degree to which associations established in these studies can be extended to other populations remains an open question. In addition, it remains unclear how often differences in risk allele frequencies between populations are due to the action of local positive selection. The present study takes a first step in addressing these issues by quantifying the degree of allele frequency differentiation between worldwide populations for 25 SNPs associated with 6 common complex diseases.

Many of the disease-associated SNPs studied here show substantial heterogeneity in allele frequencies across human populations (Figure 1). In some cases, risk allele frequencies remain generally low or high across all 53 populations. However, in several cases risk allele frequencies vary across a large portion of the allele frequency spectrum. Maximum allele frequency differences between any 2 populations ranged from 0.10 to 1.0 across SNPs with a mean of 0.65 (Figure 2). For 7 of the 25 SNPs, the maximum allele frequency difference between any 2 populations was > 0.75. Thus, some risk alleles are found at substantially different frequencies between populations.

To further quantify the allele frequency differences between populations for the disease-associated SNPs, we compared Fst values for the disease-associated SNPs to an empirical Fst distribution generated from 2750 random markers genotyped in the same samples. The average global Fst of the disease-associated SNPs is not unusually high compared to the empirical global Fst distribution. This is also the case when global Fst values were averaged across SNPs in each disease category. Thus, disease-associated SNPs do not show more population differentiation than random SNPs, in agreement with a previous study that examined a different set of disease-associated markers in a more limited set of populations [24].

Although disease-associated SNPs do not show high Fst as a set, individual disease-associated SNPs may be unusually differentiated. Previous studies have identified disease-associated loci that show evidence of local positive selection in the form of unusually large allele frequency differences between populations [14, 15, 17, 25, 5153]. In some cases it is the protective allele [17, 53], and in others the risk allele [15], which appears to have been driven to high frequency by positive selection. Several of the disease-associated SNPs studied here show considerable worldwide population differentiation and have global Fst values within the top 10% of the empirical global Fst distribution (Figure 3). At a more refined geographical scale, the patterns of population differentiation are extremely varied across SNPs and many population-pairwise Fst values lie within the top 5% and even the top 1% of the empirical distribution (see Additional file 3). For example, the risk allele at SNP rs10761659 is absent in some African populations and is near or at fixation in a number of populations outside of Africa. The global Fst value for this SNP lies within the top 5% of the empirical distribution (Figure 3) and most population pairwise comparisons between Africans and non-Africans are highly significant (Figure 4). A type 1 diabetes-associated SNP, rs11171739, also shows high levels of differentiation between Africans and non-Africans, but in this case the risk allele is near fixation in Africans but is at low to intermediate frequency elsewhere in the world (Additional file 3). There are also cases in which a risk allele frequency is unusually high or low in only one or a few populations. For example, the risk allele at rs564398, a SNP associated with type 2 diabetes, is found at unusually low frequencies only in the Kalash of Pakistan and in Melanesians (Additional file 3). These SNPs may therefore turn out to have been the targets of local positive selection. However, evidence for selection based on single marker Fst values should be interpreted with caution [54]. A more in-depth investigation of the patterns of genetic variation in and around these loci and their effects on the phenotype is required before conclusions can be confidently drawn.

Regardless of whether large risk allele frequency differences between populations are the result of selection or genetic drift, these data provide several useful insights. First, it is reasonable to assume that, if a risk allele is fixed, absent, or close to either, it does not contribute to disease risk variation within that population. Thus, assuming that the risk conferred by these alleles is constant across populations (as may be the case for risk alleles found in genes related to fundamental biological activity, e.g. cyclin dependent kinase function and T2D/CAD risk), our data suggest that the CDCV model does not necessarily extend across populations since risk alleles discovered in a European population are sometimes absent, fixed or found at extremely low or high frequencies in other populations.

Second, combining evidence of selection and association may enhance power to identify genotype-phenotype relationships: a SNP with a large difference in risk allele frequency between populations is a strong candidate to explain large differences in disease prevalence between populations [15, 18]. However, despite the pattern observed for the Crohn's disease-associated SNP rs10761659 (Figure 4), there is no strong evidence to suggest that the risk of developing Crohn's disease differs dramatically between individuals of African and European ancestry [55]. Future studies are required to determine the extent to which differences in risk allele frequencies between populations predict disease prevalence differences between populations.

Finally, power estimates for disease association studies rely on estimates of the risk allele frequency in a population [56]. Inaccurate risk allele frequency estimates can result in overestimates of power and, consequently, in underpowered studies [57, 58]. Thus, these data can aid in the design of future association studies in populations for which allele frequency data are scarce.

Some of the risk alleles studied here may not be disease causing, but instead may be in linkage disequilibrium (LD) with the disease causing allele. Although recombination hotspot locations are generally shared across human populations and there is substantial conservation of haplotype structure worldwide [49, 59], the extent of LD can vary markedly across populations [6063]. Because LD breaks down differently in different populations, the risk alleles studied here may not be associated with disease across all human populations. Our analyses assume that the degree of LD between the genotyped risk allele and the true causal allele is conserved across populations. Our interpretations should be considered in light of this caveat.

Disease-association studies have primarily made use of case-control cohorts of European ancestry. Studies of worldwide patterns of genetic variation in disease-associated genes are essential to determine how transferable disease-gene associations are from one population to another. Moreover, disease-association studies in diverse populations are required in order to determine whether different alleles are responsible for disease prevalence in different populations. A strong focus on the genetics of disease in humans worldwide is an important step in addressing large disparities in the quality of health care between human populations.


Disease-associated SNPs do not differ in frequency more between human populations than random SNPs in the genome. This suggests that positive local selection has not had a strong effect on the frequencies of risk alleles in general. Individually, however, several disease-associated SNPs do show evidence of positive local selection. Regardless of whether the observed differences are due to drift or selection, worldwide variation in risk allele frequencies is considerable. Future studies are required to determine the extent to which this variation is responsible for differences in disease prevalence between populations.