Background

After dispersal from Africa, humans have evolved to be characterized by substantial phenotypic variation, including variation in skin, hair, and eye color, body mass, height, diet, drug metabolism, susceptibility and resistance to disease, during the colonization of the World. Efforts to reveal the genetic bases of these variations should provide important insight into the history of human evolution, gene function, and the mechanisms of disease [1, 2]. Indeed, with the advent of large scale comparative genomic and human polymorphism data, a flood of studies have identified many candidate genes and genomic regions accounting for the observed phenotypic characters [2]. However, the evolutionary forces, i.e., positive selection, balancing selection, purifying selection, or neutral evolution, driving the variation of these phenotypic traits remain largely unknown.

In general, population differentiation under neutral evolution is mostly influenced by demographic history; however, adaptation to a local environment, driven by positive selection, will increase the level of population differentiation [3]. In contrast, negative and balancing selection tends to reduce population differentiation [3]. Accordingly, the evaluation of the level of population differentiation of the human genome would be helpful and informative for the identification of the genetic basis of the phenotypic difference observed in different human populations.

Results and Discussion

Here, we evaluated the level of population differentiation for human genes on autosomal chromosomes among three populations: African, European and East Asian, based on the HapMap data (Phase II) [4], using the parameter FST according to methods described previously [3, 5]. A previous study has reported that there is a higher level of population differentiation at gene regions compared to non-gene regions in the genome [6]. However, in our analysis, we observed that for several chromosomes, including 5, 6, 8, 11, 13, and 20, did not show a pattern with higher population differentiation at genic compared to non-genic regions, namely genic regions did not have excess SNPs with a higher FST (≥0.6) (Figure S1 in Additional file 1).

Functional significance of genes with higher levels of population differentiation

Since an analysis of categories that contain only a few genes will have low statistical power, here we only summarize categories that contain at least 10 genes. Figure 1 summarizes the biological processes that are enriched with higher FST SNPs with a significant P value of 10-10 or lower (see Method), and their λ values, with λ being the ratio of the proportion of higher FST SNPs (≥0.6) in the analyzed category to the proportion of higher FST SNPs in genome-wide genes (which is 0.0049). The categories listed in Figure 1 include a large number involved with organ development, such as those involved in pancreas, lung, and heart development. For example, GO: 0021983, pituitary gland development, is enriched with high FST SNPs and has the highest λ value, 19.37. The pituitary gland produces and secretes many hormones, some of which stimulate other glands to produce other types of hormones, thus this organ and it controls many biochemical processes, e.g. growth, homeostasis, stress response, reproduction, and metabolism [7, 8], that similarly demonstrate a high level of population differentiation, such as developmental growth (GO: 0048589), reproduction related(GO:0030317, GO:0007286, GO:0007276), and several metabolic pathways (GO: 0006641,GO: 0042593, GO: 0042632) (see following text and Figure 1).

Figure 1
figure 1

λ values of GO categories in biological processes enriched for higher F ST SNPs with P -value lower than 10-10.

An intriguing observation is that osteoblast development is significantly rich in high FST SNPs (λ = 12.28, P= 4.92E-88 after multiple testing). Osteoblasts are mononucleate cells that are responsible for bone formation. Modern humans demonstrate substantial phenotypic variation, which to a large extent can be illuminated by the skeletal system, such as height, body mass, body mineral density, and craniofacial differences. Indeed, evidence indicates that the human skeletal system has evolved rapidly since the advent of agriculture [9] and our recent study concluded that the high levels of population differentiation of skeletal genes among human populations was driven by positive selection [10].

Another interesting category is hair follicle development, which also showed a higher level of population differentiation (GO: 0001942, λ = 4.09, P= 2.07E-08 after multiple testing). Hair is produced by hair follicles. Similar to the skeletal system, hair morphology, including water swelling diameter and section, shape of fiber, mechanical properties, combability and hair moisture, have distinctive traits among human populations [11]. Previous studies have identified some genes involved in hair follicle development that have undergone recent positive selection, as detected by the long range haplotype homozygosity test, such as EDAR and EDA2R [12, 13]. These studies, together with our evidence of higher population differentiation in the genes involved in the hair follicle development support a hypothesis of adaptive evolution accounting for the diversification of human hair.

Consistent with previous observations [12, 14], genes involved in pigmentation, including the following GO processes: pigmentation during development, pigmentation, and melanocyte differentiation, demonstrated significantly higher population differentiation. In a similar manner, reproduction associated processes, e.g. sperm motility, spermatid development, gamete generation, have higher levels of population differentiation (Figure 1). Among the categories with a significant enrichment of higher FST SNPs, many are involved in the nervous system, e.g. dorsoventral neural tube patterning (GO: 0021904, λ = 15.67), hindbrain development (GO: 0030902, λ = 11.08), positive regulation of neuron differentiation (GO: 0045666, λ = 8.50), and neuron development (GO: 0048666, λ = 5.27) (Figure 1). Others categories include metabolic process, such as the triglyceride metabolic process (GO: 0006641, λ = 6.69), glucose homeostasis (GO: 0042593, λ = 4.64), cholesterol homeostasis (GO: 0042632, λ = 4.35), possibly resulting from the variation in metabolism among humans.

Immunity-related genes, however, which are a common target of positive selection [2, 15, 16], are involved in small list of categories with a higher proportion of higher FST SNPs. This observation is probably attributable to the fact that many of the genes in the immunity system evolve under balancing selection in human populations for a heterozygote advantage, which would reduce the level of population differentiation [17, 18].

Tables S1 in Additional file 2, and Tables S2 in Additional file 3 summarize the GO categories in cellular component and molecular function with an enrichment of higher FST SNPs.

In addition, to discern which population(s) contribute more to the pattern, we generated three pairwise sets of FST-values: FST (CEU-YRI), FST (EA-YRI) and FST (CEU-EA). At the genes in the biological processes described in Figure 1, the three data sets demonstrate consistent pattern of significantly higher proportion of higher FST SNPs compared with that at the genome-wide genes (Figure 2), which suggested that the population differentiation is present commonly between pairwise populations.

Figure 2
figure 2

The F ST (≥0.6) distribution of SNPs in the biological processes in Figure 1 and of genome-wide genes. (A) FST all -values among the three populations. (B) FST(CEU-EA) -values between Europeans and East Asians. (C) FST(EA-YRI) -values between East Asians and Africans. (D) FST(CEU-YRI) -values between Europeans and Africans.

Population differentiation under neutral evolution is mostly influenced by demographic history (that is, genetic drift and gene flow), which can generate similar pattern with biological factor such as natural selection. However, demographic history tends to influence all loci in the genome equally, and natural selection acts only on the single gene or a group of functional related genes. Compared with the proportion of higher FST SNPs in the genome-wide genes, we present some groups of functional related genes enriched with high FST SNPs, which are mostly driven by positive natural selection, although the confounding factor of demographic history cannot be excluded absolutely.

Population differentiation in disease-related genes

Studies of the pattern of molecular evolution of human disease-related genes will provide insight into the origin, maintenance and mechanism of disease [19]. Previous reports suggested that disease-related genes tend to evolve under purifying selection based on the comparison of non-synonymous rate to synonymous substitution rates [1921]. Here, as expected, we found that disease-related genes (including Mendelian disease genes and complex disease genes), demonstrate a significant excess of SNPs with lower FST (≤0.05), relative to other genes (χ2= 23.16, P= 1.49E-06 for OMIM gene panel, χ2= 193.78, P = 4.76E-44 for complex-disease gene panel, Figure S2 in Additional file 1). These disease genes demonstrate an excess of lower FST SNPs in the lower frequency bins but not in the high frequency bins (Figure 3), suggesting that negative selection, rather than balancing selection, operated on these genes.

Figure 3
figure 3

Proportions of SNPs with F ST ≤ 0.05 at each global MAF (minor allele frequencies) bin in complex disease genes (A), and OMIM genes (B), compared to that of other genes. The black nodes indicate significantly higher proportion in disease genes with P < 0.01.

Surprisingly, higher FST (≥0.6) SNPs are enriched significantly at Mendelian disease genes (OMIM) relative to other genes (χ2 = 30.47, P = 3.39E-08), with three MAF bins demonstrating statistical significance (Figure 4). These higher FST SNPs are probably under positive selection. This pattern, however, was not observed in complex disease genes and appear inconsistent with the previous study by Blekhman et al. (2008) [20]. Blekhman et al. (2008) found that Mendelian-disease genes appear to be under widespread purifying selection but that genes that influence complex disease risk show lower levels of evolutionary conservation, as assessed by the ratio of nonsynonymous to synonymous substitutions (Dn/Ds), possibly because they were targeted by both purifying and positive selection. The difference in results is probably attributable to the different methods used to assess sequence evolution: Dn/Ds method changes over a long time scale (i.e. between human and other species), while FST measures recent evolution (i.e., since the separation of modern human populations). The incidence and susceptibility to some Mendelian diseases might demonstrate higher levels of differences among modern human populations.

Figure 4
figure 4

Proportions of SNPs with F ST ≥ 0.60 at each global MAF (minor allele frequencies) bin in OMIM genes and non-OMIM genes. The P-value with statistical significance is presented above each bin.

Lower levels of population differentiation in microRNA targeted genes

The regulation of gene expression is crucial to the development of an organism and has been increasingly recognized that a remarkable fraction of regulation is dominated by microRNAs (miRNAs) [22, 23]. miRNAs are a group of ~23 nt endogenous RNAs important for a diverse range of biological functions that direct the posttranscriptional repression of mRNAs by cleavage or translational repression [22, 23]. Evidence has shown that negative selection operates on miRNA regulated genes [24]. Here, we observed that microRNA targeted genes present a significant excess of lower FST (≤0.05) SNPs (χ2 = 29.76, P = 4.90E-08), and significantly fewer high FST (≥0.6) SNPs (χ2 = 37.61, P = 8.63E-10), relative to other genes (Figure S3 in Additional file 1). The lower FST SNPs are mainly restricted within the lower minor allele frequency bins, and not the intermediate frequency bin (Figure 5), suggesting that widespread purifying selection operated on miRNA targeted genes.

Figure 5
figure 5

Proportions of SNPs with F ST ≤ 0.05 at each global MAF (minor allele frequencies) bin for microRNA targeted genes compared with other genes. The P-value is presented above each bin.

Conclusions

In this study, we find that genes involved in osteoblast development, hair follicles development, pigmentation, spermatid, nervous system and organ development, and some metabolic pathways have higher levels of population differentiation. Surprisingly, we find that Mendelian-disease genes appear to have a significant excessive of SNPs with high levels of population differentiation, possibly because the incidence and susceptibility of these diseases show differences among populations. As expected, microRNA regulated genes show lower levels of population differentiation due to purifying selection. Our analysis demonstrates different level of population differentiation among human populations for different gene groups.

Methods

Since genes on the sex chromosomes are involved in higher population differentiation than those on the autosomal chromosomes [3], we only analyzed data from the autosomal chromosomes. Allele frequency data for SNPs on autosomes were retrieved from HapMap Phase II (release 24, NCBI36) [4] for three populations: African (YRI panel including 60 Yoruban individuals from Ibadan), European (CEU panel including 60 individuals of Utah residents with ancestry from northern and western Europe) and East Asian (EA panels including 45 Han Chinese (HCB) and 45 Japanese from Tokyo (JPT)).To evaluate the degree of population differentiation, FST values of the polymorphic SNPs with minor allele frequencies ≥0.01 in at least one population were calculated as previously described [3, 5]. Since negative values have no biological explanation these were set to 0.

Protein coding genes on the human autosomal chromosomes, and their corresponding gene ontology (GO) terms including three categories: biological process, cellular component, and molecular function, were downloaded from Ensembl (http://www.ensembl.org version 54) by means of BioMart [25]. Each gene was extended 500 bp upstream of 5'-termus and downstream of 3'-termus to include all of its SNPs. χ2 tests with one degree of freedom were used to test for the significance of the enrichment of SNPs with higher (≥0.6) FST values compared with genome-wide genes empirical data based on 2 × 2 contingency tables constructed by the numbers of SNPs. For these analyses, Bonferroni correction was used for the multiple testing. To better understand the enrichment, we calculated the parameter, λ, the ratio of the proportion of higher FST SNPs in the analyzed category to that in the genome-wide genes. λ values significantly higher than 1 indicates a higher population differentiation of genes in the category among human populations.

Complex disease genes were obtained from the Genetic Association Database (GAD) [26]. Human Mendelian disease genes were obtained from the study by Blekhman et al. (2008) (OMIM) [20]. Genes targeted by microRNA were obtained from targetscan (http://www.targetscan.org, release 5.1) [2729]. For these genes, χ2 tests with one degree of freedom were used to test the significance of an enrichment of SNPs with higher (≥0.6) FST values and lower (≤0.05) FST values, respectively, compared with other genes based on 2 × 2 contingency tables constructed by the numbers of SNPs.