Under the predefined inclusion criteria, 845 hemostatic SNPs in 34 genes were found to be eligible for downstream analyses (Additional file 1: Table S1). PCA was performed to infer differentiation of these variants in the populations studied (Fig. 1). The estimated genetic distance to Europeans increased in the order: YRI > EAS > SOM > SAS > AMR > EUR. The largest input to the first two principal components (PC1 and PC2) was made by the EAS and YRI populations, which accounted for nearly three-quarters of the total contributions (Fig. 2). To verify the genetic distance provided by the 845 hemostatic SNPs for populations shown in Fig. 1, we generated another PCA with 849,267 SNPs included in the PMRA. The new PCA confirmed the genetic distance provided by the hemostatic SNPs (Additional file 2: Fig. S1).
Population allele frequencies and correlations
To investigate whether there were differences in gene frequencies between the populations, allele frequencies for the 845 SNPs were statistically examined over 6 populations (AMR, EAS, EUR, SAS, SOM, YRI) using Kruskal–Wallis test . The Kruskal–Wallis test is a nonparametric test that examines whether groups of populations (k) are identical or whether at least one of these populations deviates from other populations. Post-hoc analyses with Bonferroni corrections were then performed to investigate pairwise relationships in gene frequency among the 6 populations. This generated 6 pairs of populations that displayed statistically significant differences in allele frequencies: EAS‒EUR (P < 0.05), YRI‒EUR (P < 0.05), SOM‒YRI (P < 0.05), EAS‒AMR (P < 0.001), EAS‒YRI (P < 0.001), and SAS‒YRI (P < 0.001). No other pairs showed statistically significant differences in gene frequencies. The EAS and YRI populations showed the lowest correlation in gene frequencies between any pair of population compared, with Pearson's correlation coefficient (r) corresponding to 0.72 (Additional file 3: Table S3, Additional file 4: Fig. S2). The highest correlation was obtained between EUR and AMR (r = 0.96), indicating a significant European admixture in the latter population.
To investigate dispersion of the 845 SNPs over populations, Pearson correlations between allele frequencies of each population and the expected mean allele frequencies for the 6 populations was examined (graphs not shown). The test showed that allele frequencies in Europeans and in populations with genetic proximity to Europeans had the best correlations with the expected mean allele frequencies. The Pearson r values then decreased as the genetic distance from Europeans increased in the following order: AMR (r = 0.98) > EUR and SAS (r = 0.97) > SOM (r = 0.96) > EAS (r = 0.92) > YRI (r = 0.89).
In Fig. 3, violin plots are shown that schematically show distribution of allele frequencies for the 845 common hemostatic SNPs. In general, the two African populations (SOM and YRI) had higher median allele frequencies than other populations. The EAS population showed the lowest median allele frequencies among the 6 populations investigated.
Of the 845 hemostatic SNPs included in the analyses, 779 were not recorded in ClinVar and were designated as unknowns. The remaining 66 SNPs were found on ClinVar, of which 10 were annotated as benign variants, 8 as drug response related, 26 as likely benign, 9 as pathogenic, and 13 as variants of uncertain significance (VUS). In Fig. 4, these 66 ClinVar SNPs are shown in a heatmap. It appears in the heatmap that fewer drug response alleles reach frequencies greater than 0.10 in East Asian and West African (YRI) populations (upper segments of the heatmap). Other SNPs, such as the Ala15Thr variant (rs6092) in SERPINE1 whose pathogenicity status is disputed, have alleles that are prevalent in non-African populations but not in the 2 African populations investigated (see Additional file 1: Table S1).
Multi-SNP haplotype phasing
To investigate whether multi-SNP haplotype phasing provided better information than individual SNPs, we employed the Shapeit v2 method and constructed haplotypes for the 34 genes included in the analyses. We only included SNPs that were represented in both the Affymetrix genotype array and the phase-determined population data obtained from the 1000 Genomes Project phase 3. Resulting haplotypes are shown in a multi tab Excel file in Additional file 5: Table S2. All reference alleles shown in the table are ancestral alleles. In Table 1 a multi-SNP haplotype phasing for one of those 34 genes – vitamin K epoxide reductase complex subunit 1 (VKORC1) – is shown with regard to 8 different SNPs in 10 different populations: SOM, ASW, CEU, CHB, GIH, JPT, LWK, MXL, TSI, YRI. The VKORC1 gene is shown as an illustrative example partly because of its pharmacogenetic importance (warfarin resistance/sensitivity) and partly because of the high proportion of selected SNPs that given the allele frequencies could be expected to be informative as regards ethnicity. Surprisingly, a mere 8 different multi-SNP haplotypes of VKORC1 represent the whole spectrum. Those with a frequency less than one percent were excluded (indicated by an empty space in Table 1). The 8 percent ‘C G C C G G A A’ haplotype (haplotype row 6 in Table 1) was unique to the SOM population. To summarize Table 1, it is evident that the multi-SNP patterns of SNPs with high but not complete allelic association (i.e. strong LD plus evidence for historical recombination) in the VKORC1 gene across ethnic populations are more informative as regards ethnicity compared to looking at SNPs individually.
To examine the diversity within continental populations, we used the Human Genome Diversity Project browser (http://hgdp.uchicago.edu/cgi-bin/gbrowse/HGDP/) and examined SNPs in a region of 1 Mbps encompassing the PROC gene—one of the 34 genes studied. As expected, the resulting haplotype chart (Fig. 5) shows a greater diversity in Africa compared with the situation in other continents or subcontinents. This greater diversity is characterized by complex patterns of African haplotypes shown as shorter multicolor haplotypes as compared to the longer haplotypes found in other world populations (Fig. 5), which is a result of lower degree of linkage disequilibrium in African haplotypes . Haplotypes for individual populations included in this analysis are shown in Additional file 6: Fig. S3.
Clinically important common hemostatic gene variants in ClinVar database
To examine the representation of populations with respect to the clinically relevant variants, we combined the drug response pharmacogenetic variants involved in thrombosis therapy and common pathogenic SNPs that alter function of hemostasis genes (n = 17), which we then investigated with Friedman’s two way analysis of variance by ranks. Friedman’s mean rank was largest in Europe and decreased with increasing genetic distance from Europe: EUR (4.4) > AMR (3.9) > SAS (3.8) > SOM (3.6) > YRI and EAS (2.7 each). This suggested that EUR was the best-represented population by the DNA variants investigated and that YRI and EAS were the least represented. There was a statistically significant difference in allele frequencies of the clinically relevant SNPs between the 6 populations, chi squared (Χ2) = 14.6, P = 0.012. Post hoc analysis with Wilcoxon signed-rank tests was conducted with a Bonferroni correction applied, which showed a significant difference between EUR and EAS (P < 0.05) and between YRI and EAS (P = 0.008).
Nine pathogenic hemostatic SNPs and eight drug response DNA variants with at least 1% AAF were identified in the populations studied (Fig. 6). In the battleship diagram shown in Fig. 6, the width of the rectangle, or the square, is proportional to the magnitude of the allele frequency. The wider the rectangle, the higher the allele frequency of a specific SNP in a population. The eight common hemostatic drug response SNPs shown in Fig. 6 (left) were identified in the VKORC1 (rs2359612, rs7294, rs9934438, rs17708472, rs2884737, rs61742245) and in CYP2C9 (rs1799853, rs1057910) genes. These eight variants are known to be associated with warfarin dose requirement . One of these, rs61742245 (VKORC1 Asp36Tyr), was prevalent only in the SOM population. This variant is included in the 8% ‘C G C C G G A A’ haplotype in Table 1, as mentioned above, and is associated with warfarin resistance in the Horn of Africa [19,20,21] and, to a lesser extent, in the Middle East [22, 23]. The two genetically most distinct populations from Europe (YRI and EAS) had also the fewest number of drug response variants that reached the threshold of 1% AAF.
The EUR population appeared to have the most pathogenic SNPs reaching at 1% AAF (Fig. 6, right). In contrast, EAS and YRI populations had only one pathogenic variant each that reached the threshold 1% AAF. The variant rs118203905 (FV R306T), also called Factor V Cambridge and associated with APC resistance , was common only in the EAS population. Another SNP, rs41276738 (VWF R854Q) linked to type 2 N von Willebrand disease , reached at 1% AAF only in the AMR population. Further, SNPs rs73015965 (PLG K38E) and rs148685782 (FGG A108G) associated with type I plasminogen deficiency  and reduced levels of fibrinogen , respectively, were common only in the EUR population. Two other common pathogenic SNPs were common only in the SOM population. One of these, rs121918478 (AAF 0.0053; 1%), is a single nucleotide variant in factor II gene creating an amino acid substitution at position 461 of the protein (FII R461W), which causes hereditary factor II deficiency disease . The other SNP found only in Somalia is the hemophilia A variant rs137852380 (AAF 0.0105; 1%) in F8 gene (FVIII G89D) . Finally, a variant (rs72547529) previously associated with VKORC1 loss of enzyme activity  was common only in the YRI population.