Background

Genetic variations constitute an important part of the factors that contribute to many complex diseases. To identify genetic variations that are associated with specific complex diseases, Genome-wide and whole-genome association studies have been widely performed in recent years. These studies have identified many germline variants (in particular, single nucleotide polymorphisms or SNPs) associated with complex diseases in many different populations. Most of these identified variants exhibit subtle disease associations, and these variants usually have limited predictive power in risk assessment [1].

While the effect of a single genetic variation, like a SNP, could be small or large, the collective effect of many variations, and their relationships, provide valuable information into the mechanisms of complex diseases. To this end, there is an eminent need for studying the collective effect of many SNPs and their interrelationships as we try to characterize the functional underpinnings of the relationship between genotype and phenotype. One useful source of information in this regard is the disease associations identified by GWAS on different populations.

Since many genomic variants can be population-specific and their distribution can follow geographical patterns, GWAS on different populations offer different [2] and potentially complementary disease associations, which may enrich our understanding of disease mechanisms. Furthermore, rare variants, which are often thought to be the causal variants for many complex diseases, are usually population-specific [1]. Therefore, elucidation of the functional relationship between rare variants identified in different populations can be useful for the design of personalized treatment strategies in precision medicine. In the future, the comprehensive knowledge we collect from the use of multiple populations in whole-genome association studies can be utilized by healthcare systems in smoothing out and gradually eliminating health disparities [3, 4].

Systems biology uses computational and mathematical approaches to model the complex interactions among genetic variations as related to a phenotype. Such a holistic approach of characterizing and integrating the widespread genetic variations and the interplay between them yields a more intuitive understanding of complex diseases. Type II Diabetes (T2D) is one of the complex diseases, where most disease-associated variants identified by GWAS are different across different populations. Since variants identified on different populations can be linked to T2D through similar functional mechanisms, trying to annotate the significant SNPs from each population separately will not fully utilize the information obtained from different populations. Furthermore, since there are distinct underlying disease processes and treatment regimens for T2D, analyzing the functional overlap between variants identified on different populations can shed light into the differences between different populations in terms of disease mechanisms.

In this paper, we propose a computational pipeline to systematically assess the functional overlap between genomic markers of complex diseases that are identified on different populations. For this purpose, we use T2D as a suitable model disease, and we compile results from GWAS that have been performed on 9 different populations across the globe. These results mainly represent genomic loci that are found to be associated with T2D on samples collected from these nine populations. Interestingly, there is little overlap between disease associated loci identified on different populations.

To systematically assess the functional overlap between these different sets of SNPs, we develop computational algorithms and statistical frameworks, expecting that the variants identified in certain populations correspond to similar biological processes. For this purpose, we develop a multi-layered framework, where genomic loci, protein-coding genes, biological pathways in which these proteins are active, and networks of physical and functional interactions between these proteins are systematically evaluated for potential overlap. Figure 1 illustrates the framework. Our results show that the overlap between different populations grow as the level of abstraction coarsens from genomic location to biological function. More interestingly, our results also show that differences in the biological processes that are implicated in different populations align with the targets of T2D first-line therapy in each population.

Fig. 1
figure 1

Workflow of multi-level functional overlap analysis for disease-associated genomic variants identified in different populations. The computational pipeline takes as input a set of genomic loci that are found to be significantly associated with a disease of interest in each of the populations that are considered. It then maps these loci to functional elements with coarser granularity, first considering linkage disequilibrium, then mapping loci to protein-coding genes, subsequently mapping these genes to biological pathways, and finally identifying functionally associated subsets of these genes via network analysis. At each level, the resulting functional elements that are found to be associated with the disease are compared across populations, to systematically characterize the overlap between different populations

To further assess the generalizability of our methods, we repeat the same pipeline of analyses on datasets for another complex disease, prostate cancer as a second test case. Our results on prostate cancer also show that more functional overlap can be detected among populations as the level of abstraction coarsens.

Results

Populations and datasets for type II diabetes

We use nine T2D Genome Wide Association datasets representing nine populations of Asian and European ethnic origins. The basic statistics of these studies and their results are shown in Table 1. The first two datasets are case-control studies for which the genotypes for all samples and genotyped loci are available. The other seven datasets are published results of case-control studies. These datasets consist of the list of significant loci that are identified by the study and the associated statistical significance figures.

Table 1 Descriptive statistics of the T2D studies, datasets, and results used in this study. The letter code shows the population from which the samples were obtained, Significant tagSNPs shows the number of tagSNPs that were found to be significantly associated with T2D at the significance threshold applied by the corresponding study (also shown in the table). Significant tagSNPs+LD shows the total number of these significant SNPs and the number of SNPs that are in linkage disequilibrium with these SNPs, but were not screened by the corresponding study

The nine datasets are the following:

  1. 1.

    Wellcome Trust Case Control Consortium –WTCCC (W) T2D SNP data which has genotyped 495,477 SNPs genotyped for 1999 case and 1504 control samples obtained from the UK population [5]. The control samples represent individuals who were born in 1958.

  2. 2.

    The database of Genotypes and Phenotypes –dbGAP (D) which has genotyped 561,656 SNPs for 1007 case and 983 control samples obtained from other parts of Europe and US [6].

  3. 3.

    A Finnish (F) population case-control study which genotyped 329,091 SNPs for 1161 case and 1174 control samples collected in Finland [7, 8].

  4. 4.

    A French (Fr) population case-control study which has genotyped 392,365 SNPs for 679 case and 697 control samples collected in France [9].

  5. 5.

    Saudi (S) population studies that pooled 48 Saudi T2D SNPs implicated in previous experiments for a total of 2207 case samples and 2276 control samples collected in Saudi Arabia [10,11,12].

  6. 6.

    A Korean (K) population case-control study which has genotyped 2,188,613 SNPs for 462 case samples and 456 control samples collected in South Korea [13, 14].

  7. 7.

    A Japanese (J) population case-control study which has genotyped 7,521,072 SNPs for 23,399 case and 31,722 control samples collected in Japan [15, 16].

  8. 8.

    A Chinese (C) population case-control study which has genotyped 2,900,000 SNPs for 684 case and 955 control samples collected in China [17, 18].

  9. 9.

    A Lebanese (L) population case-control study which has genotyped 5,891,794 SNPs for 1902 case and 1384 control samples collected in Lebanon [19].

Functional overlap among genomic variants found to be associated with T2D in different populations

In this section, we present the overlap between T2D-associated SNPs at five different functional levels. For each functional level, we compute (i) an overlap matrix and (ii) a cumulative overlap function. Each overlap matrix is a k × k matrix that represents the pairwise overlap between the disease-associated loci in pairs of populations based on a certain notion of functional overlap, where k is the number of populations. Each cumulative overlap function is a function in the form f:{1, …, k} → [0,1], assessing the fraction of biological entities (individual loci, loci in LD, genes, functions, subnetworks) that are found to be associated with the disease in at least a given number of the populations.

We hierarchically cluster the populations using each of the five overlap matrices and visualize the overlap matrices as heatmaps with hierarchical clustering. To assess the statistical significance of the overlap functions, we report the results of permutation tests obtained through 1000 permutations (the procedure we use for the permutation tests is described in Methods). We compare the overlap function computed on the original dataset against the distribution of overlap functions computed using permutation tests, representing one thousand simulated runs.

The SNP overlap matrix and the SNP overlap function for T2D-associated SNPs in the nine populations are shown in Fig. 2(a) and Fig. 3(a), respectively. As seen in the figures, the overlap between any pair of populations is considerably low, but there is some overlap between pairs of populations (Chinese and Saudi Arabian, French and Lebanese, Finnish and Korean, Lebanese and Japanese). Although the pairwise overlap between T2D-associated SNPs is considerably low, the permutation test for the overlap function for k = 2 (two populations) suggests that the pairwise overlap is statistically significant (z-score = 23.8, p < 3.82E-125). In other words, a SNP that is found to be associated with T2D in one population is likely to be found to be associated with T2D in at least one other population. However, for values of k larger than 2, the overlap between T2D-associated SNPs is not statistically significant, i.e. the T2D-associated SNPs do not tend to be shared across 3 or more populations.

Fig. 2
figure 2

Overlap matrices at five functional levels for the genomic variants that are found to be associated with Type II Diabetes in 9 different populations. The plots from (a) to (e) show the overlap matrices for SNP, LD SNP, gene, functional and network overlap matrices respectively, depicted as heatmaps. Each heatmap has hierarchical clustering of the 9 populations (D = dbGAP, W=WTCCC, J = Japanese, F=Finish, K=Korean, Fr = French, L = Lebanese, C=Chinese, S=Saudi). The color intensity from white to brick, shows the degree of overlap, with brick being the highest

Fig. 3
figure 3

Cumulative overlap functions at five functional levels for the genomic variants that are found to be associated with Type II Diabetes in 9 different populations The plots from (a) to (e) show the SNP, LD SNP, gene, functional and network cumulative overlap functions respectively. Each plot also provides a comparison to the corresponding cumulative overlap function computed over 1000 random permutations

The linkage disequilibrium (LD) SNP overlap matrix and the LD-SNP overlap function for T2D-associated SNPs in the nine populations are shown in Fig. 2(b) and Fig. 3(b), respectively. When we take linkage disequilibrium into account (thereby considering loci with correlated genotypes as overlapping), we observe considerably larger overlap between pairs of populations in the corresponding heatmap. In particular, the Lebanese and French populations cluster together with very high overlap. Another cluster is formed by the Chinese, Japanese, Finnish, and Saudi Arabian populations. This cluster also exhibits considerable overlap with the French-Lebanese cluster. The permutation test for the LD-SNP overlap function suggests statistically significant overlap (p < 4.27E-17) for up to 7 populations – thus we can conclude that, at such genomic level, there is some overlap between the T2D-associated loci among most of the populations that are considered.

The gene overlap matrix and the gene overlap function for T2D-associated genes in the nine populations are shown in Fig. 2(c), Fig. 3(c), respectively. Here, a gene is considered T2D-associated in a population if at least one SNP in the genes’ region of interest (ROI) is found to be significantly associated with T2D in that population. As seen in the heatmap, when genes are considered, more overlap is detected between pairs of populations, as compared to LD Analysis, e.g., there is solid pairwise overlap between populations like the Lebanese and French as well as the Chinese and Saudi Arabian. However, when more than two populations are considered, less overlap is detected in genes than in LD-SNPs – and no gene that is common to 6 populations is identified. This is shown by the distribution of the values of the overlap function for the permutation test, but the gene overlap is statistically significant for k = 2,3,4,5 populations (p < 1.1E-245).

The functional overlap matrix and the functional overlap function for T2D-associated functions in the 9 populations are shown in Fig. 2(d) and Fig. 3(d), respectively. As seen in the heatmap, there are no functions that are enriched in SNPs found to be associated with T2D in the French and Lebanese populations, so the two populations are eliminated from further analysis. There is an improved statistically significant pairwise overlap (z-score = 433.8, which corresponds to a very small p-value), between the Finnish and Korean populations, however the extent of overlap decreases to less than five populations as shown by the distribution of the values of the overlap function for the permutation test. In other words, a function that is found to be associated with T2D in one population is likely to be found to be associated with T2D in another three populations. However, for values of k larger than 4, the overlap between T2D-associated functions is not statistically significant i.e. the T2D-associated functions do not tend to be shared across 5 or more populations.

The network overlap matrix for T2D-associated subnetworks and the network overlap function for the seven populations (excluding Fr and L) are shown in Fig. 2(e) and Fig. 3(e), respectively. As seen in the heatmap, the amount and extent of overlap between populations is considerably higher than all previous overlap analyses. There is very high overlap between the Saudi and Chinese populations, populations of European ethnic origin represented by UK, US and other parts of Europe, and between the Finish and Korean populations. Moreover, there is highly statistically significant overlap (z-score > 1294, which corresponds to a very small p-value) between up to 6 populations. This is suggested by the distribution of the values of the overlap function for the permutation test. In other words, a subnetwork that is found to be associated with T2D in one population is likely to be found associated with T2D in almost all of the populations that are considered.

In order to visualize the functional overlap across the populations from a different perspective, we input the entire set of loci from all populations to the Prix-Fixe network analysis tool [20]. This tool outputs the most significant functionally coherent subnetwork across all seven populations. Subsequently, we plot the resulting protein-protein interaction (PPI) network, color coded for different populations. The T2D PPI network across seven populations is shown in Fig. 4. As shown in the figure, the resulting most functionally coherent subnetwork has some strong inter-population interactions between T2D-associated proteins and the strength of interaction, represented by the width of the edges, suggests a strong overlap between the Asian populations; specifically Chinese, Saudi, Korean, Japanese and Finnish populations and some overlap between the Finnish and British as well as between the American and other European and the Korean populations. This conforms to the previous results from network analysis.

Fig. 4
figure 4

Cross-population protein-protein interaction network associated with Type II Diabetes. The proteins in the network are color-coded based on the population on which the corresponding gene is found to be associated with T2D. Blue:UK, Red:US+Other parts of Europe, Cyan:Finnish, Magenta: Korea, Purple:Japanese, Citric:Chinese, Dark Grey:Saudi, Light Grey is used for interconnection nodes. The edge width represents the interaction weight as calculated using Prix-Fixe network algorithm [20]

Correspondence between genes identified in different populations and T2D subtypes

According to Cantley J and Ashcroft [21], there are two main molecular mechanisms that underlie the etiology of T2D; insulin secretion deficiency and insulin resistance. Prasad and Groop [22] classify T2D associated genes according to their roles in these two subtypes. To investigate the correspondence between the genes that are found to be associated with T2D in each population and their association with these subtypes, we refined our analysis to the gene-based and network-based levels. The results of the analysis is shown in Fig. 5. In the figure, the number of genes that are found to be associated with T2D (based on the mapping of the variants identified in each population to the genes’ regions of interest) is stratified according to T2D subtypes. Since analysis at the network level provides system-level information, we also repeat this analysis at the network level and we observe increased overlap with network-based analysis. At the network level, we consider a gene to be identified in a population if the most significant subnetwork identified in that population contains the gene. Therefore, a gene that does not have a significant variant in a population can be found to be associated with T2D in that population, if it is functionally associated with other genes that harbor significant loci. Similarly, a gene that harbors a significant locus in a population may not be considered as associated with T2D at the network level, if its network neighborhood is not enriched in genes that are significant in that population.

Fig. 5
figure 5

The correspondence between genes identified as associated with Type II Diabetes in different populations and their role in T2D subtypes. The top panel shows the distribution of the genes that are found to be associated with T2D in each population with respect to their association with two common T2D subtypes. For each population, among the genes that are found to be associated with T2D in that population, yellow bars represent the number of genes that are associated with Insulin Secretion Deficiency, red bars represent the number of genes associated with Insulin Resistance subtype and the orange bars represent the number of genes that are associated with both subtypes. The bottom panel shows the corresponding distribution for network-based overlap, where a gene is considered to be identified in a population if it is in the most significant subnetwork identified by network analysis on that population. The color codes used in the right panel are identical to that in the left panel. The rightmost bar shows the number of all other T2D genes that were identified in at least one population, color-coded for the population

As seen in Fig. 5, the genes identified in Asian populations are mostly associated with insulin secretion deficiency at both levels of genomic granularities, as opposed to European and American populations, in which insulin resistance seems to be the prevailing predisposing genetic factor for T2D, at both individual gene and network levels. For example, the US, UK, and other European populations have no insulin secretion-implicated genes at the genetic-level of the analysis, while the Finnish population shows an equal share of both at the genetic-level. However, at the network-level, all European populations including the Finnish population, are dominated by insulin resistance-implicated genes, while having some share of insulin secretion deficiency-related genes interacting with them. On the other hand, the Asian populations have more insulin secretion deficiency-implicated genes at the level of individual genes, except for the Korean population, which shows an equal share of insulin secretion deficiency and insulin resistance-implicated genes. At the network level, all Asian populations show more insulin secretion deficiency as opposed to insulin resistance, including the Korean population. Interestingly, none of the insulin resistance associated genes are identified to be associated with T2D in the Japanese population, according to the network level analysis.

Application to prostate Cancer

In this section, we present the results obtained by applying the proposed framework to prostate cancer data. We use published results of Prostate Cancer Genome Wide Association Studies for seven populations representing seven ethnic origins. The basic statistics of these studies and their results are shown in Table 2. The seven datasets are the following:

Table 2 Descriptive Statistics for the genomic variants associated with Prostate Cancer. Significant tagSNPs shows the number of tagSNPs that were found to be significantly associated with Prostate Cancer at the significance threshold applied by the corresponding study (also shown in the table). Significant tagSNPs+LD shows the total number of these significant SNPs and the number of SNPs that are in linkage disequilibrium with these SNPs, but were not screened by the corresponding study
  1. 1.

    A European population represented by pooled results of European Prostate Cancer GWAS for more than 10.5 million genotyped SNPs and for a total of 90,843 case samples and 72,487 control samples of European ethnic origin [23,24,25,26,27].

  2. 2.

    A Hispanic population represented by the study of 49 haplotyped-tagged SNPs genotyped from 196 case and 472 control samples of Hispanic ethnic origin [28].

  3. 3.

    A Japanese population represented by the results of case-control studies which pooled a total of > 6,779,114 genotyped SNPs for a total of 6167 case and 12,187 control samples of Japanese ethnic origin [29, 30].

  4. 4.

    A Chinese population represented by the results of case-control studies which pooled a total of 942,613 genotyped SNPs for a total of 1912 case and 1648 control samples of Chinese ethnic origin [31,32,33,34].

  5. 5.

    A Korean population study that genotyped 60,276 SNPs from 1515 case and 3189 control samples of Korean ethnic origin [35].

  6. 6.

    An African American population study that genotyped 199 SNPs from 5869 case and 5615 control samples of African American ethnic origin [36].

  7. 7.

    A Tunisian population study that genotyped 534,781 SNPs from 90 case and 131 control samples of Tunisian men [37].

We report the results of the five cumulative overlap functions assessing the fraction of biological entities (individual loci, loci in LD, genes, functions, subnetworks) that are found to be associated with prostate cancer in at least a given number of the populations. We compare the overlap function computed on the original dataset against the distribution of overlap functions computed using permutation tests, representing one thousand simulated runs (the procedure we use for the permutation tests is described in Methods). The results in Fig. 6(a-e) show that more significant overlap between populations is realized as the level of abstraction coarsens, from genomic location to biological function.

Fig. 6
figure 6

Cumulative overlap functions at five functional levels for the genomic variants that are found to be associated with Prostate Cancer in 7 different populations. The five cumulative overlap functions for prostate cancer, with increasingly statistically significant overlap as we go down the pipeline of analyses from SNP to Network Overlap analysis

The permutation test for the overlap function for k = 2 (two populations) suggests that the pairwise overlap is statistically significant (Fig. 6(a), z-score = 43.151). In other words, a SNP that is found to be associated with prostate cancer in one population is likely be associated with prostate cancer in at least one other population. However, for values of k larger than 2, the overlap between prostate cancer associated SNPs is not statistically significant, i.e. the prostate cancer associated SNPs do not tend to be shared across 3 or more populations. In Fig. 6(b) the permutation test for the LD-SNP overlap function suggests statistically significant overlap for up to 3 populations (z-score = 281.63). In Fig. 6(c) the distribution of the values of the gene overlap function for the permutation test, suggests that the overlap is statistically significant for up to 3 populations (z-score = 489.521). In Fig. 6(d) the distribution of the values of the functional overlap function for the permutation test suggests that the overlap is statistically significant for up to 4 populations (z-score = 489.031) and in Fig. 6(e) the distribution of the values of the network overlap function for the permutation test, suggests that the overlap is statistically significant for up to 5 populations (z-score = 361.762).

Discussion

According to literature [38, 39], most T2D treatment regimens are based on one of two groups of medications. The first of these two groups is Sulfonylureas; used to improve insulin secretion, by targeting the ABBC8 and KCNJ11 genes and the other is Metformin, which is used to improve insulin sensitivity and targets the PRKAB1 [40,41,42,43]. The drug targets for Sulfonylureas are implicated in the Asian populations of this study, as well as the Finnish population. The drug target for Metformin and the genes that interact with it (PRKAG2 and PRKAG1) are implicated in the European, American and the Korean populations.

In Fig. 7, we show the interrelationship between T2D subtypes, genes identified in each population in this study, first line T2D treatments in these populations, the drug targets of these first line treatments, and their implications in T2D subtypes. It is interesting that the first line treatment for T2D in each of the populations in this study conform to the population’s unique T2D mechanisms. For example, literature confirms that decreased insulin secretion capacity takes a bigger role in the development of T2D in the Japanese population than insulin resistance. Furthermore, Sulfonylureas have been the most prescribed class of drugs, and has been the first line treatment in Japan until recently when it started to be supplemented with glucose lowering medications as well [44,45,46]. In China, the majority of oral anti-diabetic drugs belong to the Sulfonylureas class. This is the oldest of the anti-diabetic drug classes and the majority of hypoglycemic medicines on China’s 2012 EDL are within this category in spite of the availability of new classes [47,48,49,50,51]. Sulfonylureas has consistently been the first line treatment for T2D in Korea, with no competition until 2010, when Metformin started getting popular in the Korean market and its consumption and sales increased by 2013 [52].

Fig. 7
figure 7

Interrelationship between subtypes, populations, first line treatments and drug targets for Type II Diabetes. The figure shows the triangular interrelationships between T2D subtypes, genes identified in each population in this study, first line T2D treatments in these populations, the drug targets of these first line treatments, and their implications in T2D subtypes. For example, in the US, insulin resistance is the most common T2D subtype and Metformin is the first-line T2D treatment. The drug target of Metformin is PRKAB1 which is implicated in insulin resistance and is harbored by the US population in this study

In contrast to the Asian populations, we found that the genes identified in the Finnish, European, and American populations are mostly related to insulin resistance. The Saudi population is also characterized, as well as many Arab countries, by Adipocyte dysfunction which associates obesity to insulin resistance and diabetes [53]. Research shows that Metformin is the first line treatment for T2D in USA, UK, Finland and Saudi Arabia [54,55,56,57]. In fact Metformin has been on the list of the top ten prescription drugs in the USA for years, ranking four in 2018 and 2019 [58] and on the list of top 100 most prescribed drugs in the UK [59] and ranks fifth on the list of most prescribed drugs in Saudi Arabia from 2010 to 2015 [60].

In 2017, Metformin had the highest average antidiabetic drug prescriptions per physician in Finland which also falls under the top 10 most commonly prescribed medicine categories [61]. The Finnish Medicine Agency – Fimea [62] estimates what proportion of the population theoretically receives Metformin; in terms of drug daily dose per 1000 inhabitants per day. The Fimea 2017 report shows that Metformin consumption is highest among all antidiabetic drugs between 2014 and 2017 (31.65, 31.65,30.95 and 31.61 respectively with an increasing gap with Sulfonylureas; the latter shows a decreasing daily consumption of 3.94, 3.15, 2.38 and 1.76 for the same 4 years respectively). Metformin has also been associated with a good change in the gut microbiota, which improves insulin sensitivity [63].

T2D has a heterogeneous and multifactorial etiology, with many associated factors including gut microbiome, and possibly genetic subtypes that are yet to be uncovered. Although T2D treatments work at different degrees of efficiency from one person to another, the above analysis confirms previous research [64, 65] indicating that, of the currently known T2D subtypes, certain subtypes seem to be most common in certain ethnicities, and that Asian populations are more characterized by decrease in insulin secretion capacity as opposed to American, European, and other Caucasian populations which have insulin resistance as the most common reason for T2D [66, 67]. Also, the network-based T2D subtype analysis (Fig. 5) shows more overlap between the two subtypes of T2D than the gene-based analysis, which supports our results and previous findings [68].

The experimental results we obtained using prostate cancer data show consistency to T2D results in the sense that more statistically significant overlap is realized as we go through the pipeline from SNP to network overlap analysis, which supports our hypothesis and demonstrates the generalizability of the methods.

Conclusions

In this work, we developed computational algorithms and statistical frameworks to assess the functional overlap between disease-associated variants in different populations, expecting that the variants identified in certain populations correspond to similar biological processes. For this purpose, we developed a multi-layered framework, where genomic loci, protein-coding genes, biological pathways in which these proteins are active, and networks of physical and functional interactions between these proteins are systematically evaluated for potential overlap.

Our results, show that the overlap between different populations grow as the level of abstraction coarsens from genomic location to biological function. More interestingly, we were able to show that differences in the biological processes that are implicated in different populations align with the targets of first-line treatments of T2D in each population. We were also able to assess the generalizability of our method by testing its applicability to another complex disease. To this end, our results represent an innovative and potentially significant tool for preventing, curing, and treating disease, in that population-specific functional annotation of disease-associated genes can be used to design personalized treatment strategies in precision medicine.

It is important to note that the results presented here do not conclusively show a causal link between the genomic markers identified via GWAS and the first-line T2D treatments in these populations. Establishment of such a link would require further quantitative analysis to understand whether specific types of diabetes II are over represented in specific populations, and whether the medicine that was prescribed for each patient was appropriate. Furthermore, the prevalence of a specific kind of medicine in a country may not be related to the etiology of disease, but can rather be due to historical or political reasons. Further research is required to answer these questions.

Methods

Overlap matrices and cumulative overlap functions

The objective of this study is to characterize the functional overlap between loci that are identified to be significantly associated with a complex disease based on samples from different populations. To address this problem, we assume that we are given a collection L = {L1, L2, …, Lk} of sets of genomic loci identified to be associated with the disease across k populations (in this study, we have k = 9), such that the set Li contains the loci that are found to be significant based on the samples obtained in the ith population. Based on this information, we compute five overlap matrices and five cumulative overlap functions. Each overlap matrix is a k × k matrix that represents the pairwise overlap between the disease-associated loci in pairs of populations based on a certain notion of functional overlap. Each cumulative overlap function is a function in the form f:{1, …, k} → [0,1], assessing the fraction of biological entities (individual loci, loci in LD, genes, functions, subnetworks) that are found to be associated with the disease in at least a given number of the populations.

We compute the following overlap matrices:

  1. 1.

    SNP Overlap Matrix [Σij]kxk assesses the overlap between the loci that are found to be associated with the disease in populations i and j, where k is the number of populations.

  2. 2.

    LD SNP Overlap Matrix [Λij]kxk assesses the overlap between the loci that are found to be associated with the disease in populations i and j, such that two loci are considered to be overlapping if they are in linkage disequilibrium.

  3. 3.

    Gene Overlap Matrix [Γij]kxk assesses the overlap between genes that harbor loci that are found to be associated with the disease in populations i and j.

  4. 4.

    Functional Overlap Matrix [Φij]kxk assesses the overlap between the biological processes that are enriched in genes harboring loci that are found to be associated with the disease in populations i and j.

  5. 5.

    Network Overlap Matrix [Νij]kxk assesses the overlap between the subnetworks of protein-protein interaction networks that are enriched in genes harboring loci that are found to be associated with the disease in populations i and j.

In the following discussion, we explain how we compute each of these overlap matrices. The notation used in this section is provided in Table 3.

Table 3 Notation Used in this Paper

SNP overlap matrix

We define the SNP Overlap Ratio Σij between two populations i and j as the Jaccard coefficient of the overlapping SNPs between the populations, i.e., the fraction of common significant tagSNPs in the two populations among the number of significant tagSNPs in the two populations:

$$ {\varSigma}_{ij}=\mid {S}_i\cap \kern0.5em {S}_j\mid \kern0.5em /\kern0.5em \mid {S}_i\kern0.5em \cup {S}_j\mid $$

Where Si and Sj denote the sets of tagSNPs that are found to be significantly associated with the disxeases in populations i and j, respectively.

In order to quantify the overall overlap between the k populations, we define the cumulative SNP overlap function σ(l) for 1 ≤ l ≤ k as follows:

$$ \sigma\ (l)=\mid {S}_l\mid /\mid S\mid $$

Where Sl denotes the set of tagSNPs that are found to be significantly associated with the disease in at least l populations. Observe that 0 ≤ σ(l) ≤ 1, σ(1) = 1, and σ(l) is a monotonically non-increasing function of l. All cumulative overlap functions we define below also exhibit these properties.

Since several probes map to the same SNP, while computing the sets Si for 1 ≤ i ≤ k, we first remove duplicate SNP lists in every population. For the same reason, we also compute the overlap ratio for a pair of populations as the number of common tagSNP over the number of all unique tagSNPs in the populations combined.

LD SNP overlap matrix

We define the linkage disequilibrium (LD) SNP overlap ratio Λij between two populations i and j as the fraction of common significant tagSNPs in the two populations that have significant LD partners in the other population among the number of significant  tagSNPs in the two populations :

$$ {\varLambda}_{ij}=\mid {L}_{ij}\cup {L}_{ji}\mid \kern0.5em /\kern0.5em \mid {S}_i\cup {S}_{\mathrm{j}}\mid $$

Where Lij denotes the set of tagSNPs that are found to be significantly associated with the diseases in population i and have significant LD partners in the other population j, and Lji denotes the set of tagSNPs that are found to be significantly associated with the diseases in population j and have significant LD partners in the other population i.

In order to quantify the overall overlap between the k populations, while using LD to expand the definition of tagSNPs, we define the cumulative LD SNP overlap function λ(l) for 1 ≤ l ≤ k as follows:

$$ \lambda (1)=\mid {\mathrm{L}}_{\mathrm{l}}\mid /\mid \mathrm{S}\mid $$

Where Ll denotes the set of tagSNPs that have LD partners found to be significantly associated with the disease in at least l populations.

To find SNPs that are in linkage disequilibrium (LD), we input each population’s tagSNPs into SNPsnap [69] for LD search using HapMap3 release 2 [70] dataset. For this purpose, we use the European panel (CEU) for the European and American populations (W and D), European and Toscani in Italia (TSI) for the French and Lebanese populations, and the Japanese and Chinese panels (JPT + CHP) for Finland and the Asian populations. We consider two SNPs to be in LD if they have an r2 of at least 0.5 and they are within 500Kbs of each other on the genome.

Gene overlap matrix

We define the Gene Overlap Ratio Γ ij between two populations i and j as the fraction of common genes that harbor significant tagSNPs (in a region of interest of 100 kb upstream and 10 kb downstream) in the two populations among the number of genes in the two populations, i.e.:

$$ {\varGamma}_{ij}=\mid {G}_i\cap {G}_j\mid /\mid {G}_i\cup {G}_j\mid $$

Where Gi and Gj denote the sets of genes that harbor significant tagSNPs that are found to be significantly associated with the diseases in populations i and j, respectively.

In order to quantify the overall overlap between the k populations, we define the cumulative gene overlap function γ(l) for 1 ≤ l ≤ k as follows:

$$ \gamma (l)=\mid {G}_l\mid \kern0.5em /\mid G\mid $$

Where Gl denotes the set of genes that harbor significant tagSNPs that are found to be significantly associated with the disease in at least l populations.

In order to map SNPs to Genes, all Refseq transcripts [71] for hg38 assembly are downloaded from UCSC Table browser and extended 100 kb upstream and 10 kb downstream. Refseq IDs are translated to HGNC symbols. SNPs are mapped to their positions on hg38 assembly through biomaRt [72] and are intersected with gene coordinates. Such intersections result in gene lists matching each population.

Functional overlap matrix

We define the Functional Overlap Ratio Φij between two populations i and j as the fraction of common biological processes that are enriched in genes harboring significant tagSNPs (+ 100 kb/− 10 kb) in populations i and j among the number of biological processes that are enriched in genes harboring significant tagSNPs (+ 100 kb/− 10 kb) in the two populations, i.e.:

$$ {\varPhi}_{ij}=\mid {F}_{\mathrm{i}}\cap {F}_{\mathrm{j}}\mid /\mid {F}_{\mathrm{i}}\cup {F}_{\mathrm{j}}\mid $$

Where Fi and Fj denote the sets of significant biological processes in populations i and j, respectively.

In order to quantify the overall overlap between the k populations, we define the cumulative functional overlap function ϕ(l) for 1 ≤ l ≤ k as follows:

$$ \phi (l)=\mid {F}_l\mid /\mid F\mid $$

Where Fl denotes the set of biological processes that are found to be significantly associated with the disease in at least l populations.

To identify biological processes that are enriched in T2D-associated SNPs for each population, we use enrichment analysis of gene sets using the WebGestalt R package [73, 74]. Biological Process terms are used and False Discovery Rate (FDR) threshold is set to 5%.

Network overlap matrix

For each population, we feed the loci that are found to be associated with the disease to the Prix-Fixe [20] network analysis tool. The Prix-Fixe tool uses a network based disease-associated subnetwork identification algorithm that uses genome-scale shared function networks to identify the most functionally coherent subnetwork of genes spanning the disease associated loci. Using shared function networks as a reference, the algorithm evaluates gene combinations, constraining the choice to one gene from each disease associated locus, for their shared function, hence similar to choosing compatible food items from a prix fixe restaurant menu, with one dish from each course. The algorithm outputs a list of genes, with their disease association scores, such that the top scored genes constitute the most coherent disease associated subnetwork in each population.

We define the Network Overlap Ratio Νij between two populations i and j as the fraction of common genes constituting subnetworks of protein-protein interaction networks that are enriched in genes harboring loci that are found to be associated with the disease in populations i and j, among all genes in populations i and j, i.e.:

$$ {N}_{\boldsymbol{i}\boldsymbol{j}}=\mid {\boldsymbol{N}}_{\boldsymbol{i}}\mathbf{\cap}{\boldsymbol{N}}_{\boldsymbol{j}}\mid /\mid {\boldsymbol{N}}_{\boldsymbol{i}}\cup {\boldsymbol{N}}_{\boldsymbol{j}}\mid $$

Where Ni and Nj denote the sets of genes constituting the most significant subnetworks in populations i and j respectively.

In order to quantify the overall overlap between the k populations, we define the cumulative network overlap function ν (l) for 1 ≤ l ≤ k as follows:

$$ \nu\ (l)=\mid {N}_l\mid /\mid N\mid $$

Where Nl denotes the set of genes constituting the subnetworks that are found to be most significantly associated with the disease in at least l populations.

Permutation tests

To test the statistical significance of our results, we create a thousand simulated data sets at each level of genomic granularity (SNPs, LD SNPs, Genes, Functions, Networks) with a simulated set size for each population matching that of the original data set. A thousand permutations are done and in each permutation, permuted overlap functions are calculated as described above. This results in 1000 overlap function vectors for k = 1 to 9 corresponding to each permuted data set. For each k, in each permuted set, the minimum, maximum and mean values of the overlap function across all permutations are calculated. These calculations are then used to visualize distribution of the overlap function in permuted datasets and assess the significance of the observed overlap based on this distribution.

Permutation tests for SNP overlap

In order to perform the permutation tests for SNP overlap, we prepare a SNP pool. For each study, the genotyping array model is identified. Array annotation files are downloaded from UCSC genome annotation database [75]. Further, SNPs from WTCCC and dbGAP populations are extracted and pooled together with SNPs from all arrays, this resulted in 2,028,276 unique SNPs. (Note: Imputation data is not used, since these studies do not provide the final imputation results after quality control). All the populations together with their SNPs constitute a SNP set, which is used as a pool for permutation tests. SNP sets are then randomly sampled, with replacement across populations, such that the size of each set matches that of the observed SNP set for each population.

Permutation tests for LD-SNP overlap

The SNP sets generated in each iteration of permutation tests for SNP overlap are provided to SNPsnap [69] tool for LD SNP extraction using the parameters used for the original data set.

Permutation tests for gene overlap

A gene pool is created from all human genes; all Refseq transcripts [71] for hg38 assembly are downloaded from UCSC Table browser and extended 100 kb upstream and 10 kb downstream and Refseq IDs are translated to HGNC symbols. All HGNC symbols matching Refseq IDs are combined and duplicates removed. The final HGNC list served as the pool for random gene sampling. In each of the thousand permutations, gene sets are randomly sampled from the gene pool, with replacement across populations, matching the size of the observed gene set.

Permutation tests for functional overlap

Enrichment analysis is done for one thousand randomly sampled gene sets. In each permutation enrichment analysis is performed with the same parameters as the actual enrichment analysis. At this point, two populations (L and Fr) were not enriched in any biological processes, so the two populations were dropped from the analysis and further analysis is done on k = 7 populations only.

Permutation tests for network overlap

The SNP sets generated in each of the permutation tests for SNP overlap are provided to Prix-Fixe [20] network analysis tool, which outputs the most coherent disease associated subnetwork for each population. This results in 1000 most coherent disease associated subnetworks for each population.