Background

It is estimated that by the year 2030, there will be ~366 million people affected by Type 2 diabetes (T2D) worldwide [1], with many of those affected lying in the middle to late adult years group [2]. T2D is genetically heterogeneous disease by the complex interplay of several environmental factors and susceptibility genes [3]. Single-nucleotide polymorphism (SNP) exhibits an abundant form of genetic variations. SNPs can be distinguished from other rare variations by more than 1% frequency in the human population when a single nucleotide replaces one of the three nucleotides. The human genome contains about 10~30 million SNPs with an average SNP every 100~300 bases. More than 5 million human SNPs have been identified and the information is publicly available (NCBI dbSNP Build 129). A SNP in a protein coding sequence (CDS) can induce amino acid changes, resulting in functional changes in the protein. Some SNPs in a promoter region can effect transcriptional regulation, and a SNP in an intron region can affect the splicing or expression of the gene.

In recent years, genome-wide association studies (GWAS) have identified a large number of robust associations between genetic variation and complex human disease, such as Type 2 diabetes and rheumatoid arthritis [4]. These approaches have identified common genetic variants that are associated with the risk of more than 40 diseases and human phenotypes [5]. In the T2D studies, candidate gene or genome-wide association approaches have suggested various putative T2D susceptibility SNP variants in various genes including TCF7L2, PPARG, KCNJ11, CDKN2A/B, FTO, CDKAL1 and so on [610]. But individual susceptibility of SNP variants may be disappointingly small or nowhere near enough to explain estimates of heritability [11]. One possible explanation for these weak relative risks and low attributable risks is that the risk may vary across different groups of clinically and biologically distinct T2D; further, analyzing T2D as a single disease may obscure the association with these risk factors. Another possible explanation is the effects of gene-gene (SNP-SNP) interactions. Most complex diseases result from the poorly understood interaction of genetic-genetic and genetic-environmental factors. The biological phenomenon associated with T2D that are modestly affected by a single SNP might be much greatly affected by a SNP in combination with additional SNPs in genes derived from the same or distinct biological pathways. In other words, it is difficult to identify disease-linked variants that are too rare to be picked up by association methods and yet have risk alleles of sufficient effects to allow detection with the use of existing statistical strategies [12]. A marker strongly related to risk does not guarantee effective discrimination between cases and controls [13].

A goal of this research is to assess the rules for classifying the case (T2D) and control (non-T2D) groups along with considering the potential gene-gene (SNP-SNP) interactions. Since it is considered that the SNPs are less influential toward the onset or development of T2D than combinations of SNPs, our interest is specially focused on the classification of SNPs that can detect the putative effects of genetic interactions. Small effects that could, when combined, have a significant impact on someone's health including onset of T2D, thus to get overall view of risk, the effects of the individual SNPs have to be combined [11]. There are several researches designed to examine the effect of combined SNPs to disease risks. Some methods have used the multifactor dimensionality reduction (MDR) algorithm, which identifies all the possible combinations of SNPs from a set of given SNPs, and the combination of SNPs that optimally predicts the risk by minimizing the classification error of cases and controls is finally selected [14]. Goodman and colleagues formulated a polymorphism interaction analysis (PIA) method, which examines all the possible SNP combinations (similar to MDR) among 94 SNPs in 63 genes studied in 216 male colon cancer cases and 255 male controls. They employed two separate functions that cross-validate and minimize the false-positive results in the evaluation of SNP combinations to predict the risk of colon cancer [15].

In the present research, we analyzed the associations between the combination of SNPs and T2D using a Support Vector Machine (SVM) - a machine learning algorithm. Classification based on the SVM has several applications in bioinformatics and computational biology [1620]. It has been widely used to predict protein secondary structures [21], solvent accessibility [22, 23], protein-protein binding sites [24], remote protein homology detection [25], detection of non-coding RNA [26], protein domains [27], protein subcellular localization [2830], discrimination of outer membrane protein [31], and gene and tissue classification from the microarray expression data [32].

Several researchers have recently applied this powerful machine-learning algorithm--SVM--to the problem of identifying combinations of SNPs that can predict the susceptibility toward diseases. Listgarten and colleagues [33] considered the SNPs from 45 genes of potential relevance to breast cancer etiology in 174 patients as compared to the matched normal controls. They obtained an accuracy of 69% when using SVMs as the learning algorithm. They concluded that multiple SNPs from different genes over distant parts of the genome are better at identifying breast cancer patients than any single SNP alone. Waddell et al. (2005) have applied SVMs to predict the susceptibility to multiple myeloma. Their work provided 71% accuracy on a dataset containing 40 cases and 40 controls. Very recently, Uhmn et al. (2009) applied several machine learning techniques including SVM to predict patients' susceptibility to chronic hepatitis from SNPs [34].

In this research, we analyzed the importance of gene-gene interactions on T2D risk by investigating 408 SNPs from 87 genes involved in major T2D-related pathways in a sample of 462 T2D cases and 456 healthy population controls. We applied the SVM to discriminate cases and controls with SNP combination information by means of a 10-fold cross-validation test. From the target population, we achieved 65.3% prediction rate with a combination of 14 SNPs from 12 genes using the RBF (Radial Basis Function) kernel SVM. We also investigated men and women sub-population datasets using the same method, and identified some different combinations of SNPs with prediction rates of 70.9% and 70.6%, respectively. For more precise identification of gene-gene interaction information in a biological manner, we may need more precise well-characterized sub-population datasets. In order to refine the genetic-environment relationship, more information is required in an epidemiological investigation. Besides existing statistical methods, we demonstrated the feasibility of incorporating SVM - a machine learning algorithm into case-control study.

Results

Case-Control Association Study

For each SNP, the p-value was calculated based on a chi-squared test. Based on the test results of 408 SNPs, 27 SNPs showed a significant genotype- or allele-based p-value (< 0.05) (Table 1). The -log10 p-value from association result of SNPs was plotted in each chromosome and the significant SNPs are circle shape (Figure 1).

Figure 1
figure 1

Chromosome distribution for association of SNPs with Type 2 Diabetes. In this panel, Manhattan plot shows distributed variables that were generated by genome-wide significance (p-values). Triangle markers at each locus indicate the significant SNPs based on a chi-squared test (p value < 0.05). Boxed SNPs represent Top 3 ranked lists.

This candidate-gene based analysis may have some limitations to detect association from the small population size (462 cases and 456 controls) and the limited number of candidate genes (87 putative T2D-related genes). The result of this classical case-control association study may need a further replication study with a large independent target population of cases and controls for establishing the credibility of a genotype-phenotype association. We used this classical association study result in the process of sub-dataset filtering based on the genotype-based p-value range (Table 2).

Table 1 Summary of the association study (genotype- or allele-based p < 0.05)
Table 2 Prediction rate of combinations of SNPs with genotype-based p-value filtering

Combination of SNPs

We performed SVM training and test analysis to find the best combination of SNPs. The prediction rates were determined by the SVM classifier that discriminated the case-control SNP genotype vectors. At first, we acquired 63.6% of the overall accuracy with the entire 408 SNP dataset, but we found that the p-value-based filtering method is useful for obtaining a better prediction rate. The prediction rate of a higher p-value SNP dataset (Table 1) did not show the best result (57.6%). This effect might be attributed to the different effects between a single SNP and within a combination of SNPs. This p-value-based filtering can reduce the search space for gene-gene interactions from a very large number of all possible combinations of SNPs to a manageable dataset.

Another reason is the limitation of the forward selection method to find the best combination of SNPs. The entire set of 408 SNPs may contain noise SNPs for forward selection, and some useful SNPs in the ideal combination may be removed from the very restricted p-value-based filtered SNP dataset (e.g., 24 SNPs with p < 0.05).

The best prediction rate of the SVM classifier with a RBF kernel function was 65.3% with 14 SNPs including a combination from the 240 SNPs with p < 0.6 (Table 2 and Table 3). In table 3, rs343 was reported the association with T2D [35], and two of SNPs (rs2070011 and rs2243250) were reported with not T2D but myocardial infarction [36, 37]. Furthermore, sub-population datasets of men and women with the RBF kernel, which were designed to discriminate case and control, yielded slightly better prediction rates of 70.9% and 70.6%, respectively, than that of the total population dataset (Table 4, Table 5 and Table 6). These prediction rates are almost similar with other previous studies using SVM, for example 69% of Listgarten and colleagues [33], 67.5% of Uhmn et al. [34], or 53% of Schwender et al. [38]. But, these previous works used different disease samples and different cross-validation test, thus it is difficult to compare these prediction rates directly. Considering other environmental and genetic factors involved in the development of T2D, the prediction performance was reasonably acceptable. It may be presumed that including other important genes and clinical factors including family medical history, we would obtain more improved prediction rate in the future. Different results between the entire target population and men or women sub-population may arise from the effect of the dataset's size or the well-characterized sub-population grouping.

Table 3 List of 14 SNPs for the best combination of SNPs
Table 4 Prediction rates of the SVM classifiers with different target populations
Table 5 List of 12 SNPs for the best combination of SNPs (men)
Table 6 List of 19 SNPs for the best combination of SNPs (women)

We could not find better prediction results by the above p-value-based filtering as that in Table 2 with men and women sub-population datasets. This result with a slightly improved prediction rate may arise from the effect of a smaller size of sub-datasets (n = 405 and 513) or the effect of well-characterized (gender-distinguished) sub-population datasets.

Protein-Protein Interaction Information

On the basis of the results of the combinations of SNPs, we attempted to find any biological information; one of the results is the protein-protein interaction (PPI) network (Figure 2), which was constructed from the results of the combinations of SNPs. Each set of the SNP genotype data was not acquired from the fine mapping association study; therefore, direct SNP-SNP interaction or SNP analysis focused on each promoter SNP, intron SNP, or exon SNP is difficult. This is the reason why we carried out the analysis at the protein (gene) level in this research (not the SNP level).

Figure 2
figure 2

PPI network from the SNP combination of (a) total population set, (b) men sub-population set, and (c) women sub-population set. The largest PPI network at each population set was constructed using PPI information database http://genomenetwork.nig.ac.jp. Circles are included in the best combination of SNPs from SVM results and square proteins are included to construct circle-circle connected proteins network.

The genomenetwork platform http://genomenetwork.nig.ac.jp provides protein-protein interaction network from the Y2H experimental data and the public databases (BIND, MINT and HPRD). Also, it has interaction property and gene annotation information. We obtained gene interaction information from PPI database. Circles (proteins) are included in the results of the combination of SNPs, and circles are collected from the entire PPI information database to connect with the squares. The construction of an indirect PPI network of two proteins is unnecessary from the biological viewpoint; therefore, we permit only two or fewer proteins (squares) between two proteins (circles) in Figure 2. We could easily find the same proteins among the target population datasets and the target-population-specific proteins in these three PPI networks. PPI network of Figure 2a contains 7 genes from the SNP combination result of 12 genes (14 SNPs) in Table 3. Other PPI network of Figure 2b and 2c contains 4 genes and 6 genes from the SNP combination results of men and women sub-data sets in Table 5 and 6, respectively. IL4 (interleukin 4) gene is the common gene among these three PPI networks and IL4, INSR, and IRS1 genes are common between Figure 2a (total population set) and 2c (women sub-population set).

Discussion

It is widely agreed that complex diseases are typically caused by the joint effects of multiple genetic variations instead of a single genetic variation. The gene-gene (epistatic) interactions of SNPs are believed to be very important in determining individual susceptibility to complex diseases. Thus, it is desirable to develop an effective method to search gene-gene interactions in human genome data. Recently, some computational methods have been proposed to address this issue using Multifactor Dimensional Reduction (MDR), or machine learning algorithms [39]. To study complex disease such as T2D, it is possible that many genes contribute to a T2D by their interaction with other genes, while main effects of the individual gene may be small or absent. Therefore, we developed the method that specifically designed to detect multiple disease SNPs, possibly on different chromosomes using SVM. This approach could be useful for identification of potential disease markers which genotype patterns are significantly associated with a high susceptibility.

This analysis includes the SNPs information of 87 T2D-related genes from fatty acid binding/translocation, GLUT4 translocation, and insulin signal pathways. A primary function of insulin is to stimulate the transport of glucose into target tissue, prominent among which are skeletal muscle, cardiac muscle, and adipose tissue. Insulin achieves this effect by inducing the translocation of GLUT4 glucose transporters from an intracellular vesicular compartment to the plasma membrane. Under basal condition, GLUT4 cycles between this intracellular compartment and the plasma membrane. SNAP23 is required for insulin-induced GLUT4 translocation to the plasma membrane and that it mediates the formation of a complex between syntaxin4 and VAMP2 [40].

T2D results from impairment in both insulin sensitivity and insulin secretion. Several genes have been implicated that might contribute significantly to the risk of T2D, including TCF7L2, PPARG, KCNJ11, CDKN2A/B and so on [6, 810]. T2D is one of the typical complex disease (polygenic disorder), which likely associated with the effects of multiple genes (SNPs) in combination with lifestyle and other environmental factors. In this research, we analyzed the candidate genes data set, thus the result does not contain these known significant SNP markers, including TCF7L2, PPARG, KCNJ11 and so on.

The result from Table 1 (all cases, combined versus controls) indicated weak associations with the risk factors investigated. This led us to stratify by sub-grouping by gender to see whether some potential associations may have been obscured by considering T2D as one disease. In this research, we first made two subpopulation data sets by gender (Table 4, and Figure 2). Epidemiological evidence suggests that sex differences exist in T2D. The prevalence of T2D is higher in men than women. Globally, diabetes prevalence is similar in men and women but it is slightly higher in men < 60 years of age and in women at older ages [1]. This difference may possibly result from the differences in insulin sensitivity and regional body fat deposition [41, 42].

Yeh et al. [43] used a conditional knockout strategy to generate androgen receptor (AR) knockout mice to study the relationship between androgen-AR and insulin sensitivity, and Lin et al. reported the influences of loss of AR on insulin and leptin resistance. Loss of AR may contribute to an increase of leptin levels and leptin resistance, which may play important roles for the development of obesity and insulin resistance. Important factors such as age at onset of T2D can also be incorporated in the modeling to further partition phenotypic variation or for defining subtypes of the phenotype.

As high-throughput technology for genome-wide SNP genotyping (500 K or 1 mega) improves and as more SNPs are identified, it is likely that much higher prediction rate will be achieved and a useful clinical system developed. For the biologically more precise identification of gene-gene interaction's effect for T2D, we may need more precise well characterized subpopulation data sets and more powerful computational power and method and so on. Besides existing statistical methods, we demonstrated the feasibility of incorporating SVM - a machine learning algorithm into case-control study. We plan to develop the method using machine learning algorithm in the future to search gene-gene interactions for our new Genome-Wide Association Study (GWAS) data [44].

Conclusions

We have found novel association between combinations of SNPs and T2D in a Korean population. We proposed gene-gene interaction considering candidate genes association study using SVM based feature selection method in this research.

Methods

Data and Data Preprocessing

Our dataset consists of 408 SNP data distributed over putative 87 T2D-related genes in 462 cases (patients) and 456 normal controls. The T2D cases, confirmed and diagnosed in the Ansan and Ansung cohort study area, were identified from the Korean Health and Genome Study (KHGS). The Ansan area primary represents an urban community, whereas the Ansung area represents a rural community in Korea. These two cohort studies include information on 87 T2D-related genes from fatty acid binding/translocation, GLUT4 (insulin-responsive glucose transporter 4) translocation, and insulin signal pathways. We selected these three pathways with reference survey [4548]. Among identified polymorphisms, 408 SNPs were selected based on location (CDS (protein coding sequence), intron, UTR (Untranslated region), promoter etc), frequency, linkage disequilibrium (LD) status and so on. The number of SNPs occurring in the CDS, intron, UTR, near gene region (promoter) and intergenic region was 77, 169, 53, 73 and 36, respectively (additional file 1). In this research, 462 cases were defined from two cohort studies as T2D subjects according to the World Health Organization (WHO) criteria. The 456 unrelated normal control people have no history of T2D, no first-degree relatives with T2D, fasting plasma glucose level less than 126 mg/dL, plasma glucose level 120 min after glucose ingestion of less than 140 mg/dL, and HbA1C level (glycosylated hemoglobin) of less than 5.8%. Further, the normal control people do not have a history of diabetes, hypertension, and dyslipidemia. In this study, all the people of case and control were more than 60 years of age.

For each SNP, the p-value was calculated based on a chi-squared test without adjustment for other confounding variables (Table 1). In this paper, we applied SVM to predict the susceptibility to T2D using SNP genotype data. From the view point of binary classification, we treated T2D cases as positive samples and controls as negative samples, and we used SNP variants as categorical features that have three possible genotype values at a locus. Usually, a SNP genotype is represented by a number that matches 1, 2, or 3, where 1 represents the homozygous site with a major allele, 2 represents a heterozygous site, and 3 represents a homozygous site with a minor allele [33]. Waddell et al. (2005) have applied SVMs to predict the susceptibility to multiple myeloma using -1, 0, 1, where 0 represents a heterozygous site and -1 and 1 arbitrarily represent homozygous sites. The preprocessing method used in this research was the same as that used by Listgarten et al. (2004) [33].

Support Vector Machine

A SVM is a learning algorithm that learns a classifier from a set of positively and negatively labeled training vectors, which can be used to classify new unlabelled test samples. The SVM learns the classifier by mapping the input training samples into a possibly high-dimensional feature space, and seeking a hyperplane in this space that separates the two types of examples with the largest possible margin, i.e., the distance to the nearest points. If the training set is not linearly separable, the SVM finds a hyperplane that optimizes the trade-off between good classification and large margin with a slack variable and kernel trick. For an actual implementation, we used the freely downloadable SVM-light package [49]. We tested linear, polynomial, and radial basis function (RBF) kernels with various parameters, and the final results were acquired with the RBF kernel and parameter gamma 1 that yielded the best prediction rate. We treated T2D cases as positive samples and controls as negative samples, and used SNP genotypes as categorical features. We adopted SVM to discriminate T2D cases against controls in this research.

Feature Selection

For large datasets, an exhaustive consideration of all the possible SNP combinations can become computationally infeasible. Therefore, we employed a feature selection procedure to find the best putative combination of SNPs according to forward selection for handling datasets with a genotype of SNPs. In population studies, this kind of selection of informative SNPs was usually developed for population identification [50]. In this work, forward selection was started by first selecting the SNP feature that yielded the best fit for the independent test set using SVM training and test at a time. This SNP feature was used to test all the combinations with the remaining 407 (408 - 1) SNPs in order to find the best pair of SNP features. This process continues step by step until increasing the size of the current subset leads to a lower overall accuracy. We adopted the 10-fold cross-validated classification accuracy for the selection criteria in this work. The requirement of the best prediction rate at each step yields the highest overall accuracy with regard to both sensitivity and specificity (≥ 0.45). The purpose of this requirement is to avoid the extremely low sensitivity or specificity with the highest overall accuracy.

Since we have a relatively small number of people (462 cases and 456 controls) in our dataset, it is expected that training with the complete set of 408 SNP features may cause overfitting. Hence, we performed forward selection with SNP genotype features to find a good smaller feature set (a combination of SNPs). Note that forward selection does not necessarily find the best combination of SNPs. However, it usually results in a combination that comes close to the optimum solution, and it needs relatively less computational complexity. If we have a smaller datasets and a more powerful computer, step-wise feature selection may be a better method than the forward selection method in this study.

Cross-Validation Test

The prediction rates of the SVM classifiers were examined by the 10-fold cross-validation test, where each case and control dataset is randomly divided into 10 subsets of approximately the same size. The SVM classifiers were trained 10 times, leaving out one of the subsets from the training each time. This single subset was used to estimate the prediction rate of the trained SVM classifier. The prediction rate of the SVM classifiers was evaluated using three measures, namely, sensitivity, specificity, and overall accuracy.

where TP, FP, TN, and FN refer to the number of true positives, false positives, true negatives, and false negatives statuses (case or control), respectively. Sensitivity measures the ability to correctly predict T2D cases, while specificity measures the ability for correctly reject controls. The kernel functions and parameters for the classification algorithms were optimized during the 10-fold cross-validation tests, while avoiding overfitting problems.