Our findings show that inclusion of genetic information from loci previously associated with quantitative risk factors for type 2 diabetes, but not primarily with diabetes, significantly increases the power to discriminate between people with and without clinically manifest type 2 diabetes. This emphasises the multi-factorial nature of type 2 diabetes and highlights the important potential role in disease development played by loci that do not reach a level of genome-wide significance in type 2 diabetes scans.
Our study was based on the premise that some loci capable of influencing diabetes risk and thus contributing to the discriminative power of type 2 diabetes GRSs have weak effects on type 2 diabetes individually, falling, as a result, below the stringent significance thresholds used in genome-wide scans, which means they have not previously been identified as diabetes-predisposing loci. We hypothesised that some of the loci reliably associated with traits that predispose to type 2 diabetes might, by virtue of this association, also raise the risk of type 2 diabetes. Hyperglycaemia is the cardinal feature of type 2 diabetes, providing sufficient justification for including glucose-raising alleles in a GRS for type 2 diabetes. However, it is worth noting that not all glucose-raising loci appear to influence type 2 diabetes risk [11], possibly because some loci may cause modest elevations in glucose concentrations that do not worsen over time, as observed in maturity-onset diabetes of the young [17]. Obesity is also a well established risk factor for diabetes, as illustrated in clinical trials where weight loss interventions have substantially reduced the incidence of the disease in high risk individuals [18, 19]. For dyslipidaemia, the mechanisms of association with type 2 diabetes primarily involve insulin resistance caused by the infiltration of insulin-sensitive tissues by triacylglycerol and other lipid metabolites [20, 21]. Two important organs in this regard are muscle and liver, the former being important because of its predominance as a site for glucose uptake and metabolism, and the latter because of its major role in glucose production. Studies in non-obese individuals with a strong family history of type 2 diabetes have provided experimental evidence that elevations in NEFA directly impair muscle glycogen synthesis and glucose uptake, and induce muscle, hepatic and adipose tissue insulin resistance in a genetically determined manner [3]. Prospective epidemiological studies indicate that dyslipidaemia early in life [22, 23] or during adulthood [24] raises the risk of developing type 2 diabetes later in life, but such associations may be driven by obesity [22] rather than a lipid-specific genetic defect. Nevertheless, animal and human studies suggest a shared genetic basis for diabetes and dyslipidaemia. For example, expression of the HDL-associated apolipoprotein M is completely abolished in the liver of mice lacking the HNF1A gene [25]; mutations in HNF1A also cause maturity-onset diabetes of the young class 3 [26]. Epidemiological studies have also identified genetic loci that influence dyslipidaemia and glucose homeostasis or type 2 diabetes [25, 27–30]. Although these joint relationships are unlikely to result from confounding, it remains unclear whether they reflect causal relationships between dyslipidaemia and diabetes, or pure genetic pleiotropy. Similarly, one cannot easily determine whether the cumulative association between lipid loci and diabetes in the present study is attributable to (1) dyslipidaemia mediating the effects of the genotypes on diabetes risk; (2) purely pleiotropic effects; or (3) a combination of these explanations. Notwithstanding these limitations of interpretation, the use of a priori biological information to help filter genome-wide scan results minimises the multiple testing burden inherent in hypothesis-free whole-genome genetic association studies and may raise the prior probability of association, hence helping to preserve statistical power.
To minimise over-fitting of our models, prior evidence of association from the DIAGRAM dataset [12] was used to code the effect alleles in the ROC analyses presented here. Fitting the alleles in this way did not result in markedly different ROC AUCs than when alleles were fitted directly to the current dataset, indicating that our data are unlikely to be markedly over- or under-fitted. We were unable to include all currently identified risk alleles for the traits of interest, partly because the rate at which new risk variants have been discovered out-paced our study and partly because resources were limited. Although initially presumed otherwise [31], it is unlikely that LOC387761 is a true diabetes locus and could thus have been excluded from our models without diminishing the discriminative power. It is also important to highlight that there are many other antecedent traits for type 2 diabetes beyond those studied here, e.g. HbA1c, fibrinogen and adiponectin; if variants associated with such traits were to be included in a GRS, the discriminative power would probably increase further.
The derivation of GRSs using the approach applied here requires complete genotype data in the population in which the score is computed. Because the genotype success rates were less than perfect in our study (as in virtually all studies) and genotyping failures were randomly distributed across the selection of SNPs in this cohort, it was necessary and appropriate to impute missing genotypes. The alternative would have been to use a sample set in which directly genotyped data were available for all SNPs. However, because missing genotype data were random across the study sample, around half of all participants were missing data on at least one of the 73 SNPs. Thus, use of only the complete directly genotyped subgroups would have resulted in a considerable loss of statistical power and could have led to biased conclusions about the magnitude of association for the GRSs.
A further consideration is whether our findings are likely to be attributable to confounding. With the exception of linkage disequilibrium between the non-functional observed and functional unobserved loci, statistical associations between germline genetic variants such as SNPs and phenotypes are generally robust to confounding in ethnically homogeneous cohorts such as that studied here. Therefore, the associations reported here are unlikely to be prone to confounding.
Our study is clearly a hypothesis-generating effort and robust type 2 diabetes effect sizes for most of the GRSs of interest in this report are absent from the published literature. As such, meaningful a priori power calculations could not be performed for this study and post-hoc power calculations would be inappropriate, as discussed at length elsewhere [32–34]. The fact that most of the associations reported for the GRS models are highly statistically significant indicates that our study was well powered to detect the observed effects (which is a circular argument and one important reason why post hoc power calculations are often discouraged).
Finally, owing to the cross-sectional study design, we were unable to calculate the reclassification index attributable to the different genetic models, which would be valuable when considering a possible clinical application. One should also consider that in cross-sectional studies, in which cases and controls are phenotypically highly distinct, estimates of discriminative power may exceed estimates of predictive power derived from prospective studies.
In conclusion, polymorphisms that affect diabetogenic traits, but which are not conventionally considered to be diabetes-predisposing loci, significantly improve the discriminative power of a conventional GRS for type 2 diabetes. This is the case even though, on an individual basis, most variants have weak effects that were not statistically associated with type 2 diabetes in our study. Nevertheless, the discriminative power of the GRS remains below a level many would consider clinically useful; thus, validated non-genetic prediction algorithms remain the most appropriate tools for predicting type 2 diabetes in the clinical setting.