Characteristics of the meta-analysis cohorts
Baseline characteristics of the cohorts participating in the discovery meta-analysis and replication are presented in Table 2. The mean age at baseline ranged from 50.3 to 62.7 years across cohorts, and the proportion of men ranged from 42% to 68.1% for both incident type 2 diabetes cases and controls. The mean follow-up time between DNA methylation measurements in blood and type 2 diabetes diagnosis ranged from 6.25 to 10.5 years across cohorts. Already at baseline, we observed a higher mean BMI in incident type 2 diabetes cases compared with controls in all cohorts. Similarly, baseline indicators of hyperglycaemia (i.e. fasting glucose and/or HbA1c) were higher in incident type 2 diabetes cases compared with controls in ESTHER, KORA1, EPIC-Norfolk and LOLIPOP. We observed differences in smoking status between incident type 2 diabetes cases across cohorts, with the proportion of current smokers ranging from 9.4% in LOLIPOP to 36.4% in the Doetinchem cohort (Table 2).
Meta-analysis results of discovery
Combining the results of the five discovery EWAS, we identified 76 genome-wide significant DMS using model 1 (λ = 1.189; QQ plots per cohort and for the whole meta-analysis for all models are presented in ESM Fig. 1). Of these, 63 DMS have not been previously reported to be associated with incident type 2 diabetes. The 76 DMS were annotated to 65 genes. Some of these genes had multiple CpG sites annotated to them: LGALS3BP (5); ABCG1 (3); SYNGR1 (3); SLC9A1 (2); PFKFB3 (2); and NAP1L4 (2) (Table 3). The results are summarised in a Manhattan plot (Fig. 1), showing the distribution of CpG sites across the genome. Based on principal component analysis (PCA) performed in the Doetinchem dataset, 32 out of the 76 CpG sites were considered independent signals (90% of variance explained). CpG site cg11800635 was listed as a probe with potential cross-hybridisation and 11 CpG sites were listed as polymorphic CpGs (Table 3). However, for eight out of those 11 CpG sites available in the Doetinchem dataset, we found no evidence of binomial methylation distributions, suggesting lack of confounding by the underlying SNP (dip-test p values 0.5–0.99). Of the 76 DMS identified, 20 DMS (26%) showed I2 > 60% suggesting considerable heterogeneity between studies (p<0.05; Table 3); for each of these 20 CpG sites, we made forest plots (ESM Fig. 2). Despite high, statistically significant heterogeneity estimates, only one site showed a difference in the direction of the association between cohorts (cg19169154 in KORA1; I2 = 66.2%). Also, KORA1 showed large differences in effect size for cg19693031 (I2 = 89.2%) and cg11269166 (I2 = 79.7%). For some sites, two clusters of cohorts with similar effect sizes seemed to be present (e.g. cg24678869 [I2 = 71.4%]). Otherwise, despite the high heterogeneity estimates, effect estimates were broadly consistent between cohorts.
Table 3 The 76 genome-wide significant DMS for incident type 2 diabetes from meta-analysis based on five European discovery cohorts
As a sensitivity analysis, we evaluated the impact of smoking and follow-up time from sample collection until type 2 diabetes diagnosis. With this additional adjustment (model 1.1) there was a reduction in the number of significant DMS from 76 to 47 (ESM Table 1; follow-up time not available for EPIC-Norfolk non-cases and LOLIPOP). Adjustment for baseline BMI (model 2) and for BMI, smoking and follow-up time (model 2.1) revealed that the number of significant DMS associated with incident type 2 diabetes decreased from 76 to 4 and 3, respectively (still including the two top CpG sites at the TXNIP and ABCG1 genes; ESM Tables 2 and 3). The attenuation of effect sizes across all models per CpG site is presented in ESM Table 4. Mean attenuation for all 76 CpG sites was 3% in model 1.1, while in models 2 and 2.1 the mean attenuation of effects was 22% and 26%, respectively. The correlation of effect sizes between models for all 76 DMS was very high and varied between 0.98 and 0.99 (ESM Fig. 3).
Comparison with previous EWASs of incident and prevalent type 2 diabetes, lipids, BMI and BP
Previously, 13 of the 76 DMS had been reported to be associated with incident type 2 diabetes [8, 12] and nine with prevalent type 2 diabetes [11, 24], all with consistent directions of effect (ESM Table 5). Furthermore, 33 of the 76 DMS (43%) overlapped with BMI EWAS results [21, 27,28,29,30], with consistent direction of the effects, and 12 DMS (16%) overlapped with blood lipid EWAS results, including triacylglycerols, total cholesterol, LDL-cholesterol and HDL-cholesterol [25, 26]. Additionally, five DMS (7%) had previously been reported in EWASs on BP [22, 23] (ESM Table 5).
Replication
Out of the 76 genome-wide significant DMS, 64 (84.2%) showed significant, directionally consistent association with incident type 2 diabetes in Indian Asians in model 1 (p<0.05; ESM Table 6). Using models 1.1, 2 and 2.1, 40 out of 47 (85%), three out of four (75%) and two out of three (67%) DMS, respectively, were replicated in the LOLIPOP cohort (ESM Tables 1–3). Although we observed a substantial attenuation of effect sizes of 47% in our replication (ESM Table 4), the correlation of effect sizes between discovery and replication stages was high (r = 0.91; ESM Fig. 3). Next, we combined the effects from the discovery and replication cohorts for the 76 DMS in a meta-analysis. In model 1, 63 DMS showed genome-wide significant associations with incident type 2 diabetes (p<1.1 × 10−7), whereas in models 1.1, 2 and 2.1 the number of genome-wide significant DMS increased, respectively, from 47, 4 and 3 in discovery only to 59, 18 and 10 in discovery and replication combined (ESM Table 6). Despite the high replication rate of 84.2%, we did observe considerable heterogeneity between discovery and replication, greater than that seen between discovery cohorts alone (in model 1, 53% of DMS showed significant [p<0.05] heterogeneity in combined analysis compared with 26% in discovery cohorts only).
The MRS based on 76 CpG sites showed limited predictive ability for incident type 2 diabetes (model M1, AUC = 0.591) in the LOLIPOP cohort (ESM Fig. 4). Moreover, the addition of the MRS to a prediction model including established predictors of type 2 diabetes (age, sex, BMI and HbA1c) showed no improvement (model M2, AUC = 0.753 vs model M3, AUC = 0.757). Additional adjustment for cell type distributions in these models did not change these conclusions (models M4, M5, M6). In the Doetinchem cohort we observed a slight improvement in AUC after adding an MRS based on genome-wide significant CpG sites (model M1 [age, sex, BMI, cell types, batch], AUC = 0.735; model M2 [age, sex, BMI, cell types, batch and MRS], AUC = 0.755; ESM Fig. 5). However, adding additional CpG sites based on less-stringent p value thresholds did not improve the AUC, indicating the limited predictive capacity of CpG sites that did not achieve genome-wide significance in the current meta-analysis (ESM Fig. 6).
Gene set enrichment analysis and associations with gene expression and SNPs
The results of gene set enrichment analyses based on genome-wide DNA methylation results from model 1 are presented in ESM Tables 7–9. The insulin signalling pathway was enriched in KEGG analysis, although the association did not survive the FDR correction (FDR = 0.12). Furthermore, fatty acid and lipid homeostasis appear to be perturbed in future type 2 diabetes cases, since pathways such as phospholipid metabolism and metabolism of steroids were found to be enriched (Reactome analysis, FDR = 0.04; GO terms, FDR < 0.05). As a sensitivity analysis we repeated the gene set enrichment analyses on the fully adjusted model 2.1 (adjusted for BMI, smoking and follow-up time). As expected, similar pathways came up; however, the FDR significance level was not reached due to the higher p values of individual CpG sites from model 2 (ESM Tables 7–9).
Analysis of enrichment of TFs for the 65 annotated gene names out of 76 DMS, using the ChEA3 online tool, resulted in 48 TFs (p<0.01; ESM Table 10).
Further, we queried the list of 65 annotated gene names in the GWAS catalog to find previously reported associations of phenotypes/diseases with genetic variants at those loci. Seventeen out of 65 (26%) genes harboured genetic variant associations with at least one metabolic trait or disease, such as lipid traits, BP and obesity (Table 3; ESM Table 11).
Next, we queried the list of 76 genome-wide significant CpG sites in the EWAS catalog to find previously reported associations with phenotypes/diseases. Fifty-three out of 76 (70%) CpG sites were identified in EWAS studies of at least one metabolic trait and 24 (31.6%) CpG sites were previously reported to be associated with smoking (ESM Table 12).
We investigated whether DNA methylation levels of the 76 CpG sites were significantly associated with gene expression levels in blood. Of the 76 DMS identified, 21 CpG sites (28%) were associated with expression levels of 23 genes, including top signals at genes such as TXNIP, ABCG1, SREBF1 and CPT1A (Table 3; ESM Table 13). Additionally, we performed a look-up of known meQTL. Of the 76 DMS, DNA methylation at 59 CpG sites (78%) showed significant association with at least one SNP and, in total, 14,813 cis associations were found with 13,121 SNPs (p<5 × 10−8). Of these, 80 mQTL were identified after clumping (ESM Table 14).