Introduction

Recent advances in metabolite profiling technology have enabled discovery of novel biomarkers of type 2 diabetes development. It is worthwhile to better characterise these metabolic alterations since they could be of pathogenic importance. Elevated concentrations of branched-chain and aromatic amino acids and lower concentrations of glycine and various lipid species, such as lysophosphatidylcholine (LysoPC) 18:2 are reported to be associated with incident type 2 diabetes, but the causal role of these early aberrations in diabetes pathophysiology is not clear [13]. It has been proposed that the identification of genetic determinants of metabolite concentrations would assist in enabling the functional understanding of associations between metabolite concentrations and clinical endpoints [4]. So far, more than 150 associations between genetic variants and various metabolite concentrations are reported from large genome-wide association studies (GWAS), often with large effect sizes [5]. Reported variants affecting metabolite concentrations are often located within genes encoding enzymes or transporters, with a function related to the biochemical nature of the associated metabolites [6]. Some of these genetic variants have recently been used as instrumental variables to study the causal effect of lipid metabolites on cardiovascular risk [7, 8]. The underlying idea of this approach is that a genetic variant determining metabolite concentration could be used as an unbiased proxy to predict the effect of metabolite perturbation on clinical phenotypes of interest.

The primary aim of the present study was to identify metabolites associated with incident type 2 diabetes, using a non-targeted metabolomics approach in four population-based cohort studies, and to investigate whether such metabolites share a common genetic background with type 2 diabetes. A secondary aim was to explore whether the addition of metabolites to the Framingham diabetes risk score [9] would improve prediction of type 2 diabetes.

Methods

Study population

We used data that had been generated previously from non-targeted metabolomics analysis [10] in combination with phenotypic information from fasting individuals from four population-based studies. These studies have all been described in detail previously—the Uppsala Longitudinal Study of Adult Men (ULSAM) [11], the Prospective Investigation of the Vasculature in Uppsala Seniors (PIVUS) [12], a case-cohort subset of the TwinGene study [13] and the Cooperative Health Research in the Region of Augsburg (KORA) [14, 15]. Informed consent was obtained from all participants in the four studies. Details of the cohorts can be found in the ESM Methods.

Outcome definition

Impaired fasting glucose (IFG) at baseline was defined according to the American Diabetes Association criteria as fasting glucose ≥5.6 and <7.0 mmol/l [16]. Type 2 diabetes diagnosis at baseline and during follow-up could be based on biochemical measurement (fasting glucose ≥7.0 mmol/l, HbA1c ≥6.5% (48 mmol/mmol) and/or 2 h post-oral glucose tolerance test glucose ≥11.1 mmol/l) within the study, in addition to health registries and validated medical records. Details of diabetes definitions and analytical methods for glucose for each cohort are given in the ESM Methods. Individuals were censored at date of death or end of study.

Metabolomics analysis

Briefly, plasma samples from the age of 71 years in ULSAM and serum samples from the baseline of PIVUS and TwinGene were treated with methanol to precipitate proteins and dissolve lipids. Non-targeted metabolite profiling was performed using ultra-performance liquid chromatography (Acquity Ultra-Performance Liquid Chromatography) (UPLC) directly coupled to a quadrupole time-of-flight mass spectrometer (Xevo G2 Q-TOF MS) (Waters Corporation, Milford, MA, USA) fitted with an electrospray source operating in positive ion mode. Non-consecutive randomised duplicate samples of 1 μl were injected and separation was performed on a BEH C8 analytical column. Mass analysis was performed in the full scan mode (mass-to-charge ratio, 50–1200).

Data were processed using the open source XCMS package in the R statistical environment [17]. Metabolic feature detection, alignment, grouping, imputation and normalisation were performed separately for each study as previously described [10]. In total, 9755, 10,162 and 7522 metabolic features were detected in the TwinGene, ULSAM and PIVUS cohort, respectively. A metabolic feature is characterised by a unique mass-to-charge ratio and retention time, meaning that a single metabolite can be represented by many metabolic features due to phenomena such as in-source fragmentation, neutral losses, adduct formation and multimer formation. For the present study, only metabolic features present in TwinGene and PIVUS and/or ULSAM were included in the analysis. Since small polar metabolites such as sugars are not well retained by reverse-phase chromatography, all metabolic features with a retention time <35 s were excluded.

Annotation of IFG- and diabetes-associated metabolic features was based on spectral matching against an in-house spectral library of authentic standards as well as public databases. The level of confidence was categorised in agreement with the Metabolomics Standard Initiative [18] as level 1–4: 1, match with accurate mass (±5 ppm), overall fragmentation pattern and retention time with the in-house spectral library; 2, match based on accurate mass (±5 ppm) and fragmentation pattern using available spectra in public data bases; 3, match based on a combination of mass spectra and fragmentation pattern knowledge; accurate mass and retention time window to assign the metabolite to a chemical class; 4, unknown.

In KORA, metabolites were extracted using similar methods as for the Swedish cohorts from baseline serum samples and a non-targeted metabolomics analysis was performed by Metabolon (Durham, NC, USA), using three separate analytical methods GC–mass spectrometry (MS), UPLC–MS positive mode and UPLC–MS negative mode. The UPLC–MS platform utilised a Waters Acquity UPLC and a ThermoFisher LTQ mass spectrometer. The methods are described in detail elsewhere [19].

For all metabolite features included in the analysis, peak intensity was transformed to the Log2 scale and then SD-transformed within each of the four cohorts prior to statistical analysis.

Statistical analysis

The overall workflow of the study is depicted in Fig. 1. The study was designed assuming that early markers of type 2 diabetes are also altered in individuals with IFG and overt type 2 diabetes. All statistical analysis was done using STATA13 (Stata, College Station, TX, USA) and R v. 3.1.3 (https://www.r-project.org/).

Fig. 1
figure 1

Overall workflow of the study. The coloured squares indicate which studies are being used in the different steps in the workflow. T2D, type 2 diabetes

Non-targeted metabolomics of prevalent type 2 diabetes, IFG and incident diabetes

In PIVUS and ULSAM, the association of each metabolic feature was assessed separately with normal fasting glucose vs IFG and normal fasting glucose vs prevalent type 2 diabetes using logistic regression modelling with feature intensity, age, sex, BMI and waist circumference as independent variables. In total, 3276 metabolite features were detected in both PIVUS and ULSAM and here fixed-effects inverse-variance-weighted meta-analysis was performed to pool results; 1622 features were detected only in PIVUS samples and 1063 features were detected only in ULSAM samples. The Benjamini–Hochberg procedure [20] was used to correct for multiple testing (5961 tests) at a false discovery rate (FDR) of 5%. Metabolic features that were identified as being associated with IFG or type 2 diabetes underwent annotation to metabolites and were re-assessed for their association with IFG and prevalent type 2 diabetes in the TwinGene subcohort (n = 1549). Metabolites were excluded if only fragments, but not the parent ion, were found associated with the outcome. A nominal p value cut-off of 0.05 and consistent direction of effect estimates were considered as evidence of replication.

Cox proportional hazard models, adjusted for age and sex, were used to assess the association of IFG- and prevalent type 2 diabetes-associated metabolites with time-to-event to type 2 diabetes in each of the three Swedish cohorts. In TwinGene, models were fitted and re-weighted for the inverse of the sampling probability using the Borgan ‘Estimator II’ [21]. Fixed-effects meta-analysis was used to pool the results and a 5% FDR was applied. We further adjusted models for BMI, waist circumference and fasting glucose concentrations. We assessed the association of metabolites available on the KORA platform with incident type 2 diabetes using the same model specifications and applied fixed-effects meta-analysis of all four cohorts. We tested the probability of binomial probability test (bitest in STATA) for directional replication using the binomial probability test.

Genetic association of metabolic loci with type 2 diabetes

To identify genetic variants regulating the metabolites identified as being associated with incident type 2 diabetes, we extracted results from the GWAS of metabolomics based on the KORA and TwinsUK cohorts with up to 7824 adults [19]. We meta-analysed GWAS results from ULSAM, PIVUS and TwinGene for those metabolites that were not identified or did not have a GWAS signal in KORA and TwinsUK data. A cut-off of p < 5 × 10−8 was used to denote genome-wide significance. To assess the association of these variants with type 2 diabetes, the publicly available data from the GWAS and Metabochip results for type 2 diabetes, including up to 34,840 cases and 114,981 controls from the DIAbetes Genetics Replication and Meta-analysis consortium [22], were accessed and for five genetic variants we used a proxy in linkage disequilibrium with r 2 > 0.8. In additional analysis, we addressed the association of the bile acid-regulating variant within CYP7A1, with other metabolic traits using the MR catalogue (www.mrcatalogue.medschl.cam.ac.uk, accessed 03/03/2016).

Prediction of type 2 diabetes

To determine whether metabolites associated with prevalent type 2 diabetes and IFG could improve type 2 diabetes prediction, we used Lasso penalised Cox regression implemented via the glmnet package in R by setting the overall penalty parameter α to 1 to select those with the highest predictive value. Cohort identity and Framingham diabetes risk score [9] were forced into the model. Model choice was based on tenfold internal cross-validation and the minimum λ achieved by adding exactly five of the 54 metabolite biomarkers that were available in all three cohorts. We used the combined ULSAM/PIVUS cohorts as a training set to derive an additive β coefficient-weighted 5-metabolite risk score. For validation in TwinGene, Cox proportional hazards regression re-weighted for the inverse of the sampling probability was used to assess incremental improvement of adding the metabolite score to the Framingham risk score by likelihood ratio test and C indices [23]. In TwinGene, information on parental history of diabetes was not available to include in the Framingham diabetes risk score; thus, this variable was set to ‘none’.

Results

Non-targeted metabolomics of prevalent type 2 diabetes, IFG and incident diabetes

Baseline characteristics of the included cohorts and the number of individuals with prevalent diabetes and IFG are shown in Table 1. We found 338 metabolite features to be associated with IFG and 975 features to be associated with prevalent type 2 diabetes in models adjusted for age, sex, BMI and waist circumference in PIVUS and ULSAM combined. In the annotation step, these 1120 features were determined to originate from at least 115 metabolites, of which 69 could be annotated to key adducts of a single unique metabolite and were taken forward to replication in TwinGene. Further, 17 additional metabolites had high-quality spectra but no matching metabolite in our data bases and were labelled as ‘missing retention time’ and taken forward to replication. Of the 86 metabolites taken forward to replication, 70 were associated with at least one of the two outcomes in TwinGene: 13 with both IFG and prevalent type 2 diabetes, 53 with type 2 diabetes only and four with IFG only (ESM Tables 1 and 2).

Table 1 Baseline characteristics of the four cohorts used in this study

There were 78 incident events of type 2 diabetes in the ULSAM cohort, 70 in the PIVUS cohort, 122 in the TwinGene cohort and 88 in the KORA cohort. Of the 70 metabolites found to be associated with prevalent type 2 diabetes and IFG, 36 were also associated with incident type 2 diabetes in the meta-analysis of the three Swedish cohorts in crude models adjusted for age and sex at a 5% FDR and 15 metabolites in ‘fully adjusted models’ additionally adjusted for waist circumference, BMI and fasting glucose (p < 0.05) (ESM Table 3). Of those 15, deoxycholic acid, monoacylglyceride 18:2 and cortisol represent a novel finding with the highest level of annotation confidence. The comparison of analytical spectra to standard spectra is shown in ESM Figs 1 and 2.

Five of these 15 compounds (cortisol, γ-glutamyl-leucine, 2-methylbutyroylcarnitine, l-tyrosine and deoxycholic acid) were part of the panel tested in the KORA cohort. The association of 2-methylbutyroylcarnitine and tyrosine with incident type 2 diabetes in the age- and sex-adjusted models was confirmed, although none of the five metabolites were associated in the fully adjusted models (ESM Table 4). For all five metabolites, the directions of effect estimates were the same in KORA as in the Swedish cohorts and, when formally tested, the probability for this distribution was significantly different from the null (binomial probability test for 10/10 to be in the same direction, p = 0.002). A post hoc power calculation for replication at an α of 0.05 is shown in ESM Fig. 3 and ESM Table 4.

All five metabolites assessed in KORA showed p < 0.05 in the combined meta-analysis (Table 2). In a sensitivity analysis, we re-ran the meta-analysis excluding the male sex-only cohort ULSAM (ESM Table 5) and obtained similar results.

Table 2 Metabolites associated with incident diabetes mellitus in the combined analysis with TwinGene, ULSAM, PIVUS and KORA S4

Genetic association of metabolic loci with type 2 diabetes

Using published GWAS from KORA and TwinsUK [19], as well as from a meta-analysis from ULSAM, PIVUS and TwinGene, we identified a total of 12 metabolite-regulating genetic variants for eight of the 15 metabolites at a genome-wide significance level (p < 5 × 10−8). The association of these genetic variants with type 2 diabetes was assessed using published summary statistics from a large meta-analysis of GWAS for type 2 diabetes [22]. Four of the 12 genetic variants were found to be associated with type 2 diabetes at a nominal p value threshold (Table 3). First, a variant in the gene encoding cholesterol 7α-hydroxylase (CYP7A1) was found to be associated with both decreased concentrations of the bile acid deoxycholic acid and decreased risk of type 2 diabetes. We further investigated the association of CYP7A1 with other metabolic traits using the largest available GWAS results and found associations with higher LDL-cholesterol and higher triacylglycerol levels (Table 4). Second, genetic variants associated with lower concentrations of sphingomyelin (SM) 33:1 (a variant within SYNE2 [upstream SGPP1]) and ceramide phosphoethanolamine (CerPE) 38:2 (a variant within GCKR), respectively, identified in ULSAM, PIVUS and TwinGene), were found to be associated with lower risk of type 2 diabetes. Third, a variant in MYRF (upstream of FADS2) identified in ULSAM, PIVUS and TwinGene was found to be associated with lower LysoPC 20:2 and increased risk of type 2 diabetes.

Table 3 Genetic variants associated with candidate metabolites and their association with type 2 diabetes
Table 4 Association of the T allele CYP7A1 variant rs8192870 or its corresponding C allele of the proxy rs2326077 (r 2 = 0.881) with metabolic traits

Prediction of type 2 diabetes

In 1763 individuals comprising the PIVUS and ULSAM cohorts (70 and 78 incident events, respectively), a LASSO predictor selection adjusted for cohort and Framingham diabetes risk score resulted in a five-metabolite score that included tyrosine, barogenin, LysoPC/phosphatidylcholine (PC)(O-16:1/0:0), PC(O-18:1/0:0)/PC(P-18:0/0:0) and LysoPC(20:2). In the validation sample of 1394 fasting individuals without prevalent diabetes and 122 incident events in TwinGene, the metabolite score improved the Framingham diabetes risk model’s fitting (χ 2 = 7.371, p = 0.007) and marginally improved discrimination of incident diabetes events (C index for the Framingham diabetes risk score of 0.848 [95% CI 0.793, 0.903] improved to 0.855 [95% CI 0.800, 0.910]). One SD increase in the five-metabolite score, when added to the Framingham diabetes model, increased the 10 year risk of type 2 diabetes by 29% (HR 1.294, 95% CI 1.071, 1.564).

Discussion

Using a non-targeted metabolomics approach, our study confirmed several known metabolites to be associated with incident type 2 diabetes and also identified novel associations for three compounds annotated with the highest level of confidence—deoxycholic acid, monoacylglyceride 18:2 and the steroid hormone cortisol. For four metabolites, we identified genetic variants associated with both metabolite concentrations (at a genome-wide significance level) and type 2 diabetes (at a nominal level).

Bile acid synthesis

The main finding of our study is the phenotypic and genetic correlation of bile acid concentrations with type 2 diabetes. In the present study, increased concentrations of three 12α-hydroxylated bile acids (deoxycholic acid, glycocholic acid and glycodeoxycholic acid) were associated with incident diabetes in the age- and sex-adjusted models. One of these, deoxycholic acid, remained significant in the model adjusted for BMI, waist circumference, age, sex and concentration of fasting glucose. In a previous study, increased 12α-hydroxylated bile acid concentrations were linked to worse insulin resistance [24]. Another study found elevated concentrations of deoxycholic acid, but lower concentrations of cholic acid, when persons with prevalent diabetes were compared with healthy controls [25]. We note that out of four 12α-hydroxylated bile acids captured on our metabolomics platform, three were associated with prevalent and incident diabetes. The results from the current study highlight the complex interactions between lipid metabolism, type 2 diabetes and bile acid concentrations. In the liver, the enzyme cholesterol 7α-hydroxylase (encoded by CYP7A1) is the rate-limiting enzyme in the conversion of cholesterol to primary bile acids (Fig. 2). Using a genome-wide approach, we found that a genetic variant within CYP7A1 was associated with decreased deoxycholic acid concentrations, decreased risk of type 2 diabetes and increased concentrations of LDL-cholesterol and triacylglycerols, which supports our observational findings.

Fig. 2
figure 2

Overview of bile acid metabolism. Metabolites with name in bold indicates that these were measured on the platform. *p < 0.05 for incident type 2 diabetes in sex- and age-adjusted models. CA, cholic acid; CDCA, chenodeoxycholic acid; LCA, lithocholic acid

The higher level of LDL-cholesterol in carriers of the bile acid-increasing variant is likely due to a lower activity of the cholesterol 7α-hydroxylase, which will clear less cholesterol from the circulation. The effect of the CYP7A1 variant on LDL-cholesterol and type 2 diabetes is consistent with recent findings that LDL-increasing variants in the gene encoding 3-hydroxy-3-methylglutaryl-CoA reductase (HMGCR) and a polygenic LDL-cholesterol risk score are both associated with lower risk of diabetes [26, 27]. A variant in CYP7A1 decreasing LDL-cholesterol has previously been linked to lower fasting glucose [28]. The direction of effects, with higher levels of bile acids in the circulation linked to increased risk of diabetes, seems however counterintuitive, as bile acids are increasingly being recognised as hormones that regulate various metabolic processes in beneficial ways, including increasing incretin secretion in the gut [29], although different classes of bile acids affect downstream receptor signalling in different ways, not all of which may promote glucose homeostasis [30]. However, with regards to pharmaceutical applications, bile acid sequestrants such as colesevelam (approved for lipid-lowering purposes) bind to bile acids in the gut and thus increase CYP7A1 expression through feedback systems. The drug results in lowered LDL-cholesterol through increased bile acid production and has been approved for glucose-lowering treatment in hyperglycaemia [31], although the underlying mechanism for this effect is little explored and stands in contrast to our results. To our knowledge, we present the largest human sample establishing a possible common genetic origin between dyslipidaemia, reduced 12α-hydroxylated bile acid synthesis and lower risk of type 2 diabetes.

Phospholipid metabolism

Circulating concentrations of different LysoPC species have been found to be reduced in diabetes, impaired glucose tolerance and coronary heart disease [2, 3, 8, 32, 33]. In the present study, lower LysoPC(20:2) and its associated genetic variant near FADS1/2 were found to be associated with higher risk of type 2 diabetes. Fatty acid desaturases (encoded by fatty acid desaturases gene family) introduce double bonds into saturated fatty acids and variants in this locus has previously been linked to blood lipid concentrations [34], fatty acid concentrations [35] and fasting glucose [36]. In our genetic analysis, the direction of the effect was consistent with the observational analysis, where an increased level of LysoPC(20:2) was associated with a lower risk of type 2 diabetes. We speculate that decreased expression of FADS genes likely increases the concentrations of saturated fatty acids in different lipids, which may affect insulin sensitivity and insulin secretion and hence diabetes risk.

SMs have also previously been linked to type 2 diabetes [2]; however, to the best of our knowledge, their analogues, CerPEs, have not. In our study, SM d18:2/18:2, SM 34:2, SM (33:1) and CerPE 38:2 were all found to be inversely associated with incident type 2 diabetes. CerPEs are produced in trace amounts together with SMs and are located in the plasma membrane, but their functions are largely unknown [37]. We found a genetic variant within SYNE2 just upstream of the sphingosine-1-phosphate phosphatase 1 gene (SGPP1) that was associated with SM(33:1) and type 2 diabetes, but in a direction different from that revealed by the observational results. The sphingosine-1-phosphate phosphatase 1 protein regulates sphingosine and long-chain ceramide metabolism [38] and has previously been associated with SM concentrations [39] and may play a role in insulin secretion [40]. We further found that a variant in the glucokinase regulator gene (GCKR) was associated with lower CerPE 38:2 levels and lower risk of type 2 diabetes. The encoded protein regulates the activity of glucokinase (a key enzyme in glucose homeostasis) in the liver. Variants within this locus are well-known markers for diabetes and lipid traits.

Prediction

Addition of five metabolites to the established Framingham risk score for diabetes did increase model fit significantly but added very little (less than 1%) to discrimination. Future studies including also the monosaccharides and polar amino acids that could be detected by GC–MS would have the potential to define a larger set of metabolites that also might increase discrimination.

Strengths and limitations

Strengths of the present study include the use of a non-targeted metabolomics approach in four prospective cohorts and its integration with genetics data to provide evidence for shared causal pathways between several metabolites and type 2 diabetes. However, since some of the genetic variants (e.g. GCKR, FADS2) were commonly associated with several metabolites, a basic assumption for a Mendelian randomisation study (non-pleiotropic effects of genetic instruments) was violated, precluding analysis for causal directions. For CYP7A1 and its association with our main findings on bile acids, although its encoded enzyme is specific to bile acid biosynthesis, it not suitable to disentangle the effect of bile acids from those of their immediate precursor, cholesterol.

Only five of the 15 candidate metabolites could be analysed in KORA due to different analysis methods. The KORA sample had limited power to detect true effect sizes, especially in the fully adjusted models. Nevertheless, the magnitudes and directions of the associations found in the Swedish meta-analysis and in KORA were similar, supporting the validity of the results. The KORA S4 cohort with targeted metabolite profiles analysed on a different metabolomics platform from that used in the present study was previously used to assess the association of a limited number of metabolites (3 and 14, respectively) with incident type 2 diabetes [2, 3], but the metabolites did not overlap with the five assessed in KORA in the current study.

A limitation concerning generalisability is the inclusion of mostly elderly white persons. Another limitation is that ULSAM is a male sex-only cohort and this could have biased the results if there were different concentrations of metabolites in men and women. However, our sensitivity analysis where we exclude ULSAM from the meta-analysis shows similar results. Degradation of analytes is likely to reduce the power to detect differences between groups but as long as there are no differences in degradation among diabetes controls and those with events, there will be no bias causing false-positive findings. Again, results from the meta-analysis without ULSAM (which was the study with the longest freezer storage time) were similar to those of the full meta-analysis. Further, only liquid chromatography was used for separation of metabolites in the three Swedish cohorts; this limits the correct detection and identification of monosaccharides and polar amino acids, which have been highlighted in type 2 diabetes [1, 3]. It is therefore likely that a combination with other methods such as GC–MS would have increased the number of metabolites discovered, especially from glucose-related pathways, which indeed are of great interest for the present research topic. Finally, we were not able to include family history of type 2 diabetes in the Framingham diabetes risk score, which may have overestimated the contribution of the metabolite risk score.

Conclusions

We identified novel metabolites that were associated with incident type 2 diabetes. A genetic variant linked to bile acid metabolism was associated with type 2 diabetes and LDL-cholesterol, suggesting shared causal pathways. Non-targeted metabolomics linked with genetic data is a powerful approach to discover new pathophysiological mechanisms linked to type 2 diabetes development.