Background

Type 2 diabetes is one of the most common causes of mortality, disability, and health expenditure worldwide [1, 2]. During recent decades, the incidence of type 2 diabetes has increased or, at best, remained stable, while its prevalence and overall burden continue to increase [3].

Heterogeneity in the pathogenesis of type 2 diabetes represents a challenge for disease prevention. For example, although overweight and obesity are relatively strong risk factors for type 2 diabetes, most individuals with a high BMI do not develop type 2 diabetes [4]. Similarly, despite prediabetes is commonly used to identify individuals at high risk of type 2 diabetes, its practical utility is still questioned. Some studies estimate that around 70% of people with prediabetes will develop type 2 diabetes during their lifetime [5], while a recent systematic review reported that regression to normal glucose levels ranged between 17 and 59% even after over 10 years, making it difficult to accurately stratify individual’s risk for type 2 diabetes [6].

Furthermore, risk stratification based on a single factor, such as glucose levels, ignores the complex pathophysiology of type 2 diabetes [7]. Recently, the World Health Organization (WHO) highlighted the need to better understand the pathophysiological heterogeneity of type 2 diabetes to improve surveillance, prevention, and treatment.

Precision medicine aims to characterize more homogeneous subpopulations based on the individuals’ biological, environmental, and social characteristics [8,9,10]. When extended to public health, precision prevention or precision public health aims to identify homogeneous subgroups of individuals that could lead to the development of targeted interventions for disease prevention [11, 12].

Recent studies have employed data-driven analytical methods to identify distinctive subgroups of individuals with type 2 diabetes or prediabetes and reported associations with disease progression and complications [13,14,15]. Given that common risk factors for type 2 diabetes vary on the individual level, we hypothesized that such methods may also be useful in identifying characteristic risk profiles before the onset of type 2 diabetes among the general population [16, 17].

The aim of this study was to determine whether cluster analysis could be used to identify homogeneous subgroups of diabetes-free adults based on the heterogeneity of known risk factors for type 2 diabetes, and asses their clinical utility to stratify the risk of type 2 diabetes compared to that of prediabetes.

Methods

Study samples

In this study, two independent cohorts were used. Data from the Stockholm Diabetes Prevention Program (SDPP) were used in the main analysis, while data from the Metabolic Syndrome Cohort (MSC) in Mexico were used as a validation dataset (see Additional file 1: Table S1).

The Stockholm Diabetes Prevention Program

Detailed information on the SDPP cohort has been published previously [18]. In summary, all habitants of five municipalities of Stockholm who were born in Sweden between 1938 and 1961 were invited by letter to participate in SDPP. The clinical subsample included in this study was created by inviting diabetes-free individuals with a family history of type 2 diabetes, women with a history of gestational diabetes, and matched controls to take part in a clinical examination including questionnaires, anthropometric and blood pressure measurements, and blood sample collection.

In total, 7948 individuals participated in the baseline examination of the clinical subsample between 1992 and 1998. The second follow-up, between 2002 and 2006, consisted of 5612 participants, and the third follow-up, between 2014 and 2017, included 4297 individuals. All baseline participants were also followed up regarding a diagnosis of type 2 diabetes by individual linkage to the clinical inpatient and outpatient registers of Region Stockholm (VAL) and the National Diabetes Registry of Sweden (NDR).

For this study, 516 (6.5%) individuals were excluded due to missing information or extreme values of any of the variables used to perform the cluster analysis and 115 (1.4%) due to a diagnosis of diabetes during the baseline examination. The resulting study sample included 7317 participants (Additional file 2: Fig. S1).

The Metabolic Syndrome Cohort

Detailed information on the MSC has been published previously [19]. In summary, this was a prospective cohort of 9637 diabetes-free individuals, resident in central Mexico, born in Mexico, and aged 20 years or older. Baseline examinations were performed between 2006 and 2009 at the participants’ workplace, home, or during a visit to a primary health center and included a comprehensive medical history, anthropometric and blood pressure measurements, blood sample collection, and standardized questionnaires. After 3 years (± 6 months), all participants were contacted and invited to a follow-up examination, which 6144 individuals attended (80.7%).

For this study, 1839 (19.1%) individuals younger than 30 years of age or older than 60 years at baseline were removed to ensure comparability with SDPP and to minimize the risk of including cases of type 1 diabetes or other forms of diabetes. A further 3966 (40.5%) were removed due to missing information or extreme values of variables used in the cluster analysis and 1500 (15.6%) due to loss at follow-up. The final MSC study sample thus included 2332 participants (Additional file 2: Fig. S2).

Variables

Type 2 diabetes

The incidence of type 2 diabetes in the SDPP study sample was determined using oral glucose tolerance tests (OGTT) (fasting plasma glucose ≥ 7.0 mmol/L or 2-h post-load plasma glucose ≥ 11.0 mmol/L) [20], data from the VAL or NDR registers (ICD-10 code E11), or self-reported type 2 diabetes in the study questionnaires. Cases of type 1 diabetes or LADA diabetes were distinguished and excluded using the VAL and NDR registers. The OGTT was performed at each follow-up visit in participants with no new diagnosis of diabetes, using a 75-g bolus of glucose dissolved in 0.25–0.3 L water. Blood samples were collected after fasting and 2 h after administering the glucose bolus. Glucose (mmol/L) and insulin levels (μU/mL) were measured in each blood sample.

The incidence of type 2 diabetes in the MSC study sample was defined as a new diagnosis during the study follow-up for participants with a fasting plasma glucose ≥7 mmol/L, or as a self-reported new diagnosis made by a health care professional, or starting a new treatment with glucose-lowering drugs between the baseline and follow-up examinations.

Covariates

Stockholm Diabetes Prevention Program

Age and sex were obtained from the Swedish general population register. The baseline questionnaires included information on family history of type 2 diabetes (defined as at least one first- or two second-degree family members with type 2 diabetes); other chronic comorbidities (dichotomized); general health (reported as very good, good, neither good nor bad, bad, or very bad); physical activity level in comparison to others of the same age (categorized as much lower, lower, average, higher, and much higher); level of education (categorized as primary education, upper-secondary level, and university or higher); and smoking status (categorized as current smoker, previous smoker, or never smoked).

Anthropometric measurements were made by trained study nurses and included height (m), weight (kg), and waist circumference (cm). Body mass index (BMI) was calculated as BMI = weight (kg)/height(m)2. Systolic and diastolic blood pressures (mmHg) were measured using an aneroid sphygmomanometer. Insulin resistance (HOMA2-IR) and β-cell function (HOMA2-B) were estimated using fasting glucose and insulin levels, according to the homeostatic model assessment (HOMA2) [21].

Prediabetes was categorized as impaired fasting glucose (IFG), impaired glucose tolerance (IGT), or both, according to the American Diabetes Association (ADA) and WHO definitions. The ADA defines IFG as a fasting glucose of 5.6–6.9 mmol/L, while the WHO defines it as 6.1–6.9 mml/L. Both the ADA and the WHO define IGT as a 2-h glucose of 7.8–11.0 mmol/L.

The Metabolic Syndrome Cohort

Self-reported baseline covariates were age, family history of type 2 diabetes (at least one first-degree relative with a diagnosis of type 2 diabetes), physical activity (using the short version of the International Physical Activity Questionnaire, categorized as low, moderate, or vigorous physical activity), level of education (categorized as primary education (<9 years), upper-secondary education (9–12 years), or university and higher (>12 years)), and comorbidities (a previous diagnosis of hypertension, elevated blood cholesterol, or endocrine disease).

Baseline BMI was determined from measured height and weight as above, and systolic and diastolic blood pressures were measured by a trained health worker using an aneroid sphygmomanometer. Fasting blood glucose (mg/dL) and insulin (μU/mL) levels were measured. Fasting glucose was converted to mmol/L according to the formula: mg/dL × 0.0555 = mmol/L. Prediabetes was determined based on fasting plasma glucose level and categorized as IFG according to the WHO and ADA criteria, as described above.

Cluster analysis

Cluster analysis was performed using k-prototypes, an unsupervised partitioning method that divides the dataset based on the variability of categorical and continuous variables [22]. The variables used for the identification of clusters were chosen based on previous studies [14, 16] and included fasting plasma glucose and insulin levels, HOMA2-IR, HOMA2-B, BMI, systolic and diastolic blood pressure, sex, family history of type 2 diabetes, and level of education. Prior to analysis, continuous variables were standardized (mean=0, standard deviation (SD)=1), and extreme outliers, defined as values ≥ ±5 SD, were removed.

Analysis was done independently in the main data (SDPP) and validation data (MSC). We assessed validity based on measures of internal and external validity, internal stability, and visual validation [23]. The gap statistic was used to determine the optimal number of clusters [24] and as a measure of internal validity together with the within clusters sum of squares. External validity was assessed using Cox proportional hazard models to determine whether the clustering process predicts the incidence of type 2 diabetes. The internal stability of the clusters was assessed with the Jaccard similarity index, estimated using 1000 bootstrapped samples. A value greater than 0.75 for each cluster was considered stable [25]. Finally, visual validation was done using box plots and bar charts, as well as uMAP and heatmaps to compare the patterns of the variables used for clustering in each cohort. A detailed description is available as Additional file 3 [22].

Statistical analysis

The baseline characteristics of the participants, grouped by cluster, are presented as the mean and SD for continuous variables, and as proportions for binary and categorical data (Tables 1 and 2). The risk of type 2 diabetes associated with the resulting clusters was estimated using multivariable Cox proportional hazards models with age as the underlying time variable [26]. The participants in the SDPP study sample were followed from the date of baseline examination to the first recorded date of a new diagnosis of type 2 diabetes, date of death, diagnosis of type 1 diabetes obtained from registers, or until March 31, 2021. The participants in the MSC study sample were followed from the date of baseline examination until the self-reported date of diagnosis of type 2 diabetes, self-reported starting date of glucose-lowering therapy, date of follow-up examination when diagnosed with type 2 diabetes, date of death or until February 28, 2014.

Table 1 Baseline characteristics of the clusters derived from SDPP data
Table 2 Baseline characteristics of the clusters derived from MSC data

Proportional hazards were assessed visually using log-log plots of survival and predicted Kaplan–Meier survival plots, as well as statistically. Analysis in SDPP was stratified by year of birth using 5-year intervals [26]. Crude and adjusted hazard ratios (HR) are reported. Possible confounders included in the models were sex, self-reported general health status, presence of chronic comorbidities, physical activity level, smoking status, and history of gestational diabetes among women [27].

To evaluate the clinical utility of the clusters, the predictive accuracy and long-term stability of the clusters were assessed and compared to prediabetes in the SDPP data. Predictive accuracy was evaluated using sensitivity, specificity, area under the curve (AUC), and concordance statistics. The long-term stability of the clusters was assessed in the SDPP data from participants who had attended all the study follow-ups using intra-rater agreement measured as Cohen’s kappa and Gwet’s AC1 index [26].

Statistical analyses were performed using Stata version 17 [28]. The clustering algorithm was implemented using a customized k-prototype package in Python 3.7 [29, 30]. The code used can be found as Additional file 3.

Results

The participants in the SDPP study sample were followed for an average of 23 years, representing 169031.1 person-years. A total of 1226 incident cases of type 2 diabetes were identified, giving an overall incidence rate of 7.25 (95% CI: 6.86, 7.67) cases per 1000 person-years. The incidence rate in men was 9.01 (95% CI: 8.42, 9.82) cases per 1000 person-years, and in women, 5.95 (95% CI: 5.49, 6.45) cases per 1000 person-years. Self-reporting was the only source of diagnosis for 44 (<5%) individuals. In total, 17 (0.2%) individuals were diagnosed with type 1 or LADA diabetes, and 402 (5.5%) had died during the study.

Participants in the MSC study sample were followed up for a mean of 3.8 years, representing 5679.9 person-years. One hundred thirty-one incident cases of type 2 diabetes were identified. The overall incidence rate was 23.06 (95% CI: 19.43, 27.37) cases per 1000 person-years. The incidence rate was slightly higher for men than for women: 23.48 (95% CI: 17.08, 32.27) vs. 22.89 (95% CI: 18.69, 28.06) cases per 1000 person-years. No deaths were registered during the follow-up period.

Cluster analysis of the SDPP and the MSC data resulted in six distinctive clusters, as presented in Figs. 1 and 2. All clusters had good stability (Jaccard similarity index >85% for all clusters in both study samples). Detailed results regarding the determination of the number of clusters, internal validity, and cluster stability are given in Additional file 1: Tables S2-S3, and Additional file 2: Fig. S3. Visually, the two study samples showed similar overall distributions of risk factors in each cluster, leading to comparable phenotypes (see Additional file 2: Figs. S4-S5). We found small yet significant differences in some of the values of the variables used for cluster analysis between the equivalent clusters in SDPP and MSC, reflecting the underlying differences between the two populations (Additional file 2: Fig. S6).

Fig. 1
figure 1

Box plot and bar charts of baseline characteristics among clusters in SDPP. FHD family history of diabetes, VLR very low-risk, LRLB low-risk low β-cell function, LRHB low-risk high β-cell function, HRHBP high-risk high blood pressure, HRBF high-risk with predominance of β-cell failure, HRIR high-risk insulin resistance

Fig. 2
figure 2

Box plot and bar charts of baseline characteristics among clusters in MSC. FHD family history of diabetes, VLR very low-risk, LRLB low-risk low β-cell function, LRHB low-risk high β-cell function, HRHBP high-risk high blood pressure, HRBF high-risk with predominance of β-cell failure, HRIR high-risk insulin resistance

Compared to the average incidence of type 2 diabetes in each study sample, the clusters clearly divided the population into three low-risk clusters: very low-risk (VLR), low-risk low β-cell function (LRLB), low-risk high β-cell function (LRHB), and three high-risk clusters: high-risk high blood pressure (HRHBP), high-risk β-cell failure (HRBF), and high-risk insulin-resistant (HRIR), as shown in Fig. 3.

Fig. 3
figure 3

Incidence rates of type 2 diabetes in the SDPP and MSC studies. Compared to the average incidence rate in each study, the clusters divided the populations in three low-risk and three high-risk groups. SDPP Stockholm Diabetes Preventive Program, MSC The Metabolic Syndrome Cohort, VLR very low-risk, LRLB low-risk low beta cell function, LRHB low-risk high beta cell function, HRHBP high-risk high blood pressure, HRBF high-risk beta cell failure, HRIR high-risk insulin resistance

Of the low-risk clusters, the VLR cluster was characterized by predominantly highly educated, young women, without metabolic risk factors in both the SDPP and MSC study samples. The LRLB cluster included mostly women with a higher age than the population mean, but lower levels of fasting insulin, HOMA2-IR, and HOMA2-B than average in each study sample. The LRHB cluster was characterized by low risk, despite dysregulation of insulin production and sensitivity, and a higher proportion of participants with a high BMI. This cluster included participants with the second-highest values of fasting insulin and HOMA2-B, and a higher proportion of participants with BMI ≥ 35 kg/m2, as well as the lowest level of education among the other low-risk clusters in both the SDPP and MSC study samples.

Among the high-risk clusters. The sexes were equally distributed in the HRHBP cluster and included participants with the highest mean age and the highest levels of systolic and diastolic blood pressure, in both study samples. In the HRBF cluster, there was a predominance of women, the highest fasting glucose levels, and second lowest HOMA2-B, together with the highest proportion of individuals with a family history of type 2 diabetes, in both study samples. Finally, the HRIR cluster consisted mostly of men in the SDPP and women in the MSC study sample, who had the second-highest proportion of family history of diabetes and the highest values of HOMA2-B and HOMA2-IR.

Survival analysis showed similar trends of the association between cluster membership and incidence of type 2 diabetes in both cohorts. Detailed results are presented in Table 3 and Fig. 4. In the SDPP sample, compared to the LRHB cluster, a statistically significant inverse association was found between the VLR (HR: 0.38 95% CI: 0.28, 0.50) and the LRLB (HR: 0.71, 95% CI: 0.55, 0.90) clusters and incidence of type 2 diabetes. While a statistically significant positive association was found between the HRHBP (HR: 2.34, 95 CI: 1.85, 2.96), the HRBF (HR: 3.22, 95% CI: 2.62, 3.96), and the HRIR (HR: 5.39, 95% CI: 4.30, 6.75) clusters and incidence of type 2 diabetes.

Table 3 Results of the Cox proportional hazards models
Fig. 4
figure 4

Kaplan–Meier estimates of the risk of type 2 diabetes. SDPP Stockholm Diabetes Prevention Program, MSC Metabolic Syndrom Cohort, VLR very low-risk, LRLB low-risk low beta cell function, LRHB low-risk high beta cell function, HRHBP high-risk high blood pressure, HRBF high-risk beta cell failure, HRIR high-risk insulin resistance

In the MSC study sample, a statistically significant inverse association was found between the VLR cluster (HR: 0.58, 95% CI: 0.51, 0.66) and the incidence of type 2 diabetes. No association was found between the LRLB cluster (HR: 1.24, 95% CI: 0.50, 3.12), and a significantly positive association was found between the HRHBP (HR: 3.26, 95% CI: 1.49, 7.15), the HRBF (HR: 4.00, 95% CI: 2.05, 7.82), and the HRIR (HR: 4.52, 95% CI: 1.66, 12.32) clusters and incidence of type 2 diabetes. Results of the pairwise comparisons can be found in Additional file 1: Table S4.

In data from SDPP, of the 2508 (34.3%) participants categorized in a high-risk cluster at baseline, 859 (34.3%) progressed to type 2 diabetes during the study follow-up. While of the 650 (8.9%) participants with ADA-defined prediabetes at baseline, 239 (56%) progressed to type 2 diabetes. And of 374 (5.1%) with WHO defined, 239 (63.9%) progressed to type 2 diabetes (see also Additional file 2: Fig. S7).

Complete data from all follow-ups was available from a subsample of 3379 (46.1%) participants. Of 1033 (30.6%) participants in a high-risk cluster at baseline, 407 (39.4%) remained in a high-risk cluster, 280 (27.1%) regressed to a low-risk cluster, and 346 (33.4%) progressed to type 2 diabetes. While of 226 (6.7%) with ADA-defined prediabetes at baseline, 65 (28.8%) remained stable, 28 (12.4%) regressed to normal glycemia, and 133 (58.9%) progressed to type 2 diabetes. And from 124 (3.7%) with WHO prediabetes at baseline, 17 (13.7%) remained stable, 21 (16.9%) regressed to normal glycemia, and 86 (69.4%) were diagnosed with type 2 diabetes. Transitions from the different clusters to type 2 diabetes are presented in Fig. 5 and from prediabetes to type 2 diabetes in Additional file 2: Fig. S8.

Fig. 5
figure 5

Transition plot of the clinical clusters in the SDPP cohort. Patterns of transition between the baseline, 10-year, and 20-year follow-ups. The thickness of the line represents the proportion of individuals at each time point. Low-risk clusters are marked in light blue while the high-risk clusters in red. VLR very low-risk, LRLB low-risk low beta cell function, LRHB low-risk high beta cell function, HRHBP high-risk high blood pressure, HRBF high-risk beta cell failure, HRIR high-risk insulin resistance

The predictive accuracy of the high-risk clusters as a group was high compared to that of both definitions of prediabetes, with an AUC of 0.71 (95% CI: 0.70, 0.73) for the high-risk clusters, 0.63 (95% CI: 0.61, 0.64) for the ADA definition, and 0.59 (95% CI: 0.58, 0.60) for the WHO definition of prediabetes. Sensitivity (70.1%, 95% CI: 67.4%, 72.6%) and specificity (72.9%, 95% CI: 71.8%, 74.0%) were both high for the high-risk clusters. In contrast, prediabetes showed a low sensitivity: 29.9% (95% CI: 27.4%, 32.6%) for the ADA and 19.5% (95% CI: 17.3%, 21.8%) for the WHO definitions, and high specificity ranging from 95.3% (95% CI: 94.8%, 95.9%) for the ADA to 97.8% (95% CI: 97.4%, 98.2%) for the WHO definitions, as summarized in Additional file 1: Tables S5-S7.

In general, agreement of high-risk clusters ranged from fair to moderate, while both the ADA and WHO definitions of prediabetes showed a fair to high agreement (see Additional file 1: Table S8).

Discussion

We explored the utility of cluster analysis based on common risk factors for type 2 diabetes. Using data from two independent longitudinal studies, we found six characteristic clusters (three low-risk and three high-risk) that were useful to stratify the risk of type 2 diabetes in both cohorts. Compared to different definitions of prediabetes, the high-risk clusters had a better predictive accuracy and were stable after over 20 years of follow-up.

Comparison with previous studies

Previous studies have used similar methods to investigate the heterogeneity of type 2 diabetes and its complications. Li and collaborators reported three distinct phenotypes of type 2 diabetes using data from digital medical records [13], and Ahlqvist et al. more recently described five clusters of type 2 diabetes using cluster analysis [14]. However, while the clusters reported by Ahlqvist et al. have been replicated in some studies [31, 32], others have failed to reproduce them or found their clinical utility to be limited [33, 34].

Few studies describe clusters before the diagnosis of type 2 diabetes. A study in South Korea by Cho et al. identified six different clusters associated with differences in the prevalence of type 2 diabetes [17], and another by Wagner and collaborators applied cluster analysis to detect phenotypes among individuals at high risk for type 2 diabetes [35].

In all the studies mentioned above, apart from that by Cho et al., analysis required complex data (such as the presence of antibodies to glutamic acid decarboxylase) which are not readily available in most settings. While such data are very valuable, if the aim is to implement cluster analysis widely for the surveillance and prevention of type 2 diabetes, data should be accessible.

On the other hand, important risk factors such as sex and socioeconomic position have not consistently been included, perhaps due to the methodological challenge of using categorical values in cluster analysis. Only the study by Cho et al. included sex, while no previous studies have used indicators of socioeconomic position. In our study, female sex and higher education were overrepresented in the low-risk clusters, indicating the importance of sex and socioeconomic status in the pathophysiological process leading to type 2 diabetes.

The degree of replicability in previous studies is unclear. In the study by Wagner et al., the clusters were replicated using a larger cohort, although they used different variables, assuming that they were conceptually similar, while Ahlqvist et al. and Cho et al. used replication samples derived from the same target population. In contrast, we used comparable data from two independent populations and found comparable patterns.

Additionally, it remains unclear whether different clusters represent different stages in the natural history of type 2 diabetes or etiologically different phenotypes. Although studies have reported the short-term transitions between clusters through time [35, 36], our study is, to the best of our knowledge, the first describing long-term stability.

Our clusters also show important similarities to those reported previously. For example, the phenotypes defined by Ahlqvist et al., severe insulin-deficient diabetes and severe insulin-resistant diabetes, and by Wagner et al. with a prominence of β-cell failure and insulin resistance, resemble our HRBF and HRIR clusters. The studies by Cho et al. and Li et al. describe a subgroup characterized by high blood pressure, like the HRHBP cluster described in this study. The mild obesity-related diabetes cluster described by Ahlqvist et al. and the low-risk obese subgroup described by Wegner et al. are similar to the LRHB cluster, which includes individuals with higher BMI than the population average who had a relatively low risk of type 2 diabetes. However, this was not the most notable characteristic of this group in our study. This difference might be explained by selection bias introduced by clustering individuals with type 2 diabetes or at high risk of type 2 diabetes, among whom high BMI is overrepresented.

Strengths and limitations

The strengths of this study include the use of independent study samples from different countries, which allowed to assess the replicability of our findings. Furthermore, a large and representative sample with a follow-up period of over 20 years (SDPP) was used for the main analysis, which allowed us to estimate the incidence of type 2 diabetes with limited attrition, thanks to the availability of data from regional and national registers.

The replication sample (MSC), in contrast, had a shorter follow-up (3 to 6 years), had a relatively low number of new cases of type 2 diabetes, and was more prone to bias due to sample selection and attrition, which might impact external validation, as the statistical power of the survival analysis to detect an association between cluster assignment and risk of type 2 diabetes might be limited. However, the data demonstrated trends consistent with those of SDPP. Replication using data with longer follow-ups is needed to ensure the external validity of our findings. Likewise, the secondary analysis of the long-term stability on a subset from SDPP might also be biased by loss to follow-up. These results should thus be interpreted with caution.

The selection of risk factors used in the cluster analysis was based in previous literature and included well-known risk factors of type 2 diabetes. Data on biochemical and genetics were not available, and it is unclear whether it could have affected the results of the cluster analysis. Nevertheless, such information is not usually available in most settings.

Incorrect classification of type 2 diabetes as type 1 or other sorts of diabetes is a common problem in epidemiological studies [37]. In SDPP, data from clinical registries allowed us to identify individuals with other types of diabetes directly. In the MSC study, we limited the sample to individuals with a diagnosis after 30 years of age to minimize the risk of including individuals with type 1 diabetes. However, neither method completely eliminates the risk of misclassification bias.

Implications for future research and public health

From a public health perspective, data-driven risk stratification could lead to better-targeted interventions and resource utilization for the prevention of type 2 diabetes. The high-risk clusters described in this study define a group with high sensitivity and specificity that contains about one-third of the study population and captures a large majority of the cases of type 2 diabetes, which might have important implications for public health and clinical practice. However, questions remain regarding the practicality of using cluster analysis, its utility for guiding public health policy and clinical decisions, and the underlying physiological mechanisms driving the different phenotypes.

The first step towards the practical implementation of data-driven stratification is assigning individuals to the most appropriate cluster. This can be accomplished using data available in health registers or electronic medical records. The observed differences between the Swedish and Mexican cohorts indicate that clustering should be population specific, although our results show that this is likely to result in very similar phenotypes.

Clusters could also be used to investigate the effect of different interventions to prevent type 2 diabetes. For example, the relatively high blood pressure in the HRHBP cluster suggests that blood pressure control could be an important part of the intervention for this group of individuals. The high fasting glucose and insulin levels of individuals in the HRBF cluster resemble the β-cell dysfunction characteristic of type 2 diabetes; thus, pharmacological interventions to regulate β-cell function might be effective. On the other hand, intensive lifestyle interventions might have the greatest effects in individuals in the HRIR cluster, who were characterized by the highest BMI and tendencies towards insulin resistance.

Likewise, studies looking into the association between different clusters and complications of type 2 diabetes may aid clinical decision-making in patients with type 2 diabetes and lead to new insights into the natural history and pathophysiology of the disease.

Finally, to better understand the physiological mechanisms underlying the differences between clusters, studies of the environmental, social, genetic, and biochemical determinants of the different clusters are needed.

Conclusions

Phenotypes derived using cluster analysis on readily available risk factors in two independent longitudinal studies from Sweden and Mexico were useful to stratify the risk of type 2 diabetes among diabetes-free adults. The validity and reliability of the clusters described in this study, compared to those of prediabetes, indicate their potential clinical utility. These results could be used to develop more precise interventions to prevent type 2 diabetes.