figure b

Introduction

Diabetes mellitus is generally classified into type 1 and type 2 based on its aetiology [1]. Type 1 diabetes is mainly caused by beta cell dysfunction due to autoimmune mechanisms, whereas type 2 diabetes is caused by the heterogeneous influence of insulin resistance and beta cell dysfunction [2]. When choosing a glucose-lowering drug, the decision has recently shifted from being based on side effects and cost-effectiveness to being based on evidence for the prevention of diabetes complications, such as CVD, heart failure and chronic kidney disease (CKD) [3, 4]. However, the pathophysiology, genetic risk and involvement of environmental factors such as diet, physical activity and stress vary widely among individuals with type 2 diabetes [5]. Therefore, a personalised approach that comprehensively considers these factors is crucial [6, 7].

Artificial intelligence (AI), including machine learning (ML), is rapidly being applied to diagnosis, treatment and management in diabetes care and research [8]. Using ML techniques, Ahlqvist et al found five diabetes clusters with different clinical phenotypes and outcomes in a Nordic population: Cluster 1, severe autoimmune diabetes (SAID); Cluster 2, severe insulin-deficient diabetes (SIDD); Cluster 3, severe insulin-resistant diabetes (SIRD); Cluster 4, mild obesity-related diabetes (MOD); and Cluster 5, mild age-related diabetes (MARD) [9]. The SAID cluster resembles type 1 diabetes, whereas the other clusters correspond to type 2 diabetes. These diabetes subtypes have been replicated in cohorts including various ethnic groups in terms of genetic predisposition, glycaemic control, diabetes complications and treatment outcomes [10,11,12,13,14,15]. This suggests the effectiveness of a personalised approach using the diabetes subtypes [16,17,18].

However, there are several limitations when applying Ahlqvist’s diabetes clustering in clinical settings and other research. First, the diabetes clustering cannot classify new individuals that are not included in their mother dataset because it depends on the relative positioning of individuals in an entire dataset map [19]. Second, the diabetes clustering cannot be applicable when there are missing fixed variables, HOMA2-B and HOMA2-IR, which represent two key pathogenic mechanisms but are not routinely available in clinical practice and standard cohort studies [20]. For instance, an attempt to replicate the clustering using nine clinical variables, excluding HOMA2 indices, failed to identify Ahlqvist’s clusters [21]. Another study employing C-peptide and HDL-cholesterol instead of HOMA2 indices was unsuccessful in classifying individuals to Ahlqvist’s subtypes [22]. Third, although the diabetes subtypes are theoretically stable over time, a proportion of individuals migrate between subtypes over time [13, 15, 23], limiting the use of this subtyping approach for estimating long-term treatment response and prognosis. As an example, Bello-Chavolla et al reported an AI approach using a self-normalising neural network (SNNN) model [15], showing that proportions of type 2 diabetes clusters were largely different at baseline vs 2 years of follow-up: SIDD 34% vs 16%; SIRD 7% vs 7%; MOD 41% vs 54%; and MARD 18% vs 23% [15].

In this study, an interdisciplinary team of diabetologists and ML specialists aimed to develop an ML model to classify individuals with type 2 diabetes consistently over time into Ahlqvist’s subtypes by minimising the above limitations [9].

Methods

Study design and participants

We included participants from two distinct geographical areas in Japan, Fukushima (Cohort 1) and Okinawa (Cohort 2), to target a wide range of genetic backgrounds [24]. The study protocol was approved by the Ethics Committee of the Fukushima Medical University (approval no. REC 2022-028). The sex of participants was determined by self-report.

Cohort 1

The Fukushima Diabetes, Endocrinology, and Metabolism (Fukushima-DEM) cohort was a retrospective and prospective survey of participants with impaired glucose tolerance and diabetes at the Fukushima Medical University to clarify the risk factors for the onset and progression of diabetes and its complications [10]. The flow from registration to dataset construction is shown in electronic supplementary material (ESM) Fig. 1. The participants were recruited between January 2018 and March 2023 and followed up until December 2023. Of the 897 participants, 619 were diagnosed with type 2 diabetes based on the diagnostic criteria described below. Participants without diabetes (n=153), with type 1 diabetes (n=70), with secondary diabetes (n=49) or who had missing clustering variables (n=6) were excluded. After labelling with k-means clustering, 70% of the total sample was randomly selected for training and the remaining 30% was used for testing.

Cohort 2

The Shimajiri Kinsermae Diabetes Care Clinic cohort was a prospective study of individuals with impaired glucose tolerance and diabetes recruited from Okinawa, Japan. The participants were recruited between January 2020 and January 2021. Of the 1253 participants, 597 were diagnosed with type 2 diabetes based on the diagnostic criteria described below (ESM Fig. 1). Participants without diabetes (n=248), with type 1 diabetes (n=31), with secondary diabetes (n=5) or who had missing clustering variables (n=372) were excluded. After labelling with k-means clustering, the data were used as external validation data for the trained model. A subset with completely missing insulin-related variables (HOMA2-B, HOMA2-IR and C-peptide) was separately created and used as validation data after missing imputation. The need for informed consent in Cohort 2 was waived by the ethics committee because the research did not use identifiable private information and involved no more than minimal risk to the participants. Participants were given the option to decline the use of their personal data based on documents posted on bulletin boards or clinic websites.

Measurements

Variables such as height, weight, waist circumference and BP of participants in both cohorts were measured during study enrolment and the participants visited the clinic at intervals of 1–3 months. Waist circumference was measured at the level of the umbilicus (cm) in the standing position. Blood samples were collected at baseline in the morning after overnight fasting for ≥10 h and assayed within 1 h using automatic clinical chemical analysers. HOMA2-B and HOMA2-IR were calculated using a HOMA2 calculator (University of Oxford, Oxford, UK) based on fasting plasma glucose and fasting serum C-peptide concentrations measured at baseline [25]. Outliers in the HOMA2 calculator for fasting plasma glucose level (<3 mmol/l or >25 mmol/l) and C-peptide level (<0.2 nmol/l or >3.5 nmol/l) were capped to lower or upper limit values. We calculated the eGFR using the Japanese formula [26].

Definitions

The criteria for diagnosing diabetes were as follows: fasting plasma glucose level ≥7.0 mmol/l; random plasma glucose level ≥11.1 mmol/l; HbA1c level ≥48 mmol/mol (6.5%); or regular use of glucose-lowering drugs. At least one previously confirmed positive result for an islet-associated autoantibody is indicative of type 1 diabetes. The severity of diabetic retinopathy was determined based on fundus photography by qualified ophthalmologists. According to the modified international clinical diabetic retinopathy severity scales [27], we classified participants into the following three groups: no diabetic retinopathy; non-proliferative diabetic retinopathy; and proliferative diabetic retinopathy. Where severity in the right or left eye was different, more severe staging was performed. If either non-proliferative or proliferative diabetic retinopathy was present, diabetic retinopathy was diagnosed. CKD was defined as an eGFR <60 ml/min per 1.73 m2 for more than 90 days, and proteinuria was defined as albuminuria ≥30 mg/g creatinine. Coronary artery disease was defined using the ICD-10 codes I20–21, I24, I251 or I253–259 (https://icd.who.int/browse10/2019/en).

ML algorithm

The k-means clustering and random forest classifier

The k-means clustering was applied to create the true labels (type 2 diabetes subtypes pre-labelled by k-means clustering [T2Dkmeans]) for an ML model in the two cohorts. Using the fpc R package (version 2.2-11, https://cran.r-project.org/web/packages/fpc/index.html), k-means clustering was performed 1000 times (k=4), following the method of Ahlqvist et al [9]. Ahlqvist’s variables (age at diagnosis, BMI, HbA1c, HOMA2-B and HOMA2-IR) were used for the cluster analysis. To minimise the effects of sex, men and women were clustered separately. The stability of clustering was assessed using the Jaccard index after 2000× resampling of the dataset [28].

An ML model was then constructed to predict type 2 diabetes subtypes from new data using random forest (RF), a supervised approach. The RF classifier is an efficient algorithm that uses a subset of randomly selected training samples and variables to generate multiple decision trees [29] and has consistently outperformed other classifiers [30]. Furthermore, the RF classifier is less affected by multicollinearity in high-dimensional data, is faster and less susceptible to overtraining, and can calculate the importance of features [31]. Cohort 1 was used to train an RF multiclass classification model that predicted type 2 diabetes subtypes (randomForest R package version 4.7-1.1, https://cran.r-project.org/web/packages/randomForest/index.html). The parameters of the RF algorithm, such as the random sample size, number of trees, minimum number of termination nodes and maximum number of termination nodes, were tuned to improve the prediction performance [32].

We trained an RF model (type 2 diabetes subtypes predicted by RF algorithm based on five variables [T2DRF5]), based on Ahlqvist’s variables age at diagnosis, BMI, HbA1c, HOMA2-B and HOMA2-IR, to assess its accuracy for estimating the true labels (T2Dkmeans). To address potential missing Ahlqvist’s variables, especially insulin-related ones, an extended RF model (type 2 diabetes subtypes predicted by RF algorithm based on 15 variables [T2DRF15]) was constructed to predict type 2 diabetes subtypes based on 15 variables. We made T2DRF15 by applying the Boruta algorithm to select 15 important features out of an initial 25, which were chosen based on their availability in clinical settings. The importance of the features and the predictive metrics of T2DRF5 and T2DRF15 for T2Dkmeans subtypes were calculated.

The RF algorithm creates a proximity matrix as a byproduct. The proximity matrix is defined as the frequency with which two cases are classified into the same leaf node in the decision tree of the established model and represents the degree of similarity between samples [33]. Uniform manifold approximation and Projection (UMAP) was used to embed this matrix in two dimensions for visualisation of individual prediction probabilities calculated by T2DRF15.

RF prediction in a dataset with missing variables

We aimed to make the T2DRF15 model applicable to individuals who are missing insulin-related variables. First, we intentionally deleted insulin-related variables in Cohort 2 and then imputed these missing values using an RF regression analysis (ESM Fig. 1). Second, the Cohort 2 individuals imputed were classified by T2DRF15. Third, to evaluate the importance of variables, we determined the prediction accuracy of T2DRF15 for labelling by T2Dkmeans when variables were omitted step-wise for three insulin-related variables and the others. Proportions of undecidable individuals were also determined. Fourth, the performance of T2DRF15 was further evaluated using precision (% of data that actually belonged to the predicted clusters), recall (% of data that each RF model correctly predicts belongs to that cluster: sensitivity), F1-score (an indicator calculated by harmonic mean from precision and recall) and AUC for the receiver operating characteristic (ROC) curve for each subtype.

Kaplan–Meier curves for the cumulative incidence of retinopathy, CKD (eGFR <60 ml/min per 1.73 m2) and coronary artery disease in the type 2 diabetes subtypes were predicted by T2DRF15 on the putative dataset in Cohort 1.

Consistency over time

The consistency over time of subtype classification in four models, T2Dkmeans, SNNN model [15], T2DRF15 and T2DRF15 with missing insulin-related variables, was assessed by migration patterns at baseline and 5 year follow-up in Sankey diagrams. The consistency over time was assessed by the percentage of participants whose subtype classification did not change between baseline and 5 year follow-up.

Statistical analysis

Continuous and parametric values are presented as mean ± SD, and non-parametric values are presented as median (first quartile–third quartile). Group differences were analysed using one-way ANOVA or the Kruskal–Wallis test. Categorical values are presented as percentages, and group differences were analysed using the χ2 test.

Survival analysis for the cumulative incidence of diabetes complications in Cohort 1 was performed using the Kaplan–Meier method for T2DRF15 clusters. HRs and 95% CIs were subsequently calculated using the Cox proportional hazards model. Missing values in the training data (rate is shown in Table 1) were imputed using the Multivariate Imputation by Chained Equations (MICE) algorithm [34]. Ten complete datasets were generated through this imputation process. The estimated values from each imputed dataset were integrated using Rubin’s rule [35].

Table 1 Clinical characteristics of study participants at baseline in Cohort 1 stratified by type 2 diabetes subtypes predicted by RF classifier trained using 15 selected features (T2DRF15)

A p value of <0.05 indicated statistical significance. All statistical analyses were performed using R version 4.3.1 (https://www.r-project.org/).

Results

k-means cluster distribution and characteristics

In Cohort 1, the training dataset was pre-labelled (T2Dkmeans) for the type 2 diabetes subtype (SIDD, SIRD, MOD or MARD) using unsupervised k-means clustering. The cluster centre coordinates stratified by sex are shown in ESM Table 1. The Jaccard index (min–max) was 0.76–0.90 for women and 0.79–0.93 for men. As shown in ESM Table 2, the following characteristics were noted: the SIDD cluster had low HOMA2-B and high HbA1c levels; the SIRD cluster had high BMI, HOMA2-B and HOMA-IR; the MOD cluster had a younger age at diagnosis and high BMI; and MARD was the most common cluster and had the oldest age at diagnosis. The characteristics of T2Dkmeans were similar to those described by Ahlqvist et al [9].

Type 2 diabetes subtypes using RF algorithm

The model performance in T2DRF5, T2DRF15 and T2DRF25 was assessed by metrics for predicting T2Dkmeans (ESM Table 3). For T2DRF5, the overall prediction performance was 94.0%, and AUC values for subtypes are 99.5% for SIDD, 98.4% for SIRD, 99.1% for MOD and 99.0% for MARD. For T2DRF15, the overall prediction performance was robust, achieving 94.1% of AUC (Fig. 1a), and the prediction accuracy for all subtypes was validated with high precision, recall values and F1 scores≥0.9 (ESM Table 3). Among the 15 variables, C-peptide level, age and waist circumference, besides Ahlqvist’s five variables, were the most important for T2DRF15 subtype prediction (Fig. 1b). The order of importance of variables varied considerably between subtypes (ESM Fig. 2).

Fig. 1
figure 1

Predictive performance of type 2 diabetes subtypes using an RF algorithm based on 15 features (T2DRF15) for estimating T2Dkmeans in the test dataset of Cohort 1. (a) ROC curve showing the diagnostic performance of T2DRF15, the RF model using Boruta-selected 15 features, to predict the T2Dkmeans. (b) Feature importance of Boruta-selected 15 variables fed into T2DRF15. ALT, aspartate aminotransferase; FPG, fasting plasma glucose; γGT, γ-glutamyl transferase; HDL-C, HDL-cholesterol; TG, triacylglycerols

External validation of the predicting model

The validity of the RF multiclass classification model trained with the 15 features was evaluated in Cohort 2 to confirm its applicability to external data. The ROC curves comparing T2Dkmeans and T2DRF15 are shown in Fig. 2a. The overall accuracy was 86.3%, and the model performance was retained when applied to the external cohort. The detailed consistency indices are shown in ESM Table 3.

Fig. 2
figure 2

Predictive performance of type 2 diabetes subtypes using an RF algorithm based on 15 features (T2DRF15) for estimating T2Dkmeans in the external validation dataset of Cohort 2. ROC curve showing the diagnostic ability of T2DRF15 to predict the subtypes pre-labelled by k-means clustering (T2Dkmeans) were calculated in original Cohort 2 dataset (a) and in a putative Cohort 2 dataset with missing insulin-related variables (b)

Classification approach for individuals with missing clustering variables

Correlations of the insulin-related variables, C-peptide, HOMA2-B and HOMA2-IR, between observed and predicted values showed strong correlations in Cohort 2 with missing insulin-related variables (R2=0.83–0.92) (ESM Fig. 3 a–c). The mean absolute differences of these variables were small and normally distributed, suggesting a relatively small impact of imputing the insulin-related variables on subtype predictions (ESM Fig. 3 d–f). The predictive performance (ROC) by T2DRF15, including imputed insulin-related variables, is shown in Fig. 2b. The overall prediction performance of T2DRF15 was 82.9%, and AUC values for the diabetes subtypes were 97.4% for SIDD, 96.4% for SIRD, 93.7% for MOD and 97.6% for MARD (ESM Table 3). The impact of missing variables on classification metrics of T2DRF15 is shown in ESM Fig. 4. When omitting variables, the prediction accuracy of T2DRF15 did not change in individuals until a decrease was seen when age and BMI were omitted from the insulin-related variables (ESM Fig. 4a). Similarly, the proportion of undecidable individuals did not alter age and BMI were omitted (ESM Fig. 4b). The classification metrics per cluster also did not change until age and BMI were omitted (Fig. 4c, numbers 12 and 13 on x-axis) but the declines of values was more rapid in SIRD and MOD than in SIDD and MARD (ESM Fig. 4c).

Evaluating consistency over time and clarity of type 2 diabetes subtype classification

The similarities between participants was visualised by UMAP, using the proximity matrix calculated by RF, and colour-coded with T2Dkmeans (Fig. 3a) and T2DRF15 (Fig. 3b). When the individual predictive probabilities computed in the RF were embedded in the proximity matrix, participants with low predictive probabilities were located in the boundary regions of the subtypes (ESM Fig. 5). The data with a predictive probability of less than 0.6 were defined and relabelled as an ‘undecidable cluster’ to minimise uncertainty in the T2DRF15 model (Fig. 3b). This group of data (accounting for 14.2% of all participants) was located in the boundary region; after excluding them, the data were clearly divided into four clusters, showing high predictive reliability (Fig. 3c). After excluding the undecidable cluster, the clinical characteristics of T2DRF15 subtypes for SIDD, SIRD, MOD and MARD (Table 1) were almost identical to those of T2Dkmeans reported previously [10]. In contrast, the undecidable cluster showed no distinctive clinical characteristics. For example, in this type, the percentage of female sex was as low as in SIDD; age was higher than in SIRD and MOD but lower than in MARD; BMI was higher than in SIDD and MARD but lower than in SIRD and MOD; and HOMA2-IR was higher than in SIDD and MARD but lower than in SIDD and MARD.

Fig. 3
figure 3

Proximity matrix representing the similarity between participants calculated using the RF. (a) Two-dimensional visualisation of the proximity matrix between all participants included in the training and test data. Colours indicate differences in subtype assignment using k-means clustering (T2Dkmeans). (b) Proximity matrix with embedded labels for type 2 diabetes subtypes predicted by the RF algorithm based on 15 variables (T2DRF15). Participants with low predictive probability were newly defined as the undecidable cluster. (c) Proximity matrix with T2DRF15 labels embedded after excluding participants in the undecidable cluster; the remaining participants could be clearly divided into four clusters

We tested the consistency of subtype classification at baseline and after 5 years in T2Dkmeans, SNNN, T2DRF15 and T2DRF15 with missing insulin-related variables. T2Dkmeans showed low consistency (Fig. 4a; 58.9% for SIDD, 53.8% for SIRD, 70.6% for MOD and 77.8% for MARD). SNNN also showed low consistency (ESM Fig. 6). In contrast, T2DRF15, after excluding the undecidable cluster, showed higher consistency (Fig. 4b,c; 100% for SIDD, 68.6% for SIRD, 94.4% for MOD and 97.9% for MARD) than those of T2Dkmeans. The mean consistency for four type 2 diabetes subtypes between baseline and 5 years of follow-up was 96.2%, compared with 49.5% in the undecidable cluster. T2DRF15 with missing insulin-related variables also showed a high consistency (mean 94.1%, except for the undecidable cluster, Fig. 4d).

Fig. 4
figure 4

Sankey diagram showing the subtype redistribution and migration pattern of the study participants in Cohort 1 from baseline to 5 year follow-up. (a) Type 2 diabetes subtypes labelled by k-means clustering (T2Dkmeans). (b) Type 2 diabetes subtypes predicted by an RF algorithm based on 15 variables (T2DRF15). (c) Migration pattern of T2DRF15 excluding the undecidable cluster. (d) Type 2 diabetes subtypes predicted by an RF algorithm based on 15 variables from the dataset where insulin-related variables have been imputed (T2DRF15)

Survival analysis of diabetes complications

To test whether T2DRF15 could predict clinical outcomes, Kaplan–Meier analysis of diabetes complications was performed in a putative dataset in Cohort 2 with missing insulin-related variables (Fig. 5). The median observation period was 11.6 (IQR 4.5–18.3) years. The cumulative incidence of diabetic retinopathy and CKD differed among the diabetes subtypes. After adjusting for baseline age and sex, the risk for diabetic retinopathy was higher in the SIDD cluster than in the MARD cluster (HR 2.08 [95% CI 1.36, 3.18], p<0.001). Similarly, the risk of CKD was higher in the SIRD cluster than in MARD (HR 1.58 [95% CI 1.01, 2.46], p<0.001). These findings were consistent with those of previous reports [9, 10] that had determined the subtypes using k-means clustering (T2Dkmeans). Namely, the risk of CKD was higher in the SIRD cluster of T2Dkmeans than in MARD (the age- and sex-adjusted HR 2.41 [95% CI 2.08, 2.79], p<0.0001 in the Nordic population [9]; HR 1.60 [95% CI 1.03, 2.47], p=0.035 in our Japanese population [10]). The risk of diabetic retinopathy was higher in the SIDD cluster of T2Dkmeans than in MARD (the age- and sex-adjusted HR 1.33 [1.15, 1.54], p<0.0001 in the Nordic population [9]; HR 1.78 [95% CI 1.30, 2.43], p<0.001 in our Japanese population [10]). Meanwhile, the undecidable cluster had an intermediate risk for all complications (Fig. 5). Namely, the Kaplan–Meier curves for the cumulative incidence of retinopathy, CKD and coronary artery disease in the undecidable cluster lay between the highest and lowest curves (Fig. 5).

Fig. 5
figure 5

Kaplan–Meier curves for the cumulative incidence of retinopathy (a), CKD (eGFR <60 ml/min per 1.73 m2) (b), proteinuria (c) and coronary artery disease (d) in type 2 diabetes subtypes predicted by RF based on 15 variables (T2DRF15) in the putative dataset in Cohort 1 with missing insulin-related variables

Discussion

We developed an ML model that easily and consistently classifies individuals with type 2 diabetes into Ahlqvist’s subtypes by minimising the disadvantages. Three main improvements were achieved: (1) our ML model employed RF classifiers instead of original k-means [19], which enabled us to predict Ahlqvist subtypes for new individuals that were not included in the mother dataset; (2) by integrating imputation algorithms, the RF classifier was able to accurately predict type 2 diabetes subtypes even for individuals with missing HOMA2-B and HOMA2-IR [20]; and (3) by defining an undecidable cluster, the RF classifier achieved high consistency during 5 years of follow-up in the subtype classification. This new ML model has great potential for clinical practice and cohort studies because it can classify individuals newly diagnosed with type 2 diabetes into Ahlqvist’s subtypes using readily available variables.

Our ML model enables us to classify individuals into Ahlqvist’s subtypes by employing an RF classifier. Owing to its ease of implementation and low computational complexity, k-means clustering, an unsupervised ML algorithm, is most frequently used among several methods for AI subtyping [36]. Actually, Ahlqvist’s k-means clustering based on five fixed variables [37], including age at onset, BMI, HbA1c, HOMA2-B and HOMA2-IR, is the most extensively studied in diabetes research [9,10,11,12, 14]. However, the k-means clustering cannot classify new individual cases not included in their mother dataset because it depends on the positioning of cases in an entire dataset map [19]. Previously, one of our team found that RF-based ML algorithms are useful for risk stratification beyond conventional classifications and are applicable case by case in people with ovarian cancer [38] or heart failure [39]. In this study, we similarly created a novel ML model based on RF and developed a method to determine Ahlqvist’s subtype on a case-by-case basis.

By integrating imputation algorithms, the RF classifier was able to accurately predict type 2 diabetes subtypes even for individuals with missing insulin-related variables. As discussed above, the diabetes clustering cannot be applied when the fixed variables HOMA2-B and HOMA2-IR are missing [21, 22]. C-peptide levels, which are used to calculate the HOMA2 indices, are not routinely measured in people with diabetes in clinical practice and in standard cohort studies, usually due to the cost. Our RF classifier could predict diabetes subtypes, even when C-peptide was missing, by imputing with high consistency. To our knowledge, this study for the first time shows that the RF classifier can predict diabetes subtypes even when insulin-related variables are missing.

Our ML model showed long-term consistency in all four diabetes clusters. Consistency over time of previous AI models in determining type 2 diabetes subtypes has been limited. Bello-Chavolla et al reported an approach for classifying diabetes subtypes using an SNNN model [15]. Since subtype consistency during follow-up with this approach was low, they considered that diabetes subtypes are changeable and should be reassessed periodically to understand the trajectories and risks of diabetes complications [15]. However, when applying their SNNN model to our participants in Cohort 1, the consistency of the subtypes was also low (Fig. 3b): the SNNN model demonstrated an overall accuracy of 69% but was particularly low for the SIDD (36.4%) and SIRD (16.3%) clusters. The difference in consistency over time between RF classification and SNNN in the same population suggests that the diabetes subtype is simply not correctly determined rather than changeable. The diabetes subtype should be consistent in an individual over years of long clinical course in terms of genetic risk [40], molecular mechanisms [41] and complication risk [9, 10]. We achieved excellent long-term consistency in subtype classification by excluding an undecidable cluster in all four diabetes subtypes. Given that previous studies on diabetes subtypes have used ‘hard’ clustering methods such as k-means, which forcefully assigns samples at boundaries of clusters to either cluster, we a priori hypothesised that ‘hard’ clustering leads to lower consistency in diabetes subtyping. Therefore, we employed the idea of grouping samples with low prediction probability by the RF classifier (i.e. populations with uncertainty about which subtype they belong to) as a single ‘undecidable’ cluster rather than forcing their assignation to a subtype. This is a clinically acceptable approach, given that BMI and HbA1c often fluctuate during treatment and are inappropriate for inclusion in the subtype prediction. Considering this undecidable cluster, little migration among subtypes occurred after the 5 year follow-up; thereby high consistency was achieved (well differentiated). Individuals in the undecidable cluster had unclear diabetes characteristics and a non-typical course of diabetes complications for the diabetes subtypes, and approximately half of them moved to different subtypes after 5 years (Table 1, Fig. 4b, c).

This study had several limitations. First, the sample size of the training dataset is relatively small. Second, because this study was conducted only in the Japanese population the results cannot be generalised, thereby limiting applicability to other ancestral populations. We tested consistency by recruiting two Japanese cohorts with diverse genetic predispositions. However, future studies are further needed to assess whether our approach is applicable to multiethnic populations. Additionally, whilst the study sample is broadly representative of general demographic distribution of the Japanese population with diabetes in terms of sex, age and socioeconomic factors, the potential limitations and biases of these factors should still be considered when interpreting the results. Third, because some study participants were enrolled after the start of diabetes treatment rather than at the onset of diabetes, the variables used for clustering and prediction could have been affected at least partly by lifestyle interventions and medications the participants received before study enrolment. Fourth, the reasons for group migration and changes in clinical variables in the undecidable cluster are yet to be determined. This undecidable cluster was atypical, with no clear clinical features (Table 1). In the future, the respective characteristics (i.e. clinical features and genetic predisposition) of individuals moving between clusters and of undecidable groups need to be clarified.

In conclusion, we developed a novel ML model for type 2 diabetes subtypes. The new RF-based model for predicting Ahlqvist’s subtypes of type 2 diabetes has great potential for application in a wide range of research, including large-scale cohorts and clinical studies, because it can classify individuals with missing HOMA2 indices and predict glycaemic control, diabetes complications and treatment outcomes with long-term consistency by using readily available variables. Future studies are needed to assess whether our approach is applicable to research and/or clinical practice in multiethnic populations.