figure b
figure c

Introduction

Diabetes mellitus is a heterogeneous disease. Thus, there is currently growing interest in exploiting analytical techniques, particularly unsupervised machine learning, to identify subtypes of patients within those with diabetes that could be the basis for defining more targeted therapeutic strategies [1, 2]. A well-known example of such approach is the study by Ahlqvist et al, which introduced a classification of adult onset diabetes based on cluster analysis [3]. Studies with a similar approach have been performed subsequently [4, 5].

Gestational diabetes mellitus (GDM) is similarly characterised by considerable ‘phenotypic heterogeneity’ [6], which was recently identified as an important research gap by the National Institute of Diabetes and Digestive and Kidney Diseases Workshop [7]. As such, women with GDM may differ in terms of pregnancy outcomes and required treatment strategy. Some women achieve good glycaemic control through diet and lifestyle modification, whereas other women require long-acting insulin (to improve fasting glucose) or short-acting insulin (to achieve acceptable postprandial glucose), and some women will need both short- and long-acting insulin, possibly with further treatment in addition (e.g. metformin). However, to our knowledge, no previous studies have identified GDM subgroups through cluster analysis using only variables commonly measured in the clinical routine of GDM treatment. Therefore, this study aimed to identify clusters of GDM using basic clinical variables and investigate whether the identified clusters are associated with specific treatment needs/modalities or clinical outcomes, including the occurrence of pregnancy complications.

Methods

Participants and experimental procedures

A prospectively compiled dataset was analysed that comprised all singleton pregnancies with GDM diagnosis attending the Pregnancy Outpatient Department at Charité University Hospital (Berlin, Germany) between 2015 and 2022. Another cohort was used for external validation, and comprised women attending the Pregnancy Outpatient Department at the Medical University of Vienna (Vienna, Austria) in the same years. In both cohorts, women with pre-gestational diabetes or with multiple pregnancies were excluded. If any woman had multiple deliveries during the study period, we focused on the first pregnancy. Hence, the final sample consisted of 1865 and 817 women for Berlin and Vienna, respectively. A flow chart with detailed information for included and excluded patients is provided in the electronic supplemental material (ESM) Fig. 1. A list of all variables in Berlin and Vienna datasets is provided in ESM Table 1. Due to the nature of the present study, detailed information about ethnicity and socioeconomic status of the included patients was not available. GDM diagnosis was established by performing a 75 g OGTT in the late second or early third trimester (i.e. between 24 and 28 weeks of gestation), with fasting, 60 and 120 min glucose concentrations equal to or exceeding 5.1, 10 and 8.5 mmol/l (or 92, 180 and 153 mg/dl), respectively [8, 9]. In participants at high risk of GDM or those with elevated fasting glucose concentrations at early pregnancy, the presence of GDM was verified by early OGTT testing before 24 weeks, according to local guidelines [10]. All participants with GDM received lifestyle advice and medical nutrition therapy, and guidance on capillary blood glucose measurement. Glucose‐lowering medication was initiated if the fasting or 1 h postprandial capillary blood glucose exceeded 5.3 or 7.8 mmol/l (95 or 140 mg/dl), with intermediate (or long-acting) insulin being prescribed for elevated fasting glucose and short-acting insulin being prescribed for postprandial hyperglycaemia [11]. The therapeutic approaches and treatment goals in Berlin and Vienna were comparable, as the same guidelines and recommendations are valid in these countries [10]. Metformin was used in some participants, especially insulin-resistant women, in addition to insulin and/or lifestyle modification. A more detailed description about indications for and use of glucose-lowering medication is provided in ESM Methods. The study was approved by the local ethics committees (Berlin: EA2/097/23, Vienna: 1542/2019), and performed in accordance with the Declaration of Helsinki. Further details are reported in ESM Methods.

Identification and validation of clusters

Definition of training and test datasets, input and outcome variables

The pre-processing workflow for deriving training and test sets is described in ESM Methods. The Berlin dataset was randomly partitioned into a training set (70%) and a test set (30%). The Vienna dataset was used as an independent test set for validation of identified clustering models [3]. Different sets of input variables were considered to identify potential clusters. The variable sets were selected with the aim of identifying meaningful clusters while ensuring applicability in routine GDM clinical practice. One set comprised age, pre-pregnancy BMI (BMIPG), and the glucose values from the OGTT (fasting, 60 and 120 min: OGTT0, OGTT60 and OGTT120, respectively). Another set included the same variables excluding age. Further sets included age, BMIPG and either mean OGTT glucose or simply fasting glucose. Outcome variables were defined in relation to clinical outcomes of interest (see ESM Table 1). Table 1 reports summary statistics for the training set, Berlin test set and Vienna test set.

Table 1 Summary statistics for the training set, Berlin test set and Vienna test set

Clustering models implementation and validation

Various clustering algorithms were tested: k-means using Euclidean distances, and k-medoids and agglomerative hierarchical clustering [12] using both Euclidean and Manhattan distances. A hierarchical clustering algorithm was implemented with three agglomeration methodologies: complete, average and ward.D2 linkages [12]. We also explored the use of two density-based clustering methods: DBSCAN (density-based spatial clustering of applications with noise) [12] and HDBSCAN (hierarchical density-based spatial clustering) [13].

To determine the possible number of clusters for each algorithm and set of input variables, several methods were employed: the Gap statistic method [12], silhouette maximisation [12] and the function NbClust from the ‘NbClust’ R package [14], whereby various indices are calculated and the number of clusters proposed by the majority of indices is selected. Lastly, the elbow method [12] was applied for further heuristic assessment of the potential number of clusters through visualisation of cluster compactness.

We then applied a range of internal validation techniques on the training set to select valid clustering solutions and identify the optimal one. First, we ensured the solution stability (indicated by a Jaccard index above 0.75) and its compactness (indicated by a non-negative mean silhouette) for each cluster [15, 16]. The significance of each input variable was evaluated as discussed below [16]. Moreover, a twofold cross-validation was implemented on two random subsets from the training set [16]. These ‘partial models’ were then compared with the ‘original model’ (i.e. derived from the entire training set) across three key aspects: input similarity, result similarity and outcome significance consistency. With regard to result similarity, several metrics were employed, including the adjusted Rand index and classification metrics such as sensitivity, specificity and the F1 score (see ESM Methods for details) [17].

Following identification and selection of the optimal model, its clustering results underwent an external validation phase on the two test sets to assess generalisability. For a given clustering algorithm and input variables, we defined the training cluster Ctraining (clustering results from the training set) and the estimated test cluster \(\widehat{C}_{\mathrm{test}}\) (clustering results for the test set using the model determined from the training set). In \(\widehat{C}_{\mathrm{test}}\), each participant was assigned to the cluster whose centroid was nearest. We also defined the reference test cluster Ctest, representing the clustering results on the test set obtained by applying the same clustering algorithm directly to the test set itself. Comparison between Ctraining and \(\widehat{C}_{\mathrm{test}}\) allowed assessment of possible differences in each cluster between the training set and the test sets. Finally, comparison between \(\widehat{C}_{\mathrm{test}}\) and Ctest allowed validation of the clustering results on the test sets, evaluating the consistency of cluster assignments [18] (see ESM Methods and ESM Table 2). All procedures were performed in R, version 4.2.2 (https://cran.rstudio.com/). ESM Fig. 2 shows the implemented clustering analysis procedure.

Statistical analysis

Each input variable was evaluated in terms of significant differences across clusters, this being one of the criteria for acceptance of a potential clustering solution [16]. The normality of variable distribution was checked via a Shapiro–Wilk test and a graphical test of normal distribution. ANOVA followed by Fisher’s protected least significant difference post hoc test was used for normally distributed variables to achieve 95% coverage probability, while the Kruskal–Wallis and Dunn post hoc tests [19] were used for non-normally distributed variables [19]. The p values in Tables 2, 3 and 4 were interpreted in an explorative manner.

Table 2 Summary statistics for the training set
Table 3 Summary statistics for the Berlin test set
Table 4 Summary statistics for the Vienna test set

Similar methodology was applied to compare the continuous outcome variables. For binary outcomes, χ2 or Fisher’s exact tests and logistic regression analysis were performed [19]. Logistic regression results further underwent ANOVA (likelihood ratio test), and, if significant, subsequent pairwise comparisons. These analyses on the outcome variables were performed separately on the training and test sets, and on partial datasets derived from twofold cross-validation. Statistical analysis was performed using R, version 4.2.2. For all tests, the two-sided significance level was set to 0.05. All p values were interpreted in an explorative manner, with the aim of generating new hypotheses. Therefore, no further adjustment for multiplicity was performed in this study, unless otherwise indicated in the text.

Results

Selection of the optimal clustering model configuration

The selected clustering model employed the k-means algorithm and identified k=3 clusters from age, BMIPG, OGTT0, OGTT60 and OGTT120 as input variables. This configuration was selected based on the performance of the tested models against our implemented internal validation criteria. Specifically, both k-medoids and all hierarchical clustering models failed to reach the required threshold of 0.75 for the Jaccard index for each cluster, one of the internal validation criteria that was implemented to deem a proposed solution as acceptable. The reduced stability of these solutions was also confirmed by the unsatisfactory metrics achieved in twofold cross-validation, such as the adjusted Rand index. Use of DBSCAN and HDBSCAN was not pursued further as they yielded unsatisfactory results (see ESM Results for details).

Of note, when using k-means, the Gap statistic method suggested k=1 for the clustering model, while the silhouette method and the majority of the NbClust indices proposed k=2. However, upon further analysis, the k=2 clustering solution was not found to meet our internal validation criteria, which require that all variables within a proposed clustering solution must be significantly different across the identified clusters (p<0.05). Specifically, age did not exhibit significant differences across the two clusters, leading us to discard this partitioning. Thus, k=3, the second option identified by NbClust (with modest differentiation compared with the first option), was considered the best choice.

Main characteristics of identified clusters

The cluster centroids are listed in ESM Table 3. Figure 1 shows the cluster plot, with the proportion of participants assigned to each cluster (Fig. 1a–c) and the variable distribution within each cluster (Fig. 1d–f).

Fig. 1
figure 1

(ac) Representation of identified clusters on the principal components space (first two principal components, PC1 and PC2) for the training set (a), and of estimated clusters for the Berlin test set (b) and Vienna test set (c). Each point corresponds to a participant, and different colours represent the assigned cluster; for test sets, filled circles indicate participants correctly assigned to a cluster, whereas empty circles indicate participants assigned to a different cluster than that in the related reference clusters. The percentage of participants assigned to each cluster is also reported. (df) Distribution of input variables in the identified clusters for the training set (d), Berlin test set (e) and Vienna test set (f)

Cluster 1, which included 246 out of 1154 participants (21.3%), exhibited the highest values across all variables, with the majority of participants being obese before pregnancy and having elevated glucose levels throughout the OGTT. Cluster 2 included 407 participants (35.3%) with lower median age and intermediate BMI, but elevated fasting glucose often exceeding the GDM threshold. Cluster 3 included 501 participants (43.4%) whose age was similar to that of cluster 1, with typically normal BMI and elevated post-load glucose levels (OGTT60 and OGTT120).

Internal validation results

The three clusters identified showed a high Jaccard index (0.88, 0.86 and 0.86 for clusters 1, 2 and 3, respectively) and non-negative silhouette values. These clusters were validated through implementation of twofold cross-validation, showing robustness and reproducibility for cluster characteristics. Differences in input variables were found in almost all pairs of clusters, confirming the importance of the input variables. Further details of the internal validation results are reported in ESM Results, and goodness-of-fit statistics are reported in ESM Table 4.

External validation results

Assigning the participants in the test sets to clusters

Assignment to the clusters was performed for each participant in each of the two test sets as described below (see text box).

Comparison between training set and test sets

The distribution of participants among clusters in the Berlin test set (Fig. 1b) closely mirrored that of the training set (application of a χ2 test on the contingency table of cluster assignment proportions led to a p value of 0.84). Furthermore, no differences were found in the values of variables between these sets, except for BMIPG and OGTT120 in cluster 1. The Vienna test set (the external validation cohort) showed some differences in cluster proportions (Fig. 1c) compared with the training set, but, despite this, showed interesting results in terms of differences among clusters for some outcome variables, as discussed below.

Further information on distribution of variables across clusters is shown in Fig. 1e,f. Other details of external validation results are reported in ESM Results.

Comparison between estimated and reference clusters in test sets (\(\bf\widehat{C}_{\mathrm{test}}\) and C test)

On the test sets, k-means clustering with k=3 was reiterated as previously illustrated in the text above, providing the Ctest clusters that served as reference for comparison with the estimated clusters (\(\widehat{C}_{\mathrm{test}}\)). In the Berlin test set, the Jaccard indices for comparison between \(\widehat{C}_{\mathrm{test}}\) and Ctest were 0.87, 0.78 and 0.77, for clusters 1, 2 and 3, respectively, while the Vienna test set (the external cohort) showed indices of 0.80, 0.85 and 0.88, respectively. Figure 1b,c show the cluster assignments for both sets. Overall, the adjusted Rand indices were 0.81 and 0.74 for the Berlin and Vienna test sets, with total accuracies of 93.3% and 90.5%, respectively. Specific goodness-of-fit metrics are reported in ESM Table 5.

Comparison of clinical outcomes across clusters

Clinical outcomes were first analysed on the training set (Table 2). The need for glucose-lowering medications was higher in cluster 1 (39.6%) compared with clusters 2 (12.9%) and 3 (10.0%) (p<0.0001). Likewise, birthweight >4000 g and being large for gestational age (LGA) proportions were also higher in cluster 1 (19.7% and 30.5% for birthweight >4000 g and LGA proportion, respectively) compared with cluster 2 (12.1% and 22.4%, respectively) and cluster 3 (13.1% and 22.0%, respectively) (p<0.05). Finally, the low and in-range base excess were significantly different between cluster 1 and cluster 3. The significant differences identified in the training set were further explored in the two partial models obtained through twofold cross-validation (see ESM Tables 6 and 7).

The difference in drug prescription rates between cluster 1 vs clusters 2 and 3 was confirmed in the Berlin test set (p<0.0001) (Table 3). The increased drug prescription rates in cluster 1 (p<0.0001), the higher proportion of babies who were LGA, the higher values for birthweight percentiles, and the difference in low and in-range base excess in cluster 1 vs cluster 3 were confirmed in the Vienna test set (Table 4).

Some outcomes were available for the external validation cohort (Vienna test set) but not for the Berlin sets. Interestingly, all pharmacological treatment outcomes, except use of metformin only, showed differences across clusters (Table 4). Cluster 1 had higher insulin use (68.1%) than in clusters 2 and 3 (32.9% and 29.0%, respectively; p<0.0001). Use of rapid-acting insulin was higher in cluster 1 (34.5%) than in clusters 2 and 3 (6.6% and 14.7%, respectively; p<0.0001), and lower in cluster 2 vs cluster 3 (p<0.0001). Use of neutral protamine Hagedorn (NPH) insulin was again higher in cluster 1 (54.9%) than in clusters 2 and 3 (32.1% and 24.7%, respectively; p<0.0001), and higher in cluster 2 vs cluster 3 (p<0.05). Use of long-acting insulin was also higher in cluster 1 (10.0%) than in clusters 2 and 3 (0.4% and 1.0%, respectively; p<0.002).

The OR for the training set, the partial models from twofold cross-validation, and the two test sets were calculated as described in ESM Results and are reported in ESM Tables 812.

Assigning a new patient to a cluster

Having defined the clustering model, a new patient may be assigned to the appropriate cluster using the same procedure used to assign the patients in the test sets. The steps for patient assignment are summarised in the text box.

Discussion

This study aimed to assess clusters of GDM through an unsupervised machine learning technique known as data-driven clustering, using easily accessible clinical variables. We identified three clusters, one of which, cluster 1, exhibited a higher risk for the need for glucose-lowering medications, indicating its potential for targeted intervention strategies. This cluster, characterised by a higher BMI and hyperglycaemia at fasting as well as after oral glucose load, may represent a GDM subgroup with severe glucometabolic impairment and a higher need for pharmacological treatment (required by approximately 40% of patients). Moreover, the identified clusters showed differences in treatment modalities, even between clusters 2 and 3. Basal insulin was typically prescribed to participants in cluster 2, while rapid-acting insulin was more often prescribed in cluster 3. Our finding concerning differences in the need for glucose-lowering medications was extremely robust, as comparable differences between the clusters were observed in both test sets and the validation cohort. Furthermore, our analysis revealed additional differences among the three clusters for various outcome variables associated with pregnancy disorders, maternal delivery and fetal complications, although with lower degree of evidence (in some cases, mostly seen in the external validation cohort). In particular, infants of participants in cluster 3 had lower birthweight percentiles and a lower risk for being LGA, suggesting more favourable pregnancy outcomes. These observations underline the possible clinical importance of our study. On the other hand, it should be acknowledged that the differences in neonatal outcomes among clusters, such as the LGA prevalence, were not totally consistent among all participants in the Berlin and Vienna datasets, which was unexpected given that all participants were adequately treated by the same guidelines and recommendations [10]. Consequently, we consider the identified cluster differences in terms of the need for glucose-lowering medications as a more robust and relevant result, suggesting a more severe phenotype of the disease.

One may question the advantages of identifying clusters rather than developing predictors of the clinical outcomes of interest. In fact, developing a predictor is typically easier than identifying clusters because of the lower risk of methodological flaws. However, predictors usually focus on one outcome (such as pharmacological treatment risk), whereas clustering defines subgroups aiming to identify meaningful phenotypes of the disease, which may comprise several physiological/pathophysiological and clinical characteristics. Thus, the cluster definition may be progressively improved by specification of further characteristics. An example of this is seen in the present study, in which specific information about the type of insulin prescription was available in the external validation cohort (Vienna), and we identified differences among clusters for this important clinical aspect. Conversely, this is difficult to assess with a predictor-based approach. With the predictor-based approach, if an investigator focuses on a new outcome (such as predicting the risk for an event or condition that was not addressed before), it is likely that a completely new predictor must be developed, with no relation with the previous predictors. With the cluster-based approach, future studies may instead add characteristics (including risk for events/conditions) that were not addressed originally. Thus, it is possible to rely on the previously identified clusters without the need to perform the necessary methodological steps for new cluster definition (including collection of sufficiently large datasets). As a fact, compared with predictors, cluster-based approaches have the potential to provide a more holistic view of the disease under investigation [20]. It is also worth noting that, when we investigated the performance of logistic regression analysis (a typical predictor-based approach) using the same datasets (the training set and the Berlin and Vienna test sets) and the same input variables (age, BMI and OGTT glucose levels, either separately or together) for prediction of pharmacological treatment requirements, we obtained unsatisfactory results (details not shown). This suggests that, even for prediction of a single clinical outcome of interest, the prediction-based approach may not be adequate or superior to the cluster-based approach.

Our approach has the advantage of being simple and easily applicable in clinical practice, as the defined GDM clusters relied on only five input variables that are consistently available for patients with GDM (age, BMI and three OGTT glucose levels). Based on this approach, every clinician in charge of patients with GDM will be able to easily categorise each patient into a cluster. This has several clinical implications. For patients assigned to cluster 1, the clinician will gain awareness of the elevated risk of need for glucose-lowering medications and especially high insulin requirements, thus providing an indication of possible aggressive titration needs. On the other hand, patients in cluster 1, evaluated by the clinician as requiring intensive lifestyle intervention, may be trained with specific educational programmes to potentially avoid pharmacological treatment. Likewise, for patients in cluster 2 rather than cluster 3, and vice versa, the clinician will be aware that specific treatment modalities may be preferable, as patients in cluster 3 required a treatment approach based on fast-acting insulin formulations, while intermediate-acting insulin formulations were more often required for patients in cluster 2. Future prospective studies may clarify whether specific interventions (e.g. medical nutrition therapy vs pharmacotherapy) are more or less effective in a specific cluster. On the other hand, further studies to generate different cluster definitions may be also pertinent. In our approach, we aimed to use a minimum number of input variables to ensure wide clinical applicability, but future studies may define clusters based on more input variables, with more restricted clinical applicability but probably an improved ability to predict specific clinical outcomes.

Clinicians may wonder whether the assignment of their patients with GDM to the clusters identified in our study can be deemed reliable. Our methodological procedure was careful, with several alternatives tested to obtain the most accurate results. Furthermore, we validated our findings thoroughly, and, most importantly, using two independent datasets, one of which originated in a different clinic (Vienna) to the training set (Berlin). It is worth noting that, on average, the values of the five input variables exploited for clustering were typically different between the Vienna and Berlin datasets. This may be the common situation for clinicians applying our methodology to assign clusters for their patients. Despite this, the validation results were satisfactory. Thus, it is reasonable to expect correct assignment of new patients to our identified clusters, at least for women in a European setting.

Comparing our findings to prior studies is challenging, as, to our knowledge, our study is the first to identify GDM clusters based on routine clinical variables. In a previous study, we found that fasting hyperglycaemia, either isolated or in combination with post-load hyperglycaemia, was associated with a more frequent need for glucose-lowering medications [21]. In another study, we observed different treatment modalities in participants with GDM and higher BMI, demonstrating increased requirement of insulin [22]. We also applied a supervised learning technique to build a predictor of pharmacological requirements on a subset of the Vienna cohort analysed in the present study, in which the relevant prediction variables were glucose levels at fasting and at 60 min after the OGTT, plus age [23]. These previous findings are essentially consistent with those of the present study. Other investigators developed similar approaches [24,25,26,27,28], again with findings that are typically consistent with ours. However, none of those previous studies used unsupervised learning such as cluster analysis.

The necessary number of observations in relation to the number of input variables to ensure accurate results in cluster analysis has previously been reported to be 70 observations per input variable [29]. This value is higher than that required for supervised machine learning [30, 31], suggesting more challenging requirements for unsupervised vs supervised approaches. As the number of participants per input variable was much higher in our study (230 participants per variable in the training set), and we observed consistent results in training and test datasets (both of which respected the necessary observation/input ratio), we believe that our sample size was certainly adequate. One limitation may be that, among all investigated outcomes, only some showed differences among clusters. On the other hand, we considered only five variables for cluster definition: given that such limited data were exploited for GDM cluster definition, it is plausible that the identified clusters may not yield all the clinical information of possible interest. In fact, the extent of the clinical information provided by our cluster definition aligns reasonably with the limited data required.

In summary, our study identifies novel GDM subgroups through unsupervised machine learning using routine clinical variables. The subgroups derived by cluster analysis showed remarkable differences in terms of glucose-lowering medication needs and treatment modalities (e.g. rapid-acting vs intermediate- or long-acting insulin), which is of major clinical relevance. In general, therefore, our methodology holds promises for guiding personalised treatment strategies and enhancing our understanding of GDM heterogeneity.