Introduction

Background

Usually, diabetes has broadly been categorized into Gestational (GDM), Type 1 (T1DM), and Type 2 (T2DM). GDM occurs during pregnancy and increases the chances of developing T2DM later in life. T1DM usually appear at early ages when the pancreas stops producing insulin due to an autoimmune response. The reasons why this occurs are still not very clear. It is very important to monitor the glucose levels of patients, as sudden changes might be life threatening. Patients with this type often need a daily dose of insulin to lower their blood glucose levels. T2DM is the most common type of diabetes encompassing 95% of diabetic patients, which are commonly adults with a sedentary lifestyle and poor quality diet. Despite it can be easily controlled in early stages, comorbidities might appear years later. Stages of T2DM are related to parameters such as glucose concentration, insulin sensitivity, insulin secretion, overweight, and aging. However, recent studies have found that not all patients present the same manifestations.

According to the International Diabetes Federation (IDF) [1], diagnostic guidelines for diabetes include two measures obtained from blood tests: glycated hemoglobin test (HbA\(_\text {1C}\)) and plasma glucose (PG) test. The latter can be obtained in three different manners: in fasting state called Fasting Plasma Glucose (FPG), from an oral glucose tolerance test (OGTT), which consists of administering an oral dose of glucose and measuring PG after two hours; and from a sample taken at random time (normally carried out when symptoms are present), called Random Plasma Glucose (RPG). A positive diagnosis is reached when either one of the following conditions holds (IDF recommends two conditions in absence of symptoms): (1) FPG \(\ge\) 7.0 mmol/L (126 mg/dL), (2) PG after OGTT \(\ge\) 11.1 mmol/L (200 mg/dL), (3) HbA\(_\text {1C}\) \(\ge\) 6.5%, or (4) RPG \(\ge\) 11.1 mmol/L (200 mg/dL). These parameters allow to readily identify diabetic patients and, when combined with risk factors such as demographic, family history, dietary, etc. may help to predict the tendency of developing the disease or its related complications. Understanding the relation of distinct parameters to the pathology of the disease also helps scientists to develop new ways to treat it. In this regard, data-driven analysis provide powerful means to discover such relations.

With the relatively recent advent of big data supporting precision medicine [2], the understanding of diabetes changed from the classical division of T1DM,T2DM, and other minority subtypes, to the notion of a highly heterogeneous disease [3]. The field of research has directed the efforts towards the exploitation of available big data analysis – particularly from electronic health records – searching for refined classification schemes of diabetes [4]. Indeed, recent diabetes research has stressed the importance of underlying etiological processes associated to development of important adverse outcomes of the disease along with response to treatment [5,6,7]. Exploring the disease heterogeneity, a recent data-driven unsupervised analysis [8] found that T2DM might have different manifestations including five subtypes that were related to varying risks of developing typical diabetes complications such as kidney disease, retinopathy, and neuropathy. Based on Ahlqvist et al. data-driven analysis [8], in this paper we tackle the development of methods for classifying T2DM subtypes through machine learning approaches with the aim of providing a comparison and new insights on the matter. In short, the goals of the present study were:

  • Construct a dataset using publicly available databases comprising a majority of mexican and other hispanic population.

  • Obtain a characterized dataset with different T2DM subgroups by means of clustering algorithms and evaluate different clustering strategies through clustering validation indices.

  • Train and validate classification models using different algorithms and data schemes.

  • Test developed models with a hold-out dataset.

Compared to previous work, our study introduces the following contributions and main results:

  • The development of classification models for T2DM subgroups. To our best knowledge, there is only one preceding study that tackled this issue [9].

  • Validation of T2DM subtypes in a relatively large dataset predominantly composed of mexican and other hispanic population.

  • An evaluation of clustering algorithms and strategies including indices to measure clustering quality.

  • An assessment of performance of classification models for T2DM subtypes. This assessment included four algorithms, seven data schemes, two datasets, and two validation methods.

  • Our models reached accuracies of up to 98.8% and 98.9% on both datasets. Simpler and faster algorithms such as SVM and MLP performed better. Models adjusted notably better to Dset B data and performance was more consistent within the schemes on this dataset. Both validation settings, bootstrap and 10-fold cross validation, yielded similar results.

  • Finally, the simple majority vote implemented in the testing stage showed a great amount of consensus, providing class proportions akin to previously reported for other populations.

We will briefly review the subject of artificial intelligence works related to general diabetes and diabetes subgroup classification in the remaining of this Introduction section.

Related work

Artificial intelligence – and particularly, machine learning – methods have been extensively applied within the biomedical field mainly for development of computational tools to aid in diagnosis of diabetes or its complications [10]. Data analysis has been applied in several diabetes studies, covering five different main fields: risk factors, diagnosis, pathology, progression, and management [11]. A number of studies deal with identification of diabetes biomarkers, generally by means of feature selection methods, such as evaluating filter/wrapper strategies [12], combining feature ranking with regression models to predict short-term subcutaneous glucose [13], and proposing new methods for feature extraction [14, 15] and generation [16]. Another subfield of research regarding machine learning applied to diabetes mellitus is devoted to detection/prediction of complications. With the rise of deep learning within the last decade, much of this work aims at predicting diabetic retinopathy through convolutional architectures and primarily analyzing retinal fundus images [17, 18], even deploying tools that are commercially available [19, 20]. Predictive tools for diabetic nephropathy were developed integrating genetic features with clinical parameters [21] and comparing performance of various models for detection of diabetic kidney disease [22]. Another major diabetic complications tackled with machine learning algorithms are cardiovascular disease [23], peripheral neuropathy [24], diabetic foot [25], and episodes of hypoglycemia [26, 27]. All of these classification/regression tasks are approached with varying machine learning methods, most of which are reviewed in [28, 29].

Until recently, diabetes mellitus was thought as a two-class disease, divided into the general Types I and II with some uncommon manifestations within them; such as monogenic types (e.g. Maturity Onset Diabetes of the Young - MODY, and neonatal diabetes) and secondary types (e.g. due to steroid use, cystic fibrosis, and hemochromatosis) [30]. As mentioned earlier, Ahlqvist et al. [8] introduced a novel subclassification of diabetes with a data-driven (clustering) approach. Using six variables (glutamate decarboxylase (GAD) antibodies, age at diabetes onset, body mass index, glycated hemoglobin, and homeostatic model assessment values for \(\beta\) cell function and insulin resistance), they discovered five clusters (T2DM subtypes) that were dubbed as:

  1. 1.

    Severe Autoimmune Diabetes (SAID): It is probably the same as T1DM, but it is classified as a T2DM subtype, where the pancreas stops producing natural insulin by an autoimmune response. This is identified by the presence of GAD antibodies.

  2. 2.

    Severe Insulin-Deficient Diabetes (SIDD): It is similar to SAID, but the antibodies responsible for the autoimmune response are missing.

  3. 3.

    Severe Insulin-Resistant Diabetes (SIRD): the patients seem to produce a normal amount of insulin, but their body does not respond as expected, maintaining high blood sugar levels.

  4. 4.

    Mild Obesity Related Diabetes (MORD): It is related to a high body mass index, can be treated with a better diet and exercise when moderated.

  5. 5.

    Mild Age Related Diabetes (MARD): It is mostly present in elder patients, and corresponds to the natural body ageing.

For such subgroup identification, they used a cohort comprising 8,980 patients for initial clustering and then, found centroids were used to further cluster three more cohorts and replicate results. Importantly, these groups were associated with different disease progression and risk of developing particular complications.

Soon after this pioneer study, a number of works based on the proposed cluster analysis method emerged to replicate diabetes subgroup assessment within different cohorts (see Table 1). The subject was systematically reviewed in [31]. ADOPT and RECORD trial databases with international and multicenter clinical data comprising 4,351 and 4,447 observations, respectively, were analyzed in [32] to investigate glycaemic and renal progression. They found similar cluster results compared to those reported by Ahlqvist et al., but also that simpler models based on single clinical features were more descriptive to their same purposes. In a 5-year follow-up study of a german cohort with 1,105 patients [33], the authors evaluated prevalence of complications such as non-alcoholic fatty liver disease and diabetic neuropathy within diabetes subgroups after follow-up. Later, using this same german cohort, the authors assessed inflammatory pathways within the diabetes subgroups by analyzing pairwise differences in levels of 74 inflammation biomarkers [34]. In another study of the same german group [35], prevalence of erectile dysfunction among the five diabetes subgroups was researched. This complication presented a higher prevalence in SIRD and SIDD patients, suggesting that insulin resistance and deficiency play an important role in developing the dysfunction.

Table 1 Datasets and found proportions of diabetes subgroups reported in the literature

A couple of studies were carried out to validate the data-driven approach for diabetes subgroups in chinese population [36, 37]. The former consisted in a multicenter national survey with cross-sectional data comprising 14,624 records. These data showed similar distributions than those found by Ahlqvist et al. with a higher prevalence of SIDD class. The latter recruited 1,152 inpatients of a tertiary care hospital. After performing clustering on the data, the proportions were similar for SIDD and SIRD, but in this case MORD assembled the majority of records, instead of MARD. A team of researchers [9] verified the reproducibility of diabetes subgroups by introducing classification models with trained Self-Normalized Neural Networks (SNNN). They clustered NHANES data to obtained a labeled dataset on which four input data models were fitted. These models were used later to classify data from four different mexican cohorts to assess risk for complications, risk factors of incidence, and treatment response within subgroups. In a subsequent work [39], with the purpose of assessing prevalence of diabetes subtypes in different ethnic groups in US population, the research team applied their SNNN models to classify an extended NHANES dataset comprising cycles up to 2018.

A replication and cross-validation study was performed in [38], the authors used an alternative input data scheme replacing HOMA2 values – originally used for clustering – with C-peptide along with high density lipoprotein cholesterol. Five clusters were produced with the proposed scheme, three of them showing good matching with MORD, SIDD, and SIRD; whereas the combination of the remaining two showed good correspondence to MARD. Cross-validation among three different cohorts exhibited fair to good cluster correspondence. Pigeyre et al. [40] also replicated clustering results of the original Swedish cohort using data from an international trial named ORIGIN. In this cohort, they investigated differences in cardiovascular and renal outcomes within the subgroups, as well as the varied effect of glargine insulin therapy compared to standard care in hyperglycemia. Finally, the risk of developing sarcopenia was evaluated in a Japanese cohort previously characterized using cluster analysis [41]. Among diabetes subtypes, SAID and SIDD patients exhibited higher risk for the onset of this ailment.

Methods

Our interest was to explore different ways to obtain classification model variations for assigning T2DM subtypes to patients according to a set of attributes. This required us to characterize T2DM subtypes from existing databases, train these models and apply them to unseen patient records. The study followed a procedure with three main sequential stages shown in Fig. 1:

  1. 1.

    Dataset construction, where the tasks for acquiring, cleansing, merging and preprocessing the data are performed to get a tidy subset from databases. This subset is used for training, validating, and testing the clustering and classification models in the subsequent stage.

  2. 2.

    Data characterization, where diabetes patients (instances of the dataset) are segmented, yielding diabetes groups that are labeled according to the feature distribution patterns.

  3. 3.

    Classification model training, where different classification models are trained and validated using datasets from previous characterization; the obtained classification models are then used and evaluated by assigning T2DM subtypes to unseen patient records.

Fig. 1
figure 1

Overview of the general procedure applied in the study

The best classification models were obtained according to different strategies varying co-related attributes. Next, in the following subsection, we describe these stages and steps more in detail.

Dataset construction

The study was performed over real data (NHANES and ENSANUT databases). These data come from health surveys but was curated in several ways to obtain the better fitting of classification models.

  • The National Health and Nutrition Examination Survey (NHANES) database [42], as its name suggests, is a U.S. national survey performed by the National Center for Health Statistics (NCHS), which in turn is part of the Centers for Disease Control and Prevention (CDC). It gathers information from interviews where people answer questionnaires covering demographic, nutritional, socioeconomic, and health related aspects. For some of the participants, physical examination and laboratory information are included. The database is divided in cycles, which after the NHANES III (1988 to 1998) are biennial. From NHANES can be obtained several datasets (views) for a vast number of works, depending on the interests of research. NHANES dataset that we assembled in the present work consists of the merging of cycles III (1988-1998) with all continuous NHANES cycles from 1999-2000 to 2017-2020. This latter cycle was the 2017-2018 cycle joined to the incomplete “pre-pandemic” cycle from 2019 to march 2020.

  • The Encuesta Nacional de Salud y NutriciónFootnote 1 (ENSANUT) [43], is the Mexican analogous of NHANES database. ENSANUT survey methodology, data gathering, and curation is carried out by the Center for Research on Evaluation and Surveys, which is part of the National Institute of Public Health (Mexican Ministry of Health). The database is the product of a systematic effort aiming to provide a trustworthy database to assess the status and tendencies of the population health condition, along with utilization and perception of health services. Starting in 1988 as the National Nutrition Survey, it was until 2000 that became a six-year survey (with some special issues) that included health information such as anthropometric measures, dietary habits, clinical history, vaccination, common diseases, and laboratory analysis (in some issues). Similarly to NHANES, several views can be obtained focusing on specific attributes. ENSANUT dataset that we have used here included the cycles 2006, 2016, and 2018.

From both databases we selected a subset of demographic, medical history, anthropometric, and laboratory variables (see Table 9). Importantly, C-peptide and Glucose2 were available in NHANES for only some cycles. C-peptide was only available in NHANES cycles III and 1999-2004, whereas Glucose2 was only available in NHANES cycles III, 2001-2002, and 2005-2016.

After merging the versions of each database, we obtained an initial raw dataset with N = 224,807 patients. From this, we selected only adult patients (Age \(\ge\) 20 years, N=172,909). We then performed a data wrangling workflow including the following tasks, see Appendix B for a detailed description.

  1. 1.

    Data cleansing consisted in replacing some invalid values with zeroes to represent absent values.

  2. 2.

    An imputation process to assign values to missing and needed variable inputs to records that otherwise would be dismissed. When handling data, it is very likely that some values are missing for many circumstances, such as the participants of the survey did not answer the questions, then their answers could not be included in the dataset, or the laboratory samples could not be analysed. We imputed missing values by using the Multivariate Feature Imputation procedure, which infer absent values based on values available in other attributes. The considered variables were Weight, Height, Waist, HbA1c, Glucose1, Glucose2, Insulin, and Age at Diabetes Onset by taking the median value returned by four regression techniques (see Appendix B for details).

  3. 3.

    A selection step to maintain only those records that met the inclusion criteria: a) being a diagnosed patient, or b) having OGTT glucose \(\ge\) 200 (mg/dL), or c) having HbA\(_{1C}\) \(\ge\) 6.5 (%). Extreme values, i.e. those values that were apart for more than five standard deviations from their mean, were removed on each attribute.

  4. 4.

    Scaling. Due to variations in the ranges of values of selected attributes, the computations are generally biased. Thus, a scaling on those values is required. We transformed the selected attributes by means of min-max normalization and z-score standardization.

As a result of the whole dataset construction process, a curated dataset was obtained combining NHANES and ENSANUT records. The process is illustrated on the left Panel in Fig. 2. The dataset was fully preprocessed according to the requirements of the study and, at this point, is ready for its utilization in data analysis algorithms. The final dataset comprised a total of 10,077 patient records that were split into a training/validation dataset termed \(D_1\) (N = 2,768) and a hold-out dataset termed Test Dset (N = 7,309). \(D_1\) consisted of the records including values for C-peptide variable, whereas Test Dset did not included these values.

Fig. 2
figure 2

Main stages of the implemented procedure

Data characterization

The objective of this stage was to characterize the selected instances in the curated dataset. The overall flow is depicted on Fig. 2 (central panel). Since this dataset was not labeled with any group or T2DM subtype, we applied clustering algorithms over selected attributes with the purpose of finding groups of instances in the dataset according to the similarities on values of the attributes. In a preliminary analysis, we explored three algorithms with different clustering approaches: partitional (K-means [44]), hierarchical (agglomerative clustering [45]), and density (DBSCAN [46]). Since these preliminary results (not included in this paper) demonstrated meaningful dissimilarities among DBSCAN and Agglomerative clusters with respect to those obtained with K-means, we determined to focus on the utilization of the latter.

Thus, we applied K-means to group T2DM patients into clusters, relying on the principle that similar patients in a cluster denote a T2DM subtype. We used a fixed number of groups (K = 4) corresponding to previously found diabetes subtypes [8], with the exception of SAID. We did not take this class into account considering all patients as being GADA negative. The five clinical features previously reported in the literature [8, 30] were taken into account. These features are: Age at Diabetes Onset (ADO), Body Mass Index (BMI), Glycated Haemoglobin (HbA\(_{1C}\)), and Homeostasis Model Assessment 2 [47] estimates of beta cell function and insulin resistance (HOMA2-%\(\beta\) and HOMA2-IR, respectively). HOMA2 values are defined by computationally solving a system of empirical differential equations with software provided by the authors [48]. There are two types of HOMA2 values, one derived from FPG plus C-peptide, and the other derived from FPG plus Insulin. We used both types HOMA2 values, as will be explained later. Hereafter, we will refer to them as CP-HOMA2 and IN-HOMA2, respectively.

As mentioned earlier, dataset \(D_1\) included only those records with C-peptide values (N = 2768) and thus, CP-HOMA2 measures can be computed for those records. Dataset \(D_2\) (N = 680), in turn, consists of the subset of \(D_1\) that only includes patients with less than five years of diabetes onset (i.e. AGE − ADO < 5). We carried out a two-stage clustering, first on \(D_2\) and then, used the obtained centroids to cluster the remaining instances of \(D_1\): those with five or more years of diabetes onset (i.e. the difference set \(D_1-D_2\)) in the second stage. In total, we tested four clustering strategies in the first stage (numbered 1.1 to 1.4) and six in the second stage (numbered 2.1 to 2.6). In both stages we aimed to contrast two overall clustering alternatives: (1) with centroid initialization or de novo clustering; and (2) taking each gender separately or both genders at once. In the first stage, we also tested the alternative of only assigning instances to initial centroids (i.e. no iteration) versus assigning and iterating until centroid convergence. Strategies 1.1 to 1.4 are thus defined as follows:

  • Centroid initialization using centroids provided by Ahlqvist et al. [8]:

    1. (1.1)

      Only assigning instances to initial centroids.

    2. (1.2)

      Iterating until reaching centroid convergence.

  • De novo clustering with repeated K-means procedure:

    1. (1.3)

      Each gender separately.

    2. (1.4)

      Both genders at once.

For 1.1 and 1.2 we took centroids reported by Ahlqvist et al. [8]. These centroids are defined by gender, therefore, centroid assignment is performed this manner in 1.1 and 1.2. De novo strategies 1.3 and 1.4 used a repeated K-means procedure, which consisted in several executions (51) of K-means. This procedure yielded a string with 51 positions, where each position holds one of {0, 1, 2, or 3} (the four groups). Hence, each string corresponds a group assignment pattern for each instance. Then, similarity among strings was compared to constitute the final four groups. This way, two identical strings mean that those instances were assigned the same groups across the 51 executions. Those strings not identical were grouped with their most similar instances. In all these executions, we used the K-means scikit-learn function with K = 4, 100 randomized centroid initializations (with k-means++ function), and 300 maximum iterations.

After analyzing results from the first stage strategies, we selected strategies 1.2 and 1.4, according to intrinsic and extrinsic clustering validation indices (Appendix A). We then moved on to second stage clustering computing centroids from strategies 1.2 and 1.4. For both genders denoted by \(\text {C}_{1.2}\) and \(\text {C}_{1.4}\); and separated by gender (W)omen and (M)en denoted by \(\text {C}_{1.2(W)}\), \(\text {C}_{1.2(M)}\), \(\text {C}_{1.4(W)}\), and \(\text {C}_{1.4(M)}\). In second stage, we also carried out de novo clusterings with the repeated K-means procedure. Here, we included two forms of de novo clustering: in addition of using CP-HOMA2 parameters, we also tested a clustering using IN-HOMA2 parameters and scaling the data with Min-Max normalization, instead of z-score. Importantly, this latter strategy was the only one that implemented these changes. In this manner, the six strategies in second stage were:

  • Centroid initialization using centroids from first stage:

    1. (2.1)

      With centroids \(\text {C}_{1.2}\).

    2. (2.2)

      With centroids \(\text {C}_{1.2(W)}\) and \(\text {C}_{1.2(M)}\).

    3. (2.3)

      With centroids \(\text {C}_{1.4}\).

    4. (2.4)

      With centroids \(\text {C}_{1.4(W)}\) and \(\text {C}_{1.4(M)}\).

  • De novo clustering with repeated K-means procedure:

    1. (2.5)

      With CP-HOMA2 values.

    2. (2.6)

      With IN-HOMA2 values and Min-Max normalization.

Strategies 2.1 to 2.4 used centroids found in first stage for dataset \(D_2\) and thus, only cluster the remaining instances in \(D_1\). Strategies 2.5 and 2.6 cluster the whole dataset \(D_1\) without taking into account previous results from first stage. Again, we evaluated the results by means of intrinsic and extrinsic validation indices selecting strategies 2.5 and 2.6 as the best performing ones. At the end of second stage clustering, we obtained two labeled datasets from \(D_1\), named as Dset A and Dset B, from groups obtained from clustering 2.5 and 2.6, respectively. The matching of groups with labels of T2DM subtypes was performed by comparing the obtained attribute distribution patterns against those reported in the literature [8, 9, 30], as will be further explained in Results section.

Model development and evaluation

Clustering in previous stage helped us to find out how patients can be grouped on T2DM subtypes; each patient was labeled according to its corresponding T2DM subtype. In this section, a subset of the dataset was used to train classification algorithms to learn how to identify unseen patients of the same dataset, not used for training. We developed classification models in two pathways, one for each annotated dataset (Dset A and Dset B; see upper-right panel in Fig. 2). On both pathways, we considered seven classification schemes, according to different selections of attributes in the input data. First, we used bootstrapping to validate models on both pathways, and then performed a second validation of best performing algorithms using stratified 10-fold cross validation. Four classification algorithms were explored: Support Vector Machine, K-Nearest Neighbors, Multilayer Perceptron, and Self-Normalized Neural Networks (see Appendix A for description). Finally, we used models obtained in the validation stage to classify subjects from the hold-out dataset.

Classification schemes

We explored how classification algorithms behave fed with different input data. The seven classification schemes denoted by S1 to S7 are the following:

  • S1. ADO, BMI, HbA\(_\text {1C}\), and CP-HOMA2-%\(\beta\) and CP-HOMA2-IR.

  • S2. ADO, BMI, HbA\(_\text {1C}\), and IN-HOMA2-%\(\beta\) and IN-HOMA2-IR.

  • S3. ADO, BMI, FPG, and IN-HOMA2-%\(\beta\) and IN-HOMA2-IR.

  • S4. ADO, BMI, HbA\(_\text {1C}\), FPG, and C-peptide

  • S5. ADO, BMI, HbA\(_\text {1C}\), FPG, and insulin

  • S6. ADO, BMI, HbA\(_\text {1C}\), HOMA-%\(\beta\), and HOMA-IR.

  • S7. ADO, BMI, HbA\(_\text {1C}\), METS-IR, and METS-VF.

Note that all schemes include ADO and BMI and, with exception of scheme S3, all include also HbA\(_\text {1C}\). Attributes that were interchanged within schemes are those related with pancreatic beta cell function and insulin resistance (i.e. HOMA measures and their related input variables: Glucose and C-peptide/insulin). Notice that schemes S1 and S2 consist of the same attributes on which Dset A and Dset B were respectively clustered. Scheme S3 is the same as S2 with HbA\(_\text {1C}\) replaced by FPG. Schemes S4 and S5 substitute HOMA2 measures in schemes S1 and S2 with their respective input attributes. Scheme S6 makes use of a previous HOMA model [49] that uses simple formulas for approximating beta cells function and insulin resistance. Finally, scheme S7 applies Metabolic Scores for Insulin Resistance (METS-IR) [50] and Visceral Fat (METS-VF) [51], which are respectively proposed measures of insulin resistance and intra-abdominal fat content. Schemes S1, S2, S3, and S7 were implemented elsewhere [9] and here we added schemes S4, S5, and S6.

Training and validating models

In the validation stage, several models are trained to compare among them, obtain average metrics and choose the best ones. This task was carried out using two independent validation processes: bootstrapping and stratified 10-fold cross validation. The former is recommended for obtaining classification models that circumvent overfitting. This is a common undesired effect on classification models that occurs when the model memorizes the training dataset instead of learning to classify; therefore the statistics provided during training might not represent the actual performance of the model in real scenarios on unseen data. Bootstrapping helps to evaluate the model by randomly sampling a dataset with replacement to obtain the training data, and the rest of non sampled data, called out-of-bag data, to test its results. The process is repeated several times selecting different random samples each time. We chose to extract 1000 bootstrap samples and a distribution of metric values for each of the models.

Results obtained from bootstrapping validation were evaluated by means of classification metrics (see Appendix A) to select the best performing algorithm on each classification scheme. We then performed a stratified 10-fold cross validation process only on the selected algorithms. This process consists in randomly splitting the dataset in ten equitable partitions maintaining proportional number of records per each class. At each iteration of the cross validation process each partition was selected as the testing set and the remainder nine partitions combined as the training set. Unlike the bootstrapping procedure, where random sampling is processed at each iteration, in cross validation every model is validated on the exact same patient records, as the splitting is effected only at the beginning.

Final evaluation

After the validation stage, we saved the trained models of best performing algorithms in terms of accuracy, for each of the seven classification schemes. These were obtained from the bootstrapping procedure and thus, achieved the best accuracy among 1000 runs in each case. Since the hold out dataset (N = 7,309) did not contain C-peptide values, we classified it with five trained models from schemes S2, S3, S5, S6, and S7, which did not use this attribute. To obtain a final classification we applied the majority vote approach, breaking ties (i.e. two pair of schemes voting for two different classes each pair) by selecting the option of the model that achieved the highest accuracy during validation.

Results

This section describes the results corresponding to the data characterization by following the different clustering strategies previously defined, the classification models obtained from the validation Dset A and Dset B using bootstrapping and cross-validation, and the final classification on the test dataset.

Data characterization

For the first stage clustering, Table 2 shows the number of patients clustered on each group and the intrinsic validation values of the four clustering strategies applied on the dataset \(D_2\). Overall, strategies 1.1, 1.2, and 1.4 obtained comparable scores and fairly similar distribution of patients among the groups, while clustering 1.3 produced considerably lower values on validation indices. As may be intuitively expected, allowing K-means to iterate until convergence after assigning initial centroids performed slightly better than the only-assign counterpart. In terms of these validation values obtained, performing a repeated K-means clustering without initial centroids and without gender separation outperformed the rest of strategies.

Table 2 Results for first stage clustering. Dataset \(D_2\) (N = 680). SIL silhouette, DB Davies-Bouldin, CH Calinski-Harabasz. Best metric value achieved appears in bold

The comparison among the first stage clustering strategies is provided in Table 3. It can be observed how the similarities among clusterings provide further means to evaluate them. The best validated clustering 1.4 attained good similarities with clustering strategies 1.1 and 1.2. On the contrary, strategy 1.3 yielded a rather dissimilar grouping with respect to its counterparts, even on this relatively small dataset. In addition to these results, Fig. 3 contains box plots showing the distribution of attributes per group, for each of the four implemented clustering strategies. Groups of the four strategies were identified and changed to match by observing the corresponding pattern in the plots. The order of attributes per group is the same: ADO, BMI, HbA\(_\text {1C}\), HOMA2-B, and HOMA2-IR. As it is apparent from these plots, strategies 1.1, 1.2, and 1.4 also yielded similar clusters. It is also noticeable that the distribution of attributes of clustering 1.3 did not suit the rest of them, particularly in groups 1 and 3.

Table 3 Comparison metrics for first stage clustering. Dataset \(D_2\) (N = 680). ARI adjusted rand index, AMI adjusted mutual index, FM Fowlkes-Mallows index. Best metric value achieved appears in bold
Fig. 3
figure 3

Box plots of the four implemented clustering strategies in first stage clustering. (A) to (D) correspond to strategies 1.1 to 1.4, in that order

Based on these first stage clustering results, we chose strategies 1.2 and 1.4 and computed centroids either for the whole clustering (strategies 2.1 and 2.3, respectively) and for clusterings separated by gender (strategies 2.2 and 2.4, respectively). Additionally, we performed repeated K-means procedures for CP-HOMA2 and IN-HOMA2 attributes, the latter using Min-Max normalization instead of z-score (strategies 2.5 and 2.6, respectively). Table 4 summarizes results from second stage clustering. Overall, group proportions were similar across all the strategies, being Group 0 the majority group with proportions ranging from 39.4 to 43.2%. Groups 1, 2, and 3 showed almost identical proportions in strategies 2.1 to 2.5. These percentages ranged from 18.9 to 21.1%, 18.5 to 19.9%, and 19.2 to 20.7%, respectively, for Groups 1, 2, and 3. On the other hand, clustering 2.6 generated slightly different populated clusters with proportions of 17.6, 16.3, and 22.8%, respectively, in Groups 1, 2, and 3. In terms of clustering validation indices, both strategies implemented with repeated K-means procedure outperformed those with centroid initialization. Moreover, clustering 2.6, achieved notably better metric scores than its nearest competitor strategy 2.5. Also, comparing strategies with initial centroids, it is observable that those without gender separation (2.1 and 2.3) obtained better scores than their gender-separated counterparts.

Table 4 Results for second stage clustering. Dataset \(D_1\) (N = 2,768). SIL silhouette, DB Davies-Bouldin, CH Calinski-Harabasz. Best metric value achieved appears in bold

Table 5 shows comparison metrics obtained for all-pairs of the six clustering strategies implemented in the second stage clustering. Interestingly, the pair of strategies (2.1, 2.3) attained the highest similarity scores, despite that they originated from different first stage centroids. These scores were substantially higher, even compared with that of the pairs (2.1, 2.2) and (2.3, 2.4), that originated from the same first stage clusterings 1.2 and 1.4, respectively. Moreover, the second similar pair was (2.2, 2.4), which also come from different centroid initialization. Pairs of strategies that come from the same first stage clusterings (i.e. (2.1, 2.2) and (2.3, 2.4)) obtained the third and fourth places in terms of these clustering validity metrics. The remaining of clustering pairs that used z-score normalization and CP-HOMA2 values (2.1 to 2.5) reached scores ranging from: 0.6538 to 0.7512 (ARI), 0.6122 to 0.6853 (AMI), and 0.7517 to 0.8221 (FM). Finally, all comparison pairs involving the clustering 2.6 that used IN-HOMA2 values with Min-Max normalization, obtained lower score ranges: 0.3289-0.3784 (ARI), 0.3380-0.3828 (AMI), and 0.5250-0.5533 (FM).

Table 5 Comparison metrics for second stage clustering. Dataset \(D_1\) (N = 2,768). ARI adjusted rand index, AMI adjusted mutual index, FM Fowlkes-Mallows index. Best metric value achieved appears in bold

Figure 4 shows the distribution patterns of involved attributes for the six clustering strategies applied on dataset \(\text {D}_1\). The order of attributes per Group is the same: ADO, BMI, HBA1C, CP-HOMA2-%\(\beta\), and CP-HOMA2-IR. Importantly, these distribution plots allowed us to assign T2DM subtype to each cluster, by means of visual inspection and direct comparison of the patterns against previous results in T2DM sub-classifications [8, 9, 30]. Indeed, the patterns of attributes obtained within the different clusters matched the distributions previously reported for MARD, MORD, SIDD, and SIRD. In general, patterns from all six clustering strategies were sufficiently matching to that of previously reported in the literature to distinguish and assign a T2DM subtype to each group. Nevertheless, as it is observable on the plots, there are some slight differences in ranges, interquartile ranges, and outliers comparing distribution of attributes in the T2DM subtypes. Among these minor discrepancies, the most appreciable were (see Fig. 4): both HOMA2 values in MARD (Panels A-E compared to F); BMI in MORD (Panels A-D compared to E and F); HBA1C in SIDD (Panels A-E compared to F); ADO and HBA1C in SIRD (Panels A-D compared to E and F).

From the second stage clustering on dataset \(\text {D}_1\), and considering validation and comparison metrics, we selected the groups produced by two clustering strategies to constitute two labeled datasets: Dset A and Dset B, from strategies 2.5 and 2.6, respectively. On these datasets, T2DM subtype labels were assigned to patients by means of the matching of patterns in the identified groups by clusterings 2.5 and 2.6 (Panels (E) and (F) in Fig. 4). Both datasets were used in the next stage for developing classification models.

Fig. 4
figure 4

Box plots of the six implemented clustering strategies on dataset \(D_1\) (N = 2,768). Panels (A) to (F) corresponds to strategies 2.1 to 2.6, in that order

Obtaining classification models

Based on the labeled datasets Dset A and Dset B, the four classification algorithms learned about these T2DM subtypes. Models trained with both datasets are presented in Table 6. For brevity, we will refer to them as models A and B. Each entry in Table 6 displays global results that include median accuracies (ACC) and median weighted-averaged F1-scores (F1) with respective 95% CI computed from bootstrap validation (1000 samples) for each of the seven data schemes and four algorithms implemented. In our discussion, we will use the term “best performing” algorithm/model referring to the one that achieved the highest median/mean metric value, regardless of having overlapping confidence intervals.

Table 6 Bootstrap validation results. Global classification metrics obtained for models A and B. Median accuracies (ACC) and F1-scores (F1) are presented with respective 95% CI. Best performing model on each scheme appears in bold

Naturally, as it consists of the same attributes on which Dset A was clustered, the best performance among models A was attained by scheme S1 with same ACC and F1 values of 98.8% (98.1–99.4% CI) . Nonetheless, the next best performing scheme (S4) was not far from these metrics reaching up to 97.8 (96.8–98.6% CI) ACC and F1. Remaining (best performing) models A produced ACCs ranging from 75.8 to 82.7% and F1s ranging from 73.9 to 82.3%. Algorithms that yielded the highest ACC were SVM (schemes S2, S3, S5, and S7) and MLP (schemes S1, S4, and S6). Moreover, these algorithms obtained the best and second-best performance in all schemes excepting S3 and S7, where KNN and SNNN attained the second-best performance, respectively. SVM kernels that performed best were linear (schemes S1, S3, S4, S6, and S7) and rbf (schemes S2 and S5). There were marginal differences among K values tested in KNN, with values K=54 and K=55 achieving the best in most of the schemes. Evaluating mean performance of best models across all seven schemes, mean ACC and F1 were 85.3% (± 9.2%) and 84.8% (± 9.7%), respectively.

Among models B, the best performing were also the ones from which the input dataset was labeled (in this case, scheme S2), with 98.9% (97.9–99.5% CI) of both best ACC and F1. However, in this case, the rest of models offered considerable closer performances with respect to S2, in all schemes excepting S3. Indeed, second to sixth performing models (schemes S5, S4, S1, S6, and S7) achieved ACCs and F1s ranging from 97.9 to 98.5% (i.e. only 1.0 to 0.4% lower than S2), while S3 attained lower ACC = 89.3% and F1 = 89.2%. In this case and within all schemes, MLP outperformed the rest of algorithms closely followed by SVM, particularly in schemes S6 and S7. Interestingly enough, SVM kernel that produced best results was polynomial within these models. Again, tested K values did not yield substantial difference in performance for models B. The mean performance of best models in all schemes is given by ACC and F1 values of 97.1% (± 3.4%) and 97.0% (± 3.5%), respectively.

Supplementary Tables S1 and S2 show corresponding per-class results of models A and B, respectively, in terms of F1-score, Sensitivity, and Specificity. In these tables, each entry displays the metrics for the best performing model (i.e. best ACC), out of the 1000 bootstrap samples. Corresponding confusion matrices from which these metrics were computed are also included in Supplementary Figs. S1 and S2. By observing Table S1 and corresponding Fig. S1, it can be noticed that the lower performance of models A within schemes S2, S3, S5, S6, and S7 is mainly due to a poor Sensitivity for Class 3 (SIRD). This metric was drastically low in schemes S6 and S7 where some algorithms reached values even lower than 40%. This effect is evidenced in the confusion matrices by observing that most errors come from Class 3 cases being misclassified as Class 0, and vice versa. Interestingly, that was not the case for models B (Table S2). In these models, the abnormal low sensitivities occurred only in Class 1 (MORD) and only for SNNN. This result is also explained by watching that many Class 1 records are misclassified as Class 0, 2, or 3 (Fig. S2) in most of schemes.

The amounts of records of each class left in the out-of-bag (validation) set are also shown in Tables S1 and S2. It can be observed that the proportion of validation records from the input dataset is \(\sim\) 35–38% in these samples. This means that the models were trained using a proportion of \(\sim\) 62–65% of different records from the input dataset. In other words, 35 to 38% of the training records are repeated in the bootstrap process.

For this reason and with the purpose of contrasting results with those reported by [9], we also aimed at assessing the performance of classification models A and B using a stratified 10-fold cross validation. We selected the best performing algorithm in each scheme from the bootstrap validation stage; as reviewed above (i.e. those appearing bolded in Table 6). Table 7 shows these classification results computed as the mean values across the 10 folds for global Accuracy and per-class Precision, Sensitivity, Specificity, and Area Under the Curve (AUC). The overall performance of all models was consistent compared with bootstrap results, with minor increases and decreases in ACC. For models A, it is noticeable the same behavior observed in bootstrap regarding the low sensitivity in Class 3 for schemes S2, S3, S5, S6, and S7. With respect to implemented schemes S1, S2, S3, and S7 in [9], our models A achieved comparable performance in S1, but yielding lower metric values in the rest of them. Conversely, models B produced remarkable competitive performances in all compared schemes. Lastly, Fig. 5 compares macro-averaged Receiver Operating Characteristics (ROC) curves and displays corresponding AUCs for both models A and B, and for each of the seven implemented schemes. In the case of models A (upper panel), these plots show how schemes S1 and S4 attained the best performance, with considerable higher AUC than the rest of schemes. For models B (lower panel), it can be observed that excepting for S3, all schemes obtained closely similar curves and AUC values.

Table 7 Stratified 10-fold cross-validation results. Global accuracy (ACC) with per-class precision (PRE), sensitivity (SEN), specificity (SPE), and area under the curve (AUC) are shown for models A and B; and contrasted with those reported by [9]. Each entry of our results corresponds to the mean value obtained across the 10 folds. Only best performing algorithms from bootstrap validation were included (i.e. those appearing in bold from Table 6)
Fig. 5
figure 5

Macro-averaged Receiver Operating Characteristics curves for each scheme. (A) Models A. (B) Models B

As a final step in the classification stage of our data analysis flow, we tested our trained models on unseen data. The hold-out dataset comprised N = 7,309 patient records that did not include C-peptide values and thus, was a disjoint set with respect to the training/validation dataset. As previously explained, we applied a majority vote approach using the best performing models A, considering the five schemes which did not make use of C-peptide parameter (i.e. S2, S3, S5, S6, and S7). Table 8 shows the number of records that were classified in each class by the five predictors. Despite of the fact that there were disparities in these amounts (i.e. predictor S5), in general, there was consensus among the five predictors. On 77.3% of the observations, all five or four of the predictors agreed on the resulting class. Moreover, the cases when three or more predictors agreed amounted to 97% of observations. The total of ties (cases where two pairs of predictors voted for two different classes) were 175 (2.4%) and were solved by simply assigning the class predicted by the predictor that achieved the best performance during the bootstrap validation stage.

Table 8 Number of records classified per class in the hold-out dataset for each of the five predictors considered

Figure 6 depicts our final classification results (Panel A) on the test set in terms of the proportions of each class separated by gender or including both. For comparison purposes, we also include proportions obtained by landmark studies [8, 9]. The former (Panel B) were acquired by classifying our test set using the authors’ web tool with attributes corresponding to our scheme S2. The latter (Panel C) consist of the authors’ reported results obtained with a dataset of their own (ANDIS, Swedish population, N = 8,980). On the latter results, we recalculated the number of observations accordingly, after eliminating those belonging to the SAID class, which we did not consider. Proportions of classes from our majority vote approach were similar to that of [8], in spite of the fact that both were obtained from different populations. On the other hand, although the charts in Fig. 6 display different proportions with respect to [9], there was still an overall matching of 57.2% with 1152, 938, 1510, and 578 equally classified observations for MARD, MORD, SIDD, and SIRD, respectively. 90.3% of discrepancies came from observations that were respectively classified in our method/web tool as: MARD/MORD (1228), MARD/SIDD (790), MORD/SIDD (411), and SIRD/MORD (398).

Fig. 6
figure 6

Proportion of observations of T2DM classes. (A) Our mayority vote scheme with models trained with \(LD_1\) dataset. (B) Classification of test dataset using the insulin-based HOMA2 model developed by Bello-Chavolla et al. [9]. (C) Clustering results reported by Ahlqvist et al. [8] with their dataset ANDIS

Lastly, Fig. S3 shows a comparison of per-class distribution patterns for ADO, BMI, HBA1C, IN-HOMA2-%\(\beta\), and IN-HOMA2-IR; for results obtained in the test set from our study (Panel A) and the aforementioned web classifier (Panel B). Overall, resemblance of patterns is appreciable for all variables, although, there was some variation derived from the disparities in amounts of observations per class. Due to the MARD/MORD and MARD/SIDD mismatching classifications, it is observable that the web classifier yielded a narrower distribution and higher median for ADO in MARD class; as this class has fewer instances. However, as a consequence of having more instances classified within, classes MORD and SIDD present less defined distributions of BMI and HBA1C, respectively.

Discussion

In the present study, we have focused on developing and testing classification models for T2DM subtypes. Our methodology consisted in three main stages: dataset construction, data characterization, and classification model development. In view of our results, we consider the following as our findings.

First, producing an enriched large dataset by fusing information from two representative health databases, NHANES and ENSANUT. Although NHANES includes multi-ethnic information, our dataset predominantly comprised mexican-american, other hispanic, and mexican patients with approximately 60% of the total records. Thus, we consider that this dataset comprises a fairly representative sample of this population. Our dataset was amongst the largest within those related to application of unsupervised learning for diabetes [31].

Second, experimenting with more clustering algorithms such as density-based and hierarchical methods; and evaluating cluster qualities in terms of clustering validation indices. We verified that tested DBSCAN and Agglomerative algorithms did not yield good clusterings contrasted to K-means, according to intrinsic metrics; which, according to our knowledge, has not been reported by previous works. Also, amounts of observations within groups importantly differed from those obtained with K-means, as was corroborated by extrinsic metrics. Thus, on reported experiments we attached to previous proved methodologies that were based on K-means to characterize T2DM groups; on the basis that this unsupervised method provides the best means to find better defined and distinctive class boundaries. Additionally, we tested different clustering strategies contrasting centroid initialization, clustering by gender, and using a repeated K-means procedure. The latter simple procedure allowed us to deal with cluster variance within executions, occurring in some observations lying on an inter-cluster boundary. Results obtained in this stage suggest that better defined clusters are obtained by executing de novo K-means clustering and without gender separation.

And third, providing further insights of model performances in the classification of T2DM subtypes. In this regard, we carried out an exhaustive evaluation of four machine learning algorithms using two validation settings. Bootstrap is considered a more statistically robust way of assessing performance of machine learning models [52]. Nevertheless, both validation modes yielded similar results in terms of classification metrics applied. Interestingly, models fitted remarkably better to data that was clustered using Min-Max normalization and IN-HOMA2 measures, obtaining accuracies of 97.1 ± 3.4% (bootstrap) and 97.2 ± 3.2% (cross-validation), averaged from the seven implemented data schemes. These averaged accuracies were 85.3 ± 9.2% (bootstrap) and 85.1 ± 9.8% (cross-validation) in the case of models trained with z-score standardized data with CP-HOMA2. SVM and MLP machine learning techniques attained best performances. Above all, from the seven data schemes we assayed, we found that HOMA2 constituent variables (used in schemes S4 and S5) provided great performances. From our point of view this result was interesting, as it points that HOMA2 variables used for clustering can be replaced with surrogates to train classification models. Indeed, the importance of this finding lies on the fact that parameters such as fasting glucose and C-peptide/insulin are readily available from public databases or health records, while HOMA2 measures require licensed software when deploying tools in online production environments (although they provide offline converters free of access). To the best of our knowledge, with the exception of SNNN models [9, 39], development and testing of classification models for T2DM subtypes has not been previously reported in the literature.

Finally, our majority vote approach demonstrated a great deal of consensus amongst used classifiers, in the hold-out dataset. Class proportions were similar to those found in the pioneer study of Ahlqvist et al. [8]. On the other hand, we believe that the disparity in our results compared with those of the web classifier of Bello-Chavolla et al. [9] are mainly attributable to the standardization step. Indeed, during experimentation we encountered that this step, which depends on the distribution of variables in the dataset, greatly impacts classification results.

Conclusion

We have introduced a new pipeline for analysis of datasets with the goal of obtaining classifiers for T2DM subtypes. With this purpose, we described a detailed data curation and characterization processes to obtained labeled datasets. Unlike previous work, our analysis included a clustering validation step through well-known indices, that allowed us to evaluate quality of clusters. We have obtained results consistent to most of previous work in terms of subgroup proportions (see Table 1). From the classifiers we have trained, it is remarkable the fact that simpler and faster algorithms such as SVM and MLP fitted better to the clustered data than the more involved convolutional architectures. Also, the results showed that classifiers learned better from normalized (Min-Max) compared to that of standardized (z-score) data. The obtained performances using this scaling approach were consistent across the seven data schemes, since normalized data produced better defined clusters according to validation indices.

The present work was based on cross-sectional data and thus, we have limited the scope of our analysis to the development of classification tools for T2DM subtypes, without further association with risks of complications, incidence, prevalence, and treatment response. We left such analyses as future work, with the hope of establishing data sharing collaborations. However, we believe that the study offers valuable insights on the process of developing classification models for T2DM subtypes. Further limitations of the present study are those inherent to the population (i.e. dataset) used for the analysis, preprocessing steps applied, and that we have considered all the patients within the dataset as GADA negative (i.e. not considering SAID class), since this variable was not available in most of NHANES and ENSANUT records.