1 Introduction 

Gestational diabetes (GD) is a disease characterized by carbohydrate intolerance that develops under the influence of placental hormones during pregnancy. Recent evidence has demonstrated the importance of appropriate identification and management of all pregnancies with GD. However, there is no consensus on which pregnant women will be screened to identify these patients. It is suggested by some international and national authorities [1, 2] that the diagnosis of GD will increase a lot with the new criteria, and this may cause economic and emotional problems. For this reason, different approaches have been developed by WHO (World Health Organization) and some other authorities. Currently, the oral glucose tolerance test (OGTT) is the most guidelines-recommended method of diagnosis for GD.

Many risk factors for GD have been identified. Commonly accepted ones are maternal age, increased BMI, ethnicity, family history of type 2 diabetes, and a history of GD in a previous pregnancy. Additional risk factors can be listed as giving birth to a macrosomic baby in a previous pregnancy, poor pregnancy outcomes, glucosuria, polyhydramnios or a pre-developmental fetus according to the week of gestation, polycystic ovary syndrome, and cardiovascular disease history. Due to its negative effects on maternal and perinatal mortality and morbidity and its increasing frequency, a clear consensus has not yet been reached on the selective (only in the risk-bearing pregnant group) or universal (all pregnant women) screening of GD, which is one of the current issues in studies conducted in Perinatology and Neonatology. In the single or double-step oral glucose tolerance test methods used in screening and diagnosis between 24 and 28 weeks, data on the ideal threshold value to improve pregnancy outcomes are still insufficient. In the Cochrane review conducted in 2015, it was shown that no specific screening test is optimal [3].

The universality of the group to be screened brings unnecessary test load, and because the standardization of the screening threshold and the relationship between the values and pregnancy outcomes are not clear in selected group screening, GD is diagnosed more than necessary [4].

Although it is estimated to occur in 6–9% of pregnant women, its incidence varies between 1 and 22% depending on the population examined and the diagnostic methods used [5]. In addition, it is estimated that 70% of these women will develop type 2 diabetes in an average of 22–28 years after pregnancy [6, 7].

In the study [8], published in the Nature journal in 2017 and mentioned to be the first in the literature for GD prediction based on machine learning methods, the accuracy of positive samples was obtained as 62.16% in 438 data. In another study, the data of 650 patients diagnosed with diabetes were divided into three clusters GD, type 1 diabetes, and type 2 diabetes based on 14 parameters using the K-means algorithm [9]. Artzi et al. [10] used a machine learning approach to predict GD, based on retrospective data from 588,622 pregnant women in Israel (AUC = 0.80), and in another study [11], the predicted value of fasting blood glucose for the next year was obtained with an accuracy of (AUC = 0.82), with the developed model using the retrospective EHR of 1000 patients obtained from a hospital in China. Wu et al. [12] developed a GD prediction model using 1st-trimester patient records with machine learning methods (AUC = 0.70–0.77) as training and test sets of 16,819 and 14,992 respectively, while a mobile application [13] was developed for the diagnosis of GD using traditional machine learning algorithms on the data of 12,304 pregnant women in another study.

Especially in recent years, it has been seen that the studies on the diagnosis of GD developed with machine learning have gained popularity. However, the biggest limitations of these studies have been reported as the use of retrospective data and low sensitivity values. With this aim, the study was designed prospectively and the dataset has been collected between the 2019 and 2021 years, about 75% of which was used for development, and the remainder was used for validation of the model. Therefore, the present study aimed to prevent unnecessary OGTT (gold standard test for GD diagnosis) for patients who are not in the risk group by identifying the patients in the GD risk group.

Within the scope of the study, the glossary of some professional terms and their explanations are given in the supplementary file as Table S1.

2 Materials and methods

2.1 Data analysis

The study was supported by The Scientific and Technological Research Council of Turkey (TUBITAK), and necessary ethics committee approval was taken. In the study, the parameters were determined by three specialist physicians based on the literature. Data were collected prospectively by three specialist physicians for each patient who had not been diagnosed with diabetes before, in line with patient consent forms. The criteria for the patients included in the study are listed below:

Inclusion criteria:

  • Women

  • Being pregnant

  • Volunteering

Exclusion criteria:

  • Previous diagnosis of diabetes

  • The end of the 1st trimester at the first examination

Before the adoption of the IADPSG (The International Association of the Diabetes and Pregnancy Study Groups) [14] criteria in Turkey, the diagnosis of GD was performed in two stages for a while by TEMD (Turkish Society of Endocrinology and Metabolism) [15]; in pregnant women OGTT with 50 g of glucose before and with 1 h PG (plasma glucose) above 140 mg/dl; then 75 g of glucose was recommended. It is recommended that pregnant women who have at least two of the FPG (fasting plasma glucose) ≥ 95 mg/dl or 1-st PG ≥ 180 mg/dl or 2-st PG ≥ 155 mg/dl in this test should be considered GD [2]. Considering the financial burden of universal screening and the emotional stress it gives to the patient, physicians think that determining the risk of developing GD during pregnancy based on individual risk factors and performing a screening test for high-risk patients will increase the quality of health care [1, 16].

GD diagnostic test: 75 g OGTT values, 8–14 h fasting, and 75 g oral glucose tolerance test were requested after three days of restricted diet and physical activity.

The presence of a single abnormal value in the OGTT results was considered sufficient to diagnose gestational diabetes. Fasting blood sugar of 92 mg/dl and above after 75 g sugar loading, 180 mg/dl and above in the 1st hour, and 153 mg/dl and above in the 2nd hour were considered abnormal. In this way, patients with and without a diagnosis of GD were determined [2, 17] which was the outcome of the prediction model.

The ethical approval was given in Fig. A1  in the Appendix.

In the study, the summary of the follow-up diagram for the participants is given in Fig. 1.

Fig. 1
figure 1

The summary of the follow-up diagram for the participants

The data were collected prospectively from the patients who came to the clinic, and informed consent was obtained between January 2019 and March 2021 from Karadeniz Technical University Medical Faculty Farabi Hospital (Trabzon city), Recep Tayyip Erdoğan University Medical Faculty Rize Hospital (Rize city), and Ordu University Medical Faculty Ordu Hospital (Ordu city), in Turkey. The dataset has 489 patient records which were taken in two steps, i.e., 1st and 2nd trimesters, despite the difficulties of the COVID-19 pandemic conditions. Three hundred fifty-nine patients out of a total of 489 were used for development, whereas 130 patients have been used for validation of the model. Furthermore, resampling and cross-validation methods were also used and all the results were analyzed and compared.

Our dataset has 489 patient records and 73 variables and the measures have been taken at 1st and 2nd-trimester visits of the patients. The study size was over the average which was sufficient for prediction models using machine learning methods based on the literature.

In the preprocessing step, missing data imputation was implemented for limited missing values in the dataset. With this aim, k-nearest neighbor (KNN) and logistic regression methods were used for continuous and categorical variables, respectively. KNN is a popular method [18, 19] for missing data imputation which uses the similarity between data to infer the missing data for continuous data.

The logistic regression models are effective imputation models that take the two-way associations between variables well and perform quite well on each variable, so they are often used [20, 21]. In the proposed study, multinomial and proportional odds logistic imputation models were used for nominal and ordinal variables, respectively.

2.2 Prediction models for diagnosis of GD

The collected dataset has 489 patients, 71 of whom have GD and the remaining 418 do not have GD diagnosis, which points to an imbalanced dataset. In this context, different decision models for the diagnosis of GD were developed using different approaches to increase performance as shown in Table 1 and all the obtained results were evaluated and compared. The schema about the process steps is given in Fig. 2.

Table 1 The explanations of the different approaches used in the development of models
Fig. 2
figure 2

The schema of the study

For the original dataset, 5 models were developed; in 2 of them, SVM and random forest algorithms were used as traditional machine learning methods, while in the remaining 3 of them, recurrent neural network — long short-term memory (RNN — LSTM), fivefold cross-validation with RNN-LSTM, and weighted fivefold cross-validation with RNN-LSTM algorithms were used as deep learning methods.

Cross-validation (CV) is one of the most popular techniques in machine learning to estimate the risk which gives an unbiased estimate of the risk for a limited number of samples [22]. Weighted CV (WCV) assigns smaller weights to outliers not to emphasize their effect of them on the CV score that has been studied in statistics [23]. It provides a much better estimation than CV, which gives an almost unbiased risk estimate even under the covariate shift.

As seen in Table 1, models were developed for the decision support system using many approaches to improve performance and make a comprehensive evaluation. Respectively in scenarios 1, 2, and 3, the original dataset was used with traditional machine learning (SVM, random forest), deep learning (RNN-LSTM) and fivefold cross-validation icon, and deep learning (RNN-LSTM) and weighted fivefold cross-validation. In scenarios 4, 5, 6, and 7, resampled dataset obtained by bootstrapping methods (random oversampling, resampling) was used for training, and validation and original test data were used with traditional machine learning methods (SVM, random forest), deep learning (RNN-LSTM), and Bayesian optimization. Finally, for scenario 8, the original dataset was used with deep learning (RNN-LSTM) and Bayesian optimization.

For scenarios 2 and 3, hyperparameter tuning was implemented and many models were developed to obtain the best model, while BO was used to obtain the best models for scenarios 6, 7, and 8. In addition to these, to observe the underfitting and overfitting problems, AUC_ROC, accuracy, and loss graphs were analyzed for all developed models.

In summary, we developed many models for all approaches and compared the results using statistical metrics such as sensitivity, specificity, AUC (area under curve), and F1-score. All details about these are given in Sect. 3. The models have been developed using the train set and for validation, a test set was used to evaluate the performance of the models by comparing the model prediction results with the actual results of the test set obtained by OGTT.

$$\mathrm{Sensitivity\;}\left(\mathrm{Recall}\right)=\frac{TP}{TP+FN}$$
(1)
$$\mathrm{Specificity}=\frac{TN}{TN+FP}$$
(2)
$$\mathrm{Accuracy}=\frac{TP+TN}{TP+TN+FP+FN}$$
(3)
$$F1-score=\frac{2TP}{2TP+FP+FN}$$
(4)

In these models, binary cross-entropy was used for the computation of loss given in the formula below.

$$Loss=-\frac{1}{output\;size}\sum\nolimits_{i=1}^{output\;size}{y}_{i}*\mathrm{log}\widehat{{y}_{i}}+\left(1-{y}_{i}\right)*\mathrm{log}(1-\widehat{{y}_{i}})$$
(5)

where \(\widehat{{y}_{i}}\) is the i-th scalar value in the model output, \({y}_{i}\) is the corresponding target value, and the output size is the number of scalar values in the model output.

All the methods and computations were implemented using an open-source Python programming language which was an interpreted, interactive, object-oriented language that can be used for a wide variety of applications.

3 Results

The descriptive statistics for continuous and categorical variables are given in Tables 2 and 3, respectively. Furthermore, the list of all variables, box plot graphs for continuous variables, and histogram graphs for the categorical variables of the dataset are given in the supplementary file as Table.S2, Fig.S1, and Fig.S2, respectively.

Table 2 The descriptive statistics for continuous variables
Table 3 The descriptive statistics for categorical variables

In this section, we represent the results and the comparison of the different models developed using different approaches mentioned in Sect. 2. The performance results of the best model of each scenario (defined in Table 1) are given in Table 4.

Table 4 The performance results of the best model of each scenario

As seen in Table 4, the best performance results were obtained for the prediction models of scenarios 6, 7, and 8 for the diagnosis of GD, where RNN_LSTM was used with BO. These three prediction models are all very successful and can be used for the diagnosis of GD at the end of the 2nd-trimester visit of the patient without using OGTT to assist the physicians. However, scenario 8 was proposed as the best model that has the highest specificity to define the patients who were not in the risk group for preventing unnecessary OGTT for patients.

For scenario 7, the loss curve and AUC_ROC graphs are given in Fig. 3.

Fig. 3
figure 3

(Left) loss function, (right) AUC_ROC graphs of the proposed model for diagnosis of GD

4 Discussion

Many studies mentioned that one of the most useful ways to handle class imbalanced dataset is bagging, which is called bootstrap sampling, providing that training each base classifier in an independent manner and the specific imbalance problems do not affect all the base classifiers [24]. Bootstrap is a statistical random resampling method with replacement, and the use of bootstrap methods has increased with the development of computers since the 1990s, due to the need for intensive computer calculations [25]. In the research by Özdemir et al. [26], in the simulation study carried out, three different bootstrap repetition numbers, B = 600, B = 1000, and B = 2000 for sample size n = 20, were tried, and type 1 error values of the results were compared. It was seen that the number of 600 repetitions fell within the range (0.045, 0.055) determined by Bradley [27], and 600 repetitions were suggested in terms of the shortness of the procedure. In the second part of the study, real datasets were used. As a result of these, 600 or 2000 repetitions could be recommended according to the pruning percentage. Therefore, we also developed diagnosis models using bootstrap methods for our imbalanced dataset with traditional machine learning methods (SVM, random forest) and RNN-LSTM as a deep learning method in the proposed study.

Deep learning methods can automatically learn multiple levels of representations from raw input data without presenting domain knowledge or manually coded rules contrary to the traditional machine learning methods. RNN is one of the most popular methods of deep learning that can map from the entire history of previous input data to each output which is useful for sequential data. It is very suitable for sequential health data taken at different visit times, using the sequential data for each patient and predicting the output from the relationships between the clinical data for each time point [28, 29]. Long short-term memory (LSTM) is an enhanced variant of RNN consisting of connected subnetworks as memory blocks for dealing with the long-term dependency problem, which is also called the vanishing or gradient problem.

Hyperparameter optimization has a primary role in training a deep neural network and developing a model. With this aim, we used Bayesian optimization (BO), which can automatically optimize the hyperparameters of the prediction models to develop the best models. BO method can learn and select the best hyperparameter sets based on their distributions defining the fitness scores in the previous iterations that require fewer function evaluations than other classical optimization methods [30, 31].

In this section, we discuss and compare the results of the models for the scenarios. The best performance results were obtained using deep learning and Bayesian optimization methods. For the original dataset, the best model was developed as defined in scenario 8 using RNN-LSTM with Bayesian optimization, giving 95% sensitivity and 99% specificity.

Within the scope of traditional machine learning methods, the best model belongs to the model of scenario 5 using the random forest method with 69% sensitivity and 92% specificity, which is given in Table 4. Using bootstrap resampling methods for the balanced train data, the best models have approximately 97% sensitivity, 98% specificity, and 98% AUC values for scenarios 6 and 7 using RNN-LSTM with Bayesian optimization shown in Table 4.

Similar studies used machine learning methods for the diagnosis of GD in the literature reported that despite the big sample size, the main limitation of the developed prediction models was being based on retrospective electronic medical data having inherent biases, where obtained AUC values were 0.80, 0.82, and 0.70–0.77 respectively [10,11,12]. Furthermore, another study [13] developed for the diagnosis of GD mentioned that deep learning algorithms were very popular, especially for large datasets, but 9 traditional machine learning algorithms were used. As a result of this, although high selectivity values were also obtained, very low (36–43%) sensitivity values were obtained.

Thus, we can say that deep learning and Bayesian optimization make a significant contribution to increasing performance results. The proposed study stands out as a novel study to overcome the restrictions, which have been reported as retrospective data and low sensitivity in the literature for diagnosis of GD, creating the dataset prospectively and using deep learning algorithms, and has obtained very satisfactory results.

In addition to these, on the prospectively collected data unprecedented in the literature, a very comprehensive study including many ML approaches in developing a model for the diagnosis of GD is presented.

5 Conclusion

This study performed a comparative analysis of machine learning and deep learning-based algorithms using prospective data for the prediction of GD. Eight scenarios were presented and compared based on SVM, RF, and RNN-LSTM methods on original and resampled datasets. The results showed that the most effective prediction model for GD was developed using RNN-LSTM with Bayesian optimization which gave 95% sensitivity and 99% specificity on the dataset that has the highest specificity to define the patients who were not in the risk group for preventing unnecessary OGTT for patients.

The developed model can be used to diagnose whether pregnant who can have GD risk or not at the end of the 2nd-trimester visit without using OGTT to avoid unnecessary OGTT. For a future study, the proposed prediction model can be used as a prototype in the clinic to increase the validation sample size and be tested on more patients and as a result, the model can be updated if required.