Introduction

Gestational diabetes mellitus (GDM) is a condition where women without a previous diagnosis of diabetes exhibit abnormal blood glucose levels during pregnancy [1, 2]. It is one of the most common pregnancy complications worldwide [3, 4], affecting up to 14 million women annually [5, 6]. The prevalence of GDM is increasing globally, due to changes in lifestyle, increasing rates of maternal obesity [7,8,9], and evolving diagnostic criteria. In 2021, according to the International Diabetes Association, the estimated pooled standardized prevalence of GDM globally was 14.0%, and regionally was 27.6% in the Middle East and North Africa, 20.8% in South-East Asia, 14.7% in Western Pacific, 14.2% in Africa, 10.4% in South America and Central America, 7.8% in Europe, and 7.1% in North America and the Caribbean [2].

Often, blood glucose levels associated with GDM will become normal after delivery; however, these women remain at high risk of developing postpartum metabolic abnormalities such as glucose intolerance and type 2 diabetes mellitus (T2DM). According to recent literature, 12.3 to 60.0% of pregnant women who had GDM will develop some form of glucose intolerance up to 15 years postpartum [10,11,12,13,14] which increases to 70.0% 28 years after pregnancy [11, 15], although this varies in different populations and ethnic groups. Hence, women with a history of GDM have a greater than sevenfold risk of developing postpartum glucose intolerance than those who were normoglycemic [5, 16]. For this reason, the American College of Obstetricians and Gynecologists (ACOG) and the American Diabetes Association (ADA) recommend postpartum screening of all mothers who had GDM from 4 to 12 weeks postpartum for timely intervention [17, 18].

Previous studies have reported a range of prognostic factors associated with risk of developing postpartum glucose intolerance after GDM, which include demographic and clinical factors, antepartum laboratory results, and metabolic factors. For example, factors including age, increased parity, higher pre-conception body mass index (BMI), family history of diabetes, insulin therapy during pregnancy, degree of hyperglycemia during pregnancy (higher area under the curve (AUC) of glucose, higher fasting plasma glucose (FPG)), and impaired pancreatic β-cell function were consistently found to be associated with postpartum glucose intolerance [11, 13, 19,20,21,22,23]. Abnormal findings on a variety of antepartum glucose tolerance tests (OGTT) [24,25,26,27,28] were also reported to be associated with a high risk of post-partum glucose intolerance (e.g., low insulinogenic index II levels on the antepartum 75-g OGTT (42)). More than 60 genetic factors have been identified in association with T2DM. Given women with GDM have a family history of T2DM, it has been found that some genetic variants of T2DM are also associated with early or late postpartum glucose intolerance among women who had GDM [29,30,31,32,33]. In addition, a range of specific metabolic biomarkers including amino acids (branched-chain amino acids, hexose), lipids (linoleic acid, phospholipids, lysophosphatidylcholines, acylcarnitines, sphingomyelins (i.e., SM (OH) C14:1)), p-cresol sulfate, and glycocholic acid have also been reported as predictive for postpartum glucose intolerance among women who had GDM [34,35,36,37,38].

Risk prediction models have the advantage of identifying women who are at high risk of developing glucose intolerance after GDM with greater accuracy than single markers and in a timely manner (e.g., years before development). This enables women and their healthcare providers to ensure ongoing screening and implementation of early prevention strategies to optimize health outcomes. Risk prediction models can be applied with each women at any time, which can be important as it has been shown that the majority of women (even those at high risk of postpartum T2DM) did not attend postpartum screening for glucose intolerance [39, 40]. Since glucose intolerance is well known to be effectively managed with lifestyle modification [41, 42], early identification of these at-risk women and more focused ongoing screening may prevent T2DM.

Prediction models that have strong predictive ability, validated, generalizable, and based on easily accessible variables, is required in order to effectively prevent postpartum T2DM risk. A systematic review is needed to aid clinicians in selecting postpartum T2DM risk prediction tools and to summarize all available prediction models for researchers. Therefore, this systematic review aims to summarize and critically evaluate the reporting quality, methodological characteristics, and risk of bias of studies reporting prediction models for developing postpartum glucose intolerance developing after GDM.

Methods

This study was conducted according to the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) [43] and using The Checklist for critical Appraisal and data extraction for systematic Reviews of prediction Modelling Studies (CHARMS) checklist [44]. The protocol for this systematic review is registered at the international prospective register of systematic reviews (PROSPERO); CRD42022327239.

Formulating the Review Question and Protocol

The review question was formulated based on the PICOTS framework (population, intervention, comparison, outcome, time, and settings) as recommended by the CHARMS checklist. The study protocol was developed by considering the rationale, objectives, design, methodology, and statistical considerations of the systematic review (Table 1).

Table 1 PICOTS of the review question

Main outcome(s)

The main outcome of interest for this study is the development of glucose intolerance in women with a history of GDM. This encompasses metabolic conditions such as T2DM, pre-diabetes (impaired fasting glucose (IFG) and impaired glucose tolerance (IGT)) which developed within 20 years postpartum. These are defined and identified by fasting plasma glucose concentrations and/or OGTT results according to the World Health Organization criteria [45], national or regional Diabetes Association diagnostic criteria, or specified local criteria. Patients with IFG and/or IGT are now referred to as having “pre-diabetes” indicating the relatively high risk for the development of T2DM in these patients.

Eligibility Criteria

Prediction models were conducted to predict the risk of postpartum glucose intolerance among women who had GDM worldwide in all settings including hospitals, primary care, secondary, tertiary, and community-based settings. We included both prospective and retrospective cohort prognostic model studies. We did not restrict studies by ethnic origin or parity.

Studies with no original data (meetings, editorials, letters, narrative reviews, and commentaries), studies that were performed with a cohort of women with T2DM before pregnancy, and studies that were published in languages other than English were excluded.

Search Strategy and Screening

The search was conducted on May 21, 2022 across eight databases: Ovid Medline, Ovid Embase, Ovid Emcare, Scopus, Web of Science, CINHAL, Maternity & Infant Care Database (MIDIRS), and Global Health 1910 to 2022 Week 18 attached in Appendix S1. We also manually searched the references of the selected articles to identify additional eligible studies.

Studies identified on database searching were imported to Covidence web-based software (developed by Australian not-for-profit company called SaaS enterprise) for the title and abstract screening. Title and abstract screening and full-text reviews were done by two independent reviewers (YB and DH) based on the aforementioned eligibility criteria, and disagreement was resolved by discussion.

Critical Appraisal

The CHARMS tool was applied to assess the methodological quality and relevance of studies. The source of data used for prediction model development was assessed. Participants’ selection (method, setting, inclusion, and exclusion criteria); definition, clarity, consistency of outcome of interest, and candidate predictors’ assessment used; sample size used for prediction model development; missing data handling; methodologies used for model development, performance measurement, and model evaluation; and interpretation of the results were assessed by the checklist.

Risk of Bias (Quality) Assessment

Two researchers (YBM, DWH) assessed the risk of bias. In abstract and full-text screening, discrepancies were resolved by discussion, and consensus was reached on all discrepancies. Assessment of risk of bias and model applicability was conducted using the Prediction models Risk Of Bias Assessment Tool (PROBAST) tool. This involves the assessment of four domains (participants, predictors, outcome, and analysis) to cover key aspects of prediction model studies. Under these four domains, there are 20 signaling questions overall. These questions were scored as “Low,” “High,” or “Unclear.” A low score indicates a low risk of bias, whereas high shows the presence of bias, and unclear was used when there was insufficient reported information to decide on risk. The overall risk of bias was graded as low risk when all domains were considered low risk, and high risk when at least one of the domains was considered high risk. The first three domains (participants, predictors, outcome) were also rated for concern regarding applicability (low, high, or unclear) to the systematic review question. Concerns regarding applicability were rated similarly to risk of bias, but without signaling questions (Table S1).

Data Extraction (Selection and Coding)

A data extraction grid was created including all relevant variables and key elements in Transparent Reporting of a multivariable prediction model of Individual Prognosis Or Diagnosis (TRIPOD) checklist which was pilot tested by using sample articles and modified accordingly. The following variables were extracted by two authors independently: country, source of data, participants (ethnicity, maternal characteristics), outcomes to be predicted, candidate predictors (index tests), sample size, missing data, model development, model performance measurements (calibration discrimination), model evaluation, and results (Appendix S2).

Strategy for Data Synthesis

Data synthesis was performed using thematic and context analysis to summarize the methodologies used to develop the prediction studies, participant selection, predictor variable selection and collection, outcome determination, analysis used for model development, variables included in the final model, and performance measures used. Appropriate data was presented in the form of summary tables and, where relevant, graphical representations of the data. Where there was a lack of homogeneity in methods used to develop the prediction models and different sets of predictors used to develop different prediction models, meta-analysis was not performed as merging these models may lead to highly correlated data and inflated estimates [46, 47]. If meta-analysis was not indicated, then qualitative evaluation and synthesis of estimates were applied to summarize and appraise the available model estimates.

Results

Main Characteristics of Included Studies

The systematic review process is presented on the flow chart in Fig. 1. The electronic search method yielded 3455 unique articles, of which 3402 articles were excluded on title and abstract screening leaving 53 studies to be assessed by full text. Following full-text review, a further 38 articles were excluded. Finally, fifteen studies reporting 15 risk prediction models were identified and included in this review. All included studies were model development studies, with no external validation studies found. Included models were developed in six different countries or regions: four in the USA [35, 48••, 49••, 50], six in Europe [24, 36, 51, 52••, 53, 54], two in Australia [37, 55••], and one in Asia [56], Canada [57], and Ethiopia [58••] between 1995 and 2022.

Fig. 1
figure 1

PRISMA flow diagram showing the systematic review process

The primary outcomes of included studies were reported as follows: T2DM (n = 10) [35,36,37, 49••, 50, 51, 53, 55••, 56, 57], glucose intolerance (n = 4) [24, 52••, 54, 58••], and IGT (n = 1) [48••]. Muche et al. [58••] diagnosed glucose intolerance as postpartum pre-diabetes (IFG: FPG 100–125 mg/dL; IGT: 2-h plasma glucose in 75 g OGTT 140–199 mg/dL) or diabetes (FPG > 126 mg/dL, or 2-h plasma glucose > 200 mg/dL in OGTT or random plasma glucose > 200 mg/dL) (Table 2). Postpartum diagnosis of T2DM/prediabetes for Bartakova et al. [59] was performed based on the WHO criteria: FPG ≥ 7 mmol/L alone or 2 h after 75 g load glucose ≥ 11.1 mmol/L for T2DM, FPG 5.6–6.9 mmol/L or 2 h after 75 g load glucose 7.8–11.0 mmol/L for prediabetes. Bengtson et al. [48••] diagnosed impaired glucose tolerance as HbA1c ≥ 5.7%. Kondo et al. [24] diagnosed glucose intolerance with 75-g oral glucose tolerance tests. Among ten studies that reported T2DM as a primary outcome, greater than half used ADA criteria for diagnosis [60] (Table S2).

Table 2 Prediction temporality and sample size with respective events in included models

Predictors in the Final Model

A list of predictors included in the final model is presented in Table 3. The number of risk predictor variables included in the models ranged from three to seven. Age (n = 6), FPG level during pregnancy (n = 8), and BMI (n = 11) were the three most common predictor variables included in the final model to predict postpartum glucose intolerance. Four models included biochemical variables such as branched-chain amino acids (BCAAs) (Val, Leu, Ile), lipid metabolites (sphingomyelin (SM (OH) C14:1), cholesteryl palmitic acid (CE(16:0)), non-esterified fatty acids (NEFA(22:4)), triglycerides and their fatty acid combination (TAG 48:2 FA 16:1, TAG 54:0 FA 16:0, TAG 50:1 FA 16:0), cholesteryl icosatetraenoate (CE(20:4)), phosphatidylethanolamine (PE(P-18:0/18:1), PE(P-36:2)), phosphatidylcholine (PC ae C40:5), hexoses, and phosphatidylserine (PS 38:4) [35,36,37, 55••]. Five models included a family history of diabetes mellitus, and four models included a 2-h plasma glucose level during pregnancy. Postnatal fasting glucose level, postnatal 2-h plasma glucose level, insulin therapy during pregnancy, and genetic factors were also other common prognostic determinants considered for model building. GDM history in a prior pregnancy, GDM diagnosis at < 24 weeks gestation, personal history of hypothyroidism, instrumental delivery, lactation, ethnicity, antenatal depression, blood pressure, genetic risk factors, and insulinogenic index/fasting immunoreactive insulin were each included only in one model (Fig. 2).

Table 3 Prognostic determinants included in the final model and their respective predictive performance
Fig. 2
figure 2

Predictors commonly utilized for developing prediction models to predict postpartum glucose intolerance in studies included in this systematic review

Predictive Performance

Traditional statistical models were common, with only three applying machine learning (Table 4). The predictive performance of each study model is summarized in Table 3. The predictive performance of 13 studies that reported the area under the curve ranged from 0.66 to 0.92. However, none were externally validated. Only a few models were validated internally [35,36,37, 49••, 51, 55••]. Calibration was reported only for some models using Hosmer–Lemeshow test, calibration plot, and calibration slope [24, 51, 53].

Table 4 Data analysis method and modes of model presentation of the studies

Risk of Bias Assessment and Meta-analysis

The risk of bias and applicability assessment results are shown in Table S1. Overall low risk of bias was present in two (2/15) studies only. Three domains including participant selection, predictor assessment, and outcome assessment resulted in a low risk of bias for most studies. A high risk of bias for participant selection was mainly due to controversial inclusion or exclusion criteria Table S3. In addition, two studies selected participants for inclusion based on one question only asking “Have you ever been told that you had a high sugar level during pregnancy?” which is a less sensitive approach, potentially introducing serious bias and may compromise the transportability of the model [49••, 57]. The answer to the question “Were predictor assessments made without knowledge of outcome data?” resulted in a high risk of bias because the assessment of predictors was performed in retrospect after the outcome was known and/or there was no statement showing whether assessors were blinded or not. As most models had less than 10 events per variable (EPV) or the EPV was unable to be extracted (Table 2), this was scored as a high risk of bias for most models.

However, in the analysis section, most studies had a high risk of bias. In some studies, continuous and categorical variables were not handled appropriately. For instance, although using continuous variable is recommended in prediction model development, Man (2021) categorized age into six categories and BMI into three categories. The analysis did not include all enrolled participants and/or did not report on those who were excluded. Additionally, participants with missing data were not adequately addressed and/or were not reported. For example, Ignell (2016), which relies on prospective data collection, experienced a significant amount of lost to follow-up. The selection of predictors based on univariable analysis was applied and/or not reported [48••, 58••], and complexities in the data were not accounted for appropriately and/or not reported. Relevant model performance measures were not evaluated appropriately and/or not reported. Furthermore, model overfitting and optimism in model performance were not accounted for and/or not reported. Predictors and their assigned weights in the final model did not correspond to the results from the reported multivariable analysis and/or are not reported.

Meta-analysis was not possible due to the lack of homogeneity in methods used to develop the prediction models, different sets of predictors used to develop different prediction models, and heterogeneity in the prediction time interval ranging from 1 to 20 years postpartum.

Discussion

This systematic review of risk models predicting postpartum glucose intolerance among women who had GDM identified 15 models; however, none were externally validated and less than half were internally validated. No models had the same set of prognostic factors, and factors included a range of demographic, clinical, and biomarker factors. The most frequent factors were BMI (measured pre-pregnancy or early pregnancy), fasting glucose concentration during pregnancy, maternal age, and family history of T2DM. Some studies included only traditional clinical risk factors (e.g., age, BMI, pregnancy fasting OGTT, and postnatal fasting OGTT), while others included biochemical variables and genetic factors. Among traditional risk factors, the most common potentially modifiable factor was BMI (pre-pregnancy/early pregnancy). Predictive performances were suggested to be above chance (with AUC > 0.66); however, performance was difficult to evaluate as all included studies had a high risk of bias with various methodological shortcomings.

The type of prognostic factors used in the models depended on the time when risk for postpartum glucose intolerance was assessed. Some studies used clinical and biochemical factors collected during and/or before GDM diagnosis, thus making the GDM diagnosis the starting point (baseline) for the prediction. However, other studies had baseline risk assessments after delivery at 2 days [48••], 6–9 weeks [35, 36], 12 weeks [55••], 4–16 weeks [50], and 12 months [37]. In these latter studies, additional prognostic factors included postnatal fasting and 2-h plasma glucose [37, 51, 55••], mode of delivery [52••], lactation [51], and circulating miR-369-3p measured at 12 weeks postpartum [55••]. Future studies are warranted to examine which baseline time point and prognostic factors are associated with the most accurate prediction.

This review has highlighted that, although study participants were defined as women with GDM, the inclusion criteria applied were not always rigorous; for example, two studies selected participants based on one question only asking “Have you ever been told that you had a high sugar level during pregnancy?” 49••, 57. Instead, wherever possible, a diagnostic test or robust selection criteria should be applied to distinguish the target population to be included in the study [61, 62]. Otherwise, ambiguous population groups can lead to excess variability in the study data, making prediction difficult, therefore limiting any usefulness of the models and eventual inclusion in subsequent meta-analyses.

Many of these studies used routinely collected health data which is more generalizable at population level. However, missing data are not uncommon when examining routinely collected health data and retrospective cohort studies which may reduce the available evidence to build the model [63]. Instead of ignoring variables having missing data, which can introduce a source of serious bias, it is suggested that missing data should be replaced based on the available information by using advanced methods such as multiple imputations [64]. However, only a few models discussed missing data [48••, 53, 55••, 56], and only Bengtson et al. [48••] and Kwak et al. [56] applied multiple imputations to handle missing data. Where other studies instead excluded participants due to missing clinical data, this may aggravate the problem of small sample size and discards the information of nearly complete data [55••].

Comparing the predictive performance of the included studies is not a straightforward task, as the predictors utilized for each model’s development vary. Nonetheless, it can be observed that, on average, machine learning algorithms outperform traditional models in terms of sensitivity and specificity. This can be attributed to their capacity to identify complex patterns and relationships in the data that may not be apparent to the naked eye. It is worth noting, however, that traditional models may have advantages in terms of being more interpretable and simpler. Although some models show performance measures suggesting excellent predictive capabilities, our review found that none were externally validated and only a few were internally validated. This lack of validation puts their reproducibility under question. Therefore, testing the model performance in a new population in a different geographic region or in different time period is required to further this field and to assess the practical utility of the model.

Women who have experienced GDM are 8 to 10 times and 2 times at higher risk of developing type 2 diabetes and cardiovascular disease (CVD), respectively [5, 65,66,67,68]. The above emphasizes the pressing requirement for timely and continuous proactive monitoring, as well as efficient preventative measures, for type 2 diabetes and cardiovascular disease (CVD). Among these strategies, developing a well-designed clinical prediction model based on historical, antepartum, and even early postpartum variables is mandatory to early identification of at-risk women and early initiation of intervention. To be more precise, the screening and prevention of T2DM related to gestational diabetes is a subject that is challenging and controversial [69] and would benefit greatly from the development of a thoroughly planned and validated prediction model.

Strengths and Limitations

This is the first systematic review of risk prediction studies of postpartum glucose intolerance among women who have a history of GDM. Strengths of this review are that the search strategy was built based on a validated search strategy for prediction models, and the quality of risk prediction models was assessed by CHARMS guidelines. Limitations for deriving information from this review mostly arise from the low quality of the identified eligible studies. However, examining the overall quality and characteristics of existing models is important to understand the flaws and strengths of developed models, using these as stepping stones to build novel models in future.

A major limitation of the studies identified is that very few followed the reporting guidelines for prognostic risk prediction modeling. Researchers examining this area are strongly recommended to follow the appropriate guidelines so that this area can be advanced. Another major limitation of the models identified was that there was a high risk of bias evident in all included studies. The various methodological shortcomings included the use of inadequate sample sizes, uncertain inclusion or exclusion criteria, lack of missing data reporting and/or handling, inappropriate management of continuous and categorical variables, use of univariable analysis for selection of predictors, failing to evaluate/report relevant model performance measures, failing to consider model overfitting and optimism in model performance, lack of internal and external validation, the low trend of model performance measure reporting, and lack of model presentation.

Furthermore, only a fraction of models considered overfitting. Overfitting is especially prevalent when there are too few outcome events as compared with candidate prognostic determinants. Additionally, overfitting is expected when the model is developed in a small dataset, inappropriate continuous variables categorization is employed, and when stepwise predictor selection methods based on significance criteria are applied [70, 71]. In the included studies, some had very small sample sizes (n = 103, event = 21) [55••], (n = 104, event = 21) [37], (n = 112, event = 24) [58••], (n = 123, event = 45) [24] (n = 140, event = 55) [36], (n = 203, event = 71) [48••]. If the number of predictors considered for prediction is larger than the number of events of interest, the predictive performance will be overestimated. Preferably, predictive model studies necessitate a minimum of several hundred outcome events [72]. Small samples and a reduced number of events compared to several predictors will lead to overfitting and compromise the transportability of the model in a similar or a different population. This is important especially in regions with increasing migration, and the propensity for some groups to “adopt” a higher risk in the new home; therefore, external validation and ultimately generalisable models are needed more than ever.

Conclusions

GDM is common, and rates are rising globally. Women with this condition have a high risk of conversion to glucose intolerance postpartum. Identification of those at risk can facilitate targeted screening and prevention strategies. Despite this, our systematic review identified that existing prognostic models for glucose intolerance following GDM were not externally validated, and only a few were internally validated. In addition, there was a high risk of bias, unreported model calibration, and low use of model presentation methods. Future research should focus on the development of robust, high-quality risk prediction models by incorporating easily accessible prognostic determinates to enhance the practical application and accuracy of risk prediction models for glucose intolerance and T2DM following GDM looking the summarized result of this review. External validation is also required before implementing these prediction models into clinical practice.