Machine Learning Algorithms Versus Classical Regression Models in Pre-Eclampsia Prediction: A Systematic Review

Purpose of Review Machine learning (ML) approaches are an emerging alternative for healthcare risk prediction. We aimed to synthesise the literature on ML and classical regression studies exploring potential prognostic factors and to compare prediction performance for pre-eclampsia. Recent Findings From 9382 studies retrieved, 82 were included. Sixty-six publications exclusively reported eighty-four classical regression models to predict variable timing of onset of pre-eclampsia. Another six publications reported purely ML algorithms, whilst another 10 publications reported ML algorithms and classical regression models in the same sample with 8 of 10 findings that ML algorithms outperformed classical regression models. The most frequent prognostic factors were age, pre-pregnancy body mass index, chronic medical conditions, parity, prior history of pre-eclampsia, mean arterial pressure, uterine artery pulsatility index, placental growth factor, and pregnancy-associated plasma protein A. Top performing ML algorithms were random forest (area under the curve (AUC) = 0.94, 95% confidence interval (CI) 0.91–0.96) and extreme gradient boosting (AUC = 0.92, 95% CI 0.90–0.94). The competing risk model had similar performance (AUC = 0.92, 95% CI 0.91–0.92) compared with a neural network. Calibration performance was not reported in the majority of publications. Summary ML algorithms had better performance compared to classical regression models in pre-eclampsia prediction. Random forest and boosting-type algorithms had the best prediction performance. Further research should focus on comparing ML algorithms to classical regression models using the same samples and evaluation metrics to gain insight into their performance. External validation of ML algorithms is warranted to gain insights into their generalisability. Supplementary Information The online version contains supplementary material available at 10.1007/s11906-024-01297-1.


Introduction
Pre-eclampsia is a multisystem disorder of pregnancy characterised by new onset of elevated blood pressure and proteinuria or hypertension and significant end-organ dysfunction with or without proteinuria after 20 weeks of gestation or postpartum in previously normotensive women [1,2].Pre-eclampsia affects 2-8% of pregnancies worldwide and causes 76,000 maternal and 500,000 perinatal deaths each year [3][4][5].
Administration of low-dose aspirin in women with at high risk of pre-eclampsia before 16-week gestation has been shown to reduce the risk of pre-eclampsia and adverse perinatal health outcomes [6][7][8][9].Clinical risk prediction models are used in healthcare to identify those at risk and to guide diagnosis, prevention, and prognosis [10].These use readily available data, such as demographic information, clinical characteristics [11][12][13], and specialised biomarkers [14,15].
Maternal medical and clinical characteristics are the most used prognostic factors [11][12][13] that have the advantage of being widely available in non-specialised and low-resource settings; however, the addition of specialised biomarkers can improve prediction performance but might limit the implementation into low-resource settings [16].
Risk prediction models can be developed and validated either by applying classical regression models (for example, logistic regression, competing risk models) or machine learning (ML) algorithms (for example, decision tree, random forest, gradient boosting, and neural networks) [10,17].Classical regression prediction models are abundantly reported in the medical literature [18][19][20][21], whilst ML prediction algorithms are gaining in popularity in the field [22][23][24].Differences between classical regression prediction model and ML algorithm approaches have been extensively discussed in the literature [25,26].Classical regression models are based on theory and assumptions [17].In contrast, ML algorithms learn from the data with the ability to analyse non-linear data structures using fewer assumptions and modelling high dimensional data [27,28].Some studies report that ML algorithms manage more predictors and outperform classical regression models [29][30][31]; yet, others report no prediction performance advantage of ML algorithms [32,33] in healthcare prediction models.
Previous systematic reviews and meta-analyses have investigated prediction performance based on classical regression models in pre-eclampsia prediction [34][35][36].Currently, no systematic review has been conducted comparing the prediction performance of ML algorithms to classical regression models in pre-eclampsia prediction.This review aims to (1) explore the existing ML algorithms, classical regression prediction models, and potential prognostic factors in pre-eclampsia prediction and (2) compare the prediction performance of ML algorithms to that of classical regression models in pre-eclampsia prediction.

Search Strategies
This systematic review was conducted following the Preferred Reporting Items for Systematic Reviews and Meta-analyses (PRISMA) guideline [37].We used the Population (pregnant women), Index prognostic model (developed prognostic models), Comparator (machine learning algorithms with classical regression models), Outcome (pre-eclampsia), Timing (prediction of preeclampsia after 20 weeks of gestation), and Setting (individualised risk stratification) PICOTS framework [38].Pre-eclampsia is classified based on the gestational age at clinical presentation as any-onset (delivery at any gestation), preterm (delivery < 37 weeks of gestation), lateonset (delivery ≥ 34 weeks of gestation), and early-onset (delivery < 34 weeks of gestation) [39].This review was registered with the International Prospective Register of Systematic Reviews (PROSPERO CRD42023445732).
Literature search was conducted on Ovid platform (MEDLINE, Embase, Emcare, and Maternity & Infant Database (MIDIRS)) and CINAHL databases.The search was conducted until 20 May 2023 without restriction of publication years.In addition, a Google Scholar grey literature search was conducted as per Enticott et al. ( 2018) [40].We included studies from previously published systematic reviews which considered only classical regression models [34,36].The search strategies were developed following search filters for prediction and diagnostic studies [41] and in consultation with a university librarian.Medical Subject Heading (MeSH) terms and free text words were used to locate potential prediction models.Boolean operators (AND, OR, and NOT) and truncation were used to combine the search key terms.A detailed description of search combinations and strategies is given in Supplementary File 1.

Eligibility Identification
Prediction models for pre-eclampsia (any-, early-, and late-onset and preterm) conducted using cohort/follow-up, nested case-control, case-control, case-cohort, randomised controlled trial, and routinely collected health records data sources were included in this review.We excluded prediction model studies focused exclusively on hypertensive disorders of pregnancy or gestational hypertension unless they also provided a distinct model for pre-eclampsia.Studies conducted on selected populations (only twin pregnancies, only high-risk/low-risk women), studies in languages other than English, and prognostic studies conducted with only single prognostic factors were excluded from this review.Furthermore, external validation prediction studies were excluded from the comparison.

Screening and Methodological Quality Appraisal
The included studies were screened using the Covidence platform [42].After duplicates were removed, two authors (SAT and TV) independently assessed the title and abstract followed by full-text screening.Discrepancies between the two authors were resolved through discussion.

Assessment of Methodological Quality for Classical Regression Models
The risk of bias (ROB) and concern for applicability [43] was assessed using the Prediction model Risk Of Bias ASsessment Tool (PROBAST) tool by two authors (SAT and TV).The tool has four domains (participants, predictors, outcomes, and analysis) structured into 20 signalling questions.Each included study rated as high, low, or unclear risk of bias for both ROB and concern for applicability.

Data Extraction
The CHecklist for critical Appraisal and data extraction in systematic Reviews of clinical prediction Modelling Studies (CHARMS) tool was used to extract the data [44].Authors, publication year, country, data sources, outcome(s) to be predicted, candidate prognostic factors, sample size, type of models or algorithms, internal validation methods, discrimination performance, and calibration measures were extracted.The algorithm's discrimination and calibration performance were extracted from the test dataset for studies that specifically conducted internal validations; otherwise, from the development dataset.The model/algorithm deployment strategy was also extracted.Deployment strategies, such as regression formulae, nomograms, and score chart rules, are methods used to employ an algorithm/ model into a system, enabling it to predict outcomes for new clients.Two authors (SAT and TV) independently extracted the data.Disagreements were managed through discussion and by another author (JE) if necessary.

Data Analysis
The descriptive synthesis was performed for both ML and classical regression studies.Prognostic factors were identified.Algorithm/model discrimination and calibration performance were narratively described and compared.ML algorithms and classical regression model prediction performance were primarily compared in studies that used the same sample.Furthermore, the prediction performance was compared across overall ML algorithms and classical regression models.The discrimination performance for studies reporting on both ML and classical regression models was visualised in a forest plot so that readers can easily compare the performances.Model discrimination refers to the model ability to correctly classify and discriminate between participants who had the outcome of interest and those who did not, often measured by the area under the receiveroperating characteristics (ROC) curve.An area under the curve (AUC) value = 0.5 suggests no discrimination ability, 0.5 < AUC < 0.7 is considered as poor discrimination, 0.7 ≤ AUC < 0.8 is good/acceptable discrimination, 0.8 ≤ AUC < 0.9 is excellent discrimination, and AUC ≥ 0.9 is considered outstanding discrimination performance [45].Calibration reflects how well the predicted risks match the observed risks of an outcome of interest.This is often measured by comparing the mean predicted probability and the observed outcome rates within risk groups and by the Hosmer and Lemeshow statistic.A well-calibrated model is when the Hosmer-Lemeshow p value is not significant and/or the calibration slope value approaches one and/or calibration-in-the-large close to zero [46,47].

Study Selection and Search Strategies
We retrieved 9376 records from five electronic databases and an additional six studies from previously published systematic reviews which considered only classical regression models [34,36].After 2343 duplicates were removed, 7033 articles were excluded through title and abstract screening, leaving 241 articles eligible for full-text review.In the fulltext screening, 76 records met inclusion criteria.Finally, based on the database search and previously published systematic reviews of classical regression models, we included 82 developed studies (ten with both ML algorithm and classical regression models, six with ML only, and 66 with classical regression only) (Fig. 1).

Distribution of Prognostic Factors
Maternal demographic, medical, and clinical factors and a variety of biomarkers were commonly included in ML and classical regression studies to predict pre-eclampsia.Figure 2a shows the distribution of prognostic factors used in ML studies.Maternal age, chronic hypertension and diabetes mellitus, parity/gravidity, pre-pregnancy body mass index (BMI), blood pressure measurements, weight, prior history of pre-eclampsia, and ethnicity were the most frequently used maternal medical and clinical prognostic factors in ML studies.Uterine artery pulsatility index (UtA-PI) was the most frequently used biomarker in ML studies.Figure 2b shows the distribution of prognostic factors used in classical regression models.Family history of pre-eclampsia, prior history of pre-eclampsia, pre-pregnancy BMI, parity, chronic hypertension, and ethnicity were the most frequently used prognostic factors in classical regression models.Uterine artery pulsatility index (UtA-PI), mean arterial pressure (MAP), pregnancy-associated plasma protein A (PAPP-A), and placental growth factor (PIGF) were the most frequently used biomarkers in classical regression models (Fig. 2).

ML Algorithm Performance and Comparison with Classical Regression Models
Figure 3 shows model discrimination performance of thirteen ML studies.Three [53,62,63]  Another study [60•] showed that there is no difference in prediction performance between competing risks preterm pre-eclampsia model and ML algorithms.Only one preterm pre-eclampsia [58] prediction model used logistic regression showed better prediction performance than a random forest algorithm.The minimum AUC of ML algorithms was 0.60 (95% CI 0.57-0.62)and the maximum AUC was 0.94 (95% CI 0.91-0.96).Two studies [62,63]  x ML based studies (n = 16); with 10 out of 16 ML studies used the same sample to compare ML algorithms with classical regression models.x Exclusively classical regression-based studies (n = 66); reported 84 models: 41 any-onset, 20 early-onset, 16 late-onset, and 7 preterm pre-eclampsia prediction models.
Studies included in previously published reviews purely on classical regression models (n = 6)

Previous studies
Fig. 1 PRISMA flow diagram for the inclusion and exclusion criteria [37] have not reported algorithm/model discrimination (AUC) performance, however reported prediction accuracy.Overall, random forest and boosting-type algorithms (gradient boosting and XGBoost) showed better prediction performance than other ML algorithms (Fig. 3).Three [48•, 49•, 55] models were well-calibrated, one

Any-Onset Pre-Eclampsia Models
Forty-one [65, 66, 75-84, 67, 85-94, 68, 95-104, 69, 105, 70-74] any-onset pre-eclampsia prediction models were Fig. 3 Machine learning algorithm performance (reported in 13/16 studies).Among the ten studies that reported both ML algorithms and classical regression models, the top eight reported discrimination performance (AUC), and the remaining did not.NB: The red verti-cal line highlights algorithms/models with AUC cut-off values above 0.7, which indicates good discrimination performance.*This classical regression model used the same sample as the ML algorithm above it and was reported in a separate publication [19] included.The maximum sample size was 120,492, and the minimum sample size was 104 for any-onset pre-eclampsia prediction models.Almost all (40/41) any-onset pre-eclampsia prediction models were from middle or high-income countries.Nine pre-eclampsia prediction models were reported from the United Kingdom (UK) [  three from Canada [73,74,84], two from the SCOPE study [88,94], and one each from Australia [71], Thailand [65], Turkey [77], India [83], Iran [93], Greece [102], France [100], Australia [71], Norway [92], and Portugal [91].Twenty-four (10/42) percent of any onset pre-eclampsia prediction models were developed with less than ten events per prognostic factor (Table 2).Only fifteen any-onset pre-eclampsia prognostic models report the model equation.Among any-onset preeclampsia prediction models, ten studies reported regression formulae and eleven reported nomogram and score chart rule for deployment strategy to estimate individualised risks.The remaining models did not report deployment strategies (Supplementary Table 2).

Classical Regression Studies Prediction Performance
Almost all any-onset pre-eclampsia models reported discrimination performance but not model calibration.Ninety Fig. 4 Risk of bias graph: review authors' judgements about each risk of bias item presented as percentages across all included studies percent of any-onset pre-eclampsia models (36/40) reported good discrimination performance (AUC > 0.70).The minimum AUC reported was 0.62 (0.58-0.66) [74] and the maximum AUC reported was 0.96 (0.92-1) [80].Calibration performance was not reported in most studies.Only four models [68,71,94,100] reported calibration performance, and three were well-calibrated.Fifteen models reported deployment strategies.Ninety percent (18/20) of the earlyonset pre-eclampsia prediction models have reported the model discrimination performance, whereas one study [116] has reported calibration performance.Ninety-four percent of the studies showed excellent to perfect discrimination performance, with a minimum AUC of 0.78 [67] and maximum AUC of 0.99 (0.99-1) [122].The deployment of individualised risk stratification was reported in fourteen out of twenty early-onset pre-eclampsia models.Moreover, eleven out of sixteen late-onset pre-eclampsia prediction models have reported model discrimination performance and none of the studies report model calibration performance.Only five out of seven preterm pre-eclampsia prediction studies reported model discrimination performance with none of them reporting calibration performance.Most of the classical regression models failed to report internal validation, and nearly one-third (30/84) of the models were externally validated (Supplementary Table 6).

Methodological Quality of ML and Classical Regression Studies
Figure 4a shows the assessment of risk of bias (ROB) and concerns for applicability of ML studies.Overall, more than 40% of the ML studies have high risk of bias.Among four domains, the analysis domain had high risk of bias.Among studies at low risk of bias, discrimination performance (AUC) ranged from 0.77 to 0.92.Ninety-five percent of ML studies have low risk of concern for applicability.Figure 4b shows the ROB and concern for applicability of classical regression studies.Sixty percent of classical regression studies exhibited a high risk of bias with the analysis domain being the primary contributor.Among studies at low risk of bias, the AUC ranged from 0.66 to 0.89.More than 90% of classical regression studies have low risk of concern for applicability (Fig. 4).

Discussion
Machine learning algorithm approaches are increasingly common in risk prediction [129,130]; however, prediction performance compared with classical regression models remains unclear, including in pre-eclampsia prediction.This review identified 16 ML algorithms and 84 classical regression models for pre-eclampsia prediction, and overall, the ML approaches had the better prediction performance compared to the classical regression approaches.In the 10 studies reporting both ML algorithms and classical regression models in the same sample, eight [47, 48•, 50, 53, 56, 60 Medical and clinical characteristics of the mother are the most cited risk factors for pre-eclampsia [11,13]; similarly, we found these to be the most used prognostic factors in both ML and classical regression models.In addition, biomarker prognostic factors such as UtA-PI, MAP, PAPP-A, and PIGF were most frequently used in classical regression models whilst UtA-PI was most frequently used in ML algorithms, which is aligned to previous studies [115,120].The risk of pre-eclampsia can increase by eight-fold with prior preeclampsia history, seven-fold with obese pre-pregnancy BMI, five-fold with chronic hypertension, four-fold with chronic diabetes, three-fold in nulliparous woman, and a first-degree relative with pre-eclampsia [13].Hence, the most frequently used prognostic factors in our review, in line with existing literature, but here combined in ML and classical regression models, have stronger predictive performance than when used in isolation.Considering only maternal medical and clinical characteristics have the advantages of readily attainable, easy to implement in all clinical settings, and cost-effective, however, addition of biomarkers could improve the prediction performance [15].Machine learning prediction approach has the advantage of using raw biomarker data without the need for conversion into multiple of the medians (MoMs), which would simplify the implementation of screening tool [60 •].To our knowledge, no previous review has compared the prediction performance of ML to that of classical regression studies in pre-eclampsia prediction.We have captured previous studies that compared ML with classical regression studies in pre-eclampsia [131][132][133][134]. Similar to our review, a recent systematic review compared ML and classical regression studies in cardiovascular risk prediction and found that ML algorithms outperformed classical regression models [132,135].Other comparison reviews in hypertension [133] and acute kidney injury [33] found that ML algorithms had similar prediction performance to classical regression models, aligned to other clinical prediction models [32,136].However, a recent study reported that ML algorithms are a more powerful tool for prediction modelling than classical regression models in terms of higher flexibility and automatic data-dependent complexity optimisation [137].Machine learning prediction can address challenges with rare events (class imbalance) prediction by oversampling the minority class and/or undersampling the majority class [138][139][140].Classical regression models may be challenging to predict rare events, potentially yielding unstable prediction metrics values [141].Consequently, advanced ML algorithms like random forest and boosting type algorithms might benefit from predicting rare events such as pre-eclampsia.
In this systematic review, we observed a lack of direct comparison between ML algorithms and classical regression models using harmonised data sources and evaluation metrics.Further research may focus on head-to-head comparisons using harmonised data sources and the same evaluation metrics, ideally measured on test rather than development data to minimise overfitting and consequently optimism.To gain a comprehensive understanding of true performance in other healthcare settings, it encourages research in low-and middle-income countries to apply these prediction models.
In terms of ML methods, similar to this review, some studies have shown that random forest and boostingtype algorithms (gradient boosting and extreme gradient boosting) achieve better prediction performance [33] compared with other ML approaches.Potentially, random forest and boosting-type algorithms are some of the most powerful algorithms, especially for structured and tabular data.Random forest is an ensemble learning algorithm that combines multiple decision trees based on bagging and random feature selection to make a prediction.As compared to other algorithms, random forests reduce overfitting, handle missing data, are robust to outliers, and can work outof-the-box with less sensitive to hyperparameter selection [142].Boosting-type algorithms such as gradient boosting and extreme gradient boosting are another class of ensemble learning starting with a weak algorithm (often decision tree) and sequentially boost its performance to create a stronger algorithm [143,144].As a result, boosting-type algorithms can handle imbalanced datasets, missing values, and allow for fine-grained control over hyperparameters for optimisation [145].However, further algorithm development might be needed to differentiate the best algorithm for pre-eclampsia prediction; if this is confirmed, it would be advantageous (1) to externally validate the best-fit ML algorithm and (2) to facilitate clinical implementation in healthcare settings.
This study faces some limitations.Firstly, a high or unclear methodological risk of bias yet low concern for applicability was seen in both ML and classical regression studies.Some studies report insufficient sample sizes which might increase the risk of overfitting and can yield inaccurate and unstable predictions.Deployment strategies were seen in some classical regression models, but not in ML algorithms.ML algorithms lack interpretability, making it difficult to present equations and explicit mathematical relationships.Besides, the majority of the studies have not reported model's calibration performance, which led to challenges in judging the accuracy of the risk estimates.Secondly, none of the ML studies reported external validation; hence, it remains unclear how well the models could perform among diverse population and settings.Therefore, further studies warranted for temporal and external validation.Furthermore, prediction performance can be influenced and underestimated by the treatment paradox, wherein high-risk women who would otherwise develop pre-eclampsia are treated with aspirin and do not develop the disease, effectively converting truepositives into false-positive results from predictive tests.
This review also has strengths.It was able to review the common prognostic factors in term of pre-eclampsia prediction, those were shown to consistent throughout studies to enhance practical of future prediction studies.Both prediction approaches were particularly compared against studies that used the same sample and similar prognostic factors, perhaps helpful in evaluating their performance in predicting the outcome of interest.

Conclusion
This systematic review has explored prognostic factors and compared ML algorithms and classical regression models for pre-eclampsia prediction.Maternal demographic and clinical characteristics, MAP, UtA-PI, PAPP-A, and PIGF are the most used prognostic factors.Pre-eclampsia prediction performance appears better with ML algorithms, yet varies among ML approaches.Advanced ML algorithms such as random forest, gradient boosting, and extreme gradient boosting outperformed classical regression models in discrimination.To gain further insight into the performance of ML algorithms, research should focus on comparing ML algorithms to classical regression models using similar samples, evaluation metrics, comparing calibration, and conducting external validation of ML algorithms to provide insight into generalisability to other populations and settings.Ultimately, for optimal models, effective deployment and implementation strategies are needed.

Fig. 2
Fig. 2 Distribution of prognostic factors across ML algorithms (a) and classical regression-based models (b).For the 16 ML algorithms, the number of feature variables ranged from 3 to 17, with a median of 7.For the 41 any-onset pre-eclampsia classical regression models, the number of predictor variables ranged from 2 to 13, with a median of

•
Forty-one any-onset pre-eclampsia models (Table 2) • Twe n t y e a rly-o n s et p r e -e c l a m p s i a m o d e l s (Supplementary Table

3
) • S i x t e e n l a t e -o n s e t p r e -e c l a m p s i a m o d e l s (Supplementary Table ML studies did not report model discrimination performance through AUC values.Ten studies [48•, 49•, 51, 54, 57, 58, 60•, 61-63] reported both ML algorithms and classical regression model performance; eight studies [48•, 49•, 51, 54, 57, 58, 61-63] reported that ML algorithms have better prediction performance than classical regression models.

Table 1
Characteristics of ML algorithm prediction studies NB: XGBoost extreme gradient boosting, CVR classification via regression, SVM support vector machine, NR not reported * The studies that used the same sample to compare ML algorithms with classical regression models

Table 2
Characteristics of classical regression models for any-onset pre-eclampsia prediction