Introduction

Pre-eclampsia is a multisystem disorder of pregnancy characterised by new onset of elevated blood pressure and proteinuria or hypertension and significant end-organ dysfunction with or without proteinuria after 20 weeks of gestation or postpartum in previously normotensive women [1, 2]. Pre-eclampsia affects 2–8% of pregnancies worldwide and causes 76,000 maternal and 500,000 perinatal deaths each year [3,4,5].

Administration of low-dose aspirin in women with at high risk of pre-eclampsia before 16-week gestation has been shown to reduce the risk of pre-eclampsia and adverse perinatal health outcomes [6,7,8,9]. Clinical risk prediction models are used in healthcare to identify those at risk and to guide diagnosis, prevention, and prognosis [10]. These use readily available data, such as demographic information, clinical characteristics [11,12,13], and specialised biomarkers [14, 15]. Maternal medical and clinical characteristics are the most used prognostic factors [11,12,13] that have the advantage of being widely available in non-specialised and low-resource settings; however, the addition of specialised biomarkers can improve prediction performance but might limit the implementation into low-resource settings [16].

Risk prediction models can be developed and validated either by applying classical regression models (for example, logistic regression, competing risk models) or machine learning (ML) algorithms (for example, decision tree, random forest, gradient boosting, and neural networks) [10, 17]. Classical regression prediction models are abundantly reported in the medical literature [18,19,20,21], whilst ML prediction algorithms are gaining in popularity in the field [22,23,24]. Differences between classical regression prediction model and ML algorithm approaches have been extensively discussed in the literature [25, 26]. Classical regression models are based on theory and assumptions [17]. In contrast, ML algorithms learn from the data with the ability to analyse non-linear data structures using fewer assumptions and modelling high dimensional data [27, 28]. Some studies report that ML algorithms manage more predictors and outperform classical regression models [29,30,31]; yet, others report no prediction performance advantage of ML algorithms [32, 33] in healthcare prediction models.

Previous systematic reviews and meta-analyses have investigated prediction performance based on classical regression models in pre-eclampsia prediction [34,35,36]. Currently, no systematic review has been conducted comparing the prediction performance of ML algorithms to classical regression models in pre-eclampsia prediction. This review aims to (1) explore the existing ML algorithms, classical regression prediction models, and potential prognostic factors in pre-eclampsia prediction and (2) compare the prediction performance of ML algorithms to that of classical regression models in pre-eclampsia prediction.

Methods

Search Strategies

This systematic review was conducted following the Preferred Reporting Items for Systematic Reviews and Meta-analyses (PRISMA) guideline [37]. We used the Population (pregnant women), Index prognostic model (developed prognostic models), Comparator (machine learning algorithms with classical regression models), Outcome (pre-eclampsia), Timing (prediction of pre-eclampsia after 20 weeks of gestation), and Setting (individualised risk stratification) PICOTS framework [38]. Pre-eclampsia is classified based on the gestational age at clinical presentation as any-onset (delivery at any gestation), preterm (delivery < 37 weeks of gestation), late-onset (delivery ≥ 34 weeks of gestation), and early-onset (delivery < 34 weeks of gestation) [39]. This review was registered with the International Prospective Register of Systematic Reviews (PROSPERO CRD42023445732).

Literature search was conducted on Ovid platform (MEDLINE, Embase, Emcare, and Maternity & Infant Database (MIDIRS)) and CINAHL databases. The search was conducted until 20 May 2023 without restriction of publication years. In addition, a Google Scholar grey literature search was conducted as per Enticott et al. (2018) [40]. We included studies from previously published systematic reviews which considered only classical regression models [3436]. The search strategies were developed following search filters for prediction and diagnostic studies [41] and in consultation with a university librarian. Medical Subject Heading (MeSH) terms and free text words were used to locate potential prediction models. Boolean operators (AND, OR, and NOT) and truncation were used to combine the search key terms. A detailed description of search combinations and strategies is given in Supplementary File 1.

Eligibility Identification

Prediction models for pre-eclampsia (any-, early-, and late-onset and preterm) conducted using cohort/follow-up, nested case–control, case–control, case-cohort, randomised controlled trial, and routinely collected health records data sources were included in this review. We excluded prediction model studies focused exclusively on hypertensive disorders of pregnancy or gestational hypertension unless they also provided a distinct model for pre-eclampsia. Studies conducted on selected populations (only twin pregnancies, only high-risk/low-risk women), studies in languages other than English, and prognostic studies conducted with only single prognostic factors were excluded from this review. Furthermore, external validation prediction studies were excluded from the comparison.

Screening and Methodological Quality Appraisal

The included studies were screened using the Covidence platform [42]. After duplicates were removed, two authors (SAT and TV) independently assessed the title and abstract followed by full-text screening. Discrepancies between the two authors were resolved through discussion.

Assessment of Methodological Quality for Classical Regression Models

The risk of bias (ROB) and concern for applicability [43] was assessed using the Prediction model Risk Of Bias ASsessment Tool (PROBAST) tool by two authors (SAT and TV). The tool has four domains (participants, predictors, outcomes, and analysis) structured into 20 signalling questions. Each included study rated as high, low, or unclear risk of bias for both ROB and concern for applicability.

Data Extraction

The CHecklist for critical Appraisal and data extraction in systematic Reviews of clinical prediction Modelling Studies (CHARMS) tool was used to extract the data [44]. Authors, publication year, country, data sources, outcome(s) to be predicted, candidate prognostic factors, sample size, type of models or algorithms, internal validation methods, discrimination performance, and calibration measures were extracted. The algorithm’s discrimination and calibration performance were extracted from the test dataset for studies that specifically conducted internal validations; otherwise, from the development dataset. The model/algorithm deployment strategy was also extracted. Deployment strategies, such as regression formulae, nomograms, and score chart rules, are methods used to employ an algorithm/model into a system, enabling it to predict outcomes for new clients. Two authors (SAT and TV) independently extracted the data. Disagreements were managed through discussion and by another author (JE) if necessary.

Data Analysis

The descriptive synthesis was performed for both ML and classical regression studies. Prognostic factors were identified. Algorithm/model discrimination and calibration performance were narratively described and compared. ML algorithms and classical regression model prediction performance were primarily compared in studies that used the same sample. Furthermore, the prediction performance was compared across overall ML algorithms and classical regression models. The discrimination performance for studies reporting on both ML and classical regression models was visualised in a forest plot so that readers can easily compare the performances. Model discrimination refers to the model ability to correctly classify and discriminate between participants who had the outcome of interest and those who did not, often measured by the area under the receiver-operating characteristics (ROC) curve. An area under the curve (AUC) value = 0.5 suggests no discrimination ability, 0.5 < AUC < 0.7 is considered as poor discrimination, 0.7 ≤ AUC < 0.8 is good/acceptable discrimination, 0.8 ≤ AUC < 0.9 is excellent discrimination, and AUC ≥ 0.9 is considered outstanding discrimination performance [45]. Calibration reflects how well the predicted risks match the observed risks of an outcome of interest. This is often measured by comparing the mean predicted probability and the observed outcome rates within risk groups and by the Hosmer and Lemeshow statistic. A well-calibrated model is when the Hosmer–Lemeshow p value is not significant and/or the calibration slope value approaches one and/or calibration-in-the-large close to zero [46, 47].

Results

Study Selection and Search Strategies

We retrieved 9376 records from five electronic databases and an additional six studies from previously published systematic reviews which considered only classical regression models [3436]. After 2343 duplicates were removed, 7033 articles were excluded through title and abstract screening, leaving 241 articles eligible for full-text review. In the full-text screening, 76 records met inclusion criteria. Finally, based on the database search and previously published systematic reviews of classical regression models, we included 82 developed studies (ten with both ML algorithm and classical regression models, six with ML only, and 66 with classical regression only) (Fig. 1).

Fig. 1
figure 1

PRISMA flow diagram for the inclusion and exclusion criteria [37]

Characteristics of ML-Based Prediction Studies

Table 1 shows that sixteen [48•, 49•, 50,51,52,53,54,55,56,57, 56, 59•, 60•, 61,62,63] (fourteen any-onset and two preterm pre-eclampsia) ML studies were included and reported from 2019 to 2023. Ten studies reported both ML algorithms and classical regression models. Four ML studies were developed in China [48•, 49•, 50, 61], two in the United States of America (USA) [51, 52], two in Romania [53, 59•], and the rest were from United Kingdom [60•], Indonesia [54], New Zealand [55], Slovenia [56], South Korea [57], Sweden [58], Kenya [62], and Iran [63]. Case–control, retrospective/prospective cohort, and medical record data sources were used in the included studies. The maximum sample size was 60,789, the minimum was 95, and one [53] study did not report the sample size and/or event rate. Decision tree, naïve Bayes, support vector machine, random forest, gradient boosting machine, extreme gradient boosting machine (XGBoost), light boosting, neural network, Viterbi ML, and classification via regression ML algorithms were reported (Table 1).

Table 1 Characteristics of ML algorithm prediction studies

Distribution of Prognostic Factors

Maternal demographic, medical, and clinical factors and a variety of biomarkers were commonly included in ML and classical regression studies to predict pre-eclampsia. Figure 2a shows the distribution of prognostic factors used in ML studies. Maternal age, chronic hypertension and diabetes mellitus, parity/gravidity, pre-pregnancy body mass index (BMI), blood pressure measurements, weight, prior history of pre-eclampsia, and ethnicity were the most frequently used maternal medical and clinical prognostic factors in ML studies. Uterine artery pulsatility index (UtA-PI) was the most frequently used biomarker in ML studies. Figure 2b shows the distribution of prognostic factors used in classical regression models. Family history of pre-eclampsia, prior history of pre-eclampsia, pre-pregnancy BMI, parity, chronic hypertension, and ethnicity were the most frequently used prognostic factors in classical regression models. Uterine artery pulsatility index (UtA-PI), mean arterial pressure (MAP), pregnancy-associated plasma protein A (PAPP-A), and placental growth factor (PIGF) were the most frequently used biomarkers in classical regression models (Fig. 2).

Fig. 2
figure 2

Distribution of prognostic factors across ML algorithms (a) and classical regression-based models (b). For the 16 ML algorithms, the number of feature variables ranged from 3 to 17, with a median of 7. For the 41 any-onset pre-eclampsia classical regression models, the number of predictor variables ranged from 2 to 13, with a median of 5. In the ten studies with both ML algorithm and classical regression model, the number of feature variables ranged from 3 to 17, with a median of 8. NB: Others = alcohol consumption in the first trimester, family history of chronic heart disease, and single miscarriage

ML Algorithm Performance and Comparison with Classical Regression Models

Figure 3 shows model discrimination performance of thirteen ML studies. Three [53, 62, 63] ML studies did not report model discrimination performance through AUC values. Ten studies [48•, 49•, 51, 54, 57, 5860•, 61,62,63] reported both ML algorithms and classical regression model performance; eight studies [48•, 49•, 51, 54, 57, 58, 61,62,63] reported that ML algorithms have better prediction performance than classical regression models. Another study [60•] showed that there is no difference in prediction performance between competing risks preterm pre-eclampsia model and ML algorithms. Only one preterm pre-eclampsia [58] prediction model used logistic regression showed better prediction performance than a random forest algorithm. The minimum AUC of ML algorithms was 0.60 (95% CI 0.57–0.62) and the maximum AUC was 0.94 (95% CI 0.91–0.96). Two studies [62, 63] have not reported algorithm/model discrimination (AUC) performance, however reported prediction accuracy. Overall, random forest and boosting-type algorithms (gradient boosting and XGBoost) showed better prediction performance than other ML algorithms (Fig. 3). Three [48•, 49•, 55] models were well-calibrated, one [54] model was not well-calibrated, and the rest studies did not report model calibration performance. Except two [53, 59•], ten ML studies reported split sample, and four studies reported cross-validation for internal validation; yet none reported external validation. None of the ML studies provided deployment strategies for individualised risk prediction (Supplementary. Table 1).

Fig. 3
figure 3

Machine learning algorithm performance (reported in 13/16 studies). Among the ten studies that reported both ML algorithms and classical regression models, the top eight reported discrimination performance (AUC), and the remaining did not. NB: The red vertical line highlights algorithms/models with AUC cut-off values above 0.7, which indicates good discrimination performance. *This classical regression model used the same sample as the ML algorithm above it and was reported in a separate publication [19]

Characteristics of Classical Regression-Based Prediction Studies

Sixty-six publications [14, 64, 73,74,75,76,77,78,79,80,81,82, 6583,84,85,86,87,88,89,90,91,92, 66, 93,94,95,96,97,98,99,100,101,102, 67, 103,104,105,106,107,108,109,110,111,112,68, 113,114,115,116,117,118,119,120,121,122, 69, 123,124,125,126,127,128,70,71,72] reporting on 84 models were included:

  • Forty-one any-onset pre-eclampsia models (Table 2)

  • Twenty early-onset pre-eclampsia models (Supplementary Table 3)

  • Sixteen late-onset pre-eclampsia models (Supplementary Table 4)

  • Seven preterm pre-eclampsia models (Supplementary Table 5)

Table 2 Characteristics of classical regression models for any-onset pre-eclampsia prediction

Any-Onset Pre-Eclampsia Models

Forty-one [65, 66, 75,76,77,78,79,80,81,82,83,84, 67, 85,86,87,88,89,90,91,92,93,94, 68, 95,96,97,98,99,100,101,102,103,104, 69, 105, 70,71,72,73,74] any-onset pre-eclampsia prediction models were included. The maximum sample size was 120,492, and the minimum sample size was 104 for any-onset pre-eclampsia prediction models. Almost all (40/41) any-onset pre-eclampsia prediction models were from middle or high-income countries. Nine pre-eclampsia prediction models were reported from the United Kingdom (UK) [72, 81, 85, 98, 99, 101, 103,104,105], six from the USA [66, 82, 86, 87, 89, 97], six from China [67,68,69,70, 79, 96], three from Brazil [76, 78, 90], three from Canada [73, 74, 84], two from the SCOPE study [88, 94], and one each from Australia [71], Thailand [65], Turkey [77], India [83], Iran [93], Greece [102], France [100], Australia [71], Norway [92], and Portugal [91]. Twenty-four (10/42) percent of any onset pre-eclampsia prediction models were developed with less than ten events per prognostic factor (Table 2). Only fifteen any-onset pre-eclampsia prognostic models report the model equation. Among any-onset pre-eclampsia prediction models, ten studies reported regression formulae and eleven reported nomogram and score chart rule for deployment strategy to estimate individualised risks. The remaining models did not report deployment strategies (Supplementary Table 2).

Early-Onset Pre-Eclampsia Models

Twenty [67, 83, 113,114,115,116,117,118,119,120,121,122, 87, 106,107,108,109,110,111,112] early-onset pre-eclampsia models were included in this review. The maximum [115] sample size reported was 33,602, and the minimum [111] sample size was 359. Ninety percent of the studies were from middle and high-income countries. Six studies were from the UK [115, 117,118,119,120, 122], three from France [107, 109, 110], two from the Netherlands [112, 116], two from Chile [111, 113], and one each from China [67], Spain [106], India [83], Finland [108], the USA [87], Italy [114], and Denmark [121]. Only five [106, 107, 112, 115, 121] developed models had more than ten events per prognostic factor (Supplementary Table 3).

Late-Onset Pre-Eclampsia Models

Sixteen [64, 83, 118,119,120,121,122,123, 107, 109,110,111,112, 114, 115, 117] late-onset pre-eclampsia prediction models were included. The maximum sample size was reported 33,602, and 359 was the minimum sample size. Eighty-eight percent (14/16) of the models reported were from high-income countries. Six models were developed in the UK [115, 117,118,119,120, 122], three models in France [107, 109, 110], two models in Italy [114, 123], and one each from India [83], Thailand [64], Chile [111], and Denmark [121]. Sixty-nine percent of models used more than ten events per predictor (Supplementary Table 4).

Preterm Pre-Eclampsia Models

Seven preterm [14, 67, 124,125,126,127,128] pre-eclampsia prediction models were included. Two models were from the UK [14, 128], one was multicentre international (SCOPE [127] study), and the other studies were one each from Sweden [124], China [67], Denmark [125], and Chile [126]. Only one model used less than ten predictor variables per event (Supplementary Table 5).

Classical Regression Studies Prediction Performance

Almost all any-onset pre-eclampsia models reported discrimination performance but not model calibration. Ninety percent of any-onset pre-eclampsia models (36/40) reported good discrimination performance (AUC > 0.70). The minimum AUC reported was 0.62 (0.58–0.66) [74] and the maximum AUC reported was 0.96 (0.92–1) [80]. Calibration performance was not reported in most studies. Only four models [68, 71, 94, 100] reported calibration performance, and three were well-calibrated. Fifteen models reported deployment strategies. Ninety percent (18/20) of the early-onset pre-eclampsia prediction models have reported the model discrimination performance, whereas one study [116] has reported calibration performance. Ninety-four percent of the studies showed excellent to perfect discrimination performance, with a minimum AUC of 0.78 [67] and maximum AUC of 0.99 (0.99–1) [122]. The deployment of individualised risk stratification was reported in fourteen out of twenty early-onset pre-eclampsia models. Moreover, eleven out of sixteen late-onset pre-eclampsia prediction models have reported model discrimination performance and none of the studies report model calibration performance. Only five out of seven preterm pre-eclampsia prediction studies reported model discrimination performance with none of them reporting calibration performance. Most of the classical regression models failed to report internal validation, and nearly one-third (30/84) of the models were externally validated (Supplementary Table 6).

Methodological Quality of ML and Classical Regression Studies

Figure 4a shows the assessment of risk of bias (ROB) and concerns for applicability of ML studies. Overall, more than 40% of the ML studies have high risk of bias. Among four domains, the analysis domain had high risk of bias. Among studies at low risk of bias, discrimination performance (AUC) ranged from 0.77 to 0.92. Ninety-five percent of ML studies have low risk of concern for applicability. Figure 4b shows the ROB and concern for applicability of classical regression studies. Sixty percent of classical regression studies exhibited a high risk of bias with the analysis domain being the primary contributor. Among studies at low risk of bias, the AUC ranged from 0.66 to 0.89. More than 90% of classical regression studies have low risk of concern for applicability (Fig. 4).

Fig. 4
figure 4

Risk of bias graph: review authors’ judgements about each risk of bias item presented as percentages across all included studies

Discussion

Machine learning algorithm approaches are increasingly common in risk prediction [129, 130]; however, prediction performance compared with classical regression models remains unclear, including in pre-eclampsia prediction. This review identified 16 ML algorithms and 84 classical regression models for pre-eclampsia prediction, and overall, the ML approaches had the better prediction performance compared to the classical regression approaches. In the 10 studies reporting both ML algorithms and classical regression models in the same sample, eight [47, 48•, 50, 53, 56, 60•, 61,62,63] reported superior prediction performance for ML algorithms. The most frequent prognostic factors in all models were maternal demographic and clinical characteristics in pre-eclampsia prediction, with biophysical (UtA-PI, MAP) and biochemical (PAPP-A, PIGF) measurement being the most common biomarkers as prognostic factors. Almost all ML studies had reported internal validation, but failed to report external validation. All except three ML algorithms [52, 61, 63] reported discrimination performance with AUC ranging from 0.60 (95% CI 0.57–0.62) [57] to 0.94 (95% CI 0.91–0.96) [58]. Random forest, gradient boosting, and extreme gradient boosting algorithms were the top-performing ML algorithms. Of 66 classical regression studies reporting 84 models for any-, early-, late-onset, and preterm pre-eclampsia prediction showing poor-to-perfect discrimination performance, most failed to report model calibration. A high or unclear methodological risk of bias, yet low concern for applicability, was seen in both ML and classical regression studies. Deployment strategies were seen in some classical regression models, but not in ML algorithms.

Medical and clinical characteristics of the mother are the most cited risk factors for pre-eclampsia [11, 13]; similarly, we found these to be the most used prognostic factors in both ML and classical regression models. In addition, biomarker prognostic factors such as UtA-PI, MAP, PAPP-A, and PIGF were most frequently used in classical regression models whilst UtA-PI was most frequently used in ML algorithms, which is aligned to previous studies [115, 120]. The risk of pre-eclampsia can increase by eight-fold with prior pre-eclampsia history, seven-fold with obese pre-pregnancy BMI, five-fold with chronic hypertension, four-fold with chronic diabetes, three-fold in nulliparous woman, and a first-degree relative with pre-eclampsia [13]. Hence, the most frequently used prognostic factors in our review, in line with existing literature, but here combined in ML and classical regression models, have stronger predictive performance than when used in isolation. Considering only maternal medical and clinical characteristics have the advantages of readily attainable, easy to implement in all clinical settings, and cost-effective, however, addition of biomarkers could improve the prediction performance [15]. Machine learning prediction approach has the advantage of using raw biomarker data without the need for conversion into multiple of the medians (MoMs), which would simplify the implementation of screening tool [60•].

To our knowledge, no previous review has compared the prediction performance of ML to that of classical regression studies in pre-eclampsia prediction. We have captured previous studies that compared ML with classical regression studies in pre-eclampsia [131,132,133,134]. Similar to our review, a recent systematic review compared ML and classical regression studies in cardiovascular risk prediction and found that ML algorithms outperformed classical regression models [132, 135]. Other comparison reviews in hypertension [133] and acute kidney injury [33] found that ML algorithms had similar prediction performance to classical regression models, aligned to other clinical prediction models [32136]. However, a recent study reported that ML algorithms are a more powerful tool for prediction modelling than classical regression models in terms of higher flexibility and automatic data-dependent complexity optimisation [137]. Machine learning prediction can address challenges with rare events (class imbalance) prediction by oversampling the minority class and/or undersampling the majority class [138,139,140]. Classical regression models may be challenging to predict rare events, potentially yielding unstable prediction metrics values [141]. Consequently, advanced ML algorithms like random forest and boosting type algorithms might benefit from predicting rare events such as pre-eclampsia.

In this systematic review, we observed a lack of direct comparison between ML algorithms and classical regression models using harmonised data sources and evaluation metrics. Further research may focus on head-to-head comparisons using harmonised data sources and the same evaluation metrics, ideally measured on test rather than development data to minimise overfitting and consequently optimism. To gain a comprehensive understanding of true performance in other healthcare settings, it encourages research in low- and middle-income countries to apply these prediction models.

In terms of ML methods, similar to this review, some studies have shown that random forest and boosting-type algorithms (gradient boosting and extreme gradient boosting) achieve better prediction performance [33] compared with other ML approaches. Potentially, random forest and boosting-type algorithms are some of the most powerful algorithms, especially for structured and tabular data. Random forest is an ensemble learning algorithm that combines multiple decision trees based on bagging and random feature selection to make a prediction. As compared to other algorithms, random forests reduce overfitting, handle missing data, are robust to outliers, and can work out-of-the-box with less sensitive to hyperparameter selection [142]. Boosting-type algorithms such as gradient boosting and extreme gradient boosting are another class of ensemble learning starting with a weak algorithm (often decision tree) and sequentially boost its performance to create a stronger algorithm [143, 144]. As a result, boosting-type algorithms can handle imbalanced datasets, missing values, and allow for fine-grained control over hyperparameters for optimisation [145]. However, further algorithm development might be needed to differentiate the best algorithm for pre-eclampsia prediction; if this is confirmed, it would be advantageous (1) to externally validate the best-fit ML algorithm and (2) to facilitate clinical implementation in healthcare settings.

This study faces some limitations. Firstly, a high or unclear methodological risk of bias yet low concern for applicability was seen in both ML and classical regression studies. Some studies report insufficient sample sizes which might increase the risk of overfitting and can yield inaccurate and unstable predictions. Deployment strategies were seen in some classical regression models, but not in ML algorithms. ML algorithms lack interpretability, making it difficult to present equations and explicit mathematical relationships. Besides, the majority of the studies have not reported model’s calibration performance, which led to challenges in judging the accuracy of the risk estimates. Secondly, none of the ML studies reported external validation; hence, it remains unclear how well the models could perform among diverse population and settings. Therefore, further studies warranted for temporal and external validation. Furthermore, prediction performance can be influenced and underestimated by the treatment paradox, wherein high-risk women who would otherwise develop pre-eclampsia are treated with aspirin and do not develop the disease, effectively converting true-positives into false-positive results from predictive tests.

This review also has strengths. It was able to review the common prognostic factors in term of pre-eclampsia prediction, those were shown to consistent throughout studies to enhance practical of future prediction studies. Both prediction approaches were particularly compared against studies that used the same sample and similar prognostic factors, perhaps helpful in evaluating their performance in predicting the outcome of interest.

Conclusion

This systematic review has explored prognostic factors and compared ML algorithms and classical regression models for pre-eclampsia prediction. Maternal demographic and clinical characteristics, MAP, UtA-PI, PAPP-A, and PIGF are the most used prognostic factors. Pre-eclampsia prediction performance appears better with ML algorithms, yet varies among ML approaches. Advanced ML algorithms such as random forest, gradient boosting, and extreme gradient boosting outperformed classical regression models in discrimination. To gain further insight into the performance of ML algorithms, research should focus on comparing ML algorithms to classical regression models using similar samples, evaluation metrics, comparing calibration, and conducting external validation of ML algorithms to provide insight into generalisability to other populations and settings. Ultimately, for optimal models, effective deployment and implementation strategies are needed.