Introduction

Five percent of pre-menopausal women has complaints of abnormal uterine bleeding [1]. Endometrial ablation (EA) is one of the treatment options for this common complaint. Due to the low costs and less invasive nature of this procedure (lower intra-operative complication risks, shorter recovery time, and lower post-operative morbidity), this form of treatment seems to be a less invasive surgical treatment for menorrhagia compared to hysterectomy [2,3,4,5,6]. However, long-term follow-up shows a decrease in patient satisfaction and treatment efficacy. Due to permanent relief, the more invasive hysterectomy remains the most effective treatment of abnormal uterine bleeding [7,8,9,10,11,12,13,14].

According to literature, several factors prior to endometrial ablation appear to have an influence on the success-rate of this procedure. Younger age, complaints of dysmenorrhea, multiparity, a thicker pre-procedural endometrium, a duration of menstruation above 7 days, presence of an intramural leiomyoma on transvaginal sonography, a history of sterilization or caesarean section, and a longer uterine depth are some of the possible negative influencing factors [1, 2, 8, 9, 11,12,13,14,15,16,17,18].

To optimize the clinical counselling of patients with abnormal uterine bleeding, a prediction model based on the combined influence of the abovementioned predictors could provide a better insight into the individual prognosis of endometrial ablation. In times of personalized medicine, this can create better individual care leading to fewer re-interventions, lower healthcare costs, and more patient satisfaction. With the use of a prediction model shared decision-making can be optimized [19].

For this reason, Stevens et al. [16] developed two multivariate prediction models to help counsel patients for failure of EA and for surgical re-intervention within 2 years after EA. The developed prediction models have a clinically acceptable c-index of 0.68 and 0.71, respectively. In addition, Stevens et al. is performing an external validation of these models; results of these data will follow.

In the field of gynaecology, many prediction models are developed using statistic multivariate logistic regression as a standard approach, these are based on a combination of various predictors that are significantly related to the outcome of interest. However, this method cannot automatically estimate the interconnection between predictors and in this way can overestimate the influence of an individual predictor [20, 21].

We were also interested in other techniques of developing a prediction model. In recent years machine learning (ML) methods have been increasingly used in the development of clinical prediction models. ML is a scientific discipline that focuses on models that directly and automatically learn from data without using pre-identified statistical parameters and without assumption of a preconceived relationship between predictors and outcomes [20, 22]. A potential advantage of machine learning methods compared to the traditional statistical strategies is the possibility of capturing complex, nonlinear relationships in the data [23, 24]. We chose surgical re-intervention as most objective outcome measure to compare both prediction models in predicting unsuccessful endometrial ablation.

The aim of the study was to develop a machine learning model to predict the chance of surgical re-intervention (for example re-ablation or hysterectomy) within 2 years after EA. Furthermore, we compared the performance of the ML model with the prediction by the previously published multivariate logistic regression re-intervention model of Stevens et al. [16].

Methods

This study used the same dataset as was used to develop the prediction models in the study from Stevens et al.; the full study protocol can be consulted there [16].

This retrospective two-centred cohort study, performed in two non-university teaching hospitals in the Netherlands (Catharina Hospital, Eindhoven; Elkerliek Hospital Helmond), included 446 patients who have had an EA for complaints of abnormal uterine bleeding [16]. Both hospitals used similar ablation techniques between 2004 and 2013, being Cavatherm® (Veldana Medical SA, Morges, Switzerland), Gynecare Thermachoice® (Ethicon, Sommerville, USA), and Thermablate® EAS (Idoman, Ireland). Recent publications have shown that these ablation techniques were equally effective [14, 25]. Local medical ethical review boards approved the study. All patients gave informed consent.

Patients were identified in the electronic patient care system by using specified search terms related to endometrial ablation. Exclusion criteria were a postmenopausal status at time of EA or (suspicion of) endometrial malignancy or uterine cavity deformations (adenomyosis, anomalies, fibroids, or a polyp). Follow-up period after treatment was at least 2 years. This time-interval was chosen because previous literature stated that most re-interventions were done within 2 years. Follow-up ended on the day of hysterectomy, in case of death or on April 15, 2015 [9, 17, 18, 25,26,27].

Data were extracted from individual patient files by two researchers (K.S. and D.M [16].). Next, patients were asked to fill in a questionnaire regarding follow-up information. In case of non-response, patients were contacted by letter and ultimately by telephone by the authors of Stevens et al. [16]. The used questionnaire contained questions based on significant variables predicting surgical re-intervention after EA that were previously published [2, 5, 8, 11,12,13,14,15,16,17, 28, 29].

The entire dataset consists of 446 patients with different categorical and continuous variables. For the machine learning algorithms all features were extracted from the original dataset of Stevens et al. [16]. A total of five pre-operative variables were used to develop the machine learning model. This were the pre-operative variables that were significant predictors in the final multivariate re-intervention model of Stevens et al. (age, duration of menstruation, dysmenorrhea, parity, and previous caesarean section) [16]. The continuous data were not discretized into categories as was done in the development of the previously published logistic regression model [16].

Development of the logistic regression model

Statistical analysis of the data was performed by using SPSS 21.0 for Windows (IBM Corp., Armonk, NY, USA).

To determine which variables were significant, univariable logistic regression analysis was used.

The variables with a p-value < .10 were used in the multivariable analysis. This was followed by a backward stepwise manual selection process, progressively excluding the variable with the highest p-value [16].

As described by Steyerberg et al., the p-value of 0.10 was used to prevent a potential incorrect exclusion of a predictive factor. This would be far more detrimental for the test than missing a potential discriminating factor [28, 29].

Multicollinearity and interaction between the significant variables in the model was tested. Bootstrap resampling was used for internal validation (n = 5000) [29, 30]. To correct for over-optimism of the model, regression coefficients were multiplied by the calculated shrinkage factor. A detailed description of the development of the LR model can be found in the study of Stevens et al. [16].

Development of the machine learning model (random forest model)

For the development of the machine learning model, we used a random forest (RF) technique. This is a machine learning method used for classification and regression, which operates by constructing a large ensemble of decision trees on training data [22, 23, 31]. Each tree in the random forest is built using a bootstrap sample randomly drawn from a training dataset. This results in a reduction of variance and corrects for a single decision tree ability to overfit to a training set. Each tree in the forest gives an individual prediction on the outcome measure. For a classification problem (in this case, surgical re-intervention or no surgical re-intervention after EA) the final random forest model averages the prediction of all the trees in the forest [21, 23, 31, 32].

Making the model, we first trained a RF model using the five following pre-operative predictors: age, duration of menstruation, dysmenorrhea, parity, and previous caesarean section. These factors were associated with a higher probability of surgical re-intervention within 2 years after EA in the previously published multivariate logistic regression model [16].

As described above, a RF model is an ensemble of many decision tree models. Figure 1 shows an example of an individual decision tree in the random forest. The decision tree is a flowchart-like binary branch structure. At each “node split” in the tree, the data are divided in two, based on the value of variable of the decision node. If no more splits are possible a prediction will be calculated for the cases in the final leaf node [23, 31, 33].

Fig. 1
figure 1

An illustration of a decision tree in the random forest model. The decision tree directs each case from the root node to the leaf nodes, resulting in a prediction. N, number; SRR, surgical re-intervention rate

At each node split, a random subset of features (such as duration of menstruation and parity) is considered; this is done to avoid over-selection of strong predictive features, leading to similar splits in the trees. This finally leads to a robust model and prevents model overfitting [21, 23, 31,32,33,34].

Following this process, the classification result of a RF model is produced by computing a large ensemble of those trees and averaging the prediction of each single decision tree on surgical re-intervention. Figure 2 shows a simplified example of the RF model. In practice, the decision trees and the resulting prediction model contain a large number of leaf nodes [31, 35].

Fig. 2
figure 2

A simplified random forest model for the prediction of the surgical re-intervention

The RF was trained in MATLAB (2018b) using the TreeBagger function in the Statistics and Machine Learning Toolbox.

To predict the chance of surgical re-intervention within 2 years after EA, the model was initially trained and internally validated on the 446 cases. To make a good comparison between de RF and LR, the same validation technique was used. Therefore, a bootstrap resampling of 5000 was used. The performance measure area under the receiver operating curve (AUROC) was calculated.

Comparison of the prediction models

The performance of the models was tested and compared using the AUROC. Accuracy was not used as performance measure, since the database is unbalanced (ratio between re-intervention and no re-intervention 1:8 (53:446)) [36]. It was chosen to use the performance measures (AUC) as used in the previous study of Stevens et al. [16]. In this way a good comparison can be made.

Predictors of surgical re-intervention: variable importance measure (VIM)

To identify important predictors of surgical re-intervention, we used two methods for analysis.

First, a statistical univariate logistic regression analysis was applied to assess the importance of each variable. For each variable, an odds ratio (OR) with a 95% confidence interval (CI) was calculated.

Secondly, a permutation-based variable importance was used. This VIM is based on the AUC statistic of the ML model. The AUC statistic is computed by randomly permutating (leaving out) the values of predictor x and comparing the resulting AUC to the not permutated AUC. Leaving out an important feature will result in a lower AUC of the ML model, while leaving out an unimportant feature will not change the AUC significantly [23, 35, 37].

Results

Seven hundred sixty-two patients were identified retrospectively. Thirty-three patients were excluded, thirty did not meet the inclusion criteria and three underwent an incomplete endometrium ablation. The remaining 729 patients were contacted, resulting in a response-rate of 61% (N = 446).

A total amount of 446 patients was available for analysis [16].

Fifty-three (11.9%) of these patients required a surgical re-intervention within 2 years after EA.

Patients’ mean age during their EA was 43.8 years (SD ± 5.5, range 20–55, missing values 0). The mean number of parity was 2.2 (SD ± 1.0, missing values 0). Sixty-one (13.7%) of the patients underwent a caesarean section. The mean number of previous caesarean section was 0.2 (SD ± 0.6, missing values 0)

Hundred sixty-nine (39.4%) of the patients had a menstruation period longer than 7 days, the mean number of menstrual days was 9.4 (SD ± 6.0, missing values 17). Two hundred fifty-six (57.4%) of the patients had complaints of dysmenorrhea and four hundred thirty-four (97.3%) of the patients had complaints of abnormal uterine bleeding [16].

Prediction models

Logistic regression model

Univariate analysis showed six significant predictors, multivariate analyses resulted in a logistic regression model consisting of five significant predictors: age (OR 0.95, 95% CI 0.90–1.00), duration of menstruation > 7 days (OR 2.05, 95% CI 1.10–3.82), dysmenorrhea (OR 2.48, 95% CI 1.21–5.07), parity ≥ 5 (OR 7.63, 95% CI 1.51–38.46), and previous caesarean section (OR 2.21, 95% CI 1.05–4.64). The AUC of the final prediction model after correcting by the shrinkage factor was 0.71 (95% CI 0.64–0.78) (Fig. 3).

The final model is described in the article of Stevens et al. [16].

Fig. 3
figure 3

ROC-curve of the logistic regression and random forest model. LR AUC 0.71 (95% CI 0.64–0.78), NoOp AUC 0.63 (0.54–0.71), and Op AUC 0.65 (0.56–0.74). LR, logistic regression; RF, random forest; Op, after hyperparameter optimization; NoOp, before hyperparameter optimization

Random forest model

The random forest method resulted in a model which predicts the chance of re-intervention within 2 years after EA with an AUC of 0.63 (95% CI 0.54–0.71). An AUC of 0.65 (95% CI 0.56–0.74) was achieved after optimization of this model (Fig. 3).

Predictors of surgical re-intervention: variable importance

The AUC was used to quantify the importance of the predictor. For each RF model, the AUC was calculated. The difference in AUC for the individual clinical predictors (permutation-based VIM) in the optimized model were in ascending order of importance: 0.005 for parity, 0.017 for previous caesarean section, 0.019 for age, 0.026 for dysmenorrhea, and 0.051 for duration of menstruation. This means dysmenorrhea and duration of menstruation have the highest impact on the AUC of the RF model (Fig. 4).

Fig. 4
figure 4

Contribution of predictors of surgical re-intervention within 2 years after endometrial ablation, after hyperparameter optimization

Discussion

Main findings

In this study, a ML model was made using random forest technique to predict surgical re-intervention within 2 years after EA. Comparison of the predictive performance of the RF model with the existing logistic regression model of Stevens et al. was made [16].

The existing logistic regression model has a C-index of 0.71 (95% CI 0.64–0.78) [16]. The ML model, developed in this study, shows a C-index of 0.65 (95% CI 0.56–0.74). This shows that the LR prediction model developed by Stevens et al. [16] probably performs better in predicting surgical re-intervention within 2 years after EA than the newly developed ML model. However, this difference in performance is not statistically significant when looking at the confidence intervals.

Explaining the significant factors in the model

In the LR model, high parity (≥ 5) is a predictive variable for surgical re-intervention. This can be related to the larger uterine cavity of grand multiparous women. However, when considering our ML model, parity does not have a large impact on the AUC. This is in line with previously reported studies that show no significant increased risk of treatment failure with increasing parity [1, 15].

Previous caesarean section is also related to higher rates of surgical re-intervention which can be explained by irregularity of the uterine wall caused by the uterine scar [38]. This can inhibit complete contact of the ablation device with the uterine wall, leading to residual active endometrium.

In our cohort, pre-operative dysmenorrhea is associated with a higher risk of surgical re-intervention. There is evidence that gynaecologic pathology causing this dysmenorrhea (adenomyosis and endometriosis) reduces the success of endometrial ablation [8, 17, 39,40,41]. This can be explained by the fact that EA is not an appropriate treatment for these diseases due to the superficial effect of energy to the uterine wall. It could help to diagnose these diseases before performance of EA. However, sensitivity and specificity of the diagnostic tools for determining these diseases in the pre-operative setting are still low [42].

In line with previous studies, we found that younger age was associated with a higher risk of surgical re-intervention [7, 9,10,11,12,13, 43].

The duration of menstruation > 7 days is also a negative predictive factor for surgical re-intervention after EA. This may be caused by a thicker endometrium which is more difficult to completely remove by the device [7, 10].

Interpretation in light of other evidence

There are several possible reasons to understand why the LR model probably performs better compared to the ML model.

Firstly, ML tends to work better for variables with strong predictive power [20, 44]. We observed that most of the candidate predictors in this model have low predictive power. The variables parity, age, and previous c-section show low predictive power. On the one hand, the outcome can be unpredictable, meaning these candidate predictors have little influence on the outcome measure. On the other hand, the dataset can be too small to identify the predictive power of a candidate predictor. A larger dataset could possibly identify more predictors [20, 44].

Secondly, some studies demonstrate that ML is performing better when a larger set of potential predictors are used. There seems to be an influence of the number of predictors (p) and the ratio of p:n (sample size). ML tends to perform better for increasing p and p:n [20, 24, 45, 46]. In our study, to limit potential bias, the five identical predictors as published before [16] were considered for the LR and ML algorithms. We did this to allow a fair comparison between the two models, probably in disadvantage of the ML model [20, 24, 45, 46].

Another possible reason for a lower AUC of the ML model is the necessity of big datasets to reach an optimal performance. A dataset with 446 participants might be too small for ML to make robust conclusions. For LR however, this number of patients can be enough to develop a prediction model.

Finally, we can also consider that for this clinical problem a logistic approach is better than a ML model for modelling the relationship between surgical re-intervention and the explanatory variables. Probably the previously mentioned complex, nonlinear relationships that a ML approach can better capture are not present in this dataset.

Strengths and limitations

The predictors obtained by univariate and multivariate logistic regression are in accordance with the existing literature [1, 8, 10,11,12,13,14,15, 17, 47]. However, when we compare the variable importance between the LR and ML of each variable, we identify a different ranking in variable importance.

The difference in ranking of variable importance is a limitation of the study because there is no proper way to compare the importance of each predictor on surgical re-intervention between the ML and LR model because of different calculation methods (OR for the LR model and difference in AUC for the ML model).

Dysmenorrhea (OR 2.48) and a parity > 5 (OR 7.63) have the highest odds ratio in the multivariate LR analysis, while for the ML model the duration of menstruation and dysmenorrhea are the most important variables. We consider two possible reasons for the difference in importance. The first reason is that for the LR model, all continuous variables (except age) were discretized, while for the ML model continuous variables were handled. A second reason is that in the LR the predictors have different units, and these were not standardized. This means that a subjective assessment of variable importance cannot easily be made by simply comparing the raw sizes of the OR [21, 23, 31, 44]. This can be seen as a strength of our study since the difference in AUC for each predictor (permuted vs. not permuted) reflects the variable importance in a standardized way.

We used bootstrap resampling for internal validation (n = 5000) in the LR and ML model. Using the same validation method limits potential bias.

Furthermore, the same predictors were considered for the LR and ML algorithms. This limits potential bias but will limit the potential power of a ML technique as well.

It could be seen as a limitation of this study that we did not perform an external validation in another cohort. However, we did not expect it to be significantly better in performance, since the internal validation of the ML did not perform better than the logistic regression model. In addition, an external validation for the logistic regression model is being performed at the time of this study.

Finally, we can state that mostly LR models are used in the clinical practice since ML models are not easily implemented in the clinical practice. These models are often not available in commonly used software packages in clinical practice. However, future structured data-registration is increasing, which makes it easier to create big datasets available for ML programmes. In this way, we can clinically benefit from the advantages of the ML models.

Conclusion

In conclusion, we can state that for the prediction of surgical re-intervention within 2 years after EA, the logistic regression model gives a better prediction compared to the machine learning model. However, machine learning algorithms should always be considered because of the possible clinical advantages. So far, there is no evidence for one single algorithm that outperforms the other in general use. Both the ML and LR model can identify the clinical predictors to surgical re-intervention and contribute to the shared decision-making process in the clinical practice. Based on our ML model, a longer duration of menstruation and the presence of dysmenorrhea are important predictive factors for surgical re-intervention.