Background

Patients with an ovarian tumor should be managed appropriately. There is evidence that treatment in oncology centers improves ovarian cancer prognosis [1, 2]. However, benign ovarian cysts are frequent and can be managed conservatively (i.e. non-surgically with clinical and ultrasound follow-up) or with surgery in a general hospital [3]. Risk prediction models can support optimal patient triage by estimating a patient’s risk of malignancy based on a set of predictors [4, 5]. ADNEX is a multinomial logistic regression (MLR) model that uses nine clinical and ultrasound predictors to estimate the probabilities that a tumor is benign, borderline, stage I primary invasive, stage II-IV primary invasive, or secondary metastatic [6,7,8]. ADNEX differentiates between four types of malignancies because these tumor types require different management [7, 9].

There is an increasing interest in the use of flexible machine learning algorithms to develop prediction models [10,11,12]. Contrary to regression models, flexible machine learning algorithms do not require the user to specify the model structure: these algorithms automatically search for nonlinear associations and potential interactions between predictors [10]. This may result in better performing models, but poor design and methodology may yield misleading and overfitted results [10, 11]. A recent systematic review observed better performance for flexible machine learning algorithms versus logistic regression when comparisons were at high risk of bias, but not when comparisons were at low risk of bias [10]. Few of the included studies addressed the accuracy of the risk estimates (calibration), none assessed clinical utility.

In addition, there is increased awareness for uncertainty of predictions [13, 14]. It is known that probability estimates for individuals are unstable, in the sense that fitting the model on different sample from the same population may lead to very different probability estimates for individual patients [15, 16]. This instability decreases with the sample size for model development, but is considerable even when models are based on currently recommended sample sizes [16, 17]. Apart from instability, ‘model uncertainty’ reflects the impact of various decisions made during model development on the estimated probabilities for individual patients. Modeling decision may relate to issues such as the choice of predictors or the method to handle missing data [18, 19]. All other modeling decisions being equal, the choice of modeling algorithm (e.g. logistic regression versus random forest) may also play a role.

In this study, we (1) compare the performance of multiclass risk models for ovarian cancer diagnosis based on regression and flexible machine learning algorithms in terms of discrimination, calibration, and clinical utility, and (2) assess differences between the models regarding the estimated probabilities for individual patients to study model uncertainty caused by choosing a particular algorithm.

Methods

Study design, setting and participants

This is a secondary analysis of prospectively collected data from multicenter cohort studies that were conducted by the International Ovarian Tumor Analysis (IOTA) group. For model training, we used data from 5909 consecutively recruited patients at 24 centers across four consecutive cohort studies between 1999 and 2012 [6, 20,21,22,23]. All patients had at least one adnexal (ovarian, para-ovarian, or tubal) mass that was judged not to be a physiological cyst, provided consent for transvaginal ultrasound examination, were not pregnant, and underwent surgical removal of the adnexal mass within 120 days after the ultrasound examination. This dataset was also used to develop the ADNEX model [6]. For external validation, we used data from 3199 consecutively recruited patients at 25 centers between 2012 and 2015 [8]. All patients had at least one adnexal mass that was judged not to be a physiological cyst with a largest diameter below 3 cm and provided consent for transvaginal ultrasound examination. Although this study recruited patients that subsequently underwent surgery or were managed conservatively, the current work only used data from patients that were operated within 120 days after the ultrasound examination without additional preoperative ultrasound visits. The external validation dataset was therefore comparable to the training dataset.

Participating centers were ultrasound units in a gynecological oncology center (labeled oncology centers), or gynecological ultrasound units not linked to an oncology center. All mother studies received ethics approval from the Research Ethics Committee of the University Hospitals Leuven and from each local ethics committee. All participants provided informed consent. We obtained approval from the Ethics Committee in Leuven (S64709) for secondary use of the data for methodological purposes. We report this study using the TRIPOD checklist [4, 24].

Data collection

A standardized history from each patient was taken at the inclusion visit to obtain clinical information, and all patients underwent a standardized transvaginal ultrasound examination [25]. Transabdominal sonography was added if necessary, e.g. for large masses. Information on a set of predefined gray scale and color or power Doppler ultrasound variables was collected following the research protocol. When more than one mass was present, examiners included the mass with the most complex ultrasound morphology. If multiple masses with similar morphology were found, the largest mass or the mass best seen on ultrasound was included. Measurement of CA125 was neither mandatory nor standardized but was done according to local protocols regarding kits and timing.

Outcome

The outcome was the classification of the mass into one of five outcome categories based on the histological diagnosis of the mass following laparotomy or laparoscopic surgery and on staging of malignant tumors using the classification of the International Federation of Gynecology and Obstetrics (FIGO): benign, borderline, stage I primary invasive, stage II-IV primary invasive, or, secondary metastasis [26, 27]. Stage I invasive tumors have not spread outside the ovary, and hence have the best prognosis of the primary invasive tumors. The histological assessment was performed without knowing the detailed results of the ultrasound examination, but pathologists might have received clinically relevant information as per local procedures.

Statistical analysis

Predictors and sample size

We used the following nine clinical and ultrasound predictors: type of center (oncology center vs. other), patient age (years), serum CA125 level (U/ml), proportion of solid tissue (maximum diameter of the largest solid component divided by the maximum diameter of the lesion), maximum diameter of the lesion (mm), presence of shadows (yes/no), presence of ascites (yes/no), presence of more than ten cyst locules (yes/no), and number of papillary projections (0, 1, 2, 3, > 3). These predictors were selected for the ADNEX model based on expert domain knowledge regarding likely diagnostic importance, objectivity, and measurement difficulty, and based on stability between centers (see Supplementary Material 1 for more information) [6]. We also developed models without CA125: not all centers routinely measure CA125, and including CA125 implies that predictions can only be made when the laboratory result becomes available. We discuss the adequacy of our study sample size in Supplementary Material 2.

Algorithms

We developed models using standard MLR, ridge MLR, random forest (RF), extreme gradient boosting (XGBoost), neural networks (NN), and support vector machines (SVM) [28,29,30,31]. For the MLR models, continuous variables were modeled with restricted cubic splines (using 3 knots) to allow for nonlinear associations [32]. The hyperparameters were tuned with 10-fold cross-validation on the development data (Supplementary Material 3). Using the selected hyperparameters, the full development data was used to train the model.

Model performance on external validation data

Discrimination was assessed with the Polytomous Discrimination Index (PDI), a multiclass extension of the binary c-statistic (or area under the receiver operating characteristic curve, AUROC) [33]. In this study, PDI equals 0.2 (one divided by five outcome categories) for useless models, and 1 for perfect discrimination: PDI estimates the probability that the model can correctly identify a patient from a randomly chosen category from a set of five patients (one from each outcome category). We also calculated pairwise c-statistics for each pair of outcome categories using the conditional risk method [34]. Finally, we calculated the binary c-statistic to discriminate benign from any type of malignant tumor. The estimated risk of any type of malignancy equals one minus the estimated probability of a benign tumor. PDI and c-statistics were analyzed through meta-analysis of center-specific results. We calculated 95% prediction intervals (PI) from the meta-analysis to indicate what performance to expect in a new center.

Calibration was assessed using flexible (loess-based) calibration curves per outcome; center-specific curves were averaged and weighted by the square root of sample size [35]. Calibration curves were summarized by the rescaled Estimated Calibration Index (ECI) [36]. The rescaled ECI equals 0 if the calibration curve fully coincides with the diagonal line, and 1 if the calibration curve is horizontal (i.e. the model has no predictive ability).

We calculated the Net Benefit to assess the utility of the model to select patients for referral to a gynecologic oncology center [37, 38]. A consensus statement suggests to refer patients when the risk of malignancy is ≥ 10% [9]. We plotted Net Benefit for malignancy risk thresholds between 5% and 40% in a decision curve, but we focus on the 10% risk threshold. At each threshold, Net Benefit of the models is compared with default strategies: select everyone (‘treat all’) or select no-one (‘treat none’) for referral [37, 38]. Net Benefit was calculated using meta-analysis of center-specific results [39]. We calculated decision reversal for each pair of models by calculating the percentage of patients for which one model had an estimated risk ≥ 10% and the other < 10%.

Missing values for CA125

CA125 was missing for 1805 (31%) patients in the development data and for 966 (30%) patients in the validation data. Patients with tumors that looked suspicious for malignancy more often had CA125 measured. We used ‘multiple imputation by chained equations’ to deal with missing CA125 values. Imputation results, done separately for the development and validation data, were available from the original publications, see Supplementary Material 4 for details [6, 8].

Modeling procedure and software

Supplementary Material 5 presents the modeling and validation procedure for models with CA125 and models without CA125. The analysis was performed with R version 4.1.2, using packages nnet (MLR), and caret together with packages glmnet (Ridge MLR), ranger (RF), xgboost, nnet (NN), and kernlab (SVM) [29]. Meta-analysis for Net Benefit was performed using Winbugs.

Results

Descriptive statistics for the development and validation datasets are shown in Table 1 and S1. A list of centers with distribution of the five tumor types is shown in Table S2. The median age of the patients was 47 years (interquartile range 35–60) in the development dataset and 49 years in the validation dataset (interquartile range 36–62). Most tumors were benign: 3980 (67%) in the development dataset and 1988 (62%) in the validation dataset. Secondary metastatic tumors were least common: 246 (4%) in the development dataset and 172 (5%) in the validation dataset.

Table 1 Descriptive statistics of predictors and outcome in the development and validation datasets

Discrimination performance

For models with CA125, PDI ranged from 0.41 (95% CI 0.39–0.43) for SVM to 0.55 (0.51–0.60) for XGBoost (Table 2, Figure S1). In line with these results, the pairwise c-statistics were generally lower for SVM than for other models (Table S3). For the best models, pairwise c-statistics were above 0.90 for benign versus stage II-IV tumors, benign versus secondary metastatic tumors, benign versus stage I tumors, and borderline versus stage II-IV tumors. For all models, pairwise c-statistics were below 0.80 for borderline versus stage I tumors, stage I versus secondary metastatic tumors, and stage II-IV versus secondary metastatic tumors. The binary c-statistics (or AUROC) for any malignancy was 0.92 for all algorithms except Ridge MLR (0.90) and SVM (0.89) (Figure S2).

Table 2 Overview of discrimination, calibration, and utility performance on external validation data

For models without CA125, PDI ranged from 0.42 (95% CI 0.39–0.45) for SVM to 0.51 (0.47–0.54) for standard MLR (Table 2, Figure S3). Including CA125 mainly improved c-statistics for stage II-IV primary invasive vs. secondary metastatic tumors, and stage I vs. stage II-IV primary invasive tumors (Table S4). The binary c-statistics for any malignancy was less affected by excluding CA125, with values up to 0.91 (Figure S4).

Calibration performance

For models with CA125, the probability of a benign tumor was too high on average for all algorithms, in particular for SVM (Fig. 1). The risks of a stage I tumor and a secondary metastatic tumor were fairly well calibrated. The risk of a borderline tumor was slightly too low on average for all algorithms. The risk of a stage II-IV tumor was too low on average for standard MLR, Ridge MLR, and in particular for SVM. Based on the ECI, RF and XGBoost had the best calibration performance, SVM the worst (Table 2). Box plots of the estimated probabilities for each algorithm are presented in Figures S5-S10. For models without CA125, calibration results were roughly similar (Table 2, Figures S11-17).

Fig. 1
figure 1

Flexible calibration curves for models with CA125 on external validation data. Abbreviations: MLR, multinomial logistic regression; XGBoost, extreme gradient boosting

Clinical utility

All models with CA125 were superior to the default strategies (treat all, treat none) at any threshold (Figure S18). At the 10% threshold for the risk of any malignancy, all algorithms had similar Net Benefit (Table 2). At higher thresholds, RF and XGBoost had the best results, SVM the worst. For models without CA125, results were roughly similar (Table 2, Figure S19, Table S4).

Comparing estimated probabilities between algorithms

For an individual patient, the six models could generate very different probabilities. For example, depending on the model, the estimated probability of a benign tumor differed at least 0.2 (20% points) for 29% (models with CA125) and 31% (models without CA125) of the validation patients (Table 3, Figure S20). Note that these absolute differences were related to the prevalences of the outcome categories: the differences were largest for the most common category (benign) and smallest for the least common category (secondary metastatic). Scatter plots of estimated probabilities for each pair of models are provided in Figs. 2, 3, 4, 5 and 6 for models with CA125 and in Figures S21-S25 for models without. When comparing two models at the 10% threshold for the estimated risk of any malignancy, between 3% (XGBoost vs. NN, with CA125) and 30% (NN vs. SVM, without CA125) of patients fell on opposite sides of the threshold (Table S5).

Table 3 Descriptive statistics of the probability range across the six models on external validation data
Fig. 2
figure 2

Scatter plots for the estimated risk of a benign tumor on validation data. For each pair of models with CA125. Abbreviations: MLR, multinomial logistic regression; RF, random forest; XGBoost, extreme gradient boosting; NN, neural network; SVM, support vector machine

Fig. 3
figure 3

Scatter plots of the estimated risk of a borderline tumor on validation data. For each pair of models with CA125. Abbreviations: MLR, multinomial logistic regression; RF, random forest; XGBoost, extreme gradient boosting; NN, neural network; SVM, support vector machine

Fig. 4
figure 4

Scatter plots of the estimated risk of a stage I tumor on validation data. For each pair of models with CA125. Abbreviations: MLR, multinomial logistic regression; RF, random forest; XGBoost, extreme gradient boosting; NN, neural network; SVM, support vector machine

Fig. 5
figure 5

Scatter plots of the estimated risk of a stage II-IV tumor on validation data. For each pair of models with CA125. Abbreviations: MLR, multinomial logistic regression; RF, random forest; XGBoost, extreme gradient boosting; NN, neural network; SVM, support vector machine

Fig. 6
figure 6

Scatter plots of the estimated risk of a secondary metastatic tumor on validation data. For each pair of models with CA125. Abbreviations: MLR, multinomial logistic regression; RF, random forest; XGBoost, extreme gradient boosting; NN, neural network; SVM, support vector machine

Discussion

We compared six algorithms to develop multinomial risk prediction models for ovarian cancer diagnosis. There was no algorithm that clearly outperformed the others. XGBoost, RF, NN and MLR had similar performance, SVM had the worst performance. CA125 mainly increased discrimination between stage II-IV primary invasive tumors and the other two types of invasive tumors. Despite similar performance for several algorithms, the choice of algorithm had a clear impact on the estimated probabilities for individual patients. Choosing a different algorithm could lead to different clinical decisions in a substantial percentage of patients.

Strengths of the study include (1) the use of large international multicenter datasets, (2) data collection according to a standardized ultrasound examination technique, measurement technique, and terminology [25], (3) evaluation of risk calibration and clinical utility, and (4) appropriate modeling practices by addressing nonlinearity for continuous predictors in regression models and hyperparameter tuning for the machine learning algorithms. Such modeling practices are often lacking in comparative studies [10]. A limitation could be that we included only patients that received surgery, thereby excluding patients managed conservatively. This limitation affects most studies on ovarian malignancy diagnosis, because surgery allows using histopathology to determine outcome. The use of a fixed set of predictors could also be perceived as a limitation. However, these predictors were carefully selected based largely on expert domain knowledge for development of the ADNEX model, which is perhaps the best performing ultrasound-based prediction model to date [6, 8, 40]. Including more predictors, or using a data-driven selection procedure per algorithm, would likely increase the observed differences in estimated probabilities between algorithms.

Previous studies developed machine learning models using sonographic and clinical variables to estimate the risk of malignancy in adnexal masses on smaller datasets (median sample size 357, range 35-3004) [41,42,43,44,45,46,47,48,49,50,51,52]. Calibration was not assessed, and the outcome was binary (usually benign vs. malignant) in all but two studies. One study distinguished between benign, borderline, and invasive tumors [45], another study distinguished between benign, borderline, primary invasive, and secondary metastatic tumors [44]. However, sample size was small in these two studies (the smallest outcome category had 16 and 30 cases, respectively, in the development set). All studies focused exclusively on neural networks, support vector machines, or related kernel-based methods. All but one of these studies implicitly or explicitly supported the use of machine learning algorithms over logistic regression.

Our results illustrate that the probability estimates for individual patients can vary substantially by algorithm. There are different types of uncertainty of individual predictions [53]. ‘Aleatory uncertainty’ implies that two patients with the same predictor measurements (same age, same maximum lesion diameter, etcetera) may have a different outcome. ‘Epistemic uncertainty’ refers to lack of knowledge about the best final model and is divided into ‘approximation uncertainty’ and ‘model uncertainty’ [53]. ‘Approximation uncertainty’ reflects sample size: the smaller the sample size, the more uncertain the developed model. This means that very different models can be obtained when fitting the same algorithm to different training datasets of the same size, and that these differences become smaller with increasing sample size. ‘Model uncertainty’ reflects the impact of various decisions made during model development. Our study illustrates that the choice of algorithm is an important component of model uncertainty.

A first implication of our work is that there is no important advantage of using flexible machine learning over multinomial logistic regression for developing ultrasound-based risk models for ovarian cancer diagnosis to support clinical decisions. An MLR-based model is easier to implement, update, and explain than a flexible machine learning model. We would like to emphasize that the ADNEX model that was mentioned in the introduction, although based on MLR, includes random intercepts by center [6]. This is an advantage because it acknowledges that prevalences of the outcome categories vary between centers [54]. We did not use random intercepts in the current study, because they do not generalize directly to flexible algorithms. A second implication is that the choice of algorithm matters for individual predictions, even when discrimination, calibration, and clinical utility are similar. Different models with equal clinical utility in the population may yield very different risk estimates for an individual patient, and this may lead to different management decisions for the same individual. Although, in our opinion, the crux of clinical risk prediction models is that their use should lead to improved clinical decisions for a specific population as a whole, differences in risk estimates for the same individual are an important finding. More research is needed to better understand uncertainty in predictions caused by the choice of algorithm, or other decisions made by the modeler such as the predictor selection method. The observation that different algorithms may make different predictions emphasizes the need of sufficiently large databases when developing prediction models. The recently established guidance for minimum sample size to develop a regression-based prediction model is a crucial step forward [55, 56]. However, it is based on general performance measures related to discrimination and calibration, and does not cover uncertainty of risk estimates for individual patients. Hence, if possible, the sample size should be larger than what the guidance would suggest. Flexible machine learning algorithms may require even more data than regression algorithms [57]. We should consider providing an indication of the uncertainty around a risk estimate. Confidence intervals around the estimated probabilities may be provided, although this may be confusing for patients [58]. Moreover, standard confidence intervals do not capture all sources of uncertainty. The biostatistics and machine learning communities are currently researching methods to quantify the confidence of predictions [13, 14, 17, 59, 60]. Related options may be explored, such as models that abstain from making predictions when uncertainty is too large [14].

Conclusion

Several algorithms had similar performance and good clinical utility to estimate the probability of five tumor types in women with an adnexal (ovarian, para-ovarian, or tubal) mass treated with surgery. However, different algorithms could yield very different probabilities for individual patients.