Key points

  • Different combinations of feature selection methods and classifiers result in models that are not significantly different from the best model in approximately 57%.

  • Features selected by statistically best-performing models are largely different (0.19; range 0.02–0.42), although their correlation is higher (0.71; range 0.4–0.92).

  • Relevance of features often cannot be decided by the single best model.

Background

Radiomics is an emergent technique used for diagnostic and predictive purposes and is based on machine learning techniques [1, 2]. It promises a non-invasive, personalized medicine and is applied primarily in an oncological context for diagnosis, survival prediction, and other purposes [3]. Radiomics is often performed using a well-established machine learning pipeline; generic features from the images are first extracted before feature selection methods and classifiers for modeling are employed [4, 5]. One of the main concerns related to radiomics is whether the extracted features have biological meaning [6].

However, in the absence of a direct link between features and the underlying biology of various pathologies, in radiomic studies, many generic features are extracted in the hope that some would be associated with the biology and thus be predictive [7]. These features strongly depend on several choices regarding the acquisition parameters, preprocessing choices, the segmentation of the volume-of-interest and others, and therefore contain both necessarily correlated and irrelevant features [8, 9]. Radiomics then proceeds by using a feature selection method to identify relevant features and a machine learning model for prediction. It seems natural to consider relevant features of the best-performing model as surrogates for biomarkers because such features contribute to the model’s predictive performance. They should therefore be considered informative and are at least good candidates for biomarkers [10,11,12,13,14].

Unfortunately, from a statistical standpoint, there is often not a single best-performing model. Different choices of feature selection methods and classifiers can lead to models performing only slightly worse than and statistically similarly to the best-performing model. In these cases, the null hypothesis that they are equal cannot be rejected. If the best-performing model’s features are associated with the underlying biology, it raises the question of whether the same features can be considered relevant in statistically similar models.

Therefore, we analyzed on 14 publicly available radiomic datasets whether the selected features of statistically similar models are similar. We employed 8 different feature selection methods and 8 classifiers and measured the predictive performance of the models by area under the receiver operating characteristic curve (AUC-ROC). We compared the stability of the selected features, the similarity, and the correlation among the best-performing models.

Methods

To ensure reproducibility, only publicly available datasets were employed for this study. Ethical approval for this study was therefore waived by the local ethics committee (Ethik-Kommission, Medizinische Fakultät der Universität Duisburg-Essen, Germany). All methods and procedures were performed following the relevant guidelines and regulations.

Datasets

We identified publicly available datasets by reviewing open-access journals. A total of 14 radiomic datasets were included in this study (Table 1). As is common for radiomic datasets, these datasets were all high-dimensional; in other words, they contained more features than samples, except for the dataset Carvalho2018. Since the focus of this study is on radiomic features, only features coming from imaging data were used, and other features, e.g., clinical or genetic features, were removed. All available data were merged to reduce the effects of non-identically distributed data.

Table 1 Overview of the datasets used for the study

Features

All datasets were available in preprocessed form; that is, they contained already extracted features. Extraction and acquisition parameters differed for each dataset. Because in-house software was used, compliance with the Image Biomarker Standardisation Initiative (IBSI) was not always ensured [15]. Texture and histogram features were available for all datasets but shape features were only available for some. Further information on the feature extraction and preprocessing methods used for each dataset can be found in the corresponding study.

Preprocessing

Additional simple preprocessing was performed to harmonize the data; specifically, missing values were imputed using column-wise mean, and the datasets were normalized using z-scores.

Feature selection methods

Eight often used feature selection methods were employed (Table 2), including LASSO, MRMRe, and MIM. These feature selection algorithms do not directly identify the most relevant features but instead calculate a score for each. Thus, a choice had to be made as to how many of the highest-scoring features should be used for the subsequent classifier. The number of selected features was chosen among N = 1, 2, 4, …, 64.

Table 2 Overview of all feature selection methods used

Classifiers

After the choice of the feature selection methods, the choice of classifier is most important because suboptimal classifiers yield inferior results. Eight often-used classifiers were selected (Table 3).

Table 3 Overview of all classifiers used during training

Training

Training proceeded following the standard radiomics workflow; combinations of feature selection methods and classifiers were used. In the absence of explicit validation and test sets, tenfold stratified cross-validation was employed.

Evaluation

Predictive performance is the most important metric in radiomics. Therefore, AUC-ROCs were employed for evaluation. Model predictions over all 10 test folds of the cross-validation were pooled into a single receiver operating characteristic (ROC) curve. A DeLong test was used to determine whether models were statistically different.

For the evaluation, we focused on the best-performing model for each dataset as well as the models that were not be shown to be statistically different. A histogram of Pearson correlations between all features was plotted to visually depict the correlations present in the datasets.

Stability of the feature selection methods

A critical property for the interpretation of features is the stability of the feature selection, meaning whether a given feature selection method will select similar features if presented with data from the same distribution. Using the data from the tenfold cross-validation, we calculated the stability using the Pearson correlation method. Because features are selected prior to the training of the classifier, the stability does not depend on the classifier; rather, it depends on the number of chosen features. The stability might be different for two models that use the same feature selection method but a different number of features. The stability will be 1.0 if, over each cross-validation fold, the very same features are selected.

Similarity among the feature selection methods

The similarity of the selected features was computed using the Pearson correlation method to determine the discrepancy between the features selected by the best model and those selected by a statistically similar model. A correlation of 1.0 would indicate that the two feature selection methods selected the same features over each cross-validation fold.

Correlation among the selected features

Because the features in radiomic datasets are known to be highly correlated, two feature selection methods could select features that are different but nonetheless highly correlated. In this case, their similarity would be low; however, this is misleading because the different features would contain similar information. Therefore, correlations among the selected features themselves were measured by computing the average highest correlation of each feature to that of all other features. This measure will be 1.0 if a perfectly correlated feature of another method can be found for each selected feature of one method. More details on the measure can be found in Additional file 1.

Predictive performance

The stability, similarity, and correlation of statistically similar models could be associated with their predictive performance. For example, if a few features had very high correlation with the outcome in a dataset, it is likely that many statistical similar models will be able to identify these features. Therefore, a larger correlation among them could be observed. Similarly, if the dataset contains many relatively uninformative features, it is conceivable that the selection of one feature over another depends on the model. Therefore, in this case, lower correlations among models would be observed. Hence, a linear regression was performed to relate the AUC-ROCs to stability, similarity, and correlation.

Software

Python 3.6 was used for all experiments. Feature selection methods and classifiers from scikit-learn 0.24.2 [16] and ITMO_FS 0.3.2 [17] were utilized.

Statistics

Descriptive statistics were reported as means and ranges, computed using Python 3.6. p values less than 0.05 were considered significant. No adjustments for multiple testing were performed. AUC-ROCs were compared using a DeLong test using the R library pROC.

Results

Overall, 3640 models for each of the 14 datasets were fitted, each with stratified tenfold cross-validation. For evaluation, the model with highest AUC-ROC was selected for each combination of the 8 features selection methods and 8 classifiers, yielding 64 models. A DeLong test was used to compare these models to the overall best performing model to calculate statistical difference (Fig. 1). For 508 of 882 models, the hypothesis that the AUC-ROCs were different from those of the best performing model could not be rejected (Table 4). This corresponds to 58% of all models (N = 508/882) and to roughly 36 of the 64 models per dataset.

Fig. 1
figure 1

Graphical overview of the predictive performance of all models. The AUC-ROC of all computed models were plotted for all datasets. Those models that cannot be statistically shown to be different from the best model were marked in cyan color, while those that were worse were marked in orange color

Table 4 Counts of how many models were statistically not different to the best model for each dataset, sorted by AUC-ROC of the best model

Datasets

Plotting the Pearson correlation among all features revealed that some datasets had many highly correlated features (Fig. 2). All datasets deviated largely from the histogram of a dataset with normally distributed and independently chosen columns. Although Toivonen2019 is very close, it still revealed many highly correlated features, as can be seen in the fat right tail.

Fig. 2
figure 2

Histogram of the Pearson correlation between all features. The “Normal” histogram was obtained by creating a dummy dataset that only contained independent and normally distributed features and serves as a reference

Stability of the feature selection method

In general, selecting more features resulted in higher stability (Fig. 3). The three simpler methods (Anova, Bhattacharyya and Kendall) yielded higher stability than the more complex methods (including LASSO and Extra Trees). Overall, the stability of the methods was moderate (Fig. 4). Results for each dataset can be found in Additional file 2 and Additional file 3.

Fig. 3
figure 3

Relation of feature stability with the number of selected features

Fig. 4
figure 4

Analyzed measured for all statistically similar models. The range is given in parentheses, the color of each cell corresponds to the stated measure

Similarity among the feature selection methods

The average similarity between the features selected by statistically similar models was rather low (Fig. 4). There were almost always models with selected features that were not similar to those of the best model. In all cases, the average similarity among the feature selection methods was lower than their stability (Fig. 4). Details can be found in Additional file 4.

Correlation among the selected features

The average correlation among the features was much higher than their similarity. For Song2020, on average, the models had a correlation of 0.92, while for Veeraraghavan2020, this figure was only 0.40. Again, there were often models with features that were only moderately correlated with the features of the best model. Results for each dataset can be found in Additional file 5.

Predictive performance and feature stability

Comparing the mean AUC-ROC to the number of statistically similar models showed a slightly significant and decreasing association (p = 0.029; Fig. 5a), that is, the better the best model performed, the less statistically similar models were observed. The associations of AUC-ROCs with stability, similarity, and correlation were higher and, in all cases, positive: a strong positive association (p = 0.007; Fig. 5b) was observed for stability, in other words, models that reached higher AUC-ROCs were more stable. An equally strong association between mean AUC-ROC and similarity was found (p = 0.004; Fig. 5c), better models seemed to be more similar. On the other hand, for correlation, a weaker association was observed (p = 0.012; Fig. 5c).

Fig. 5
figure 5

Association of the mean AUC-ROC for all statistically similar models with (a) the number of equivalent models, (b) stability, (c) similarity and (d) correlation. Each point represents one dataset

Discussion

Radiomics has been intensively studied in pursuit of non-invasive, precise, and personalized medicine, enabling better and more easily deduced diagnoses and prognoses. However, a critical problem of radiomics is that the features are generically defined and strongly depend on acquisition parameters. Therefore, they often lack biological meaning. This reduces reproducibility, that is, the features, and thus the models, often cannot be faithfully recreated if the study is conducted at other sites [18]. Feature selection is used to identify relevant features that could represent the underlying biology and thus be considered biomarkers [19]. In radiomic studies, often only a single feature selection method and classifier combination is considered, without a rationale for why a given method was chosen over others [20,21,22,23]. Even if multiple methods are considered, the best-performing model is only compared to the other models in terms of its predictive performance [24,25,26,27,28]. While this approach is justifiable, worse-performing models need not be different from a statistical viewpoint. There is little reason to ignore them.

Therefore, in this study, we considered all models that were statistically similar to the best model and compared their selected features. First and foremost, our study demonstrates that several statistically similar models exist for each dataset. This finding is not surprising, given that the radiomic datasets considered have small sample sizes and that the null hypothesis can only be rejected in the case of a large difference in predictive performance. Nonetheless, approximately 58% of the considered models were statistically similar to the best model, which is much higher than anticipated.

Based on the stability of the feature selection methods, in general, the more features selected, the more stable feature selection is. This is expected because, if all possible features were selected, the correlation coefficient would be equal to 1. Nonetheless, it is surprising that the association is not U-shaped (Fig. 1) because if the datasets contained a small set of relevant features, it could be expected that most feature selection methods would identify them. In this case, the stability would be quite high for a small number of features; however, this was not observed (Fig. 3). There are two possible reasons for this: either the feature selection algorithms were not able to identify those relevant features, or the datasets, in general, did not contain such sets. In the latter case, considering the low stability of the feature selection methods, this could mean that the interpretation of relevant features as biomarkers is doubtful at best.

Consequently, the similarity among the models was also very low. Almost no similarity could be seen for some datasets, and the largest similarity was only moderate (0.42). Therefore, even if the best model was relatively stable in terms of the selected features, thus hinting toward a set of relevant features, in most cases, there is a similarly performant model that would yield completely different features.

Stability and similarity might be misleading because features are highly correlated in radiomic datasets and different features could still express related information. Therefore, in addition to similarity, the correlation of the selected features was compared. Indeed, the correlation was higher than seen for similarity and at least moderate (> 0.40) on average for all datasets.

Taking these observations together, it seems clear that an interpretation of relevant features as biomarkers cannot be obtained en passant during modeling with machine learning because these results are not based on causality but rather on correlation. Radiomic datasets are high-dimensional, so results are often abundant and partly random.

Intuitively, a higher overall AUC-ROC should be associated with a higher mean association among statistically similar models, because a higher AUC-ROC could indicate that there exists a set of relevant features that the models can identify. Indeed, regression results indicate that models with higher AUC-ROCs seem to have higher stability and similarity as well as slightly higher correlation. This means that feature relevance in models that do not perform well must be determined cautiously. Indeed, for all datasets, the best model was significantly different from the constant model (predicting a probability of 0.5 for each sample; p < 0.05), but for some, the predictive performance was too low (< 0.70) to be clinically useful. However, the results were valid to a large extent for the datasets with higher AUC-ROCs—for example, for Song2020, where the AUC-ROC of 0.98 is high enough for possible clinical use.

When performing feature selection, the expectation is that there is a single set of relevant features and that these can be precisely determined. While this intuition may be correct for low-dimensional datasets, radiomic datasets are high-dimensional and are highly correlated. In theory, both problems could be prevented by acquiring larger samples and more specific features. Regrettably, increasing sample size is problematic, if only because of the need for segmentations, which, currently, are often still delineated manually. Furthermore, performing decorrelation in a high-dimensional setting is unlikely to be useful because correlated features might complement one another [29]. As decorrelation can be regarded as an unsupervised feature selection method, it might not perform any better than a supervised feature selection method. Principal component analysis, on the other hand, could be more suitable; however, due to features being recombined, no kind of association can be made with the biological underpinning.

Although the current radiomics pipeline is generally accepted as state-of-the-art, deep learning methods have been considered. These forgo the tedious task of generating features and feature selection methods [30, 31]. Radiomic models with generic features are often regarded being more interpretable than deep learning models, but our results demonstrate that this presumed superiority is not necessarily the case.

While we obtained our results using cross-validation, other studies have used a train-test split which was then tested on an independent validation set. Using such a training scheme might give the impression that the selected features are relevant and stable, but this is misleading because a simple split neither considers the stability of the model nor the fact that a disjoint set of features could produce a model with statistically similar predictive performance. Nonetheless, having an explicit validation set available would provide a more precise picture because it is conceivable that statistically similar models would produce different results if an external dataset were used.

We focused only on a few often-used feature selection algorithms and classifiers, but we believe that adding more methods to the study would only enhance our results. The same would be true if the hyperparameters of the classifiers were more heavily tuned. We also did not account for multiple testing; doing so would increase p values, making more models statistically similar. Thus, our results can be thus considered a lower limit.

Conclusion

Our study demonstrated that the relevance of features in radiomic models depends on the model used. The features selected by the best-performing model were often different than those of similarly performing models. Thus, it is not always possible to directly determine potential biomarkers using machine learning methods.