The records of 91 patients with newly diagnosed GBM who underwent preoperative MRI between August 2008 and December 2013 and treated with resection and temozolomide-based chemoradiation  were reviewed retrospectively. Patient inclusion criteria were 1) pathologically confirmed primary GBM, 2) known OS, 3) pre-operative MRI with postcontrast T1-weighted (T1c), T1-weighted (T1), T2-weighted (T2), and T2 fluid-attenuated inversion recovery (FLAIR) images. Patients with unknown OS time (n = 3) and missing or low-quality pre-operative MRI sequences (n = 25) were excluded, with a remaining study population of 63 patients (mean age: 62.75 years, standard deviation 9.96 years, mean OS: 22.77 months, standard deviation: 14.74 months). We selected a subset of 19 patients with homogeneous acquisition parameters for robustness testing. After robustness testing, the full dataset with 63 patients was used to assess the performance of feature selectors and ML model combinations.
The ethics committee approved the study and waived written informed consent.
Public multi-center data (BraTS TCIA)
To date, the only publicly available high-grade glioma dataset with survival information is the BraTS dataset , which consists of pre-treatment MRI data with patient age, survival, and extent-of-resection information. Due to our interest in the acquisition parameters, we consider the subset originating from the TCIA database [1, 19], where this information is available. The data for the survival prediction task includes MRI data from seven different centers, two different vendors, and eight MRI models, comprising 76 patients (mean age: 59.46 years, standard deviation: 13.19 years, mean OS: 14.78 months, standard deviation: 11.98 months). These images have already been skull-stripped, resampled to 1 mm voxel size, and all MRI sequences co-registered to the T1c sequence, according to .
The OS and age distributions of both datasets are visualized in the supplementary material, Figure S1.
Pre-processing and automated tumor segmentation
All single-center images were skull-stripped and resampled to match the BraTS data. Automated tumor segmentation was performed using BraTuMIA [20, 21].
BraTuMIA outputs labels for contrast-enhancement, necrosis, non-enhancing tumor, and edema. Since previous studies use different tumor labels (e.g., ). We combined the four labels to yield eight single and combined labels: contrast-enhancement (cet), non-enhancing tumor (net), necrosis (nec), edema (ed), whole tumor (wt, all labels combined), core (all labels except edema), necrosis and non-enhancement combined (net_ncr), and non-enhancement combined with the edema (net_ed).
Class boundaries for OS survival prediction
Previous studies use a variety of OS class boundary definitions: Either a data-driven boundary defined by the distribution on a given dataset, or a clinically-motivated definition based on the median survival. Accordingly, we tested classification into two and three OS classes to ensure comparability to previous research.
To keep the analysis concise, we report the experiments for three OS classes in the supplementary material.
To test classification into two OS classes, we tested four different class boundaries: 304.2 days (10 months), 365 days (1 year), 425.8 days (15 months), and 540 (18 months). The 10 and 18 months class boundaries are used in the BraTS OS prediction challenge , and the 1 and 2 year OS is often reported in risk stratification studies and clincal reports (e.g., [10, 13])
We selected imaging features that cover widely applied types in previous studies. We analyzed all 120 features provided by PyRadiomics , extracted on the pre-processed MRI images. It includes shape (n = 26), first-order (n = 19), gray level co-occurrence matrix (GLCM, n = 24), gray level size zone matrix (GLSZM, n = 16), gray level run length matrix (GLRLM, n = 16), neighborhood gray-tone difference matrix (NGTDM, n = 5), and gray level dependence matrix (GLDM, n = 14) features .
Tumor location is known to affect the survival time of patients (e.g., ). In order to include this information, we registered each case to an atlas image , and computed the centroids for each segmentation label, resulting in n = 8 features per case.
End-to-end deep learning (DL) has been attempted for OS prediction in patients with GBM but showed unstable results . We included deep features proposed by Lao et al. , where a convolutional neural network (CNN) pre-trained on the ILSVRC-2012 dataset  is used to extract features from the two fully-connected layers, resulting in n = 8192 deep features.
The last feature type considered in our study characterizes the shape of the contrast-enhancing tumor. Pérez-Beteta et al.  demonstrated the predictive performance of pre-treatment tumor geometry. This class of shape features (n = 7) is hereafter referred to as enhancement geometry.
All previously described radiomic features were extracted from all four MRI sequences and all eight segmentation labels.
We evaluated a wide range of perturbations that affect the MRI image quality to an extent expected in a multi-center setting. To define the range of perturbations, we rely on the imaging guidelines in  and visual inspection by a neuroradiologist:
Voxel size and axial slice spacing, with variations generated according to a reference MRI imaging protocol, as presented in  for GBM patients.
K-space subsampling: Randomly masking the image in the frequency domain using 80 to 100% of the k-space information, with the range selected by visual assessment.
Inter-rater manual segmentation variability: Elastic deformation of all labels, such that the inter-rater Dice coefficient  matches the reported variability in  (supplementary material, Table S1 and Figure S2).
Additive Gaussian noise, with its level set such that the signal-to-noise ratio (SNR) does not exceed the mean SNR of the single-center data plus one standard deviation (supplementary material, Figure S3).
Quantization / binning of gray values: High-order radiomics features require histogram quantization/binning. We varied the bin width for higher-order PyRadiomics features within the recommended range in the PyRadiomics package documentation. Since consistent binning is straightforward in an image processing pipeline, no feature was excluded based on this perturbation.
These perturbations are visualized in Fig. 2 and detailed in Table 1.To ensure reproducibility, we provide all PyRadiomics feature extraction settings files and Python code used to generate perturbations (https://github.com/ysuter/gbm-robustradiomics).
Since the radiomic features were extracted on all four MRI sequences and eight tumor labels, the robustness evaluation amounted to more than 16.4 × 106 tests.
Ensuring absolute agreement and not only consistency across perturbations is key for a robust feature set, therefore the Intraclass Correlation Coefficient ICC(2,1) was chosen for robustness evaluation. The cut-off for the lower bound of the 95% confidence interval of the ICC(2,1) was set at 0.85, indicating good reliability according to , and following the publication of Lao et al. . We consider a feature robust if it reaches this threshold for all tested perturbations.
Feature selectors and ML methods
Machine learning models with high-dimensional feature spaces and only a few training samples suffer from the curse of dimensionality , which considerably increases the likelihood of poor performance when the ML model is used in practice. We chose thirteen feature selection and twelve ML methods from the literature (see supplementary material, sections S5, S6, and Table S2). The feature selection methods tested include ReliefF (RELF), Fischer Score (FSCR), Gini index (GINI), Chi-square score (CHSQ), joint mutual information (JMI), conditional infomax feature extraction (CIFE), double input symmetric relevance (DISR), mutual information maximization (MIM), conditional mutual information maximization (CMIM), interaction capping (ICAP), t-test score (TSCR, only for binary classification), minimum redundancy maximum relevance (MRMR), and mutual information feature selection (MIFS). The ML methods Nearest Neighbors, Support Vector Classifiers (SVC) with linear and radial basis function (RBF) kernels, Gaussian processes, decision trees, random forests, multilayer perceptrons, AdaBoost, naïve Bayes, quadratic discriminant analysis (QDA), XGBoost, and logistic regression were included. We remark that we tested all combinations of feature selectors and ML models (i.e., 13 × 12 = 156 combinations, 144 for three OS class experiments, since the t-test score is only applicable for binary classification).
The feature selection step was included in the cross-validation to avoid data leakage and overestimating the single-center performance.
Following , we further excluded features with zero median absolute deviation (MAD) and a concordance index (C-index) of 0.55 or lower, regarded as non-predictive features. This threshold setting was chosen considering a tradeoff between only retaining the most predictive features and reducing the curse of dimensionality (see supplementary material, Figure S4).
Detailed information regarding the image acquisition parameters, image pre-processing, and perturbations is available in the supplementary material.
Clinical and data-driven prior knowledge for feature set reduction
We tested two priors to decrease the features set size further: A sequence prior by only using the T1c and FLAIR MRI since these two sequences are predominantly considered by neuroradiologists when assessing pre-operative data. A second prior, referred hereafter as hand-picked, was introduced by limiting the features to the most robust features types as observed during the robustness analysis: Pyradiomics-derived tumor shape, enhancement geometry, centroids, and patient age.
The performance of the ML approaches was measured by the area under the receiver operating characteristics curve (AUC), balanced and unbalanced accuracy, sensitivity, specificity, F1 score, and precision. The best performing model for every OS class boundary was selected based on the AUC. All metrics were recorded during a 10-fold stratified cross-validation  on the single-center dataset. All performance metrics are reported as the mean across all splits.