Deep learning for glioblastoma segmentation using preoperative magnetic resonance imaging identifies volumetric features associated with survival

Background Measurement of volumetric features is challenging in glioblastoma. We investigate whether volumetric features derived from preoperative MRI using a convolutional neural network–assisted segmentation is correlated with survival. Methods Preoperative MRI of 120 patients were scored using Visually Accessible Rembrandt Images (VASARI) features. We trained and tested a multilayer, multi-scale convolutional neural network on multimodal brain tumour segmentation challenge (BRATS) data, prior to testing on our dataset. The automated labels were manually edited to generate ground truth segmentations. Network performance for our data and BRATS data was compared. Multivariable Cox regression analysis corrected for multiple testing using the false discovery rate was performed to correlate clinical and imaging variables with overall survival. Results Median Dice coefficients in our sample were (1) whole tumour 0.94 (IQR, 0.82–0.98) compared to 0.91 (IQR, 0.83–0.94 p = 0.012), (2) FLAIR region 0.84 (IQR, 0.63–0.95) compared to 0.81 (IQR, 0.69–0.8 p = 0.170), (3) contrast-enhancing region 0.91 (IQR, 0.74–0.98) compared to 0.83 (IQR, 0.78–0.89 p = 0.003) and (4) necrosis region were 0.82 (IQR, 0.47–0.97) compared to 0.67 (IQR, 0.42–0.81 p = 0.005). Contrast-enhancing region/tumour core ratio (HR 4.73 [95% CI, 1.67–13.40], corrected p = 0.017) and necrotic core/tumour core ratio (HR 8.13 [95% CI, 2.06–32.12], corrected p = 0.011) were independently associated with overall survival. Conclusion Semi-automated segmentation of glioblastoma using a convolutional neural network trained on independent data is robust when applied to routine clinical data. The segmented volumes have prognostic significance. Electronic supplementary material The online version of this article (10.1007/s00701-020-04483-7) contains supplementary material, which is available to authorized users.


Introduction
The median survival for glioblastoma remains approximately 12-15 months despite surgery and chemoradiation [1]. Extent of resection (EOR) is the only modifiable prognostic factor [1][2][3]. Other prognostic variables such as age, O 6methylguanine-DNA methyltransferase (MGMT) methylation and Karnofsky performance status (KPS) are not modifiable, and mutation status cannot be assessed preoperatively [1,4]. Combining clinical variables with MRI features may be able to better prognosticate patients and stratify them for clinical trials-especially trials at the time of surgery when molecular markers are unknown [5,6].
Radiogenomic analysis has linked MRI-based tumour subregions such as fluid-attenuated inversion recovery (FLAIR) volumes with genetic signatures of invasiveness and reduced survival [7]. Reproducible and accurate descriptors of imaging features are needed to identify prognostic biomarkers [8]. The Visually Accessible Rembrandt Images (VASARI) variables is the largest standardised dataset of imaging variables based on preoperative MRI [8,9]. Qualitative VASARI descriptors when combined with manually segmented volumetric variables are independently associated with survival [9].
Manual segmentation of tumour volumes is timeconsuming and suffers from high interobserver and intraobserver variability [10]. This limits their use in clinical practice. Computer-assisted methods reduce segmentation time and demonstrate good agreement with manual ground truth segmentations [10][11][12].
Convolutional neural networks (CNNs) are the state-of-the art computer-assisted method for glioblastoma segmentation [13]. They outperform alternative methods which use independent decision classifiers to extract texture and intensity features [14]. A pre-annotated dataset is used to train the CNN architecture to perform a series of mathematical convolutions through interdependent multiple layers. This determines the relationship between the input images and output images. The CNN can then be validated on different test datasets.
DeepMedic is a 3-dimensional (3D) CNN ranked highly in the Brain Tumour Segmentation Challenge (BRATS) [14]. DeepMedic assigns classes to each voxel independently using intensity and local feature information across image planes through two 11-layer convolutional pathways [14]. Each pathway samples different resolutions to lower computational cost [14]. DeepMedic has been shown to have robust segmentation accuracy of tumour subregions when tested and trained on MRI images performed at multiple institutions, with different protocols [15].
Automated segmentation of glioblastoma subregions has moderate agreement with their corresponding VASARIderived semi-quantitative measures. However, measurement of certain tumour subregions is less accurate than other regions such as necrosis compared to the contrast-enhancing region (CER) [8]. In clinical practice, manual correction of segmentations generated from deep learning is more accurate than when the automated segmentations are used as full replacement for manual segmentations [15]. Generating curated training data for each dataset is also labour-intensive and timeconsuming, requiring manual segmentations for images in each data cohort. Using publicly available data such as BRATS to train CNNs may enable these automated methods to be applicable across different datasets and institutions. We therefore aim to investigate whether DeepMedic trained on BRATS can be used for transfer learning, utilising automated segmentations as priors for manual correction. Previous studies have not examined the prognostic value of CNN-assisted volumetric measurements in combination with known prognostic variables. We test the validity of our semi-automated segmentation approach by correlating volumetric features of the resulting segmentations with survival.

Patient characteristics
All patients (≥ 18 years) diagnosed with histologically confirmed primary glioblastoma were identified from July 2016 to January 2018. Patients with previous glioma or cranial surgery were excluded.
Clinical characteristics were collected from electronic patient records. Motor deficit was defined as reduced power in any modality, and sensory deficit as reduced sensation or paraesthesia in any modality. Speech problems can present as receptive or expressive dysphasia. The operative records were used to obtain the following factors: American Association of Anesthesiologists (ASA) grade, use of 5-aminoleuvenic acid (5-ALA) and/or neurostimulation/awake surgery. Postoperative neurological deficit was recorded within 1 week from surgery.
Presence of isocitrate dehydrogenase (IDH) mutation and MGMT promoter methylation were recorded. MGMT promoter methylation was determined by pyrosequencing of the differentially methylated region 2 using a 10% cutoff value [16].
Patients were treated with either adjuvant chemoradiation (Stupp regimen), radiotherapy for symptom stabilisation or supportive care. The date of death was obtained from national patient records. The date of last follow-up was the time of query of NHS Spine (22/01/2019). Patient demographics and imaging were anonymised prior to recording of research data. This study was approved by the local research ethics committee (Study ID: PRE.2017.040).

Image preprocessing
MRI protocols are shown in Supplementary Table A1. Each patient's images were resampled to 1 mm 3 . T1-weighted, T2weighted and T2-FLAIR images were co-registered to T1C image using the FSL linear image registration tool (FLIRT) with a mutual information algorithm and 6-degrees of freedom [17].
Brain extraction was performed using brain extraction tool (BET) for the T1C image [18]. Masks were edited and applied to the other sequences using voxel multiplication.
Processed images were registered to the same atlas (SRI24) used for BRATS with a 12 degree of freedom affine registration, mutual information algorithm with FSL FLIRT [17]. Noise reduction was performed using smallest univalue segment assimilating nucleus (SUSAN) [19]. Images were denoised and normalised to zero mean unit variance.

Tumour segmentation
The BRATS 2017 dataset contains 285 mixed-grade gliomas with expert-annotated manual segmentations of the (1) necrotic core (NC), (2) CER, (3) non-enhancing tumour (NET) and (4) peritumoural oedema (PTE) [20]. The definitions for manual segmentation are found in the Supplementary Data. DeepMedic can detect three tumour subregions: (1) FLAIR region, (2) CER and (3) NC region. Following the procedure from BRATS, automated segmentations of the whole-tumour (WT) region were created from T2-weighted sequences and FLAIR sequences whilst the CER region and NC regions were created from T1-weighted images. PTE and NET were manually delineated from the FLAIR region [15].
The primary architecture of DeepMedic consists of two main parallel pathways, each consists of four feature extraction layers with 53 kernels for feature extraction, as well as two fully connected layers and one final classification layer. The multi-scale processing of different input channels is handled using the dual pathway to achieve a large receptive field for the final classification, whilst the cost computation remains low. The first pathway operates on the original image, and the second one operates on a down-sampled version [14,21].
To apply transfer learning on our dataset, DeepMedic was firstly trained on randomly chosen scans from BRATS dataset. The dataset was divided into 107 patients for training (n = 66,340 images) and 50 patients for validation (n = 31,000 images). This trained model was further tested on 50 further patients from BRATS prior to application to our test dataset. As we are evaluating the validity of applying a CNN trained on a different dataset to our test dataset, one author (YW) manually segmented our test images by correcting the automated segmentation labels derived from DeepMedic using 3D Slicer (Harvard Medical School) ( Fig. 1) [22].
DeepMedic incorporates data augmentation to increase input data volumes as well as their complexity via reflection with respect to the mid-sagittal plane. Data shuffle is performed at the start of each epoch to avoid overfitting [23]. The hyper parameters remained the same as the original DeepMedic network proposed [14]. The network is regularised using dropout, 35 epochs and 5 batch sizes, with 5-fold cross-validation. The loss function used was negative log-likelihood.
Training was performed using an implementation of DeepMedic on Tensorflow, using an NVIDIA Titan Xp graphics card [24].
The delimitation of residue contrast-enhancing tumour was performed using T1 subtraction maps (ΔT1 map). This method improves the delineation of tumour by subtraction of contrast enhancement from blood products [25]. Their use has validated residue tumour volume (RTV) as a predictor of survival [25]. ΔT1 maps were created by voxel-by-voxel subtraction of the pre-processed T1 image from the T1C image (Fig. 2).
The comparative metrics used to assess the quality of the segmentations were the Dice coefficient and volumes of the segmentations. The Dice coefficient is a value between zero and one which presents the degree of overlap between two segmentations (one represents perfect overlap) [11]. For our test dataset, the post-manually edited tumour subregion segmentations were considered the ground truth for comparison with the automated labels. Differences between the automated and manually corrected segmentations volumes of each tumour subregion were also compared. For the BRATS test data (n = 50), we also computed the Dice coefficient between the automated labels generated from our model and the expertannotated manual ground truths.

Statistical analysis
VASARI features were scored on preoperative imaging (Supplementary Table A2). Spearman rank correlation was used to compare the proportions of each subregion; EOR was calculated based on volumetric segmentations; EOR = CER − RTV/CER × 100%.
The primary outcome was overall survival (OS): difference between the date of death and the date of surgery. Cox proportional hazards models were used to identify significant factors associated with OS. Median follow-up time was calculated using the reverse KM method [26]. The Student t test and Wilcoxon rank-sum test were used to compare baseline characteristics between groups for continuous variables. The chi-square test was used for comparing categorical variables.
Variables with p < 0.2 in the bivariable regression models were included in the multivariable models. The Akaike information criterion (AIC) was used as a measure to compare the quality of the models. AIC scores compare the relative performance of a model based on the number of parameters and goodness of fit. The model with the lower AIC score explains the greatest variation using the least number of independent variables [27]. Multiple testing was controlled for using the false discovery rate. A false discovery rate adjusted p value (qvalue) is the percentage of significant tests which will result in a false positive. Given the exploratory nature of the study, a threshold of 0.1 was chosen (sensitivity analysis showed that lowering the threshold to 0.05 did not change the results of significant variables). Statistical analysis used Stata version 14 (StataCorp. College Station, Texas).
Due to differences between baseline characteristics, the resection and biopsy groups were analysed separately. Resection patients had larger CER and NC volumes whilst biopsy patients had a larger NET volume (Supplementary  Table B1).

Comparison between automated and manually corrected segmentations
The network was able to detect WT, CER and NC in all patients in the BRATS test data and in 118 patients (98%) of The training and test times were approximately 41 h (2475 min) and 36 min respectively in our data. Therefore, considering training, testing and correction time, our semiautomated segmentation method can reduce segmentation time by approximately 38 h (2289 min) compared to manual segmentation.
Comparison of segmentation outcomes between our BRATS test data (n = 50) and our sample data (n = 120) showed similar network performance in the two datasets. Median Dice coefficients in our sample were (1)

Relationship between volumetric subregions
There was a positive correlation between the volume of tumour core and individual tumour subregions (p < 0.001), shown in Fig. 5.
PTE volume positively correlated with CER volume r = 0.42 (p < 0.001) and NC volume r = 0.37 (p < 0.001) but not NET volume r = 0.06 (p = 0.508). There was a negative  Subregion volumes were normalised by dividing by TC volume. TC volume positively correlated with NET/TC r = 0.22 (p < 0.05). There was a negative correlation between TC and CER/TC r = − 0.35 (p < 0.001) and TC with PTE/TC r = − 0.48 (p < 0.001). NC/TC did not correlate with TC volume r = 0.13 (p = 0.158).

Volumetric features associated with overall survival
Cox regression models were constructed for volumetric variables which were not significantly correlated with each other (Supplementary Table C1, C3 and C5). Analysis was performed for the entire cohort and for biopsy and resection patients separately (    MGMT methylation and IDH mutation were not significantly associated with OS in bivariable or multivariable analysis.

Discussion
In this study, we integrated a deep learning network into a clinically applicable processing pipeline for semi-automated measurement of glioblastoma volumetric features from preoperative MRI. A CNN was trained on publicly available BRATS data before testing on our routine clinical dataset. Final segmentation labels were generated by manual correction of the automated segmentations. Performance of the network was comparable between our clinical dataset and BRATS testing dataset when measured using the Dice coefficient. We further assessed the validity of our segmentation approach by evaluating the prognostic performance of volumetric features. Higher CER/TC and NC/TC were independently associated with higher risk of death overall. NET/WT was associated with lower risk of death in biopsy patients.
We demonstrate the possibility of transfer learning by segmenting tumour regions on a heterogeneous clinical dataset trained and tested on an independent dataset using multimodal imaging. A major advantage of our method is that it is applicable to data from different scanners and institutions. Furthermore, curating a large sample of uniform images for deep learning training is time-and labour-intensive [28]. Even when training time is considered, use of our computer-assisted segmentation method can reduce annotation time by 20 min per patient compared to fully manual segmentation. Training on an external dataset also increases the generalizability of the method and means data does not need to be split into training and validation cohorts, reducing sample sizes.
Deep learning methods can provide quantitative imagingbased prognostic biomarkers that outperform semiquantitative estimates. In previous studies, tumour size has been investigated as a potential prognostic marker [29]. Tumour dimensions can be estimated using long axis diameter, cuboid, spheroid and ellipsoid formulas [30]. These methods are subjective and difficult to reproduce, leading to conflicting associations with survival [31,32]. Formula-based estimates also have poor accordance with volumetric measurements [30].
We did not perform full manual segmentation as an approximate for ground truth to compare with our automated segmentations as the goal of our study was to determine if deep learning generated automated segmentations could be used to aid manual segmentation rather than as a replacement. Despite being trained on BRATS data, our network was able to detect the tumour in nearly all our patients and showed comparable segmentation accuracy in our dataset when evaluated against BRATS test data.
Our final segmentations were derived from corrections performed on the automated labels so it is expected that the label overlap for each subregion is significantly higher in our dataset compared to the BRATS data where no bias exists due to the manual segmentations being performed independently. Nonetheless, the relative accuracy of the segmentation labels across different tumour subregions was similar across the two test datasets. For example, the necrosis subregion had the highest proportion of misclassified voxels whilst the CER had the least proportion of misclassified voxels. This indicates that our manual corrections were dependent on the accuracy of the automated labels.
Few studies have compared accuracy of automated and manual segmentations of tumour subregions such as necrosis [15]. It is important to measure these subregions as they may have prognostic significance. Our data highlights the importance of choosing clinically relevant metrics to compare automated and manual segmentations. Two segmentations may have high Dice overlap but significantly different volumes if a smaller volume is entirely within the larger volume. There were significant differences between the automated and corrected segmentations for all tumour subregions apart from necrosis. Necrosis may be particularly challenging to segment using either automated or manual methods due to its heterogeneous and dispersed nature within the tumour core [15].
Necrosis was one of the earliest imaging markers found to have prognostic value [32]. Descriptive studies classifying tumours by estimating necrotic proportions have not been consistently prognostic [1]. In concordance with previous studies, we do not find a consistent association between       Unless otherwise specified, the multivariable hazard ratio for the clinical variables is derived from the overall model with the lowest Akaike information criterion. Where a volumetric variable appears in more than one multivariable model, the hazard ratio displayed is derived from the model with the lowest Akaike information criterion (Appendix Tables C2, C4 and C6). b Model 5 from Appendix Table C2. c Model 8 from Appendix Table C4. d Model 2 from Appendix Table C6 Significant p values are emphasised in italics necrosis volume measured by VASARI score and OS [33]. Instead, higher relative proportion of necrosis is associated with worse prognosis. NC/TC has been previously associated with worse survival [30,34]. However, this study only included patients suitable for CRET, raising the possibility of selection bias. Furthermore, they did not control for qualitative imaging features described in the VASARI feature set in their multivariable analysis [34]. Hypoxia may select for quiescent stem-like cells around the NC which resist apoptosis and undergo proliferation [35]. Established prognostic variables including EOR calculated using our segmentations are associated with survival. The median OS in our cohort is only 9 months because compared to previous cohorts, we did not exclude patients undergoing biopsy [1,30]. The median OS of 13 months in our resection patients is comparable to previous studies [1]. By including both resection and biopsy patients, our sample is representative of the heterogeneous cohorts of glioblastomas encountered in routine clinical practice. IDH mutation was not found to be associated with improved survival but this finding may be limited by the small number of IDH-mutant patients in our cohort. Our finding that MGMT methylation was not associated with improved survival is difficult to interpret. Not all studies demonstrate an association between MGMT methylation and improved OS, with several studies including a randomised control trial showing no association with survival [4,36]. The reasons for this may be due to assay variability and difficulties correlating promoter methylation with protein expression [37].
The association between FLAIR proportion relative to other subregions and survival is unclear [38]. PTE/TC was not found to be associated with OS in a previous study [34]. Multiparametric biopsies have shown high levels of viable tumour cells within the non-enhancing region [39]. However, unlike our study, these studies did not differentiate oedema from non-enhancing tumour. The Dice coefficient between the automated and corrected FLAIR region segmentations was not significantly higher for our sample compared to BRATS as we were able to manually delineate NET from the FLAIR region.
Differentiating non-enhancing tumour from peritumoural oedema is important as each subregion may yield different prognostic information. The non-enhancing tumour within the FLAIR abnormality may represent lower grade disease [40]. We show that non-enhancing tumour proportion relative to the whole-tumour volume rather than FLAIR volume or peritumoural oedema volume was associated with improved survival in the biopsy group. This suggests that nonenhancing tumour was differentiated from peritumoural oedema by manual segmentation. Although our network was not able to detect non-enhancing tumour, we have shown high segmentation accuracy for the FLAIR region which can aid the manual delineation of the non-enhancing tumour.
Non-enhancing tumour variables were not prognostic in resection patients. This may be because the resection patients had significantly smaller volumes of non-enhancing tumour compared to the biopsy patients. The difficulty in delineating oedema from non-enhancing tumour may result in overlap between non-enhancing tumour and oedema in resection patients. We have shown that diffusion MRI signatures have higher sensitivity for invasive tumour can also be segmented using CNN [21,41]. These biomarkers may be correlated with the FLAIR region to improve identification of infiltrative tumour.
CER/TC was associated with survival, independent of RTV and other prognostic factors. In radiogenomic studies, the CER correlates with genes involved in angiogenesis and hypoxia [42]. The relationship between the NC and the CER may not be linear; we show that NC/TC is independent of CER/TC and core tumour volume. This suggests that some tumours have relatively greater proportions of necrosis for a given tumour volume.
CER volume was negatively associated with survival in a cohort of resected glioblastomas but the accuracy of the volumetric measurements was limited by digitised hard-copy imaging [43]. CER volume has also been associated with worse survival when adjusted for other VASARI variables but in this study adjuvant treatment received was not controlled for [9]. We found that CER/TC rather than CER volume was significantly associated with survival. CER/TC may be a more accurate prognostic measure because it relates CER to the core tumour volume rather than whole-tumour volume as the oedema component may be affected by factors such as steroid use.
Limitations to our study include its retrospective nature. It is one of the largest studies investigating volumetric features and prognosis incorporating quantitative measurement of postoperative imaging and clinical variables. Future work to evaluate our approach should quantitatively compare segmentation measurements with manual segmentations from multiple observers to assess inter-rater variability [28]. In addition, an independent dataset is necessary for us to compare the relative prognostic performance of automated segmentations against manual segmentations and determine the reproducibility of our segmentations. Finally, volumetric features may be combined with texture-and shape-based analysis of tumour subregions to develop improved prognostic models for glioblastoma patients [44,45].

Conclusions
Using a CNN with a transfer learning approach, we have shown that volumetric measurements of glioblastoma tumour subregions can be measured from preoperative MRI with high accuracy. The CNN can be integrated into a radiological workflow to significantly shorten segmentation time.
Tumours with greater proportions of necrosis and contrast enhancement are independently associated with worse survival whilst non-enhancing tumour proportion is associated with improved survival. With further validation, we may be able to use volumetric features from routine clinical imaging for patient prognostication and stratification into clinical trials.
Author's contribution Yizhou Wan contributed to study design, and performed the data collection and analysis. He also performed the statistical analysis.
Roushanak Rahmat contributed to study implementation and performed the convolutional neural network training and testing.
Stephen J. Price performed the surgeries and provided clinical and imaging data.
All authors contributed to writing, revisions and editing.
Funding information Stephen J. Price is funded by a National Institute for Health Research (NIHR), Career Development Fellowship, for this research project. This presents independent research funded by the National Institute for Health Research (NIHR). This report is an independent research supported by the NIHR Brain Injury MedTech Cooperative based at Cambridge University Hospitals NHS Foundation Trust and the University of Cambridge.
Data availability Code can be made available by request to the authors

Compliance with ethical standards
Conflict of interest The authors declare that they have no conflict of interest.
Ethics approval This study was approved by the local research ethics committee (Study ID: PRE.2017.040). The study was performed in accordance with the ethical standards as laid down in the 1964 Declaration of Helsinki and its later amendments or comparable ethical standards.
Consent to participate Patient demographics and imaging were anonymised prior to recording of research data for this retrospective study. Informed consent from individual patients was not required.
Consent to publish This retrospective anonymised study did not require consent to publish from individual participants.
Disclaimer The views expressed are those of the authors and not necessarily those of the NHS, the NIHR or the Department of Health and Social Care.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.