Key points

  • On highly heterogenous, multi-centre data, machine learning on DSC-MRI features can correctly predict glioma IDH subtyping in 71% of cases and glioma grade II-IV in 53% of the cases (87% < 1 grade difference).

  • Shape features distinguish best grade II from grade III gliomas.

  • Texture and distribution features distinguish best grade III from grade IV tumours.

Importance of study

This work illustrates the diagnostic value of combining machine learning and dynamic susceptibility contrast-enhanced MRI (DSC-MRI) radiomics in classifying gliomas into WHO grades II-IV as well as across their isocitrate dehydrogenase (IDH) mutation status.

Despite the data heterogeneity inherent to the multi-centre design of the studied cohort (333 subjects, 6 centres) that greatly increases the theoretical challenges of machine learning frameworks, good classification performance (accuracy of 53% across grades (87% < 1 grade difference) and 71% across mutation status) was obtained. Therefore, our results provide a proof-of-concept for this emerging precision medicine field that has good generalisability and scalability properties. Introspection on the classification errors highlighted mostly borderline cases and helped underline the challenges of a categorical classification in a pathological continuum.

With its strong generalisability property, its ability to further incorporate participating centres and its possible use to identify borderline cases, the proposed machine learning framework has the potential to contribute to the clinical translation of machine-learning assisted diagnostic tools in neuro-oncology.

Background

The most recent grading system of gliomas has integrated the mutation status of key encoding genes prompting a ground-breaking change in the context of glioma diagnosis and treatment [1, 2] with further molecular stratification achieved by determining CDKN2A/B status [3]. Tumour anatomical features and gadolinium enhancement properties are not sufficient for this task and are extremely variable across different types and grades of gliomas with low specificity rates [4, 5]. To overcome this limitation, advanced imaging techniques such as diffusion weighted imaging (DWI), MR spectroscopy, and dynamic susceptibility contrast (DSC)-MRI (also known as perfusion-weighted MRI) have been tested. Perfusion-weighted MRI provides biomarkers tightly linked to the tumour vascularity and thus indices of the biochemical and genetic substrates related to neo-angiogenesis [6, 7]. ‘Gain-of-function’ isocitrate dehydrogenase (IDH) mutations result in the accumulation of the “oncometabolite” 2-hydroxy glutarate (HG) and cause an increase in the activity of the hypoxia-inducible factor-1, which has a pivotal role in the energy metabolism, angiogenesis and apoptosis [8]. Therefore, perfusion MRI biomarkers carry the potential to act as diagnostic surrogates of IDH mutation status and further for tumour grading.

DSC-MRI is often favoured, compared to other MRI-based perfusion techniques, for its higher temporal resolution and sensitivity for detecting abnormal angiogenesis resulting in high accuracy rates for baseline tumour grading and tumour surveillance [7]. To date, these encouraging findings have been mostly obtained in single-institution settings utilising different DSC-MRI acquisition techniques and post-processing software packages for quantitative analysis. This can result in significant variation [7] and, hence, limit the diagnostic value of the technique impeding its large-scale application for non-invasive tumour phenotyping [9,10,11]. Recently, a consensus paper for DSC-MRI acquisition guidelines aimed to act as a blueprint and provide a framework for achieving routine success with this technique [12].

The vast majority of the published DSC-MRI studies have focused on establishing the value of the tumour region of interest rather than encompassing the whole tumour. They have made use of histogram analyses, which nonetheless reflect numeric values of the voxels and only help to calculate first-order statistics. By contrast to first-order statistics, second- and higher-order statistical analyses, which allow measures of not only local voxel-wise values but also incorporate neighbouring information, have been suggested as more reliable and elaborative methods for characterisation of the region-of-interest [13]. Texture analysis, as a contextual quantification method, has already embarked in the medical imaging literature as a method that can detect tissue heterogeneity and complexity [13, 14]. Radiomics, applied in the clinical context of glioblastoma, have generally been performed on anatomical sequences and applied to distinguish between GBM subtypes [15], for prediction of survival rates [16] and prognosis [17], prediction of response to treatment [18], as well as risk stratification [19]. Deep-learning based techniques have recently received a sustained attention due to their high performance in such classification tasks [20, 21]. Nonetheless, there is limited biological insight in their decision process making failure cases difficult to interpret. An alternative is to use the classical feature-based machine learning methods as diagnostic tool and to evaluate the relevance of acquisition sequences. In such a framework, interaction between clinical interpretation and machine learning is essential. Indeed, an iterative loop between clinical interpretation and machine learning analysis based on advanced features can help to define most distinguishing images properties and exclude non-relevant ones. Such process would result in the definition of disease specific imaging signatures.

While the high heterogeneity among DSC-MRI acquisition protocols may contribute to lowering the diagnostic value of any single first-order statistic, this work sought to examine the role of higher-order analysis and machine-learning assisted radiomics in mitigating these limitations and augment the diagnostic accuracy of DSC-MRI for tumour staging with emphasis on IDH molecular subtyping.

Specifically, this study aims to assess the accuracy of combining DSC-MRI radiomics with supervised machine learning algorithms in classifying gliomas by grade and mutation status using multi-centre dataset. Our hypothesis is that different combination of MRI features are to be used according to the distinctions to be looked after encompassing either tissue heterogeneity between high grade GBM or measures of dissemination (using shape assessment) at lower grades. Compared to existing work, including the most recent one by Lu et al. [22], where a publicly available database was used focusing solely on structural imaging, this work focuses for first time on the use of DSC-MRI acquisition in a multicentre setting. The heterogeneity inherent to the multi-centre quality of the data, makes the task of tumour and molecular phenotyping even more challenging. If successful, the proposed platform is readily expandable to include more participating institutions or utilising different machine learning approaches – all aiming to increase the diagnostic capacity and power to determine the most appropriate parameters for conducting DSC-MRI experiments. With this in mind, this work has potential to become a first step for a widespread application of machine-learning assisted diagnostic aid in neuro-oncology.

Methods

Different components of the methodology in this study are shown in Fig. 1a with the post-processing of the tumour segmentation and rCBV maps leading to feature extraction depicted in Fig. 1b.

Fig. 1
figure 1

a: Flow chart of the methodology of this analysis in this paper. b: Graphical illustration of the customised pipeline shows the cascade of processing starting from tumour segmentation on FLAIR series and ending in feature extraction over tumour mask. c: Decision tree illustrating the different cohorts of gliomas from the multicentre dataset stratified per WHO grades II, III and IV and also per IDH mutation status

Study design and ethics

This multi-center study was based on retrospectively identified or prospectively acquired neuro-oncology MRI data collection in the participating centres, according to the local institutional review boards guidelines and the respective local ethical committees’ approvals. Data sharing agreements between the collaborators and the central reading and post-processing site were also in place.

Subjects

DSC-MRI data of 359 patients from 6 university hospitals with histopathologically and molecularly confirmed primary, treatment naïve gliomas (WHO grades II-IV) between January 2010 and May 2018 were consecutively collected and screened for this study. The inclusion criteria were as follows: i) existence of preoperative DSC-MRI examination raw data suitable for post-processing and ii) availability of histopathological and molecular testing least for IDH1 and IDH2 mutations [2]. Images with non-correctable motion artefacts (n = 8) and corrupt source data (n = 18) were excluded. Finally, 333 patients were included in the analysis. Figure 1c contains the decision tree of the dataset stratified into different glioma cohorts.

Histopathology

All gliomas in the cohort were diagnosed according to the WHO 2016 classification scheme, complemented with molecular profiling for IDH1 and IDH2 mutations [2, 23,24,25]. The assessment of morphological features and immunostaining in the tumour samples was performed by neuropathologists in each institution. All IDH1R132H immune-negative gliomas were investigated by Sanger sequencing to identify out common tumour driver mutations of IDH1, IDH2, histone H3F3A, TERT promoter and BRAF genes [26, 27]. 1q/19p codeletion status, EGFR amplification and 10q loss were determined by using a qPCR-based copy number assay in a subset of the participating institutions.

MR imaging

The eligible patients were examined on either 1.5 T MR scanners (n = 186, 56%) (Siemens SymphonyVision, Siemens Avanto, Siemens Healthineers, Erlangen, Germany; GE Discovery MR450, GE Healthcare, Chicago, USA; and Philips Achieva, Philips Medical Systems, Eindhoven, The Netherlands) or 3 T MR units (n = 147, 44%) (Siemens Skyra, Siemens Allegra, Siemens TrioTim, Siemens Biograph_mMR, Siemens Healthineers, Erlangen, Germany; Philips Ingenia, Philips Medical Systems, Eindhoven, The Netherlands, 3 T, Signa HDx, GE Healthcare, Waukesha, WI, USA).

Tumour delineation and parametric modelling

Tumour segmentation was performed manually by the same radiologist, who was blind to the immunohistopathological results, on axial T2 (n = 167, 49.7%) or axial FLAIR (n = 169, 51.3%) series using ITK-SNAP Version 3.6.0 [28] excluding large cystic or necrotic tumour components and large vessels in line with previous relevant literature. When available, 3D T2/FLAIR series were preferred to their 2D counterparts.

Relative cerebral blood volume (rCBV) maps were generated from the DSC –MRI raw data using dedicated commercially available postprocessing software (Olea Sphere, Version 3; Olea Medical Solutions, La Ciotat, France). Perfusion source images motion correction, temporal and spatial filtering was applied in all cases. Subsequently, the software automatically defined the arterial input function (AIF) for each individual based on all voxels throughout the time series using global clustering method as it was described by Mouridsen et al. [29]. Instead of the standard deconvolutional mathematical algorithms, a fully adaptive Bayesian scheme, which has been shown to be superior to the widely used classical approaches [30, 31], was chosen to derive the perfusion indices aiming at the most possibly accurate contrast agent leakage correction.

Spatial and intensity normalisation

The customised pipeline for the post-processing of the tumour segmentations and perfusion (rCBV) images is presented in Fig. 1b. The data was prepared for advanced analysis by isotropic resampling of the T1-weighted images followed by registration, normalisation of the parametric rCBV maps, and biomarker extraction through distribution and texture features. Isotropic resampling to 1mm3 voxels was facilitated using the NiftyReg open source software (https://sourceforge.net/projects/niftyreg) to allow for the computation of rotational invariant textural features. The T2-weighted/FLAIR images and rCBV maps were rigidly registered to the T1 isotropic image using NiftyReg [32] and trilinear interpolation. The resulting co-registered maps were visually inspected to evaluate the success of the registration step. At the third stage, automatic normalisation of the parametric maps was performed in order to mitigate inter-subject variability and therefore, enable comparison of statistical and textural features across subjects. At this step, the basal ganglia of both hemispheres were initially segmented using the Geodesic Information Flows framework (GIF) [33], a label fusion framework allowing for a local weighting of the information from a labelled database according to measures of local morphological similarities. The basal ganglia in the tumour-free hemisphere was used as a reference area for normalisation of the rCBV maps instead of healthy white matter in order to mitigate the risk of variability across the subjects due to white matter degeneration and vasculopathy. For tumours extending in the contralateral hemisphere, any tumour tissue was first removed from the reference region. Consequently, z-score maps over the tumour mask were derived with respect to the intensity distribution over the normalisation area.

Feature extraction

Comprehensive biomarker extraction was performed using the normalised parametric maps. To capture tumour shape, four shape features (volume, surface/volume ratio, non-compactness) were extracted for each tumour mass. For distribution mapping, 13 histogram features (mean, skewness, kurtosis, standard deviation, minimum, maximum, 1st, 5th, 25th, 50th (median), 75th, 95th and 99th percentiles) were extracted through all voxels in each segmented tumour. Finally, the 12 Haralick features [34] and including angular secondary moment (ASM), contrast, correlation, sum square, sum average, inverse difference moment (IDM), sum entropy, entropy, difference variance, sum variance, difference entropy, IMC1) for each tumour were also extracted.

Statistical analysis

All extracted intensity-based features were corrected for the influence of MR acquisition parameters on the extracted features was scrutinised by scanner manufacturer, magnetic field strength (1.5 T and 3 T), TR (≤1499 ms and ≥ 1500 ms), TE(25–44 ms and 45–55 ms), FA (90° and < 90°)slice thickness (< 5 mm and ≥ 5 mm), and matrix size (matrix size < 128 × 128 and ≥ 128 × 128) and in plane resolution (<=1, > 1 and < =2, > = 2). Following correction for these covariates, two-sample Wilcoxon test was used to report the differences in all extracted features between IDH mutation status and across pairs of grades. Results were considered to be significant if p-value ≤0.05 and all statistical analysis was carried out using Stata v14 and python 3.7. Effect size difference was calculated using Cliff’s Delta.

Supervised learning from extracted features

All extracted features and acquisition parameters were used as input to a random forest (RF) algorithm [35] to classify gliomas according to their IDH mutation status (IDH-mutant vs IDH-wildtype as a 2-class classifier) or according to their grade II, III and IV, as a 3-class problem. RF was chosen as the classifier based on the fact that it is simple to use as a collection of decision trees, is based on ensemble learning that allows the algorithm to learn accurately and does not require fine-tuning of parameters with the default parametrisation often leading to excellent performance.

For each task (IDH mutation or grade assessment) training was performed in a stratified two-fold cross-validation setting with 250 repetitions to assess the value of rCBV-extracted features. Hyperparameters were optimised in a cross-validation setting leading to two different settings for the mutation task (200 trees, maximum depth of 10, minimum samples per leaf of 4) and the grade separation task (800 trees, minimum samples per leaf = 4, maximum depth of 50). Error, computed as the difference between predicted and true classification was averaged over the 250 iterations and t-test over the error distribution were performed to assess the differences observed across acquisition parameters. Confusion matrices using the average classification were constructed to summarise the classification performance of the machine learning algorithm, from which the overall accuracy of classifying different gliomas cohorts, as well as overall sensitivity and specificity rates were calculated.

Error introspection

In order to have more insight into the cases of erroneous classification, features were compared for each category of misclassification (wild type classified as mutant, mutant as wild type, grade II as grade III etc) and their rightly classified counterpart to investigate if there was a consistency in the features leading to the misclassification and whether or not it was consisted demographic findings between said classes. Wilcoxon 2 samples tests were performed on the features corrected for acquisition parameters, age and gender as per the demographic analysis.

Results

Demographic findings

Overall, data included 333 glioma patients comprising 198 males and 135 females with a mean age 48.9 years (age range 20–81 years). Of these, 101 were classified as grade II, 74 were grade III and 158 were grade IV while 151 were IDH-mutant and 182 were IDH-wildtype gliomas (Fig. 1c). Details on the demographics of the patient population, tumour location, histological grades and the molecular mutation status are contained in table S1 of Appendix A.

Effect size of acquisition parameters are presented with the value of the Cliff Delta in supplementary table S2. Shape, histogram and texture features for comparing gliomas by mutation status and across three different grades are shown in Table 1, with Cliff’s Delta effects sizes presented pictorially in Fig. 2.

Table 1 Median IQR and P-Value of features corrected for all acquisition parameters, age and gender P -values of the Mann Whitney 2 samples test across IDH status and Grade are reported
Fig. 2
figure 2

Comparisons of IDH wild type and IDH mutant gliomas within WHO grades II, III and IV gliomas. Typical examples of gliomas categories are shown for illustrator purposes on the top panel. Bar charts on the bottom panel present the mean z-score transformed difference between groups for each feature after correction for acquisition parameters for the non-shape features. Bars encoded in red represent features presenting significant statistical significance between groups, whereas bars in blue are those not statistically significant (significance threshold p = 0.05)

Glioma stratification for IDH mutation status

When comparing gliomas by mutation type, 23 out of the 29 extracted features were significantly different between IDH-wildtype and IDH-mutants (column 6 on Table 1 and Fig. 2). Tumour surface to volume ratio (SAV) and measure of non-compactness were significantly lower in IDH-mutants compared to IDH-wildtype, as were six rCBV histogram features, including mean, standard deviation,5th,25th,50th,75th, 95th and 99th percentiles. Skewness and kurtosis were higher in IDH-mutants compered to IDH-wildtype. Among the texture features, correlation and sum entropy were significantly higher in IDH-mutant tumours than in the IDH-wildtype (p < 0.0001), while sum square, sum average, sum variance, entropy and difference variance features were significantly lower (p ≤ 0.0001) in IDH-mutant gliomas compared to IDH wildtype gliomas.

Glioma stratification for WHO-grade

The results of comparing gliomas by grade across WHO grades II, III and IV gliomas are shown in Fig. 3. When comparing gliomas grade III vs grade II, the surface to volume ratio shape feature, was significantly (p = 0.021) higher in grade III gliomas compared to grade II gliomas as was the non-compactness feature (p = 0.042) (Table 1, column 13). The IMC1 measure was also found to be lower in grade III compared to grade II (p = 0.040).

Fig. 3
figure 3

Comparisons between different WHO grade gliomas grades across shape, histogram and texture features. Typical examples of gliomas grades are shown for illustrator purposes on the middle panel. Bar charts on the top and bottom panel present the mean z-score transformed difference across all groups and between groups for each feature after correction for acquisition parameters for the non-shape features. Bars encoded in red represent features presenting significant statistical significance between groups as per Wilcoxon non parametric testing, whereas bars in blue are those not statistically significant (significance threshold p = 0.05)

A total of 20 features were different between grades III and IV (Fig. 3 and Table 1).The shape features did not differ between grades III and IV. Among the histogram features, 9 out of the 13 features were different with 6 (mean, STD, 25th, 50th, 75th, 95th and 99tth percentile) higher and skewness and kurtosis lower in grade IV compared to grade III gliomas. Finally, among the texture features 11 out of 12 features were different. Notably texture features related to appearance heterogeneity such as the entropy, sum square and sum variance were larger in grade IV than III, while measures of correlation and inverse difference momentum were lower.

Comparing gliomas grades IV with II, significant differences could be observed in most of the extracted features (20 out of 29). Of the shape features, SAV and non-compactness were higher, with non-compactness of greater degree, in gliomas grade IV compared to grade II. Five histogram features (mean, STD, 25th, 50th, 75th,95th percentile) were higher while skewness and kurtosis were lower in grade IV compared to grade II gliomas. Finally, among the texture features, 10 out of 12 features, with only the contrast and the difference variance not presenting any statistically significant difference. Overall, comparisons between grade II and IV combined the differences observed between grade II and III and grade III and IV.

Classification results

The confusion matrices describing the accuracy of the random forest algorithm are contained in the supplementary table S5 and are depicted in Fig. 4a-b. The random forest algorithm correctly classified gliomas by IDH mutation status in 71% of the cases. This was slightly higher when we considered data from 1.5 T scanner only (77% accuracy) while the accuracy dropped to 66% when we considered data from 3 T scanner only. Stratifying gliomas by IDH mutation status had an overall specificity of 77% and sensitivity of 65%. When magnetic field strength was taken into consideration, at 1.5 T the specificity rate increased up to 83% with a sensitivity at 70% while the specificity was of 72% and the sensitivity of 60% at 3 T. Overall, we obtained significant (p = 0.008) lower error rates of stratification at 1.5 T (0.25) in comparison to 3 T (0.35) (Table S3 and confusion matrices). Resolution appeared also to be associated with error, higher resolution leading to lower error rates (0.196 vs 0.320 p = 0.024 for <=1 mm vs between 1 and 2 mm) (see Supplementary Table S2).

Fig. 4
figure 4

a: Pictorial representation of Cliff’s delta values when comparing features of shape, histogram and textures across mutation status and grades. Exact Cliff delta values are given in Table S1. Positive values are shown in darkening red while negative ones are in darkening blue. b: Outcomes of the confusion (or error) matrix aiding visualisation of the performance of our machine-learning algorithm. The first row shows the confusion matrices for the classification of gliomas by IDH mutation status. The second row shows the confusion matrices for the classification across the three WHO grades II, III and IV. In both classification scenarios we show three cases: using all data and separately using only data obtained from scanners with 1.5 T or 3 T magnetic field. Within each matrix, the matrix row represent the instances in the actual ground truth class while each column represents the instances in the predicted class. Darkening red correspond to higher percentages of the overall population to be classified. The exact values of the confusion matrices are given in supplementary table S4 of the Appendix.

When classifying the gliomas by grade, 53% of included gliomas were classified correctly and 87% of the cases received grade classification with a distance less or equal to 1. The accuracy of classifying gliomas by grade was higher using perfusion data obtained at 1.5 T magnetic field rather than at 3 T: 59 and 46% respectively, p = 0.024, with grade distance of less than 1 in 91 and 83% of the cases, respectively. Acquisition parameters had different effect with a larger error for lower resolution (p = 0.026), TR value (p = 0.009) with a larger error for TR below 1500 ms; a flip angle value of 90 degrees was associated with higher error rates (p = 0.006).

Introspection results

When investigating in more details the misclassified cases (see Fig. 5), it appeared that the characteristics of these gliomas reflected the texture properties of the classes they were misclassified into both with respect to the mutation status or the grading. For instance, similarly to what was observed between grade II and IV, the cases wrongly classified as grade IV instead of grade II had a significantly lower compactness, with higher mean and kurtosis and lower skewness as well as lower correlation and higher entropy among others. Numerical mean of z-score differences for each of the possible error cases are reported in Supplementary table 4.

Fig. 5
figure 5

Comparison of features Z-Scores between erroneously classified and rightly classified elements for each possible error type. The bar length represents the Mean Z-Score difference between error cases and true cases. Bars in which difference was significant (Wilcoxon 2 sample test P < 0.05) are coloured in red while others are in blue

Discussion

Our results suggest that the application of machine learning strategies to DSC-MRI extracted features, has the potential to correctly classify gliomas from multi-center patients pool by IDH mutation status and WHO grades II, III and IV. Applying a random-forest algorithm, we accurately predicted IDH mutation status in 72% of the cases with an overall specificity of 77% and sensitivity of 65%. On average, the algorithm classified 53% of the gliomas correctly into three histological grades II-IV, and with 87% the classification with an error distance of less than 1.

The novelty of our work is that we show that even simple shape features can help in the distinction between grade II and higher grades. The measure of non-compactness was significantly higher in WHO grade III or grade IV compared to grade II, reflecting possibly higher infiltrative features in these tumours. Interestingly, this shape feature was not found to be significantly different between WHO grade IV and grade III groups. It must be noted that in the existing literature, histogram features have generally been favoured against shape descriptors [36,37,38]. Falk et al. have found that 90th percentile rCBV values were significantly lower in grade II gliomas than grade III (AUC = 0.77, p = 0.04), but they have also found the standard deviation of rCBF was the best discriminative feature with the highest AUC in the same setting (AUC = 0.80, p = 0.02). Our results do not confirm these findings, with histogram features only relevant markers of grade difference between grades IV and II or IV and III. Therefore, we suggest that shape descriptors, such as non-compactness and surface to volume ratio, have to be considered in the mapping of the composition of gliomas as they correlate well with histopathological findings. Perfusion texture features measuring the degree of intensity heterogeneity (such as the correlation or the sum of variance) appeared to be significantly different across the IDH mutation groups. Similar large number of markers of tissue heterogeneity were found to be different across a paired comparison of grades IV vs III and grades IV vs II. These observations are biologically relevant as they reflect heterogeneous microenvironment in high-grade gliomas, which is also consistent with diffusion studies [14, 39]. It can be postulated that texture features reflect the continuum of appearance evolution between grades but the results highlight the challenge of strictly separating grades III from the two extremes of the spectrum (grade II and grade IV).

The prediction rate in our study performed better (65% of specificity, 79% sensitivity) than the one reported by Kickingereder et al. [8] in a single-centre trial despite the overall heterogeneous nature of our large cohort in terms of molecular subgroups and DSC acquisition protocols. Similar machine-assisted techniques for tumour grade prediction demonstrated accuracy between 73 and 85% [40,41,42,43]. While our algorithm performance appears slightly inferior to the aforementioned results, we note that in all these studies patients are recruited from one centre, while we considered patients data from six centres. With multi-center data we expect higher variability and heterogeneity and hence anticipated lower accuracy. Using a different simulation approach might however improve accuracy. Citak-Er et al. [43] have proposed the use of a linear kernel support vector machine (SVM) with ten-fold cross-validation based on features extracted from conventional and advanced MRI data (DTI, DWI, DSC perfusion and MR- Spectroscopy), and reported 73.3% of overall sensitivity from their multiparametric MR imaging. For the purpose of our study we chose random-forest algorithm as an established method for three classes or more classification. Future work would probably involve combining radiomics features from additional advanced MRI sequences so as to better identify required sequences for an optimised performance. With the recent rise of deep-learning techniques that do not rely on hand-crafted features, further exploration on this end would be required notably to assess the behaviour of such solutions in the identification of borderline cases, the associated uncertainty calibration as well as the assessment of robustness to acquisition changes.

Our error analysis was performed to better characterise the existing continuum between ordinal categories such as glioma grades. Analysis of the relationship between misclassifications and used features was strongly in agreement with previously well-known MRI shortcomings in tumour staging. For instance, some tumours incorrectly classified as WHO grade IV instead as grade III had significantly higher mean intensity in rCBV images. This highlights the inadequacy of first-order statistics to establish clear cut-off values regardless the sophistication of the DSC technique. This paradigm exemplifies the biological interpretability of machine-learning based radiomics compared to non-feature based methods and illustrates how it can be used to re-evaluate categorical classifications and improve characterisation and understanding of the pathological continuum in glioma grades, in particular with emerging novel, validated molecular biomarkers, such as CDKN2A/B deletion in IDH-mutant astrocytomas, which have been suggested to form a new clinical risk group [3]. Such use of radiomics techniques parallels the findings of new biological definition of GBM subgroups described in the work from Rathore et al. [44].

Overall, this proof-of-concept study aimed to assess the diagnostic value of the DSC-MRI radiomics in treatment-naïve gliomas using multi-centre study design without a priori consensus on the acquisition parameters and settings (incl. MR field strength, sequence type, resolution, acquisition parameters). Different suggestions for sequence optimisation in DSC-PWI, especially across TE and FA, are available in the literature [30]. The large protocol variability observed in our cohort allowed us to investigate the effect of these key characteristics. Acquisition metrics were dichotomized based on existing literature [30, 45], and subsequently analysed prediction results across defined groups. Higher acquisition resolution was significantly associated with lower error rates. Error in grade prediction was more sensitive to acquisition parameters than the distinction between IDH wild-type and mutant. Additionally, lower field strength associated with lower variability in scanning protocols led to lower absolute errors (p = 0.02 and p = 0.01 for IDH and WHO grade respectively). Higher sensitivity and specificity rates with lower errors were achieved in the estimation of IDH mutation (82, 70%, 0.26 at 1.5 T, and 72, 60%, 0.37 at 3 T, respectively), and better WHO grade predictions, exact or with a distance, at 1.5 T systems when compared to 3 T. In fact, while higher spatial resolution and signal-to-noise ratio from higher magnetic fields may turn into an advantage in DSC imaging, which eventually leads decrease in the dose of contrast material and acquisitions time, protocol variability may increase image appearance heterogeneity. This result cannot be generalised in favour of the 1.5 T field strength scanners as this would need a controlled randomised study but it stresses the need for imaging acquisition guidelines to harness the potential of technological advances, such as higher MRI fields. Last but not least, partial volume effect caused by surrounding tissues in the point of arterial input function may also result in erroneous estimation of rCBV.

The main strength of this work lies in both the multi-centre design and, though limited for optimised learning approaches, still the largest cohort in the literature to utilise machine-learning based radiomics as a predictive diagnostic tool. Since the analyses were purposefully performed across perfusion data only, further studies are needed to assess a combined impact of a multimodal approach (e.g. DWI, T1-weighted MR perfusion, MR spectroscopy, PET) in the imaging phenotyping in gliomas, though this carries the risk of introducing further considerable variability.

This study has some limitations. Specifically, the small number of patients in some groups did not allow to form integrated molecular subgroups, akin to existing previous work. Therefore we could not undertake multi-class optimised learning classification according to the integrated histomolecular WHO classification scheme. Further data is continuously being gathered from the existing and new participating centres to extend the work in this direction and assess the generalisability of the proposed framework, while simultaneously exploring the optimal machine learning algorithm to use. Future work will explore the accuracy of contrasting different algorithms, such as supervised V unsupervised algorithms, when faced with heterogeneous dataset.

Conclusions

In conclusion, the findings underscore the satisfactory diagnostic contribution of DSC-MRI extracted higher-order features combined with machine-learning in the automated classification of grading and IDH mutation status of gliomas mitigating the high imaging heterogeneity. The promising results obtained through the proposed random forest radiomics-based framework, are an additional step towards the clinical translation of machine learning tools as diagnostic aid aiming at facilitating the implementation of tailored treatment based on precision imaging.