Quantitative CT radiomics-based models for prediction of haematoma expansion and poor functional outcome in primary intracerebral haemorrhage

Objectives To test radiomics-based features extracted from noncontrast CT of patients with spontaneous intracerebral haemorrhage for prediction of haematoma expansion and poor functional outcome and compare them with radiological signs and clinical factors. Materials and methods Seven hundred fifty-four radiomics-based features were extracted from 1732 scans derived from the TICH-2 multicentre clinical trial. Features were harmonised and a correlation-based feature selection was applied. Different elastic-net parameterisations were tested to assess the predictive performance of the selected radiomics-based features using grid optimisation. For comparison, the same procedure was run using radiological signs and clinical factors separately. Models trained with radiomics-based features combined with radiological signs or clinical factors were tested. Predictive performance was evaluated using the area under the receiver operating characteristic curve (AUC) score. Results The optimal radiomics-based model showed an AUC of 0.693 for haematoma expansion and an AUC of 0.783 for poor functional outcome. Models with radiological signs alone yielded substantial reductions in sensitivity. Combining radiomics-based features and radiological signs did not provide any improvement over radiomics-based features alone. Models with clinical factors had similar performance compared to using radiomics-based features, albeit with low sensitivity for haematoma expansion. Performance of radiomics-based features was boosted by incorporating clinical factors, with time from onset to scan and age being the most important contributors for haematoma expansion and poor functional outcome prediction, respectively. Conclusion Radiomics-based features perform better than radiological signs and similarly to clinical factors on the prediction of haematoma expansion and poor functional outcome. Moreover, combining radiomics-based features with clinical factors improves their performance. Key Points • Linear models based on CT radiomics-based features perform better than radiological signs on the prediction of haematoma expansion and poor functional outcome in the context of intracerebral haemorrhage. • Linear models based on CT radiomics-based features perform similarly to clinical factors known to be good predictors. However, combining these clinical factors with radiomics-based features increases their predictive performance. Supplementary Information The online version contains supplementary material available at 10.1007/s00330-021-07826-9.


Introduction
Haematoma expansion (HE) is a complication that affects around one in five people with spontaneous intracerebral haemorrhage (ICH) during the first 24 h after symptom onset [1]. Additionally, HE is associated with poor functional outcome and is a therapeutic target for improving outcome [2,3]. Identifying those with a high risk of HE may allow selective targeting of patients with high chance of benefit in clinical trials testing haemostatic therapies.
Computed tomography angiography (CTA) spot sign has been validated as good predictor of HE in ICH [4][5][6]. However, noncontrast computed tomography (NCCT) is still the standard of care for acute stroke in most care settings worldwide; thus, CTA is not routinely performed. NCCT signs such as blend sign [7], black hole sign [8], hypodensities [9,10], island sign [11], and swirl sign [12] have been proposed as alternative predictors of haematoma expansion and/or poor functional outcome. Nonetheless, those predictors suffer from intra-and interobserver variability, have variable definitions, and present modest sensitivity. For example, Law et al [13] reported sensitivities of 11.4-39.5% for haematoma expansion and 14.3-39.2% for poor functional outcome. These limitations highlight the need for alternative quantitative approaches that are performed automatically and may show better predictive performance.
A recent meta-analysis found that time from symptom onset to baseline CT imaging, baseline intracerebral haemorrhage volume, antiplatelet use, and anticoagulant use were important factors for the prediction of HE in the context of primary ICH [14]. Despite their importance, these clinical factors may not cover the full spectrum of predictive information that can be obtained from patients. Therefore, providing an approach that can automatically extract features from images may provide valuable complementary information to aid in the prediction.
Radiomics is a relatively recent quantitative approach in which a large number of features, such as intensity statistics, shape descriptors, or texture measurements, are extracted from radiological images to then be tested as predictors of outcomes [15,16]. We hypothesise that radiomics-based features predict haematoma expansion and poor functional outcome, since they may capture not only characteristics of the haematoma known to relate to instability (such as heterogeneity or complex shape) but also subtle characteristics not readily appreciable to the naked eye. Recent studies have applied a radiomics-based approach to prediction of HE [17][18][19]; however, these studies were relatively small (the largest included just over 250 subjects) and are based on either a single centre [17,19] or just 4 centres [18]. While useful to demonstrate the feasibility of the approach, the generalisability of the results is unclear. In this paper, we investigate the use of NCCT radiomics-based features and generalised linear models for prediction of both HE and poor functional outcome, in a retrospective analysis of data acquired prospectively in a large international multicentre randomised controlled trial in ICH. We also explore the predictive relation of our radiomics-based model with both radiological signs and clinical factors.

Intracerebral haemorrhage subjects
We retrospectively included participants recruited prospectively to the TICH-2 international randomised, placebo-controlled clinical trial (ISRCTN93732214) [20]. This trial tested the efficacy and safety of intravenous tranexamic acid in people with acute spontaneous intracerebral haemorrhage presenting within 8 h of symptom onset. Primary outcome was functional status at day 90 measured by modified Rankin scale. Ethical approval for TICH-2 was obtained from the local institutional review board and informed consent was obtained before enrolment, either from the participant or one of their relatives. The rationale, protocol, and inclusion/ exclusion criteria for the TICH-2 trial have been reported elsewhere [21]. All 2077 TICH-2 trial participants that had valid baseline and follow-up scans and have been previously reported [13] were eligible for inclusion in this analysis. We excluded 345 participants for our analysis due to clinical and technical reasons (Fig. 1), yielding a total of 1732 participants. Finally, all analyses were performed on a stratified semi-random split of the participants into a training set (N = 1211, 70%) and testing set (N = 521, 30%), forcing both sets to be age-and gendermatched.

Image acquisition
Noncontrast CT (NCCT) brain scans were acquired as part of routine clinical care at each of the 124 centres participating in TICH-2 following their local protocol. Baseline scans were acquired before randomisation and follow-up scans were acquired after 24 ± 12 h [21]. There were no restrictions on scanner manufacturer, scanner settings, or slice thickness. Nevertheless, only axial scans were accepted.

Feature extraction
The proposed feature extraction process follows the image processing guidelines of the image biomarker standardisation initiative (IBSI) [22,23]. Firstly, semi-automated volumetric segmentation of intracerebral haemorrhage, perihaematomal oedema, and intraventricular haemorrhage was performed from each baseline and follow-up NCCT scans by one of three independent experienced stroke imaging researchers (Z.K.L., K.K., and A.A.), who were blinded to clinical data, using ITK-SNAP version 3.6.0 (http://www.itksnap.org), with manual editing as required. The raters also classified each baseline NCCT scan as positive or negative for the presence of radiological markers (blend sign [7], black hole sign [8], hypodensities [9,10], and island signs [11]). Reliability assessments for the haematoma volumetric measurement and radiological marker interpretation for these raters have been published previously [13].
Secondly, images were resampled to 1-mm isotropic voxel size and additional filtered versions were computed using Laplacian of Gaussian and Wavelet filters (see Supplementary Materials). A total of 754 NCCT radiomics-based features were subsequently extracted from the original and filtered scans (see Supplementary  Table S1) using MATLAB R2019b. Seven hundred fiftytwo of them were taken from the area defined as intracerebral haemorrhage on the baseline scan, using a third-party package (https://github.com/mvallieres/radiomics) [24] for   Training and testing procedure. The training UK data is split into 10 non-overlapping folds and 10 different models are trained for each value of the hyperparameter α, using each fold as validation data once. The model that shows the greatest AUC is selected for testing using the non-UK holdout data

Feature processing
Each feature vector was harmonised using the MATLAB version of the ComBat harmonisation package (https://github. com/Jfortin1/ComBatHarmonization) [25][26][27] with parametric adjustments to remove possible batch effects of slice thickness in the radiomics computation. This method has already been tested before for harmonisation of NCCT radiomics [28]. We utilised three batches: (1) slice thickness < 2 mm; (2) slice thickness ≥ 2 mm and < 4 mm; and (3) slice thickness ≥ 4mm. Additionally, age and gender were utilised as biological covariates. Harmonisation was performed first on the training set and the same parameters were then applied to harmonise the testing set.
Also, an iterative feature selection procedure was run in which; for every pair of variables with absolute correlation > 0.9, the one with the largest mean absolute correlation is removed. After removing each variable, the average correlations were recomputed for the next iteration. This procedure was performed on the training data only and reduced the feature set to 218 unrelated features (see Supplementary  Table S2). Figure 2 depicts the feature extraction process.

Generalised linear model construction
We computed a generalised linear model (GLM) via elasticnet regularisation [29] with standardisation and log-loss score as energy function. To this end, we performed an exhaustive search-grid optimisation procedure with stratified 10-fold cross-validation ( Fig. 3) using the H2O platform v3.26.0.2 (www.h2o.ai) and R software v3.6.3 (www.r-project.org). This search was carried out over the α blending hyperparameter of elastic-net regularisation, with values ranging from 0 (Ridge regression) to 1 (LASSO regression) in increments of 0.1 (see Supplementary Materials). Outcomes of interest were haematoma expansion, defined as volumetric growth of > 6 mL or > 33% on the follow-up scan, and poor functional outcome defined as modified Rankin scale of 4 to 6 at day 90 [20]. Finally, the optimal GLMs were chosen based on their cross-validation area under the receiver operating characteristic curve (AUC) score.
For comparison purposes, the same hyperparameter optimisation was performed using the presence of radiological signs alone as binary feature set and also using demographic and clinical information previously found to be predictive [14,30]. The radiological markers used were the blend sign, black hole sign, hypodensities, and island signs. Demographic and clinical factors were age (years), gender (male/female), time from onset to baseline scan (hours), baseline haematoma volume (mL), antiplatelet use (yes/no), and ultra-early haematoma growth (baseline haemorrhage volume over time from onset to baseline scan). Anticoagulant use was also found to be predictive by Al-Shahi Salman et al [14]; however, this was an exclusion criterion of TICH-2 and hence not included. We also carried out the analysis using radiomics-based features combined with radiological signs and with clinical factors independently.
We also assessed our study using the radiomics quality score [31] (See Supplementary Materials for details). The score was 13 (36.11%), which is higher than the median score obtained on a recent systematic review of 51 cancer studies [32]. One of the main criteria affecting our score was the fact that despite TICH-2 being a prospective trial, it was not initially devised with radiomics in mind.

Participant characteristics analysis
Participant demographics are summarised in Table 1 for haematoma expansion and Table 2 for poor functional outcome. Of the 1732 participants, 13 had no functional outcome information available and were excluded from the corresponding analysis. This accounts for the difference in the number of subjects between Tables 1 and 2. There was no significant difference in age, gender, or any of the included variables between training and testing sets (all p > .05). Finally, no statistical difference (all p > .05) in treatment allocation proportions was found between both sets.
Regarding differences within the training and testing sets, we observed that baseline haemorrhage volume was significantly higher (p < .001) for participants who developed HE and for those with poor functional outcome in both sets. This difference remains significant if we test on the whole population sample (p < .001). For treatment allocation, we observed a difference that is not statistically significant between participants with and without HE on both sets. However, this difference became significant when tested on the whole population sample (p = .018). We observed no statistical difference in poor functional outcome in the training and testing sets, nor when tested on the whole population sample (all p > .05).

Effect of feature harmonisation
To show the effect of feature harmonisation, we computed a 2-dimensional t-SNE manifold [33] over the standardised feature vectors of each subject for both the training and testing sets (Fig. 4). We observed that prior to harmonisation, there were three clusters corresponding to each of the harmonisation batches on both sets. This demonstrates the strong impact of slice thickness on the values of computed radiomics-based features. After harmonisation, this influence was eliminated.

Performance of generalised linear models
Threshold analysis and ROC curves summarising the prediction performance of the optimal models on the training set using NCCT radiomics-based features, radiological signs, clinical factors, and combined models are depicted in Fig. 5 for HE and Fig. 6 for poor functional outcome. Individual AUC, sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), and prevalence for both the training and testing sets are reported in Table 3. For each optimal model, we chose the threshold probability in the training set such that Youden's index defined as sensitivity + specificity − 1 is maximised. The same threshold was then used on the testing set.
For prediction of haematoma expansion, a moderate performance was observed when using NCCT radiomics-based features, with testing set AUC of 0.693 (95% CI, 0.638-0.747), sensitivity of 0.635 (95% CI, 0.554-0.716), specificity   (Table 3) due to sharp reductions in sensitivity and AUC on both sets; however, specificities were mildly higher. Moreover, the optimal model for prediction using radiomicsbased features and radiological signs combined was identical to the optimal model using radiomics-based features alone, as all the coefficients for radiological signs were shrunk to zero by elastic-net training. Also, the optimal model using demographic and clinical factors yielded slightly lower performance than the one using radiomics-based features, especially due to a sharp reduction in sensitivity to 0.35 (95% CI, 0.27-0.43). However, combining these factors with radiomics-based features boosted their performance, achieving an AUC of 0.704 (95% CI, 0.651-0.758) and an increased sensitivity of 0.65 (95% CI, 0.57-0.73). We found that time from onset to CT scan was the factor responsible for this improvement. Finally, treatment allocation was discarded by elastic net in all radiomics-based models, showing it had no effect on predictions.
In the case of prediction of poor functional outcome using NCCT radiomics-based features, we observed a good performance, with testing set AUC of 0.783 (95% CI, 0.744-0.823), sensitivity of 0.698 (95% CI, 0.642-0.754), specificity of 0.741 (95% CI, 0.688-0.795), PPV of 0.729 (95% CI, 0.673-0.784), and NPV of 0.711 (95% CI, 0.657-0.765). Here, the balanced prevalence of 49.9% (95% CI, 45.6-54.2%) is reflected on balanced predictive values. Perihaematomal oedema volume was found as the most important factor for prediction, suggesting that inflammation may have an important effect on clinical outcome. Like the case of haematoma expansion, both sensitivity and AUC suffered a strong reduction using radiological signs compared to to subject radiomics feature vectors pre-harmonisation and the right column corresponds to subject radiomics feature vectors postharmonisation Table 3  Performance table of 6 Threshold analysis for sensitivity, specificity, Youden's index, F1 score, F0.5 score, and F2 score (left column) and ROC curve (right column) for the three prediction models of poor functional (radiomics, radiological signs, radiomics and signs combined, clinical factors, and radiomics and clinical factors combined). Optimal threshold criterion was maximal Youden's index Table 4 Ranking of radiomics-based features selected in elastic-net training in terms of their importance. Only features with an importance greater than 1% are shown. Standardised model coefficients are also provided in boldface (positive) and italics (negative) Haematoma expansion the model trained with NCCT radiomics-based features. Again, for the case of combined NCCT radiomics-based features and radiological signs, we obtained a model which was identical to the one trained using NCCT radiomics-based features alone. Moreover, demographic and clinical factors showed slightly better AUC than NCCT radiomics-based features, explained by an improved specificity. Again, by combining these factors and NCCT radiomics-based features, the best performance is achieved, with an AUC of 0.818 (95% CI, 0.781-0.854), mainly due to increased PPV and NPV of 0.799 (95% CI, 0.747-0.852) and 0.73 (95% CI, 0.68-0.781), respectively. The features accountable for this improvement were age and ultra-early haematoma growth. Lastly, treatment allocation had again no influence on the radiomics-based models and was discarded by elastic net.

Discussion
We evaluated the predictive performance of NCCT radiomics-based features for haematoma expansion and poor functional outcome using generalised linear models on a large and heterogeneous sample. We also investigated the predictive performance of radiological signs and clinical factors, and their combination with radiomicsbased features with the same type of models. This analysis evidenced that radiomics-based features have higher predictive performance compared to radiological signs and perform similarly to clinical factors. Moreover, this work showed that prediction performance is not improved by incorporating radiological signs, but it is boosted by the inclusion of demographic and clinical factors. Since our analyses are based on binary outcomes, haematoma expansion and the modified Rankin scale were dichotomised. Our choice of dichotomisation for haematoma expansion was identical to that of the TICH-2 trial [20] and has also been used in several retrospective analyses [10,[34][35][36]. Our choice for the modified Rankin scale is based on the observation that intracerebral haemorrhage is in general a severe type of stroke and a score of 3 or less would still be considered a "good" outcome. This choice has also been previously made for the primary analysis of large clinical trials of ICH, such as STICH [37] and CLEAR III [38].
Our reported diagnostic performance of radiological signs was not identical to a previous work using data from the TICH-2 clinical trial [13]. The main sources of difference were that our study included fewer participants and that we performed the analysis using the combined set of signs as predictor features simultaneously, rather than testing each one individually. Nevertheless, our results are consistent with this previous report in terms of the relatively low sensitivity showed by these signs.
In addition to their poor sensitivity, radiological signs are qualitative measures with subjective definitions and are therefore susceptible to inter-and intra-rater variability. On the other hand, the proposed radiomics-based generalised linear model provides a quantitative and consistent way of predicting HE and poor functional outcome, yielding better sensitivity and AUC. However, despite the performance of our radiomics approach, it is not currently available for realtime application and would require significant further development and validation for clinical use.
Many important variables selected by our radiomics-based model are consistent with previous findings. Amongst them are features based on Laplacian of Gaussian filters, which are indicative of haematoma heterogeneity and have good predictive value for its expansion [39], and extent of perihaematomal oedema, which has a predictive association with poor functional outcome [40,41]. Also, our observation that the addition of the time from symptom onset to baseline scan to our radiomics-based model improves HE prediction is consistent with the meta-analysis by Al-Shahi Salman et al [14]. Finally, the conclusion that incorporating age and ultra-early haematoma growth as factors substantially improves functional outcome prediction has strong support in prior studies [30,42,43].
An important aspect of constructing a NCCT radiomicsbased model for prediction is the harmonisation of the extracted raw features. This is especially crucial in the context of data coming from multicentre studies such as TICH-2, where the stability of the radiomics features can be severely affected by differences in imaging protocols [44]. Our investigation suggests that slice thickness is a very important factor to consider for harmonisation, which is in agreement with previous findings [45]. However, there are additional relevant sources of difference (for which we had no exhaustive information) that could have been considered such as scan type (helical/axial), slice gap, reconstruction kernel, field of view, tube voltage, and milliamperage [46]. Future radiomics studies should take all these sources of variability into account to achieve optimal harmonisation.
A relevant observation is that our model does not reach the same predictive performance for HE as recent studies using radiomics-based linear models [18,19] that show substantially better performance. This may be due to those studies being performed on significantly smaller samples from 4 or fewer centres which may have much lower variability in terms of CT parameters. This highlights the challenges that radiomicsbased models face when dealing with data that better reflects the "real-world" variability present in multicentre studies. In order to provide a reliable tool for research and clinical application, it is also essential to tackle the current challenge of reproducibility in advanced clinical imaging. In radiomics, potential sources of variability arise from image pre-and postprocessing, discrepancies in the way the segmentation of regions of interest is performed, and differences in the number and type of radiomics computed. Standardisation efforts such as IBSI [22] are a step in the right direction and should be adopted in future radiomics studies. Furthermore, we have found beneficial to perform post-harmonisation of discrepant features when protocol homologation amongst many centres is not feasible or practical, like in the case of our study.
Finally, this is a first approach using elastic net, where the linear contributions of each feature can be assessed in a straightforward manner. Other machine learning approaches, such as random forest or deep learning, can be investigated to improve the detection of non-linear interactions between variables and potentially improve prediction performance. For example, a comparison of different classifiers similar to the one performed by Li et al [17] may be pursued, but in a much larger sample.
In conclusion, we showed that models using NCCT radiomics-based features outperform models using radiological signs or clinical factors. However, we found that incorporating demographic and clinical factors into a radiomics-based model substantially improves the prediction. These results suggest that radiomics-based models, with added demographic and clinical factors if available, may be of prognostic value in people with ICH. Hence, they could be incorporated into therapeutic trials to aid selection of those at risk of haematoma expansion or poor functional outcome once further work has been performed to address the challenges around predictive accuracy and variability of radiomics features in multicentre studies.