Introduction

Haematoma expansion (HE) is a complication that affects around one in five people with spontaneous intracerebral haemorrhage (ICH) during the first 24 h after symptom onset [1]. Additionally, HE is associated with poor functional outcome and is a therapeutic target for improving outcome [2, 3]. Identifying those with a high risk of HE may allow selective targeting of patients with high chance of benefit in clinical trials testing haemostatic therapies.

Computed tomography angiography (CTA) spot sign has been validated as good predictor of HE in ICH [4,5,6]. However, noncontrast computed tomography (NCCT) is still the standard of care for acute stroke in most care settings worldwide; thus, CTA is not routinely performed. NCCT signs such as blend sign [7], black hole sign [8], hypodensities [9, 10], island sign [11], and swirl sign [12] have been proposed as alternative predictors of haematoma expansion and/or poor functional outcome. Nonetheless, those predictors suffer from intra- and interobserver variability, have variable definitions, and present modest sensitivity. For example, Law et al [13] reported sensitivities of 11.4–39.5% for haematoma expansion and 14.3–39.2% for poor functional outcome. These limitations highlight the need for alternative quantitative approaches that are performed automatically and may show better predictive performance.

A recent meta-analysis found that time from symptom onset to baseline CT imaging, baseline intracerebral haemorrhage volume, antiplatelet use, and anticoagulant use were important factors for the prediction of HE in the context of primary ICH [14]. Despite their importance, these clinical factors may not cover the full spectrum of predictive information that can be obtained from patients. Therefore, providing an approach that can automatically extract features from images may provide valuable complementary information to aid in the prediction.

Radiomics is a relatively recent quantitative approach in which a large number of features, such as intensity statistics, shape descriptors, or texture measurements, are extracted from radiological images to then be tested as predictors of outcomes [15, 16]. We hypothesise that radiomics-based features predict haematoma expansion and poor functional outcome, since they may capture not only characteristics of the haematoma known to relate to instability (such as heterogeneity or complex shape) but also subtle characteristics not readily appreciable to the naked eye. Recent studies have applied a radiomics-based approach to prediction of HE [17,18,19]; however, these studies were relatively small (the largest included just over 250 subjects) and are based on either a single centre [17, 19] or just 4 centres [18]. While useful to demonstrate the feasibility of the approach, the generalisability of the results is unclear. In this paper, we investigate the use of NCCT radiomics-based features and generalised linear models for prediction of both HE and poor functional outcome, in a retrospective analysis of data acquired prospectively in a large international multicentre randomised controlled trial in ICH. We also explore the predictive relation of our radiomics-based model with both radiological signs and clinical factors.

Materials and methods

Intracerebral haemorrhage subjects

We retrospectively included participants recruited prospectively to the TICH-2 international randomised, placebo-controlled clinical trial (ISRCTN93732214) [20]. This trial tested the efficacy and safety of intravenous tranexamic acid in people with acute spontaneous intracerebral haemorrhage presenting within 8 h of symptom onset. Primary outcome was functional status at day 90 measured by modified Rankin scale. Ethical approval for TICH-2 was obtained from the local institutional review board and informed consent was obtained before enrolment, either from the participant or one of their relatives. The rationale, protocol, and inclusion/exclusion criteria for the TICH-2 trial have been reported elsewhere [21]. All 2077 TICH-2 trial participants that had valid baseline and follow-up scans and have been previously reported [13] were eligible for inclusion in this analysis. We excluded 345 participants for our analysis due to clinical and technical reasons (Fig. 1), yielding a total of 1732 participants. Finally, all analyses were performed on a stratified semi-random split of the participants into a training set (N = 1211, 70%) and testing set (N = 521, 30%), forcing both sets to be age- and gender-matched.

Fig. 1
figure 1

Study inclusion flowchart

Image acquisition

Noncontrast CT (NCCT) brain scans were acquired as part of routine clinical care at each of the 124 centres participating in TICH-2 following their local protocol. Baseline scans were acquired before randomisation and follow-up scans were acquired after 24 ± 12 h [21]. There were no restrictions on scanner manufacturer, scanner settings, or slice thickness. Nevertheless, only axial scans were accepted.

Feature extraction

The proposed feature extraction process follows the image processing guidelines of the image biomarker standardisation initiative (IBSI) [22, 23]. Firstly, semi-automated volumetric segmentation of intracerebral haemorrhage, perihaematomal oedema, and intraventricular haemorrhage was performed from each baseline and follow-up NCCT scans by one of three independent experienced stroke imaging researchers (Z.K.L., K.K., and A.A.), who were blinded to clinical data, using ITK-SNAP version 3.6.0 (http://www.itksnap.org), with manual editing as required. The raters also classified each baseline NCCT scan as positive or negative for the presence of radiological markers (blend sign [7], black hole sign [8], hypodensities [9, 10], and island signs [11]). Reliability assessments for the haematoma volumetric measurement and radiological marker interpretation for these raters have been published previously [13].

Secondly, images were resampled to 1-mm isotropic voxel size and additional filtered versions were computed using Laplacian of Gaussian and Wavelet filters (see Supplementary Materials). A total of 754 NCCT radiomics-based features were subsequently extracted from the original and filtered scans (see Supplementary Table S1) using MATLAB R2019b. Seven hundred fifty-two of them were taken from the area defined as intracerebral haemorrhage on the baseline scan, using a third-party package (https://github.com/mvallieres/radiomics) [24] for textural features, and in-house code for first-order and shape-based features (see Supplementary Materials for code snippets). The remaining two features correspond to baseline perihaematomal oedema volume (mL) and baseline intraventricular haemorrhage volume (mL). Additionally, we incorporated the TICH-2 treatment allocation (either tranexamic acid or placebo) to the radiomics features as a covariate of no interest.

Feature processing

Each feature vector was harmonised using the MATLAB version of the ComBat harmonisation package (https://github.com/Jfortin1/ComBatHarmonization) [25,26,27] with parametric adjustments to remove possible batch effects of slice thickness in the radiomics computation. This method has already been tested before for harmonisation of NCCT radiomics [28]. We utilised three batches: (1) slice thickness < 2 mm; (2) slice thickness ≥ 2 mm and < 4 mm; and (3) slice thickness ≥ 4mm. Additionally, age and gender were utilised as biological covariates. Harmonisation was performed first on the training set and the same parameters were then applied to harmonise the testing set.

Also, an iterative feature selection procedure was run in which; for every pair of variables with absolute correlation > 0.9, the one with the largest mean absolute correlation is removed. After removing each variable, the average correlations were recomputed for the next iteration. This procedure was performed on the training data only and reduced the feature set to 218 unrelated features (see Supplementary Table S2). Figure 2 depicts the feature extraction process.

Fig. 2
figure 2

Feature extraction process flowchart. NCCT scans and their annotations are resampled to 1mm isotropic. Shape features are extracted from the resampled annotations and intensity and texture features are extracted from the resampled original and filtered images. This set of features, together with ultra-early haematoma growth are harmonised and the final set of uncorrelated features is then computed using a correlation-based filtering method. NCCT, noncontrast computed tomography; LoG, Laplacian of Gaussian

Generalised linear model construction

We computed a generalised linear model (GLM) via elastic-net regularisation [29] with standardisation and log-loss score as energy function. To this end, we performed an exhaustive search-grid optimisation procedure with stratified 10-fold cross-validation (Fig. 3) using the H2O platform v3.26.0.2 (www.h2o.ai) and R software v3.6.3 (www.r-project.org). This search was carried out over the α blending hyperparameter of elastic-net regularisation, with values ranging from 0 (Ridge regression) to 1 (LASSO regression) in increments of 0.1 (see Supplementary Materials). Outcomes of interest were haematoma expansion, defined as volumetric growth of > 6 mL or > 33% on the follow-up scan, and poor functional outcome defined as modified Rankin scale of 4 to 6 at day 90 [20]. Finally, the optimal GLMs were chosen based on their cross-validation area under the receiver operating characteristic curve (AUC) score.

Fig. 3
figure 3

Training and testing procedure. The training UK data is split into 10 non-overlapping folds and 10 different models are trained for each value of the hyperparameter α, using each fold as validation data once. The model that shows the greatest AUC is selected for testing using the non-UK holdout data

For comparison purposes, the same hyperparameter optimisation was performed using the presence of radiological signs alone as binary feature set and also using demographic and clinical information previously found to be predictive [14, 30]. The radiological markers used were the blend sign, black hole sign, hypodensities, and island signs. Demographic and clinical factors were age (years), gender (male/female), time from onset to baseline scan (hours), baseline haematoma volume (mL), antiplatelet use (yes/no), and ultra-early haematoma growth (baseline haemorrhage volume over time from onset to baseline scan). Anticoagulant use was also found to be predictive by Al-Shahi Salman et al [14]; however, this was an exclusion criterion of TICH-2 and hence not included. We also carried out the analysis using radiomics-based features combined with radiological signs and with clinical factors independently.

We also assessed our study using the radiomics quality score [31] (See Supplementary Materials for details). The score was 13 (36.11%), which is higher than the median score obtained on a recent systematic review of 51 cancer studies [32]. One of the main criteria affecting our score was the fact that despite TICH-2 being a prospective trial, it was not initially devised with radiomics in mind.

Results

Participant characteristics analysis

Participant demographics are summarised in Table 1 for haematoma expansion and Table 2 for poor functional outcome. Of the 1732 participants, 13 had no functional outcome information available and were excluded from the corresponding analysis. This accounts for the difference in the number of subjects between Tables 1 and 2. There was no significant difference in age, gender, or any of the included variables between training and testing sets (all p > .05). Finally, no statistical difference (all p > .05) in treatment allocation proportions was found between both sets.

Table 1 Patient characteristics for the training and testing datasets with respect to haematoma expansion. Data are number (%), mean (SD), or median (IQR). p value between testing and training datasets
Table 2 Patient characteristics for the training and testing datasets with respect to functional outcome. Data are number (%), mean (SD), or median (IQR). p value between testing and training datasets

Regarding differences within the training and testing sets, we observed that baseline haemorrhage volume was significantly higher (p < .001) for participants who developed HE and for those with poor functional outcome in both sets. This difference remains significant if we test on the whole population sample (p < .001). For treatment allocation, we observed a difference that is not statistically significant between participants with and without HE on both sets. However, this difference became significant when tested on the whole population sample (p = .018). We observed no statistical difference in poor functional outcome in the training and testing sets, nor when tested on the whole population sample (all p > .05).

Effect of feature harmonisation

To show the effect of feature harmonisation, we computed a 2-dimensional t-SNE manifold [33] over the standardised feature vectors of each subject for both the training and testing sets (Fig. 4). We observed that prior to harmonisation, there were three clusters corresponding to each of the harmonisation batches on both sets. This demonstrates the strong impact of slice thickness on the values of computed radiomics-based features. After harmonisation, this influence was eliminated.

Fig. 4
figure 4

TSNE visualisations of standardised training and testing radiomics feature vectors for each of the 3 harmonisation batches. Each point represents a feature vector for one subject. The left column corresponds to subject radiomics feature vectors pre-harmonisation and the right column corresponds to subject radiomics feature vectors post-harmonisation

Performance of generalised linear models

Threshold analysis and ROC curves summarising the prediction performance of the optimal models on the training set using NCCT radiomics-based features, radiological signs, clinical factors, and combined models are depicted in Fig. 5 for HE and Fig. 6 for poor functional outcome. Individual AUC, sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), and prevalence for both the training and testing sets are reported in Table 3. For each optimal model, we chose the threshold probability in the training set such that Youden’s index defined as sensitivity + specificity − 1 is maximised. The same threshold was then used on the testing set.

Fig. 5
figure 5

Threshold analysis for sensitivity, specificity, Youden’s index, F1 score, F0.5 score, and F2 score (left column) and ROC curve (right column) for the five prediction models of haematoma expansion (radiomics, radiological signs, radiomics and signs combined, clinical factors, and radiomics and clinical factors combined). Optimal threshold criterion was maximal Youden’s index

Fig. 6
figure 6

Threshold analysis for sensitivity, specificity, Youden’s index, F1 score, F0.5 score, and F2 score (left column) and ROC curve (right column) for the three prediction models of poor functional (radiomics, radiological signs, radiomics and signs combined, clinical factors, and radiomics and clinical factors combined). Optimal threshold criterion was maximal Youden’s index

Table 3 Performance table of all models, both on the testing and training sets. NCCT, noncontrast computed tomography; AUC, area under the receiver operating characteristic curve; PPV, positive predictive value; NPV, negative predictive value

For prediction of haematoma expansion, a moderate performance was observed when using NCCT radiomics-based features, with testing set AUC of 0.693 (95% CI, 0.638–0.747), sensitivity of 0.635 (95% CI, 0.554–0.716), specificity of 0.690 (95% CI, 0.644–0.736), PPV of 0.422 (95% CI, 0.355–0.49), and NPV of 0.841 (95% CI, 0.801–0.882). The rather low PPV can be partially explained by an imbalance of the positive and negative classes, as prevalence was 26.3% (95% CI, 22.5–30.1%). The most important feature in the GLM was “LoG-35 interquartile range” (Table 4), which highlights that haematoma heterogeneity (as measured by Laplacian of Gaussian-based features) is relevant for prediction. The performance of radiological signs was considerably lower (Table 3) due to sharp reductions in sensitivity and AUC on both sets; however, specificities were mildly higher. Moreover, the optimal model for prediction using radiomics-based features and radiological signs combined was identical to the optimal model using radiomics-based features alone, as all the coefficients for radiological signs were shrunk to zero by elastic-net training. Also, the optimal model using demographic and clinical factors yielded slightly lower performance than the one using radiomics-based features, especially due to a sharp reduction in sensitivity to 0.35 (95% CI, 0.27–0.43). However, combining these factors with radiomics-based features boosted their performance, achieving an AUC of 0.704 (95% CI, 0.651–0.758) and an increased sensitivity of 0.65 (95% CI, 0.57–0.73). We found that time from onset to CT scan was the factor responsible for this improvement. Finally, treatment allocation was discarded by elastic net in all radiomics-based models, showing it had no effect on predictions.

Table 4 Ranking of radiomics-based features selected in elastic-net training in terms of their importance. Only features with an importance greater than 1% are shown. Standardised model coefficients are also provided in boldface (positive) and italics (negative)

In the case of prediction of poor functional outcome using NCCT radiomics-based features, we observed a good performance, with testing set AUC of 0.783 (95% CI, 0.744–0.823), sensitivity of 0.698 (95% CI, 0.642–0.754), specificity of 0.741 (95% CI, 0.688–0.795), PPV of 0.729 (95% CI, 0.673–0.784), and NPV of 0.711 (95% CI, 0.657–0.765). Here, the balanced prevalence of 49.9% (95% CI, 45.6–54.2%) is reflected on balanced predictive values. Perihaematomal oedema volume was found as the most important factor for prediction, suggesting that inflammation may have an important effect on clinical outcome. Like the case of haematoma expansion, both sensitivity and AUC suffered a strong reduction using radiological signs compared to the model trained with NCCT radiomics-based features. Again, for the case of combined NCCT radiomics-based features and radiological signs, we obtained a model which was identical to the one trained using NCCT radiomics-based features alone. Moreover, demographic and clinical factors showed slightly better AUC than NCCT radiomics-based features, explained by an improved specificity. Again, by combining these factors and NCCT radiomics-based features, the best performance is achieved, with an AUC of 0.818 (95% CI, 0.781–0.854), mainly due to increased PPV and NPV of 0.799 (95% CI, 0.747–0.852) and 0.73 (95% CI, 0.68–0.781), respectively. The features accountable for this improvement were age and ultra-early haematoma growth. Lastly, treatment allocation had again no influence on the radiomics-based models and was discarded by elastic net.

Discussion

We evaluated the predictive performance of NCCT radiomics-based features for haematoma expansion and poor functional outcome using generalised linear models on a large and heterogeneous sample. We also investigated the predictive performance of radiological signs and clinical factors, and their combination with radiomics-based features with the same type of models. This analysis evidenced that radiomics-based features have higher predictive performance compared to radiological signs and perform similarly to clinical factors. Moreover, this work showed that prediction performance is not improved by incorporating radiological signs, but it is boosted by the inclusion of demographic and clinical factors.

Since our analyses are based on binary outcomes, haematoma expansion and the modified Rankin scale were dichotomised. Our choice of dichotomisation for haematoma expansion was identical to that of the TICH-2 trial [20] and has also been used in several retrospective analyses [10, 34,35,36]. Our choice for the modified Rankin scale is based on the observation that intracerebral haemorrhage is in general a severe type of stroke and a score of 3 or less would still be considered a “good” outcome. This choice has also been previously made for the primary analysis of large clinical trials of ICH, such as STICH [37] and CLEAR III [38].

Our reported diagnostic performance of radiological signs was not identical to a previous work using data from the TICH-2 clinical trial [13]. The main sources of difference were that our study included fewer participants and that we performed the analysis using the combined set of signs as predictor features simultaneously, rather than testing each one individually. Nevertheless, our results are consistent with this previous report in terms of the relatively low sensitivity showed by these signs.

In addition to their poor sensitivity, radiological signs are qualitative measures with subjective definitions and are therefore susceptible to inter- and intra-rater variability. On the other hand, the proposed radiomics-based generalised linear model provides a quantitative and consistent way of predicting HE and poor functional outcome, yielding better sensitivity and AUC. However, despite the performance of our radiomics approach, it is not currently available for real-time application and would require significant further development and validation for clinical use.

Many important variables selected by our radiomics-based model are consistent with previous findings. Amongst them are features based on Laplacian of Gaussian filters, which are indicative of haematoma heterogeneity and have good predictive value for its expansion [39], and extent of perihaematomal oedema, which has a predictive association with poor functional outcome [40, 41]. Also, our observation that the addition of the time from symptom onset to baseline scan to our radiomics-based model improves HE prediction is consistent with the meta-analysis by Al-Shahi Salman et al [14]. Finally, the conclusion that incorporating age and ultra-early haematoma growth as factors substantially improves functional outcome prediction has strong support in prior studies [30, 42, 43].

An important aspect of constructing a NCCT radiomics-based model for prediction is the harmonisation of the extracted raw features. This is especially crucial in the context of data coming from multicentre studies such as TICH-2, where the stability of the radiomics features can be severely affected by differences in imaging protocols [44]. Our investigation suggests that slice thickness is a very important factor to consider for harmonisation, which is in agreement with previous findings [45]. However, there are additional relevant sources of difference (for which we had no exhaustive information) that could have been considered such as scan type (helical/axial), slice gap, reconstruction kernel, field of view, tube voltage, and milliamperage [46]. Future radiomics studies should take all these sources of variability into account to achieve optimal harmonisation.

A relevant observation is that our model does not reach the same predictive performance for HE as recent studies using radiomics-based linear models [18, 19] that show substantially better performance. This may be due to those studies being performed on significantly smaller samples from 4 or fewer centres which may have much lower variability in terms of CT parameters. This highlights the challenges that radiomics-based models face when dealing with data that better reflects the “real-world” variability present in multicentre studies. In order to provide a reliable tool for research and clinical application, it is also essential to tackle the current challenge of reproducibility in advanced clinical imaging. In radiomics, potential sources of variability arise from image pre- and post-processing, discrepancies in the way the segmentation of regions of interest is performed, and differences in the number and type of radiomics computed. Standardisation efforts such as IBSI [22] are a step in the right direction and should be adopted in future radiomics studies. Furthermore, we have found beneficial to perform post-harmonisation of discrepant features when protocol homologation amongst many centres is not feasible or practical, like in the case of our study.

Finally, this is a first approach using elastic net, where the linear contributions of each feature can be assessed in a straightforward manner. Other machine learning approaches, such as random forest or deep learning, can be investigated to improve the detection of non-linear interactions between variables and potentially improve prediction performance. For example, a comparison of different classifiers similar to the one performed by Li et al [17] may be pursued, but in a much larger sample.

In conclusion, we showed that models using NCCT radiomics-based features outperform models using radiological signs or clinical factors. However, we found that incorporating demographic and clinical factors into a radiomics-based model substantially improves the prediction. These results suggest that radiomics-based models, with added demographic and clinical factors if available, may be of prognostic value in people with ICH. Hence, they could be incorporated into therapeutic trials to aid selection of those at risk of haematoma expansion or poor functional outcome once further work has been performed to address the challenges around predictive accuracy and variability of radiomics features in multicentre studies.