Introduction

Intracerebral hemorrhage (ICH) is the most severe form of stroke with a 1-month morbidity and mortality approaching 50% and death or severe disability exceeding 75% [1,2,3]. In contrast to recent advances in interventional treatments of patients with ischemic stroke, beneficial effects of medical treatment and surgical intervention on the mortality and functional outcome of ICH patients were not observed in recent trials [4, 5]. Accurate stratification of ICH prognosis is highly desired regardless of the therapeutic options that are available and remains a clinical research priority [6]. Therefore, several prognostic tools have been proposed for the prediction of mortality and functional outcome in spontaneous ICH [7]. Though potentially useful for ascertaining prognosis and facilitating communication between clinicians, numerous methodological and reporting deficiencies are reported for a majority of these tools [7]. There is growing interest in augmented diagnostic and prognostic vision with machine learning (ML) in the medical field due to the wide range of applications of these algorithms and the increasing availability of computational power. ML is a type of artificial intelligence that learns patterns and rules from given information [8]. Recent studies applied ML to severity and outcome prediction models for neurological disorders such as ischemic stroke [8], aneurysmal subarachnoid hemorrhage [9], and traumatic brain injury [10]. However, ML approaches in the field of ICH were mainly focused on prompt diagnosis and automated volume quantification [11, 12] with lacking algorithms for the prediction of clinical outcome. As of late, Wang et al. have been among the first to develop an outcome prediction model based on ML by incorporating initial clinical presentations, laboratory data, and imaging findings [13]. Imaging findings were limited to ICH volume and location, presence of intraventricular hemorrhage, ventricle compression, and midline structure shift [13]. Further integration of quantitative imaging characteristics may hold additional prognostic value [9]. In the past, specific CT markers and histogram-based analyses of ICH heterogeneity have been linked to poor clinical outcome and reinforce this notion [14,15,16]. The goal of this study was twofold: First, we hypothesized that quantitative radiomic filter- and texture-derived high-end image features extracted from non-enhanced computed tomography (NECT) brain scans can be used to predict clinical outcome of ICH patients. To test and evaluate this hypothesis, we employed a radiomics-based ML approach on NECT brain scans of patients presenting with acute primary ICH [17]. Secondly, we hypothesized that the diagnostic power of the presented algorithm using high-end image features is equal to the ICH Score serving as the most widely utilized prognostic model for predicting mortality [18].

Materials and Methods

Study Population

We retrospectively analyzed the database of three university hospitals (University Medical Center Hamburg-Eppendorf, Charité University Medical Center Berlin, University Medical Center Münster) with a high-volume tertiary stroke center, for patients with ICH aged ≥18 years between January 2010 and April 2019. Inclusion criteria were defined as follows: Spontaneous ICH confirmed on NECT on admission. Patients were excluded if they had a secondary ICH from head trauma, hemorrhagic transformation of ischemic infarction, brain tumor, cerebral aneurysm, or vascular malformation. Baseline patient characteristics were retrieved from medical records, including Glasgow Coma Scale (GCS) at admission and modified Rankin Scale (mRS) at discharge. Additionally, we obtained vascular risk factors, blood pressure parameters, antiplatelet and oral anticoagulation (OAC) medication, and follow-up procedures, such as craniectomy or intraventricular drainage placement from patients’ clinical records and follow-up CT. A binary clinical outcome was defined based on modified Rankin Scale (mRS) on discharge with ≤3 as good outcome and mRS >3 as poor outcome [19]. According to the inclusion criteria, 520 patients were included, out of which 151 (29%) patients had a good outcome (mRS 0–3) and 369 (71%) patients had a poor outcome (mRS 4–6). Details are listed for further consideration in Table 1. This multicenter retrospective study was approved by the ethics committee (Ethik-Kommission der Ärztekammer Hamburg, Ethik-Komission der Charité Berlin) and written informed consent was waived by the institutional review boards. All study protocols and procedures were conducted in accordance with the Declaration of Helsinki. The deidentified data and analytic code are available from the corresponding author upon reasonable request.

Table 1 Baseline demographic, clinical, and radiological characteristics of study cohort

Image Acquisitions

The NECT scans were performed using standard clinical parameters with axial < 5 mm section thickness. All datasets were inspected for quality and excluded in case of severe motion artifacts. In detail, the images were acquired on the following scanners: 256 slice scanner (Philips iCT 256) with 120 kV, 280–320 mA, < 5.0 mm slice reconstruction; 80 slice scanner (Toshiba Aquilion Prime) with 120 kV, 280 mA, < 5.0 mm slice reconstruction and < 0.5 mm in-plane resolution; and 2 × 128 slice scanner (SOMATOM Definition Flash) with 120 kV, 280 mA, < 5.0 mm slice reconstruction and < 0.5 mm in-plane resolution.

Post-procedure Evaluations

NECT scans were obtained and stored for further evaluation. Two experienced neuroradiologists (JN and SE) assessed and documented the following imaging features on NECT scans: [1] intraventricular hemorrhage; [2] ICH location; [3] craniectomy in the follow-up NECT scans. ICH locations were classified as basal ganglia, thalamus, lobe, brain stem, pons, and cerebellum. In the following ICH, volumes were segmented semi-automatically on the basis of the original NECT images [20]. Regions of interest (ROIs) were delineated using Analyze 11.0 Software (Biomedical Imaging Resource, Mayo Clinic, Rochester, MN). Consensus ROIs were derived based on overlapping segmentations of both readers. Both readers were blinded to all clinical information and bleeding location. Discrepancies were settled by joint discussion of the 2 readers and a third reader (UH). JN and SE: 3 years clinical experience in diagnostic neuroradiology in an academic full-service hospital; UH: 8 years clinical experience in diagnostic neuroradiology; JN, SE, and UH: research with focus on clinical applications of image processing and predictive modelling.

ICH Score

ICH Scores were obtained for every patient included according to the definition of Hemphill et al. based on five independent and multidimensional predictors (ICH volume, infratentorial location, GCS, age, and intraventricular extension) [18]. ICH volumes were obtained from ICH delineations. Oral anticoagulants (OAC) were not included as their addition does not increase the prognostic performance of the ICH Score [21]. As the ICH Score is a prognostic model for 30-day mortality in ICH patients (equivalent to mRS 6), a binary mortality outcome was defined based on mRS at discharge with mRS ≤ 5 (survival) and mRS = 6 (death).

Imaging-Based Outcome Prediction

Radiomic features were defined according to the PyRadiomics Python package v2.1.0. Features were extracted from consensus ROIs and resampled to 0.5 mm × 0.5 mm × 2 mm resolution using sitk BSpline interpolators. Resampling was performed to ensure comparability of texture analysis. Extracted features comprised 252 first-order features (thereof 18 based on unfiltered images, 144 based on wavelet decompositions, 90 based on log-sigma laplacian of Gaussian filters), 902 texture features (thereof 68 based on unfiltered images, 544 based on wavelet decompositions, 290 based on log-sigma laplacian of Gaussian filters), and 14 shape features. In total, 1218 quantitative image features were extracted from the ICH ROIs. To adjust for effects of therapeutic interventions that cannot be detected on admission NECTs, we included decompressive craniectomy as sole clinical parameter into the machine learning models.

ML-based classification was performed using random forest algorithms (Python scikit-learn environment v0.20.3 [22]). Random forest is a ML technique that utilizes multiple decision trees trained on random sub-selections of samples in order to improve stability and reduce overfitting of the algorithm [23]. Decision trees learn decision rules according to predictor values of the training data samples. With increasing depth of nodes, decision trees can represent more complex decision rules, resulting in a better fitting of the model [23, 24]. Hyperparameter tuning (total number of features, number of trees, maximum depth of the tree, minimum number of samples to split an internal node, number of features considered for splitting (mtry), minimum number of samples at leaf node, bootstrapping yes/no) was performed in a nested 5-fold cross-validation approach for each training set using grid search algorithms. Parameters at initiation were set to scikit-learn default values.

Selection of features with highest predictive value was conducted separately for each training dataset of the 5-fold cross-validation outer loop sample split according to Gini impurity measures [25]. Classifier models were trained and tested on each set’s unique training and testing samples (outer loop) utilizing optimized hyperparameters and feature importance of the respective training data (inner loop).

Integration of ICH Score and Imaging-Based Outcome Prediction

It was shown that combinations of classification models trained on heterogeneous predictors tend to have higher synergistic effects if knowledge flows are merged at a very late stage of the data evaluation process. Therefore, probabilities for survival of the ICH Score and of the imaging-based classifier were extracted. The arithmetic average of both probabilities was then used for outcome prediction.

Statistics

Model validation and testing of all classifiers was conducted in a nested 5-fold cross-validation with independent training and validation sets in a model-external approach [26]. Accordingly, model selection and hyperparameter tuning was performed with grid search algorithms on each training data set using a second cross-validation layer. Model stability was examined through comparative analysis of 10 randomly permuted cross-validation sets.

Receiver-operating characteristic (ROC) curves were generated from prediction results of all cross-validation sets. Confidence intervals (CI) for sensitivities and specificities were bootstrapped (2000 replicates, pROC v1.15 [27] R-package). Bonferroni adjustments were applied to control for alpha error inflation.

Furthermore, the classifiers were analyzed using ROC areas under the curve (AUC), sensitivity, specificity, accuracy, Youden Index, positive predictive value, negative predictive value (ThresholdROC v2.8 R-package), and Matthews correlation coefficient (MCC) [28] metrics (psychometric v.2.2. R-package). MCC evaluates all fields of the confusion matrix and is considered a favorable measure for unbiased comparisons of binary classifiers [29]. With TP: true positives, TN: true negatives, FP: false positives, and FN: false negatives, MCC is defined as:

$$ MCC=\frac{TP\ x\ TN- FP\ x\ FN\ }{\sqrt{\left( TP+ FP\right)\left( TP+ FN\right)\left( TN+ FP\right)\left( TN+ FN\right)}} $$

A flow chart of the proposed ML-based prediction of the clinical outcome is depicted in Fig. 1.

Fig. 1
figure 1

Conceptual overview of the proposed machine learning approach for intracerebral hemorrhage outcome prediction showing the major processing steps: CT based image acquisition and segmentation, feature extraction (n = 1218), and statistical learning (random forest algorithm). NECT, non-contrast-enhanced computed tomography; ICH, intracerebral hemorrhage; CT, computed tomography; mRS, modified Rankin Scale; CV, cross-validation set with i: inner loop and o: outer loop

Results

Our analysis included NECT images of 520 patients with acute ICH. One hundred fifty-one patients (29%) had a mRS of 0–3 and 369 (71%) had a mRS of 4–6. There were no statistically significant differences in clinical parameters age (P value = 0.85), sex (P value = 0.85), hypertension (P value = 0.25), diabetes mellitus (P value = 0.62), antiplatelet or anticoagulant medication (P value = 0.5 and P value = 0.78, respectively), and systolic blood pressure at admission (P value = 0.75). Both time from symptom onset to admission CT and time from CT to hospital discharge were not statistically different (P value 0.92 and P value = 0.13, respectively). However, patients with mRS 4-6 had a significantly lower GCS (GCS 9 versus GCS 14; P value <0.001), higher percentage of intraventricular hemorrhage (59% versus 33.1%; P value <0.001), higher ICH volumes (35.2 cm3 versus 8.4 cm3; P value <0.001), and a higher rate of supra-tentorial craniectomies (27.4% versus 10.6%; P value <0.02). There were no significant differences in ICH locations. ICH Score was significantly higher in patients with mRS 4-6 (median 3 versus 1; P < 0.001).

Imaging-Based Outcome Prediction

Machine learning–based ROC AUCs of the validation sets for predicting functional clinical outcome were 0.80 (95% CI [0.77; 0.82]) for mRS ≤ 2, 0.80 (95% CI [0.78; 0.81]) for mRS ≤ 3, and 0.79 (95% CI [0.77; 0.80]) for mRS ≤ 4. Trained on survival prediction (mRS ≤ 5), the classifier reached ROC AUCs of 0.80 (95% CI [0.78; 0.82]) which was equivalent to results of the ICH Score with ROC AUC of 0.80 (95% CI [0.79; 0.82]) (Fig. 2, Table 2). Exclusion of the parameter craniectomy yes/no had no effect on classification performance. Model selection and hyperparameter tuning within the nested cross-validation process resulted in the following median settings for mRS ≤ 2, ≤ 3, ≤ 4, and ≤ 5, respectively (medians over cross-validation sets): Number of features considered: 25, 100, 200, 100; number of trees: 750, 1000, 500, 1000; maximum depth of trees: 10 for all cut-off values; number of features considered for splitting (mtry), minimum number of samples to split an internal node, and minimum number of samples at leaf node: 1 for all cut-off values. Feature importance analyses of the mean top 100 predictors of all training data sets suggests that features with highest predictive power are mainly derived from wavelet (43%) and log-sigma (30%) filtered images. Unfiltered original images contributed 27% to total predictive power. Within feature classes, texture metrics dominated predictions (58%) (Fig. 3). Predictive power of the 15 most important features demonstrates dominance of texture and shape features compared to first-order metrics (basic statistical measures of the grey level distribution). To also assess the predictive value of the ICH volume only, an additional ROC analysis was performed (supplementary Figure 1). ROC AUC for ICH volume as sole predictor was 0.72 with a Youden Index of 0.30 at 60% specificity and 70% sensitivity.

Fig. 2
figure 2

Receiver-operating characteristics (ROC) curves for (a) functional outcome prediction of the proposed machine learning classifier based on quantitative image features and (b) prediction of survival using the ICH Score, the proposed machine learning classifier based on quantitative image features, and a classifier integrating ICH Score metrics and quantitative image features. AUC, area under the curve; CI, confidence interval; mRS, modified Rankin Scale

Table 2 Classification performance of imaging-based outcome prediction
Fig. 3
figure 3

Predictive value of quantitative image features. Bar charts show mean Gini impurity feature importance of all cross-validation training sets of the top- 15 high-end image features. Pie charts show distribution of feature classes and applied filters in utilized top-100 predictors. First-order metrics: Basic statistical metrics of the voxel grey level distribution; glcm: gray level co-occurrence matrix; gldm: gray level dependence matrix; glrlm: gray level run length matrix; glszm: gray level size zone; H: high-pass wavelet decomposition; L: low-pass wavelet decomposition

Integration of ICH Score and Imaging-Based Outcome Prediction

ICH Score metrics reached a ROC AUC of 0.80 (95% CI [0.79; 0.82]), which was equivalent to the purely imaging-based classifier with ROC AUC of 0.80 (95% CI [0.78; 0.82]). If combined, the integrated model showed a significantly higher ROC AUC of 0.84 (95% CI [0.83; 0.86], P value <0.05). Sensitivities of the integrated model were significantly higher at Youden Index maximum cut-offs with 77% vs. 74% sensitivity at 76% specificity, P value <0.05 (Fig. 2, Table 2).

Discussion

In this study, we developed an imaging-based ML model for predicting the functional outcome of ICH patients. The proposed approach employing quantitative image features derived from NECT scans provided high discriminatory accuracy between good and poor functional outcome of ICH patients at different mRS cut-off values. This study is based on a large multicenter and heterogeneous imaging dataset of 520 patients that was acquired in clinical routine over almost a decade. The proposed classification is solely based on high-end image features without a priori information about the location of the hemorrhage and without controlling for factors such as patient conditions, image acquisition parameters, or scanner type. Observed classification performance and model stability across all nested cross-validation runs suggest sufficient generalizability of our results.

It is a well-known paradigm that the ICH volume profoundly impacts functional clinical outcome. Initially derived by Broderick et al. to predict 30-day mortality after ICH, the ICH volume has been later validated and included in the ICH Score [3, 18]. In line with these findings, we have shown that ML-based outcome assessment using ICH volume as sole predictor already achieves ROC AUCs of >0.70 (supplementary Figure 1). Similarly, surrogate parameters of ICH volume such as maximum 2D diameter or minor axis length had comparatively high predictive importances in the imaging-based ML model. However, total contribution to predictive power of shape-based metrics in the comprehensive model was only 19% at ROC AUCs of 0.80. It thus stands to reason that the ICH formation on NECT holds additional and relevant information which is not assessable by human eyes but can be evaluated by imaging-based ML algorithms. As so, analyses of the 100 most powerful features demonstrate the importance of second-order features (e.g., texture metrics) in comparison to first-order features. In contrast to first-order measures, second-order metrics also capture information regarding the spatial distribution of gray levels and are often difficult to evaluate by the human visual system. The predictive value of second-order features is particularly apparent in the high predictive power of the gray level non-uniformity (Fig. 3). This specific finding could be related to the heterogenous appearance of hematomas that are still actively bleeding with evidence of spot sign or in those of patients with anticoagulation that are at risk for further expansion. It is equally conceivable that the gray level non-uniformity may differentiate areas of hyperacute ICH as the blend sign—with blending of a hypoattenuating area and a hyperattenuating region relative to the surrounding brain parenchyma—suggesting hematoma expansion and in reversal poor clinical outcome.

Hence, the proposed approach can be used as supportive tool to augment conventional image analysis and to improve prognostic decision for both radiologists and clinicians. As aspects of precision medicine are an emerging concept [30], combining the ICH Score with high-end imaging features may be useful in this respect. In line with this, the ICH Score seems to be limited in extension to critical care patients. In a prospective multicenter cohort study with patients presenting with spontaneous ICH and admitted to the intensive care unit (ICU), the ICH Score had only acceptable discriminatory power [31]. Although at this stage speculative and part of future studies, the proposed ML classifier may provide promising complementary results. In anticoagulation-associated ICH, the ICH Score may not be as reliable [21, 32, 33] and clinical outcomes in these patients likewise substantially often worse in comparison to patients without oral anticoagulation (OAC) [34, 35]. Assuming that OAC therapy alters morphology and intensity of ICH, it is most likely that radiomic features are affected by OAC therapy. As we trained the ML model on acute CT images of both, patients receiving OAC and patients without OAC, the information on OAC therapy is incorporated in the model through these differences in ICH imaging characteristics.

Since our quantitative imaging feature analysis performs equally in comparison to multidimensional scoring systems (e.g., ICH Score), the application of the proposed ML approach may be of value for randomized clinical trials. Challenges and opportunities to optimize clinical research and randomized trials in ICH are ongoing [36]. The ML approach could simplify trial procedures by performing an imaging-based prediction of functional outcome or early mortality. Simultaneously the multicenter approach of this study takes local variations in practice into account which are necessary to reflect upon a successful trial planning. Furthermore, this approach may also be of value for telemedicine and remote prediction of ICH outcome in regions lacking neuroradiological specialists. Taken together, the proposed method integrates the merits from quantitative radiomic features and ML algorithms and relates the employed predictors to well-known imaging characteristics.

Despite the promising results, several limitations deserve comment. Our study had general limitations typically associated with quantitative radiomics-based image analysis and classification [17, 37,38,39]. These limitations include differences in image acquisition settings (e.g., size of the field of view, gantry tilt) and under- or overfitting of machine learning algorithms. Bias of these factors was minimized through (a) employment of NECT scans that offer standardized HU metrics and (b) the application of random forest algorithms that are comparably stable with regard to overfitting. The risk of overfitting was further reduced by evaluating multiple different models in a nested cross-validation approach. Furthermore, we observed study-specific limitations: First, we included a limited number of patients in a retrospective analysis. An expansion of sample size in a prospective study design would certainly contribute to further improving generalizability of our results. However, observed model stability suggests sufficient robustness for evaluating feasibility and limitations of the proposed algorithm. The utilized dataset includes imaging data from 520 patients acquired over a relatively long period of almost a decade in three different centers. In such heterogeneous datasets, results of nested cross-validation approaches serve as a valid indicator for confirming feasibility and performance of the proposed classifier in the underlying clinical setting. Due to standardized and calibrated quantitative imaging parameters and signal intensity processing of CT scanners, we assume neglectable bias on classifier performance in a generalized setting. Second, the manual definition of ROIs still implies a certain degree of observer dependence within the ML process. To minimize its influence, we employed consensus segmentations from two independent readers and applied a semi-automated delineation method that was shown to have a favorable inter- and intra-observer reliability and a high level of congruence with a fully automated delineation [20, 40]. Furthermore, it was found that radiomic features are relatively stable with regard to variations in segmentations [41, 42]. The lack of data on withdrawal and limitation of care are a further limitation [43]. Final limitation was the missing correlation with long-term data (e.g., mRS at 90 days and mortality) as it might offer additional information but was not available for this study [44].

Conclusion

Quantitative imaging features of acute NECT evaluated by ML algorithms provide a high discriminatory power in predicting functional outcome in patients with spontaneous ICH. Additional integration of the ICH Score increases predictive power of the ML classifier, hence providing promising complementary results. The findings support the potential of ML algorithms to augment conventional image analysis, improve prognostic decision, and simplify trial procedures. In the very near future, such ML techniques may play a pivotal role in determining optimized therapeutic regimes and predicting the prognosis for patients with ICH in an individualized manner.