Study cohort
This retrospective study was approved by the ethics committees of three participating hospitals (hospital A, hospital B, and hospital C). Head CT scans from 3129 subjects were initially collected, with 2102 from hospital A, 511 from hospital B, and 516 from hospital C. All subjects were from the Asian population. The detailed study cohort design is described in Supplementary Material. After careful slice-wise review and annotation by three independent experienced radiologists (with 10, 12, and 16 years’ experience in interpreting head CT scans, respectively), 293 cases were excluded from further analysis due to incomplete information or serious imaging artifacts. The remaining 2836 cases were finally used in our study, including 1836 subjects with ICH and 1000 normal subjects. We intentionally kept such a high ICH prevalence (65%) in this dataset to ensure that there were sufficient positive samples to benefit the learning process of the algorithms as well as to effectively evaluate our algorithms with sufficient positive and negative samples. Table 1 shows the demographic characteristics of these subjects. The differences of patient age and sex distribution between the non-ICH group and ICH group were tested using ANOVA and χ2 test, respectively, with p values reported in Table 1. Statistical significance for both age and sex distributions between these two groups is consistent with previous findings that the incidence ratio of ICH tends to be higher in males and in more aged subjects [25,26,27,28,29]. Subjects in the ICH group were further categorized into five subtypes according to the location of ICH on both the slice-level and the subject-level: CPH, IVH, SDH, EDH, and SAH. It is possible for some subjects with ICH presence to have more than one subtypes (i.e., mixed subtypes). Table 2 shows the inter-rater annotation agreement among the three radiologists. The majority vote of these three senior radiologists’ annotations (slice-level and subject-level bleeding as well as subtypes) was used as the gold standard. Examples of scan slices used in this study are shown in Fig. 1.
Table 1 Demographic information of subjects used in this study Table 2 Subject-level and slice-level scoring variability assessment of three radiologists on the diagnosis of ICH and five subtypes Non-contrast CT imaging protocol
Head CT images used in this study were acquired by scanners from different manufacturers. The scanning parameters were different among these three institutions, with details listed in Supplementary Table 1.
Data pre-processing
To feed the data for training, we first performed pre-processing of the original CT images with the following steps. All image slices were resampled to 512 × 512 pixels if necessary and then downsampled to 256 × 256 pixels to reduce GPU memory usage. The original slice number of each scan was kept. To better account for the high dynamic intensity range while preserving the details for different objects of interest, we chose three different intensity windows to normalize images, with details described in Supplementary Material.
Prediction models and workflow
To reduce redundancy, hereinafter, we refer to the scenario that only subject-level ground truths were used in training as Sub-Lab, and the scenario that subject-level labels together with slice-level labels were used in training as Sli-Lab. Furthermore, we refer to the task of predicting whether a subject and its slices contain bleeding or not as a two-type classification, while the task of predicting the bleeding subtype(s) of an ICH-positive subject and the associated slices as a five-type classification. Our framework can be used for both two-type and five-type classification under both Sub-Lab and Sli-Lab settings. Specifically, this algorithm is composed of a CNN component followed by a RNN component to mimic how radiologists interpret scans. The CNN component focuses on extracting useful features from image slices. The RNN component makes use of these features and generates the probability of ICH or a subtype. The RNN component is particularly useful for capturing sequential information of features from consecutive slices, adding inter-slice dependency context to boost classification performance (please refer to Supplementary Figure 1 for an illustration of our algorithm; more detailed description can be found in Supplementary Material).
In our prediction workflow, we first carried out two-type classification to determine if ICH was present in a subject. If a subject was predicted to be ICH-positive, five-type classification was performed to decide if this subject belonged to any of the five subtypes. This workflow is demonstrated in Fig. 1.
Training procedures
We split the entire subjects randomly into training (80%), validation (10%), and testing set (10%). Data distribution for two-type and five-type classification tasks is shown in Supplementary Table 2. The training set was used to optimize model parameters while the validation set was used to avoid overfitting to the training set. The testing set was reserved for final evaluation of our models. Training and testing schemata are illustrated in Fig. 2. Training for ICH detection (two-type task) and its subtypes (five-type task) was performed under two settings: Sub-Lab and Sli-Lab (more details about the training process are elaborated in Supplemental Material).
Model visualization
A disadvantage of deep learning models is their lack of transparency and explanability [30, 31]. To improve the explainability of our models, we generated a coarse localization map that highlighted important regions in the image leading to the decision of the algorithm using the Grad-CAM method [31]. The localization map on each slice was generated with our fully trained algorithm, which neither affected the algorithm training process nor required manual annotation of bleeding areas for supervised training. This visualization technique might also be adopted by radiologists as a guidance for interpretation (more details are provided in Supplementary Material).
Statistical analysis
All statistical analyses were performed using the python package scikit-learn, while statistical plots were generated with matplotlib. We evaluated the performance of algorithms using statistical metrics including accuracy, sensitivity, specificity, F1 score, and area under the curve (AUC). We used 0.5 as the threshold to convert probabilities into binarized class labels, i.e., a probability no smaller than 0.5 was considered ICH-positive and a probability smaller than 0.5 to be ICH-negative.
Diagnosis from additional radiologists and trainees
We additionally invited three junior radiology trainees and an additional senior radiologist to provide subject-level diagnosis on the 299 CT scans in the testing set for performance comparison with the automated algorithm (more details about these head CT interpreters can be found in Supplementary Material).