1 Introduction

Traumatic Brain Injury (TBI) is a sort of head injury that causes high mortality and physical disability worldwide [1]. The extreme cases meeting the recommended criteria for surgery require urgent medical and surgical management. For this reason, an accurate and prompt diagnosis is essential for the effective treatment carried out by a medical professional.

Currently, computed Tomography (CT) is accepted as one of the most common techniques applied for preliminary examination before the start of any operative procedures. It provides a low-cost solution for doctors to diagnose TBI [2]. The doctor can receive more information about patients from CT during diagnosis, follow-up, and decision-making on surgery [3]. CT is relied on to identify different diseases such as bony defects [4, 5], lung cancer [6, 7], sports-induced injuries [8], and COVID-19 [9,10,11]. CT provides a means of rapid examination for analyzing the TBI in patients [12, 13]. It also allows doctors to detect hemorrhagic lesions and determine whether immediate surgery is required for the patients [14].

The recommended criteria for surgical consideration are detailed in [15, 16]. The patients will be considered to need surgery, for example, when the thickness of epidural hematoma (EDH) exceeds 15 mm, the thickness of subdural hematoma (SDH) reaches above 10 mm, or the lesion volume of intraparenchymal hematoma (IPH) exceeds 50 mL. It is worth noting that three out of five subtypes of intracranial hemorrhage (ICH), including EDH, SDH, and IPH, are related under the surgical consideration criteria. Additionally, the thickness of extra-axial hemorrhage (EDH and SDH) and the volume of IPH are significant to surgical consideration. These three subtypes can be distinguished by their shape and position. Figure 1 shows the images of healthy brain and those indicating these three subtypes of hemorrhage. As shown in Fig. 1b, EDH represents a biconvex shape of bleeding which occurs between the dura and skull. Distinct from EDH, SDH is a collection of blood that shows a concave shape. It exists between the dura and arachnoid mater as shown in Fig. 1c. Even though both EDH and SDH can be observed in different layers of potential space outside the brain, their position is often adjacent to the skull area on CT scan. Figure 1d demonstrates the hemorrhagic regions of IPH which is observable in the area of brain parenchyma on CT scan. However, the shape of IPH is irregular. In order to estimate the volume of IPH, it is assumed in this paper that IPH has a spherical shape.

Fig. 1
figure 1

The sampled CT scan images show the hemorrhagic lesions in different subtypes of hemorrhage. (a) a normal brain without hemorrhagic lesions. (b) a biconvex shape in between dura and skull. (c) a concave shape in between dura and arachnoid. (d) an irregular shape in the area of brain parenchyma

The measurement of hemorrhagic lesions for their volume and thickness is often challenging due to the possibility that different types of hemorrhage can appear on the same CT scan. In order to estimate the thickness and volume of hemorrhagic lesions, it is necessary for the radiologists to know the subtype of each hemorrhage lesion. On this basis, the thickness and volume of each hemorrhage lesion are measured separately according to the exact subtype. In this study, the thickness of EDH and SDH is measured while the volume of IPH is calculated. However, it takes plenty of time to perform the manual measurement of thickness and volume as well as the segmentation of lesions.

Furthermore, due to the shortage of radiologists and other medical practitioners in some places such as those small hospitals in rural areas, additional techniques or tools are required to solve these problems. Rapid ICH diagnosis can help significantly reduce the death rate and boost the chances of survival for patients. This is essential for improving clinical outcome for patients. In this sense, it is practically significant to develop an intelligent algorithm that can be applied to detect different lesion types accurately and efficiently and to quantify the size of lesion for the early diagnosis of ICH.

According to literature review, there has been little attention paid to exploring the method used to estimate both thickness and volume for various subtypes of brain hemorrhage. Computer lacks the ability to follow the same diagnostic process as humans. A possible technique that can help the computer measure the thickness and volume of different hemorrhage subtypes is to evaluate their contour sizes separately. Nonetheless, there are quite few approaches to brain hemorrhage segmentation that are appropriate for different subtypes of hemorrhagic region. In this study, a framework is proposed to measure both thickness and volume of each subtype of hemorrhagic lesion (EDH, SDH, and IPH) automatically. The main contributions of this paper are summarized as follows. On the one hand: a novel method is proposed to classify and segment different types of traumatic brain injury by integrating the features extracted from a double-branch deep neural network. This network consists of a modified transfer-learning enhanced-based multi-label classifier and an optimal multi-class segmentation algorithm. On the other hand, a new algorithm of quantitative assessment is put forward to measure the thickness and volume of three-dimensional (3D) head CT scans, while each 3D scan is a stack of many 2D images called slices. The proposed algorithm can help radiologists with diagnosis and decision-marking on emergency surgery.

2 Related works

This section presents a review of the relevant literature. Then, a summary is made of all the previous studies on the methods of classification and segmentation for traumatic brain injury.

2.1 Classification model for traumatic brain injury

Over the past few years, such prominent machine learning methods as decision tree [17] and support vector machine [18, 19] have been proposed to detect hemorrhages. Despite the high accuracy achieved by most of these methods, they are still limited to detecting intracerebral hemorrhage which is also known as intraparenchymal hemorrhage (IPH). In some recent studies, deep learning methods have been mentioned for the automatic detection of multiple types of hemorrhages [20]. Jnawali et al. [21] constructed the ensemble networks with three 3D convolutional neural networks (CNNs) for the detection of intracranial hemorrhage. Lee et al. [22] trained an ensemble model consisting of four deep convolutional neural networks (DCNNs) for small datasets. Burduja et al. [23] put forward a hemorrhage detection system through the design of a lightweight CNN with long short-term memory (LSTM). Additionally, there were other CNN-LSTM models [24, 25] proposed. He [26] combined the results of SE-ResNetXt50 and EfficientNet-B3 deep neural network architecture to detect the intracranial hemorrhage and its subtypes on head CT scans. Though these models can help detect hemorrhage accurately, this model is incapable to determine the location and size of ICH.

2.2 Segmentation method for traumatic brain injury

Convolutional Neural Networks (CNNs) can produce outstanding performance in performing various tasks related to computer vision such as vehicle recognition [27, 28], image generation [29, 30], and the segmentation of automatic hemorrhagic lesion on CT scans. Farzaneh et al. [31] proposed an approach to SDH segmentation for TBI using a conventional feature extraction algorithm and a TreeBagger classifier. Remedios et al. [32] created the U-Net with transferred weight as multisite learning models (MSL). Hssayeni et al. [33] constructed a deep fully convolutional network (FCN) to segment the ICH regions on CT scans. Although the prior studies have produced impressive results in segmenting the hemorrhagic lesions of different types, there remains a problem that arises from segmenting different types of hemorrhage on the same CT slice. To solve the problem of multi-class segmentation, Kuo et al. [34] proposed the application of patch-based fully convolutional neural network (PatchFCN) for acute intracranial hemorrhage on head CT. Though the PatchFCN provided the evaluation metrics of classification with pixel-level supervision, the quantitative evaluation of various lesions was ignored. Monteiro et al. [35] worked out the design of automatic segmentation for head CT lesions system with DeepMedic [36] backbone and data augmentation. DeepMedic is a widely-known dual pathway 3D CNN architecture intended for the task of medical image segmentation. Although PatchFCN and DeepMedic can make distinction between different types of hemorrhagic lesion, it remains necessary for experts to estimate the size of lesions. Monteiro et al. [37] demonstrated the capability of a CNN through the multi-class lesion quantification and detection. This study contributed to the multi-class lesion segmentation and volume evaluation of each hemorrhagic subtype. Nonetheless, the method provided the contour of EDH and SDH in the same group as extra-axial hemorrhage (EAH) and required the involvement of experts in assessing the quantitative information of hemorrhage.

2.3 Automatic quantitative information calculation for multiple subtypes of ICH

The factors that can influence medical diagnosis and surgical consideration include hemorrhage position, hemorrhage volume, surgical timing, and curative effect [38]. In order to determine the volume of acute ICH lesions automatically, Jain et al. [39] proposed an automated image analysis based on an extension to U-Net model called icobrain which can compute the volume and midline shift of acute intracranial lesions. However, it can identify only one category of hemorrhage per slice. Patel et al. [40] modified 3D-CNN architecture for the automatic segmentation of ICH in non-contrast CT exams. This modified 3D CNN model was applied on a single subtype of hemorrhage for estimating the volume. Chang et al. [41] adopted the custom faster mask R-CNN algorithm to detect and segment hemorrhage. Although mask R-CNN produced excellent performance in segmentation with a high correlation score, the model estimated only the volume of IPH. Sharrock et al. [42] proposed the public source code for ICH segmentation, which is known as DeepBleed. It was trained for detecting the lesions of ICH and the occurrence of intraventricular hemorrhage (IVH) and SDH. In addition to the volume of lesion, its thickness is another key indicator used for surgical consideration. To the best of our knowledge, however, there is still no existing method mentioning both thickness and volume of various hemorrhage subtypes.

The method proposed in this paper differs from the aforementioned approaches. Herein, a framework is put forward to estimate both thickness and volume of hemorrhage subtypes through a quantitative assessment algorithm, with the output from two different deep neural networks used. The implementation of the method will be detailed in the next section.

3 Datasets and proposed method

3.1 Datasets

In this study, there are three datasets of brain hemorrhage used to train and evaluate the proposed method. Both public and private datasets are included, among which two datasets (RSNA 2019 Brain Hemorrhage Challenge and PhysioNet) are public datasets. The CMU-TBI is a private dataset. Each dataset is detailed as follows.

3.1.1 RSNA 2019 brain hemorrhage challenge dataset

The Radiological Society of North America (RSNA) [43] dataset can be found on Kaggle challenges. The objective of this competition is to identify the subtypes of ICH from brain CT scans. The dataset with annotations was collected and compiled by three research institutions located in the north and south of America. Due to the large amount of dataset comprised of over 25000 CT scans with five different subtypes of ICH labels, the competition attracted many developers and researchers from around the world to participate. The dataset involves six categories of brain hemorrhage including epidural hemorrhage (EDH), intraparenchymal hemorrhage (IPH), intraventricular hemorrhage (IVH), subarachnoid hemorrhage (SAH), subdural hemorrhage (SDH), and any existed hemorrhage. The raw data was stored in DICOM files. The DICOM format provides not only the pixel array of 512×512 but also header metadata. The total 755948 slices was divided into 740829 slices for the training set and 15119 slices for the test set, respectively.

3.1.2 PhysioNet

The PhysioNet [33] repository was collected from an Iraqi hospital during February and August 2018. There were two radiologists annotating the diagnosis of existing hemorrhage and ICH subtypes. The dataset is comprised of 82 CT scans including 46 male and 36 female scans with an average age of 27.8. There are approximate 34 slices included in each CT scan. A total of 2814 slices were extracted from these CT scans. The slices were split into 2233 slices for the training set and 581 slices for the test set.

3.1.3 CMU-TBI

This research has been granted ethical approval from the Ethics Committee of Faculty of Medicine, Chiang Mai University (CMU) and institutional review protocol. The head CT dataset includes the clinical data of 321 cases. There were about 30000 slices of a 1.5 mm thickness extracted from the Digital Imaging and Communications in Medicine (DICOM) series of CT Scans including 143 normal brains and 178 TBIs. The gender and age of patients are detailed in Table 1. The data of patients was collected from Maharaj Nakorn Chiang Mai Hospital. The slice numbers of detected EDH, SDH, IPH, SAH, and IVH were determined. The thickness of EDH and SDH, as well as the volume of IPH, were included as part of this dataset. Additionally, the data of those patients requiring surgical intervention was provided by the neuro-radiologists.

Table 1 Sex and Age Details for the CMU-TBI Dataset

The 3D scan of a 1.5 mm slice thickness contains a set of 2D images ranging between 90 to 105 slices. Through our investigation, it was found out that hemorrhage appeared most commonly in the slices starting from slice number 20 to number 90. Therefore, the total number of slices was reduced to 19946 slices. Then, the dataset was split into 15956 for the training set and 3990 for the test set in this study.

The samples of each hemorrhage subtype in different three datasets used for training and testing the model are detailed as Table 2.

Table 2 The training and test samples of ICH subtypes in RSNA, PhysioNet, and CMU-TBI datasets

3.2 Proposed method

The objective of this study is to estimate the thickness and volume of hemorrhage. The thickness and volume of hemorrhage depend on the subtypes of hemorrhage, for example, the radiologists measure the thickness of EDH and SDH but the volume is determined through calculation for for the IPH type of hemorrhage. Herein, an optimal framework is proposed on the basis of a double-branch deep neural network and a quantitative assessment algorithm. With the fine-tuned multi-label classification performed and pre-trained multi-class segmentation algorithm adopted, the output features of both networks are treated as the input of the quantitative assessment algorithm to calculate the thickness and volume of different types of brain hemorrhage. The flowchart of our method is presented in Fig. 2. The raw CT Scans of the CMU-TBI database are in DICOM format including metadata and pixel data. The first branch refers to the process of training a multi-label classifier, while the second branch represents the task of multi-class segmentation. In order to achieve the final output of predicted thickness and volume, there are five major steps to go through, including data pre-processing and augmentation, multi-label classification, DICOM to NIfTI conversion, multi-class segmentation, and quantitative assessment. The details of each step will be presented in the following section.

Fig. 2
figure 2

Overview of the workflow of our proposed method. The input is the 3D DICOM folder. The outputs include the predicted thickness of EDH or SDH, and the predicted volume of IPH

3.2.1 Data pre-processing and augmentation

The multi-label classifier model was trained on the RSNA 2019 Brain CT Hemorrhage Challenge dataset before its integration into our method. The original pixel value of the images from the RSNA dataset is in the form of Hounsfield units (HU) representing the physical density of the tissue. HU allows radiologists to change the intensity windows during diagnosis. It consists of two separate windows: window center (WC) and window width (WW). According to the method suggested in [23], three HU windows are discussed depending on the exact type of issues. The window center (WC) and window width (WW) values of the three HU windows are expressed as:

  • Brain window (WC = 40, WW = 80)

  • Subdural window (WC = 80, WW = 200)

  • Soft tissue window (WC = 40, WW = 380)

The image of a HU window is grayscale. The results obtained from different HU windows were integrated into a single three-channel image as shown in Fig. 3. The original size of CT slices is 512× 512 pixels. The size of CT slices was changed into 256×256 pixels before data augmentation. As a result, the shape of the three-channel input for the classifier is 256×256×3.

Fig. 3
figure 3

Data pre-processing and Augmentation flow. Each DICOM file is pre-processed by extracting three different intensity windows (brain window, subdural window, soft tissue window) taken as three channels for RGB image

With 25 percent of all images flipped horizontally and 10 percent of all images flipped vertically for data augmentation, the images captured from each side were cropped randomly between 0 to 25 pixels. Finally, the pixel values of all images were normalized into the range of [0,1].

3.2.2 Multi-label classifier

The up-to-date CNN architectures were refined for recognition, which were called EfficientNet [44] and EfficientNet with noisy student training [45]. The refined model was trained on the RSNA data and then taken as a pre-trained model. The original model architecture was modified by removing the last network layer and connecting it with the dropout layer (with dropout rate of 0.15). This is followed by a fully-connected layer with six output features that equal the number of categories in the RSNA dataset. The sigmoid activation function was applied after the fully connected layers. In this way, the final output provided the probability of ICH subtypes appearing in each image.

The EfficientNet models from B0 to B4 and EfficientNet with noisy student training models from B0 to B4 on the RSNA dataset with ten epochs were trained for comparison. The transfer learning mechanism was applied for this study through the following process. Firstly, EfficientNet-B2 was taken as a pre-trained network due to the highest accuracy (97%) and a reasonable number of parameters (7.77M). Secondly, the weight of the pre-trained model was transferred by fine-tuning the model with our CMU-TBI dataset. Finally, the multi-label classifier model provides the probability of each hemorrhage subtype that appears on each CT slice. The six output features represent the categories of hemorrhages including EDH, IPH, IVH, SAH, SDH, and an “exist or not” features. These features comprise one input for a quantitative assessment algorithm used to identify the types of hemorrhage. Multi-label log loss was taken as binary cross-entropy (BCE) loss for six output probabilities (the probabilities of five hemorrhagic subtypes and one probability of existed hemorrhage). The equation of BCE is expressed as:

$$ \mathcal{L}_{\text{multi-BCE}}(y,\hat{y}) = -\sum\limits_{t=1}^{6}y_{t}\cdot\log(\hat{y}_{t})+(1-y_{t})\cdot\log(1-\hat{y}_{t}) $$
(1)

where yt ∈{0,1} represents the ground truth label for class t, and \(\hat {y}_{t}\in \{0,1\}\) indicates the predicted class probability for a class t with the range of zero to one. By optimizing the BCE loss, the stochastic gradient descent technique with Adam [46] optimizer was applied at a 0.000125 learning rate. The batch sizes of 32 and 16 were adopted for the training set and test set, respectively. Each training session was processed for approximately two days on NVIDIA Tesla M10 GPU using Keras deep learning API. In this study, consideration was given only to the three subtypes (EDH, SDH, IPH) significant to decision-making on emergency surgery.

3.2.3 DICOM to NIfTI conversion

A multi-class segmentation method is required to assess the quantitative information of the hemorrhagic lesion. As one of the multi-class segmentation methods, DeepMedic [47] is based on a three-dimensional CNN architecture designed for the purpose of 3D segmentation. In the existing model, the format of NIfTI file is taken as the input data. Since NIfTI as a format of image is simpler than the DICOM format, it has been widely adopted for image processing and analysis [48]. Therefore, the conversion from DICOM to NIfTI is performed to prepare the data required for segmentation.

3.2.4 Multi-class segmentation

As a three-dimensional CNN for the accurate segmentation of brain lesion, DeepMedic [36] is comprised of eleven neural network layers. For the multi-class segmentation task on CT scans, an optimal DeepMedic model [37] was adopted. The model was modified to maintain the network architecture through residual blocks, batch normalization, and pre-activation blocks. Not only does the optimal DeepMedic outperform the existing medical image segmentation models such as U-Net [49] and UNet++ [50], it is also suitable for the tasks of multi-class segmentation. The optimal DeepMedic model was applied to obtain the multi-class segmentation mask for each slice of brain CT. Then, these output masks were used to classify the types and estimate the sizes of hemorrhage through the quantitative assessment algorithm. The samples of the predicted mask are shown in Fig. 4. The contours were separately colored according to each subtype of hemorrhage.

Fig. 4
figure 4

The upper row represents the original images of different CT scans in gray. The bottom row shows the outputted images of the optimal DeepMedic segmentation method

3.2.5 Quantitative assessment

Herein, a quantitative assessment algorithm is proposed. In the function, the output probabilities of fine-tuned EfficientNet-B2 are taken from branch #1 and the output mask of optimal DeepMedic is taken from branch #2 of a double-branch deep neural network, with every point on the output mask treated as the input. The network architectures of a double-branch deep neural network are shown in Fig. 5.

Fig. 5
figure 5

Network structure of a double-branch deep neural network based on fine-tuned EfficientNet-B2 and optimal DeepMedic. (a) Given the pre-processed 2D images and original 3D images, the feature extraction deep neural networks are covered by the gray area. (b-e) the details of each block in the main network

For each slice on a CT scan, the estimator is used to calculate the thickness and volume size of each contour separately. The contours are divided mainly into two groups. One is the contour that overlaps with the brain skull area including EDH and SDH types. The thickness estimator is applied to this group using Euclidean distance and the distance transform methods. The Euclidean distance provides the maximum and minimum lengths between a center point and other points. Based on these two lengths, the thickness ratio can be determined through calculation. The thickness ratio is a key factor to consider for distinguishing between EDH and SDH shapes. The distance transform is then applied to measure the thickness of a particular contour. The other is the contour of IPH inside the brain tissue area. The volume estimator function is applied to this group for the purpose of volume estimation.

The output of the quantitative assessment algorithm includes the thickness (in millimeter) and volume (in milliliter) estimated for of each subtype of hemorrhage. The pseudocode of the quantitative assessment algorithm applied for each CT scan is referred to Algorithm 1.

figure g

The probability threshold (PTEDH, PTSDH, PTIPH), thickness ratio (TR), and volume ratio (VR) are chosen based on accuracy and error calculation, respectively. The selection of probability threshold and thickness ratio is detailed in Section 5. The description of each function is explained as follows:

findSkullMask(s) - a function used to find the mask of the brain skull. With the input of each slice s, the function is expressed as Algorithm 2.

figure h

euclideanDistance(pz,pc) - a function that generates the Euclidean distance-vector D containing the distances between the center point pc and all of the other points in the contour c. The function is calculated by means of

$$ \begin{array}{@{}rcl@{}} D &=& \sqrt{(x_{z}-x_{c})^{2}+(y_{z}-y_{c})^{2}} \text{ where } p_{c}=(x_{c},y_{c}),\\ p_{z}&=&(x_{z},y_{z}),p_{c} \ne p_{z} \end{array} $$
(2)

distanceTransform(c) - a method used to replace each pixel p of the image with a distance to the nearest background pixel q. This method can be used to build the distance map DM. The output value of distance transformation is approximately half the actual lesion width. The distance map DM is expressed as

$$ DM(p) = \min\{d(p,q)|I(q)=0\} $$
(3)

where I(q) represents the pixel value of q.

findVolume(w,h,ps) - a function intended to estimate the volume of hemorrhagic lesions. This function is derived from the ABC/2 method [51]. The ABC/2 is a technique proposed by Kothari et al. to calculate the volume of hemorrhage, where A represents the value of maximum length (in cm), B indicates the width perpendicular to A on the same head CT slice, and C denotes the number of slices multiplied by the thickness of slice. Thus, the findVolume(w,h,ps) function is expressed as (4).

$$ V=\frac{(w \times ps) \times (h \times ps) \times \text{slice thickness}}{2 \times 1000} $$
(4)

In this study, slice thickness is set to 1.5 mm.

4 Evaluation

The performance of the proposed method and that of the baseline methods are compared. Then, comparison results are categorized mainly into three sets of evaluations. In the first one, the classification results are compared between our fine-tuned EfficientNet-B2 and baseline methods. In the second one, the performance of the optimal DeepMedic is discussed using segmentation metrics. The last one demonstrates the results of classification and estimation for the size of hemorrhagic lesions based on a double-branch deep neural network built on a private CMU-TBI dataset.

4.1 The performance evaluations between fine-tuned EfficientNet-B2 and baseline methods

There are five metrics used to evaluate the classification performance including precision, sensitivity, specificity, f1-score and accuracy. Each of them can be calculated using the following equations:

Precision

$$ \text{Precision}=\frac{TP}{TP+FP}\times 100\% $$
(5)

Sensitivity or recall

$$ \text{Sensitivity}=\frac{TP}{TP+FN}\times 100\% $$
(6)

Specificity

$$ \text{Specificity}=\frac{TN}{TN+FP}\times 100\% $$
(7)

F1-score

$$ \text{F1-Score}=\frac{2 \times TP}{2 \times TP+FP+FN}\times 100\% $$
(8)

Accuracy

$$ \text{Accuracy}=\frac{TP+TN}{TP+FP+TN+FN}\times 100\% $$
(9)

where TP represents a true positive value, TN refers to a true negative value, FP denotes a false positive value, and FN indicates a false negative value. All metrics are converted into the percentage unit.

There are various baseline methods, for example, decision tree-based Projection Profile [17], ICH UNet [33], and UNet++ [50]. According to Table 3, the fine-tuned EfficientNet-B2 outperforms the baseline methods on the CMU-TBI dataset in terms of classification. The models were trained for 100 epochs to obtain the results. The scores of the fine-tuned EfficientNet-B2 are mostly higher compared to ICH U-Net and UNet++ methods. The specificity rate of UNet++ is the lowest due to large proportion of false-positive diagnosis. In other words, the model predicts normal brain wrongly as hemorrhagic lesion.

Table 3 The comparison in classification performance (%) between fine-tuned EfficientNet-B2 and the baseline methods on CMU-TBI dataset

Figure 6 shows the accuracy and loss charts of fine-tuned EfficientNet-B2 on the CMU-TBI dataset. During the training process, the model achieves higher accuracy and lower loss than in the testing process. Through comparison with the performance during the training process, it can be discovered that the accuracy and loss during the testing process converge and maintain consistency after 40 epochs. The output suggests that the performance of the model during the testing process did not improve with the increase in epoch.

Fig. 6
figure 6

The accuracy and loss of fine-tuned EfficientNet-B2 on our CMU-TBI dataset

PhysioNet is the public dataset that used in this study to evaluate the performance of our model. Even though the sensitivity score of our model is lower relative to U-Net and UNet++, the other metrics surpass baseline methods as shown in Table 4.

Table 4 The comparison in classification performance between fine-tuned EfficientNet-B2 and the baseline methods on PhysioNet dataset

4.2 The performance evaluation between optimal DeepMedic algorithm and baseline methods

In order to quantify the performance of models in segmentation, there are two methods are adopted, including Jaccard Index which is also known as Intersection over Union (IoU) and Dice score (Dice similarity coefficient). Jaccard Index is an effective metric intended to measure the accuracy between the predicted output mask and ground truth mask by computing the overlapping area of those masks under the union area of both masks. With Y representing ground truth segmentation and \(\hat {Y}\) referring to the predicted output of methods, the Jaccard Index and Dice score are written as (10) and (11), respectively.

$$ \begin{array}{@{}rcl@{}} J(Y,\hat{Y})&=&\frac{Y \cap \hat{Y}}{Y \cup \hat{Y}} \end{array} $$
(10)
$$ \begin{array}{@{}rcl@{}} D(Y,\hat{Y})&=&2\frac{Y \cap \hat{Y}}{|Y|+|\hat{Y}|} \end{array} $$
(11)

Where ∩ denotes intersection and ∪ represents the union of two segmentations Y and \(\hat {Y}\), while |⋅| indicates the summation result of the argument. The values of Y and \(\hat {Y}\) range from 0 to 1.

The ICH UNet [33], DeepBleed [42], and UNet++ [50] were treated as the baseline methods. The Jaccard Index and Dice Score of baseline methods and our method on the publicly accessible PhysioNet dataset were calculated, while the metrics were calculated on the test set including 581 slices. Consistent with the experimental evaluation shown in Table 5, the optimal DeepMedic outperforms most of the baseline methods in both Jaccard Index and Dice Score except UNet++. However, the UNet++ is incapable of multi-class segmentation, with low sensitivity score achieved.

Table 5 The comparison in the metrics of segmentation evaluation between optimal DeepMedic and the baseline methods on the PhysioNet dataset

Figure 7 shows the segmentation regions and evaluation metrics of the baseline model (UNet++) and our method on the PhysioNet dataset. In the first row of region output, the red line represents the supervised mask as provided with the dataset and the green line refers to the predicted mask from the UNet++ model. The regions of the supervised mask and predicted mask are largely overlapped. However, there are some false-positive regions detected. The results in the second row are from ground truth and our method. The predicted regions of our approach exclude many false-positive regions, which however differ from the baseline model. Moreover, the optimal DeepMedic model and baseline model were tested on the CMU-TBI dataset as shown in Fig. 8. According to the output of segmentation, our method covers more types of hemorrhagic lesion including the small region (last image) than the UNet++ model.

Fig. 7
figure 7

Samples obtained from a validation set of PhysioNet dataset along with ground truth mask (green lines). The segmentation areas from baseline (UNet++) are indicated by in red lines. The outputs of our method are highlighted by blue lines

Fig. 8
figure 8

Samples obtained from a validation set of CMU-TBI dataset along with ground truth mask (green lines). The segmentation areas from baseline (UNet++) are indicated by red lines. The output of our method are highlighted by blue lines

4.3 The performance evaluation of our double-branch deep learning network with quantitative assessment algorithm on each subtype of hemorrhage in CMU-TBI dataset

In this part, a comparison was performed between the results obtained from the classification of types of hemorrhagic lesion. Additionally, the error metrics of thickness and volume calculation were discussed. The experimental analysis was conducted through the classification task by observing the classification metrics of our method on the CMU-TBI dataset. The 56 CT scans with obviously differentiated lesion types from a total of 178 scans were treated as a validation set. The validation set is comprised of 3130 slices. Each slice shows only one type of hemorrhage, that is, either EDH, SDH, or IPH. Table 6 shows the comparison of our method on each type of hemorrhage in different measurements. The hybrid method based on quantitative assessment algorithm achieves the highest accuracy of 96.54 percent when SDH is classified. The average accuracy is 96.21 percent for the three types of hemorrhage.

Table 6 The evaluation metrics of our method on CMU-TBI validation set for different subtypes of hemorrhage

4.4 The thickness and volume difference of EDH, SDH, and IPH between true and predicted values from our method

A total of 56 CT scans in the validation set obtained from the CMU-TBI dataset were included to calculate the difference in thickness and volume between the true values provided by the doctor and the values estimated by using our method. Figure 9 shows the Bland-Altman plots of agreement between ground truth and predicted values. The mean difference of thickness is 2.99 mm (-0.42 to 6.42) for EDH and 0.97 mm (-2.41 to 4.35) for SDH. The mean different volume of IPH is 0.43 mL (-4.74 to 5.61).

Fig. 9
figure 9

The bland-Altman plots for lesion progression of the validation set as derived from CMU-TBI dataset

5 Ablation study

5.1 Probability threshold selection

The probability threshold is a parameter required to select the model that achieves the highest accuracy. The probabilities ranging between 0.1 to 0.9 for each subtype of hemorrhage are evaluated, as shown in Fig. 10. The best probability threshold for the subtypes of EDH (PTEDH), SDH (PTSDH), and IPH (PTIPH) is 0.5, 0.2, and 0.1, respectively.

Fig. 10
figure 10

The chart of accuracy in varying probabilities of predicted output from the fine-tuned EfficientNet-B2

5.2 Thickness and volume ratio selection

The optimal thickness ratio (TR) for EDH and SDH subtypes and volume ratio (VR) for IPH subtype can be identified by the minimum Mean Absolute Error (MAE) while the model is tested using different ratios. The ratios selected for testing the model range from 1 to 25. The MAE is expressed as (12).

$$ MAE=\frac{1}{n}\sum\limits_{i=1}^{n}|q_{i}-\hat{q}_{i}| $$
(12)

q represents the ground truth quantitative information (thickness or volume) provided by experts, \(\hat {q}_{i}\) indicates predicted quantitative information, and n denotes the number of lesions in each subtype.

In order to find the optimal TR and VR, the true-positive MAE (MAETP) and false-positive MAE (MAEFP) of each subtype are obtained. The MAETP evaluates the error between ground truth and predicted quantitative information within the same category, while the MAEFP is used to calculate the error between ground truth and the predicted quantitative information of different types. The optimal TR is defined as the ratio that provides a minimum average MAE of thickness values in EDH and SDH subtypes as calculated using the following equation.

$$ TR=\text{argmin}(MAE_{\text{AVG}}(\text{EDH},\text{SDH})) $$
(13)

MAEAVG(EDH,SDH) represents the average MAE of thickness values in EDH and SDH subtypes defined as

$$ \begin{array}{@{}rcl@{}} &&MAE_{\text{AVG}}(\text{EDH},\text{SDH}) {} \\ &=&\!\frac{\mathit{MAE}_{\mathit{TP}}(\text{EDH}) + \mathit{MAE}_{\mathit{FP}}(\text{EDH}) + \mathit{MAE}_{\mathit{TP}}(\text{SDH}) + \mathit{MAE}_{\mathit{FP}}(\text{SDH})}{4} \end{array} $$
(14)

MAETP(EDH) is true-positive MAE for EDH, MAEFP (EDH) is false-positive MAE for EDH, MAETP(SDH) is true-positive MAE for SDH, and MAEFP(SDH) is false-positive MAE for SDH. Figure 11 shows MAETP and MAEFP for different thickness ratios of our method and the original DeepMedic.

Fig. 11
figure 11

The true-positive MAE, false-positive MAE and average MAE of thickness in the EDH and SDH subtypes compared to DeepMedic network in a range of ratio between 1 and 25

As shown in Fig. 11a, the true-positive MAE of our method is clearly comparable to the traditional DeepMedic method. In spite of this, our method can also achieve less false-positive and overall MAE than the baseline approach as shown in Fig. 11b and c, respectively. The TR was set to 20 as the lowest point in average MAE.

The optimal VR is referred to as the ratio that provides a minimum average MAE of volume values in the subtype of IPH. It can be expressed as the following equation.

$$ VR=\text{argmin}(MAE_{\text{AVG}}(\text{IPH})) $$
(15)

MAEAVG(IPH) refers to the average MAE of volume values in IPH subtype which is defined as

$$ MAE_{\text{AVG}}(\text{IPH})= \frac{MAE_{TP}(\text{IPH})+MAE_{FP}(\text{IPH})}{2} $$
(16)

MAETP(IPH) is true-positive MAE for IPH, and MAEFP(IPH) is false-positive MAE for IPH. Figure 12 shows the MAETP and MAEFP for different volume ratios of our method and the original DeepMedic.

Fig. 12
figure 12

The true-positive MAE, false-positive MAE and average MAE of volume values in IPH subtype compared to DeepMedic network in a range of ratio between 1 and 25

The method proposed in this study improved the true-positive MAE when the ratio value increased, as shown in Fig. 12a. The MAE of volume measurement can als be reduced significantly, as shown in Fig. 12b. The VR was set to 24 as the lowest point in average MAE.

6 Conclusion

The prior studies on automated assessment of head CT images after TBI are limited to the undifferentiated detection of hemorrhage different lesions, with no quantitative assessment conducted for the volumetric analysis. For this reason, the accurate detection and quantification of lesion volumes are essential for improving the understanding of those influencing factors in lesion progression and targeted medical treatment. In this study, an optimal deep learning framework is proposed, which can not only identify the subtypes of hemorrhages but also assist the clinically relevant quantitative assessment of thickness and volume. The proposed method is integrated with a fine-tuned multi-label classifier (EfficientNet-B2), an optimal multi-class segmentation model (DeepMedic), and our quantitative assessment algorithm. The fine-tuned EfficientNet-B2 model can achieve the highest accuracy with 98.62 percent on the CMU-TBI dataset in comparison with two baseline models, namely, ICH U-Net and UNet++.

In addition, the Jaccard Index and Dice score of our method are calculated using the output from the optimal DeepMedic. The model shows a comparable Jaccard Index and Dice score to the baseline methods on a PhysioNet dataset.

The quantitative assessment algorithm takes the probabilities of each hemorrhage subtype from a fine-tuned multi-label classifier and hemorrhage contours from the optimal multi-class segmentation model as inputs. In order to differentiate hemorrhagic lesions, our method is also assessed for each subtype of hemorrhage. The model is tested to classify EDH, SDH, and IPH separately on a validation set of the CMU-TBI dataset. According to the test results, our method performs best in classification for SDH type with a 96.54 percent accuracy. The average accuracy is 96.21 percent for the three subtypes of hemorrhage.

The thickness and volume of hemorrhagic lesions are computed by means of distance transform and the commonly applied volume evaluation ABC/2 functions. The differences between ground truth and predicted lesions (of thickness and volume) are indicated by Bland-Altman plots. The predicted EDH, SDH thickness and IPH volume overestimated the true values by 2.99 mm, 0.97 mm, and 0.43 mL, sequentially. Moreover, our method can reduce the false-positive mean absolute error of both thickness and volume assessments more significantly than the traditional DeepMedic multi-class segmentation approach.

With this fully automated method applied, the process of decision-making on surgery can be accelerated and the shortage of radiologists can be addressed for rural medical institutions. In the future, it is necessary to improve the technique based on the aforementioned surgical consideration. Integrating the research into clinical practice requires various additional functions such as skull fracture detection and midline shift measurement. For the better understanding and prognostication of lesions, it is essential to conduct adequate validation on other subtypes of hemorrhage.