Background

A potentially fatal disorder known as intracranial hemorrhage (ICH) occurs in 25 per 100,000 yearly, which is related to 2 million strokes globally and has an estimated incidence [1]. There is a variety of fundamental (80–85%) and secondary (15–20%) underlying causes of ICH [2]. The most frequent non-traumatic secondary causes include brain tumors, ischemic strokes, and vascular malformations. Hospital admissions for ICH have grown during the past ten years, primarily because of the elderly population, insufficient blood pressure (BP), and increased use of blood thinners management [3, 4]. In such a way that, rational decrease of BP is an important factor to manage these patients, specifictly for lower than 15 mL ICH volume [5, 6]. The revascularization in the acute phase of strokes can improve the symptoms and better prognosis of these patients [7]. The tissue plasminogen activator (tPA) is the main treatment for ischemic stroke. Moreover, the clot in the blood vessel can be removed by thrombectomy technique that catheter intervent upper of femur; then, using angioplasty blocked artery can be opened up [8, 9].

Neuroimaging is, therefore, essential for the diagnosis of acute ICH because perchance challenging to differentiate it from other diseases, such as ischemic stroke [10]. The successful procedure of a non-contrast computed tomography (CT) for the cerebrum, an accessible and quick technique for diagnosing ICH, are crucial component of the ICH diagnostic process. Fundamental ICH features such as location, edema, ventricular system expansion, and midline shift are morphologically revealed by a CT-Scan [11]. However, more significant CT-Scan usage could delay the identification of ICH, and a growing burden in radiology departments could lead to job-related stress and burnout. In contrast, it has been discovered that artificial intelligence (AI) can improve radiology practice by lowering the amount of effort required [12,13,14].

Today, the efficiency of machine learning (ML) algorithms, especially improving deep learning (DL) algorithms for computer vision, has advanced significantly. The CT-Scan, one of the most well-known imaging modalities, and has seen considerable breakthroughs in ML and its application [15, 16]. Support vector machine (SVM), Convolutional neural network (CNN), random forest (RF), and conditional random field (CRF) are the most prominent ML algorithms for recognizing brain bleeding from visual data. Even though a great deal of work has already been accomplished in this field, there is still room for growth. Additional research is required to improve the accuracy, precision, and resilience of ML-based brain segmentation [17, 18]. A meta-analysis reported DTA of AI for the detection of ICH; however, this study did not report subgroups for distinguishing between Algorithms and also types of ICH [19]. Therefore, this systematic review and meta-analysis were conducted to objectively evaluate the evidence of ML in the patient diagnosis of ICH on CT scans.

Results

Study selection & characteristics

Following the primary search, 1,405 studies were recognized after removing duplicated studies. At last, after screening the title, abstract, and full paper, twenty-six retrospective and three prospective, and two retrospective/prospective studies were included [20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50]; then, twenty-nine studies were included in the final quantitative analysis, and the other studies were excluded because no diagnostic accuracy was reported (Fig. 1) [20,21,22,23,24,25,26, 28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46, 48,49,50,51,52]. The machine learning networks were classified into, Support vector machine (SVM), Random Forest (RF), k-nearest neighbors’ algorithm (k-NN), VGG-16, Logistic Regression (LR), ResNet-18, AlexNet, DenseNet-121, eXtreme Gradient Boosting (XGBoost), Decision Tree (DT), and Deep Learning (DL) included Convolutional Neural Network (CNN); ResNet34, ResNet50, ResNet18, ResNet-v2, GoogleNet (Table 1).

Fig. 1
figure 1

Study Flow Diagram showing how to extract articles

Table 1 Summary of findings for all studies included in the qualitative synthesis

Risk of bias

The validity and the possibility of bias for the included studies were evaluated with the QUADAS-2 (Fig. 2). One high-risk bias was reported in all the included studies [20]. When the publication bias is very low, the points will be symmetrically distributed around the true effect of an inverted funnel, as shown in Fig. 3.

Fig. 2
figure 2figure 2

A. Risk of bias and applicability concerns graph; review authors' judgments about each domain presented as percentages across included studies. B. Risk of bias and applicability concerns summary; review authors' judgments about each domain for each included study

Fig. 3
figure 3

Funnel plot showing the low likelihood of publication bias in all included studies

Diagnostic test accuracy (DTA) of all included studies

Retrospective studies

The overall DTA of the 26 retrospective studies and 904,755 scans was estimated using a univariate meta-analysis with a pooled sensitivity was 0.917 (95% CI 0.88 to 0.943, I2 = 99%) (Fig. 4) [20,21,22,23,24,25,26, 28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43, 45, 46, 48,49,50]. The pooled specificity was 0.945 (95% CI 0.918 to 0.964, I2 = 100%) (Fig. 5). The pooled diagnostic odds ratios (DOR) was 219.47 (95% CI 104.78 to 459.66, I2 = 100%) (Additional file 1: Figure S1). The LR+ ranges from 12.639 to 20.784 with pooled mean of 16.208 (Table 2), and LR ranges from 0.072 to 0.123 with pooled mean of 0.094. The AUC of 0.971 was reported for the SROC via the bivariate model (Fig. 6). The overall accuracy was 90.3 (ranges from 87.24 to 93.01), the precision was 76.24 (ranges from 66.71 to 86.32), and the F1-score was 79.14 (ranges from 70.9 to 86.48) (Table 2).

Fig. 4
figure 4

Univariate sub-group analysis of sensitivity with random model based on retrospective studies

Fig. 5
figure 5

Univariate sub-group analysis of specificity with random model based on retrospective studies

Table 2 DTA estimated from all the studies included in the meta-analysis using (2 \(\times\) 2) confusion table
Fig. 6
figure 6

The SROC of the bivariate for DTA based on retrospective studies

Prospective studies

The overall DTA of the five prospective studies and 104,397 scans was estimated using a univariate meta-analysis with a pooled sensitivity was 0.886 (95% CI 0.613–0.975, I2 = 100%) (Fig. 7) [24, 29, 33, 40, 44]. The pooled specificity was 0.967 (95% CI 0.937–0.983, I2 = 100%) (Fig. 8). The pooled DOR was 227.71 (95% CI 27.82–1863.51, I2 = 100%) (Additional file 1: Figure S2). The LR+ ranges from 6.054 to 87.029 with pooled mean of 22.953 (Table 2), and LR ranges from 0.005 to 1.932 with pooled mean of 0.101. The AUC of 0.98 was reported for the SROC via the bivariate model (Fig. 9).

Fig. 7
figure 7

Univariate sub-group analysis of sensitivity with random model based on prospective studies

Fig. 8
figure 8

Univariate sub-group analysis of specificity with random model based on prospective studies

Fig. 9
figure 9

The SROC of the bivariate for DTA based on prospective studies

The overall accuracy was 93.69 (ranges from 90.31 to 97.2), the precision was 75.58 (ranges from 55.23 to 91.18), and the F1-score was 77.26 (ranges from 56.23 to 91.32) (Table 2).

DTA Based on network architecture

The Network Architecture analysis was divided into ResNet, RF, and SVM [20,21,22,23,24,25,26, 28, 30,31,32,33,34,35,36,37,38,39, 41,42,43, 45, 46, 48,49,50]. These results were significant for the specificity of the different network architecture models (p-value = 0.0289). However, the results for sensitivity (p-value = 0.6417) and DOR (p-value = 0.2187) were not significant (Additional file 1: Figures S3–S5).

DTA based on ICH types

The ICH types of analysis were divided into EDH, SDH, IPH, IVH, SAH, and CPH [21, 25, 33, 36, 38, 46, 49, 50]. These results were significant for the results for specificity (p-value < 0.0001) and DOR (p-value = 0.0009). However, the sensitivity of different ICH types (p-value = 0.4564) was insignificant (Additional file 1: Figures S6–S8).

DTA based on data sources

The data sources analysis was divided into single [20, 22, 24, 26, 27, 30, 32,33,34, 36,37,38,39, 41, 42, 47, 48, 50] or multiple [21, 23,24,25, 27, 28, 31, 34, 35, 43, 45,46,47,48,49]. These results were not significant for the sensitivity (p-value = 0.6879), specificity (p-value = 0.6494), and DOR (p-value = 0.7272) (Additional file 1: Figures S9–S11).

The data sources analysis was divided into benchmark [26, 28, 31, 32, 36, 38, 42, 46] or real-time data [20,21,22,23,24,25, 27, 30, 33,34,35, 37, 39, 41, 43, 45, 47,48,49,50]. These results were not significant for the sensitivity (p-value = 0.1017), specificity (p-value = 0.5189), and DOR (p-value = 0.1285) (Additional file 1: Figures S12–S14).

Discussion

Detection of ICH by ML in systematic studies may decrease the time to diagnosis, which is crucial for clinical because approximately most of ICH in accordance with death occurs within the primary hours [53]. This meta-analysis demonstrated that ResNet algorithms could detect ICHs accurately with retrospective and non-randomized data [22, 31, 33, 37, 38, 50].

In this current study, ML has been used in ICH non-contrast CT-Scans with different architecture models. The resulting pooled sensitivity, specificity, DOR, AUC, accuracy, and precision were 0.917 (95% CI 0.88 to 0.943, I2 = 99%), 0.945 (95% CI 0.918 to 0.964, I2 = 100%), 219.47 (95% CI 104.78 to 459.66, I2 = 100%), 0.971, 90.3 (ranges from 87.24 to 93.01), and 76.24 (ranges from 66.71 to 86.32), respectively.

Practical ML is characterized by high accuracy measures such as AUC, sensitivity, and specificity, which can accurately categorize illness suspects and non-suspects. This meta-analysis revealed a combined AUC of 0.971. On the other hand, the high AUC of the included trials could not correctly represent the performance of the algorithm's therapeutic benefit [54]. Initially, the range of AUC among studies was 0.608 to 1 that Neural Networks (NNs) learning such as CNN, ResNet, and RNN had a higher rate from other ML algorithms [20, 21, 23, 24, 26,27,28,29, 31, 33, 37,38,39, 43, 44, 46, 49]. In other words, this result suggested that NNs algorithms in the big data can improve the rate of AUC which it is a useful way to detect a good model and positive and negative target classes.

DL models were shown to have a pooled sensitivity of 87.00% (95% confidence interval: 83.00–90.20%) and specificity of 92.50% (95% confidence interval: 85.10–96.40%) when compared to the gold standard by Liu et al. (2019), who pooled 14 out-of-sample external validation experiments [55].

To interpret the results, a DOR of 219.47 (95% CI 104.78–459.66, I2 = 100%) generally means using ML in diagnosing ICH is valuable. Due to the necessity of reporting the convergence of the results along with the accuracy, precision is also mentioned. Precision equal to 76.24 (ranges from 66.71 to 86.32) indicates a relative convergence besides the accuracy of 90.3 (ranges from 87.24 to 93.01). These results show that ML can be diagnosed with ICH in healthy patients. Also, likelihood ratios are important factors that could help improve clinical judgment and show the range of disease frequencies, and LR+ greater than 10 produces a greater pretest probability. The LR less than 0.1 has conclusive changes in the post-test possibility [56]. The pooled positive LR+ and LR range from 12.639 to 20.784 with a mean of 16.208 and 0.072 to 0.123 with a pooled mean of 0.094, respectively. The pooled LR+ of 16.208 means that diagnosis of ICH is 16.208 times more likely to be diagnosed while ML is used; likewise, the pooled LR of 0.094 means ICH has a higher likelihood of negative test for the ML algorithm than healthy patients. The pooled F1 score of this study was 79.14 (ranging from 70.9 to 86.48). The F1 score is a numerical score between 0 and 100; the closer this number is to 100, the more valuable the method studied [57]. This score results from the average weight of recall and precision, which has a significant place in data interpretation. It can be reduced the number of false negatives and positives.

The sub-group analysis based on the ML architecture and algorithms was done to assess these factors' influence on the DTA results. The network architecture analysis results showed significance for the specificity of the different network architecture models (p-value = 0.0289). However, the results for sensitivity (p-value = 0.6417) and DOR (p-value = 0.2187) were not significant. Thus, the ResNet algorithm has higher pooled specificity than other algorithms 0.935 (95% CI 0.854 to 0.973, I2 = 93%). Between studies, CNN architectures included specialized neural networks and ensemble learning [58]. However, this study focuses on CNNs for detecting ICHs in general, and it may not be acceptable to extend the results to other AI projects [25]. To increase the number of entirely connected layers from one to five, Lee et al. 2019 combined a final CNN made up of VGG16, ResNet50, Inception-v3, and Inception ResNet-v2 utilizing ResNet18 with only minor alterations [33]. It has been demonstrated that standard ImageNet architectures such as ResNet18 do not significantly outperform smaller and simpler CNNs [59]. However, by averaging many transfer models, the performance of an ensemble of transfer models may be enhanced. Chang et al. (2018) used a hybrid 3D/2D CNN pyramid with a proprietary mask R-CNN architecture as its backbone to detect and segment ICHs [60]. Medical imaging can use finely tuned 3D networks, which have shown exceptional performance in a variety of applications; however, 3D networks need a large dataset and several training parameters, with the image depth volume varying from 20 to 400 slices per scan, which is more demanding in terms of computation efficiency [25].

Besides, the sub-group analysis based on the ICH types was significant for specificity (p-value < 0.0001) and DOR (p-value = 0.0009). However, the sensitivity of different ICH types (p-value = 0.4564) was insignificant. Thus, EDH has higher pooled specificity and DOR than other ICH types 0.99 (95% CI 0.947–0.998, I2 = 100%) and 616.79 (95% CI 91.76–4145.99, I2 = 97%). However, there were no significant differences between data sources (single versus multiple or benchmark versus real-time).

Misdetection of ICHs, which are difficult to distinguish from bone or undiscovered microbleeds in trauma imaging, is another therapeutically significant and relevant issue [61]. Using image processing techniques, the skull and face were removed from NCTCs in Kuo et al. 2019 research. They achieved 100% sensitivity in an external test set of 200 NCTCs, which was likely made possible by the simplicity of detecting bleeding when only intracranial structures were considered [62]. Patients excluded or removed because of picture artifacts might improve the algorithm. NCTCs are familiar with patient-related imaging artifacts in CT, such as metallic materials, human movements, and incomplete projections. In addition, the diversity of CT scanners and image reconstruction methods makes direct comparisons between research challenging [33].

Limitations

Developing a clinical environment where an ML supports the radiologist could improve diagnostic efficacy and should be assessed from a socioeconomic and patient standpoint [63]. The deployment of MLs in clinical operations necessitates a sophisticated configuration coupled with medical imaging systems. Just one of the included articles assessed midline shift [25]. Therefore, this outcome couldn’t analyze. This would be important clinically, as its value > 5 mm may be an indication for urgent neurosurgical review.

Additionally, the findings of the I-squared analysis make it clear that combining the data from these studies may not be appropriate, underscoring the dearth of external validation research. Due to factors like scanning methodology, scanner types, algorithm designs, and reference standards, it is not easy to compare different research, which reduces the generalizability and validity of the findings. The judgment of articles may have been tainted by subjective bias since writers' degrees of experience varied. The creation of additional prospective studies in this area may significantly advance future research since, in addition to the different causes of variability, the use of retrospective studies was the study's most noticeable limitation.

Conclusion

This meta-analysis on DTA of ML algorithms for detecting ICH by assessing non-contrast CT-Scans shows the ML has an acceptable performance in diagnosing ICH. Using ResNet in ICH detection remains promising prediction was improved via training in an Architecture Learning Network (ALN). However, further studies with greater homogeneity are needed to draw more accurate conclusions about the results of DTA of ML in ICH.

Methods

Protocol and registration

This meta-analysis study was reported according to Preferred Reporting Items for Systematic Reviews-Diagnostic Test Accuracy (PRISMA-DTA) guideline [64].

Eligibility criteria

Original studies were eligible if they met all the following predefined inclusion criteria: a) patients undergoing non-contrast brain computed tomography (CT) scan for the detection of acute or chronic Intracranial hemorrhage (ICH), such as intraparenchymal hemorrhage (IPH), subdural hemorrhage (SDH), epidural hemorrhage (EDH), intraventricular hemorrhage (IVH), and subarachnoid hemorrhage (SAH), or b) using a gold standard (Radiologists) to report the ICH.

Information sources

Until May 2023, systematic searches were conducted in ISI Web of Science, PubMed, Scopus, Cochrane Library, IEEE Xplore Digital Library, CINAHL, Science Direct, PROSPERO, and EMBASE for studies that evaluated the diagnostic precision of ML model-assisted ICH detection.

Search strategy

One knowledgeable librarian [KSH] established and refined search tactics through team discussion. “Deep Learning,” “Machine Learning,” “Artificial Intelligence,” “Intracranial Hemorrhages,” “intraparenchymal hemorrhage,” “epidural hemorrhage,” “subdural hemorrhage,” “subarachnoid hemorrhage,” “intraventricular hemorrhage,” “Diagnosis,” “Meta-Analysis,” and “Computerized Tomography” were among the kwywords. Moreover, conferences, editorials, commentaries, reviews, guidelines, book chapters, technical articles, and papers with inadequate citation standards that did not match the conceptual framework of the study were rejected.

Summary measures

ICHs versus HCs that were true positive (TP, true ICH, predicted to be ICH), true negative (TN, non-ICH predicted to be non-ICH), false positive (FP, non-ICH predicted to be ICH), or false negative (FN, ICH, predicted to be non-ICH) were extracted for meta-analysis purposes. The original study's inclusion criteria were utilized to obtain data for the meta-analysis on detecting ICH. In addition, the publication year, the nation where the research was conducted, the study methodology, the number of patients, and their ages were recovered. The primary outcomes were diagnostic accuracy = ((TP + TN)/(TP + FN + FP + TN)), specificity = TN/(FP + TN), sensitivity = TP/(TP + FN), precision = (TP/TP + FP), F1- Score = 2 × (Precision × Recall/Precision + Recall), negative likelihood ratio (LR) = (1-sensitivity/specificity), positive likelihood ratio (LR+) = (sensitivity/1- specificity), DOR = (LR+/LR), and the AUC of ML on detecting ICH in the patients, ICH versus healthy controls (HCs) [65, 66]. Comparing the accuracy, sensitivity, and specificity of ML and CT-Scan were the subgroup analysis.

Risk of bias across studies

Two independent reviewers utilized the updated Quality Assessment of Diagnostic Accuracy Studies (QUADAS-2) instrument to evaluate all studies' quality and potential bias. Communication resolved conflicts, and a third reviewer and reviewers independently assessed the first included papers. Two categories were considered: bias susceptibility and patient selection, index test, and comparative benchmark application. In the flow and pace areas, bias was evaluated.

Additional analyses

Using the Random Effects Model (RE) technique, a univariate meta-analysis was conducted for each modality's sensitivity and specificity to determine its diagnostic accuracy [67]. The RE model was chosen because of the suspected high proportion of heterogeneity. The primary endpoints were sensitivity, specificity, a summary of receiver operating characteristics (SROC) curve, and diagnostic odds ratio (DOR). Point estimates and 95% confidence intervals (CIs) for each study were calculated to ensure consistency of sensitivity and specificity. A bivariate meta-analysis of sensitivity and specificity used R version 4.1.2 (R Foundation for Statistics Computing, Vienna, Austria, 2021) and RStudio version 1.4.1717 to obtain the SROC curve. This includes the "mada" and "meta" R packages implemented. Then the average AUC of SROC was estimated [68, 69]. The secondary outcomes comprised the positive and negative likelihood ratios, precision, and F1 score. Cochran's Q test and I2 statistics were utilized to evaluate statistical heterogeneity between studies. 0–40% indicates insignificant non-uniformity, 30%–60% indicates moderate non-uniformity, and 75–100% indicates considerable non-uniformity for Q statistics. A funnel chart was used to examine and depict publication bias (32). All p-values are derived from two-sided tests, and p-values of 0.05 are statistically significant. Screening based on machine learning algorithms, ICH types, retrospective or prospective study design, and acute or chronic ICHs was used to perform subgroup analysis. Using the Cochrane Review Manager version 5.4 (RevMan 5.4) program, bias cross-study risk and applicability concern charts were assessed.