Introduction

Vestibular Schwannoma (VS), or Acoustic Schwannoma, is a rare tumor with an incidence rate ranging from 0.3 to 6.1 cases per 100,000 individuals, varying between age groups and ethnicities [1,2,3,4,5]. Mainly affecting individuals in the 60–69 age group, VS causes symptoms such as gradual hearing loss, dizziness, and tinnitus. Symptoms escalate as the tumor increases in size, leading to compression of the cranial nerves, the brainstem, and the cerebellum. This compression results in symptoms such as vomiting, headaches, nausea, or mental confusion [6,7,8].

Diagnosis of VS is crucial for determining optimal interventions, which include surgery, radiotherapy, or gene therapy. Currently MR scans are considered the gold-standard, utilizing T1 weighted (T1W), T2W weighted (T2W), and contrast-enhanced T1 weighted (ceT1W) image sequences for localization and tumor segmentation assessment [9, 10].

However, as the VS follow-ups span years, manual delineation and segmentation of VS becomes an ineffective, labor-intensive process. Automated segmentation could potentially save a lot of time for physicians, improving the effectiveness of the workplace. Additionally, manual segmentation of the neoplasm is relatively subjective process, which can lead to variations in the results. Currently, challenges associated with limitations of manual segmentation include:

  • Clinical decision-making: Accurate volumetric measurements are crucial for monitoring the growth of VS to plan next treatment steps. The time-consuming manual option limits the potential of rapid decisions.

  • Superiority of volumetric analysis: 3D volume analysis allows for more comprehensive and accurate analysis than regular linear measurements of the VS. However, due to the time required for analysis, this option remains less attractive.

  • Safety of protocols: While contrast scans are still commonly used, implementation of high-resolution scans without contrast for automated segmentation could benefit patients by minimizing the risk of gadolinium accumulation in brain tissues. Additionally, cystic parts are hyperintense on these scans, allowing for better decision making.

  • Costs: Automated tools could reduce potential expenses by decreasing the total scanning length and the number of required sequences.

In the era of artificial intelligence (AI), convolutional neural networks (CNNs), a subset of deep learning (DL), demonstrated potential for the automated diagnosis of pathologies. CNNs, which are able to mimic human neurons, have demonstrated promising results in medical applications, including radiology, where DL networks have been applied for detection, tumor volume measurements, and neoplasm segmentation [11,12,13,14,15].

Despite these advancements, currently no work synthesizes information regarding the automated segmentation of VS using DL techniques. Hence, in this study we aim to conduct a systematic review and meta-analysis, describing the current state of algorithms applied to MR image segmentation of this rare neoplasm. By performing such review, we not only aim to fulfill existing gap in synthesis, but also provide insights, limitations, and guidance that may influence the future works related to the topic of automated segmentation of VS.

Methodology

This study adhered to the 2020 PRISMA guidelines for conducting a systematic review and meta-analysis [16]. The identification of eligible articles involved a comprehensive search across four databases: PubMed, Web of Science, Scopus, and Embase. Our search strategy incorporated MeSH terms and encompassed a range of keywords, including Artificial Intelligence, Deep Learning, Machine Learning, Automation, Convolutional Neural Networks, and specific terms related to vestibular schwannoma, acoustic schwannoma, and perineural fibroblastoma. The complete search strategy is provided in Appendix 1. In addition, a manual review of references was conducted to identify any studies not captured in the initial search. Two authors independently screened articles for inclusion and exclusion based on the outlined criteria, with conflicts resolved through mutual discussion with other authors. The screening process was facilitated by ZOTERO software, employed by both authors.

Inclusion and exclusion criteria

We established inclusion criteria for our study, including: (I) no restriction on publication date, encompassing articles from all years; (II) inclusion of only original articles in English; (III) exclusion of reviews, abstract-only works, preprints, and other non-original articles; (IV) focus on deep learning MR image segmentation; (V) exclusion of other artificial intelligence techniques such as conventional machine learning (ML); (VI) inclusion of studies with specified validation methods for the meta-analysis, aiming to mitigate overestimation; (VII) inclusion of studies with extractable data for meta-analysis (standard deviation or error), with the Dice score as the primary performance metric, supplemented by data on relative volume error (RVE) and average symmetric surface distance (ASSD).

Dice score, RVE, ASSD, and other extracted data

The Sørensen-Dice score, or Dice score, quantitatively measures the similarity between two sets, serving as an evaluation metric for segmentation algorithm performance. RVE provides a percentage measure of the deviation between segmented and true volumes, while ASSD gauges the alignment of segmented surfaces with ground truth surfaces (in our study in mm). Additional extracted data from each study included information on country of origin, software and hardware specifications, DL algorithm characteristics, validation types, testing set sizes, MRI inputs, patient demographics, tumor volume, MR imaging specifications, labelling, cystic analysis, training parameters, and dataset source. Independent extraction of data was followed by conflict resolution through discussion with other authors.

Statistical analysis

Each study underwent individual assessment using the Quality Assessment of Diagnostic Accuracy Studies (QUADAS-2) tool to evaluate the potential risk of bias [17]. Robvis software facilitated the visualization of outcomes [18]. Stata v18 software was employed for statistical analysis, utilizing a random-effects REML model for forest plot construction. Distinct techniques, testing sets, DL algorithms, and MRI inputs were treated as independent results. Subgroup analyses were conducted based on 3-dimensional and 2.5-dimensional architecture algorithms, ceT1W, T2W, T1W and mixed MRI inputs, as well as different testing set sizes (< 50, 50–100, and > 100 patients). Heterogeneity was quantified using the percentage I2 index. Significance threshold was set at p < 0.05. The analysis employed 95% confidence intervals (CI), and bias was assessed through funnel plot analysis.

Results

The detailed process of our literature search is outlined in Fig. 1 [16]. We identified a total of 752 articles across four databases: 241 from PubMed, 147 from Web of Science, 201 from Embase, and 163 from Scopus. Using ZOTERO, we initially identified 371 articles to remove duplicates. During screening, two authors independently reviewed 381 records, resolving disagreements through collaboration.

Following screening, we excluded 347 works unrelated to the meta-analysis topic. In the subsequent full-text screening, 34 works were included. Unfortunately, we were unable to retrieve 2 records. For the meta-analysis, we incorporated 11 out of 34 potentially eligible works. Eleven lacked extractable data for statistical analysis, one was a preprint, one had duplicate results, another was a review, and seven were abstract-only articles. Attempts to contact authors via mail correspondence yielded no replies.

Fig. 1
figure 1

PRISMA flow diagram. Made with PRISMA template [16]

Publication bias

We performed a QUADAS-2 analysis using the robvis tool, individually assessing each of eleven studies [19,20,21,22,23,24,25,26,27,28,29]. For each study, we evaluated the patient selection process, index test, reference standard, and flow & timing of the study, with a detailed description available in the QUADAS-2 statement [17]. One article exhibited low risk of bias, seven had some concerns, while three had a high risk of bias. Major concerns were identified in the patient selection process and some in the index test, with no significant concerns in the reference standard and flow and timing section. We have identified lack (or no reports) of clear inclusion or exclusion criteria, unclear process of randomization of the datasets (patients), lack of information of blinding the index test, and two studies had concerns in the labelling process. The QUADAS-2 charts with bias assessment results are presented in Fig. 2.

Fig. 2
figure 2

QUADAS-2 chart with assessment of bias of included studies. Made with robvis [18]

Included studies

From the 11 studies included in this meta-analysis, 4 were from the UK, 3 from the USA, 1 from Taiwan, and 1 each from the Netherlands and China, with a collaborative effort between China and the UK. The publication dates ranged from 2019 to 2023. Pytorch and Tensorflow were the most commonly used software, along with NVIDIA GPU cards and Intel CPUs. However, not all studies reported details such as software versions or GPU types. All works were based on CNN platforms, with U-Net being the most popular option, featuring various modifications of modules and dimension processing. The majority of works (6) employed split validation for performance assessment and further tuning. Studies varied in MR imaging techniques, with contrast-enhanced T1-weighted imaging being the most popular, appearing in 9 studies. Detailed information about each study is available in Table 1. We have additionally extracted supplementary data from each study, related to demographics, dose radiation, tumor volume, and MRI specification which is available in the Table 2. Table 3 provides additional information about labelling, training process, and datasets [30, 31].

Table 1 Overview of with vestibular schwannoma DL segmentation, included in quantitative review
Table 2 Supplementary information from each study
Table 3 Supplementary information II

Overall performance analysis

Regarding the overall performance of segmentation, fifty-six models from eleven studies reported a mean Dice score for VS MR image segmentation. The pooled mean Dice score was 0.89 (CI: 0.88–0.90, p < 0.001), with a heterogeneity of 95.9% (I2). Forest plot with overall Dice results is provided in Fig. 3. The lowest reported value was 0.80, while the highest was 0.94. A funnel plot revealed asymmetry (Fig. 4).

The pooled ASSD was 0.46 (CI: 0.42–0.51, p < 0.001), with a heterogeneity (I2) of 96.1%, from 43 models. The mean relative volume error was 11.7% (CI: 9.7–13.7, p < 0.001) from 24 models, with a heterogeneity (I2) of 93.8%.

Fig. 3
figure 3

Forest plot with overall mean Dice score pooled from 12 studies

Fig. 4
figure 4

Funnel plot for overall Dice score

Architecture subgroup analysis

In the architecture subgroup analysis, we compared different DL architectures, mainly 2.5D networks (40 models), 3D networks (13 models), and other models that did not fit into these categories (3). Among the 2.5D DL networks, the mean Dice score was 0.89 (CI: 0.88–0.90), with a heterogeneity of 94.4% (I2). Thirty-two models reported ASSD, and the pooled result was 0.45 (CI: 0.40–0.50, I2 = 94.9%). The reported RVE was 10.3% (CI: 8.7–11.9, I2 = 86.6%) from 15 models.

For 3D networks, the pooled Dice score was 0.90 (CI: 0.88–0.91), with a heterogeneity of 94.5%. From 8 models, the average ASSD was 0.45 (CI: 0.33–0.56, I2 = 95.7%), and the pooled RVE was 13.4% (CI: 8.7–18.0, I2 = 95.8%).

For other networks, the mean Dice score was 0.82 (CI: 0.81–0.84), with I2 = 26.2%, and ASSD = 0.72 (CI: 0.55–0.88, I2 = 76.2%). Only one work pooled RVE, 18.0% (CI: 13.1–22.9).

MRI input subgroup analysis

In the MRI input subgroup analysis, models were divided into groups based on their imaging input. The mixed group was for DL networks using two or more different MRI inputs. The mean Dice score for 13 models using mixed MR imaging inputs was 0.90 (CI: 0.89–0.92, I2 = 94.9%), and the mean ASSD from 8 models was 0.37 (CI: 0.28–0.45, I2 = 94.7%). Four models reported RVE, pooling 7.8% (CI: 6.9–8.6, I2 = 0.0%).

For 22 T2W models, the mean dice score was 0.86 (CI: 0.85–0.87, I2 = 86.3%), 19 models reported ASSD 0.59 (CI: 0.54–0.64, I2 = 70.4%). Pooled RVE from 10 models was 14.0% (CI: 12.3–15.6, I2 = 57.6%).

Twenty models used ceT1W input – pooled Dice score was 0.91 (CI: 0.90–0.92, I2 = 93.5%). From 16 models reporting ASSD, mean results pooled was 0.36 (CI: 0.31–0.41, I2 = 94.5%). Ten models reported RVE, yielding 10.9% (CI: 7.1–14.6, I2 = 96.7%).

Only 1 model used T1W input and only reported a Dice score of 0.82 (CI: 0.79–0.85).

Testing set size subgroup analysis

In the final subgroup analysis, we compared models’ performance, grouping them based on their testing set size. DL algorithms were divided into < 50, 50–100, and > 100 patient set sizes. There were 41 models tested on sets smaller than 50. The pooled Dice score was 0.88 (CI: 0.87–0.90, I2 = 95.0%). Thirty-eight models reported ASSD, and the pooled result was 0.48 (CI: 0.43–0.53, I2 = 95.4%), while the mean RVE from 19 models was 11.0% (CI: 9.4–12.6, I2 = 87.1%).

Eleven models were tested on sets of size 50–100 patients. The pooled Dice score was 0.88 (CI: 0.86–0.91, I2 = 97.6%). One work reported ASSD and RVE.

Four models were tested on sets with more than 100 patients. The pooled Dice score was 0.90 (CI: 0.89–0.92, I2 = 88.2%). Pooled ASSD was 0.37 (CI: 0.32–0.41, I2 = 45.1%), while RVE was 14.2% (CI: 5.3–23.1, I2 = 98.3%). A Dice score subgroups analysis summary is available as Fig. 5, for ASSD subgroups as Fig. 6, while for RVE as Fig. 7.

Fig. 5
figure 5

Dice score subgroup analysis results

Fig. 6
figure 6

ASSD (mm) subgroup analysis results

Fig. 7
figure 7

RVE (%) subgroup analysis results

Discussion

In the rapidly evolving landscape of medical imaging, our systematic review and meta-analysis aims to comprehensively summarize the current state of automated segmentation of Vestibular Schwannoma using Deep Learning techniques. As artificial intelligence, particularly Convolutional Neural Networks, continues to redefine diagnostic paradigms, understanding the application and reliability of these technologies is imperative.

The rarity of VS and its potential severity underscore the importance of accurate and timely diagnosis. Our meta-analysis emphasizes the significance of advanced imaging techniques, primarily MR scans, in evaluating tumor localization and size. Diagnosis of rare tumors might be especially challenging for inexperienced residents of radiology. The implementation of DL not only assists younger specialists but also reduces the time required for the manual segmentation process. Traditional gold standard methods, while effective, are being complemented and, in some instances, challenged by the advent of AI. Reduced workload allows for routine application of volumetric analysis [32]. Manual contouring suffers from inter-operator bias [24]. Automated segmentation offers the potential for standardized and reproducible volume measurements. Traditional linear measurements lack sensitivity, while volumetric measurements are more accurate but much more time-consuming [33, 34]. DL automation allows for efficient and accurate 3D volume calculation, facilitating better clinical decision-making. It is estimated, that automation tools could reduce even 30% time for segmentation [35]. Finally, the ability to perform segmentation using solely T2-weighted MRI addresses concerns about gadolinium contrast agents and potentially reduces costs, to an even 10-fold saving per scan [36, 37].

The integration of CNNs, a subset of DL, holds promise in automating the diagnostic process for VS. The substantial number of MRI scans associated with VS diagnosis and follow-up contributes significantly to healthcare costs. AI-based segmentation offers a novel option to address these conditions by optimizing the workflow of radiologists. Our meta-analysis attests to the increasing interest in leveraging DL for condition detection, tumor volume measurements, and neoplasm segmentation.

Despite strides in AI applications in medical imaging, our study reveals a critical knowledge gap concerning automated segmentation of VS using DL techniques. To the best of our knowledge, this is the very first systematic review and meta-analysis on this topic.

Adherence to the 2020 PRISMA guidelines and the use of the QUADAS-2 tool underscore the methodological rigor employed in our study. Comprehensive searches across major databases, incorporation of specific keywords, and manual reviews of references aimed to encompass relevant literature. Our inclusion criteria, while necessary, led to the exclusion of studies lacking extractable data, highlighting the importance of standardized reporting in future research.

The pooled mean Dice score of 0.89 indicates a commendable level of agreement between segmented and true volumes. However, the observed heterogeneity of 95.9% raises questions about methodological consistency across studies. High heterogeneity could be influenced by the different number of patients used in testing, training and validating models, population (patient) differences, different functions applied in the DL architecture, as well as technical differences in the MR images.

While the Dice score provides a comprehensive measure of segmentation accuracy, the inclusion of Relative Volume Error and Average Symmetric Surface Distance adds granularity to our assessment. The pooled RVE of 11.7% indicates the average deviation between segmented and true volumes. The ASSD of 0.46 mm highlights the alignment of segmented surfaces with ground truth surfaces, showcasing the robustness of DL algorithms in preserving anatomical details.

The subgroup analyses further illuminate the nuanced performance variations. 2.5D DL networks, with a mean Dice score of 0.89, demonstrate comparable efficacy to 3D networks (mean Dice score of 0.90) and outperform other models (mean Dice score of 0.82). Imaging input analyses reveal the superiority of contrast-enhanced T1-weighted imaging and mixed MRI inputs, emphasizing the influence of input modality on segmentation outcomes.

Studies varied in the MRI imaging protocols. Scanner field strength ranged from 1.5T to 3T. For the ceT1 scans, MPRAGE sequence was the most common option. T2W weighted scans utilized 3D CISS, TSE, and 2D Spin Echo sequences. However, notable variations in slice thickness and in-plane resolution (Table 2) could affect the effectiveness of the algorithms in other scenarios.

Despite the promising findings, limitations exist. We were unable to retrieve two records, what arises publication bias. Through QUADAS-2 tool, we have identified bias in different aspects of included studies – these risks in included works should be acknowledged, as well as high heterogeneity of the results. Additionally, articles only in the English language were included. Readers should be aware of these limitations, and future systematic reviews should address these gaps.

Regarding external validation, most of the studies did not perform external analysis of the AI, which raises the risk of overfitting to certain MRI setups. Additionally, external datasets mostly relied on a single database provided by Shapey et al. [30]. This raises question how will DL models perform in the randomized trials or prospective studies with different environments. Studies also lacked software (type and version used) and/or hardware information (GPU type), which could guide future researchers. These aspects are useful for software designers, and should be reported in the future studies.

Another aspect is the analysis of the cystic and solid parts of the VS. As the imaging properties differ, not all studies have deeply analysed them, and often only mentioned difficulties with cystic parts analysis [38]. Finally, other parameters such as size (beside volume), and proportion of the cystic cases should be provided. These aspects are crucial to be resolved in the future trials to provide more comprehensive analysis of VS for better clinical decision-making.

While the high Dice score shows promising results, which could be theoretically implemented in clinical practice, currently due to the limitations, more testing is required for full-scale clinical application. Future research should report fully validated models with more standardized designs, in order to increase homogeneity across the studies. Additionally, we believe that using bigger testing sets, would decrease overestimation of the results, as majority of DL models in this study were assessed on sets with 100 or less patients.

Conclusions

To summarize, this meta-analysis explores current knowledge of automated segmentation of VS using DL techniques. While Dice, RVE and ASSD scores show promising results for this task, the study is limited by high heterogeneity, caused by variations in MRI techniques, population differences and DL model differences, as well as bias among studies. Future research in this field should address these limitations for more standardized results.