1 Introduction

Aorta segmentation in computed tomography (CT) images is crucial for making accurate diagnoses and devising treatment strategies for cardiovascular diseases. Precise aorta segmentation can help with evaluating the severity and progression of conditions such as aortic aneurysms and dissections. Segmentation has traditionally been performed manually, which is labor-intensive and prone to operator variability [1]. Semi-automatic segmentation methods aimed at reducing this burden still require considerable user intervention and expertise [2].

The emergence of deep learning (DL) models, particularly convolutional neural networks, has markedly advanced the field of medical image segmentation. These models can automatically learn complex patterns from extensive data sets, which leads to substantial improvements in segmentation accuracy and efficiency [3, 4]. DL models have been increasingly used for aorta segmentation in CT images to overcome the limitations of manual and semi-automatic methods [5, 6].

DL models, such as U-Net and its variants, have been at the forefront of technical advancements in medicine. These models exhibit robust performance across imaging conditions and anatomies and thus facilitate accurate and reproducible aorta segmentation, which is crucial for evaluating disease characteristics and devising treatment plans [7, 8]. However, despite these technological advancements, challenges related to data diversity, model generalization, and clinical integration remain to be addressed before widespread adoption of DL model–based segmentation in clinical practice can occur [9, 10].

Review studies have extensively investigated the application of machine learning and DL models for cardiac [11] and aortic [12] diseases. In the present systematic review and meta-analysis, we focused on DL models for aorta segmentation in CT images. We evaluated the contribution of DL models to comprehensive and quantitative assessments. We explored the methodologies, performance metrics of various DL models, identifying their achievements and obstacles to implementation. This study clarifies the current status of the literature, offering insights for relevant future research and facilitating the clinical integration of DL models to improve management of aortic diseases.

2 Methods

2.1 Study Guidelines

This systematic review and meta-analysis adhered to the 2020 Preferred Reporting Items for Systematic reviews and Meta-Analyses (PRISMA) guidelines, which ensured methodological rigor from study conception to result reporting. Supplementary Tables S1 and S2 present information regarding the adherence to PRISMA checklists [13]. The present study was registered with INPLASY and is currently awaiting assignment of a unique ID. Because this study involved synthesizing data from the literature and did not involve direct human participation, the requirements for ethical approval and informed consent were waived.

2.2 Database Search For Eligible Articles

Two reviewers, T-WW (5 years of experience in the application of artificial intelligence in medical imaging) and J-SH (7 years of experience in the application of artificial intelligence in medical imaging), independently searched the literature to identify studies that used DL models for aorta segmentation in CT images. PubMed, Embase, and Web of Science were searched from their inception to March 13, 2024 (Supplementary Table S3). To ensure comprehensive literature coverage, manual searches were performed after the initial screening of titles and abstracts. Between-reviewer disagreements in study selection process were resolved through discussion with a senior researcher Y-TW (over 20-year experience in medical imaging). Studies were included if they employed deep learning algorithms to segment the aorta, encompassing various aortic conditions, such as calcification, in computed tomography images of the adult thorax. We excluded studies on aortic diseases (e.g., dissections and aneurysms), those not using CT or DL models, those not involving human participants, those published as conference articles, those irrelevant to our research focus, and those lacking the data required for meta-analysis (e.g., standard deviation values for mean Dice scores).

2.3 Data extraction and management

T-WW and J-SH extracted the following data from the identified studies: study design, patient demographics, anatomical focus on the aorta, validation methods, and external validation criteria. In addition, information was collected on metrics such as the number and dimensions of image series. The annotation quality and annotators’ qualifications were thoroughly reviewed. The following variables were systematically evaluated: image modality, contrast agent use, slice thickness, and hardware and algorithmic specifications.

Our study included an analysis of the Sørensen–Dice coefficient because of its pivotal role in CT image segmentation. This coefficient is indispensable and most reported metric for quantitative evaluation of algorithmic accuracy in the segmentation of medical images.

2.4 Methodological quality appraisal

The methodological quality of the included studies was evaluated using the following tools: the Checklist for Artificial Intelligence in Medical Imaging and the Quality Assessment of Diagnostic Accuracy Studies-2 [14, 15]. The assessments were independently conducted by T-WW and J-SH; any disagreement between the reviewers was resolved through discussion with senior researchers Y-TW. This structured approach, involving multiple evaluators and necessitating consensus, ensured a comprehensive and unbiased evaluation of study quality.

2.5 Statistical Analysis

Meta-analyses were performed to evaluate the segmentation performance of DL models in studies reporting Dice scores. For studies using multiple DL models, only the model exhibiting the best performance was considered. When several studies used the same data set for validation, the model exhibiting the best performance was selected. For studies reporting on different aortic segments (ascending, descending, or entire aorta), the results for each segment were individually extracted. Reported median values and interquartile ranges were converted to mean and standard deviation values by using a standard formula [16, 17]. Given the heterogeneity in study population, a random-effects model was used with the restricted maximum likelihood method [18]. Forest plots were generated to visualize the results. Subgroup analyses were performed by geographic location, publication status, validation method, image modality, contrast agent use, imaging dimensionality, and algorithm type [19]. Meta-regression was performed to identify the correlation of Dice scores with the size of the training data set [20]. The Q test was performed to measure heterogeneity; statistical significance was set at p < 0.05. On the basis of I2 values, heterogeneity was classified as trivial (0%–25%), minimal (26%–50%), moderate (51%–75%), or pronounced (76%–100%) [21]. Publication biases were investigated using the Egger test for funnel plot asymmetry [22]. Statistical analyses were performed using Stata/SE (version 18.0) for Mac (StataCorp, College Station, Texas, USA).

3 Results

3.1 Study Selection

Figure 1 presents a PRISMA flowchart depicting the article selection process. Supplementary Table S3 presents the numbers of articles retrieved from various databases. After the removal of duplicates and the exclusion of articles on the basis of a title and abstract review, 16 studies [6, 7, 23,24,25,26,27,28,29,30,31,32,33,34,35,36] were included in the final analysis. Supplementary Table S4 presents the reasons for excluding certain articles from the review.

Fig. 1
figure 1

Flowchart depicting article selection. The present review study and meta-analysis adhered to the PRISMA guidelines

3.2 Basic Characteristics of Included Studies

Table 1 presents the characteristics of aorta-related retrospective studies, which were predominantly published in 2023, from various countries, with only one open-source dataset reported [30]. These studies explored different regions of the aorta and primarily used train/test splits for validation [6, 7, 24,25,26,27,28,29,30,31,32,33, 35, 36], with two using cross-validation [23, 34]. Notably, none of the included studies reported conducting external validation. The studies varied in terms of annotation methods; manual [23, 24, 27,28,29,30,31,32,33, 35, 36], automated [25], and semi-automated [6, 7, 26, 34] methods were reported. The annotators were professionals from relevant fields—for example, seasoned radiologists and clinicians.

Table 1 Patient and study characteristics

3.3 Characteristics of CT Imaging and DL Models

Table 2 presents the characteristics of the CT imaging and DL models used in the included studies. The studies used a range of CT imaging modalities, including standard CT [26, 28,29,30,31,32,33, 36], non-contrast-enhanced CT and noncontrast CT [23, 27], CT angiography [6, 24, 25], and coronary CT angiography [34, 35], with variable use of contrast agents. Slice thickness varied widely, from 0.29 to 6 mm.

Table 2 Characteristics of CT imaging and DL models

Two-dimensional [7, 24, 26, 31, 35, 36], three-dimensional [6, 25, 27,28,29, 32, 34], or both [23, 30, 33] types of images were analyzed. The DL models used in these studies included U-Net and its variants, such as stacked U-Net, bidimensional Attention U-Net, V-net, Siamese multitask learning network, convolutional neural network, nnU-Net, posterior-conditional random field, and two-stage methods.

3.4 Quality Assessment

Figure S1 delineates the outcomes of the quality evaluations conducted on the encompassed studies, utilizing the QUADAS-2 instrument. The supplemental Table S5 elucidates the results from a detailed examination focusing on biases and concerns pertaining to applicability. Ambiguities associated with the exclusion of interval derivation from the datasets were discerned in 8 (50%) of the studies referenced [6, 7, 23, 24, 30,31,32,33], which may compromise the integrity of data depiction. Furthermore, a minority of the studies, 2 (12.5%) [25, 31], failed to encompass all patients within their analysis, potentially undermining the relevance and extrapolative value of these investigations' outcomes.

Supplemental Table S6 discloses the findings from an exhaustive evaluation of 16 studies conducted by the CLAIM criteria. These investigations reported an average CLAIM score of 29, equating to approximately 69% (with a standard deviation of 2.33), spanning a scope from 24.00 to 33.00 out of a potential full score of 42. The mean scores across different CLAIM subsections further delineate the caliber of these 16 studies, detailed as follows: for the title/abstract section, 1.69/2 (84%); in the introduction, 2.00/2 (100%); within the methods, 19.69/28 (70%); for results, 2.69/5 (54%); in discussions, 2/2 (100%); and about other information, 0.94/3 (31%). These insights illuminate the strengths of these 16 studies, simultaneously accentuating the potential for enhancing aorta segmentation research by adopting DL methodologies.

3.5 Efficacy of DL Model–Based Aorta Segmentation in CT Images

Data from 19 tables in the 16 articles revealed slight variations in Dice scores (93%–98%). The overall Dice score was 96% (95% confidence interval [CI]: 95%–96%; Fig. 2). The Q test indicated significant heterogeneity among the included studies (Q = 902.83; p < 0.01), a finding corroborated by a high Higgins I2 value (98.10%). A sensitivity analysis confirmed the robustness of these results, revealing that aggregated effect sizes remained significant even when individual studies were excluded from the analysis (Fig. S2).

Fig. 2
figure 2

Forest plot of Dice scores for DL model–based aorta segmentation in chest CT images

A subgroup analysis by geographical location revealed significant between-country differences in model performance (Asia vs. Europe vs. North America, respectively: Q = 58.33 vs. 27.13 vs. 458.46; I2 = 93.88% vs. 76.88% vs. 99.03%; p < 0.01 for all). Specifically, seven Asian, eight European, and four North American studies yielded pooled Dice scores of 97% (95% CI: 96%–98%), 94% (95% CI, 94%–95%), and 95% (95% CI, 93%–97%), respectively (Fig. 3). However, subgroup analyses by validation method, annotation, image modality, contrast agent use, image dimension, and algorithm type revealed no significant difference in model performance (Supplementary Fig. S3). Meta-regression revealed no significant correlation between the Dice score and the size of the training data set (coefficient: − 9.14e−06; p = 0.570) and the size of the testing dataset (coefficient: − 1.735e−04; p = 0.181).

Fig. 3
figure 3

Forest plot for subgroup analysis of Dice scores for DL model–based aorta segmentation in chest CT images. The subgroup analysis was performed by geographic location

The funnel plot analysis performed using the 16 studies and the Egger regression test yielded a p value of 0.2775, indicating no significant publication bias in these studies (Supplementary Fig. S4). However, a subgroup analysis by publication year revealed significant differences among the studies (p < 0.01; Supplementary Fig. S5).

4 Discussion

Our meta-analysis revealed that DL models significantly improved aorta segmentation in CT images, achieving an overall pooled Dice score of 96% (95% CI: 95%–97%). Notably, our findings revealed significant geographic variations, with Dice scores varying among Asian, European, and North American studies. Therefore, the studies varied in terms of data set characteristics or segmentation methods. The Egger regression test indicated no significant publication bias (p = 0.28), affirming the robustness of our findings. However, significant differences were noted between the published papers and conference articles.

Our findings align with the observed trend of increasing accuracy in aorta segmentation with the use of DL models. Gu et al. (2021) [33] reported an average Dice score of 93% for various DL models, highlighting the rapid progress in the field of DL. The improvements observed in the current review may be attributable to advancements in DL architectures and training strategies, particularly for U-Net variants, which are known for their superior feature extraction abilities [23].

The high accuracy of DL models in aorta segmentation has crucial clinical implications. Precise segmentation facilitates assessment of aortic diseases such as aortic aneurysms [37] and aortic dissections [38], thereby aiding in diagnosis [38], treatment planning [39], disease monitoring [40], and delineation of organs at risk during radiotherapy [32]. DL models with Dice scores of > 95% can reduce the time and operator variability associated with manual segmentation, thus ensuring timely and uniform patient care.

U-Net [4] and its variants [6] have proven particularly effective in aorta segmentation because of their symmetric architecture that captures both contextual information and fine details. However, challenges related to data diversity and model generalization persist, as indicated by the geographic variations in Dice scores noted in the current study. These challenges highlight the need for data sets covering a wide range of anatomical and pathological variations to train robust models.

Our review, as with most meta-analyses, has some limitations, including possible study selection biases and heterogeneity among the included studies, particularly in imaging protocols, population demographics and aortic region. Secondly, although the Dice coefficient is a valuable metric for assessing segmentation accuracy, it may not adequately reflect performance variability across different complexity levels in segmentation tasks, potentially overlooking subtleties in model performance when handling challenging or atypical image features. Also, including only the model with the best performance on the same dataset may obscure poor performance. Finally, the integration of DL models into clinical workflows poses several challenges, and therefore, technical robustness and considerations for user interfaces and system integration are imperative to ensure effective enhancement of clinical practice through these models.

Future studies on DL models for aorta segmentation should aim to enhance the models’ generalizability and evaluate their clinical benefits with various aortic conditions. Also, research should standardize the aortic section analyzed or stratifying results by aortic location to better understand the impact on model performance. Furthermore, the clinical effects of automated aorta segmentation—particularly, its effect on diagnostic accuracy, treatment planning, and patient outcomes—warrant further study. The benefits of DL models must be evaluated through rigorous research to provide a solid foundation for the clinical integration of these models. Multicenter studies involving a broad spectrum of clinical settings and patient populations should be conducted to assess the real-world applicability of DL models. Such studies, along with external validation efforts, may help confirm the models’ efficacy and reliability beyond the development environment. Additional studies must be conducted to enable significance to be reached in subgroup analyses of meta-analyses. This is crucial to enable identification of areas where DL models perform well or require improvement, which can guide future optimization efforts to meet the specific demands of clinical practice.

5 Conclusion

This systematic review and meta-analysis revealed that DL models considerably enhance aorta segmentation in CT images, indicating promising advancements toward accurate, efficient, and standardized diagnosis and treatment planning for cardiovascular diseases.

6 Declaration