Introduction

Esophageal cancer, ranking seventh globally in incidence and sixth in mortality, affects predominantly men, especially in Eastern Asia, with nearly half a million new cases and deaths. The two main subtypes, squamous cell carcinoma (LUSC) and adenocarcinoma (LUAD), are associated with distinct patterns; SCC may be declining in Asia due to economic progress, while AC is on the rise in high-income countries, linked to factors like excess body weight [1]. Lymph node metastasis (LNM) is a pivotal factor in esophageal cancer prognosis, influencing long-term survival. The complex lymphatic network around the esophagus leads to metastases in the abdomen, mediastinum, and neck. The diverse patterns of LNM in both squamous cell carcinoma and adenocarcinoma, regardless of the primary tumor location, suggest a need for a more tailored and perhaps more aggressive approach in both surgical and radiotherapeutic management of esophageal cancer. The phenomenon of skip metastasis and the presence of metastases in non-adjacent lymph node stations underscore the complexity of lymphatic drainage from the esophagus and the limitations of current staging and treatment paradigms [2]. The gold standard for detecting LNM in esophageal cancer is histopathological examination of surgically resected lymph nodes, providing a definitive diagnosis through microscopic analysis of tissue [3]. Although invasive, this method is unmatched in accuracy. For non-invasive pre-surgical assessment, endoscopic Ultrasound (EUS) is highly sensitive for local lymph node evaluation [4], while computed tomography (CT) and positron emission tomography (PET) scans are crucial for broader staging, including distant metastases [5]. Magnetic Resonance Imaging (MRI) may also be utilized, but less frequently for lymph node assessment [6]. These imaging techniques, though valuable for initial staging and planning, have limitations in detecting small LNM, particularly micrometastases, as demonstrated by the small median sizes of involved lymph nodes and metastatic nests in our study. The lower sensitivity of imaging modalities such as FDG-PET for detecting small LNMs highlights the challenge of relying solely on preoperative imaging for accurate nodal staging. Consequently, this underscores the need for meticulous surgical assessment and possibly more extensive lymph node dissection in certain cases, even when clinical staging suggests the absence of nodal involvement. The discrepancy between clinical and pathological findings emphasizes the potential for underestimation of disease spread and the critical role of postoperative pathological evaluation in guiding further treatment decisions and improving patient outcomes. Therefore, imaging methods cannot substitute the conclusive nature of histopathological examination post-surgery [7]. Histopathological samples are routinely obtained via surgical route, but surgery is no longer the preferred choice for treating metastatic esophageal cancer due to increased risks and poor prognosis, especially in patients with advanced metastases [8]. Therefore, imaging methods such as endoscopic ultrasound, CT, and 2-[fluorine-18]fluoro-2-deoxy-D-glucose (FDG)-PET for detecting LNM in esophageal cancer are critical since they are less invasive than surgery. Each of these modalities has its limitations, and individual studies suggest they exhibit low to moderate sensitivity and moderate to high specificity when assessing lymph node status [9]. The integration of artificial intelligence (AI) into radiology, propelled by advancements in machine learning (ML) and deep learning (DL), has ushered in a paradigm shift. This transformation is marked by the optimization of image acquisition processes, the streamlining of operational workflows, and the enhancement of diagnostic precision. ML algorithms, as evident in tools like Computer-Aided Diagnosis (CAD), significantly contribute to heightened sensitivity and specificity, reducing the time needed for interpreting chest X-rays. Deep learning models, particularly those built on convolutional neural networks (CNNs), demonstrate exceptional proficiency in tasks such as image recognition, proving invaluable for deciphering intricate medical images. Moreover, Radiomics plays a vital role in leveraging data to improve diagnostic insights by extracting quantitative features from medical images. This process is fundamental in enhancing our understanding of medical conditions through the analysis of specific image characteristics [10]. Radiomics, as an analytical method in medical imaging, employs sophisticated mathematical analyses to extract detailed features from medical images (e.g., CT, MRI, and PET), with a primary application focus on oncology [11]. Radiomics seeks to transform medical images into data that can be mined for valuable insights not easily visible to the naked eye. Through the analysis of quantitative features, it aims to offer extra details about the inherent biology, diversity, and traits of tissues and tumors. The extracted information holds potential for diverse medical applications, especially in oncology, serving diagnostic, prognostic, and predictive purposes [12]. Radiomics plays a potential role in improving the staging of esophageal cancer by analyzing texture features from imaging modalities. These features, including tumor heterogeneity and various measurements, offer additional insights beyond conventional staging methods [13]. Radiomics outperforms traditional radiological assessments conducted by radiologists in various aspects of cancer diagnosis and prognosis. It excels in tasks such as predicting tumor invasion and differentiating between malignant and benign tumors, offering promising potential for accurate prognosis and treatment planning. Radiomics models also show efficacy in forecasting metastasis, providing valuable insights for personalized patient care. These findings underscore radiomics’ role in improving diagnostic accuracy and guiding clinical decision-making in oncology [14]. Numerous meta-analyses have highlighted the promising results of the radiomics methods for predicting LNM in malignancies of organs such as the stomach [15], thyroid [16], breast [17], cervix [18], and pancreas [19]. The pooled area under the curve (AUC) of these studies fell between 0.70 and 0.90, indicating moderate to good diagnostic performance of radiomics methods for predicting LNM. Due to the variable diagnostic performance of radiomics methods for predicting LNM in different organs, it is necessary to obtain a comprehensive standpoint on the accuracy and quality of the radiomics studies in esophageal cancer. This objective can be achieved through a systematic approach and meta-analysis. Therfore, this study was designed to investigate the pooled diagnostic performance and quality of the published literature, as well as to provide future perspectives for further studies.

Materials and methods

Study design and reporting guidelines

The present study has been conducted following the Preferred Reporting Items for Systematic Reviews and Meta-analysis (PRISMA 2020) guidelines [20].

Literature search

A systematic literature search of the electronic databases, including Embase, PubMed, and Web of Science, was performed independently by two reviewers to identify relevant studies that predicted lymph node metastasis in esophageal cancer using the following terms and their equivalents: (Radiomics) AND (Esophageal Cancer) AND (Lymph Node Metastasis). The search was updated on November 16, 2023. Exclusively, English-language studies were taken into account. The updated search terms and the results are detailed in Supplementary Materials.

Study selection

PICO (population, intervention, comparison, and outcome) questions of the study were: (P) population: patients preoperatively diagnosed with esophageal cancer; (I) intervention: application of radiomics; (C) comparison: assessment of radiomics for prediction of LNM before treatment; and (O) outcome: measurement of diagnostic performance (e.g., sensitivity, specificity, and AUC) for predicting LNM after surgery. Inclusion criteria were (a) application of radiomics to predict LNM in esophageal cancer, (b) all participants had pathological postoperative LNM status (c) sufficient data for calculating 2 × 2 contingency tables consisting of true positive (TP), false positive (FP), false negative (FN), and true negative (TN). The exclusion criteria were as follows: (a) review papers, case reports, meetings, letters, abstracts, editorials, comments, posters, and guidelines; (b) studies that did not use radiomics methods for predicting LNM; (c) articles with no access; (d) literature published in a language other than English; (d) not providing enough data for constructing 2 × 2 tables; and (e) studies that not used separate validation cohorts.

Data extraction

The citations obtained through database retrieval were imported into Endnote software. After removing redundant publications, a thorough examination of titles and abstracts was conducted to eliminate literature that did not meet the specified inclusion criteria. Following this, the complete texts of the remaining studies were carefully reviewed to ascertain the definitive inclusion of literature. Two authors independently carried out data extraction and the evaluation of study quality. The following basic data was extracted: name of the first author with the year of publication, study origin, design of the study (e.g., retrospective or prospective design), number of centers, number of participants in validation cohorts, reference standard, image modality, phase of imaging acquiring (for CT scan studies), radiomics approach (texture analysis, ML, or DL), and combined clinicopathological features. In addition, the following technical information was extracted: segmentation method (automatic vs. manual), region of interest (ROI) type (2D vs. 3D), software used for feature extraction, number of imaging features extracted/selected, type of imaging features extracted, modeling algorithm, features reduction algorithm, ICC evaluation, and type of cross-validation.

Quality assessment

A modified version of the Quality Assessment of Diagnostic Accuracy Studies (QUADAS-2) tool was designed to investigate the quality of the included studies and questions for each section are detailed in Table S4 (Supplementary Materials) [21]. In addition, the Radiomics Quality Score (RQS) tool proposed by Lamblin et al. was used to evaluate the methodological quality of radiomics studies [22]. QUADAS-2 questions were implemented in the Review Manager software, and diagrams were drawn subsequently.

Statistical data analysis

The accumulative values of sensitivity (SENS), specificity (SPEC), diagnostic odds ratio (DOR), positive likelihood ratio (PLR), negative likelihood ratio (NLR), and are under the curve (AUC), with their 95% confidence intervals (CIs) were generated. Utilizing the random effects model, we generated the summary receiver operating characteristic (SROC) curve and calculated the AUC to appraise the diagnostic efficacy of the aggregated studies. The AUC values were categorized as indicating low (0.5–0.7), fair (0.7–0.8), good (0.8–0.9), and excellent discriminatory power (> 0.9). Coupled forest plots were generated to show the pooled value for sensitivity and specificity. Cochran’s Q test and Higgins’ I2 statistic were calculated to estimate the heterogeneity among the studies included in this meta-analysis, with I2 values categorized as follows: 0 to 25% indicating very low heterogeneity, 25 to 50% indicating low heterogeneity, 50 to 75% indicating medium heterogeneity, and > 75% indicating high heterogeneity. We used Deek’s asymmetry test and its funnel plot to investigate publication bias. All p-values below 0.05 were considered to be significant. The statistical analyses in this study were conducted using Stata version 17.0 and meta-DiSc. Fagan plots were employed to evaluate clinical effectiveness by offering post-test probabilities of LNM while considering pre-test probabilities.

Results

Literature search

An electronic database search identified 426 citations with 128 duplicate studies. After screening the titles and abstracts of the candidate studies, 261 citations were excluded for not meeting the inclusion criteria. A thorough examination of the full texts resulted in the exclusion of 28 additional articles, leaving 9 for inclusion in the meta-analysis [23,24,25,26,27,28,29,30,31]. Figure 1 illustrates the detailed search process.

Fig. 1
figure 1

Flowchart of the study based on PRISMA guidelines

Study characteristics

Table 1 shows full characteristics of the selected studies and predictive models. Nine articles consisting of 719 participants were selected in the quantitative synthesis, all retrospectively designed, eight conducted in China [23,24,25,26, 28,29,30,31] and one in the United Kingdom [20]. Imaging modalities were CT [23,24,25,26, 28,29,30], PET [27], and MRI [31]. Only one study used multi-center data [27]. Two studies used deep learning-based radiomics (deep-radiomics) for feature extraction [26, 28], and the rest of the studies were conventional (machine learning-based) radiomics [23,24,25, 27, 29,30,31]. Five studies combined radiomics and clinical features [26,27,28,29,30]. Manual ROI segmentation was performed by seven studies [23,24,25,26, 29,30,31], and only two studies [27, 28] used the automatic segmentation method. Only one study used 2D ROI segmentation [28]. Matlab was the most frequently used software for feature extraction (5/9) [24, 25, 27, 28, 31]. Similarly, the least absolute shrinkage and selection operator (LASSO) algorithm was adopted in two-thirds of the studies for feature selection [23, 25, 27,28,29,30], followed by more modern algorithms such as “elastic net” [24, 26, 31]. Logistic regression (LR) was the most commonly adopted algorithm for building radiomic models, and only one study used more advanced machine learning algorithms such as support vector machine (SVM), AdaBoost (Adaptive Boosting), and random forest (RF) [28].

Quality assessment

QUADAS-2

Fig. 2
figure 2

QUADAS quality assessment per study (A) and per domain (B)

Figure 2 shows the quality of the selected studies using QUADAS-2 tool, which was completely acceptable, and their design was aligned with the signaling questions. Only the study by Ding et al. had a high risk of bias and a high applicability concern in the patient selection domain as it included some patients who received treatment before imaging [26].

Table 1 Characteristics of the included studies and predictive models

RQS

The nine studies obtained an average RQS score of 12.78 and a median score of 12, with individual scores ranging from 9 to 14 out of 36 points. The estimated mean score was 35%, and the study with the superior rating achieved 38%. All studies provided/performed detailed image protocol quality, feature reduction, and discrimination statistics. On the other hand, none of the studies provided a phantom study, prospective study, biological correlation, comparison to the gold standard, cost-effectiveness analyses, and open science and data. Multiple segmentation was not performed in one study [30]. One-third of the studies did imaging at multiple points [23, 26, 29]. Multivariable analysis (combined with clinical factors) was performed in two-thirds of the studies [24, 26,27,28,29,30]. Likewise, cut-off analysis was performed only in two studies [25, 27]. Eight of the included studies had validation cohorts and received + 2 points in validation items, and one study [27], due to using data for another center, received + 3 points. Four studies assessed potential clinical applicability by conducting decision curve analysis [23, 24, 29, 30]. Detailed RQS scores of each study are provided in Table 2.

Table 2 RQS score of the included studies per item

Diagnostic meta-analysis

Nine studies (validation cohorts) consisting of 334 patients with LNM (+) and 385 patients without LNM (-) were selected for the quantitative synthesis. The pooled diagnostic indicators with their 95% confidence interval (CI) were determined: SENS, 0.72 [95% CI; 0.67–0.77]; SPEC, 0.76 [95% CI; 0.69–0.82]; PLR, 3.1 [95% CI; 2.3–4.1]; NLR, 0.36 [95% CI; 0.30–0.44]; DOR, 9 [95% CI; 6–13]; and AUC, 0.74 [95% CI; 0.70–0.78]. The coupled forest plot, including sensitivity and specificity alongside heterogeneity indicators (Higgins’ I2 and Cochran’s Q) is shown in Fig. 3. Furthermore, Fig. 4 shows the summary ROC curve (SROC) with pooled AUC value.

Fig. 3
figure 3

Coupled forest plot showing pooled sensitivity and specificity

Fig. 4
figure 4

Summary ROC curve (SROC) of the radiomic models for predicting LNM in esophageal cancer

Heterogeneity

Heterogeneity existence

The Cochran’s Q and Higgins I2 tests showed that medium heterogeneity (I2 = 57.04%) was present in the pooled specificity values (p-value = 0.02). In contrast, very low heterogeneity (I2 = 0.00%) was observed in the accumulative sensitivity, and Cochran’s Q test did not show a significant heterogeneity (p-value = 0.53). Threshold effect was also ruled out as the possible cause of heterogeneity since Spearman’s correlation coefficient (r) was calculated as 0.1 (p-value = 0.798).

Causes of heterogeneity

Meta-regression was performed to investigate the causes of heterogeneity (Table 3). However, among all of the considered covariates, only using combined models significantly contributed to the results’ heterogeneity (p-value = 0.05). Among other covariates, a sample size higher than 75 or using least absolute shrinkage and selection operator (LASSO) for feature extraction might be implicated in the inter-study heterogeneity. However, the results were not statistically significant (0.05 < p-value < 0.10).

Subgroup analysis

Different factors were considered for subgroup analysis (Table 3).

Study population

Studies with sample sizes larger than 75 showed higher pooled sensitivity (0.74 vs. 0.68) and pooled specificity (0.80 vs. 0.67); however, the results were not statistically significant (p-values for sensitivity and specificity > 0.05).

Publication year

Studies published before 2020 exhibited slightly higher pooled sensitivity (0.73 vs. 0.72; p-value = 0.01), but pooled specificity was higher in those published after 2020 (0.78 vs. 0.71; p-value > 0.05 with no statistically significant difference).

ROI segmentation method

Studies utilizing manual ROI segmentation exhibited higher pooled sensitivity (0.73 vs. 0.70; p-value = 0.04) and pooled specificity (0.78 vs. 0.71; p-value = 0.38, not statistically significant) compared to those with semi-automatic segmentation.

ROI segmentation dimension

Studies utilizing 2D ROI segmentation demonstrated superior sensitivity (0.76 vs. 0.72; p-value = 0.03) compared to 3D segmentation; however, specificity was quite similar (0.76; p-value = 0.35).

Radiomics methods

Deep learning-based radiomics exhibited superior sensitivity (0.79 vs. 0.70) and specificity (0.81 vs. 0.75) compared to conventional radiomics methods, but the evidence was not statistically significant (p-value > 0.05) due to the small number of studies employing the deep learning approach (n = 2).

Imaging modality

MRI demonstrated the highest sensitivity value (0.81), followed by CT (0.73) and PET (0.63). As for specificity, CT scored the highest (0.79), followed by MRI (0.70) and PET (0.64). Notably, due to the limited number of studies involving MRI or PET, the results lacked statistical significance (p-value > 0.05), underscoring the need for further investigation into MRI and PET radiomics in this area.

Radiomics model construction algorithm

A study that used AdaBoost for model construction had a significantly higher sensitivity compared to those other with LR (0.76 vs. 0.72; p-value = 0.03). However, the specificity was pretty similar (0.76; p-value = 0.35).

Feature selection algorithm

Studies employing Elastic Net feature selection exhibited significantly higher sensitivity (0.81 vs. 0.69; p-value = 0.00). However, the pooled specificity was higher for studies utilizing the LASSO algorithm (0.78 vs. 0.73; p-value = 0.13), although this difference lacked statistically significant evidence.

Combined radiomics models

Studies combining radiomics signature with clinical factors demonstrated a significantly lower pooled sensitivity compared to those utilizing signature-only studies (0.74 vs. 0.72). In contrast, combined models exhibited a higher pooled specificity (0.82 vs. 0.69; p-value = 0.05).

Table 3 Meta-regression and subgroup analysis based on different covariates

Publication bias

No significant publication bias was found in the included studies using Deeks’ asymmetry test (p-value = 0.09) (Fig. 5).

Fig. 5
figure 5

Deeks’ funnel plot for testing publication bias (p-value = 0.09)

Sensitivity analysis

By removing each study one by one, the pooled AUC varied between 0.73 and 0.78, with the latter belonging to the removal of the study by Zhang et al., which used PET radiomics (Table 4). Overall, the pooled values were almost consistent, indicating the robustness of the results.

Table 4 Results of the sensitivity analysis

Clinical utility

Utilizing radiomics models resulted in a rise in the post-test probability from 20 to 43% when the initial probability was positive, accompanied by a positive likelihood ratio of 3. Conversely, when the initial probability was negative, the post-test probability diminished to 8%, featuring a negative likelihood ratio of 0.36 (Fig. 6).

Fig. 6
figure 6

Fagan plot showing the clinical utility of radiomics models for predicting LNM in esophageal cancer

Discussion

Lymph node metastasis plays a crucial role in esophageal cancer prognosis, particularly impacting early-stage disease due to the anatomical and histological characteristics of esophageal cancer [32]. Esophageal cancer is recognized for its aggressive behavior and frequent lymphatic dissemination, underscoring the pivotal role of lymph node status as a critical factor in predicting patient outcomes. Achieving precise preoperative staging is imperative for informed decision-making and effective management of esophageal cancer. Despite the widespread use of esophageal CT scans in preoperative assessments, their reliability in detecting lymph node (LN) involvement is deemed inadequate. This inadequacy is attributed to disagreements in diagnostic criteria and inherent limitations, including the challenge of identifying metastasis that may not result in noticeable enlargement of the lymph nodes [33]. Although large-scale lymph node (LN) dissection is necessary during surgery, excessive LN dissection is associated with postoperative complications. Therefore, accurate preoperative prediction of LNM can prevent unnecessary lymph node dissection [26]. Recent advances in artificial intelligence in imaging, particularly radiomics, opened up a new horizon in precision medicine [34, 35]. The results of the meta-analysis consisting of nine studies with separate validation cohorts and acceptable overall quality showed that radiomics-based methods have a moderate diagnostic performance (AUC = 0.74) for diagnosing LNM in esophageal cancer. The presence of a geographic bias, with the majority of studies (8 out of 9) originating from China, raises a concern about the representativeness of the evidence. This concentration may introduce regional variations that limit the generalizability of findings to a broader global context. The disproportionate focus on a specific geographic region underscores the importance of diversifying study locations to capture a more comprehensive understanding of the subject matter. Future research should strive for a more globally representative sample to ensure the applicability of findings across different populations and settings. In addition, The retrospective study design in the included studies is a limitation, as it poses challenges related to data accuracy, potential biases, and establishing causal relationships. Retrospective studies lack prospective data collection and may have incomplete variables. Despite providing insights, their design introduces limitations that should be considered when interpreting findings. Future research could improve validity by incorporating prospective study designs.

Compared to previous meta-analyses in other gastrointestinal cancers, the pooled diagnostic performance was slightly lower in our study. In rectal cancer, a meta-analysis by Bedrikovetski et al. showed that the pooled AUC of radiomics models was 0.808, which is higher than the results of this study [28]. A recently published meta-analysis showed that CT-scan-based radiomics combined with clinical factors could reach an AUC of 0.90, representing excellent diagnostic accuracy [15]. Another meta-analysis evaluating validation cohorts has shown that radiomics based on MRI and CT might facilitate the diagnosis of LNM in pancreatic ductal adenocarcinoma with a pooled AUC of 0.79 [19]. However, it seems that radiomics methods might perform slightly weaker in thoracic and head and neck regions compared to the abdominal cavity, as another meta-analysis has shown that CT-based radiomics studies have a pooled AUC of 0.75 for predicting LNM in thyroid cancer [16]. This suggests that the current performance of radiomics studies falls within a fair range of diagnostic accuracy. Such findings highlight the necessity for more refined methodologies and enhanced study designs to improve the diagnostic capabilities of radiomics in identifying LNM in esophageal cancer. Future research should prioritize the standardization of imaging protocols, feature extraction methods, and deep learning algorithms. Additionally, to ensure their generalizability across different populations and clinical settings, it is crucial to train and validate these models using larger and more diverse external datasets.

We concluded following findings based on subgroup analysis: First, it seems that 2D segmentation performs better, at least in terms of sensitivity, compared to the 3D segmentation method, as this finding was previously mentioned in meta-analyses of thyroid and gastric cancers [15, 16]. This observation can be attributed to several factors: First, 2D images often offer higher resolution and quality within specific planes, facilitating the detection of subtle features indicative of early disease. The simplicity and focused nature of 2D segmentation enable more precise analysis of certain anatomical features, while the computational efficiency of 2D methods allows for greater optimization during algorithm training. Additionally, the wider availability of annotated 2D data enhances the development of sensitive detection models. Despite the comprehensive spatial insights provided by 3D segmentation, its complexity may hinder the accurate modeling of early-stage disease markers. The choice between 2D and 3D approaches should, therefore, consider the specific clinical needs, the disease in question, and the goals of the imaging analysis [36]. The scarcity of studies employing 2D segmentation may result in inaccurate conclusions, restricting comprehensive insights and generalizability in this specific area. This constraint hampers a thorough exploration of potential applications and biases associated with 2D segmentation. To address this, future research should prioritize expanding the number of studies utilizing 2D segmentation to enhance understanding and assessment of its capabilities and limitations.

We also found that manual segmentation outperforms automatic segmentation in terms of sensitivity. However, it should be noted only one study used automatic segmentation, and further investigations are required in this context, as a previous meta-analysis mentioned the superiority of automatic segmentation [15]. The scarcity of studies utilizing automatic segmentation limits available evidence and constrains insights and generalizability in this area. This constraint impedes thorough exploration of potential applications and biases. Future research should prioritize expanding studies employing automatic segmentation to enhance understanding of its capabilities and limitations.

In addition, the pooled AUC of deep radiomics models was higher than the conventional models. However, the variations were not identified as statistically significant because of the limited number of studies examining this aspect (2 out of 9). The integration of CNNs and deep learning into radiomics has markedly enhanced diagnostic accuracy in medical imaging by automating the extraction of intricate features that may not be visible to the human eye. This advancement allows for the handling of high-dimensional data and the extraction of meaningful patterns, leading to improved disease detection, classification, and prediction capabilities. As these models are trained on large datasets, their diagnostic precision improves, offering potential for personalized medicine through predictive modeling of disease progression and treatment outcomes. Despite challenges such as the need for extensive annotated datasets, potential biases, and the complexity of interpreting deep learning models, this integration represents a significant leap forward in the field of medical imaging, promising more accurate, efficient, and individualized patient care [37,38,39,40,41]. Going forward, it’s crucial to increase the number of deep radiomics studies to get more comprehensive insights and facilitate thorough analyses and meta-analyses.

We also observed that adding clinical factors to radiomics signature can also be considered as a promising method to increase the diagnostic accuracy of the studies. Incorporating clinical factors into radiomics signatures enhances diagnostic accuracy by leveraging a comprehensive patient profile that combines macroscopic clinical data with microscopic imaging features. This integration improves specificity and sensitivity by helping differentiate diseases with similar imaging appearances and supports personalized medicine by accounting for individual variability in disease presentation. Additionally, it aids in accurate risk stratification, allowing for tailored treatment strategies and closer patient monitoring. The approach also enhances the generalizability of models across different populations by incorporating a wider range of predictive variables. Furthermore, aligning radiomics with established clinical practices bolsters the credibility and acceptance of these advanced diagnostic tools within the medical community, ensuring a smoother integration into clinical workflows. The synergy between clinical factors and radiomics signatures thus represents a significant step forward in developing more accurate, personalized, and clinically relevant diagnostic methodologies [42].

We have also shown that PET radiomcis methods are not superior to CT and MRI models, and comparing their performance with CT-scan methods requires more studies to establish a firm conclusion. The limitation of a limited number of studies utilizing the MRI and PET imaging modality was evident, with the majority (7 out of 9) relying on CT, one on PET, and only one incorporating MRI. This imbalance raises concerns about the comprehensiveness of insights gained from MRI and PET in the context of the topic under investigation. Considering the potential superiority of MRI in terms of performance [43,44,45], it emphasizes the crucial need for more extensive evaluation of its diagnostic accuracy in future research. This would ensure a comprehensive understanding of the subject matter and provide insights into the comparative effectiveness of different imaging modalities.

In radiomics model construction algorithms, we observed that AdaBoost had a significantly higher sensitivity compared to those studies using LR. A recent meta-analysis suggests that using more advanced machine learning algorithms such as support vector machines and AdaBosst can improve the results significantly, supported by our results [21]. AdaBoost, a machine learning algorithm that combines multiple weak classifiers to form a strong classifier, has shown significantly higher sensitivity in detecting specific conditions or characteristics from medical images compared to LR, a more traditional method widely applied in radiomics studies. This difference in performance can be attributed to AdaBoost’s ability to adaptively focus on the most challenging cases in the training dataset, thereby improving its ability to generalize from complex, high-dimensional imaging data. In contrast, LR, although powerful in its simplicity and interpretability, might struggle with the complex and high-dimensional nature of radiomic data. This adaptability of AdaBoost, coupled with its ability to handle a wide range of data distributions and its robustness to overfitting, likely contributes to its superior performance in sensitivity, as supported by both recent meta-analyses and empirical results [46, 47].

Regarding feature selection, we found that elastic net and feature-wise attentional graph neural networks might perform better than LASSO. Both elastic net and LASSO are regularization techniques used in linear regression, but while LASSO imposes variable sparsity by encouraging some coefficients to be exactly zero, elastic net combines both lasso and ridge regression penalties to provide a more balanced selection of variables [48, 49].

In this study, to compare the results of our study with previous meta-analyses, the overall quality of the selected articles was assessed using RQS tools, which is commonly used in systematic reviews for quality assessment of radiomics studies. Overall, the included studies received a mean score of 12.78, denoting 35% of the total possible score. This score is in line with the results of previous meta-analyses [15, 50], indicating that the included studies had an acceptable quality, and these results were also concluded from the QUADAS-2 assessment. However, following the development of new quality assessment tools for artificial intelligence like CLEAR and METRICS after 2023, we strongly recommend adopting these newer tools instead of RQS in future radiomics meta-analyses. The CLEAR checklist, short for Consolidated Criteria for Reporting Radiomics Studies, serves as a structured set of recommendations aimed at enhancing the transparency and quality of reporting in radiomics research. It stresses the importance of thorough documentation throughout every phase of a study, spanning from data collection and image processing to feature extraction and statistical analysis. What sets CLEAR apart from RQS is its broader focus on reporting standards rather than solely on methodological quality. By advocating for the transparent sharing of data, scripts, and models, CLEAR addresses the crucial need for reproducibility and validation in radiomics. Additionally, it offers specific guidance on how to report the workflow of radiomics studies, which is often overlooked. This holistic approach not only facilitates comparison, replication, and expansion of radiomics research but also aims to bolster the credibility and impact of findings within the field. On the other hand, METRICS (METhodological RadiomICs Score) is a novel scoring tool designed to assess the methodological quality of radiomics research, developed through a collaborative effort involving a large international panel of experts. Unlike existing tools such as the RQS, METRICS offers several advantages. Firstly, it incorporates input from a diverse group of experts through a modified Delphi process, ensuring a comprehensive and consensus-driven approach to evaluating research quality. Secondly, METRICS assigns weights to different categories and items based on expert rankings, providing a nuanced and transparent assessment framework. Thirdly, METRICS covers a wide range of methodological variations, including both traditional radiomics and deep learning-based approaches, making it applicable to diverse research contexts. Finally, METRICS is accompanied by a user-friendly web application and a repository for community feedback, facilitating its adoption and continuous improvement. Overall, METRICS represents a significant advancement in the field, offering a robust and adaptable tool for enhancing the methodological rigor of radiomics research [51, 52].

While high risk of bias for reference standard domain of one study was identified, it does not significantly compromise the overall reliability of our meta-analysis findings. The QUADAS-2 assessment tool was applied rigorously, and the majority of included studies demonstrated acceptable quality across the assessed domains. High risk of bias concerns, especially in diagnostic accuracy studies, are not uncommon, and variations in study design can contribute to these biases. Importantly, similar meta-analyses often encounter multiple instances of high risk of bias across various domains, making the presence of only one study with a high risk of bias in a single domain relatively favorable [21, 53, 54].

However, a medium to moderate degree of heterogeneity was observed based on Higgins’ I2 test for the pooled specificity. Following meta-regression, we found that integrating clinical factors with radiomics signatures might explain the possible cause of interstudy heterogeneity, as the diagnostic performance of combined models was higher. The pooled results were consistent regarding pooled sensitivity, and Higgins’ I2 test did not detect significant heterogeneity. In addition, no significant publication bias was observed based on Deek’s test. If no significant publication bias exists in a diagnostic test accuracy meta-analysis, it means that studies with positive and negative results are equally likely to be published. This leads to more representative and reliable findings, reduces the risk of overestimating the test’s accuracy, and allows for better-informed clinical decisions with improved generalizability across different populations and settings.

Although the pooled AUC in this study was 0.74, following removing a study that used PET-based radiomics (Zhang et al.) [27], we observed that the overall pooled AUC of the remaining studies (consisting of MRI and CT-scan modalities) increased to 0.78, proposing that CT or MR-based radiomics could improve the diagnostic performance.

Limitations

This study faced a few limitations: First, commitment to methodological rigor drove the exclusion of studies lacking separate validation cohorts from the meta-analysis. Studies relying solely on training cohorts or cross-validation may lead to overestimating diagnostic accuracy, introducing a risk of overfitting and limiting the generalizability of results. The decision underscores the importance of assessing diagnostic models in independent datasets to ensure their applicability across diverse patient populations and clinical settings. Additionally, the study employed the RQS tool to assess the risk of bias, enhancing comparability with other studies. However, we recommend that future researchers consider utilizing newer tools such as METRICS and CLEAR for more comprehensive assessments. Moreover, while extracting data, we opted for the model exhibiting superior diagnostic efficacy from various options, potentially leading to an overestimation of the combined sensitivity and specificity of radiomics in LNM in esophageal cancer. Excluding studies published in Chinese could introduce bias by omitting potentially relevant data and perspectives, particularly from regions like China with significant research output. This exclusion may skew the overall understanding of the topic and introduce publication bias, as studies with statistically significant results may be more likely to be published in English-language journals. Therefore, researchers should carefully consider the implications of excluding studies based on language criteria to ensure the robustness and generalizability of their findings. Finally, it is important to acknowledge the limitation of pooling all imaging modalities together, including MRI, PET, and CT, in our study. While this approach allows for a comprehensive assessment of radiomics across various imaging techniques, it can also introduce heterogeneity in the data due to differences in image acquisition protocols, resolution, and contrast. However, the subgroup analysis was performed to rule out the possible sources of heterogeneity.

Conclusion

This meta-analysis consolidates evidence on radiomics for predicting LNM in esophageal cancer, showcasing its potential diagnostic value. Despite identified heterogeneity and specific challenges, radiomics demonstrates promise in enhancing esophageal cancer staging. To integrate radiomics-based predictions into clinical workflows for esophageal cancer management, it is imperative to prioritize further research and development efforts aimed at refining radiomics models tailored to esophageal cancer while advocating for standardized imaging protocols and data-sharing initiatives. Validation through external testing using diverse datasets is essential to ensure the reliability and generalizability of radiomics models. Establishing guidelines for integration into clinical practice, developing decision support tools, and interdisciplinary collaboration between radiologists, oncologists, and surgeons are crucial steps. Prospective clinical trials are needed to evaluate the impact of radiomics on patient outcomes, and continuous evaluation and improvement of radiomics models are essential to keep pace with technological advancements and clinical needs. It’s noteworthy that when diagnostic performance reaches a level comparable to the gold standard, which is surgery for prediction of LNM in esophageal cancer, radiomics has the potential to replace it. However, we acknowledge that we are not currently at that stage, and further studies are required to achieve this level of performance.