Introduction

Hepatocellular carcinoma (HCC) represents the most prevalent form of primary liver cancer, ranking sixth globally in cancer incidence and third in cancer-related mortality. In 2020, there were approximately 906,000 new cases and 830,000 deaths attributed to HCC worldwide [1]. Recognized for its highly aggressive nature, HCC is associated with a generally poor prognosis. Age-standardized 5-year survival rates exhibit regional variations, typically ranging from 5% to 30%, and falling within 10% to 19% in most countries [2]. Therapeutic advancements have contributed to an enhanced outcome for HCC patients, with observed increases of 5% to 10% in age-standardized 5-year survival rates across different regions from 1995 to 2014 [2, 3]. Current therapeutic strategies highlight optimal management through comprehensive approaches, primarily centered around surgical interventions [4,5,6]. Non-surgical treatments, including transarterial chemoembolization, hepatic arterial infusion chemotherapy [7], ablations [8,9,10], radiotherapy [11], and systemic therapies, among others, have provided opportunities to improve the prognosis, particularly for early-stage patients unsuitable for operations and those with advanced unresectable HCC.

Transarterial chemoembolization (TACE) stands as a common locoregional treatment for liver cancer. Depending on the therapeutic agents employed, it can be broadly classified into conventional TACE (cTACE), utilizing a mixture of iodized oil and chemotherapeutic drugs, and drug-eluting bead TACE (DEB–TACE) [12]. TACE is considered the preferred therapeutic modality for liver-limited diseases deemed unsuitable for surgical resection [4], especially recommended for intermediate-stage tumors according to the Barcelona clinic liver cancer (BCLC) classification [5]. Notably, TACE finds application in various scenarios, such as the management of liver cancer with localized portal vein tumor thrombosis with incomplete obstruction, bridging therapy for liver transplantation, postoperative adjuvant therapy for patients at high risk of recurrence, and conversion therapy for unresectable tumors [4,5,6, 13], among others. In real-world practice, TACE is commonly employed in patients with recurrent or progressed HCC and is often repeated following the initial treatment [12]. This renders the patient selection and efficacy evaluation susceptible to the influences of previous or concurrent anti-tumor treatments. Consequently, the choice and application of TACE heavily depend on the expertise and preferences of physicians, lacking well-recognized supportive informative tools.

The assessment of tumor response and long-term prognosis subsequent to TACE primarily relies on radiological criteria, such as the response evaluation criteria in solid tumors (RECIST) [14] and modified response evaluation criteria in solid tumors (mRECIST) [15, 16]. According to these criteria, the objective response rates (ORR) after TACE exhibit notable diversity across different reports, ranging broadly from 30% to 80% [17,18,19,20,21]. This substantial variability may be linked to factors such as distinct baseline patient characteristics, the intricacy of background therapies before TACE, and variations in TACE techniques. Consequently, predicting prognosis following TACE could enhance the process of individualized patient selection, ultimately improving clinical outcomes [13]. Researchers have explored diverse indicators for post-TACE prognosis, encompassing radiological features [22, 23], clinico-pathological factors [24], serum markers [25], and genomic alterations [26], among others. However, these predictive tools remain hindered by limitations such as insufficient predictive power, lack of validations, and the associated trauma or cost related to these tests.

First systematically defined by Lambin et al. in 2012 [27], radiomics is a technique leveraging computer programs to explore and analyze high-throughput quantitative features from medical images. Since its inception, the influence of radiomics has progressively expanded. Radiomics has primarily been applied to aid in cancer diagnosis and prognostic prediction, offering robust support for precision medicine. By transforming medical images into a high-dimensional space of features for analysis and model construction, radiomics contributes to an enhanced understanding of medical images. In response to the unmet demand for personalized TACE, many researchers have developed radiomics models to predict prognosis after TACE. These studies have analyzed data from various imaging modalities, including magnetic resonance (MR) [21, 28,29,30,31,32,33,34,35,36], computed tomography (CT) [20, 37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53], as well as ultrasound (US) [54], reporting promising results. Recognizing the significance of developing reliable tools for prognostic prediction in TACE, we deem it timely and crucial to summarize the predictive performance and potential for clinical translation of radiomics models in this task. Therefore, this systematic review and meta-analysis aimed to determine the value of radiomics in predicting the prognosis of TACE treatment.

Methods

This systematic review adheres to the preferred reporting items for systematic review and meta-analysis of diagnostic test accuracy studies (PRISMA-DTA) guidelines [55] as well as the assessing the quality of systematic reviews 2 (AMSTAR 2) guidelines [56]. The protocol was prospectively registered on PROSPERO (ID: CRD42023449278).

Literature retrieving

A computerized search was executed on PubMed, Web of Science and Embase databases to identify original studies implementing radiomics analysis of preprocedural images to predict the therapeutic outcomes of TACE. The search strategy was devised following the PICO (participants, intervention, control, outcome) principles [55] to facilitate a sensitive screening for relevant studies. The search terms encompassed four perspectives, including "hepatocellular carcinoma or liver cancer," "radiology or radiomics," "TACE," and "prognosis." The search scope included literature published up to May 16, 2023. Detailed search conditions are provided in Supplementary Table 1.

Study inclusion

After eliminating duplicates, the remaining literature underwent a comprehensive review based on the following inclusion criteria: (1) population: patients diagnosed with HCC through clinical or pathological confirmations; (2) intervention: initial TACE treatment (without prior history of TACE treatment); (3) index test: radiomics analysis/high-throughput analysis/artificial intelligence analysis based on preprocedural medical imaging; (4) outcomes: prognosis after TACE, including tumor response, overall survival, recurrence, progression, etc. Studies were then excluded based on the following criteria: (1) studies involving non-human subjects; (2) conference abstracts, reviews, case reports, commentaries, and other non-original publications; (3) studies not published in English; (4) studies that did not report standalone radiomics models without combining clinical features; (5) studies that did not provide sufficient raw data for qualitative or quantitative review.

The identified articles underwent an initial screening based on titles and abstracts. Subsequently, a full-text review was conducted for potentially eligible articles. The literature was reviewed by two authors (KGD, FY) with over four years of experience in hepatic surgery and medical imaging analysis. Any uncertainties were resolved through consensus in an author team, including a senior author with over fifteen years of experience in the management of liver cancers and meta-analysis (YCZ). When sufficient data were still unavailable after the full-text review, attempts were made to acquire the original data by contacting the corresponding authors. Endnote (version 20, Clarivate Analytics, England) was utilized for all literature management.

Data extraction

Predefined information was systematically extracted from the included literature according to the following categories: (1) bibliographic features, including authors, publication year, country, and study type; (2) patient characteristics, encompassing (2–1) demographic features such as sample size, age and gender; (2–2) etiological features, such as hepatitis viral infection status and liver cirrhosis; (2–3) clinical-pathological features, including pathological diagnosis and tumor staging; (3) procedural information related to TACE; (4) pre-treatment radiological examination details, covering imaging modalities, instrument parameters, and phases or sequences of the images; (5) post-treatment follow-up information, comprising follow-up endpoints, assessment criteria, follow-up intervals, and follow-up examinations; (6) methodological characteristics of radiomics analysis, involving image segmentation methods, feature extraction methods, modeling algorithms, predictive outcomes, etc.; (7) model validation methods; (8) model performance parameters, such as sensitivity, specificity, concordance index, area under the curve (AUC), number of true positive (TP), false positive (FP), true negative (TN), false negative (FN) quantities, etc.; and (9) other relevant information. Detailed data extraction strategies are presented in Supplementary Table 2.

In instances where a single study involved the construction of multiple radiomics models, the model demonstrating the highest performance was selected for data extraction. When a study presented results for both training and validation sets [57], outcomes were extracted separately from each dataset. If an article reported both pure radiomics models and combined models incorporating additional clinical features, only results from the pure radiomics models were extracted, while the supplementary clinical features were retained for study quality evaluation.

Quality assessment

The quality of the included studies was evaluated using the quality assessment of diagnostic accuracy studies 2 (QUADAS-2) [58], the radiomics quality score (RQS) [59] criteria, and the METhodological RadiomICs Score (METRICS) [60]. QUADAS-2 was employed to assess the risk of bias across four domains: patient selection, index test, reference standard, and flow and timing within the studies. RQS is a specialized rating framework for radiomics, covering sixteen aspects such as image acquisition, feature extraction, modeling methods, model validation, and clinical applications. And the METRICS is a newly proposed assessment tool for radiomics studies, which is based on international consensus and a well-constructed framework including ranking and weighting of methodological variations (refer to Supplementary Table 3 for details).

Throughout the data extraction and quality assessment process, a systematic approach was adopted. Initially, a training phase was initiated, during which three authors (KGD, ZJL, YCZ) independently reviewed three included papers. These authors aligned their interpretations of each item in the data extraction forms and the quality rating forms, with all definitions ultimately confirmed by senior author YCZ. Subsequently, two authors (KGD, ZJL) independently conducted data extraction and quality assessment. METRICS scores were assessed using the convenient web-based tool at https://metricsscore.github.io/metrics/METRICS.html. The final electronic datasheet integrated information unanimously confirmed by both authors.

Statistical analysis

Statistical analysis and visualization were carried out using Microsoft Office Excel 2019, RevMan 5.3.5 [61], and Stata Statistical Software 17 (StataCorp., T.X., USA) [62]. 2 × 2 diagnostic contingency tables, containing numbers of TP, FP, FN, and TN, were extracted wherever possible. Meta-analysis was conducted using the Stata midas package [63]. Pooled sensitivity, specificity, positive likelihood ratio (PLR), and negative likelihood ratio (NLR) were calculated as synthetic statistics. A PLR > 10 or within 5–10, along with an NLR < 0.1 or within 0.1–0.2, indicates high or moderate informative value, respectively. Cochrane’s Q-test and I2 were utilized to assess inter-study heterogeneity, with I2 > 50% or P < 0.1 indicating the presence of heterogeneity. The Galbraith method was employed to identify the impact of outliers. The summary receiver operating characteristic curve (sROC) was used to evaluate the overall predictive performance of different radiomics models. An AUC of 0.5–0.7, 0.7–0.9, and > 0.9 indicates low, moderate, and high predictive power, respectively [64]. Meta-regression was used to explore the relationship between major methodological factors and inter-study heterogeneity, followed by further subgroup analyses. Deek’s funnel plot and asymmetry test were applied to evaluate publication bias. Finally, the clinical application of radiomics models was assessed. The calculated average ORR in included studies was used as prior probability, and then post-test probability was calculated using Fagan plot based on the pooled PLR and NLR and Bayesian conditional probability theorem.

Results

Literature retrieving, selection and data extraction

The literature retrieval process identified a total of 645 relevant articles, and 203 duplicated articles were subsequently removed. From the remaining 442 studies, scrutiny of titles and abstracts led to the exclusion of 396 articles that did not align with the language, study type, or PICO criteria specified for our review. After a thorough review of full texts, an additional 17 articles were excluded, resulting in the final inclusion of 29 articles (Fig. 1).

Fig. 1
figure 1

PRISMA flowchart illustrating the literature selection process in this study.

Summarized information for all 29 included studies is presented in Table 1, covering a total of 5483 patients. Details of the studies are presented in Supplementary Table 4, with the primary research features illustrated in Supplementary Fig. 1. The majority of the articles originated from China (24 articles, 82.8%). The included studies date back to 2016, with a progressive increase in the number of publications, peaking at 19 studies in 2021–2022. All included studies adopted a retrospective design. Most studies utilized the BCLC staging system to determine the inclusion of patients, while others used the China liver cancer staging (CNLC) system [32] or other criteria [21, 31, 40, 43, 48]. Mid-stage patients were the most extensively studied population, with 21 articles (in whole or in part) including BCLC-B stage patients [20, 28,29,30, 33, 37, 41, 28,29,30, 49, 53], followed by BCLC-A (13 articles) and BCLC-C stage (9 articles) patients.

Table 1 Baseline characteristics of the included studies

In terms of imaging modalities, all studies utilized contrast-enhanced examinations. The most frequently investigated imaging modality was contrast-enhanced CT [20, 36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53], followed by contrast-enhanced MR [21, 28,29,30,31,32,33,34,35], with only one article exploring contrast-enhanced ultrasound [54]. Various phases or sequences of contrast-enhanced images were studied, with 18 articles incorporating single-phase/sequence images, while the remaining literature analyzed multi-phase/sequence images. The most commonly utilized phases for radiomics analysis were the arterial phase (AP) [21, 39, 42, 44, 46, 49] and the portal venous phase (PVP) [31, 34, 36, 38, 48, 50]. Regarding MR-based radiomics, T2WI was the most frequently analyzed sequence [29, 32, 35]. Several studies compared radiomics models constructed from images of different phases/sequences, but the conclusions were inconsistent. Some indicated that multi-phase/sequence models were superior [20, 30, 42] in performance, while others demonstrated that single-phase/sequence models were comparable to, or even better than, the multi-phase/sequence models [35, 48]. In addition, comparative studies between different single-phase models suggested that the prognostic value of PVP-based models might be superior to AP-based models [31, 34, 48].

The majority of studies employed 3D volumes of interest (VOIs) for analysis, with only two articles utilizing 2D regions of interest (ROIs) based on the maximum tumor section [46, 47]. Most studies only delineated the tumor regions, while a few explored radiomics features in the peritumor areas [36, 37, 45]. Song et al. suggested that the prognostic value of radiomics models based on tumor regions plus peritumor extensions was not as good as the models considering solely the tumor regions [34]. 28 articles reported the algorithms used for feature selection and model construction, with the majority employing machine learning (ML) algorithms. Five articles adopted deep learning (DL) algorithms in the modeling process [32, 33, 39, 46, 54]. A total of 44 independent datasets (training sets or validation sets) reported both sample sizes of the cohorts and the number of features in the predictive models. The median sample-ize-to-feature-number ratio was 9.30 (p 25–p 75: 5.16–12.43), with only 19 datasets having a value over 10.

Quality assessment

A quality assessment was conducted for the included studies. When assessing the risk of bias using QUADAS-2, prevalent biases were identified, originating from unclear case selection procedures (whether all patients were included consecutively), the absence of specified assessment criteria for diagnostic models (cut-off values), and the lack of validation of established models in independent datasets (Fig. 2a, Supplementary Fig. 2A). The latter two factors were also the primary sources of concerns about the studies' applicability.

Fig. 2
figure 2

Quality assessment of included articles. A The bias risk assessment of included studies utilizing the QUADAS-2 scale. B RQS and METRICS scores of included articles. C Relationship between METRICS and RQS for each included study, illustrated by scatter plot

The summary of study quality according to RQS and METRICS is depicted in Fig. 2b, c, with detailed results presented in Supplementary Fig. 2B and Supplementary Table 5. The overall RQS across all 29 included studies averaged 12.90 ± 5.13 (35.82% ± 14.25%). METRICS scores showed positive correlation with RQS scores (Fig. 2c), averaged 62.98% ± 14.58%. Generally, most studies demonstrated satisfactory quality in terms of reporting imaging strategies, multiple segmentations, integrating non-radiomics clinical variables, assessing the model’s discriminative powers and calibration, model validation, and disclosing radiomics features within the model. Using the METRICS assessment, there were 2, 14, 10 and 3 studies meeting the criteria of “excellent”, “good”, “moderate” and “low”, respectively.

Study data synthesis (Meta-analysis)

Pooled predictive performance of radiomics for TACE response

A total of 23 independent datasets from 14 studies provided sufficient information to extract a complete 2 × 2 contingency table and thus entered the meta-analysis. The included datasets comprised 12 training datasets (with or without resampling validation), including 1628 patients, and 11 independent validation datasets (internal or external validation), comprising 815 patients. In all of these datasets, the predictive endpoint of radiomics models was the tumor response after TACE, defined by objective response (OR) according to the RECIST [14] or mRECIST [15] criteria in the majority of the studies. All datasets included in the meta-analysis and their key methodological information are outlined in Table 2. For the total of 23 datasets, a synthetic analysis of predictive performance was conducted on 2443 subjects (Fig. 3a–e, Supplementary Table 6). The pooled sensitivity was 0.83 (95% CI: 0.78–0.87) (Fig. 3a), specificity was 0.86 (95% CI: 0.79–0.92) (Fig. 3b), and the pooled PLR and NLR were 6.13 (95% CI: 3.79–9.90) (Fig. 3c) and 0.20 (95% CI: 0.15–0.27) (Fig. 3d), respectively. The AUC of the sROC was 0.90 (95% CI: 0.87–0.93) (Fig. 3e). Heterogeneity tests revealed I2 exceeding 70% for sensitivity, specificity, PLR, and NLR, with Q-test P-values below 0.01, indicating significant heterogeneity (Supplementary Table 6). Further Galbraith plot suggested a limited and symmetrical impact of outliers on the result of the meta-analysis (Fig. 4a).

Table 2 Major results and methodological features of the studies in meta-analysis
Fig. 3
figure 3

Summarized and pooled performance of radiomics models for predicting TACE response. A Pooled sensitivity. B Pooled specificity. C Pooled positive likelihood ratio. D Pooled negative likelihood ratio. E The summary receiver operating characteristic curve

Fig. 4
figure 4

Heterogeneity among included studies. A Impact of outliers on the meta-analysis, illustrated by Galbraith plot. B No significant publication bias was indicated by Deek’s funnel plot and asymmetry test. CD Impact of methodological factors on (C) pooled sensitivity and (D) specificity of radiomics models, according to meta-regression

Origin of heterogeneity and subgroup analysis

Meta-regression analysis was performed based on the key methodological parameters, which incorporated six variables: imaging modalities (CT/MR/US), training or validation datasets, modeling algorithms (ML/DL), imaging phases (single-phase/multi-phase), single imaging phase (AP/PVP), and the inclusion of peritumoral features (Yes/No). In relation to pooled sensitivity, noteworthy inter-subgroup variances were identified between training sets and validation sets, MR studies and other studies, ML models and DL models, single-phase models and multi-phase models, as well as AP models and PVP models (Fig. 4c, Supplementary Table 7). Concerning specificity, significant disparities were discerned between MR studies and other studies, and between ML models and DL models (Fig. 4d, Supplementary Table 7).

Subsequent subgroup analyses were conducted for variables deemed significant in the meta-regression. Due to the limited studies reporting single-phase AP models and PVP models, subgroup analyses were executed based on the remaining factors, and pooled sensitivity, specificity, and heterogeneity within each subgroup was calculated (Table 3). Inter-study heterogeneity was significantly lower among the independent validation datasets compared to the training datasets, while comparable model performance was reported (I2: 51.32% vs. 82.11%, Supplementary Fig. 3). In the 11 validation datasets, further subgrouping was based on imaging modalities, imaging phases and modeling algorithms. Inter-study homogeneity with I2 < 50% was observed in the MR subgroup, ML subgroup, and the multi-phase model subgroup (Supplementary Fig. 4–6). Furthermore, the five studies combining CT and ML also showed satisfactory homogeneity (I2: 45.07, Supplementary Fig. 7).

Table 3 The pooled results from subgroup analysis

Evaluation of publication bias and clinical interpretation

The funnel plot disclosed no significant evidence of publication bias among the studies included in the analysis (P = 0.64, see Fig. 4b). Of the 20 datasets extracted from 11 studies that reported ORR following TACE, the weighted average ORR was determined to be 51.24% (Supplementary Table 8). Using this average ORR as the pre-test probability, the posterior probabilities of achieving an objective response were calculated to be 87% and 17% when the radiomics models yielded positive and negative predictions, respectively (Fig. 5a). The likelihood ratio matrix illustrated that the majority of studies failed to attain optimal PLR and NLR, with only three records falling within the upper left quadrant, indicative of high predictive informativeness (Fig. 5b).

Fig. 5
figure 5

Clinical utility of the radiomics models. A Fagan plot illustrating the posterior probabilities of the radiomics models. B Likelihood ratio matrix illustrating the informative value of each individual radiomics model

Discussion

This systematic review encompassed a total of 29 primary studies, involving 5483 patients, with 14 articles integrated into the meta-analysis, constituting 2691 patients. The studies manifested noteworthy regional disparities, as 25 studies (86.2%) emanated from East Asia, of which 24 hailed from China (82.8%), including a total of 5187 Chinese patients (2640 Chinese patients in the meta-analysis). This observation aligns with the high incidence of HCC in East Asia [1] and the prevalent utilization of TACE within this region [6]. Additionally, this regional concentration determined a high proportion of viral-hepatitis-related HCC cases in this review (Supplementary Table 4). We further performed an exploratory bibliometric analysis by searching the Web of Science core database. The findings revealed that 42.3% of publications within the purview of "radiomics" originated from China. Similarly, Chinese researchers contributed 30.2% and 31.4% of the studies on "HCC" and "TACE", respectively, establishing China as the foremost source of studies in these domains. From the perspective of practical implication, radiomics tools may confer cost advantages and greater convenience compared to molecular-pathological tests, such as next-generation sequencing, especially in middle-income countries like China. Nevertheless, this does not diminish the applicational potential these studies in the context of global HCC management and precision medicine.

Among the included articles, mid-term HCC patients, predominantly BCLC-B stage patients, constituted the predominant cohorts, aligning with the prevailing indications for TACE [5, 13]. The majority of the studies delved into the predictive value of cross-sectional imaging, encompassing CT and MR, both acknowledged as diagnostic modalities for HCC [65]. This alignment with the traditional clinical and radiological practice might set the basis of convenient application of these radiomic models. Since 2018, a discernible upswing in the number of studies in this field has been noted, and the studies showed considerable consistency in research patterns, indicating a trend of maturation and standardization of radiomics research [59, 60].

To our knowledge, this represents the inaugural systematic review and meta-analysis delving into the predictive efficacy of radiomics in determining the prognosis of ACE. Previous investigations endeavored to harness pre-treatment conventional imaging features for predicting TACE outcomes. Parameters such as tumor size, location, enhancement patterns, enhancement heterogeneity, and intraprocedural digital subtraction angiography (DSA) characteristics were explored [66, 67]. Additionally, established liver imaging assessment tools, like the liver imaging reporting and data system (LI-RADS), were also employed for this purpose [68]. Though these traditional imaging features had demonstrated some correlations with TACE response, they were not specifically tailored tools for this purpose, resulting in imprecise prediction and limited clinical value. Consequently, radiomics methodologies emerge as a promising avenue in this context. Radiomics has demonstrated substantial potential across various oncological domains since its inception. Previous systematic reviews and meta-analyses have consolidated evidence of its efficacy in diverse areas, such as predicting microvascular invasion (MVI) in HCC [69, 70], post-radiotherapy survival in non-small cell lung cancer (NSCLC) [71], and differential diagnosis and prognosis prediction for renal cell carcinoma (RCC) [72]. Diverging from preceding meta-analyses, our study placed particular emphasis on the generalizability of radiomics models. We independently analyzed outcomes from both the datasets utilized for model training and the separate datasets designated for validation. The combined results from validation sets did not disclose insufficient model performance compared to training sets. This finding lays the groundwork for future endeavors in validating and applying these radiomics models. Of note, utilization of public datasets for model testing is a practical way to enhance the generalizability of radiomic models. In terms of the specific tasks explored in this meta-analysis, the HCC-TACE-SEG dataset from The Cancer Imaging Archive (TCIA) might provide a proper external validation cohort [73], which is worth exploiting in future studies.

In this systematic review, we utilized QUADAS-2 [58], RQS [59] and METRICS [60] to appraise the quality of radiomics studies. In general, the included studies showed moderate to good quality. In 2020, Ursprung et al. summarized the role of radiomics in RCC, reporting an average RQS of 3.41 ± 4.43 for 57 articles [72]. In contrast, this systematic review, comprising 29 studies, reported a significantly higher average RQS of 12.90 ± 5.13, reflecting the advancements in this field in recent years. Further analysis of the 16 scoring categories in RQS revealed the most pronounced disparity in the "validation" category, while the lowest scores were noted in three domains: "detect and discuss biological correlates," "prospective study registered in a trial database," and "cost-effectiveness analysis." This discrepancy may be partly attributed to our specific inclusion criteria, where prospective studies could be constrained by the intricacies of treatment indications [12], and the joint analysis with other biomarkers might be impeded by the absence of reliable predictive biomarkers for TACE efficacy. However, it is noteworthy that integrating cost-effectiveness analysis could play a pivotal role in the subsequent validation and application of radiomics models, particularly in evaluating methodological factors such as the workforce and time required for image segmentation, additional resources for multi-phase image segmentation, and net benefits compared to traditional imaging examinations. In addition, we evaluated METRICS scores of the studies, which is a novel assessment tool that is more detailed and balanced compared to RQS. For most included studies, further improvement could be made in handling clinical confounding factors, appropriate use of machine learning algorithms, appropriate choice of imaging phases, independent validation of the results, as well as the transparency and reproducibility of scientific data. Hence, we advocate for future radiomics research to incorporate these considerations in their analysis and reports.

The meta-analysis involving all the included radiomics models indicated pooled sensitivity and specificity surpassing 0.80. As these models were developed in highly bespoke manners within retrospective cohorts, lacking consistent cut-off values across different models, we employed a summary receiver operating characteristic to assess the general prognostic performance, which unveiled an AUC of 0.90, signifying robust discriminative capacity for post-TACE prognosis. As indicative parameters for clinical application, the pooled PLR and NLR were 6.13 and 0.20, respectively, with only three studies meeting the criteria of PLR > 10 and NLR < 0.1. Overall, the current radiomics models demonstrated moderate clinical informative value. Furthermore, with prior probability taken into consideration, the calculated post-test probabilities of objective response were 87% and 17% when positive and negative results were given by the radiomics models, respectively. Accordingly, radiomics tools could forecast totally different outcomes after TACE, and thus harbor significant potential in informing therapeutic decisions of TACE.

Significant heterogeneity surfaced among the incorporated studies, a revelation consistent with prior meta-analyses in the radiomics field [69,70,71,72], aligned with our expectations. Given the retrospective design of all studies, heterogeneity was inevitable [70]. Although similar and standard research workflows were adopted by most of the included studies, numerous methodological variables in the whole analyzing process, from imaging data to the final predictive model, could contribute to substantial disparities among studies, which is an inherent limitation of highly customized models. Nevertheless, the combined statistical results of the current models are still of considerable value. In this study, we utilized the Galbraith plot to evaluate the impact of heterogeneity on the meta-analysis results, uncovering few extreme outliers and a generally symmetric impact. Furthermore, Deek’s funnel plot and asymmetry tests were utilized, suggesting no apparent publication bias.

Through meta-regression, we scrutinized the origins of heterogeneity, we observed that several crucial methodological factors including imaging modalities, modeling algorithms, and validation or training datasets could potentially be major contributors to inter-study heterogeneity. In subgroup analyses, results from testing sets among studies that adopted the same imaging method (MR), imaging phases (multi-phase imaging data), or the same type of modeling (ML-based modeling) showed better homogeneity. And the synthesized results in subgroups did not show serious deviation from the combined results of all studies. Taken together, we believe that radiomics tools have great potential for predicting treatment outcome of TACE, from both the research and clinical application perspectives.

We analyzed the performance of different models only in the validation datasets (Table 3), CT-based radiomics models demonstrated comparable pooled sensitivity and superior specificity compared to MR models. Considering the lower time cost and expense of CT compared to MR, CT-based radiomics models may hold an advantage in practical clinical application. Interestingly, models using single-phase/sequence imaging data did not show insufficient performance compared to multi-phase/sequence models in combined sensitivity and specificity. A few studies directly compared the predictive value between uniparametric and multi-parametric models [20, 30, 35, 42, 48], and there was no consistent evidence supporting joint analysis of multiple imaging phases or sequences. As pointed out by the METRICS guideline [60], multi-parametric analysis may unnecessarily increase the data dimensionality and risk of overfitting. Therefore, future studies should try to use simple radiographic data or prove the added value of multi-parametric models. Notably, while many studies considered multiple phases, very few conducted voxel-to-voxel image matching across different phases [43, 49], with no study analyzing the evolution of feature values across different time points. Future research employing image alignment and feature matching to analyze the time evolution of features might expand the role of delta-radiomics [59] and fully leverage the potential of multi-phase imaging.

The choice of modeling algorithms influences the model's performance in a decisive manner [74, 75]. In our subgroup analysis, ML-based models demonstrated inferior sensitivity and specificity compared to DL models. In addition, two studies directly compared DL and ML models and indicated the superior predictive power of DL models [33, 54]. However, it is noteworthy that the transparency of details in DL models was limited in the included studies, potentially restricting the reproducibility of their results. Generally, deep learning demonstrated significant potential for radiomics modeling, albeit its current limited use in studies [32, 33, 39, 46, 54]. There is presently no consensus on the necessity of including peritumoral features. Our meta-regression and the direct comparison by Song et al. [34] did not reveal significant benefit of considering peritumoral regions. Nevertheless, in other tasks, such as predicting MVI [76] and postoperative recurrence [77] in HCC, analyzing both tumor and peritumoral regions proved to be more effective. Hence, whether incorporating peritumoral features improves model performance might be task-specific.

Finally, while our meta-analysis suggested potential advantages of some methods, there have been no universally recognized optimal choices for these methodological options, and further validations are imperative. The ongoing exploration and standardization of radiomics research methods will undoubtedly lead to more ideal predictive models, guiding decision-making and management for HCC patients undergoing TACE treatment.

Limitations and future perspectives

This study presents several limitations. Firstly, all included studies are retrospective in design. Despite conducting meta-regression and subgroup analyses, the sources of heterogeneity could not be fully elucidated. Therefore, caution is advised when directly applying the results of these studies due to potential concerns about generalizability. This limitation underscores the necessity for further methodological exploration and additional radiomics research to address these gaps. Secondly, not all included datasets met the ideal sample size criterion of ten times the number of model features, emphasizing the need for future studies to establish larger clinical cohorts or reduce the number of model features.

Conclusion

Radiomics models demonstrated good accuracy and promising clinical utility in predicting the prognosis of initial TACEtreatment. Methodological factors such as imaging modalities, phases of imaging, and modeling algorithms are significant sources of literature heterogeneity.