Introduction

The preferred treatment choice for colorectal liver metastases (CRLM) is resection, but not all metastases nor patients are eligible for resection. An alternative and complementary strategy is thermal ablation, including microwave ablation (MWA) and radiofrequency ablation (RFA) [1, 2]. After thermal ablation of CRLM, local tumour progression (LTP) rates of 6–46% have been reported [3,4,5,6,7,8]. LTP is defined as the recurrence of tumour foci at the edge of the ablation zone after initial follow-up imaging showing adequate ablation [9, 10]. The detection of LTP can be challenging since post-ablation effects and recurrent disease have comparable densities on contrast enhanced (ce) CT [11]. This results in a sensitivity of 53% for ceCT for the detection of LTP [4]. So to detect LTP, imaging at multiple subsequent time points may be necessary, consequently causing a delay in the detection and treatment of LTP.

To overcome this delay, we recently performed a study to predict LTP in CRLM with the use of radiomics of the post-ablation CT images [12]. If the prediction of LTP is successful, patients with a high risk for LTP can undergo complementary treatment without delay, and a de-intensified follow-up schedule can be considered for low-risk patients. In the previously published original study, we developed and compared three prediction models, including clinical parameters, radiomics features of both the ablation zone (AZ) and the peri-ablational rim (PAR), as well as a combination of clinical and radiomics parameters. The combined clinical-radiomics model yielded the highest performance with a concordance (c-) statistic of 0.78 (95% confidence interval (95%CI) 0.65–0.87). The performances were retrieved with leave-one-out cross-validation (LOOCV), i.e., the models were not validated on independent patient cohorts. To evaluate whether results can be applied to other populations, external validation is crucial [13]. Therefore, the aim of the current study is to validate the clinical-radiomics prediction models from the original study to predict LTP after thermal ablation of CRLM using both independent internal and external validation cohorts.

Material and methods

Patient selection

This multicentre retrospective study was approved by the Institutional Review Board of both institutions (IRBd18.066/MEC-2019–0850), and informed consent was waived. A data license agreement was established to transfer all data to the primary research centre. For the internal validation cohort, medical records were reviewed from April 2018 until August 2021 in the same institution (The Netherlands Cancer Institute Amsterdam) where the original study was performed. In the original study, patients were included up until April 2018. For the external validation cohort, medical records were searched from January 2007 until October 2019 in the second institution (Erasmus Medical Centre Rotterdam).

The patient selection process was in line with the original study in order to select a comparable patient cohort. The original inclusion criteria comprised of (1) patients successfully treated with thermal ablation for CRLM; (2) histopathological confirmation of CRLM; (3) portal venous phase (PVP) CT available 2–8 weeks after ablation. The exclusion criteria were (1) < 6 months of follow-up without LTP; (2) > 5 CRLM; (3) unclear origin of liver metastases; (4) ablated CRLM of size > 3 cm; (5) history of diffuse liver disease; (6) history of liver treatment which could affect the parenchyma (such as stereotactic body radiation therapy (SBRT), portal vein embolisation (PVE), transarterial chemoembolisation (TACE)); (6) incomplete ablation (including residual disease, ablation margins < 5 mm and re-ablations); (7) missing clinical data (e.g. no pre-ablation imaging available); (8) delineation problems including artefacts, air or abscess within the AZ and insufficient scan quality. Due to a relatively short inclusion period compared to the external and original cohorts, the number of eligible patients for the internal validation was small. Hence, to increase the sample size for the internal cohort, the exclusion criterion ‘ > 5 CRLM’ was changed into ‘ > 5 CRLM ablated’. This adjustment was deemed not to influence the results, since it was made under the assumption that the AZ texture is not correlated with the number of CRLM present in one liver. A flowchart of the patient selection process is depicted in Fig. 1. Patient characteristics were collected from the medical records and are presented per cohort in Table 1.

Fig. 1
figure 1

Flowchart of the patient selection process

Table 1 Patient and lesion characteristics

Ablation procedures

Ablation procedures were performed either percutaneously under CT or ultrasound guidance or open, guided by intraoperative ultrasound. All percutaneous ablations were performed by an interventional radiologist under sedation analgesia, epidural, or general anaesthesia. The open ablations were performed under general anaesthesia by a liver surgeon, either with or without the assistance of an interventional radiologist. The choice between RFA and MWA was based on the availability and physician’s preferences. Three different systems were used for RFA: the Cool-tip™ RF Ablation System E Series (Medtronic), the StartBurst® Radiofrequency Ablation system (AngioDynamics), and the AMICA Microwave and RF system (HS Hospital Service). For MWA, the NeuWave™ Microwave Ablation System of Ethicon (Johnson&Johnson), the Emprint™ Ablation System with Thermosphere™ Technology (Medtronic), and the AMICA Microwave and RF system (HS Hospital Service) were used. Procedures were carried out in accordance with the CIRSE Standards of Practice [10].

CT image acquisition

Contrast enhanced CT image acquisition was performed on a total of 19 different CT scanners. Intravenous contrast was injected at a rate of 3 ml/s followed by a 30 ml saline flush. Both bolus triggering software and fixed delay times (70 s post-injection for PVP) were used, depending on the CT scanner. Detailed information on scanning parameters is displayed in Table 2.

Table 2 Scanning parameters

Standard of reference to establish LTP

LTP was defined as any new tumour foci occurring in a 10 mm vicinity of the AZ on follow-up imaging within 24 months after thermal ablation [9]. Lesions were categorised as no LTP if the patient developed (1) no new CRLM; (2) new CRLM > 10 mm distance to the AZ; or (3) new CRLM within 10 mm of the AZ after > 24 months. Follow-up imaging consisted of regular follow-up ceCT, scheduled every 3 months in the first year, and 6 monthly thereafter until 5 years after ablation. In case of doubt, magnetic resonance imaging (MRI) or positron emission tomography (PET)-CT was used as a problem-solver. All liver imaging until the end of follow-up was checked for disease progression.

Delineation and radiomics features

The manual delineations, pre-processing steps, and features extraction process were similar to the original study [12]. An example of the delineations is displayed in Fig. 2.

Fig. 2
figure 2

Delineation example. Post-ablation ceCT images of a the ablation zone (arrow), b the delineation of the ablation zone, and c the peri-ablational rim with the exclusion of the needle track ( <) and large vessels (*)

Prediction models and analysis

Baseline patient characteristics were compared between the cohorts, using the Kruskal Wallis test and chi-square test. p values ≤ 0.05 were considered statistically significant. The included features per model are presented in Table 3. For the two validation cohorts, the discriminative power of all three models was assessed using the c-statistic. ComBat harmonisation was applied to the radiomics features to harmonise between the three cohorts [14]. All statistical analyses were performed using RStudio software v1.4.1103. To assess the quality of this study, the Radiomics Quality Score (RQS) was calculated [15]. The methods of this study and the original study are schematically presented in Fig. 3.

Table 3 Included features per model
Fig. 3
figure 3

Methodology. Schematic presentation of the methodology of the current study (right) and the original study (left)

Results

Patient and lesion characteristics

The internal validation cohort included 68 CRLM in 39 patients. LTP was found in 11/68 CRLM (16%). The median time to LTP was 8 months (range 2–22), and the median follow-up for CRLM without LTP was 25 months (range 8–50). The external cohort comprised of 78 CRLM in 52 patients. Twenty-three out of 78 CRLM (29%) developed LTP with a median time to LTP of 10 months (range 2–22 months). The CRLM without LTP had a median follow-up of 29 months (range 6–139). The median ablation to CT interval was 31 days (range 14–50, IQR 24–44 days) and 42 days (range 14–56, IQR 20–48 days) for the internal and external cohort, respectively. Patient and lesion characteristics were similar in terms of sex, primary tumour characteristics, and chemotherapy treatment. A higher mean age (66 vs 61 and 63) was found in the external validation cohort (p = 0.047). Larger CRLM were ablated (p = 0.047) in the original cohort (18 ± 6), compared to the internal (11 ± 7 mm) and external cohorts (13 ± 7). Significantly more metachronous metastases were included in the validation cohorts compared to the original cohort (21 and 23% vs 45%, p < 0.01). Lastly, all CRLM (100%) were treated with MWA in the internal cohort, while the majority were treated with RFA in the original and external cohorts (80% and 87%, respectively, p < 0.01).

Model performance

For the internal validation cohort, a c-statistic of 0.47 (95%CI 0.30–0.64) was found for the combined model. The radiomics model showed a c-statistic of 0.46 (95%CI 0.29–0.63) and the clinical model 0.51 (95%CI 0.34–0.68). In external validation, the combined model yielded a c-statistic of 0.50 (95%CI 0.38–0.62), the radiomics model 0.40 (95%CI 0.28–0.52), and the clinical model 0.51 (95%CI 0.39–0.63). ComBat harmonisation yielded no improvement in the combined or radiomics models. Results are presented in Table 4. This study reached an RQS of 50%. The distribution of RQS points is displayed in Supplementary Table 1.

Table 4 Model performances

Discussion

This study evaluated the reproducibility of three previously published clinical-radiomics models to predict LTP after thermal ablation of CRLM. The models were validated in an independent internal and external validation cohort, and poor performances were found (C-statistics 0.40–0.51). The poor validation performance is most probably explained by overfitting: the models were trained too specifically for the training data and probably (also) used image noise or random fluctuations instead of true differences between the studied groups [16, 17]. In the original study, LOOCV was applied after model development. However, this is rather a test of the fit of the training data than of the quality of the model, which can result in an overoptimistic estimate of the performance [18].

We hypothesise our radiomics models overfitted on image noise caused by acquisition differences. Multiple studies show that acquisition parameters affect the values of the radiomics features [19,20,21,22,23]. Our cohorts were heterogeneous in terms of CT acquisition parameters, with 19 different CT scanners involved in validation and 5 scanners in the original study. In an attempt to account for the variability between scanners, we applied ComBat harmonisation to the three cohorts. The features were only marginally adjusted without a relevant effect on the performance, possibly because each batch already included multiple scanners. Preferably, the radiomics features would have been harmonised per CT scanner, but the number of patients allocated per batch was insufficient to allow for such harmonisation. Other acquisition differences were less likely to contribute to the low validation performance, such as the difference in iodine concentration per contrast agent or the tube current and voltage [23]. The differences in slice thickness were corrected by image resampling. Furthermore, additional steps, such as testing the intra-observer correlation of the segmentations or harmonising the features across scanners, could have been undertaken to enhance the reproducibility during model development.

Clinical heterogeneity between the cohorts might have contributed to the failure of the clinical model in validation. Despite the similar selection methodology, differences may have occurred due to (1) variations in hospital protocols and (2) adjustments over time due to treatment and scanner development. Both centres follow the Dutch clinical guidelines on the treatment of CRLM, but still, hospital variation occurs [24]. Especially, the eligibility of patients for thermal ablation based on ‘CRLM size’ and ‘number of CRLM ablated’ has evolved over the years. The use of MWA has rapidly increased over the last years, which resulted in technique differences between the cohorts. However, we do not think this is the reason for the low validation performance since the original study showed that the ablation technique did not significantly influence the radiomics features [12]. Moreover, two out of three parameters in the clinical model were ‘patient-specific’ (adjuvant chemotherapy and T-stage), while the prediction of LTP is a ‘lesion-specific’ outcome. A study exploring the risk factors for LTP found only ‘lesion-specific’ parameters were associated with LTP, and none of the ‘patient-specific’ parameters investigated were predictive for LTP [25]. This raises the question of how robust ‘patient-specific’ characteristics can be for the prediction of a ‘lesion-specific’ outcome.

Our study has several limitations. Firstly, the study design was retrospective and included a relatively small sample. Secondly, the LTP rates in our study were relatively high, which could be attributed to the long inclusion period, considering LTP rates were higher 15 years ago. The diagnosis of LTP was based on imaging, and the absence of histopathological evaluation could be considered a limitation, but it resembles how LTP is detected in clinical practice. Next, the minimum follow-up period of 6 months might have resulted in a small subset of patients being allocated to the wrong outcome group, given the median time to LTP of 8 months. Lastly, an arbitrary cut-off of 24 months was applied for the detection of LTP, as LTP after 24 months is rare and possibly involves new metastases rather than residual tumour clusters.

Due to the risk of overfitting the original model, we cannot draw any conclusions on the feasibility of LTP prediction based on CT radiomics. This study emphasises the need to assess the reproducibility of radiomics prediction models in independent patient cohorts. It underlines that no definite conclusions can be drawn from studies without proper internal and external validation. Future research aiming to explore radiomics in a similar setting should strive to minimise heterogeneity between and within patients’ cohorts, both in terms of clinical differences and imaging acquisition.