Introduction

Oesophageal cancer presents a major burden worldwide [1]. A multimodal treatment approach (surgery with chemotherapy or chemoradiotherapy) is standard of care following landmark trials [2,3,4] and offers the best chance of survival for resectable cancer. However, despite this, outcome remains poor for patients treated with curative intent with mortality rates of 45–53% in the first year post diagnosis [5] and median post-progression survival of only 13 months [6]. There is a greater need for personalisation of management to obviate treatment in patients who may not benefit substantially. Alongside this, with the growing interest in total neoadjuvant therapy, and ongoing trials of perioperative immunotherapy, better initial risk stratification of patients at diagnosis is needed to guide management.

Currently, the clinical TNM (tumour-node-metastasis) stage guides management [7] but it has a low predictive accuracy with contemporaneous pathological stage for early stage cancers [8], as well as limitations in prognostication. Prognostic information has central importance for patient decision-making, with cancer patients ranking prognosis, diagnosis and treatment options as their highest information priority [9].

Modelling studies have demonstrated scope to improve upon prognostication in oesophageal adenocarcinoma, including through CT imaging radiomic approaches [10,11,12]. Initial publications have highlighted the potential of different locoregional features. For example, Piazzese et al identified an association of CT grey-level zone distance variance with overall survival in a multicentre cohort (n = 213), which was independent of dimensionality and contrast administration [11], whilst Larue et al developed a random forest radiomic model including 40 CT features to predict 3-year overall survival (OS) in oesophageal cancer patients treated with chemoradiotherapy (n = 239) [12].

To date, clinical translation of imaging radiomic models has been hampered by a relative paucity of independent external validation. Thus, we aimed to validate a prognostic model for 3-year OS including prespecified clinical features and previously proposed CT radiomic features in a prospective multicentre setting for patients with primary oesophageal adenocarcinoma planned for curative treatment.

Methods

Participants and datasets

Following ethical approval, clinical data and CT imaging were obtained prospectively from five institutions participating in the OCCAMS (Oesophageal Cancer Clinical and Molecular Stratification) Consortium. Consecutive participants with non-metastatic, pathologically proven oesophageal adenocarcinoma who underwent staging contrast-enhanced CT imaging and planned for definitive treatment were eligible. Participants were excluded if (1) no tumour was visible on CT; (2) CT images were unavailable/corrupted; and (3) concurrent malignancy was present. Data from three institutions were used for model development and two institutions for model testing.

CT imaging and analysis

Contrast-enhanced CT was performed according to institutional practice and included arterial phase imaging of the thorax and upper abdomen. CT acquisition and reconstruction parameters for the training and test datasets are summarised in Supplementary Table 1. Pre-processing of CT images was undertaken as per Image Biomarker Standardisation Initiative recommendations [13]. CT slice thickness was linearly interpolated to 2 mm and attenuation values were converted to Hounsfield units with PyDICOM [14]. The primary tumour was segmented by a radiologist (with 5 years’ experience), who was blinded to clinical outcomes, on the arterial phase thoraco-abdominal CT images. The rationale for using the arterial versus portal venous phase was prior studies showing better tumour conspicuity [15] and tumour staging accuracy [16] for the arterial phase. Tumour segmentations were reviewed and adjusted as required by a second radiologist (with > 20 years’ experience). Examples of tumour segmentations are shown in Supplementary Figs. 1 and 2. 3D Radiomic features were extracted from the segmented volume of interest using PyRadiomics version 3.0.1 under default parameters (no image filters, no normalisation, no voxel array shift, grey-level discretisation at fixed bin width of 25 Hounsfield units) [17].

Radiomic feature selection

Following recommendations [18], radiomic features were pre-selected based on previous published studies. This obviated data-driven feature selection thereby reducing the risk of data overfitting. Published studies were evaluated using the ‘Transparent reporting of a multivariable prediction model for individual prognosis’ (TRIPOD) [19] and ‘Radiomics Quality Score’ checklists [20]. Identifiable Image Biomarker Standardisation Initiative features were sought [13].

Of the identified published studies [11, 12, 21, 22], features proposed by Piazzese et al were excluded as the interpolation strategy employed (2 mm isotropic) would have resulted in a significant reduction of the axial image resolution [11]. Features proposed by Larue et al were excluded as they evaluated nonlinear effects in a high-dimensional feature set using a random forest method, complicating the extraction of a few individually informative linear predictions [12]. Zhang et al achieved a high TRIPOD score of 30 [21] and examined a limited number of previously proposed predictors, each of which was identifiable; thus, these features were extracted—GLCM_Contrast, GLCM_Correlation and GLCM_InverseDifferenceMoment. Furthermore, GLCM correlation was independently identified as a predictor of response in Klaasen et al [22]. Following Peduzzi and Concato’s guideline recommendation of more than 10 events per modelled feature [23], an unsupervised method was applied to identify the most suitable two of the three proposed features. The most collinear GLCM feature with respect to tumour volume and the other GLCM features was excluded.

Statistical analysis

Differences of participant characteristics between training and testing datasets were tested with the Fisher test for categorical variables or the two-sided t-test for continuous variables. For modelling, unpenalised logistic regression models were fitted to predict 3-year OS using base R. With a median survival of 24 months following surgery alone and 46 months following neoadjuvant chemoradiotherapy and surgery, 3-year OS is a meaningful endpoint in oesophageal cancer and has been used in clinical trials assessing the efficacy of neoadjuvant treatment [24].

The following four models were fitted, using the following sets of features:

  1. 1.

    ‘Stage’: overall TNM stage.

  2. 2.

    ‘Clinical’: age, clinical T-stage and N-stage, as determined at tumour board review

  3. 3.

    ‘ClinVol’: age, clinical T-stage and N-stage, primary CT tumour volume

  4. 4.

    ‘ClinRad’: age, clinical T-stage and N-stage, primary tumour volume, two GLCM features.

Discrimination and calibration of 3-year OS were assessed using the area under the receiver operator curve (AUC) and Brier score, respectively, using the riskRegression R library [25]. Confidence intervals were estimated using the method of Blanche [26] and compared using the Delong test. Following Van Rossum [27], clinical utility was assessed by thresholding the model prediction in training data to maximise specificity whilst maintaining sensitivity of > 90%. Confidence intervals for sensitivity, specificity and accuracy were estimated with 1000 replacement bootstraps.

To ensure absence of dataset-partitioning bias or institutional confounding, supplementary post hoc model evaluation was performed using each institution in turn for testing and the remaining four institutions for model fitting. Kaplan–Meier curves were plotted, grouping participants according to the target 90% sensitivity threshold fitted in training data.

A post hoc analysis was also performed to estimate conditional dependencies between radiomic features and survival time, using data from training and testing cohorts. A partial Spearman correlation matrix was inferred using the de-sparsified graphical least absolute shrinkage and selection operator method via the SILGGM package and 95% confidence intervals were estimated via bootstrapping with 1000 replicates. Spearman correlation was also employed to assess volume confounding of radiomic features. Analysis of variance was employed to test radiomic stability with respect to scanner manufacturer and study institution.

Results

Participant and dataset characteristics

Of 210 participants recruited, 48 participants were excluded, generating a training set of 95 participants and a test set of 67 participants. The participant flowchart is provided in Fig. 1 showing the reasons for exclusions. Participant and dataset characteristics are shown in Table 1 and Supplementary Fig. 3.

Fig. 1
figure 1

Study participant flowchart

Table 1 Participant characteristics in training and testing datasets

Performance of model variables

Spearman correlations of the prespecified predictors in the training data are shown in Fig. 2A. GLCM_InverseDifferenceMoment was highly correlated with tumour volume (r =  + 0.33) and GLCM_Contrast (r =  − 0.94). Hence, GLCM_Correlation and GLCM_Contrast were selected for modelling (Supplementary Material). Clinical T-stage was the most important prognosticator, with each model assigning it a significant positive coefficient (representing increasing risk with increasing stage). The second most influential predictor was age, which was assigned significant positive coefficients in each model. Tumour volume was an insignificant predictor in both the ClinVol and ClinRad models. GLCM_Correlation was assigned a marginal negative coefficient in the ClinRad model, and GLCM_Contrast was the least influential model predictor. Model coefficients, their standard errors and associated z-tests are reported in Table 2. Histograms of model predictions demonstrated that each model had similar distributions of predictions in training and testing (Fig. 2B). Radiomic features are visualised at voxel level in Fig. 3.

Fig. 2
figure 2

Spearman correlations of predictor variables in training data (A) and histograms of model predictions in training and testing data (B) are shown

Table 2 Summary of model coefficients
Fig. 3
figure 3

Visualisation of CT images and voxel-level radiomic features (GLCM correlation and GLCM contrast) in four study participants. Participant A: 71-year-old female with a clinically staged T2 tumour [correlation 0.67, contrast 1.22]; participant B: 78-year-old female with a clinically staged T3 tumour [correlation 0.5, contrast 2.56]; participant C: 62-year-old male with a clinically staged T2 tumour [correlation 0.48, contrast 3.01]; participant D: 56-year-old male with a clinically staged T3 tumour [correlation 0.57, contrast 2.10]

Prediction of 3-year overall survival

The ClinRad model showed best discrimination of 3-year OS, achieving similar performance in both training (AUC, 0.71 [95% CI 0.60, 0.82]) and testing (AUC, 0.68 [95% CI 0.54, 0.81]) (Table 3). Test discrimination of the ClinRad model was greater than that of Stage alone (Δ AUC, 0.12 [95% CI 0.01, 0.23]; p = 0.04). Stage was the least discriminative model in both training (AUC, 0.60 [95% CI 0.49, 0.71]) and testing (AUC, 0.56 [95% CI 0.44, 0.67]).

Table 3 Model discrimination and calibration metrics with respect to 3-year overall survival

The Clinical model achieved similar test discrimination to the ClinRad model (AUC, 0.66 [95% CI 0.51, 0.80]; Δ AUC, 0.02 [95% CI − 0.04, 0.08]; p > 0.05). The ClinVol model attained marginally lower test discrimination than the Clinical model (AUC, 0.65 [95% CI 0.50, 0.79]). All models yielded similar test calibration. Only Stage and ClinRad models achieved target 90% sensitivity in both training and testing. Here, the ClinRad model yielded slightly higher specificity (specificity, 0.19 [95% CI 0.06, 0.33]) than Stage (specificity, 0.14 [95% CI 0.03, 0.26]) at this threshold.

Supplementary per-institution model testing results were consistent with the main external validation results (Supplementary Table 2). Kaplan–Meier curves are provided in Fig. 4. Survival statistics are provided in Supplementary Table 3. In the test data, risk groups assigned by the ClinRad model separated survival curves for the initial 3 years, and convergence was observed at 5 years. However, few participants were assigned to the high-risk group (9/67, 13%). Risk groupings assigned by Clinical and ClinVol models achieved little separation of survival curves in testing. Post hoc partial correlation analysis confirmed that, over both training and validation datasets, the most informative predictors of overall survival time were clinical N-stage (partial ρ, −0.15 [95% CI − 0.33, 0.02]) and age (partial ρ, − 0.14 [95% CI − 0.3, 0.02]). The level of independent predictive information contributed by GLCM_Correlation (partial ρ, 0.10 [95% CI − 0.05, 0.25]) was comparable to that of clinical T-stage (partial ρ, − 0.12 [95% CI − 0.26, 0.04]). Partial correlation analysis results are provided in Supplementary Table 4. Feature variability with respect to scanner manufacturer and study institution are also presented in Supplementary Material.

Fig. 4
figure 4

Kaplan–Meier plot of survival in high- and low-risk groups according to each model score. Risk groups were defined according to the target 90% sensitivity threshold fitted in training data

Discussion

The ability to provide individualised risk–benefit analysis would help to optimise management decisions in potentially resectable oesophageal cancer. Improving prognostication is a step in this direction as prognosis influences treatment decisions made by doctors and patients. Surgery may have an impact on quality of life for up to 12 months post treatment [28] and patients with a poorer prognosis may not fully benefit from a multimodal approach. Initial imaging studies have suggested that radiomic features may have additive prognostic value [11, 12, 18, 21]. However, prespecified models have to demonstrate reliable prediction in external datasets without local refitting. Accordingly, studies need to transition to the evaluation of previously proposed predictors and models, rather than continuing to fit new models with many degrees of freedom to new clinical data [18].

In this prospective multicentre study, we have demonstrated that a multivariate clinicoradiomic prognostic model (ClinRad) incorporating previously identified CT features improved discrimination of 3-year OS compared to TNM staging with an AUC of 0.68 in the test dataset, but offered similar calibration. The Clinical model had similar performance as the ClinRad model with an AUC of 0.66. Both Clinical and ClinRad models retained discriminative capacity between training and testing, though calibration deteriorated, suggesting a distributional mean shift between institutions.

Our findings are concordant with previously published data of Larue et al [12], where the high-dimensional random forest radiomic model with other features achieved AUCs of 0.69 and 0.61 in training and testing, respectively. The direction of radiomic coefficients fitted in this study is consistent with previously published results by Zhang et al [21], who observed increasing GLCM_Correlation in patients with oesophageal adenocarcinoma who responded to chemoradiotherapy. In our model, low GLCM_Correlation was an adverse prognosticator. Our finding that GLCM correlation was the most informative predictor also concurs with Klaasen et al [22]. However, as Klaasen’s model employed a random forest architecture, the directional concordance of results could not be verified.

Zhang et al [21] also observed decreasing GLCM contrast in chemoradiotherapy responders. In our study, GLCM contrast did not affect model predictions substantially, indicating that any prognostic information it encoded was already provided by the other clinical and image-based predictors already modelled.

An advantage of our study is that it incorporated multicentre prospective data, thereby providing realistic conditions for the estimation of model informativeness and generalisability. The imaging equipment and protocols were representative of the varying conditions encountered in clinical practice. The imaging acquisition parameters in this dataset reflected typical clinical practice and variations between institutions, which a radiomic model must be able to accommodate. We noted that GLCM correlation and GLCM contrast varied according to institution and scanner manufacturer respectively. This variability introduces noise which can complicate modelling of the underlying prognostic signal. Clinical deployment of radiomic models requires either that this noise is accommodated or that clinical imaging protocols adapt to acquire images under more standard conditions.

Model validation was performed in test data from three institutions which were unobserved during model development, yielding a realistic estimate of model generalisability in our healthcare system. However, our study had limitations. First, manual segmentation especially of early-stage cancers is subject to intra-reader and inter-reader variability [29]. Second, radiomic approaches are not typically well suited for the identification of new imaging biomarkers, due to the low ratio of events to evaluated variables [18, 20]. It is noteworthy that the ClinRad model fitted here is simpler than that of Larue, whilst matching its training performance, and marginally improving upon its generalisation [12]. However, a necessary cost of this study design is that the other informative radiomic features may have been omitted. Third, although both the ClinRad model and TNM staging demonstrated 90% sensitivity, the low specificity achieved at this threshold is a limitation. Fourth, the improvement in performance between the Clinical and ClinRad model is small and unlikely to change clinical management substantially. Finally, the logistic regression models employed in this analysis were insensitive to nonlinear and nonmonotonic effects.

In conclusion, we have confirmed in a prospective multicentre dataset that previously proposed GLCM features—correlation and contrast—contain incremental prognostic information. The clinicoradiomic model incorporating GLCM correlation and contrast with tumour and nodal stage, age and volume outperformed TNM stage alone in the discrimination of 3-year overall survival. Nevertheless, the level of discrimination remained modest and it is questioned if this will impact on management substantially.