Introduction

Locally advanced rectal cancer (LARC) is typically treated with neoadjuvant chemoradiotherapy (CRT) followed by surgery [1]. In up to 15–27% of the cases a complete tumor remission is achieved as a result of CRT [2]. This has contributed to the recent paradigm shift in rectal cancer treatment towards organ preservation (e.g., “watch-and-wait” or local treatment of small tumor remnants) for selected patients with clinical evidence of a very good or complete tumour response after CRT. For these organ-preservation approaches, the morbidity and mortality risks associated with major surgery are avoided, with good reported clinical outcomes regarding quality of life and overall survival [3, 4]. Predicting the response to CRT and thus the chance of achieving organ preservation before the start of treatment, i.e., at baseline, may open up new possibilities to further personalize neoadjuvant treatment strategies depending on the anticipated treatment benefit, particularly for smaller tumors that do not necessarily require CRT for oncological reasons.

Recent studies have suggested a possible role for imaging in this setting [5,6,7,8,9]. Promising results have been reported for clinical staging variables (MRI-based TN-stage) [6, 7], tumor volume [10,11,12], and functional parameters derived from diffusion-weighted imaging (DWI) [8, 9] or dynamic contrast-enhanced MRI (DCE) [13] to predict rectal tumor response on baseline MRI, and more recently also for more advanced quantitative variables derived using modern post-processing tools such as radiomics [5]. However, the available evidence mainly comes from single-center studies and comprehensive multicenter studies incorporating clinical, functional as well as advanced quantitative imaging data are scarce [14, 15]. Moreover, the effects of multicenter data variations and diagnostic staging differences between observers so far remain largely uninvestigated. Prediction studies on larger multicenter patient cohorts with imaging data acquired and analyzed as part of everyday clinical routine are therefore urgently needed to develop a more realistic view of the potential role of image-based treatment prediction models in general clinical practice.

In this retrospective multicenter study, we therefore set out to develop and validate a model to predict response to neoadjuvant treatment in rectal cancer using rectal MRIs acquired for baseline staging in 9 different centers in the Netherlands, intended to be a representative sample of rectal imaging performed in everyday clinical practice.

Materials and methods

Patients

As part of an institutional review board-approved multicenter study project, the clinical and imaging data of 670 LARC patients undergoing standard-of-care neoadjuvant chemoradiotherapy between February 2008 and March 2018 were retrospectively collected from 9 study centers (1 university hospital, 7 large teaching hospitals, and 1 comprehensive cancer center). Patients were identified based on the following inclusion criteria: (a) biopsy-proven rectal adenocarcinoma, (b) non-metastasized disease, (c) availability of a pre-treatment MRI (including at least T2-weighted (T2W) sequences in multiple planes and an axial DWI sequence) with corresponding radiological staging report (d) long-course neoadjuvant treatment consisting of radiotherapy (total dose 50.0–50.4 Gray) with concurrent capecitabine-based chemotherapy, (e) final treatment consisting of surgery or watch-and-wait with >2 years clinical follow-up to establish a reliable final response to CRT. From this initial cohort, 161 patients were excluded for reasons detailed in Fig. 1, leaving a total study population of n=509. Due to the retrospective nature of this study, informed consent was waived.

Fig. 1
figure 1

In- and exclusion flowchart. Note, mucinous tumors were excluded because these are known to exhibit distinctly different signal characteristics on both T2W-MRI and DWI

Imaging and image pre-processing

MRIs were acquired according to routine practice in the participating centers with substantial variations in scan protocols and corresponding image quality between and within centers (Fig. 2); images were acquired using 25 different scanners (19 1.5T; 6 3.0T) and a total of 112 unique T2W and 94 unique DWI protocols. Further parameters are summarized in Supplementary Materials A. From the source DW images we calculated the Apparent Diffusion Coefficient (ADC) maps using all available b-values (varying from 2–7 b values per sequence; b values ranging between b0 and b2000) using a mono-exponential fit. ADC values <0 or >3 standard deviations from the tumor mean were marked as invalid. Since T2W pixel values are represented on an arbitrary scale, these images were normalized to mean=0 and standard deviation=100 [16]. All images were resampled to a common isotropic pixel spacing of 2 mm x 2 mm x 2 mm using a Linear interpolator for the DWI and ADC maps (linear was chosen to prevent out-of-range intensities which may occur due to overshoot with higher order interpolations) and a B-Spline interpolator for T2W images.

Fig. 2
figure 2

Examples illustrating differences in image quality and acquisition for T2W-MRI (ad) and DWI (eh) between centers, related to for example field-of-view, tissue contrast (e.g., TR/TE settings), image resolution, and noise. For the DWI scans, the highest acquired b-values shown in these examples were b1000 (e), b600 (f), b800 (g), and b1000 (h)

Image evaluation

Baseline staging variables (cT-stage (cT1-2, cT3, cT4), cN-stage (cN0, cN1, cN2), and involvement of the mesorectal fascia (MRF)) were derived from the original staging reports that were performed by a multitude of readers. In addition, all MRIs were retrospectively re-evaluated for the purpose of this study by a dedicated radiologist (DMJL, with >10 years’ experience in reading rectal MRI) who staged all cases in line with the latest staging guidelines and reporting template from the European Society of Gastrointestinal and Abdominal Radiology [17]. For quantitative analysis, tumors were segmented using a 3D slicer (version-4.10.2). Segmentations were acquired semi-automatically using a level-tracing algorithm applied to the high b value DWI, which were then manually adjusted by an expert radiologist (DMJL, the same reader who also staged the cases) taking into account the anatomical information from the corresponding T2W-MRI. Care was taken to include only tumor tissue, excluding the rectal lumen and any non-tumoural perirectal tissues. Segmentations were then copied to the ADC-map and T2W-MRI, after which tumor volume and other quantitative features were extracted with PyRadiomics (version-3.0) using a bin-width of 5 (T2W-MRI) and 5x10-5 (ADC). This bin width was chosen such that the number of histogram bins was between 30 and 130 [16]. Quantitative features were limited to simpler volume, and first-order features as these have previously been reported to be most reproducible [19,20,21,22,23] and least dependent on acquisition differences between centers [18].

Variable definitions

Five distinct variable categories were defined:

  1. (1)

    Non-imaging variables; including age, sex, basic tumor descriptors from clinical examination and endoscopy (tumor location and basic tumor morphology, e.g., polyp/circular), and the time interval between neoadjuvant CRT and surgery.

  2. (2)

    Basic image-based staging variables:

    • (2a) derived from the original reports, including cT-stage (cT1-2, cT3, cT4), cN-stage (cN0, cN1, cN2), and MRF involvement, that were routinely available from the original staging reports.

    • (2b) derived from expert re-evaluation, including the same descriptors from 2a, but now derived from the expert re-evaluations.

  3. (3)

    Advanced image-based staging variables; including advanced staging descriptors (tumor length, cT-substage (cT1-2; cT3a,b,c,d; cT4a,b), depth of extramural invasion, and extramural vascular invasion (EMVI)) that were not routinely available from the original staging reports but derived from the expert re-evaluations.

  4. (4)

    Quantitative imaging features; including tumor volume (extracted directly from the whole-tumor segmentations), and the following first-order features (derived from the pixel values within the tumor on both T2W-MRI and ADC): mean, median, minimum, maximum, variance, mean absolute deviation, range, robust mean absolute deviation, root mean squared, 10th percentile, 90th percentile, energy, entropy, interquartile range, kurtosis, skewness, total energy, and uniformity.

These five variable categories were combined into eight combinations of variable sets for the statistical analysis as detailed in Table 1.

Table 1 Variable category definition and variable sets

Response outcome

The final treatment response was defined in twofold [8, 24, 25]:

  • Complete response (CR) versus incomplete response: CR was defined as a pathological complete response after surgery (pCR; ypT0N0) or a sustained clinical complete response (cCR) without evidence of recurrence on repeated follow-up MRI and endoscopy for >2 years in patients undergoing watch-and-wait. Patients with ypT1-4 disease after surgery were classified as incomplete responses.

  • Good response (GR) versus poor response: GR included all patients with Mandard’s tumor regression grade (TRG) of 1–2 (total and subtotal regression); patients with TRG of 3–5 (moderate, limited and no regression) were classified as poor responders. Patients with a sustained cCR for >2 years were considered TRG1. If the pathology report did not explicitly mention a TRG score, the complete pathology reports were reviewed with a dedicated gastrointestinal pathologist (P.S. with >8 years of experience) to assign a TRG score retrospectively.

Statistical analysis

The 9 centers were divided into development including 6 centers (n=412) and (external) validation set including 3 centers (n=97). Differences between development and validation sets were assessed using Chi-squared tests for categorical (sex and response) and Kruskal-Wallis tests for continuous/ordinal variables (age, cTN-stage). The model development and validation process are summarized in Fig. 3. For the eight variable sets (see Table 1) the ability to predict the two respective response outcomes (complete vs incomplete response; good vs poor response) was assessed in the development cohort by calculating the average area under the receiver operator characteristic curve (AUC) after repeated (n=100) random hold-out validation. During each iteration, the development cohort was randomly split into a 70% training / 30% test dataset. All training variables were then scaled (mean=0, standard deviation=1), with the same scaling (i.e., using the mean and standard deviation derived from the training set) applied to the test set. When two or more features in a variable set were correlated (with Pearson’s ρ>0.8 in the training data), only the feature with the lowest mean absolute correlation was retained for further analysis. The remaining variables were used to train a logistic regression model with the Least Absolute Shrinkage and Selector Operator (LASSO) regularization [26]. The LASSO regularization parameter (λ) was tuned to select only the most relevant variables by minimizing the negative binomial log-likelihood loss using internal repeated (n=100) 10-fold cross-validation. Each model’s performance was measured on the test dataset, and the model achieving the best average test AUC was trained on the whole development cohort. As a final step, the performance of this best-performing model (N.B. one model for CR and one for GR) was tested on the external validation cohort. 95% confidence intervals for averaged AUCs in the development data were estimated through bootstrapping (200 samples). Confidence intervals for the validation cohort were obtained using DeLong’s method [27].

Fig. 3
figure 3

Schematic overview of the study workflow and statistical analysis. From a total cohort of 509 patients from 9 centers, 412 patients (from 6 centers) were used to develop a prediction model to predict two respective outcomes (complete response, good response) using repeated hold-out validation. For both outcomes, the best-performing model was tested on an external and independent validation cohort consisting of 97 patients (from 3 different centers)

Supplementary Materials B describes two additional analyses: (1) testing the effects of 3 different previously described methods for multicenter data normalization (using a reference organ [28], statistical correction of imaging features using the ComBat algorithm [29], and statistical correction using mixed-effects models [30]), and (2) comparing model performance in the multicenter dataset to a single-center data subset from the cohort acquired with a harmonized MRI acquisition protocol. The latter was done to mimic the comparison of our results with a single-center study design.

Results

Patients

Baseline patient information is presented in Table 2; 332 (65%) patients were male; the median age was 65 years. For the outcome complete (versus incomplete) response, 141 patients (28%) were classified as complete responders. For the outcome good (versus poor) response, 225 patients (44%) were classified as good responders. The development and validation cohort showed no significant differences in sex, age, cT-stage, cN-stage, and tumor response (p=0.37–0.98).

Table 2 Baseline patient characteristics and variations between centers

Model performance and predictive variables

Results for model development and performance are detailed in Table 3. The best-performing model included non-imaging and advanced imaging staging variables and achieved an average AUC of 0.60 (95%CI 0.48–0.72) to predict a complete response and an AUC of 0.65 (95% CI 0.53–0.76) to predict a good response in the external validation cohort, results very similar to those obtained during testing in the development cohort. The addition of quantitative imaging features did not improve predictive performance in any of the model combinations. Basic staging variables consistently achieved lower predictive performance compared to the advanced staging variables, especially (though 95% confidence intervals showed some overlap) when the basic staging variables were derived from the original reports. Based on the model coefficients, a more proximal tumor location, shorter tumor length, longer waiting interval after CRT, lower cT-substage and cN-stage, negative MRF, lower extramural invasion depth, and negative EMVI status were associated with a favorable response outcome (full model coefficients are provided in Supplementary Materials C).

Table 3 Model performance

The results of Supplementary Materials B show that none of the normalization methods applied to retrospectively harmonize the data led to improved predictive performance. When mimicking a single-center study design (i.e., when performing the same analysis on a single-center subset within our cohort with homogeneous imaging protocols), results were highly variable but showed a trend towards better single-center model performance for most variable subsets to predict a complete response. The best-performing single-center model (including non-imaging and advanced staging variables) achieved an AUC of 0.79, compared to an AUC of 0.69 in the total multicenter development cohort.

Discussion

This multicenter study shows that when combining clinical baseline variables with image-based staging quantitative variables, overall model performance to predict neoadjuvant treatment response in rectal cancer is disappointing, with externally validated AUCs ranging between 0.60 and 0.65 to predict either a complete response (ypT0) or a good response (TRG1-2). Best model performance was achieved when combining clinical baseline information (e.g., time to surgery) and image-based staging variables (e.g., cT-stage). Quantitative imaging features had no added value. Notably, model performance was considerably better when including modern staging parameters such as cT-substage, extramural invasion depth, and EMVI, compared to more traditional staging including only simplified cTN-stage and MRF involvement. Moreover, model performance seemed to be affected by staging variations between observers with better performance when staging was performed by a dedicated expert compared to the original staging reports acquired by a multitude of readers.

Previous studies typically included staging variables such as cTN-stage as part of the “baseline patient variables,” which implies that these are “objective” variables with little variation between readers [31, 32]. While measurement variations are commonly considered when analyzing quantitative imaging data, our results demonstrate that interobserver variation is also an important issue to take into account for the more basic staging variables. The improved model performance when including also modern staging variables such as cT-substage and EMVI in the expert re-evaluations further highlights the importance of high-quality diagnostic staging using up-to-date guidelines. The clinical impact of ‘state-of-the-art’ staging was also demonstrated by Bogveradze et al, who showed in a retrospective analysis of 712 patients that compared to “traditional” staging methods, advanced staging according to recent guideline updates would have led to a change in risk classification (and therefore potentially in treatment stratification) up 18% of patients [33]. The fact that our cohort dates back as far as 2008 and covers a 10-year inclusion period explains why many of these advanced staging variables could not be derived from the original reports. The use of older data will likely also have impacted the quality of the images and thus the quantitative imaging features derived from the data. Following developments in acquisition guidelines and software and hardware updates, the image quality will have evolved over time. This is also reflected by the large number of different imaging protocols including 112 unique T2W and 94 unique DWI protocols. The question, therefore, remains if and how model performance would have improved using only state-of-the-art and/or more harmonized (prospectively acquired) MRI data. In our current dataset, quantitative imaging features showed no added benefit to predict response. This contradicts previous single-center and smaller bi- and tri-institutional studies that achieved more encouraging AUCs ranging from 0.63 to as high as 0.97 [5, 14, 15]. These previous results are likely at least in part an overestimation of how such models would perform in everyday practice, as especially earlier pilot studies are hampered by limitations in methodological design (e.g., small patient cohorts, re-using of training data for testing, and multiple testing) as also outlined in several review papers reporting on the quality and/or reproducibility of image biomarker studies [5, 19, 34,35,36,37,38]. The fact that most previous studies have been single-center reports will likely have also played an important role. Though reflective of data acquired in everyday practice, our results confirm the known difficulties of building generally applicable prediction models using heterogeneous retrospectively collected multicenter data. While some data variations are necessary to identify robust features to vendor and acquisition differences, too much variation will negatively impact model generalizability. Attempting to directly compare and investigate the effects of multicenter (heterogeneous) versus single-center (homogeneous) modelling using our own data, we mimicked a single-center comparison by repeating our study analyses on a homogeneous single-center subset within our cohort. Though results have to be interpreted with caution considering the wider confidence intervals and lack of external validation in the single-center arm, this comparison suggests that the best-performing model indeed appeared to be better for the homogeneous single-center subset (AUC 0.79) than for the multicenter (AUC 0.69) cohort. Though full data harmonization will likely never be achieved in daily clinical practice, these findings do support a need for further protocol guidelines and standardization to benefit future multicenter research.

There are some limitations to our study design. As mentioned above, data was acquired over the time span of a decade including scans acquired using outdated protocols dating back as far as 2008. A detailed analysis of the impact of these spectrum effects was outside the scope of this study, but a preliminary analysis (results not reported) showed that the impact of temporal changes was negligible. All segmentations were performed on high b-value DWI and then copied to T2W-MRI and ADC maps. Although care was taken to include anatomical information from T2W-MRI during segmentation, ideally a separate segmentation would have been performed. Finally, the comparison between the original basic staging reports and the advanced staging performed as part of this study was influenced by the fact that all re-evaluations were done by a single reader. In contrast, original staging reports were performed by a multitude of readers with varying levels of expertise. Due to the time-consuming nature of the expert re-evaluations (and segmentations), it was unfortunately not deemed feasible to include an independent extra reader.

In conclusion, this multicenter study combining clinical information and MRIs acquired as part of everyday clinical practice over the time span of a decade rendered disappointing performance to predict response to neoadjuvant treatment in rectal cancer. The best results were obtained when combining clinical baseline information with state-of-the-art image-based staging variables, highlighting the importance of good quality staging according to current guidelines and staging templates. No added value was found for quantitative imaging features in this multicenter retrospective study setting. This is likely at least in part the result of acquisition variations, which is a major problem for feature reproducibility and thus model generalizability. To benefit from quantitative imaging features—assuming a predictive potential—further optimization and harmonization of acquisition protocols will be essential to reduce feature variation across centers. For future research, it would also be interesting to see how model performance may improve when combining the information that can be derived from imaging with other clinical biomarkers such as molecular markers (e.g., DNA mutations, gene expression, microRNA) [39, 40], blood biomarkers (e.g., CEA, circulating tumor DNA) [39, 41], metabolomics (e.g., metabolites, hormones, and other signaling molecules) [42], organoids [43], and immune profiling [44].