Introduction

Locally advanced rectal cancer (LARC) refers to patients with rectal cancer (RC) with clinical (c) T3-cT4 or positive nodal status. The standard treatment strategy is neoadjuvant chemoradiotherapy (nCRT) followed by total mesorectal excision [1]. nCRT aims to achieve tumor downstaging, improve resection rate, increase sphincter preservation probability, and reduce local recurrence rate. For patients with a clinically complete response to nCRT, organ preservation strategies, such as a watchful waiting policy, could avoid radical surgery, preserve organ function, and enhance quality of life. Notably, studies suggest a possible link between the status of lymph nodes (LNs) after nCRT and the prognosis of LARC patients because complete LN regression consistently correlates with improved disease-free survival, overall survival, and reduced local recurrence and distal metastasis risk [2,3,4]. Chan et al found that the recurrence rate of LN-positive patients was six times higher than that of LN-negative patients; moreover, the five-year survival rate was 42% for LN-positive patients and 85% for LN-negative patients [5]. Additionally, when watchful waiting or local excision is considered, a precise assessment of LN restaging following nCRT is important. Lymph node regression after nCRT may help predict the clinical complete response of the primary tumor [6]. Conversely, LNs containing tumor cells after nCRT are a potential source of local recurrence and distant metastasis. Therefore, accurate prediction of LN metastasis (LNM) after nCRT is crucial in therapeutic decisions and for predicting prognoses for LARC patients.

At present, the preoperative evaluation of LN status and restaging following nCRT in RC relies on high-resolution magnetic resonance imaging (MRI) [7]. High-resolution T2 weighted imaging (HR-T2WI) is preferred for the evaluation of the morphological and signal characteristics of LN, such as irregular borders, uneven internal signals, and roundness. Diffusion-weighted imaging (DWI) facilitates malignant LN detection and provides biological information on cellularity. Incorporating DWI and T2WI can improve the accuracy of preoperative LNM predictions [8]. However, the reaction of LN to nCRT is heterogeneous, ranging from residual cancers to a complete fibrotic response, causing LN changes in morphology, dimension, quantity, and texture [9, 10]. In this context, visual assessment based on MRI to identify LNM following nCRT may be ambiguous, especially for small nodes (< 3 mm). Thus, there is a need for a new diagnostic method.

Radiomics extracts quantitative features from medical images and transforms them into mineable, high-dimensional data to reveal pathophysiological information about tumor heterogeneity in biomedical images [11, 12]. Studies have demonstrated that the radiomic characteristics of primary tumors can be used to predict LNM in RC [13,14,15]. For instance, one study reported that a radiomic nomogram based on T2WI, apparent diffusion coefficient (ADC) features, and clinical factors performed favorably [16]. Yang et al developed and validated an HR-T2WI radiomic model that could help predict the LNM of RC [17]. Furthermore, several studies focus on predicting LNM following nCRT based on multiparametric MRI using radiomics and report relatively high performances with areas under the curve (AUCs) of 0.812–0.865 in the validation cohort [18]. However, the absence of an external cohort to show how the model performs in the real world limits the clinical translation of these methods. Additionally, the radiomic models based on MR data performed before nCRT may miss key information related to LNM.

This study aimed to construct and validate multiparametric MR-based radiomic models with the pre- and/or post-nCRT information to predict LN status following nCRT in LARC patients. To obtain better predictive performance, we constructed and validated MR-based radiomics models based on various combinations obtained from pre- and/or post-nCRT information to predict LN status following nCRT in LARC patients, and compared them with the radiologist’s qualitative evaluations.

Materials and methods

Patients

The institutional review boards of the Nanfang Hospital (Guangzhou, China, center A) and the Second Affiliated Hospital of Guangzhou University of Chinese Medicine (Guangzhou, China, center B) granted ethical approval of the retrospective study and waived the need for informed consent. Between October 2017 and October 2020, consecutive LARC patients (n = 150) from the two medical centers were included. The inclusion criteria were: (1) histopathologically confirmed rectal adenocarcinoma; (2) diagnosed as LARC (cT3–T4 or cN1–2) at the initial treatment stages; (3) received MR scan before and after nCRT; and (4) received complete nCRT followed by surgery and confirmed by postoperative pathology. The exclusion criteria were: (1) additional targeted therapy or immunotherapy during treatment; (2) recurrent rectal carcinoma; (3) poor quality MRI; and (4) incomplete clinicopathological data. Figure 1 depicts a flowchart for patient recruitment.

Fig. 1
figure 1

Flowchart of patient selection

The clinicopathological features of patients were obtained from their medical records. The collected data included age, gender, carcinoembryonic antigen (CEA) level before and after nCRT, tumor location, chemotherapy regimen, and tumor differentiation.

Pathologic assessment

All patients underwent total mesorectal excision surgery after nCRT. The extent of LN dissection encompassed the following regions: perirectal LNs, internal iliac LNs, external iliac LNs, common iliac LNs, superior rectal LNs, presacral LNs, and pararectal LNs. The pathological assessment was performed on all the surgically resected LNs. The pathologic T and N stages were evaluated according to the American Joint Committee on Cancer’s Cancer Staging Manual (AJCC 8th edition) by two pathologists in consensus (X.H.D. and H.S.W., with 11 and 15 years of experience in gastrointestinal diagnosis, respectively).

MRI acquisition and radiologist’s assessment LN after nCRT

All study participants had an MRI one week before the start of nCRT and one week before surgery; these are referred to as pre- and post-nCRT MRI, respectively. Before the MR examinations, patients received a cleansing enema but did not receive bowel preparation antispasmodic medication, or rectal distention. The imaging protocol included T2WI, DWI, and T1-weighted imaging in the oblique axial, coronal, and sagittal planes. Oblique axial and coronal sequences were angulated perpendicular and parallel to the tumor axis, respectively. Additionally, a coronal sequence parallel to the anal canal was performed in distal tumors (lower third of the rectum). Table 1 summarizes the imaging acquisition parameters for axis T2WI and DWI.

Table 1 MRI sequences parameters of T2WI and DWI of different MR devices

A gastrointestinal radiologist (X.L., with 19 years of experience) retrospectively and independently reviewed MRI images to evaluate MR-based tumor regression grade (mrTRG) and LN status after nCRT, without knowing the patient’s pathological findings. mrTRG was assessed as outlined by Patel et al [19]. LN MR-restaging was assessed according to the European Society of Gastrointestinal and Abdominal Radiology (ESGAR) criterion, including size (short axis diameter > 5 mm) and nodal morphological features (shape, contour, signal intensity homogeneity on T2WI, and enhanced homogeneity) [7].

Tumor segmentation and radiomic feature extraction

The volume of interest (VOI) of the primary tumors was manually delineated in a blinded manner in the axis T2WI and DWI (b = 1000 s/mm2) before and after nCRT treatment by two radiologists (X.H. and L.C.), each with more than seven years of experience in consensus using ITK-SNAP software (version 3.8; http://www.itksnap.org/). The intestinal lumen and noninvaded rectal wall were carefully excluded from the tumor regions (Fig. 2). The radiologists were blinded to the patients’ clinicopathological information. To ensure consistency and reproducibility of extracted features, 45 patients were chosen at random to calculate the intraclass correlation coefficient (ICC); features with an ICC < 0.75 were eliminated.

Fig. 2
figure 2

Rectal pre-and post-nCRT MRI scans in a 54-year-old man with lymph node metastasis (LNM) proven by pathology after nCRT. Regions of interest segmentation of the primary tumor on T2WI (a, c) and DWI images (b, d) before and after nCRT. Before nCRT, a suspicious metastatic lymph node (MLN) is noticed (white arrow), with high intensity in T2WI (a) and DWI (b). The suspicious MLN shrinks (< 3 mm) on T2WI (c) and DWI images (d) after nCRT. Photomicrograph (hematoxylin-eosin stain, ×100) shows the presence of residual invasive tumor cells in the LN (e)

Before feature extraction, the MR images were preprocessed using AK software (GE Healthcare, China) to compensate for differences owing to different protocols. The pre-processing steps were: (1) the MR images were smoothed with the bilateral filter algorithm to achieve similar noise characteristics; (2) the MR images and VOI were resampled to a uniform voxel size of 1 × 1 × 1 mm3 using linear interpolation and nearest neighbor interpolation, respectively; and (3) T2WI and DWI images were Z-score normalized to eliminate the influence of different gray value ranges.

The PyRadiomics (https://pyradiomics.readthedocs.io/en/latest/) package with the default setting was used to extract the radiomic features from the T2WI and DWI images before and after nCRT (with a fixed intensity bid width of 25). Each imaging modality yielded 960 radiomic features, for a sum of 3840 radiomic features extracted for each patient. The extracted features were: (1) 14 shape-based features; (2) 18 first-order features; (3) 68 texture features (gray-level co-occurrence matrix (GLCM), gray-level dependence matrix (GLDM), gray-level run-length matrix, and gray-level size zone matrix); (4) 688 wavelet features; and (5) 172 Gaussian Laplacian features.

Feature selection

To standardize the radiomics features and mitigate the impact of variability among different MR scanners, Z-score normalization was applied to the radiomics features of each patient. Subsequently, two methodologies were explored to identify optimal features for predicting LNM in LARC patients after nCRT. Initially, pairwise matching analysis was conducted on all features, with features exhibiting a Spearman correlation coefficient > 0.70 being subjected to significant testing, where the feature with the lower p-value was retrained for subsequent analysis. Following this, the most predictive radiomics features were identified using multivariate logistic regression, with LNM being significantly associated with features having a p-value < 0.05. Moreover, given that homogeneity in the radiomics features can be influenced by center and protocol/vendor-specific dependencies, we employed the ComBat harmonization approach to eliminate batch effects arising from variations across the different centers and different MR modalities [20, 21].

Model construction and evaluation

Random forest (RF) is a machine learning technique that uses an ensemble approach to combine decision regression trees and classification methods [22]. Compared to other classification methods, RF demonstrates enhanced efficacy in handling noisy data and outliers. Its robustness against overfitting and reduced sensitivity to input values contribute to heightened discriminative capabilities and improved precision [23, 24]. In the study, we used an RF classifier to establish machine learning models. We first constructed four single-sequence models based on T2WI or DWI features before and after nCRT (modelpre_T2, modelpre_DWI, modelpost_T2, and modelpost_DWI). Then, three combined radiomic models were produced by combining the features of different treatment points, including the combinations of T2WI and DWI before nCRT (modelpre_T2_DWI), T2WI and DWI after nCRT (modelpost_T2_DWI), and T2WI and DWI before and after nCRT (modelpre_T2_DWI_post). The clinical risk factors were selected via the least absolute shrinkage and selection operator (LASSO) method and the penalty parameters were tuned using ten-fold cross-validation. Variables with non-zero coefficients were included in the clinical model.

The model was trained with all the data from the training cohort and validated with data from the external validation cohort. The models’ discriminative performance was assessed via receiver operating characteristic curve (ROC) analysis, and quantified by the AUC. Figure 3 illustrates displays the pipeline for constructing and evaluating various models for predicting LNM after nCRT in LARC patients.

Fig. 3
figure 3

The study workflow for predicting LNM after nCRT in LARC patients

Statistical analysis

SPSS (version 26.0; IBM, New York, USA) and R software (version 4.2.1; R Core Team, Vienna, Austria) were used for statistical analyses. An independent t-test was used to process continuous variables and the chi-square test or Fisher’s exact test was used to analyze the classified variables. DeLong’s test was utilized to evaluate differences in the predictive performance concerning the AUCs among clinical models, radiologists’ evaluations, and different radiomic models. All statistical tests were two-sided. Benjamini and Hochberg-corrected p-values were used to assess the feature significance for multiple comparisons.

Results

Patient characteristics

A total of 150 patients with LARC from two independent institutions were enrolled in this study. The training cohort comprised 100 patients from center A. Fifty patients from center B were included in the external validation cohort. Fifty-two patients with LNM status were confirmed by pathology, with a prevalence of 37% (37/100) in center A and 30% (15/50) in center B.

A significant difference was found in the CEA level after nCRT and mrTRG between the training cohort and external validation cohort (all p < 0.05). There were no significant differences between the training cohort and external validation cohort in gender, age, mrT Stage after nCRT, CEA level before nCRT, tumor differentiation, tumor location, or chemoradiotherapy regimen (p = 0.686, 0.417, 0.349, 0.236, 0.785, 0.675, and 0.669, respectively; Table 2).

Table 2 Clinicopathological characteristics of patients in training cohort and external validation cohort

Feature selection

From each MRI sequence, 960 radiomic features were extracted. Pre_T2, Pre_DWI, Post_T2, and Post_DWI were reduced to 898, 924, 864, and 955 features after excluding those with low repeatability. After the multivariate logistic regression, there were five features in Pre_T2, three features in Pre_DWI, three features in Post_T2, seven features in Post_DWI, twelve features in Pre_T2_DWI, seven features in Post_T2_DWI, and thirteen features in Pre_T2_DWI_Post. Based on the LASSO logistic regression analysis, mrTRG and gender were identified as clinical risk factors that are associated with LNM after nCRT (Fig. S1). Table 3 lists the best radiomic features of the various models and Fig. 4 depicts the correlation matrices for the selected features used in the different radiomic models.

Table 3 Most significant radiomics features of single-sequence models and combined models
Fig. 4
figure 4

Correlation matrix plot displaying correlations between radiomic features used in different prediction models. a Modelpre_T2. b Modelpre_DWI. c Modelpost_T2. d Model post_DWI. e Modelpre_T2_DWI. f Modelpost_T2_DWI. g Modelpre_T2_DWI_post

Performance of models

The assessment of LNs following nCRT by radiologist’s interpretation, based on the ESGAR criteria, yielded an AUC of 0.624 (95% CI: 0.475–0.725) in the training cohort and 0.556 (95% CI: 0.333–0.714) in the external validation cohort.

The clinical model was built based on mrTRG and gender, and the AUC of the clinical model was 0.687 (95% CI: 0.601–0.774) in the training cohort and 0.623 (95% CI: 0.503–0.742) in the external validation cohort.

For the single-sequence model, the AUC of modelpre_T2 was 0.945 (95% CI: 0.911–0.976) in the training cohort and 0.601 (95% CI: 0.464–0.737) in the external validation cohort, respectively. The AUC generated by modelpre_DWI was 0.956 (95% CI: 0.925–0.983) in the training cohort and 0.589 (95% CI: 0.435–0.746) in the external validation cohort. The AUC generated by modelpost_T2 was 0.934 (95% CI: 0.891–0.970) in the training cohort and 0.710 (95% CI: 0.568–0.838) in the external validation cohort. The AUC generated by modelpost_DWI was 0.978 (95% CI: 0.957–0.993) in the training cohort and 0.756 (95% CI: 0.616–0.877) in the external validation cohort.

For the combined model, the AUC of modelpre_T2_DWI in the training and external validation cohorts was 0.982 (95% CI: 0.964–0.995) and 0.602 (95% CI: 0.437–0.767), respectively. The AUC for modelpost_T2_DWI in the training cohort was 0.955 (95% CI: 0.907–0.991) and 0.811 (95% CI: 0.706–0.902) in the external validation cohort. The AUC for modelpre_T2_DWI_post was 0.978 (95% CI: 0.956–0.996) in the training cohort and 0.831 (95% CI: 0.715–0.940) in the external validation cohort. Table 4 depicts the additional quantitative indicators for evaluating the performance of the various models. The ROC curves generated by the different models are shown in Fig. 5.

Table 4 Prediction performance of different models in training cohort and external validation cohort
Fig. 5
figure 5

ROC curves of radiologist’s assessment and different models for predicting LNM after nCRT. Training cohort (a, c); external validation cohort (b, d)

Model comparison

For the training cohort, the DeLong test demonstrated that the AUCs of the radiomic models were superior to the radiologist’s assessment and clinical model (all p < 0.001; Fig. 6a). The AUC of modelpre_T2_DWI_post was better than that of modelpost_T2 (p = 0.049; Fig. 6a). For the external validation cohort, the AUC of modelpost_T2_DWI was superior to the radiologist’s assessment, clinical model, modelpre_T2, and modelpre_DWI (p = 0.026, 0.047, 0.029, and 0.036, respectively; Fig. 6b). The AUC of modelpre_T2_DWI_post was superior to that of radiologist’s assessment, the clinical model, modelpre_T2, modelpre_DWI, and modelpre_T2_DWI (p = 0.004, 0.036, 0.008, 0.007, and 0.014, respectively; Fig. 6b). The supplementary material provides a detailed comparison of various models in the training and external validation cohorts.

Fig. 6
figure 6

AUC comparison of radiologist’s assessment and different models with DeLong’s test in the training (a) and external validation cohorts (b)

Discussion

In this study, data from two centers were used to build one radiologist’s assessment model, one clinical model, four single-sequence radiomic models, and three combined-sequence radiomic models based on primary tumors to identify LNM after preoperative nCRT in LARC. An independent test set was used to assess predictive performance. To avoid confusion between pathology-confirmed metastatic LNs and MR-detected LNs, we extracted radiomic features from the primary tumor instead of individually delineating LNs. Our findings are consistent with data from previous studies that reported that the biological characteristics of primary tumors and metastatic LNs are similar [25, 26]. Our results demonstrate that the multiparametric model incorporating MR features before and after nCRT (modelpre_T2_DWI_post) had the best diagnostic performance for predicting LNM after nCRT in the external set, with a significantly higher AUC (0.831) than those of radiologist’s assessment, clinical model, modelpre_T2_DWI, and the single-sequence models (all p < 0.05). These data indicate that certain MR-based radiomic models have the potential to guide therapies for LARC patients.

Most previous radiomic studies focused on imaging data before nCRT to detect LNM [13,14,15]. However, tumor heterogeneity changes dynamically during treatment, and thus extracting features from imaging data before or after nCRT may miss important information about tumor changes during treatment [27]. We used both pre- and post-treatment imaging data, including T2WI and DWI, to investigate the role of radiomic features in detecting LNM. To the best of our knowledge, this is the first study to use preoperative MR data of primary tumors at two-time points to predict LNM in patients with LARC. Our findings reveal that the radiomic analysis of the baseline and follow-up MR data obtained more significant features and information on treatment-induced tumor changes. Additionally, the features of each MR sequence can still be found in the radiomic signature constructed by the Pre_T2_DWI_ Post after feature selection and eliminating redundant features, which highlights the significance of MRI parameters of both before and after nCRT in predicting LNM.

Four features were selected from T2WI and three from DWI before nCRT. After nCRT, three features were chosen from T2WI and seven from DWI. The selected features included shape, GLCM, GLDM, and wavelet features. Shape features describe the geometry of the VOI and indicate the degree of tumor complexity. GLCM and GLDM are texture-based radiomic features that characterize the intensity relationships between pairs of neighbor voxels in all spatial directions and intensity differences between neighbors, respectively [28]. Wavelet transformation offers a comprehensive spatial and frequency analysis of low- and high-frequency signals in tumor regions [29]. Tumors are biologically heterogeneous, with differences in cells, microenvironmental factors metabolism, vasculature, structure, and functions. These radiomic features reveal tumor heterogeneity at different scales, provide insights into the tumor microenvironment, and are valuable for predicting treatment responses in various tumors (non-small cell lung cancer, breast cancer, cervical cancer, and LARC).

To explore the impact of multiparametric MR data and time points on prediction accuracy, we constructed seven radiomic models, including four single and three combined-sequence models. Among the single-sequence models, modelpost_DWI exhibited superior predictive power, with an AUC of 0.756 in the external validation set. Our results suggest that radiomic features derived from DWI might be useful for predicting LN status in LARC patients. This observation is partially consistent with data from a previous study, which reported that texture features extracted from DWI images and ADC maps can predict pathological N stages in RC, with an AUC of 0.802 [30]. DWI is a functional technique that assesses water molecule diffusion in biological tissue. The usefulness of DWI in discriminating benign from malignant tumors has been demonstrated widely. Furthermore, there is growing evidence that DWI allows for qualitative and tumor microenvironment-based quantitative assessment of the post-treatment tumor bed [31]. The histopathological characteristics of primary tumors are closely related to LNM in RC [32]. Therefore, this might be why modelpost_DWI could successfully identify the LNM after nCRT. In combined-sequence models, the model that used T2WI and DWI features before and after nCRT had better performance than modelpre_T2_DWI and the single-sequence models and had high accuracy and specificity in both the training and external validation cohorts. The multi-sequence radiomic model could accurately determine LN status after nCRT, even in the absence of surgery-related clinical data. This might be interpreted that radiomic analysis based on the baseline and follow-up, therefore, may provide more significant features and information about changes resulting from treatment.

Despite the accuracy of LN restaging MRI following nCRT being better than that of baseline staging, challenges such as size overlap between malignant and reactive LNs, and fibrosis, edema, or inflammatory changes resembling tumors remain. In our study, the accuracy of the radiologist’s assessment of LN involvement was 0.578 in the external validation cohort. When constructing the clinical model, mrTRG and gender were identified as factors associated with LNM after nCRT, consistent with previous research findings [33, 34]. Newton et al developed a nomogram based on clinicopathological variables to predict LNM after nCRT in patients with LARC with a c-index of 0.71 [35], which is in line with our clinical model (AUC = 0.687). Nevertheless, the predictive performance of the clinical model is still significantly weaker than modelpre_T2_DWI_ post (p = 0.036). This might be due to clinicopathological features reflecting the coarse features of tumors, which inevitably involve clinicians’ subjective judgments of patients. In contrast, radiomic features contain multidimensional quantitative information that can more objectively and accurately reflect tumor heterogeneity and biological characteristics.

Our study has several limitations. Firstly, the sample size was small, which may affect the generalizability of the findings. Secondly, due to the retrospective nature of our study, the potential for selection bias remains, and we were unable to achieve precise alignment between pathologically confirmed LNs and those detected on MRI scans. Thirdly, we obtained data from different scanners at different centers. Although we used data pre-processing techniques such as resampling and normalization, as well as the ComBat method to eliminate batch effects, the heterogeneity of MRI scans from different centers is unavoidable. Lastly, the manual delineation of primary tumors was a time-consuming and labor-intensive process. Future studies should explore the application of deep learning for automatic VOI segmentation of RC.

In conclusion, our findings suggest that a multiparametric model that incorporates MR radiomic features before and after nCRT is optimal for predicting LNM after nCRT in patients with LARC. The model may help guide therapies and predict prognoses for LARC patients.