Predicting lymph node metastasis in colorectal cancer patients: development and validation of a column chart model

Lymph node metastasis (LNM) is one of the crucial factors in determining the optimal treatment approach for colorectal cancer. The objective of this study was to establish and validate a column chart for predicting LNM in colon cancer patients. We extracted a total of 83,430 cases of colon cancer from the Surveillance, Epidemiology, and End Results (SEER) database, spanning the years 2010–2017. These cases were divided into a training group and a testing group in a 7:3 ratio. An additional 8545 patients from the years 2018–2019 were used for external validation. Univariate and multivariate logistic regression models were employed in the training set to identify predictive factors. Models were developed using logistic regression, LASSO regression, ridge regression, and elastic net regression algorithms. Model performance was quantified by calculating the area under the ROC curve (AUC) and its corresponding 95% confidence interval. The results demonstrated that tumor location, grade, age, tumor size, T stage, race, and CEA were independent predictors of LNM in CRC patients. The logistic regression model yielded an AUC of 0.708 (0.7038–0.7122), outperforming ridge regression and achieving similar AUC values as LASSO regression and elastic net regression. Based on the logistic regression algorithm, we constructed a column chart for predicting LNM in CRC patients. Further subgroup analysis based on gender, age, and grade indicated that the logistic prediction model exhibited good adaptability across all subgroups. Our column chart displayed excellent predictive capability and serves as a useful tool for clinicians in predicting LNM in colorectal cancer patients.


Introduction
Colorectal cancer (CRC) refers to malignant tumors occurring in the proximal colon, distal colon, or rectum [1].Currently, the standard treatment for CRC is curative surgery, although endoscopic resection may be considered for some early-stage colon cancer patients.However, if lymph node metastasis (LNM) is present, the treatment principles may significantly differ.For early-stage CRC with LNM, endoscopic treatment may not be suitable [2,3].On the other hand, neoadjuvant chemotherapy should be considered for advanced-stage CRC with LNM.Therefore, regardless of whether endoscopic or surgical resection is performed, preoperative assessment of LNM in CRC patients is essential.
Predicting the presence of LNM in CRC patients before surgery holds great significance, both in treatment selection and prognostic evaluation [4].Several studies have developed predictive models for LNM in CRC patients; however, these studies have limitations such as small sample sizes, singlecenter designs, or lack of external validation cohorts [5][6][7].
Currently, the risk factors for LNM remain unclear.To address this, we extracted data from the Surveillance, Epidemiology, and End Results (SEER) database for patients diagnosed with CRC between 2010 and 2019.Subsequently, we constructed a nomogram to predict LNM in CRC patients and evaluated the applicability of the model through external validation.

Study design and population
The data were extracted from the SEER database, a population-based clinical data repository proposed by the National Cancer Institute, covering approximately 28% of the U.S. population [8].Using SEER Stat software (Calverton, Maryland), we obtained a list of cases diagnosed with CRC between 2010 and 2019 from the SEER database.Since patient data in the SEER database are publicly available and de-identified, this study was exempt from ethical review.

Statistical analysis
Data were retrieved from the SEER database using SEER Stat software (version 8.4.2).The data from 2010 to 2017 were divided into training and testing sets in a 7:3 ratio.Data from 2018 to 2019 were used for external validation.Categorical variables were presented as numbers and percentages, and intergroup comparisons were performed using the chi-square test (χ 2 ) or Fisher's exact test.
Univariate and multivariate logistic regression analyses were sequentially conducted to identify independent risk factors for LNM and establish a predictive model.Models were developed using logistic regression, LASSO regression, ridge regression, and elastic net regression methods.The Hosmer-Lemeshow goodness-of-fit test was performed to assess the fitness of the predictive model.Model performance was quantified by calculating the area under the receiver operating characteristic curve (AUC) with its corresponding 95% confidence interval (CI), sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV).The optimal-performing model was selected, and a nomogram was constructed.
All statistical analyses and visualizations were performed using R software (version 4.3.1).The glmnet package was used for constructing LASSO regression, ridge regression, and elastic net regression models with ten-fold crossvalidation.Other R packages, such as compareGroups, ResourceSelection, rms, and pROC, were also utilized.A p value < 0.05 was considered statistically significant.

Baseline characteristics
A total of 83,430 cases from the SEER database were included in this study as the training and testing sets for the period between 2010 and 2017, and 18,545 cases from 2018 to 2019 were used as the external validation set (Table 1).Among these 83,430 patients, 43,889 (52.6%) were male.The racial distribution consisted of 65,641 (78.7%) White individuals, 8,776 (10.5%)Black individuals, and 9,013 (10.8%) individuals from other ethnicities.From Table 1, it can be observed that none of the univariate variables showed statistical significance between the testing and training sets.More detailed features are presented in Table 1.

Differences in characteristics between patients with and without LNM
Table 2 displays the characteristics of patients with and without LNM.The results indicate significant differences between patients with and without LNM in terms of age, tumor location, T stage, tumor size, tumor grade, CEA levels, and race (all P < 0.001).The occurrence rate of LNM was higher in patients younger than 60 years compared to other age groups (P < 0.001).CEA-positive patients had a higher LNM occurrence rate (P < 0.001).As tumor grade increased, the proportion of LNM also increased (P < 0.001).LNM in rectal cancer was significantly higher than in colon cancer.Additionally, the occurrence rate of LNM increased with larger tumor diameters.

Risk factors associated with LNM in CRC patients
The results of univariate and multivariate logistic regression analyses are presented in Table 3 CEA-positive patients had a 1.28 times higher LNM occurrence rate compared to CEA-negative patients (OR 1.28; 95% CI 1.24-1.33).The occurrence rates of LNM in patients aged 60-70 years and > 70 years were 0.76 times (OR 0.76; 95% CI 0.73-0.80)and 0.55 times (OR 0.55; 95% CI 0.53-0.58),respectively, compared to patients younger than 60 years.

Model comparison and selection
We constructed models using logistic regression, LASSO regression, ridge regression, and elastic net regression.Table 4 presents the AUC values of these models in the training and testing sets.In the testing set, the logistic regression, LASSO regression, ridge regression, and elastic net regression models achieved AUCs of 0.708 (95% CI 0.704-0.712),0.707 (95% CI 0.703-0.711),0.708 (95% CI 0.704-0.712),and 0.708 (95% CI 0.702-0.714),respectively.There were no significant differences in AUC among these models (P > 0.05).Although all models performed similarly, the logistic regression model was more clinically interpretable.Therefore, the logistic regression model was selected (Fig. 1).

Model fit analysis
The calibration curves of the nomogram (Fig. 3A-C) demonstrated high consistency between predicted and observed survival probabilities in both the training and validation cohorts.Additionally, the decision curve analysis (DCA) curves (Fig. 3D-F) indicated the good clinical utility of our model.

Discussion
Colorectal cancer (CRC) is the third leading cause of cancer-related deaths worldwide, with over 1.85 million new cases and 850,000 deaths annually [9].Lymph node metastasis (LNM) is an important prognostic factor in CRC and influences the selection of treatment options.Studies have shown that the 5-year survival rate for CRC patients with positive lymph nodes ranges from 30 to 60%, significantly lower than that for patients without lymph node involvement (5-year survival rate: 70-80%) [10].Current treatment options for CRC include endoscopic resection, surgical resection, radiotherapy, chemotherapy, targeted therapy, and immunotherapy.Endoscopic treatment is mainly used for early-stage CRC but is not suitable for patients with LNM.
For non-early-stage CRC patients, surgical resection is the primary consideration, but if LNM is present, preoperative adjuvant therapy needs to be evaluated [3].Therefore, LNM is a crucial determinant in choosing the appropriate treatment approach and serves as an important prognostic factor for CRC recurrence and distant metastasis [11][12][13].Predicting LNM can provide more accurate personalized treatment strategies, which is of paramount importance for CRC patients.
Numerous models have been developed to predict LNM in CRC patients; however, they have limitations and certain shortcomings [14].In this study, we established a model based on the SEER database and conducted internal and external validations.We compared four models-logistic regression, lasso regression, ridge regression, and elastic net regression-and analyzed significant variable factors.The results indicated that LNM was associated with tumor location, grade, patient age, tumor size, T stage, race, and CEA level.Previous studies and experience have shown that larger tumor size, deeper infiltration into the intestinal wall, and later stage are associated with a higher probability of lymph node metastasis [15,16].Our study revealed that compared to patients with tumor size < 7 cm, the risk of LNM increased by 1.20 times and 1.36 times in patients with tumor sizes of 7-15 cm and > 15 cm, respectively.The probability of lymph node metastasis also increased with higher T stages.Many previous studies have demonstrated the close correlation between tumor size and the risk of LNM [17], confirming tumor size as an independent prognostic factor.This may be related to the high expression of CCR7, as Yan C et al. found significantly higher CCR7 expression in tumors with LNM compared to those without, and CCR7 expression showed a positive correlation with tumor size.
These findings align with our research results, indicating that as T stage advances and tumor size increases, the likelihood of lymph node metastasis, especially when reaching T4 stage or tumor diameter > 15 cm, significantly increases.Therefore, careful consideration should be given to the selection of treatment options.
Studies have shown that there are differences in the occurrence rates of lymph node metastasis (LNM) between rectal cancer and colon cancer [18,19].Our study indicates that the risk of LNM in rectal cancer patients is 1.4 times higher than that in colon cancer patients.This may be attributed to different tumor biology and anatomical characteristics.Therefore, it is recommended that for early-stage rectal cancer, radical resection rather than local excision seems to be a more reasonable approach, as the involvement of  lymph nodes, which are more prone to metastasis in rectal cancer, may be the main cause of local recurrence after surgery.Similarly, rectal cancer patients may require adjuvant chemotherapy more often after local excision [20].
A study in the United States has demonstrated that younger CRC patients have a higher risk of lymph node positivity compared to older patients in an equal environment [21][22][23].However, our study shows that among patients aged < 60 years, 60-70 years, and > 70 years, the probability of lymph node positivity decreases with increasing age, which is consistent with our research.Based on the above studies, we should exercise caution in endoscopic treatment for young early-stage CRC patients.
CEA plays a crucial role in the biological phenomena of tumor cells, including adhesion, immune response, and apoptosis [24].Previous research and experience have shown that CEA levels are associated with lymph node positivity and prognosis in patients with CRC [25].Our study indicates that CEA-positive patients have a 1.6 times higher likelihood of LNM compared to CEA-negative patients.This may be related to the mechanisms of CEA, as it enhances the metastatic potential of CRC through various pathways.In addition to being considered a pro-angiogenic molecule, CEA protects metastatic cells from death, alters the microenvironment of blood sinuses, promotes the expression of adhesion molecules, and enhances the survival of malignant tumor cells [26].
We developed a nomogram based on the SEER database to predict LNM in CRC patients and conducted internal and external validations.Furthermore, we further evaluated the performance of the model in different subgroups.This predictive tool for the likelihood of LNM in CRC patients can guide clinicians in selecting more appropriate treatment strategies.However, this study still has some limitations.Firstly, the data used in the study are solely derived from the SEER database and lack relevant information on patients from other medical regions.Secondly, certain imagingrelated data that may contribute to the prediction of lymph nodes are lacking in the SEER database.We hope that further research will address these limitations.

Conclusion
In this study, we developed a nomogram for predicting lymph node metastasis (LNM) in CRC patients.We identified tumor location, grade, age, tumor size, T stage, race, and CEA as independent predictive factors for LNM in CRC patients.This tool can predict the likelihood of LNM in CRC patients, which may aid clinicians in formulating appropriate treatment strategies.

Fig. 3
Fig. 3 Calibration plots.: Show the consistency of the predicted potentiality and actual values。A-C The consistency of the predicted potentiality and actual values in the training set、the test set and in -2.52), and 2.39 times (OR 2.39; 95% CI 2.11-2.72),respectively.Compared to T1 stage, the risk of LNM occurrence in patients with T2, T3, and T4 stages was 1.18 times (OR 1.18; 95% CI 1.68-2.03),5.40 times (OR 5.40; 95% CI 4.96-5.89),and 8.89 times (OR 8.89; 95% CI 8.08-9.80),respectively.White individuals had the lowest risk of LNM compared to Black individuals and individuals from other races.

Table 1 :
Characteristics of all patients

Table 2
Difference analysis of patients with or without LNM in the training set