Introduction

Colorectal cancer (CRC) refers to malignant tumors occurring in the proximal colon, distal colon, or rectum [1]. Currently, the standard treatment for CRC is curative surgery, although endoscopic resection may be considered for some early-stage colon cancer patients. However, if lymph node metastasis (LNM) is present, the treatment principles may significantly differ. For early-stage CRC with LNM, endoscopic treatment may not be suitable [2, 3]. On the other hand, neoadjuvant chemotherapy should be considered for advanced-stage CRC with LNM. Therefore, regardless of whether endoscopic or surgical resection is performed, preoperative assessment of LNM in CRC patients is essential. Predicting the presence of LNM in CRC patients before surgery holds great significance, both in treatment selection and prognostic evaluation [4]. Several studies have developed predictive models for LNM in CRC patients; however, these studies have limitations such as small sample sizes, single-center designs, or lack of external validation cohorts [5,6,7].

Currently, the risk factors for LNM remain unclear. To address this, we extracted data from the Surveillance, Epidemiology, and End Results (SEER) database for patients diagnosed with CRC between 2010 and 2019. Subsequently, we constructed a nomogram to predict LNM in CRC patients and evaluated the applicability of the model through external validation.

Methods

Study design and population

The data were extracted from the SEER database, a population-based clinical data repository proposed by the National Cancer Institute, covering approximately 28% of the U.S. population [8]. Using SEER Stat software (Calverton, Maryland), we obtained a list of cases diagnosed with CRC between 2010 and 2019 from the SEER database. Since patient data in the SEER database are publicly available and de-identified, this study was exempt from ethical review.

Inclusion and exclusion criteria

Inclusion criteria: (1) Primary tumor located in the colon or rectum, (2) Pathological type classified as adenocarcinoma, (3) Patients with complete and available clinical baseline data. Exclusion criteria: (1) Multiple primary malignant tumors, (2) Distant metastasis present, (3) Zero survival time, (4) Diagnosed based on death certificates or autopsy reports.

Variable categorization

We extracted data on patients' race, age (< 60 years, 60–70 years, > 70 years), gender, marital status, T stage, tumor location, tumor size (< 7 cm, 7–15 cm, > 15 cm), and CEA levels. LNM served as the endpoint indicator.

Statistical analysis

Data were retrieved from the SEER database using SEER Stat software (version 8.4.2). The data from 2010 to 2017 were divided into training and testing sets in a 7:3 ratio. Data from 2018 to 2019 were used for external validation. Categorical variables were presented as numbers and percentages, and intergroup comparisons were performed using the chi-square test (χ2) or Fisher's exact test.

Univariate and multivariate logistic regression analyses were sequentially conducted to identify independent risk factors for LNM and establish a predictive model. Models were developed using logistic regression, LASSO regression, ridge regression, and elastic net regression methods. The Hosmer–Lemeshow goodness-of-fit test was performed to assess the fitness of the predictive model. Model performance was quantified by calculating the area under the receiver operating characteristic curve (AUC) with its corresponding 95% confidence interval (CI), sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV). The optimal-performing model was selected, and a nomogram was constructed.

All statistical analyses and visualizations were performed using R software (version 4.3.1). The glmnet package was used for constructing LASSO regression, ridge regression, and elastic net regression models with ten-fold cross-validation. Other R packages, such as compareGroups, ResourceSelection, rms, and pROC, were also utilized. A p value < 0.05 was considered statistically significant.

Result

Baseline characteristics

A total of 83,430 cases from the SEER database were included in this study as the training and testing sets for the period between 2010 and 2017, and 18,545 cases from 2018 to 2019 were used as the external validation set (Table 1). Among these 83,430 patients, 43,889 (52.6%) were male. The racial distribution consisted of 65,641 (78.7%) White individuals, 8,776 (10.5%) Black individuals, and 9,013 (10.8%) individuals from other ethnicities. From Table 1, it can be observed that none of the univariate variables showed statistical significance between the testing and training sets. More detailed features are presented in Table 1.

Table 1: Characteristics of all patients

Differences in characteristics between patients with and without LNM

Table 2 displays the characteristics of patients with and without LNM. The results indicate significant differences between patients with and without LNM in terms of age, tumor location, T stage, tumor size, tumor grade, CEA levels, and race (all P < 0.001). The occurrence rate of LNM was higher in patients younger than 60 years compared to other age groups (P < 0.001). CEA-positive patients had a higher LNM occurrence rate (P < 0.001). As tumor grade increased, the proportion of LNM also increased (P < 0.001). LNM in rectal cancer was significantly higher than in colon cancer. Additionally, the occurrence rate of LNM increased with larger tumor diameters.

Table 2 Difference analysis of patients with or without LNM in the training set

Risk factors associated with LNM in CRC patients

The results of univariate and multivariate logistic regression analyses are presented in Table 3. Multivariate logistic regression analysis revealed that T stage, CEA levels, tumor size, and tumor grade were independent risk factors for LNM in CRC patients. The risk of LNM in rectal cancer patients was 1.4 times higher than in colon cancer patients (OR 1.40; 95% CI 1.34–1.46). Compared to patients with tumor diameter < 7 cm, the risk of LNM occurrence in patients with tumor diameters of 7–15 cm and > 15 cm was 1.20 times (OR 1.20; 95% CI 1.01–1.43) and 1.36 times (OR 1.36; 95% CI 1.16–1.61), respectively. Compared to patients with grade I tumors, the risk of LNM occurrence in patients with grade II, III, and IV tumors was 1.31 times (OR 1.31; 95% CI 1.21–1.41), 2.32 times (OR 2.32; 95% CI 2.13–2.52), and 2.39 times (OR 2.39; 95% CI 2.11–2.72), respectively. Compared to T1 stage, the risk of LNM occurrence in patients with T2, T3, and T4 stages was 1.18 times (OR 1.18; 95% CI 1.68–2.03), 5.40 times (OR 5.40; 95% CI 4.96–5.89), and 8.89 times (OR 8.89; 95% CI 8.08–9.80), respectively. White individuals had the lowest risk of LNM compared to Black individuals and individuals from other races. CEA-positive patients had a 1.28 times higher LNM occurrence rate compared to CEA-negative patients (OR 1.28; 95% CI 1.24–1.33). The occurrence rates of LNM in patients aged 60–70 years and > 70 years were 0.76 times (OR 0.76; 95% CI 0.73–0.80) and 0.55 times (OR 0.55; 95% CI 0.53–0.58), respectively, compared to patients younger than 60 years.

Table 3 Univariate and multivariate logistic regression analyses of factors associated with LNM

Model comparison and selection

We constructed models using logistic regression, LASSO regression, ridge regression, and elastic net regression. Table 4 presents the AUC values of these models in the training and testing sets. In the testing set, the logistic regression, LASSO regression, ridge regression, and elastic net regression models achieved AUCs of 0.708 (95% CI 0.704–0.712), 0.707 (95% CI 0.703–0.711), 0.708 (95% CI 0.704–0.712), and 0.708 (95% CI 0.702–0.714), respectively. There were no significant differences in AUC among these models (P > 0.05). Although all models performed similarly, the logistic regression model was more clinically interpretable. Therefore, the logistic regression model was selected (Fig. 1).

Table 4 The area under the receiver operating characteristic curve (AUC) for different models
Fig. 1
figure 1

Nomogram for predicting lymph node metastasis (LNM) in colorectal cancer (CRC) patients

Nomogram for predicting LNM in CRC patients

Table 5 displays the performance of the logistic regression model. In the testing set, the logistic regression model achieved an AUC of 0.708 (95% CI 0.704–0.712), accuracy of 0.637 (95% CI 0.63–0.641), sensitivity of 0.736 (95% CI 0.730–0.742), specificity of 0.569 (95% CI 0.564–0.574), PPV of 0.539 (95% CI 0.534–0.545), and NPV of 0.759 (95% CI 0.753–0.764). The Hosmer–Lemeshow goodness-of-fit test indicated good calibration of the predictive model (χ2 = 10.207, P = 0.251). Furthermore, during external validation, the model achieved an AUC of 0.709 (95% CI 0.701–0.716), indicating its good applicability to external validation data (Fig. 2, Table 5).

Table 5 The performance of the logistic regression prediction model
Fig. 2
figure 2

Receiver operator characteristic (ROC) curves and the area under the ROC curve (AUC) for the logistic regression prediction model in the training set, test set, and external validation. A ROC curves in the training set; B ROC curves in the test set; C ROC curves in the external validation

Further validation based on different subgroups

Further validation was conducted based on gender, age, and tumor grade (Table 6). In the testing set, the logistic regression predictive model exhibited good performance for male and female patients, as well as patients aged < 60 years, 60–70 years, and > 70 years, and those with grade I tumors. The AUC values for these subgroups were 0.705 (95% CI 0.696–0.714), 0.711 (95% CI 0.702–0.720), 0.685 (95% CI 0.673–0.696), 0.708 (95% CI 0.696–0.720), 0.704 (95% CI 0.694–0.714), and 0.737 (95% CI 0.714–0.760), respectively. In the external validation dataset, the predictive model also demonstrated good applicability to these subgroups, with AUC values of 0.710 (95% CI 0.700–0.720), 0.707 (95% CI 0.696–0.718), 0.696 (95% CI 0.683–0.710), 0.710 (95% CI 0.696–0.724), 0.705 (95% CI 0.694–0.717), and 0.746 (95% CI 0.720–0.771).

Table 6 The performance of the prediction model based on different populations

Model fit analysis

The calibration curves of the nomogram (Fig. 3A–C) demonstrated high consistency between predicted and observed survival probabilities in both the training and validation cohorts. Additionally, the decision curve analysis (DCA) curves (Fig. 3D–F) indicated the good clinical utility of our model.

Fig. 3
figure 3

Calibration plots.: Show the consistency of the predicted potentiality and actual values。AC The consistency of the predicted potentiality and actual values in the training set、the test set and in the external validation. D, E Decision curve analysis (DCA). Assessing clinical utility in the training set、the test set and in the external validation

Discussion

Colorectal cancer (CRC) is the third leading cause of cancer-related deaths worldwide, with over 1.85 million new cases and 850,000 deaths annually [9]. Lymph node metastasis (LNM) is an important prognostic factor in CRC and influences the selection of treatment options. Studies have shown that the 5-year survival rate for CRC patients with positive lymph nodes ranges from 30 to 60%, significantly lower than that for patients without lymph node involvement (5-year survival rate: 70–80%) [10]. Current treatment options for CRC include endoscopic resection, surgical resection, radiotherapy, chemotherapy, targeted therapy, and immunotherapy. Endoscopic treatment is mainly used for early-stage CRC but is not suitable for patients with LNM. For non-early-stage CRC patients, surgical resection is the primary consideration, but if LNM is present, preoperative adjuvant therapy needs to be evaluated [3]. Therefore, LNM is a crucial determinant in choosing the appropriate treatment approach and serves as an important prognostic factor for CRC recurrence and distant metastasis [11,12,13]. Predicting LNM can provide more accurate personalized treatment strategies, which is of paramount importance for CRC patients.

Numerous models have been developed to predict LNM in CRC patients; however, they have limitations and certain shortcomings [14]. In this study, we established a model based on the SEER database and conducted internal and external validations. We compared four models—logistic regression, lasso regression, ridge regression, and elastic net regression—and analyzed significant variable factors. The results indicated that LNM was associated with tumor location, grade, patient age, tumor size, T stage, race, and CEA level.

Previous studies and experience have shown that larger tumor size, deeper infiltration into the intestinal wall, and later stage are associated with a higher probability of lymph node metastasis [15, 16]. Our study revealed that compared to patients with tumor size < 7 cm, the risk of LNM increased by 1.20 times and 1.36 times in patients with tumor sizes of 7–15 cm and > 15 cm, respectively. The probability of lymph node metastasis also increased with higher T stages. Many previous studies have demonstrated the close correlation between tumor size and the risk of LNM [17], confirming tumor size as an independent prognostic factor. This may be related to the high expression of CCR7, as Yan C et al. found significantly higher CCR7 expression in tumors with LNM compared to those without, and CCR7 expression showed a positive correlation with tumor size. These findings align with our research results, indicating that as T stage advances and tumor size increases, the likelihood of lymph node metastasis, especially when reaching T4 stage or tumor diameter > 15 cm, significantly increases. Therefore, careful consideration should be given to the selection of treatment options.

Studies have shown that there are differences in the occurrence rates of lymph node metastasis (LNM) between rectal cancer and colon cancer [18, 19]. Our study indicates that the risk of LNM in rectal cancer patients is 1.4 times higher than that in colon cancer patients. This may be attributed to different tumor biology and anatomical characteristics. Therefore, it is recommended that for early-stage rectal cancer, radical resection rather than local excision seems to be a more reasonable approach, as the involvement of lymph nodes, which are more prone to metastasis in rectal cancer, may be the main cause of local recurrence after surgery. Similarly, rectal cancer patients may require adjuvant chemotherapy more often after local excision [20].

A study in the United States has demonstrated that younger CRC patients have a higher risk of lymph node positivity compared to older patients in an equal environment [21,22,23]. However, our study shows that among patients aged < 60 years, 60–70 years, and > 70 years, the probability of lymph node positivity decreases with increasing age, which is consistent with our research.Based on the above studies, we should exercise caution in endoscopic treatment for young early-stage CRC patients.

CEA plays a crucial role in the biological phenomena of tumor cells, including adhesion, immune response, and apoptosis [24]. Previous research and experience have shown that CEA levels are associated with lymph node positivity and prognosis in patients with CRC [25]. Our study indicates that CEA-positive patients have a 1.6 times higher likelihood of LNM compared to CEA-negative patients. This may be related to the mechanisms of CEA, as it enhances the metastatic potential of CRC through various pathways. In addition to being considered a pro-angiogenic molecule, CEA protects metastatic cells from death, alters the microenvironment of blood sinuses, promotes the expression of adhesion molecules, and enhances the survival of malignant tumor cells [26].

We developed a nomogram based on the SEER database to predict LNM in CRC patients and conducted internal and external validations. Furthermore, we further evaluated the performance of the model in different subgroups. This predictive tool for the likelihood of LNM in CRC patients can guide clinicians in selecting more appropriate treatment strategies. However, this study still has some limitations. Firstly, the data used in the study are solely derived from the SEER database and lack relevant information on patients from other medical regions. Secondly, certain imaging-related data that may contribute to the prediction of lymph nodes are lacking in the SEER database. We hope that further research will address these limitations.

Conclusion

In this study, we developed a nomogram for predicting lymph node metastasis (LNM) in CRC patients. We identified tumor location, grade, age, tumor size, T stage, race, and CEA as independent predictive factors for LNM in CRC patients. This tool can predict the likelihood of LNM in CRC patients, which may aid clinicians in formulating appropriate treatment strategies.