Introduction

Colorectal cancer (CRC) ranks the third most common cancer worldwide (10.2%) but second in terms of mortality (9.2%) when men and women are combined [1]. It is noteworthy that the incidence of early-onset CRC (EO-CRC, aged < 50 years) patients increased by approximately 2% annually since the mid-1990s, compared to the decreasing incidence in older populations [2, 3] in many regions across the globe [4,5,6]. It was projected that, by 2030, 10.9% of colon and 22.9% of rectal cancers would be diagnosed in patients younger than 50 years [7]. So, it is necessary to found crucial prognostic factors for predicting the survival outcome of EO-CRC patients, which is beneficial to further clinical decision-making.

Currently, the American Joint Committee on Cancer (AJCC) TNM staging system is widely used for prognosis prediction and medical treatments in many cancers. However, the TNM staging does not deal with all survival discrepancies. For example, some colon cancer patients in stage III had a statistically better prognosis than those with stage IIB and IIC according to this staging system [8]. Furthermore, many other clinicopathological factors, such as primary site, tumor size, lymph node ratio (LNR), pretreatment carcinoembryonic antigen (CEA) level, circumferential resection margin (CRM), and tumor deposits, have been demonstrated to influence the survival outcome in colorectal cancer, while they were not sufficiently utilized by the TNM staging system. Therefore, in clinical practice, an integrated prognostic judgment system incorporating crucial factors is needed.

Nomograms are statistical predictive models that incorporate independent factors of prognosis to estimate prognosis for individual patients. They have been built for various types of cancers [9,10,11,12,13] and have shown advantages over the TNM staging system [10, 11, 14, 15]. However, nomograms regarding EO-CRC patients are still rare nowadays.

Therefore, the present study aimed to identify clinicopathological and socio-economic prognostic factors associated with overall survival of EO-CRC patients using a large multi-institutional data from the Surveillance, Epidemiology, and End Results (SEER) database, then to establish and internally validate a nomogram for predicting the 3- and 5-year OS of EO-CRC patients.

Methods

Data source and patient selection

The SEER program of the National Cancer Institute (NCI) collects information on cancer incidence and survival from 17 population-based cancer registries and represents about 28% of the US population. In this study, a total of 8886 pathologically proven EO-CRC patients who were diagnosed from January 1, 2010, to December 31, 2017, were retrospectively extracted from the SEER database using the SEER*Stat program (v 8.3.6). Patients with EO-CRC were identified by the ICD-O-3 site code (C18.0, C18.2, C18.3, C18.4, C18.5, C18.6, C18.7, C19.9, C20.9) and the cancer staging scheme (version 0204). The inclusion criteria were as follows: (1) patients were 15–50 years old, (2) CRC was the only primary cancer, (3) complete survival information, and (4) follow-up > 1 month. Patients who had missing or incomplete clinicopathological and socio-economic information (primary site, histological type, grade, tumor size, regional nodes examined, metastatic situation, tumor stage, CEA level, perineural invasion, median household income) were excluded from this study. The detailed patient selection workflow is shown in Fig. 1. Eligible patients were randomly divided into a training cohort and a testing cohort (ratio, 70:30). The training cohort was used to explore the prognostic factors, and to construct a nomogram, the testing cohort was used for internal validation of the nomogram. This study was conducted under the SEER data use agreement, and patient informed consent was not required given the anonymized, de-identified data in the SEER database.

Fig. 1
figure 1

The workflow of the patient selection process

Variables and outcome

Eighteen factors, including sex, race, primary site, histology, grade, tumor size, number of examined regional nodes, LNR, liver metastasis, lung metastasis, bone metastasis, brain metastasis, TNM stage, T stage, N stage, CEA, perineural invasion, and median household income, were retrieved to predict prognosis of the training cohort. The primary site was defined as right-side (cecum, ascending colon, hepatic flexure of colon, transverse colon), left-side (splenic flexure of colon, descending colon, sigmoid colon, rectosigmoid), and rectum. The LNR was calculated by dividing the metastatic node number by the examined regional node number. Overall survival (OS), the primary endpoint, was defined as the interval from diagnosis until death or last follow-up.

Statistical analysis

Categorical variables were reported as whole numbers and proportions. The overall survivals of the study cohort were produced using the Kaplan–Meier method, and differences between overall survivals were examined using the log-rank test. The associations between clinicopathological, socio-economic variables and survival were evaluated using Cox proportional hazards regression models. Hazard ratios (HRs) were displayed with 95% CIs. Significant variables in the univariate analysis were subjected to multivariate Cox regression analysis by Backward stepwise selection under the Akaike information criterion (AIC). Variables statistically significant in the multivariate Cox regression analysis were determined as independent prognostic factors to predict the survival outcome. Then, these independent prognostic factors were used to establish a nomogram for predicting the 3- and 5-year OS of patients with EO-CRC. To allot points in the nomogram, the regression coefficients were used to define the linear predictor.

The performance of the nomogram was evaluated by the discriminatory ability and calibration [16]. The discriminatory ability refers to how well the model differentiates patients who will have an event from those who will not have an event. The concordance index (C-index) and the receiver operating characteristic (ROC) curve were applied to evaluate the discriminatory ability of our nomogram. A C-index or the area under the ROC curve (AUC) of 0.5 indicates the nomogram is devoid of discrimination, while a C-index or AUC of 1.0 suggests the perfect separation of patients with different results. A C-index or AUC more than 0.75 reflects useful discrimination [16]. The calibration refers to the consistency between the nomogram-predicted survival and the actual observed survival. Calibration plots were used to evaluate the calibration of our nomogram. In a calibration plot, the actual OS is plotted on the y-axis, and the nomogram-predicted OS is plotted on the x-axis. A perfect prediction would fall on a 45-degree diagonal line. All the statistical analyses were performed using SPSS version 25 and R software version 3.3.0 (Vienna, Austria; www.r-project.org). Only a two-tailed P value of < 0.05 was considered statistically significant. This study has been reported in line with the TRIPOD statement [17].

Results

Clinicopathological and socio-economic characteristics and survival outcomes of EO-CRC patients

Data on a total of 5585 eligible patients with early-onset colorectal cancer diagnosed from 2010 to 2017 were retrospectively collected from the SEER database. Patients were randomly divided into a training cohort (3910 patients) and a testing cohort (1675 patients). Clinicopathological and socio-economic characteristics of early-onset colorectal cancer patients are listed in Table 1.

Table 1 Clinicopathological and socio-economic characteristics of early-onset colorectal cancer patients from 2010 to 2017

Most patients were male (53.64%) and White (71.44%), with a median household income level of 50,000–75,000 dollars (50.67%). The majority of patients had the adenocarcinoma histological type (91.14%), were moderately differentiated (75.34%), examined ≥ 12 regional nodes (84.44%), with LNR ranged from 0 to 0.2 (79.52%), without perineural invasion (76.22%). The left-side colon (45.8%) was the most common primary tumor site, followed by the right-side colon (27.72%), and rectum (26.48%). 51.07% of the patients developed a smaller tumor size (< 5 cm), while 48.93% of patients developed a larger tumor size (≥ 5 cm). Liver metastasis, lung metastasis, bone metastasis, and brain metastasis were observed in 13.68%, 3.33%, 0.5%, and 0.14% of the patients, respectively. Patients with TNM stage I, II, III, and IV tumors accounted for 14.31%, 22.97%, 43.13%, and 19.59% of all cases, respectively. In total, 28.68% of the patients were tested with positive pretreatment CEA, with the remaining patients having negative CEA (39.7%) or unknown CEA (31.62%).

At a median follow-up of 42.0 months (range from 1.0 to 95.0 months), 19.7% (772 of 3910) of the patients had died in the training cohort, and 20.6% (346 of 1675) of the patients had died in the testing cohort. The 3-year and 5-year overall survival were 80.7% (95% CI, 79.3–82.1%), and 72.5% (95% CI, 70.7–74.3%), respectively.

Independent prognostic factors of early-onset colorectal cancer patients

Univariate Cox regression analysis indicated that race, primary site, histology, grade, tumor size, regional nodes examined, LNR, liver metastasis, lung metastasis, bone metastasis, brain metastasis, TNM stage, T stage, N stage, CEA, perineural invasion, and median household income were significantly associated with OS in the training cohort (Table 2).

Table. 2 Univariate cox regression analysis of overall survival in the training cohort

After controlling confounding factors, the multivariate Cox regression analysis demonstrated that race, primary site, histology, grade, tumor size, regional nodes examined, LNR, liver metastasis, lung metastasis, bone metastasis, TNM stage, T stage, CEA, perineural invasion, and median household income were independent prognostic factors of EO-CRC patients as shown in Fig. 2.

Fig. 2
figure 2

Multivariate cox regression analysis of overall survival in the training cohort. LNR, lymph node ratio; CEA, carcinoembryonic antigen

Construction of the prognostic nomogram

A prognostic nomogram to predict 3- and 5-year OS was established, which contained the independent prognostic factors identified from the multivariable Cox regression analysis (Fig. 3). The corresponding score of each variable can be obtained by projecting to the top “points” axis according to the patient’s actual situation. In the same way, the total points are obtained by adding the corresponding scores of each variable. By projecting the total points to the bottom “3-year overall survival” and “5-year overall survival” axis, the 3- and 5-year OS can be estimated.

Fig. 3
figure 3

Nomogram for predicting 3- and 5-year OS of early-onset colorectal cancer patients. LNR, lymph node ratio; CEA, carcinoembryonic antigen

For instance, a 45-year-old White patient (3 points) with right-sided colon (16 points) adenocarcinoma (0 points), T4 (35 points), without lung, liver, or bone metastasis (0, 0, and 0 points), TNM stage III (48 points), poor differentiated (25 points), tumor size > 5 cm (8 points), examined 12 regional lymph nodes (0 points), LNR > 0.6 (50 points), CEA positive (24 points), without perineural invasion (0 points), median household income 70,000 dollars (10 points) would have a total of 219 points, which means a predicted 3-year OS of 40.0% and predicted 5-year OS of 20.0%.

Validation of the prognostic nomogram

To evaluate the discriminatory ability of constructed nomogram, the C-index value and AUC value were applied in this study. The C-index of the nomogram was 0.840 (95% CI 0.827–0.854) and 0.837 (95% CI 0.816–0.857) in the training and testing cohort, respectively. Moreover, the 3- and 5-year AUC values of the nomogram were 0.868 and 0.84869, respectively, in the training cohort, corresponding to 0.868 and 0.86049 in the testing cohort (Fig. 4). Thus, both the C-index and the 3- and 5-year AUC values of the nomogram were over 0.75 and more close to value 1.0, which suggested that the constructed nomogram in our study has good discriminatory ability for OS prediction.

Fig. 4
figure 4

ROC curves and AUC values for training and testing cohort. a 3-year OS in the training cohort. b 5-year OS in the training cohort. c 3-year OS in the testing cohort. d 5-year OS in the testing cohort. AUC, area under the ROC curve

The calibration of our nomogram was assessed by calibration plots. Actual OS was plotted on the y-axis, and nomogram-predicted OS was plotted on the x-axis. The calibration plots of the established nomogram displayed bare deviations from the 45-degree diagonal reference line both in training cohort and testing cohort (Fig. 5), which indicated optimal agreement between the actual observed survival and the nomogram-predicted survival.

Fig. 5
figure 5

Calibration plots of the nomogram for predicting 3- and 5-year OS in the training cohort (a, b) and testing cohort (c, d) respectively. The actual OS is plotted on the y-axis; the nomogram-predicted OS is plotted on the x-axis. OS, overall survival

Comparison of nomogram with TNM stages

Moreover, we compared the prediction ability of the nomogram and the TNM staging system. Compared with the C-index of the constructed nomogram (0.840, 95% CI 0.827–0.850), the C-index of the TNM staging system was lower (0.804, 95% CI 0.788–0.820, P < 0.001). More importantly, the constructed nomogram yielded a larger log-likelihood and a smaller AIC value than the TNM stage (Table 3). All the above results implied the stronger predictive power of the nomogram than the TNM staging system. And the same result was also observed in the testing cohort.

Table. 3 Comparison of nomogram with the TNM staging system

Discussion

In contrast to the decreasing incidence in older populations, the incidence of EO-CRC patients had increased since the mid-1990s. Accurate survival prediction for EO-CRC patients is important in informing the accurate prognosis of patients and in making personal clinical decisions. Many prognostic factors affecting long-term survival were not sufficiently utilized. Currently, the optimal method for predicting the survival outcome of EO-CRC patients is unclear. Based on large population and multi-institution data from the SEER database, the present study used independent clinicopathological and socio-economic factors to establish and internally validate a nomogram for predicting the 3- and 5-year OS of individual EO-CRC patients.

This study is essential because the nomogram can represent complex mathematical formulas with intuitive visualization results and quickly estimate clinical outcomes without complicated calculations, facilitating individual prognosis prediction and clinical decision-making regarding the treatment and surveillance [18]. Besides, the data of this study were extracted from the openly accessed SEER database, which ensures the sample size sufficient.

Another strength of this study is reflected in the fact that it involved dozens of clinicopathological and socio-economic variables which were associated with the prognosis of EO-CRC in previous reports. Survival outcome is different in colorectal patients with varied primary tumor location. Several previous studies, including meta-analyses, demonstrated that patients with the left-sided disease were significantly associated with a better overall survival rate than those with the right-sided disease [19,20,21], which is consistent with the present study (HR = 0.72, 95% CI 0.61–0.86, P < 0.01). Moreover, our result showed that rectal cancer was higher than right-side colon cancer in terms of OS (HR = 0.76, 95% CI 0.62–0.95, P = 0.015), which is in accordance with the previous study [22]. Based on our multivariate analysis, tumor size was also an independent factor for improved OS (HR 1.17, 95% CI 1.01–1.37, P = 0.038), which was in agreement with previous reports [23,24,25,26]. Previous researches have revealed that a high lymph node ratio (LNR) was significantly correlated with inferior overall and disease-free survival in stage III [27,28,29] and stage IV [30,31,32] colorectal cancer patients, which is in line with this study.

For cancer patients, socio-economic status (SES) was reported to be a significant predictor of prognosis [33], which was not considered in most previous nomograms [9,10,11,12,13,14]. Previous studies showed that patients with low SES resulted in a worse prognosis than those with high SES [34,35,36]. Similarly, in our study, we also identified the significant association between survival outcome and median household income, a measure indicator of SES. As for the low SES population, they less frequently participate in cancer screening programs, resulting in an advanced stage CRC diagnosis while not at an early stage [37]. Moreover, the worse access to health services and high-quality treatments accelerates their bad outcome [37].

Furthermore, the performance of the constructed nomogram was comprehensively evaluated. Firstly, the constructed nomogram showed good discriminatory ability, with a high C-statistic of 0.840 and the 3- and 5-year AUC values of 0.868 and 0.84869 respectively. What’s more, the calibration plots for 3- and 5-year OS probabilities showed barely any deviations from the reference line (Fig. 4), which means the nomogram-predicted survival would be similar to the actual observed survival. Moreover, the same results were also confirmed in the testing cohort, which further implies the strong predictive ability of our nomogram model. Most importantly, compared with the TNM staging system, the nomogram displayed better predictive activity with a higher C-index (0.840 vs 0.804, P < 0.001), larger log-likelihood, and smaller AIC value. The results above collectively suggested that the established nomogram might be utilized as a more powerful and conventional tool to predict survival outcomes for patients with EO-CRC.

Our study shows more strength than previous related nomograms. On the one hand, unlike previous nomograms that just included patients with colon cancer [38,39,40], our nomogram focused on patients with colon cancer and those with rectal cancer, no matter the stage situation. On the other hand, our nomogram involved some distinct variables, such as SES and LNR, which were also reported to be important predictors of prognosis. And our research is the only study including the socio-economic status of patients in the nomogram.

Of note, the present study had some limitations. Firstly, several vital prognostic factors, such as KRAS, BRAF, microsatellite instability (MSI), tumor regression grade, circumferential resection margin (CRM), were inaccessible in the SEER database, thus did not incorporate in the proposed nomogram. Secondly, the nomogram was devoid of treatment information like surgical procedures and chemotherapy regimens, which greatly affected survival outcomes. Thirdly, although the information of some factors, such as LNR and perineural invasion, may restrict the application of constructed nomogram preoperatively, the nomogram indeed shows a solid ability to predict postoperative patients’ overall survival. In addition, the selection bias could not be ignored because of the retrospective nature of the study. Besides, the constructed nomogram includes relatively more variables, so it requires a high degree of integrity of relevant information, probably affecting the practicability. Last, this study did not involve any external validation based on other populations. Therefore, it is unclear whether the nomogram can be directly applied to other populations, and its universality needs further verification and prospective evaluation.

Conclusion

A novel nomogram for EO-CRC patients based on independent clinicopathological and socio-economic variables was developed and internally validated, which is superior to the TNM staging system. In addition, the nomogram could facilitate postoperative individual prognosis prediction and clinical decision-making.