Prognostic nomogram to predict the overall survival of patients with early-onset colorectal cancer: a population-based analysis

Purpose The present study aimed to identify independent clinicopathological and socio-economic prognostic factors associated with overall survival of early-onset colorectal cancer (EO-CRC) patients and then establish and validate a prognostic nomogram for patients with EO-CRC. Methods Eligible patients with EO-CRC diagnosed from 2010 to 2017 were extracted from the Surveillance, Epidemiology, and End Results (SEER) database. Patients were randomly divided into a training cohort and a testing cohort. Independent prognostic factors were obtained using univariate and multivariate Cox analyses and were used to establish a nomogram for predicting 3- and 5-year overall survival (OS). The discriminative ability and calibration of the nomogram were assessed using C-index values, AUC values, and calibration plots. Results In total, 5585 patients with EO-CRC were involved in the study. Based on the univariate and multivariate analyses, 15 independent prognostic factors were assembled into the nomogram to predict 3- and 5-year OS. The nomogram showed favorable discriminatory ability as indicated by the C-index (0.840, 95% CI 0.827–0.850), and the 3- and 5-year AUC values (0.868 and 0.84869 respectively). Calibration plots indicated optimal agreement between the nomogram-predicted survival and the actual observed survival. The results remained reproducible in the testing cohort. The C-index of the nomogram was higher than that of the TNM staging system (0.840 vs 0.804, P < 0.001). Conclusion A novel prognostic nomogram for EO-CRC patients based on independent clinicopathological and socio-economic factors was developed, which was superior to the TNM staging system. The nomogram could facilitate postoperative individual prognosis prediction and clinical decision-making.


Introduction
Colorectal cancer (CRC) ranks the third most common cancer worldwide (10.2%) but second in terms of mortality (9.2%) when men and women are combined [1]. It is noteworthy that the incidence of early-onset CRC (EO-CRC, aged < 50 years) patients increased by approximately 2% annually since the mid-1990s, compared to the decreasing incidence in older populations [2,3] in many regions across the globe [4][5][6]. It was projected that, by 2030, 10.9% of colon and 22.9% of rectal cancers would be diagnosed in patients younger than 50 years [7]. So, it is necessary to found crucial prognostic factors for predicting the survival outcome of EO-CRC patients, which is beneficial to further clinical decision-making.
Currently, the American Joint Committee on Cancer (AJCC) TNM staging system is widely used for prognosis prediction and medical treatments in many cancers. However, the TNM staging does not deal with all survival discrepancies. For example, some colon cancer patients in stage III had a statistically better prognosis than those with stage IIB and IIC according to this staging system [8]. Furthermore, many other clinicopathological factors, such as primary site, tumor size, lymph node ratio (LNR), pretreatment carcinoembryonic antigen (CEA) level, circumferential resection margin (CRM), and tumor deposits, have been demonstrated to influence the survival outcome in colorectal cancer, while they were not sufficiently utilized by the TNM staging system. Therefore, in clinical practice, an integrated prognostic judgment system incorporating crucial factors is needed.
Nomograms are statistical predictive models that incorporate independent factors of prognosis to estimate prognosis for individual patients. They have been built for various types of cancers [9][10][11][12][13] and have shown advantages over the TNM staging system [10,11,14,15]. However, nomograms regarding EO-CRC patients are still rare nowadays.
Therefore, the present study aimed to identify clinicopathological and socio-economic prognostic factors associated with overall survival of EO-CRC patients using a large multi-institutional data from the Surveillance, Epidemiology, and End Results (SEER) database, then to establish and internally validate a nomogram for predicting the 3-and 5-year OS of EO-CRC patients.

Data source and patient selection
The SEER program of the National Cancer Institute (NCI) collects information on cancer incidence and survival from 17 population-based cancer registries and represents about 28% of the US population. In this study, a total of 8886 pathologically proven EO-CRC patients who were diagnosed from January 1, 2010, to December 31, 2017, were retrospectively extracted from the SEER database using the SEER*Stat program (v 8.3.6). Patients with EO-CRC were identified by the ICD-O-3 site code (C18.0, C18.2, C18.3, C18.4, C18.5, C18.6, C18.7, C19.9, C20.9) and the cancer staging scheme (version 0204). The inclusion criteria were as follows: (1) patients were 15-50 years old, (2) CRC was the only primary cancer, (3) complete survival information, and (4) follow-up > 1 month. Patients who had missing or incomplete clinicopathological and socio-economic information (primary site, histological type, grade, tumor size, regional nodes examined, metastatic situation, tumor stage, CEA level, perineural invasion, median household income) were excluded from this study. The detailed patient selection workflow is shown in Fig. 1. Eligible patients were randomly divided into a training cohort and a testing cohort (ratio, 70:30). The training cohort was used to explore the prognostic factors, and to construct a nomogram, the testing cohort was used for internal validation of the nomogram. This study was conducted under the SEER data use agreement, and patient informed consent was not required given the anonymized, de-identified data in the SEER database.

Variables and outcome
Eighteen factors, including sex, race, primary site, histology, grade, tumor size, number of examined regional nodes, LNR, liver metastasis, lung metastasis, bone metastasis, brain metastasis, TNM stage, T stage, N stage, CEA, perineural invasion, and median household income, were retrieved to predict prognosis of the training cohort. The primary site was defined as right-side (cecum, ascending colon, hepatic flexure of colon, transverse colon), leftside (splenic flexure of colon, descending colon, sigmoid colon, rectosigmoid), and rectum. The LNR was calculated by dividing the metastatic node number by the examined regional node number. Overall survival (OS), the primary endpoint, was defined as the interval from diagnosis until death or last follow-up.

Statistical analysis
Categorical variables were reported as whole numbers and proportions. The overall survivals of the study cohort were produced using the Kaplan-Meier method, and differences between overall survivals were examined using the logrank test. The associations between clinicopathological, socio-economic variables and survival were evaluated using Cox proportional hazards regression models. Hazard ratios (HRs) were displayed with 95% CIs. Significant variables in the univariate analysis were subjected to multivariate Cox regression analysis by Backward stepwise selection under the Akaike information criterion (AIC). Variables statistically significant in the multivariate Cox regression analysis were determined as independent prognostic factors to predict the survival outcome. Then, these independent prognostic factors were used to establish a nomogram for predicting the 3-and 5-year OS of patients with EO-CRC. To allot points in the nomogram, the regression coefficients were used to define the linear predictor.
The performance of the nomogram was evaluated by the discriminatory ability and calibration [16]. The discriminatory ability refers to how well the model differentiates patients who will have an event from those who will not have an event. The concordance index (C-index) and the receiver operating characteristic (ROC) curve were applied to evaluate the discriminatory ability of our nomogram. A C-index or the area under the ROC curve (AUC) of 0.5 indicates the nomogram is devoid of discrimination, while a C-index or AUC of 1.0 suggests the perfect separation of patients with different results. A C-index or AUC more than 0.75 reflects useful discrimination [16]. The calibration refers to the consistency between the nomogram-predicted survival and the actual observed survival. Calibration plots were used to evaluate the calibration of our nomogram. In a calibration plot, the actual OS is plotted on the y-axis, and the nomogram-predicted OS is plotted on the x-axis. A perfect prediction would fall on a 45-degree diagonal line. All the statistical analyses were performed using SPSS version 25 and R software version 3.3.0 (Vienna, Austria; www.r-proje ct. org). Only a two-tailed P value of < 0.05 was considered statistically significant. This study has been reported in line with the TRIPOD statement [17].

Clinicopathological and socio-economic characteristics and survival outcomes of EO-CRC patients
Data on a total of 5585 eligible patients with earlyonset colorectal cancer diagnosed from 2010 to 2017 were retrospectively collected from the SEER database. Patients were randomly divided into a training cohort (3910 patients) and a testing cohort (1675 patients).  Clinicopathological and socio-economic characteristics of early-onset colorectal cancer patients are listed in Table 1.

Independent prognostic factors of early-onset colorectal cancer patients
Univariate Cox regression analysis indicated that race, primary site, histology, grade, tumor size, regional nodes examined, LNR, liver metastasis, lung metastasis, bone metastasis, brain metastasis, TNM stage, T stage, N stage, CEA, perineural invasion, and median household income were significantly associated with OS in the training cohort ( Table 2). After controlling confounding factors, the multivariate Cox regression analysis demonstrated that race, primary site, histology, grade, tumor size, regional nodes examined, LNR, liver metastasis, lung metastasis, bone metastasis, TNM stage, T stage, CEA, perineural invasion, and median household income were independent prognostic factors of EO-CRC patients as shown in Fig. 2.

Construction of the prognostic nomogram
A prognostic nomogram to predict 3-and 5-year OS was established, which contained the independent prognostic factors identified from the multivariable Cox regression analysis (Fig. 3). The corresponding score of each variable can be obtained by projecting to the top "points" axis according to the patient's actual situation. In the same way, the total points are obtained by adding the corresponding scores of each variable. By projecting the total points to the bottom "3-year overall survival" and "5-year overall survival" axis, the 3-and 5-year OS can be estimated.

Validation of the prognostic nomogram
To evaluate the discriminatory ability of constructed nomogram, the C-index value and AUC value were applied in this study. The C-index of the nomogram was 0.840 (95% CI 0.827-0.854) and 0.837 (95% CI 0.816-0.857) in the training and testing cohort, respectively. Moreover, the 3and 5-year AUC values of the nomogram were 0.868 and 0.84869, respectively, in the training cohort, corresponding to 0.868 and 0.86049 in the testing cohort (Fig. 4). Thus, both the C-index and the 3-and 5-year AUC values of the nomogram were over 0.75 and more close to value 1.0, which suggested that the constructed nomogram in our study has good discriminatory ability for OS prediction. The calibration of our nomogram was assessed by calibration plots. Actual OS was plotted on the y-axis, and nomogram-predicted OS was plotted on the x-axis. The calibration plots of the established nomogram displayed bare deviations from the 45-degree diagonal reference line both in training cohort and testing cohort (Fig. 5), which indicated optimal agreement between the actual observed survival and the nomogram-predicted survival.

Comparison of nomogram with TNM stages
Moreover, we compared the prediction ability of the nomogram and the TNM staging system. Compared with the C-index of the constructed nomogram (0.840, 95% CI 0.827-0.850), the C-index of the TNM staging system was lower (0.804, 95% CI 0.788-0.820, P < 0.001). More importantly, the constructed nomogram yielded a larger log-likelihood and a smaller AIC value than the TNM stage (Table 3). All the above results implied the stronger predictive power of the nomogram than the TNM staging system. And the same result was also observed in the testing cohort.

Discussion
In contrast to the decreasing incidence in older populations, the incidence of EO-CRC patients had increased since the mid-1990s. Accurate survival prediction for EO-CRC patients is important in informing the accurate prognosis of patients and in making personal clinical decisions. Many prognostic factors affecting long-term survival were not sufficiently utilized. Currently, the optimal method for predicting the survival outcome of EO-CRC patients is unclear. Based on large population and multi-institution data from the SEER database, the present study used independent clinicopathological and socio-economic factors to establish and internally validate a nomogram for predicting the 3-and 5-year OS of individual EO-CRC patients. This study is essential because the nomogram can represent complex mathematical formulas with intuitive visualization results and quickly estimate clinical outcomes without complicated calculations, facilitating individual prognosis prediction and clinical decision-making regarding the treatment and surveillance [18]. Besides, the data of this study were extracted from the openly accessed SEER database, which ensures the sample size sufficient.
Another strength of this study is reflected in the fact that it involved dozens of clinicopathological and socioeconomic variables which were associated with the prognosis of EO-CRC in previous reports. Survival outcome is different in colorectal patients with varied primary tumor location. Several previous studies, including meta-analyses, demonstrated that patients with the leftsided disease were significantly associated with a better overall survival rate than those with the right-sided disease [19][20][21], which is consistent with the present study (HR = 0.72, 95% CI 0.61-0.86, P < 0.01). Moreover, our result showed that rectal cancer was higher than rightside colon cancer in terms of OS (HR = 0.76, 95% CI 0.62-0.95, P = 0.015), which is in accordance with the previous study [22]. Based on our multivariate analysis, tumor size was also an independent factor for improved OS (HR 1.17, 95% CI 1.01-1.37, P = 0.038), which was in agreement with previous reports [23][24][25][26]. Previous researches have revealed that a high lymph node ratio (LNR) was significantly correlated with inferior overall and disease-free survival in stage III [27][28][29] and stage IV [30][31][32] colorectal cancer patients, which is in line with this study.
For cancer patients, socio-economic status (SES) was reported to be a significant predictor of prognosis [33], which was not considered in most previous nomograms [9][10][11][12][13][14]. Previous studies showed that patients with low SES resulted in a worse prognosis than those with high SES [34][35][36]. Similarly, in our study, we also identified the significant association between survival outcome and median household income, a measure indicator of SES. As for the low SES population, they less frequently participate in cancer screening programs, resulting in an advanced stage CRC diagnosis while not at an early stage [37]. Moreover, the worse access to health services and high-quality treatments accelerates their bad outcome [37].
Furthermore, the performance of the constructed nomogram was comprehensively evaluated. Firstly, the constructed nomogram showed good discriminatory ability, with a high C-statistic of 0.840 and the 3-and 5-year AUC values of 0.868 and 0.84869 respectively. What's more, the calibration plots for 3-and 5-year OS probabilities showed barely any deviations from the reference line (Fig. 4), which means the nomogram-predicted survival would be similar to the actual observed survival. Moreover, the same results were also confirmed in the testing cohort, which further implies the strong predictive ability of our nomogram model. Most importantly, compared  with the TNM staging system, the nomogram displayed better predictive activity with a higher C-index (0.840 vs 0.804, P < 0.001), larger log-likelihood, and smaller AIC value. The results above collectively suggested that the established nomogram might be utilized as a more powerful and conventional tool to predict survival outcomes for patients with EO-CRC. Our study shows more strength than previous related nomograms. On the one hand, unlike previous nomograms that just included patients with colon cancer [38][39][40], our nomogram focused on patients with colon cancer and those with rectal cancer, no matter the stage situation. On the other hand, our nomogram involved some distinct variables, such as SES and LNR, which were also reported to be important predictors of prognosis. And our research is the only study including the socio-economic status of patients in the nomogram.
Of note, the present study had some limitations. Firstly, several vital prognostic factors, such as KRAS, BRAF, microsatellite instability (MSI), tumor regression grade, circumferential resection margin (CRM), were inaccessible in the SEER database, thus did not incorporate in the proposed nomogram. Secondly, the nomogram was devoid of treatment information like surgical procedures and chemotherapy regimens, which greatly affected survival outcomes. Thirdly, although the information of some factors, such as LNR and perineural invasion, may restrict the application of constructed nomogram preoperatively, the nomogram indeed shows a solid ability to predict postoperative patients' overall survival. In addition, the selection bias could not be ignored because of the retrospective nature of the study. Besides, the constructed nomogram includes relatively more variables, so it requires a high degree of integrity of relevant information, probably affecting the practicability. Last, this study did not involve any external validation based on other populations. Therefore, it is unclear whether the nomogram can be directly applied to other populations, and its universality needs further verification and prospective evaluation.

Conclusion
A novel nomogram for EO-CRC patients based on independent clinicopathological and socio-economic variables was developed and internally validated, which is superior to the TNM staging system. In addition, the nomogram could facilitate postoperative individual prognosis prediction and clinical decision-making.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.