Background

Colorectal cancer (CRC) is the most common gastrointestinal malignant tumor worldwide [1]. Although the overall mortality rate is low, the mortality rate is higher in developing countries, and significant differences in mortality have been observed among countries and regions [2].

An increasing number of studies, especially large-scale, multicenter population studies, have identified spatial differences in the survival of colorectal and other cancer patients [3,4,5,6,7,8]. Douaiher et al. noted significant differences in survival outcomes between developed and developing countries [9, 10]. Mokdad et al. [11] found significant differences in the mortality rate of CRC patients between different regions and counties in the United States, with the highest mortality rate in Union County, Florida (58.4/100000 people; 95% Confidence Interval: 52.0–65.2), and the lowest mortality rate in Summit County, Colorado (8.1/100000 people; 95% CI: 7.0–9.3). The decrease in mortality from 1980 to 2014 also varies among counties, with the largest decreases found in Howard County, Maryland (62.2%; 95% UI: 60.7–67.4%) and Nassau County, New York (62.3%; 95% UI: 60.1%, 64.3%) [11]. Michael et al. also reported regional differences in the management and outcomes of CRC patients in Australia [12].

In studies of survival prediction modeling for CRC based on large-scale and multicenter data aggregated from a wide range of geographic regions, little attention has been paid to the spatially varying effects of predictors [13,14,15,16,17,18]. Many studies have assumed that aggregated data are homogeneous and directly used statistical models, such as the Cox proportional hazard model, which assumes that all patients are independent, regardless of origin [19,20,21,22]. Some studies suggested that patients from the same geographic region were likely to have correlated outcomes, thus violating the assumption of independent observations. Therefore, researchers should consider a multilevel survival model for survival prediction, including a single random effect that considers similarities within spatial clusters [23,24,25,26]. All the regression coefficients of these models are assumed to have a constant impact across the entire study region, meaning that the impact of the patients’ characteristics remain constant among different geographic regions. However, recent research has shown that there are spatially varying effects of predictors in breast cancer survival prediction, indicating that the impact of patient characteristics on breast cancer survival varies by location [27].

This study aims to detect and interpret the spatially varying effect of predictors using population-based CRC survival data aggregated from a wide range of geographic regions. The studied predictors included the following widely used clinical predictors: tumor, node, metastasis (TNM) tumor stage; demographic factors; tumor differentiation grade; histological type; tumor location; tumor size; and number of positive regional lymph nodes. Overall survival was considered the outcome of interest. A machine learning model (random survival forest, RSF) was then developed. The model requires no statistical restrictions or assumptions to build a survival prediction model and can be used as an alternative survival prediction model to statistical models for multilevel survival when dealing with spatially varying effects of predictors.

Methods

Patients

This study obtained CRC patient data from the Surveillance, Epidemiology and End Results (SEER) program of the National Cancer Institute (NCI), which includes 18 population-based registries [28]. We regarded these registries as spatial clusters of patients and explored the spatially varying effect of predictors among these registries. Patient information, including demographics, diagnoses and survival, are routinely collected, and this information is publicly available as deidentified data.

In this study, the analysis was limited to patients who were diagnosed with primary nonmetastatic CRC (SEER primary site recodes C180-C189, C199 or C209 without distant metastasis) as their only malignant tumor and were actively followed up from January 2004 through December 2013. We excluded individuals if their cancer status was obtained from a nursing home, hospice, autopsy report or death certificate; if their survival time was less than 1 month; if their tumor size was not reported as an exact value; or if the number of cancer-positive regional lymph nodes was not noted. In addition, patients with unknown key predictor variables were excluded (Fig. 1).

Fig. 1
figure 1

Details of the patient data screening procedure. For all SEER research data reported from 1973 to 2013, we first considered patients who were diagnosed with primary colorectal cancer after 2004 and who were actively followed-up with case reports from hospital inpatient departments, radiation treatment centers, laboratories, and physicians’ offices and for whom survival times were longer than 1 month; these criteria yielded 249,665 patients. We then excluded patients whose tumor grading and differentiation codes were of undetermined cell types or were not stable or not treated, reducing the number of patients to 211,292. We next excluded patients whose tumor size, lymph nodes, positive regional nodes and examined regional nodes were either NA or unknown, further reducing the sample to 164,331 patients. Finally, we excluded patients with metastases and patients for whom the exact tumor size and number of positive regional nodes were unknown. We ultimately obtained 128,061 patients from 18 SEER registries as our study cohort

Univariate analysis

First, the spatial varying effect was detected for a widely used clinical predictor: the TNM staging system. In the TNM staging system, T represents the depth of primary tumor penetration, N represents the number of regional nodes involved in the tumor, and M represents distant metastasis. Based on these parameters, patients were divided into 7 groups that correspond to different prognoses: stage I, stage IIA, stage IIB, stage IIC, stage IIIA, stage IIIB and stage IIIC. The TNM staging system can be used as an independent criterion for distinguishing patients based on survival outcomes. Based on these criteria of the tumor stage, each patient was assigned a TNM staging label. Then, log-rank tests (the survdiff function in the “survival” R package [29]) were used to evaluate differences in the Kaplan-Meier survival curves among spatial clusters. A quantitative comparison based on the restricted mean survival time (RMST) (the rmst2 function in the “survRM2” R package [30]) was performed, which intuitively reflects the spatially varying effects of the TNM staging system.

Second, the heterogeneity test from the meta-analysis was adopted to reveal the spatially varying effects of the patient characteristics. First, univariate associations between overall survival and predictors in each spatial cluster were examined using a Cox regression model, from which log hazard ratios were obtained with 95% confidence intervals. Then, Cochran’s heterogeneity [31] statistic Q value and the inconsistency index I2 [32] were used to examine the heterogeneity of the predictors across the entire study region, revealing the spatially varying effects of the predictors.

Multivariate analysis

The predictors listed in Table 1 were fitted using Cox regression models [33] based on the patient data from different spatial clusters, and regression coefficients were thereby obtained for predictors from different spatial clusters. For comparison and interpretation, we used “Age at diagnosis” as a category variable and defined the following groups: Group 1: less than 55 years old; Group 2: between 55 and 64 years old; Group 3: between 65 and 74 years old; and Group 4: older than 75 years. We compared the variance of the regression coefficient for each predictor among the spatial clusters, revealing the impact of predictors across the entire study region.

Table 1 Demographic and Clinical Characteristics of SEER Nonmetastatic CRC Patients from 2004 to 2013

Survival prediction based on the statistical model and the machine learning model

In this paper, we apply a machine learning model (RSF [34]) to make full use of large-scale, multicenter clinical research data without violating the statistical assumptions requiring all patients to be independent of one another and requiring the impact of predictors to remain constant across the entire study region. The performances of the machine learning model and the statistical model (Cox proportional hazards model with mixed effects Cox, Accelerated Failure Time Model AFT) were then evaluated and compared based on testing data aggregated from multiple geographic regions. We used the concordance index (C-index) [35] and prediction error curves [36] as measures of the model’s prediction performance. The C-index is one of the most commonly used performance measures of survival models. It can be interpreted as the fraction of all pairs of subjects whose predicted survival times are correctly ordered among all subjects who can actually be ordered. To reflect the performance variance of both models among different spatial clusters, we compared the variance of the C-index for each model among the different spatial clusters. Prediction error curves can be used to show model calibration performance via an expected time-dependent Brier score. For correctly censored data, the squared residual (observed status-predicted status)2 of a subject at each particular time point t is weighted using the inverse probability of the censoring weights [37], which can yield the calibration ability of the prediction model within a certain follow-up period.

The modeling construction process is outlined in Additional file 1: Figure S1. First, we divided the selected SEER data based on region codes and then divided the data of each regional dataset into a training set (80%) and a test set (20%). Second, we fit all the training data to models based on the machine learning approach and the statistical approach and tested the prediction performance of both models using the test datasets for the different regions. The deduction process, including setting the model parameters and inputting the factors, is described in the Additional file 1.

All analyses were performed using R version 3.4.0.

Results

Patient demographics and characteristics

A total of 128,061 patients met the inclusion criteria and were included in the analysis. The overall percentage of excluded patients was 86.42%. The patient demographics and characteristics are listed in Table 1. The median follow-up time was 40 months (range, 1–119 months). Overall deaths were recorded for 34,599 (27.02%) patients. The median age at diagnosis was 67 years. Patients were categorized into age groups of less than 55 years, 55 to 64 years, 65 to 74 years and greater than 74 years. We considered age at diagnosis as a categorical factor in the multivariate analysis for comparison and interpretation purposes. However, we used age as a continuous variable in both the statistical and machine learning prediction models. Therefore, this parameter had both levels and interquartile range values. The median tumor size was 40 mm (interquartile range, 30 to 60; maximum, 975). The median number of positive regional lymph nodes was 0 (interquartile range, 0 to 2; maximum, 73). The probability of all-cause death varied across different geographic regions (Fig. 2). As mentioned in a previous study, the survival outcome of patients with CRC exhibits spatial cluster effects, which may impact the choice of methodology in building a survival prediction model.

Fig. 2
figure 2

Mortality distribution across all study regions. Similar to previous findings, differences in survival outcomes across geographic areas were found in this study. The urban mortality rate was lower than the suburban mortality rate, and the mortality rate in areas with high levels of economic activity was lower than that in areas with low levels of economic activity

Spatially varying effects of the predictors according to univariate analysis

For the widely used clinical predictor, the TNM tumor stage, we first compared the differences in survival outcomes for patients within the same TNM staging group among different spatial clusters. Because the “Alaska” and “Rural Georgia” clusters included too few patients [252 patients in Alaska (0.2% of the total population) and 299 patients in Rural Georgia (0.23% of the total population)], we drew Kaplan-Meier survival curves for all spatial clusters except these two clusters. As shown by the Kaplan-Meier survival curves (Fig. 3), the survival outcomes of patients within the same TNM staging group significantly differed across different spatial clusters. Except for the stage IIC group, there were significant spatially varying effects (P value < 0.05) for patients assigned to the same staging groups. Second, using the San Francisco-Oakland SMSA registry as a reference, we performed a quantitative comparison of patient survival outcomes within the same staging group among spatial clusters based on the RMST. As shown in Table 2, in most staging groups, the RMSTs of the spatial clusters significantly differed. For stage IIB, the maximum difference in RMST was approximately 15.66% of the entire assessment period, which means that for the same staging group, patients in different spatial clusters showed different survival outcomes.

Fig. 3
figure 3

Kaplan-Meier survival curves of patients within the same TNM staging group across different spatial clusters. The log-rank test was used to test differences across different spatial clusters. Except for the stage IIC group, there were significant spatially varying effects (P value < 0.05) of patients within the same staging groups

Table 2 RMST of Different Clusters for Patients within the Same AJCC7 Stage Group

For other predictors that are usually considered in studies of survival prediction, we first analyzed the univariate associations between overall survival and the predictors in each spatial cluster using a Cox regression model from which log hazard ratios were obtained with 95% confidence intervals. Then, using Cochran’s heterogeneity statistic Q and the inconsistency index I2 as measurements, we obtained the heterogeneity of the impact of predictors on survival outcomes among different spatial clusters. As shown in Table 3, the effects of “tumor size,” “number of positive regional lymph nodes” and “age at diagnosis” on survival outcomes were substantially heterogeneous among the different spatial clusters, which indicated that the impact of the patients’ characteristics may not be consistent among different geographic regions. A forest plot was used to intuitively display the spatially varying effects for “tumor size” and “age at diagnosis” (Fig. 4).

Table 3 Heterogeneity Test for Predictors among Spatial Clusters
Fig. 4
figure 4

Forest plot displaying the spatially varying effects of “tumor size” and “age at diagnosis”

The spatially varying effect of the regression coefficient on the multivariate analysis

Considering the regression coefficients of the predictors combined through multivariate Cox regression models among the different spatial cluster datasets, we found strong spatially varying effects of the regression coefficient. We considered male patients, age at diagnosis less than 55 years, well-differentiated tumor, adenocarcinoma histology type, right colon tumor location, T1 T stage and N0 N stage as the baseline hazard. As shown in Fig. 5, the pattern of the spatial variation of the regression coefficient differed among predictors; the spatially varying effect of “age at diagnosis” was large, but spatially varying effects for “tumor size” and “number of positive regional lymph nodes” were relatively small. The pattern of spatially varying effects also differed among the subgroups of one predictor. For the predictor “histological type,” the regression coefficients of the subgroups of “mucinous adenocarcinoma” and “signet ring cell carcinoma” among the different spatial clusters were more stable than those of the other subgroups of “histological type.” As shown in more detail in Fig. 6, the impact of the same predictors (e.g., “age at diagnosis”) on different geographic regions also varied, and the hazard ratio (the exponential transformation of the regression coefficient of the predictor) of “Hawaii” remained higher than that of “Greater Georgia” and “New Mexico” within all the subgroups. However, for other registries, these patterns may not exist. For example, the hazard ratio of “Detroit” for the subgroup of patients with age at diagnosis = 55–64 years was similar to that of “Hawaii” for the same subgroup but lower than that of “Hawaii” for the age at diagnosis = 65–74 years subgroup and much higher than that of “Hawaii” for the age at diagnosis > 74 years subgroup.

Fig. 5
figure 5

Spatially varying effects of the regression coefficients of predictors on the multivariable analysis. The predictors included a) age at diagnosis between 55 and 64 years; b) age at diagnosis between 65 and 74 years; c) age at diagnosis greater than or equal to 75 years; d) female; e) moderately differentiated tumor grade; f) poorly differentiated tumor grade; g) undifferentiated tumor grade; h) mucinous adenocarcinoma histology type; i) papillary adenocarcinoma histology type; j) adenoma. In adenoma, polyp histology type; k) signet ring cell carcinoma histology type; l) other histology type; m) left colon; n) rectum; o) T stage T2; p) T stage T3; q) T stage T4a; r) T stage T4b; s) N stage N1a; t) N stage N1b; u) N stage N1c; v) N stage N1nos; w) N stage N2a; x) N stage N2b; y) N stage N2nos; z) tumor size; and AA) number of positive regional lymph nodes

Fig. 6
figure 6

Hazard ratios according to “age at diagnosis” group and registry. a) Subgroup 1: age at diagnosis between 55 and 64 years, b) Subgroup 2: age at diagnosis between 65 and 74 years, c) Subgroup 3: age at diagnosis greater than or equal to 75 years. d) The varying effects of hazard ratio on different subgroups of registries, including Detroit, Greater Georgia, Hawaii, New Mexico and Utah

In summary, we found that the predictors commonly used in survival prediction models have significant spatially varying effects and that the impact of patient characteristics may not remain constant across large-scale, multicenter clinical research data that have been aggregated from a wide range of geographic regions.

Model evaluation and comparison

The predictive accuracies of the machine learning model and the statistical model were measured using the C-index. The stability of both models for different spatial cluster datasets was evaluated based on the standard deviation and the difference between the maximum and minimum of the C-index for different spatial cluster datasets. As shown in Table 4, the machine learning model exhibited better prediction performance for the test dataset than the statistical model. As shown in Additional file 1: Table S5, the proportional hazard assumption for the Cox model was violated and may affect the reliability of the model; therefore, we consider the AFT model as an alternative strategy for the analysis of time-to-event data. The AFT model can even be suitable when hazards are not proportional. The C-index for the machine learning model (0.898, 95% CI: [0.895, 0.902]) was higher than that for the AFT model (0.732, 95% CI: [0.726, 0.738]). Although there were no significant differences (P = 1) in the model’s prediction stability for different spatial cluster datasets, the RSF model had a lower standard deviation and lower maximum and minimum differences of the C-index. As shown in Fig. 7, the prediction accuracy of both models for different spatial cluster datasets also varied, especially for regions that contained fewer patients. The RSF model yielded a higher prediction accuracy with a narrower error bar for the C-index (95% confidence interval) than the AFT model, indicating that the RSF model was more accurate and stable for different spatial cluster datasets that contained a different number of patients and predictors with spatially varying effects. As shown in Fig. 8, the prediction error of both models was tested using an aggregate test dataset, and the reference model was a nonparametric Kaplan-Meier curve. The RSF model consistently had a lower prediction error than the statistical model and, therefore, had better model calibration capability than the statistical model. In addition, the statistical model had lower prediction error than the nonparametric method.

Table 4 Performance Comparison of the Statistical Model and the Machine Learning Model
Fig. 7
figure 7

Model evaluation and comparison between the machine learning model and the statistical model. a) C-indexes of both models among different spatial clusters sorted by decreasing C-indexes. b) Boxplot of C-indexes of both models among different spatial clusters using the Wilcoxon test (nonparametric) to compare the mean C-indexes of the models. c) The C-indexes of both models among different spatial clusters. d) Comparison of the standard deviations of the C-indexes of both models among different spatial clusters using the Wilcoxon test (nonparametric). The machine learning model exhibited better prediction performance than the statistical model

Fig. 8
figure 8

Model calibration performance based on prediction error curves. The nonparametric Kaplan-Meier curve was used as a reference. Cox and AFT represent the statistical model, and RSF represents the machine learning model. The prediction error was estimated at 0, 12, 36, 60, 72, 96, and 108 months after diagnosis. As the prediction error curves of Cox and AFT were nearly overlapping, we plotted their curves separately. a) Prediction error curves for Cox and RSF. b) Prediction error curves for AFT and RSF

Discussion

In this study, we used population-based data from the SEER database to detect and interpret the spatially varying effects of patients’ clinicopathological and demographic characteristics, which are commonly used in CRC survival prediction. The study period was between 2004 and 2013, a period during which patients benefited from modern therapies with improved survival probability. A new population-based survival prediction model is needed to predict CRC patient survival probabilities, as the impact of patient characteristics may not remain constant across entire study regions, especially for large-scale, multicenter clinical research, for which data are collected from a wide range of geographic regions. Strong spatially varying effects were identified for commonly used CRC predictors. To our knowledge, this study is the first to explore the spatially varying effects of the predictors used in a CRC survival prediction model with a population-based dataset. The machine learning model, which considered the varying impact of patient characteristics on different spatial clusters, achieved more accurate prediction than the statistical model, which considered only the random effects of spatial clustering and that the impact of patient characteristics remained constant across different spatial clusters of patients diagnosed with primary nonmetastatic CRC.

The spatially varying effects of predictors for CRC survival prediction were detected, while many previous studies have ignored these effects. TNM tumor staging, which is widely used worldwide for predicting cancer prognosis, assumes that patients in different geographic regions should have the same or similar survival outcomes based on the same pathological criteria. However, in our study, all stage groups except stage IIC exhibited significant variance in survival outcomes (P < 0.05). Therefore, using the TNM staging system to predict survival may potentially reveal deviations between different regions. Moreover, age at diagnosis also has spatially varying effects that are typically not considered. Using heterogeneity testing and multivariable regression analysis, we observed that age at diagnosis had large spatially varying effects. Compared with patients in Hawaii, patients from Utah had lower hazard ratios in subgroups for which the patients’ age at diagnosis ranged from 55 to 64 and from 65 to 74, but these hazard ratios were higher in the subgroup for which the patients’ age at diagnosis was greater than 74. However, for patients from Greater Georgia and New Mexico, the hazard ratios were consistently lower than those of patients in Hawaii for all subgroups related to age at diagnosis.

The spatially varying effects of the predictors imply that the impact of patients’ characteristics may not remain constant across entire study regions. The reasons for these effects should be studied further. However, many studies that have constructed survival prediction models using large-scale, multicenter clinical research data aggregated from a wide range of geographic regions did not consider these effects. Because these models assume that the impact of patient characteristics remains constant across different spatial clusters and consider only random effects regarding the spatial nature of the data, the use of these models may have over- or underestimated survival prediction. We achieved better performance using a machine learning model (RSF) that considered the spatially varying effect of predictors than when we used a statistical model.

In combination with our previous research results [38], the present study demonstrates that the RSF model can be used to study complex relationships (such as nonlinear or time-dependent relationships) regarding the problem of prognosis in nonmetastatic CRC; this topic warrants continued in-depth study. Using the proposed machine learning model framework, one can establish a global survival prediction model based on large-scale, multicenter clinical data that considers the spatially varying effects of predictors among different clusters. Once the model is finalized, patients from different regions will be able to obtain more personalized survival predictions by entering their individual characteristics. Both doctors and patients will benefit from such a model. Doctors will be able to provide more precise and generalized survival predictions and chose more appropriate treatment options, and patients will be able to better understand the progression of their disease, thus enhancing patient compliance.

However, we did not include therapeutic and molecular data, which might have further improved the predictive accuracy of the model. Moreover, we assumed that all patient treatments and clinical visits were confined to the same region because migration and cross-regional clinical visits were outside the scope of this study. Another limitation of this study is that the longest follow-up in the SEER database was only 119 months, and the median follow-up was 40 months; these follow-up times are relatively short considering a population with a potentially curable condition. Therefore, the results should be verified using other databases containing long-term follow-up data.

Conclusions

We conclude that the widely used clinical TNM tumor staging system is limited by spatially varying effects for predicting survival. The impact of age at diagnosis, tumor grade, histology and tumor location may not be consistent across study regions. Constructing survival prediction models based on population-based data collected from a wide range of geographic regions without considering these spatially varying effects may produce deviations across different regions. Machine learning models that consider these spatially varying effects are likely to produce more accurate and robust survival prediction models.