Background

Recent studies indicate that approximately 20% of stage II and III colon cancer patients will have a recurrence of their cancer within 5 years following curative treatment (surgery ± adjuvant chemotherapy) [1,2,3,4]. In the US, guideline issuing groups such as the National Comprehensive Cancer Network (NCCN) have supported surveillance testing consisting of physical exams, carcinoembryonic antigen (CEA) blood tests, colonoscopy, and computed tomography (CT) scans of the chest and abdomen in the years following primary treatment [5,6,7]. The main purpose of colon cancer surveillance testing is to detect tumor recurrence or new primary colon cancer at an earlier point than symptom-based detection, resulting in a higher likelihood of curative treatment for the recurrence and better cancer-specific survival [8, 9]. However, although earlier randomized controlled trials (RCTs) and meta-analyses of these trials demonstrated that surveillance for local/regional colon cancer improved overall survival [10,11,12,13], the evidence that more intensive surveillance improves cancer-specific survival has not been demonstrated [9, 12,13,14]. This scenario has been complicated by the fact that the clinical trial data pertaining to this issue:1) has spanned a number of decades in which treatments for colon cancer have dramatically improved and surveillance strategies have evolved [15, 16], 2) includes heterogeneous studies with varying surveillance protocols that make comparisons between trials and inclusion of trials in meta-analyses problematic [15,16,17,18,19], and 3) has been derived from small controlled trials with limited follow-up, and thus, lacks the statistical power to detect meaningful survival differences [18, 19]. The resulting contradictory evidence has led some clinicians/researchers to question whether surveillance guidelines represent the best follow-up strategy and even if surveillance testing should be done at all [15, 20,21,22].

Observational comparative effectiveness research can provide high quality, nuanced evidence beyond that which may be feasible or ethical in RCT designs evaluating survival outcomes [23, 24]. This evidence can then be used to facilitate shared decision making between physicians and their patients [23,24,25,26]. The goal of this retrospective comparative effectiveness study was to assess the hypothesized survival benefit associated with more surveillance testing in stage II and III colon cancer patients following curative treatment. We utilized the National Cancer Institute’s (NCI) Surveillance, Epidemiology, and End Results database combined with Medicare claims (SEER-Medicare) to evaluate whether levels of adherence with NCCN Surveillance Guidelines (more vs. less) is associated with survival in older adult colon cancer patients. This study was designed to address the statistical power, heterogeneous surveillance strategies, and generalizability limitations of previous RCT designs by evaluating survival differences according to surveillance testing received in real world clinical settings. To achieve this objective, we sought to leverage the strengths of the large SEER-Medicare database with extensive follow-up time to evaluate 5-year cancer-specific survival and, secondarily, 5-year overall survival in colon cancer survivors according to levels of surveillance with NCCN guidelines.

Methods

Study methodology has been described although some differences will be presented herein [17]. The study population consisted of colon cancer patients, 66–84 years of age, who were diagnosed between 2002 to 2009 and included in the NCI’s SEER-Medicare database. These years were included to ensure that all patients had the potential for at least 5 years of follow-up after completion of treatment. Vital status and cause of death information was available up to the termination date of this study, December 31, 2015. Patients who received no surveillance tests (Nonadherent) for up to 3 years post-treatment were excluded. The flow diagram depicting inclusion/exclusion criteria is shown in Fig. 1.

Fig. 1
figure 1

Flow diagram of the study population. SEER: Surveillance, Epidemiology, and End Results

Ascertainment of study data

Diagnostic and procedure codes from the International Classification of Diseases, Ninth and Tenth Edition (ICD-9/10); International Classification of Diseases for Oncology, 3rd Edition (ICD-O-3); Current Procedural Terminology (CPT); and Healthcare Common Procedure Coding System (HCPCS) were utilized to obtain relevant treatment, surveillance, comorbidity [27, 28], and cause of death information (Additional file 1, Table S1).

Final treatment date

For patients who received surgery as their only primary treatment, their final treatment date was their surgery date. For patients who received adjuvant chemotherapy, their final treatment date was the last date of sequential chemotherapy treatment (claims ≤90 days apart were considered as sequential chemotherapy) or 6 months from the first chemotherapy date for patients who received chemotherapy > 6 months.

Surveillance tests

In determining adherence with NCCN Surveillance Guidelines, we considered receipt of surveillance tests that were established for their ability to detect new colon cancers and intraluminal, locoregional, and distant metastatic recurrences. Hence, we did not include physical exams in determining adherence with guidelines [19]. In each year of follow-up after the final treatment date, we obtained the number of CEA blood tests, CT (and positron emission tomography [PET]) scans, and colonoscopies. To avoid duplicate counting of surveillance tests, claims had to be separated by ≥30 days. We assessed adherence with surveillance tests up to 3 years post-treatment as this is the time period in which most recurrences occur [29]. In a given year, we defined adherence with a surveillance test as receiving the minimum number of recommended tests (Table 1) [5,6,7].

Table 1 NCCN Colon Cancer Surveillance Guidelines [30]

Determining yearly and overall surveillance classification

For the assessment of yearly and overall surveillance, we modified our original strategy in order to capture the totality of the surveillance experience [17]. For CEA, which requires a minimum of two tests per year to meet minimum guidelines, we classified each year of follow-up according to the number of CEA tests as ≥2, 1, or 0. For tests in which the minimum was one test in a given year (Colonoscopy and CT), each patient was labeled as ≥1 or 0. For each year of surveillance assessment, we classified colon cancer cases as “More Adherent”, “Less Adherent”, or “Nonadherent”. Patients who were More Adherent either received all recommended testing, were missing one test, or missing more than one test but with additional testing (e.g., colonoscopy) that was absent for those who were Less Adherent. Less Adherent patients were missing at least one recommended tests without additional testing to compensate for the missing tests. Nonadherent patients received no surveillance tests during the assessment period. The classification scheme for yearly surveillance assessment is shown in Table 2.

Table 2 Classification scheme for yearly surveillance assessments

In determining overall surveillance, all years of complete follow-up up to 3 years post-treatment were assessed, and patients were classified as More Adherent or Less Adherent (Nonadherent patients were excluded). For patients with only 1 year of complete follow-up, their Year 1 classification was also their overall classification. For patients with two or 3 years of complete follow-up, their yearly classifications were combined to create an overall classification. Thus, for patients with 2–3 years of complete follow-up, their yearly surveillance assessments determined their overall surveillance classification. The overall classification scheme for patients with two or 3 years of complete follow-up is contained in Additional file 2 (Table S2).

Statistical analysis

Demographic and clinical variables for the unweighted study population were compared across the two categories of overall surveillance (More vs. Less). Pearson’s chi-square was applied to detect differences in categorical variables. Median follow-up time and 25% cancer death time between the two groups were calculated by the Kaplan-Meier method with differences evaluated by the log-rank test. Statistical significance was defined as P value < 0.05.

In observational comparative effectiveness studies—where treatment is not randomized—appropriate methods to control for potential differences in confounding variables between the treatment groups is paramount [31]. Generalized boosted models represent a relatively new, machine learning approach, whereby propensity scores are obtained via multiple regression trees that model complex associations between pretreatment covariates across categories of treatment (i.e., surveillance) [32,33,34,35]. This procedure produces inverse probability of treatment weighting (IPTW) that balances covariates between groups. Propensity scores for each treatment group were obtained by balancing covariates according to age, race, sex, marital status, year of diagnosis, state buy-in coverage, census-tract poverty level, urban-rural designation, SEER region [36], disease substage, tumor grade, tumor location (proximal/distal), chemotherapy status, and the individual comorbid conditions that comprise the Charlson Comorbidity Index [27]. Balance between potential confounding factors was assessed by the absolute standardized mean difference for each group compared to the study population mean. Differences ≥0.20 were considered as evidence of imbalance [32]. For point estimates obtained in the regression models described below, the estimand is the average treatment effect.

Five-year survival curves for the study population were obtained by the Kaplan-Meier method which incorporated IPTW. The weighted Cox proportional hazards model was used to obtain adjusted hazard ratios (HRs) with 95% confidence intervals (CIs) for the relative hazard (i.e., risk) of five-year colon cancer-specific, cancer-specific (i.e., any cancer death), noncancer-specific, and overall mortality from the last date of treatment. The survival analysis revealed nearly identical hazard ratios for colon cancer-specific and cancer-specific survival, thus, only cancer-specific mortality is reported. We initially planned to obtain separate models according to treatment (surgery, surgery + chemotherapy). However, when we analyzed both models, the associations between surveillance status and all survival outcomes were highly similar. Thus, all patients were combined. The proportional hazards assumption was assessed for surveillance status graphically (log [−log] of survival function by log survival time) and by evaluating the interaction with time in the survival models. Although there was slight evidence of a declining hazard ratio (moving to the null) over time for cancer-specific survival, there was not sufficient evidence to indicate a violation of the proportional hazards assumption.

Results

Demographic and clinical characteristics of the study population (n = 17,825) are shown in Table 3. The majority of colon cancer patients were designated as More Adherent vs. Less Adherent (11,840, 66.3% vs. 6020, 33.7%). More Adherent patients were more likely to be alive at 5 years post-treatment, but also more likely to have died of cancer. Other variables that were significantly associated with surveillance status were age, race, marital status, year of diagnosis, state buy-in coverage, census tract poverty level, disease stage, tumor grade, receipt of adjuvant chemotherapy, and comorbidity. Sex, geographic residency, SEER region, and tumor site were not associated with overall surveillance status.

Table 3 Demographic and clinical characteristicsa of stage II/III colon cancer patients (n = 17,860)

The assessment of balance for the unweighted and weighted study samples is depicted in Additional file 3 (Table S3). Before weighting, when compared to the study population, there was evidence of unbalanced data for both groups in terms of receipt of adjuvant chemotherapy. The Less Adherent group was also unbalanced for the 80 to 84 years age category and disease stage. After weighting, all covariates were balanced between groups.

The IPTW-Kaplan-Meier survivor curves for 5-year cancer-specific, noncancer-specific, and overall survival are displayed in Figs. 2, 3, and 4, respectively. The More Adherent group experienced slightly poorer 5-year cancer-specific survival, better noncancer-specific survival for years 2–5, and no difference in 5-year overall survival. As reflected by the survival curves, the earliest death could not occur until 12 months after final treatment date due to inclusion/exclusion criteria (all patients had to have ≥1 year of surveillance follow-up).

Fig. 2
figure 2

5-year cancer-specific survival probability by surveillance status. The Kaplan-Meier method was used to obtain IPTW-adjusted survival curves with statistical significance defined by the log-rank test

Fig. 3
figure 3

5-year noncancer-specific survival probability by surveillance status. The Kaplan-Meier method was used to obtain IPTW-adjusted survival curves with statistical significance defined by the log-rank test

Fig. 4
figure 4

5-year overall survival probability by surveillance status. The Kaplan-Meier method was used to obtain IPTW-adjusted survival curves with statistical significance defined by the log-rank test

Mortality data are provided in Table 4. Compared to colon cancer patients who were More Adherent, Less Adherent patients experienced a 17% decreased risk of 5-year cancer-specific death (HR = 0.83, 95% CI 0.76–0.90), and a 61% increased risk of 5-year noncancer-specific death (HR = 1.61 95% CI 1.43–1.82) that was limited to years 2 to 5. There was no difference between the groups in overall survival (HR = 1.04, 95% CI 0.98–1.10).

Table 4 IPTW-adjusted hazard ratios for the association between surveillance status and 5-year cancer-specific, noncancer-specific, and overall mortality

Discussion

The controversy surrounding the use of surveillance testing in colon cancer survivors has been ongoing for several decades [37,38,39] as described in several meta-analyses which included studies published between 1995 to 2016 [12, 13, 21, 40,41,42,43]. Although guideline-issuing groups in the United States are consistent in their recommendation that stage II and III colon cancer patients receive surveillance testing following completion of treatment, we found that there is no benefit to more testing vs. less testing in terms of cancer-specific and overall survival in an older adult patient population. Future studies with more granular patient data (e.g., prognostic/predictive biomarkers) are needed to develop risk-stratified surveillance strategies with the goal of decreasing disease-related morbidity and mortality associated with recurrence.

Earlier RCTs on this topic were hampered by a number of limitations which resulted in conflicting evidence [15, 17]. Recently, higher quality studies have provided a more consistent conclusion regarding more vs. less surveillance testing. In 2014, Primrose et al. [44] published results of the Follow-Up After Colorectal Surgery (FACS) trial conducted in the UK. This study revealed that surveillance with CEA/CT increased the likelihood of curative resection by three times compared with minimal follow-up care, but no differences in survival were indicated. The results of two studies—one RCT and one observational study—were recently published in 2018 [2, 4]. Wille-Jorgensen et al. [2] reported the results of the multicenter COLOFOL trial which overcame many of the aforementioned limitations of earlier trials by boasting a large study population (n = 2555) with long-term follow-up. These investigators evaluated higher vs. lower frequency surveillance testing using CEA/CT. At the completion of the trial, there were no differences between the groups in either cancer-specific or overall survival. The only other observational study to evaluate more vs. less frequent surveillance testing was recently published by Snyder et al. [4]. This study differed from ours in a number of ways including: 1) surveillance was defined as high vs. low intensity CEA/CT testing at the facility level as opposed to individual-level assessment; 2) colon and rectal cancer patients were combined; 3) stage I patients (who only require colonoscopy to be adherent with guidelines) were included; 4) assessment of colonoscopy was not considered; and 5) differing inclusion/exclusion criteria were applied. This study, too, found no differences in overall survival according to facility-level surveillance.

Consistent with the results reported by the authors of the three aforementioned studies [2, 4, 44], in our study, patients who were More Adherent with guidelines did not experience improved cancer-specific survival compared to those who were Less Adherent. In fact, those who were Less Adherent with guidelines experienced slightly better 5-year cancer-specific survival. A reason for these seemingly contradictory results is possibly due to a study design limitation. Namely, as receipt of surveillance testing was not randomized, those who were deemed at greater risk for cancer recurrence/death received more surveillance testing. Similarly, patients in the Less Adherent group had a lower risk of cancer-specific death, but a higher risk of noncancer-specific death. These hypotheses are supported by the bivariable data in Table 3 and the mortality data in Table 4. Thus, due to a lower perceived risk of cancer-specific mortality and a higher risk of mortality from chronic comorbid conditions, patients in the Less Adherent groups underwent, appropriately, less surveillance testing and experienced slightly better cancer-specific survival. It is plausible that less encounters with the medical system in the Less Adherent group led to poorer control of comorbid conditions and an increased risk of noncancer-specific mortality. It is also possible that our assessment of comorbidity using the individual comorbid conditions in the Charlson Comorbidity Index [27] did not fully capture the comorbidity burden in our study population. Although our method of IPTW balanced all measured covariates between the two groups, there could have been additional factors related to comorbid disease burden that were not measured which may explain these results.

Our findings in an older adult colon cancer population indicate that there is considerable latitude regarding adherence with surveillance guidelines and the relationship with cancer-specific survival. Given the demonstration that treatment for cancer can result in financial (in addition to treatment-related) toxicity, lower cost/less intensive surveillance strategies might be attractive for many patients in the US and other first world countries, as well as in more resource-challenged environments [45,46,47,48]. Regardless of cost, shared decision making between patients and providers concerning the best surveillance strategy is appropriate to balance patient preferences, quality of life, suitability for curative treatment if recurrence is detected, risk of recurrence and likelihood of cure, and the presence of comorbid conditions which may present a much larger risk of short-term mortality [18, 49]. The purpose of comparative effectiveness research is to give patients and clinicians more information to inform these discussions [23,24,25,26]. Given the evidence from the current study and other recent investigations which have demonstrated the lack of cancer-specific and overall survival benefit with more surveillance testing, it may be appropriate to revisit guideline recommendations.

This study has a number of strengths. We leveraged the powerful SEER-Medicare database which contains demographic, comorbidity, tumor-related, treatment, follow-up, vital status, and cause of death information for cancer patients diagnosed in one of the 17 SEER regions in the US. The SEER-Medicare files contain an enormous amount of claims data with redundancies across file types. This helps to ensure the accuracy of information, which is a limitation of claims-based studies. Finally, we developed an individual-level classification scheme to assess surveillance testing for up to 3 years following treatment completion. We feel that our approach, which was based on obtaining an individual-level, holistic evaluation of surveillance reflecting what patients actually received, is a significant strength of this study.

These results should be interpreted with full consideration of the study’s limitations. The main limitation concerns judgments regarding comparative effectiveness using an observational study design. The gold standard for determining treatment efficacy is the RCT, but these designs are often not ethical (randomizing to receive no surveillance testing despite recommendations) or feasible due to the costs of enrolling a large number of patients with years of follow-up [23, 24]. Our study enabled the evaluation of a range of surveillance testing experiences reflecting what patients actually received. However, it should be acknowledged that although we achieved balance on the measured, potential confounders available in our dataset, unmeasured prognostic factors may have remained associated with surveillance status. Despite these limitations, the conclusions of this study remains valid. That is, more surveillance testing does not improve cancer-specific or overall survival compared to less testing.

A criticism of this study could relate to the surveillance classification scheme. Unlike a controlled trial, every variation of surveillance testing received in real world clinical settings had to be categorized and combined for up to 3 years of follow-up. Although one could possibly differ with the classification scheme used in this study, the study conclusions are the same. That is, more surveillance testing did not confer a cancer-specific or overall survival benefit compared to less testing. Another weakness of the study is that information on tumor recurrence cannot be observed via the SEER-Medicare database and the reason for testing is unavailable. Thus, it was not possible to differentiate true surveillance testing of asymptomatic patients from diagnostic testing in patients presenting with symptoms. Finally, our conclusions concerning the older adult/elderly colon cancer population may not be directly generalizable to younger patients.

Conclusions

In a population of older adults with stage II and III colon cancer, more surveillance testing did not result in better cancer-specific or overall survival. These results support an individualized surveillance testing strategy that considers patients’ preferences, risk assessment of recurrence, and other individual-level clinical factors (e.g., comorbidity). In conjunction with results from recent studies [2, 4, 44], our findings may warrant a reconsideration of guidelines. At a minimum, the results of this study support shared decision making between older adult colon cancer patients and their healthcare providers. Efforts to ensure high quality cancer care which includes a patient-specific surveillance testing strategy are necessary to achieve the best clinical and patient-centered outcomes for stage II and III colon cancer patients in the survivorship phase of care.