Background

One of the current interests of medicine is to first identify and then integrate new prognostic markers in prediction models that can help distinguish cancer patients with different risks of disease outcome after diagnosis. Genetic sequence variations, such as single nucleotide polymorphisms (SNPs), may have biological roles in modifying the outcome risk and are the focus of the many current prognostic studies. Among many genetic approaches applied, the genomewide SNP survival association studies are considered useful as they examine a large number of genetic markers scattered along the DNA covering the majority of the genomic regions. Several studies applied this approach to identify genetic markers with prognostic associations in cancer such as pancreatic [1], esophageal [2], breast [3], and lung [4] cancers.

Colorectal cancer is a common malignancy [5]. Mortality rates of this disease have been decreasing in many countries including Asian [6], European [7], and North American countries (Canada [8] and the USA [9]). Yet even with this improvement in patient survival, the 5-year survival rate for this disease is estimated to be 62-64% in North America [10,11]. In other parts of the world, such as in India and Eastern European countries, these rates are lower [12]. In Canada, one of the highest incidence rates of colorectal cancer is observed in the Newfoundland population [10]. This population is also characterized by a high incidence of familial colorectal cancer [13] and by one of the lowest cancer survival rates in Canada [14].

The disease stage remains as the most important indicator of prognosis of colorectal cancer patients. Research reported in the literature suggests that other factors may also modify the prognosis, such as age at diagnosis, tumor location [15], vascular/lymphatic invasion [16], and molecular features such as the microsatellite instability (MSI) status [17,18].

MSI-high (MSI-H) is observed in almost 15% of the colon cancers and is characterized by inactivation of DNA mismatch repair genes either by germ line mutations in MLH1, MSH2, MSH6, and PMS2 genes (inherited colon cancer syndrome known as the Lynch syndrome; [19]) or by the somatic promoter hypermethylation of the MLH1 gene (sporadic colorectal cancer with MSI-H [20]). Patients with the MSI-H tumors have better survival rates than the patients with MSI-low (MSI-L) or microsatellite stable (MSS) tumors and the chromosomal instability (CIN) + tumors [17,18]. In addition, MSI-H tumor phenotype is rarely observed in rectal cancers [21] and colon and rectal cancers also differ from each other in terms of other molecular alterations, risk of recurrence, and treatment approaches [15,22,23].

Identification of genetic predictors of disease outcomes in cancer patients is a promising research aim. Till now, studies aiming to test the potential of polymorphisms as prognostic markers in colorectal cancer have been restricted to the candidate gene or pathway approaches. In this study, for the first time we performed a genomewide survival association study for patients with MSS or MSI-L tumors (MSS/MSI-L; n = 431). In addition, due to the differences between the colon and rectal cancers, as an exploratory analysis we investigated the prognostic associations of the same genetic markers in the rectal (n = 171) and colon cancer (n = 334) patients separately.

Results

The baseline demographic, clinical and pathological data for the patients in the MSS/MSI-L, colon, and rectal cancer patient groups are shown in Table 1. The MSS/MSI-L patient group was the largest (n = 431), followed by colon (n = 334) and rectal (n = 171) patient groups. The number of events for overall survival (i.e. death) was 158, 105 and 65 in MSS/MSI-L, colon and rectal groups and for disease free survival (i.e. recurrence, metastasis or death) was 184, 121 and 79 in the same groups, respectively. For the entire cohort (n = 505), the 5-year and 10-year survival probabilities were 79.0% and 46.4% for OS and the 5-year and 10-year disease-free probabilities were 68.5% and 49.1%, respectively.

Table 1 Baseline characteristics of the MSS/MSI-L colorectal, colon and rectal cancer patients investigated in this study

The Quantile-Quantile (Q-Q) plots that were drawn for each sub-group and each outcome are shown in the Additional file 1. These plots show that the models on the genome-wide scan satisfy the expected distribution and suggest the appropriateness of the multivariate model settings.

As a result of the statistical analysis, possibly due to the limited sample size, no association with genome-wide association significance levels (p < 5.0E-08) was detected in models constructed. But, ten SNPs showed suggestive associations with OS and DFS times (p < 1.0E-06), which is the nominal significance cut-off level in our study (Table 2). Manhattan plots for the patient cohorts and outcomes are shown in Additional file 1. For the interested readers, the list of associations detected at a p-value less than 1.0E-05 can be found in Additional file 2.

Table 2 SNPs identified from six models with nominal significance (p < 1.0x10 −6 )

In the MSS/MSI-L patient group association study, there were two SNPs in the DFS model that achieved the nominal-significance level (Table 2). The HRs obtained as a result of the bootstrap method demonstrated the robustness of the results from the original association analysis (Additional file 3). One of these SNPs (rs6720296) was a frequent variant (MAF: 40%) and based on the information in the dbSNP database [24], was located in a non-coding RNA gene (long intergenic non-protein coding RNA 1121; LINC01121). The second marker (rs1407508) was a relatively infrequent SNP (MAF: 5.8%) located in a non-coding region on chromosome 9.

In the colon patient group analysis, two SNPs showed nominally-significant signals in OS or DFS models (Table 2). These SNPs were located in non-coding regions along the chromosome 20 and 14.

Among all groups, the rectal cancer patient group was the one with the largest group of SNPs identified, the majority of which were intergenic. In this patient cohort, two SNPs in OS model and four SNPs in DFS model had p-value lower than the cut-off level (p < 1.0x10−6) (Table 2). Interestingly, one intergenic SNP was associated with both OS and DFS in this patient group (rs6854845). Among the six SNPs, rs17057166 was the only one located in a gene (non-coding RNA; AC011343.1).

None of the SNPs in Table 2 were amino acid changing SNPs (non-synonymous). As of February 2015, there was no publication in PUBMED about these SNPs. A search at the Regulome DB database [25] returned information for five of the polymorphisms (rs17026425, rs1407508, rs17280262, rs17057166, and rs6854845). According to these results, rs17026425 (Regulome DB score: 3a) is located within a transcription factor (such as JUND) binding site, yet it is “relatively less likely to affect protein binding”. For rs1407508 (Regulome DB score: 4), rs17280262, rs17057166, and rs6854845 (Regulome DB scores: 5), Regulome DB suggests that there is minimal or no evidence of proteins binding to the DNA sequences where these SNPs are located.

An additional information related to the genes for the SNPs with p-values less than 1.0x10−5 is shown in the Additional file 4.

Discussion

In this study, we aimed to investigate the associations of a large number of SNPs (n = 729,737) with overall or disease free survival times in Caucasian colorectal cancer patients from Newfoundland, Canada. Our primary aim was to investigate the associations of SNPs with the disease outcome in the patients with MSI-L or MSS tumors (n = 431) as the outcome risks for this group of patients and the patients with the MSI-H tumors are significantly different from each other [17,18]. In addition, due to the differences between the colon and rectal cancer patients (for example the relatively decreased recurrence risk as well as the higher number of MSI-H tumors in colon cancer patients compared to the rectal cancer patients [15,21]), as an exploratory analysis, separate analyses for the colon (n = 334) and rectal (n = 171) cancer patients were also performed. Of note, due to the small number of the patients with MSI-H tumors (n = 53), we have not attempted the statistical analyses in this group.

The main result of this study is that none of the genetic markers investigated reached the genomewide significance levels (5.0E-08) in either overall or disease free survival analyses in any of the patient groups investigated. Thus, we were not able to identify a genetic marker with a strong association with the risk of main clinical outcomes (i.e. death, local or distant metastases) in colorectal cancer. This can be interpreted as that none of the genetic markers investigated are related to the survival outcomes of interest. Alternatively, this can be also due to the fact that the sample sizes of the patient cohorts investigated in this study were small and thus our study power was limited to detect possible associations (see Methods). Considering this study power issue, in this manuscript we present and discuss the SNPs with p-values (<1.0E-06) higher than the genome-wide significance levels as potentially promising genetic markers (Table 2). We suggest that these markers may be promising and should be investigated in larger-sized cohorts to test whether they are associated with the colorectal cancer prognosis. Further research may also be performed to test whether the two non-coding RNA genes (AC011343.1 and LINC01121) shown in Table 2 have prognostic roles in colorectal cancer. In addition, while there is currently no literature report about the SNPs in Table 2 showing their biological functions or relation to health and disease, Regulome DB [25] data suggest that one of the SNPs, rs17026425, is located within the binding site of JUN/JUND transcription factors. Mammalian JUN family of transcriptional regulators includes c-JUN, JUNB, and JUND with important roles in cell proliferation and carcinogenesis (reviewed in [26]). According to our results, rs17026425 polymorphism was nominally associated with overall survival in rectal cancer sub-group (Table 2). Further studies on rs17026425 polymorphism may test its potential binding to the JUN family of transcription factors, its potential role in variable JUN function, and contribution to the rectal cancer formation and progression.

Conclusions

In conclusion, we performed genomewide SNP survival association studies in MSS/MSI-L, colon and rectal cancer patients. A limitation of this study is that all three patient cohorts investigated were characterized by small samples sizes, thus we cannot confidently conclude whether the investigated genetic markers indeed have no prognostic associations in colorectal cancer. However, this study also generated a small set of SNPs that may be an interest for other investigators in their future analyses.

Methods

Patient cohort

Patients registered at the Newfoundland Colorectal Cancer Registry (NFCCR) were investigated in the present study. Characteristics of the NFCCR patient cohort were described earlier [13,27]. Briefly, between 1999–2003, a total of 750 participants were recruited to NFCCR. Informed consent was obtained from the patients or their family members. Recurrence, metastasis or vital status information for patients were collected till 2010 as described in Negandhi et al. [28]. Tumor characteristics (i.e. MSI status) were determined as explained by Woods et al. [27]. Among 750 patients, 736 stage I-IV patients had the prognostic data collected during the follow-up period.

Genotyping and quality control (QC)

Germline DNA was isolated from patients’ blood samples. Initially, a total of 539 patients with available prognostic data and germline DNA were subject to whole-genome SNP genotyping using the Illumina® Omni1-Quad human SNP genotyping platform (service provider: Centrillion Biosciences, USA).

QC analyses on the genetic data were performed using PLINK v1.07 [29] and Eigensoft 4.2 [30]. We excluded 1) one subject because of mismatching sex information; 2) 129,172 SNPs with high missing genotype data (above 5%); 3) 21 individuals who were first, second, or third degree relatives (based on the identity by state PI_HAT score >0.125), and 4) one subject with extreme value of heterozygosity (out of 6 standard deviations). Additionally, Multidimensional-scaling (MDS) method was used to identify individuals with diversity in ancestry. Comparing with the Hapmap III points in the MDS plot, six outliers that are far away from Caucasian cluster were removed from the study (Additional file 5). Lastly, 275,285 SNPs and 320 SNPs were excluded because of failure in HWE test (p < 1.0E-08) and low minor allele frequency (MAF < 0.05), respectively.

Principal component analysis was undertaken with the EIGENSTRAT package after QC filtering (Additional file 5). The first two principal components were incorporated into the models. Additional analyses for population stratification were undertaken with each of the genetic marker adjusting the estimated principal components. As a result, five population outliers were identified and excluded from the analysis.

After these analyses, a total of 505 subjects and 729,737 SNPs were included in the final analysis. The adequacy of the probability distribution and possibility of differential genotyping were formally evaluated using the Q-Q plots of –log10(p-values) (Additional file 1).

Association analyses

We examined the associations of SNPs with two major clinical outcomes: overall survival (OS) and disease free survival (DFS). We define OS as the date from diagnosis till the date of death from any cause or the date of last follow-up, and DFS as the date from diagnosis till the date of local recurrence, metastasis or death from any cause, or the date of last follow-up. OS or DFS status was available for 504 patients.

Cox proportional hazard models were applied based on a genetic additive model. Hazard Ratio (HR) estimates and corresponding 95% confidence intervals (CIs) were calculated using R statistical programming language [31]. The association between genetic markers and survival outcomes, OS and DFS, adjusting for the top two principal components and other confounding factors (gender and stage for MSI-L or MSS sub-group, gender, stage and MSI for colon and rectal sub-groups), was tested using Cox proportional hazard model.

Clinical factors, as potential confounding factors in the genetic analysis, were evaluated firstly using univariate analysis. Among all the clinical factors: gender, stage, MSI were found to be significantly associated with OS (p-values: 0.0151, 2.2E-16, 5.2E-05, respectively); and gender, site (colon/rectum), stage, MSI were shown to be significantly associated with DFS (p-values: 0.0139, 0.0344, 3.3E-16, 1.2E-04, respectively). These clinical variables were used to build up baseline multivariable models for OS and DFS separately in each sub-cohort.

In order to present the genetic association with the outcome variable in more detail, MSS/MSI-L patient, colon cancer patients and rectal cancer patient groups were analysed separately using Cox proportional hazards model. Since two outcome measures (overall and disease free survivals) were examined in MSS/MSI-L, colon and rectum cancer patient cohorts, a total of six models were constructed. While for the genome-wide screening the genetic markers were modeled using additive genetic models, dominant, recessive, and co-dominant genetic models were also applied as sensitivity analysis to assess the genetic inheritance effect. For each scenario, two types of models were explored, namely crude, and comprehensive models: for the crude model, we adjusted by principal components only; for the comprehensive model, we adjusted by principal components and the baseline model factors. In this report we show the comprehensive models.

In order to validate the results obtained from the MSS/MSI-L cohort association study, we applied bootstrap resample method. Based on the original data set, 200 bootstrap samples data sets were generated using sampling with replacement algorithm. For each of these bootstrap data sets, we applied Cox proportional hazards regression model on the SNPs with significance levels p < 1.0x10−5 identified from the original analysis and calculated the corresponding HRs of the genetic effect. Overall, 200 HRs were provided for each SNP and bootstrap confidence intervals were created.

We computed the study power for the MSS-MSI-L cohort (n = 431) using PASS 13 (NCSS, LLC. Kaysville, Utah, USA). Assuming a SNP in linkage disequilibrium, LD, (D’ = 1) with a risk allele frequency 0.3, we have 0.76 power to detect nominal significant association at p < 1.0x10−6 under a dominant model with strong effect size of HR 2.0 in the MSS-MSI-L cohort. To detect an association with the same assumptions and at p < 5E-08 significance level, the statistical power is reduced to 0.56. With a moderate effect size of HR 1.5, the power to detect genome wide significant association (p < 5E-08) is very low (0.08).