Background

Colorectal cancer is a global health problem and contributes substantially to worldwide cancer mortality [1]. In 2012, this disease was the 3rd most common cancer worldwide with higher rates occurring in developed countries [1]. In Canada, colorectal cancer is expected to cause 26,800 new cases and 9400 deaths in 2017. Newfoundland and Labrador, in particular, have the highest age-standardized rates of incidence and mortality in the country [2].

Mucins are a family of high-molecular-weight glycoproteins that are widely expressed by epithelial tissues [3]. According to the HGNC database [4], there are 22 members in this family that can be expressed in various tissues. They have been identified in two forms: cell surface (transmembrane), such as MUC1 and MUC4, and fully released (gel-forming) [3, 5, 6]. The gel-forming mucin-encoding genes are clustered at chromosome 11p15.5 [5, 7, 8]. These mucins, including MUC2, MUC5AC, MUC5B, and MUC6, constitute the major macromolecular components of mucus [5, 7, 9]. Among them, MUC2 is the most highly expressed one in the colorectum and is the predominant component of colorectal mucus [10,11,12]. MUC5B and MUC6 are highly expressed in the upper gastrointestinal (GI) tract, but low levels of both have been reported in the normal colon [12, 13]. MUC5AC is highly expressed in the upper GI tract and is not expressed in the normal colon, however, abnormal expression is observed in colorectal cancer [14,15,16].

Mucinous adenocarcinoma is a distinct form of colorectal cancer with the defining characteristic of a high mucin component (more than 50% of the tumor volume). This subtype accounts for 5–15% of colorectal cancer cases. Compared to non-mucinous colorectal cancer, mucinous adenocarcinoma patients are typically younger and are often at an advanced stage at diagnosis [17,18,19,20,21,22,23]. Mucinous tumors are more likely to occur in the proximal colon [20, 21, 24, 25] and tend to have an inferior response to systemic therapies [25, 26].

Specific molecular distinctions are also seen in mucinous compared to non-mucinous colorectal tumors, for example, increased rates of BRAF mutations and CpG island methylator phenotype (CIMP) [27]. In addition, overexpression of MUC2, strong ectopic expression of gastric MUC5AC, and decreased p53 expression in mucinous tumors are reported in the literature [28, 29]. Mucinous and non-mucinous tumors also appear to have differences in genome-wide gene expression patterns [23]. Some of the upregulated genes in mucinous tumors are involved in cellular differentiation and mucin metabolism, which are characteristics biologically relevant to the phenotype [23]. While the differences between mucinous and non-mucinous colorectal cancers are well recognized, the prognostic importance of a high mucin component has been controversial [19,20,21, 25, 26, 30,31,32,33,34,35].

Most studies investigating characteristics of mucinous colorectal tumors examined single or a limited number of candidate genes [10, 36, 37]. This study aimed to comprehensively identify common and rare genetic polymorphisms that may be influencing the production of mucin or formation of the mucinous tumor phenotype. To do so, we applied a genome-wide approach to identify genes and genetic regions that are associated with the risk of developing the mucinous tumor phenotype.

Methods

Patient cohort

The study cohort was a subgroup of the Newfoundland Colorectal Cancer Registry (NFCCR) and consisted of 505 Caucasian patients. Both the NFCCR and the study cohort were described in detail in other publications [38, 39]. In short, the NFCCR recruited 750 colorectal cancer patients in Newfoundland and Labrador collected between 1999 and 2003. All diagnoses were confirmed by pathological examination. Out of 750 patients, 505 patients constituted the study cohort as explained below.

Genotype data

The genotype data used in this study was explained in Xu et al. (2015) [39]. In short, DNA samples of 539 patients were subject to whole-genome single nucleotide polymorphism (SNP) genotyping using the Illumina Omni1-Quad human SNP genotyping platform (Centrillion Bioscience, USA). These patients were included into the genetic analysis because of the availability of their outcome and clinical data as well as the germline DNAs extracted from peripheral blood samples. The quality control analysis and filtering for this data included removing SNPs whose frequencies deviated from Hardy-Weinberg equilibrium, SNPs that had >5% missing values, and patients with discordant sex information, accidental duplicates, divergent or non-Caucasian ancestry, and first, second, or third degree relatives [39]. In Xu et al. (2015) [39], 505 patients were examined to investigate associations between overall and disease-free survival times after colorectal cancer diagnosis and genetic polymorphisms with a minor allele frequency (MAF) of at least 5%. In our study, there were 505 patients with 729,373 common SNPs (MAF ≥0.05) and 275,645 rare SNPs (MAF <0.05) that were included. No SNP was excluded due to high or perfect linkage disequilibrium (LD) with other SNPs. During this study, management and handling of these genotype data was done using PLINK v. 1.07 [40].

Statistical analysis

The response variable is a binary variable indicating existence of mucinous tumor histology or non-mucinous tumor histology.

Common SNP analysis

Univariable logistic regression analysis

Univariable logistic regression analysis was performed on each common SNP (MAF ≥5%) to determine if individual SNPs were significantly associated with mucinous tumor phenotype (i.e. mucinous versus non-mucinous tumor histology). For each SNP, the additive, co-dominant, dominant, and recessive genetic models were applied. Consequently, we report the 10 SNPs without excluding those in high LD with the highest level of significance in each genetic model (Additional file 1: Tables S1-S4).

Selection of baseline variables and multivariable logistic regression analysis

In order to select significant baseline factors to adjust for in the multivariable analyses, we first examined the variables shown in Table 1 using univariable logistic regression models. These variables were selected for inclusion into the selection process based on previous studies investigating mucinous colorectal tumors [27, 33]. Factors that had a p-value less than 0.1 were then included in a forward stepwise variable selection method. In addition, although there appeared to be a non-significant association between tumor histology and grade in the univariable analysis, tumor grade was still included in the multivariable model as has been shown to be linked to tumor histology [30, 41]. As a result, the baseline characteristics in the final models were sex, age at diagnosis, stage, and tumor location based on the 0.1 level of significance, and tumor grade (Additional file 1: Table S5). The 10 SNPs with the highest level of significance under each genetic model in the univariable logistic regression analysis were analyzed using the multivariable logistic regression model adjusting for the selected baseline characteristics (Additional file 1: Tables S1-S4).

Table 1 Baseline features of the study cohort and the results of univariable logistic regression analysis

Plausibility of the genetic models

It is common in genetic association studies that only one genetic model is applied. In this study, we applied all four genetic models and assessed the plausibility of the genetic model under which the SNP was identified. To do this, we used the Akaike Information Criterion (AIC) calculations to compare the fit of four different genetic models per SNP under the multivariable logistic regression model. The genetic model with the smallest AIC estimate was considered to be the most plausible genetic model (i.e. the best fitting model). We first ranked the SNPs based on their p-value obtained in the multivariable model with the genetic model under which the SNP was identified (Additional file 1: Table S6). Then, we excluded those SNPs that were not identified in their plausible genetic model. Of note, we present in this manuscript only the 10 SNPs that have the highest association significance levels under the multivariable logistic regression models that were identified in their most plausible genetic model. We refer to these SNPs as “the top 10 SNPs”. The LD between SNPs was not taken into account when listing the top 10 SNPs.

Assessing the discriminatory accuracy of the estimated models

We aimed to check the ability of the multivariable models of the top 10 SNPs to discriminate between mucinous and non-mucinous phenotypes. A well-known method for assessing the discriminatory accuracy of a model is using a receiver operating characteristic (ROC) curve [42,43,44]. Calculating the area under the curve (AUC) of the ROC curve for the given models provides a single numeric representation for the performance of the model [43, 45, 46]. Comparing the AUC values and their corresponding confidence intervals provides a method for determining if one model is significantly superior to another in discriminatory accuracy [44, 47].

ROC curve analysis was performed by calculating the AUC using the pROC package in R [48]. The AUC estimates for (i) the model conditioning only on the baseline characteristics, (ii) the model conditioning on only the top SNPs, and (iii) the model conditioning on the baseline characteristics and the top SNPs. Comparing the AUC, specifically the 95% confidence intervals, between these three models can quantify the differences in the capacity of the models to distinguish mucinous and non-mucinous phenotypes.

Rare variant analysis

SKAT-O analysis

SKAT-O [49] test statistic was used to test the associations between the rare variants and the mucinous tumor phenotype. For this analysis, we prioritized gene-based regions including 5 kb long sequences before and after each gene. To do so, we first obtained genome location information for genome-wide gene-based regions (for the reference genome GRCh37.p13) using the biomaRt tool [50] in the Ensembl database [51]. The SNP information within these regions were then retrieved from the patient genome-wide data and used as the region-based SNP-sets in SKAT-O. During this analysis, each SNP was assigned to one gene-based region only. As a result, when a gene is located in close proximity to another gene, the second gene-based region does not include the SNPs that are analyzed in the first gene-based region. This limits redundancy since no SNP is analyzed more than once. For this analysis, only the additive genetic model was considered as using multiple genetic models is not a practical option for SKAT-O. The associations of gene regions were examined in multivariable models, adjusting for the significant baseline characteristics sex, age at diagnosis, stage, tumor location, and tumor grade.

All statistical analysis was performed using R v. 3.1.3 [52]. Correction for multiple testing was not applied to the results as this is an exploratory study and we did not want to increase false negative rate due to conservative corrections. While this increases the chances of obtaining false positives, we believe replication of these results in other studies will assist in reducing the potential false positive findings.

Bioinformatics analysis

Potential regulatory consequences of the identified SNPs were examined through RegulomeDB (http://www.regulomedb.org/) [53]. Ensembl [51] database was used to retrieve information related to the genes identified in the common and rare variant analysis.

Results

The demographic and clinicopathological information for the sample population is shown in Table 1. We observed a non-significant association of histology with age at diagnosis (>65 versus ≤60), grade, microsatellite instability (MSI) status, lymphatic invasion (LI), and BRAF V600E mutation; a moderately significant association with stage, sex, and age at diagnosis between 60 and 65 versus ≤60; and a strongly significant association with tumor location (Table 1). In this cohort, there was a trend for female sex having increased risk of mucinous tumors. As expected, the proportion of mucinous tumors was higher in colon cancer patients compared to rectum cancer patients and in stage II-IV patients compared to stage I patients (Table 1).

Common SNP analysis

None of the associations in this analysis reached the traditional genome-wide significance level (P < 5 × 10−8), but each genetic model identified promising associations.

After the univariable analysis, there were 33 SNPs that were nominally associated with the mucinous tumor phenotype (Additional file 1: Tables S1-S4). Associations of two SNPs (rs11216624 & rs17712784) were identified in both the dominant and co-dominant genetic models; one SNP (rs7314811) was detected in the additive, recessive, and co-dominant genetic models; and three SNPs (rs4843335, rs10511330, & rs16822593) were detected in both the additive and dominant genetic models. The estimates obtained in the univariable analysis did not change significantly when the models were adjusted for the baseline characteristics (Additional file 1: Tables S1-S4).

As explained in the Methods section, the AIC estimates (Additional file 1: Table S6) were used to determine the most plausible genetic models for each of 33 SNPs. Ten SNPs with the smallest p-value in the multivariable analysis under the most plausible genetic models were further prioritized (i.e., the top 10 SNPs). The results of the univariable and the multivariable logistic regression analyses for these top 10 SNPs are summarized in Table 2. Seven of these SNPs were located within gene sequences. These genes were quite diverse and belong to a variety of biological processes and pathways (Table 3).

Table 2 Top ten promising SNPs identified in univariable analysis and the subsequent multivariable analysis under their plausible genetic models
Table 3 Genes identified in the common and rare analyses

Before the ROC analysis, the LD among the top 10 SNPs were assessed using patient genotype data. These calculations indicated that rs13019215 and rs12471607 were in complete pairwise LD (r2 = 1). The SNPs rs4837345 and kgp10457679 were also in high LD with each other, as well as rs10511330 and rs16822593 (0.99 ≤ r2 ≤ 1.0). Therefore, we kept one SNP per SNP set in high LD, which left the following SNPs for the ROC analysis: rs9481067, rs10511330, rs13019215, rs716897, rs4843335, rs11968293, and kgp10457679.

Figure 1 shows the ROC curves comparing the accuracy of the models to discriminate mucinous and non-mucinous tumor phenotypes. The model (iii) including both the baseline characteristics and the SNPs (AUC = 0.916, CI: 0.873–0.960) had the most discriminatory accuracy followed by model (ii) including only the SNPs (AUC = 0.868, CI: 0.813–0.923) and model (i) including only the baseline characteristics (AUC = 0.703, 95% CI: 0.634–0.773). Since the confidence intervals of models (i) and (iii) do not overlap, we can confidently claim that there is a statistically significant improvement in the discriminating accuracy of the model containing the SNPs [44, 47]. This also suggests that these SNPs explain some of the variation between the mucinous and non-mucinous tumor phenotypes.

Fig. 1
figure 1

ROC curves and corresponding AUC values for multivariable models. Due to high LD among some of the top 10 SNPs, ROC analysis was performed on only the following SNPs: rs9481067, rs10511330, rs13019215, rs716897, rs4843335, rs11968293, and kgp10457679. AUC: area under the ROC curve, CI: confidence interval, LD: linkage disequilibrium, ROC: receiver operator characteristic

Rare SNP analysis

In the gene region-based rare variant analysis, we investigated 29,966 regions in the patient cohort using the multivariable SKAT-O method. Table 3 and Table 4 summarize the most significant regions (P < 10−4) that potentially contain causal rare variants associated with the mucinous tumor phenotype. The number of variants aggregated in these gene-based regions varied from 5 to 10. While three of these regions (including the SEC24B, SEC24B-AS1, and CCDC109B regions) were located close to each other on chromosome 4, other regions come from different parts of the genome (Table 4).

Table 4 Most significant gene regions identified from SKAT-O multivariable analysis

Discussion

Mucinous tumors are considered an aggressive type of colorectal tumors that are poorly understood [22, 24, 54]. While their role in prognosis is not well established, several studies suggested these tumors are associated with poorer prognosis when compared to non-mucinous tumors [25, 26, 32, 33, 35]. Identification of genes and genetic variations that can have a role in mucinous tumor development, therefore, has both scientific (e.g. dissecting the biology behind the mucinous tumor histology) as well as clinical value (e.g. biological information gained may assist with development of targeted treatment for this cancer subtype). Accordingly, for the first time with this study, we examined associations of both common and rare variants with the risk of developing the mucinous tumor phenotype using a genome-wide dataset.

While our results did not reach the conservative genome-wide significance level, promising associations were detected in both the common and rare variant analyses. In common SNP analysis, we identified seven unlinked polymorphisms that significantly increased our capacity to discriminate between mucinous and non-mucinous tumor phenotypes (Fig. 1, Table 2). Their effects on tumor histology were independent from the effects of the baseline variables (Fig. 1, Table 2). It is possible these polymorphisms (or others in high LD with them (Additional file 1: Table S7), including three additional SNPs shown in Table 2) are biologically linked to tumor histology or mucin production. Since there was no reported functional consequence of these SNPs in the literature, we searched the RegulomeDB database [53] for their potential biological characteristics. As of March 2018, the only SNP with a predicted/reported regulatory function in this database was kgp10457679 (rs10819474) (RegulomeDB score = 1f). This intergenic SNP is categorized as an expression quantitative trait locus (eQTL)/Transcription Factor (TF) binding/DNAse peak site, with a likely role of influencing the expression of target genes (Additional file 1: Table S8). Specifically, PPP2R4 is noted as the eQTL for this SNP. PPP2R4 is a tumor suppressor protein [55] which has been shown to have low activity in a large portion of a small cohort of colorectal tumors [56] and is associated with shorter survival times in metastatic colorectal cancer patients [57]. A potential link of PPP2R4 to mucinous tumor phenotype risk should be examined in further studies. Interestingly, one GWAS identified a SNP within the sequences of ZBTB20, other than the one reported in this study, that is significantly associated with the risk of non-cardia gastric cancer in the Han Chinese population [58]. Overall, all the novel loci identified by the common variant analysis are interesting candidates in examination of mucinous tumor development.

Typical association studies, such as the common variant analysis, focus on a variant-by-variant approach, which is underpowered for rare variants. It has been suggested that gene/region-based approaches can be useful in increasing the power under these circumstances where the direct effects of multiple variants on a phenotype can be examined [59]. Hence, in this study, we performed the first rare variant analysis to explore gene regions that may have a role in mucinous tumor formation using SKAT-O [49]. SKAT-O is a multi-marker association test which has reasonable type I error rate and is a powerful test under many scenarios [49]. In our study, this method identified a number of gene-based regions that may harbor rare variants associated with mucinous phenotype (Tables 3 and 4). Interestingly, three of the gene-based regions in Table 4 (SEC24B, SEC24B-AS1, and CCDC109B-based regions) were located in a 341,243 bp long genomic region on chromosome 4q. Since we assigned each SNP to only one gene region, these results suggest that these three gene regions are associated with the mucinous phenotype independent of each other. A search on the RegulomeDB database [53] indicated that one of the SNPs in LINC00596 (rs8005541) could have a strong regulatory function (RegulomeDB score = 1f). This variant is located in an eQTL and seems to affect the expression of two nearby genes; DHRS4 and DHRS4L2. These two genes are a part of a gene cluster on chromosome 14 that code for dehydrogenases/reductases [60] and have not been previously linked to mucinous tumors. Similarly, none of the genes in Table 4 had a previously identified connection to the risk of developing mucinous tumors. In conclusion, these regions, genes, or SNPs, alone or in combination, may be influential on the mucinous tumor phenotype and should be explored further.

Several strengths and limitations of this study should be mentioned. Studying the mucinous tumor phenotype is inherently challenging since it is not frequently detected. Despite this and the large number of SNPs/gene-based regions investigated, this study identified promising genetic variants and genomic regions that may have a biological connection to the mucinous tumor phenotype. We are aware that our results need to be replicated in independent cohorts and remain to be verified. Of note, SNPs and genetic regions we report are different than the MUC genes, which are the typical candidate genes for mucin production and mucinous phenotype. In the common variant analysis, the recessive and co-dominant models yielded some high odds ratio estimates but also wide confidence intervals (as expected, as these are the models with relatively low power). Consequently, the interpretation of these results should be made with caution. SKAT-O is a robust test and an attractive choice for rare variant analysis, however, it cannot determine which SNPs or how many SNPs within a SNP-set are truly associated with the phenotype. Also, in the rare variant analysis, due to the assignment of one SNP to one gene region, there could be some genes whose associations may have been missed. In addition, in contrast to previous studies, we used a comprehensive genome-wide SNP genotype data, however, analysis of a more comprehensive data (such as those obtained by whole genome sequencing) would be desirable. This is particularly true for rare variants as most genotyping technologies target primarily common SNPs.

Conclusions

In this study, we performed the first genome-wide association study on common and rare SNPs in colorectal cancer patients to identify novel genetic associations with the mucinous tumor phenotype. We identified novel, promising, and independent associations of specific SNP genotypes with the risk of developing mucinous tumors. In the common and rare variant analysis, we reported SNPs within the sequences of genes encoding transporter proteins, such as SLC22A16 and SLC35F1, which may have a role in transporting molecules related to excessive mucin production. In addition, the rare variant analysis reported associations with several regulating RNA molecules, which may influence the expression of genes related to mucin production. Finally, the common SNP analysis reports genes whose protein products are involved in DNA replication (CCDC141) and transcription (ZBTB20) that could have downstream effects on the mucin genes. Furthermore, the common SNPs reported in this study significantly improved the discriminatory accuracy of the multivariable model to distinguish between mucinous and non-mucinous tumors. In addition, we detected novel promising associations between gene-based sets of rare SNPs and mucinous tumors. The results of this study, once replicated in other cohorts, can contribute further information to the molecular characteristics of this under-studied but clinically important colorectal cancer subtype.