A survey of tandem repeat instabilities and associated gene expression changes in 35 colorectal cancers
Colorectal cancer is a major contributor to cancer morbidity and mortality. Tandem repeat instability and its effect on cancer phenotypes remain so far poorly studied on a genome-wide scale.
Here we analyze the genomes of 35 colorectal tumors and their matched normal (healthy) tissues for two types of tandem repeat instability, de-novo repeat gain or loss and repeat copy number variation. Specifically, we study for the first time genome-wide repeat instability in the promoters and exons of 18,439 genes, and examine the association of repeat instability with genome-scale gene expression levels. We find that tumors with a microsatellite instable (MSI) phenotype are enriched in genes with repeat instability, and that tumor genomes have significantly more genes with repeat instability compared to healthy tissues. Genes in tumor genomes with repeat instability in their promoters are significantly less expressed and show slightly higher levels of methylation. Genes in well-studied cancer-associated signaling pathways also contain significantly more unstable repeats in tumor genomes. Genes with such unstable repeats in the tumor-suppressor p53 pathway have lower expression levels, whereas genes with repeat instability in the MAPK and Wnt signaling pathways are expressed at higher levels, consistent with the oncogenic role they play in cancer.
Our results suggest that repeat instability in gene promoters and associated differential gene expression may play an important role in colorectal tumors, which is a first step towards the development of more effective molecular diagnostic approaches centered on repeat instability.
KeywordsTandem repeats Colorectal cancer MSI Microsatellite instability Expression Repeat instability Microsatellite Cancer pathways Cancer genes Wnt signaling pathway p53 pathway Methylation Hypermetyhlation
Microsatellites, short tandem DNA repeats, are among the most variable loci in the human genome. They experience mutations in the copy number of their repeat units at a rate of 10−3 to 10−7 per cell division [1, 2]. Most such mutations result from replication slippage that escaped the proofreading activity of mismatch repair systems . To date, microsatellite instability – an increased propensity of a microsatellite to suffer length-altering mutations – has been linked to at least 40 monogenic disorders [4, 5]. Such instability is also commonly observed in many cancers, including colorectal, gastric, endometrial, ovarian, and breast cancer [6, 7]. Among them, colorectal cancer, the third most commonly diagnosed cancer in the world, and the second leading cause of cancer-related deaths in western societies [8, 9] shows several phenotypically distinct subtypes. Of these, tumors with microsatellite instable (MSI) phenotype are found in at least 15 % of sporadic colorectal cancers, and almost all hereditary colorectal cancers . MSI tumors differ from other tumors in their gene expression and methylation patterns to a great extent [11, 12, 13].
Several studies reported gene expression changes associated with tandem repeat mutations in human carcinomas. For example, a CAG tri-nucleotide repeat associated with prostate cancer has been identified in the first exon of the androgen receptor gene. Expansion of this repeat decreases gene expression, and increases disease incidence and tumor aggression . Another example involves mutations in the promoter of the telomerase reverse transcriptase (TERT) gene, which causes overexpression of the gene, and is a key mechanism behind some types of cancer . In breast cancer, a dinucleotide CA-repeat within the first intron of the epidermal growth factor receptor (EGFR) gene correlates with the gene’s transcription levels. Mutant alleles of the highly polymorphic 28 base pair long repeat in the downstream region of the proto-oncogene HRAS1 significantly increase disease susceptibility for many cancers, including breast cancer, colon cancer, rectal cancer, bladder cancer, and leukemia .
Several studies of colorectal adenomas showed that tumors with mutations in different genes have distinctive expression and methylation patterns [13, 17, 18]. The patterns detected from such large-scale gene expression data sets are being used to stratify colorectal tumor subtypes [17, 19]. A study on comparability of gene expression changes in colorectal cancer, based on data produced in various laboratories, showed that on average 95 % of genes show consistent gene expression changes between two major subtypes of colorectal cancer, independent of the data source . Despite many studies on colorectal cancer, current therapeutic approaches cure only a fraction of patients [10, 21], which necessitates a more complete understanding of the kinds of mutations that contribute to tumorigenesis and their impact on tumor phenotypes. Although copy number variations of long DNA stretches, and single nucleotide polymorphisms have received much attention in colorectal tumors [13, 22], a genome-wide analysis of tandem repeat instabilities is currently not available. Most work on repeat instability in colorectal cancer focuses on variation between tumor and matched normal genomes in merely five marker repeats , a tiny fraction of the more than 3 million human microsatellite loci [24, 25]. Recent advances in next generation sequencing and accurate repeat genotyping algorithms enabled us to investigate repeat variation in tumor genomes more comprehensively, and to study their potential consequences on gene expression.
Here we analyze tandem repeat variation in 35 colorectal tumors and their matched normal genomes in proximal (near-gene) promoter regions and exons of 18,439 genes, as well as in a smaller subset of genes in known cancer-associated pathways. We find that MSI tumors are significantly enriched for de novo repeat gain, repeat loss, and copy number variation in their exonic and promoter regions. Also, tumors, in general are enriched in genes with such repeat instabilities compared to normal tissues. We observe that genes with repeat instability in their promoters tend to be expressed at lower levels. The promoters of genes in most well-studied cancer pathways, including the p53 and Wnt signaling pathways, are significantly enriched in unstable repeats, and those pathway genes with unstable repeats show gene expression alterations consistent with their role in carcinogenesis, whether oncogenic or tumor-suppressive.
Abundant repeat gains and losses in tumors compared to normal genomes
We identified genes with tandem repeats in the exons and promoters of 18,439 genes in 35 colorectal tumors and their matched normal genomes (see Methods and Additional file 1: Table S1 and S2). We found that a tumor genome has on average 1510 exon sequences and 4192 promoters with tandem repeats. A normal genome has on average 1475 exons and 4165 promoters with tandem repeats.
More genes with unstable repeats in tumors
For those genes where both the tumor and matched normal genomes contain a repeat, we next asked how many repeats are unstable, that is, varying in the copy number of their repeat unit. Averaged over all 35 tumor-normal genome pairs, the number of genes with unstable repeats is 158 ± 24 (for MSI tumors only: 160 ± 22). This number is significantly greater than the number of genes with unstable repeats in normal genome pairs (81 ± 19, WRS test, P < 10−24, see Fig. 1b). When we repeated this analysis for exonic repeats, we observed a similar enrichment in repeats that varied in their copy number between the two sets of genome pairs. Specifically, we found on average 36 ± 14 (for MSI tumors only: 40 ± 13, P not significant) genes with unstable repeats in a tumor-normal genome pair, a value greater than that in a normal/normal pair (31 ± 12 genes, P = 0.04, see Fig. 2b). We conclude that tumor genomes harbor more repeat copy number variation than normal genomes both in their promoters and exons.
MSI genomes have more genes with repeat instability
Most cancer pathways in tumors are enriched for unstable and/or orphan repeats in gene promoters
Genes with repeat instability are downregulated in tumors
Pathway-specific gene expression alterations associated with repeat instability
Here we present a comprehensive analysis of exome and whole genome sequencing data from 35 patients with colorectal cancers to identify tandem repeat instabilities and their association with gene expression alterations. To our knowledge, this is the first genome-wide analysis of tandem repeat instabilities in the currently largest collection of colorectal tumor genomes. To date, there are few other studies on genome-wide tandem repeat mutations in cancers. One such study  genotyped repeat variations in breast cancer exomes in comparison to random healthy individual tissues. Another study  focuses on microsatellite mutations within various tumors (including colorectal cancer) in a small number of genes (137). Our study remains unique in its focus on comparison between tissues from the same individual.
Using the matched tumor and normal tissues, we identified two types of repeat instability between these pairs of genomes, namely (i) repeat copy number variation, and (ii) de novo gains and losses of repeats. We identified these instabilities both in promoter regions and exons of 18,439 human genes, and in a smaller set of 371 genes from five signaling pathways associated with cancer. We found evidence for enhanced repeat instability in promoters and exons of tumor tissues. We also showed for the first time that tumor genomes with an MSI phenotype, which indicates a defect mismatch repair system, contain more repeat instabilities than microsatellite stable tumors. The difference was more pronounced when we focused on mononucleotide repeats, in agreement with the finding that replication slippage alone cannot explain the incidence of polymorphisms in repeats whose repeat units are longer than one nucleotide . Although replication slippage is a major factor driving mononucleotide repeat variation, additional cellular factors, such as chromatin reorganization  and telomere instability  also play a role for non-mononucleotide repeats.
Motivated by the impact of gene regulatory alterations on carcinogenesis, we studied repeat-associated gene expression changes. Using the comprehensive catalogue of information we retrieved from , we compared a gene’s expression level in genomes where the gene shows repeat instability and where it does not. We observed that genes with repeat instability are mostly downregulated, and especially so if this instability occurred in the promoter, emphasizing the importance of regulatory mutations in carcinogenesis also suggested by others. Two other studies [15, 35] identified recurrent mutations in gene promoters and their association with gene expression levels in multiple tumor genomes across many cancer types. Another study on non-coding disease associated variants  showed that these variants are concentrated in regulatory DNA marked by DNase hypersensitive sites and that these variants perturb epigenetic processes.
Gene silencing mediated by repeats is a phenomenon observed in various diseases, where, for example, DNA around tandem repeats becomes heterochromatic, leading to decreased promoter accessibility and hence to local transcription repression. This phenomenon has been documented in mammalian embryonic carcinoma cells , as well as for repeat-induced diseases such as myotonic dystrophy and Friedreich’s ataxia . Apart from chromatin reorganization, promoter hypermethylation, which is commonly observed in carcinogenesis [18, 39, 40], can also cause gene silencing or reduced gene expression. Several genes are downregulated via promoter hypermethylation in colon cancers [39, 41]. This type of downregulation can act synergistically with other genetic mechanisms, such as somatic mutations, to alter key signaling pathways critical to colorectal tumorigenesis [39, 42]. Previous smaller-scale studies based on the five markers characterized in the “Bethesda guidelines” microsatellites  showed an association between promoter hypermethylation and microsatellite instability [43, 44]. We therefore asked if promoters with repeat instability show higher promoter methylation levels, and found indeed a small but significant increase of methylation in promoters with unstable repeats (WSR test, P = 0.004, see Methods and Additional file 2: Figure S1). Our findings reveal, for the first time, a genome-wide association between promoter methylation and decreased expression in genes with repeat instability.
Although identification of mutated cancer genes provides insights into tumorigenesis , diverse and functionally heterogeneous genes can be mutated even within same type of tumor [27, 28]. However, some pathway dysregulations are shared among multiple cancer types [22, 27, 46]. We therefore identified unstable repeats in the promoters of genes in five prominent cancer-associated pathways. One of them is the Wnt signaling pathway, which is commonly implicated in carcinogenesis due to its regulatory role in cell proliferation, gene transcription and cell migration [13, 47]. Colorectal cancers of all subtypes almost invariably start with an activating mutation in this pathway [22, 48]. Remarkably, we found that gene promoters in the Wnt pathway are significantly enriched for unstable and/or orphan repeats, and these genes are also significantly overexpressed (in contrast to the opposite genome-wide trend discussed above). Genes in the MAPK pathway, a signaling cascade that regulates cellular transcription and translation levels , also show higher repeat instability in the promoters of tumor-normal genome pairs than of normal genome pairs, and those promoters with unstable and/or orphan repeats are also significantly overexpressed. This increase in gene expression is in line with the significant hyperactivation of the MAPK pathway revealed by a comprehensive study on colorectal tumors by TCGA . In contrast, none of the genes in the TGF beta pathway show increased repeat instability or expression alterations that are associated with repeat instability. This observation is in line with the previous observation  that this pathway is the least divergent pathway between colorectal tumors and their matched normal genomes in terms of gene copy number variation and gene expression.
The final pathway we analyzed is the p53 pathway. It plays a crucial role in the cell cycle and can initiate cell death . Inactivation of the p53 pathway through multiple mutations is an almost universal feature of human cancer cells [50, 51]. In agreement with its central role in tumor suppression, we found that genes in this pathway are significantly enriched both for promoter and exonic repeat instabilities in tumor-normal pairs compared to normal genome pairs, and genes with instabilities both in promoter and exon sequences are downregulated in colorectal tumors. When we examined pathway genes with exonic repeats, we identified several genes with unstable and/or orphan repeats in tumor-matched normal genome pairs but not in normal genome pairs. One of them, TP53I3 is a well-known example for tandem repeat instability associated with cancer. It has been shown that this gene contains a pentanucleotide (TGCCC) repeat where the tumor-suppressor p53 binds to activate the gene, a mechanism suggested to be mediating cell death . Copy number variation in this repeat alters TP53I3 activation and probably affects an individual’s susceptibility to cancer . We show for the first time that this repeat is actually polymorphic in a tumor tissue. We also identified two other genes (TP53I11 and CDKL1) that contain tumor-specific repeat instabilities in their exons, and where repeat instability had not been documented so far. These findings highlight the importance of analyzing tumor-specific tandem repeat instability, and their consequences on gene regulation, which could contribute to carcinogenesis.
Among the limitations of our study is that we cannot distinguish between somatic and germline mutations. This is relevant, because some mismatch repair genes can experience germline mutations that cause colorectal cancer . These germline mutations can also play a role in forming different subtypes of colorectal cancer, as they trigger accumulation of different sets of somatic mutations throughout carcinogenesis . However, because 90 % of cancer mutations are somatic , this is not a serious drawback. Second, an ideal control analysis would compare repeat instability between normal-normal genome pairs from healthy tissues of the same individual to those of tumor-normal genome pairs. However, the necessary multiple normal genomes are currently not available, which is why we had to compare the genomes of normal tissues from different individuals as a control. As a result, we may underestimate differences in repeat instability between normal and tumor genomes. Another source of underestimating repeat number and instability is our conservative approach of identifying matched repeats (see Methods). Absent these limitations, we might see an even greater excess of unstable repeats in colorectal tumors. Some of them would be by-products of defective mismatch repair, whereas others might trigger or promote carcinogenesis. It is also important to note that our cancer gene set is unlikely to encompass all genes that may play a role in cancer, because we focused on particular, well studied cancer associated pathways. Finally, limitations in whole genome alignment quality may underestimate repeat copy number variation in gene promoters.
Because genetic instability is not only central to tumor pathogenesis, but may also underlie the development of resistance to chemotherapeutic agents, it is important to identify its incidence and phenotypic consequences. Our analysis, based on the best currently available data sets is a first small step towards this understanding. Future studies using more data and more advanced technologies will enhance this understanding further, in order to develop more effective molecular diagnostic approaches centered on repeat instability. For example, studies comparing gene expression levels between tumor and healthy tissues will be able to identify tumor-specific gene expression alterations more confidently. Also, information on allele-specific expression can help explaining the association between repeat instability and downregulation. Future studies with a more comprehensive set of microsatellite stable tumors will hopefully disentangle differences between microsatellite-stable and -unstable tumors in greater detail. Finally, differentiating between clonal and subclonal instabilities will facilitate a better understanding of the life histories of tumors, because they show extreme intra-tumor heterogeneity [54, 55, 56].
Genome sequence analysis
We obtained whole genome sequences of colon and rectal tumors, together with their matched genomes -- the same individual’s genomic sequences from blood samples -- from the controlled access data tier of the Cancer Genome Atlas Data Portal (TCGA, http://cancergenome.nih.gov/, ). For our analysis, we considered only genomes for which RNA-Seq data were also available in TCGA. The genome sequence data is based on 2-5X coverage Illumina HiSeq2000 sequencing of 80–100 million pairs of 100-nucleotide-long reads, aligned against human genome build #18  with the indel-compatible software package BWA (bwa-0.5.9rcl ). For the exon analysis, we used Illumina exome-seq data exceeding 20X coverage for ~44Mbs of sequence from ~30 K genes.
We generated consensus sequences for the promoters and exons of genes in the tumors and their matched normal genomes using SAMtools . In order to specify the exonic regions, we considered all transcript variants for each gene in the human reference genome annotation  for human genome build #18. We excluded those exons that contained transcript variants in more than one chromosome, such as transposons. For genes with multiple transcripts, we merged all exonic regions from all transcripts into one super-transcript. Because our previous work on human tandem repeats  suggests that the 5,000 base pairs [bps] upstream from the transcription start site contain the most regulatory signals, we focused on this region and refered to it as the promoter.
While generating our consensus sequences, we noticed that some genomes contained many more unaligned sequences than others. We eliminated genomes with unaligned nucleotides in more than 10 % of the regions of our interest (promoters or exons), which reduced our data set to 35 genomes (see Additional file 1: Table S2 for a list of genomes). We considered a tumor MSI, if its MSI status was MSI-H based on . Because this approach yielded only three genomes, we considered also other criteria of an MSI phenotype, as provided by . We found, however, only one more genome that was not MSI-H but showed all other indications of an MSI phenotype, namely a CIMP-H methylation subtype, MLH1 silencing, and a MSI-CIMP expression subtype. We therefore considered this genome also MSI (see Additional file 1: Table S2). After removing genes from the data set whose promoters or exons could not be aligned, we focused our analysis on the remaining “global” set of (one-to-one homologous) 18,439 genes (see Additional file 1: Table S1), as listed in . Apart from analyzing this global set, we also performed a more detailed analysis of 371 cancer genes (Additional file 1: Table S3) that fall into five well-studied cancer associated pathways [13, 22, 28].
Tandem repeat identification
We used the program Tandem Repeat Finder 4.07b  to identify tandem repeats in the consensus sequences. Specifically, we identified repeats with (i) an incidence of indels (insertions or deletions) in adjacent repeat units below 10 % (e.g., a repeat unit of 20 nucleotides can have up to two single base pair indels relative to the consensus pattern, which is the repeat unit most common in the whole repeat sequence ), and (ii) a sequence identity of repeat units above 90 % (e.g., at least 18 nucleotides of a repeat unit of 20 nucleotides must match the consensus pattern). We set the Tandem Repeat Finder Score to a value of 80, as we were most interested in how repeat variation might cause gene expression differences, and variation of tandem repeats increases strongly for repeats of high Tandem Repeat Finder Scores . We considered both micro- and minisatellites with tandem repeat units up to 100 nucleotides in length. Repeats longer than that are more stable and therefore less likely to cause gene expression differences .
To identify repeat gains and losses, we first defined matched repeats between a tumor and its normal genome. These are repeats with the same repeat unit that occur in the promoter or exon of the same (homologous) genes in a tumor and its matched normal genome. We did not consider gene families separately. We allowed positional variation of repeats up to 50 nucleotides within a promoter or an exon, because indels can cause substantial shifts in repeat location even within a species .
To find out whether a tumor genome shows a significant difference in repeat incidence or variability to a normal genome, it is necessary to compare (i) the incidence or variability of repeats in a tumor genome relative to its matched normal genome to (ii) the incidence or variability of repeats between two normal genomes. We computed the latter from our 35 normal genomes by pairing them in all possible (595) combinations, computing our measures of repeat incidence and variability for each pair, and pooling the resulting data.
Gene expression analysis
The gene expression data we used is based on RNA sequencing of 350–450 base pair-long Illumina Cluster Station and Genome Analyzer reads by TCGA . The data comprises expression levels in reads per kilobase of transcript per million reads mapped (rpkm) for 18,439 genes in the 35 tumor genomes we analyzed.
The promoter methylation data we used is based on Illumina Infinium HumanMethylation27 arrays to profile DNA methylation at gene promoters of TCGA , targeting 27,578 CpG sites located in proximity to the transcription start sites of 14,475 consensus coding sequencing (CCDS) in the NCBI Database (Genome Build 36). We computed for each gene the methylation level in those genomes where the gene has a repeat instability in its promoter, and compared it to the gene’s methylation level in genomes where the gene has no repeat instability.
We acknowledge support through Swiss National Science Foundation grant 315230–129708, as well as through the University Priority Research Program in Evolutionary Biology at the University of Zurich. The results published here are in whole based upon data generated by The Cancer Genome Atlas managed by the NCI and NHGRI. The controlled data sets (dbGaP accession number phs000544.v2.p7, a substudy of the TCGA Data set phs000178.v8.p7) used in this study were accessed through the authorized access approval of the NIH committee for the projects #5876 and #8774. Primary sequencing data are downloaded from the Cancer Genomics Hub (CGHub) . In their analyses and publication of the results authors strictly followed the Data Use Certification Agreement and the dbGaP Approved User Code of Conduct, they agreed for the TCGA controlled data access. No further ethical approval was needed for the study. Information about TCGA can be found at http://cancergenome.nih.gov.
- 5.Gemayel R, Vinces MD. Legendre M. Variable Tandem Repeats Accelerate Evolution of Coding and Regulatory Sequences. Annu Rev Genet: Verstrepen KJ; 2010.Google Scholar
- 6.Imai K, Yamamoto H. Carcinogenesis and microsatellite instability: The interrelationship between genetics and epigenetics. Carcinogenesis. 2008;673–680.Google Scholar
- 9.Cancer Research UK, http://www.cancerresearchuk.org/health-professional/cancerstatistics/worldwide-cancer, Accessed 11 2014.
- 14.Giovannucci E, Stampfer MJ, Krithivas K, Brown M, Brufsky A, Talcott J, et al. The CAG repeat within the androgen receptor gene and its relationship to prostate cancer. Proc Natl Acad Sci U S A. 1997;3320–3323.Google Scholar
- 18.Paulsen M, Ferguson-Smith AC. Methylation and colorectal cancer. J Pathol. 2001;111–134.Google Scholar
- 23.Umar A, Boland CR, Terdiman JP, Syngal S, de la Chapelle A, Rüschoff J, et al. Revised Bethesda Guidelines for hereditary nonpolyposis colorectal cancer (Lynch syndrome) and microsatellite instability. J Natl Cancer Inst. 2004;261–268.Google Scholar
- 25.Willems TF, Gymrek M, Highnam G, Mittelman D, Erlich Y. The landscape of human STR variation. Genome Res. 2014;gr.177774:114–.Google Scholar
- 29.Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc Ser B Methodol. 1995;57:289–300 [B].Google Scholar
- 30.Woolson RF. Wilcoxon signed-rank test. Wiley Encycl Clin Trials. 2008;1–3.Google Scholar
- 36.Maurano MT, Humbert R, Rynes E, Thurman RE, Haugen E, Wang H, et al. Systematic Localization of Common Disease-Associated Variation in Regulatory DNA. Science. 2012;1190–1195.Google Scholar
- 42.Chan TA, Glockner S, Joo MY, Chen W, Van Neste L, Cope L, et al. Convergence of mutation and epigenetic alterations identifies common genes in cancer that predict for poor prognosis. PLoS Med. 2008;5:0823–37.Google Scholar
- 52.Venot C, Maratrat M, Dureuil C, Conseiller E, Bracco L, Debussche L. The requirement for the p53 proline-rich functional domain for mediation of apoptosis is correlated with specific PIG3 gene transactivation and with transcriptional repression. EMBO J. 1998;17:4668–79.PubMedCentralCrossRefPubMedGoogle Scholar
- 54.Carter SL, Cibulskis K, Helman E, McKenna A, Shen H, Zack T, et al. Absolute quantification of somatic DNA alterations in human cancer. Nat Biotechnol. 2012;413–421.Google Scholar
- 55.Nik-Zainal S, Van Loo P, Wedge DC, Alexandrov LB, Greenman CD, Lau KW, et al. The life history of 21 breast cancers. Cell. 2012;994–1007.Google Scholar
- 60.Bilgin Sonay T, Carvalho T, Robinson MD, Greminger MP, Comas D, Highnam G, et al. Tandem repeat variation in human and great ape populations and its impact on gene expression divergence. Genome Res. Published in Advance August 19, 2015, doi: 10.1101/gr.190868.115.
- 64.Wilks C, Cline MS, Weiler E, Diehkans M, Craft B, Martin C, et al. The Cancer Genomics Hub (CGHub): overcoming cancer through the power of torrential data Database 2014; doi: 10.1093/database/bau093.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.