Introduction

Breast cancer is the second leading cause of cancer-related death in women [1]. Comprehensive gene expression analyses of breast cancer confirmed the presence of the following three histopathologically identified subsets: (1) estrogen receptor (ER)-positive, progesterone receptor (PR)-positive; (2) human epidermal growth factor receptor 2 (HER2)-enriched; and (3) triple-negative (lacking ER, PR, and HER2) [2]. ER+ breast cancers account for approximately 70 % of breast tumors diagnosed, and while effective targeted endocrine therapies have been identified, ~25 % of these tumors develop resistance over time and consequently undergo relapse [3]. Analysis of aromatase inhibitor-treated ER+ breast tumors using whole exome sequencing identified associations between endocrine resistance and mutations in ER-related genes, including GATA3, CBFB, TBX3, RUNX1, and PIK3CA [4]. Similarly, loss of PR in ER+ breast tumors associates with loss of estrogen-dependence, increased endocrine resistance, and diminished overall survival [5]. The discovery of underlying targetable pathways of resistance in this subgroup is required for the identification of markers and the development of tailored therapeutic strategies.

Several prognostic markers have been identified for breast cancer including lymph node involvement, tumor stage, TP53 status, PAM-50 subtype, ER status, PR status, and HER2-enrichment. Mutations in DNA damage response (DDR) pathways are also implicated in clinical outcome of breast cancer. Mutations in BRCA1 (a double-strand break (DSB) repair gene) and TP53 (a DDR checkpoint gene), for instance, are associated with triple-negative breast cancer and poor clinical outcome [6, 7]. Mutations in other DDR genes including ATM, ATR, and BRCA2 (all DSB repair genes) have been associated with increased susceptibility to breast cancer [8]. While some of these markers have contributed significantly to the tailoring of therapeutic strategies, they do not comprehensively predict resistance or increased mortality. Moreover, despite much effort no further globally significant single genes have been identified as predictors of breast cancer clinical outcome, and it is unlikely that many such genes remain to be discovered. In non-breast cancers, DNA damage affects tumor somatic mutation load (SML), and mutations in DDR genes can be predictive of clinical outcomes, such as overall and relapse-free survival [9, 10]. In this context, we postulate that genome-wide phenotypic signatures might have a wide impact on breast cancer prognosis and prediction.

In support of this idea, increased genomic instability in tumors has been associated with the basal-like tumor subtype [11] and with metastasis-free survival in lymph node-negative luminal breast tumors although this analysis was limited by its sample size [12]. This genomic instability score was found to be highly associated with TP53 mutations and proliferative indices. However, genomic instability in this group was restricted to a very small number of tumors, indicating a potential limitation of its scope for use as a prognostic/predictive marker. Recent whole exome sequencing of colorectal cancer by the TCGA initiative identified a high SML subset associated with microsatellite instability, mutations in mismatch repair (MMR) pathway genes, and favorable outcome [13]. However, the effects of SML on breast cancer have not yet been elucidated and we postulated that, as in colorectal cancer, SML of a breast tumor would influence patient survival. To test this hypothesis, we analyzed whole exome sequencing data recently generated by the The Cancer Genome Atlas (TCGA) initiative from breast tumors [14].

Materials and methods

Informatics

Whole exome somatic variants, gene expression, clinical, and epidemiological data were downloaded from The Cancer Genome Atlas Breast Invasive Carcinoma data portal (https://tcga-data.nci.nih.gov/tcga/tcgaCancerDetails.jsp?diseaseType=BRCA&diseaseName=Breast%20invasive%20carcinoma). Details of sample acquisition, DNA sequencing, and RNA expression analyses have been described in the original TCGA publication [14]. Data processing and statistical analysis were carried out using the R statistical software suite [15] and custom scripts written in Perl.

Statistics

t tests were used to determine p-values for continuous data, with Holm’s adjustment for multiple comparisons as required. For data with non-normal distribution, the Wilcoxon Rank Sum test was used. Fisher’s exact tests were used to determine significance of categorical data. Survival analyses used log-rank tests, and Kaplan–Meier curves were plotted using R. Due to the low median follow-up time of the TCGA cohort (575 days), all survival analyses extend only 10 years. Proportional hazards were calculated using the Cox regression model, and the coxph function in R was used to confirm that the dataset met the assumptions for the Cox regression analysis.

Clinical information

ER status provided in the publically available TCGA dataset was used to sort the tumors into ER+ and ER groups. Age at diagnosis of >50 years was used as a surrogate indicator of postmenopausal status.

Gene lists

Lists of genes within the specific DNA damage response, MAPK, NFkB, and T-cell marker pathways were generated using the KEGG database (keywords: DNA damage repair; base excision repair (BER); nucleotide excision repair (NER); MMR; homologous recombination (HR); non-homologous end joining (NHEJ); DDR checkpoint; MAPK signaling; NFkB signaling; T-cell marker) (see Tables 1, 2). Genes with known prognoses [4, 1618] were generated from the previous literature (see Table 5). A consensus ER+ breast cancer signature gene list (ABCA3, ACADSB, ALDH3B2, AR, ANXA9, BCL2, CA12, CCND1, CGA, DNAJC12, ESR1, ERBB4, FOXA1, GATA3, GJA1, GREB1, HPN, IGFBP4, IL6ST, KRT18, LRBA, MYB, NAT1, NRIP1, PGR, PTPRT, RABEP1, RARRES1, RERG, RET, SEMA3B1, SLC27A2, SLC39A6, SULT2B1, TFF1, TFF3, XBP1) was generated from five independent studies profiling ER+ (luminal) breast tumors and cell lines [1923].

Table 1 KEGG-generated list of genes from three cancer-related pathways
Table 2 KEGG-generated list of DNA damage repair genes

Results

Mutation load distribution is different between ER+ and ER breast cancer

Our sample set comprises 762 invasive breast tumors from the TCGA dataset. Immunohistochemical analysis shows that the majority of these tumors (73.4 %) are ER+ (Table 3). The mean SML is 67.23 mutations per tumor (Table 3); however, ER tumors have a significantly higher SML than ER+ tumors (p < 0.0001; Fig. 1a). Furthermore, ER+ and ER tumors are characterized by marked differences in SML distribution. ER+ tumors have a median SML of 46 (Fig. 1a) and a mean SML of 62.7, with a small subset of these tumors carrying significantly high mutation loads (HMLs) (Fig. 1a). Conversely, ER tumors lack a distinct high mutation subset, instead almost half (42 %) of the tumors carry mutation loads higher than the mean SML (Fig. 1a).

Table 3 Descriptive characteristics of TCGA dataset used in the study
Fig. 1
figure 1

HML subset of ER+ breast tumors is associated with poor clinical outcome. a Index plot. Median and mean SMLs of each population are indicated with red lines. be Kaplan–Meier survival curves of all breast tumors (b) and the HML (red) and LML (blue) subsets of: c ER+ breast cancer; d ER breast cancer; and e a comparison between ER+ HML, ER+ LML, and ER (black) breast cancer. The log-rank test was used to determine p-values

SML associates with ER+ breast cancer survival

Associations have been found between genomic instability, mutations in specific DNA damage genes, and clinical outcome in various cancers, including the breast [8], but there have been few previous reports on the effect of mutation load on any type of cancer. One report identified an association between high SML and good clinical outcome in colorectal cancer [13]; however, there are no such associative findings reported for breast cancer. To test whether mutation load affected survival in breast cancer, we divided all breast cancers into HML and low mutation load (LML) groups based on the mean SML across all breast cancers. We found that SML had no effect on breast cancer survival when all tumors were considered (Fig. 1b) in accord with the previous TCGA report [14]. However, we postulated that SML might differentially affect breast cancer outcomes based on ER status, as suggested by the distinct distribution of SML between ER+ and ER breast tumors (Fig. 1a). Therefore, we next analyzed the effect of SML on overall survival independently in the ER+ and ER subsets of breast cancer by defining tumors as LML or HML based on mean SML for each ER subtype. We found that patients with ER+ HML tumors exhibit significantly shorter overall survival than do patients with ER+ LML tumors (p = 0.02, Fig. 1c), and conversely, overall survival is not affected by SML in patients with ER tumors (p = 0.25, Fig. 1d). In addition, the overall survival curve of ER+ HML tumors is virtually identical to the survival curve of ER tumors (Fig. 1e), emphasizing the significantly poor overall survival observed in the HML subset of ER+ breast tumors. For most of the remaining analyses, we focused on the effects of mutation load on ER+ breast cancer.

We next used a Cox regression model that assessed effect of SML on survival in the presence of known prognostic/predictive factors including PR status, HER2 enrichment, tumor stage, and lymph node involvement. Our results showed that mutation load was an independent prognostic factor in ER+ tumors (p = 0.04, Table 4) with a hazard ratio (HR = 2.02) higher than that of all other factors considered except nodal status. In fact PR status no longer contributed significantly to survival (p = 0.15) although lymph node status remained significant in the multivariate analysis (p = 0.02). The fact that tumor stage did not affect clinical outcome significantly in the Cox analysis is likely due to the small number of patients and the short follow-up time in this study (see “Materials and methods” section). The HML subset overall was enriched for HER2+ tumors (36/395 LML tumors were HER2+ vs 36/105 HML tumors; p < 0.001), and the average SML of HER2-enriched tumors (79.5 ± 55.9) was higher than HER2-negative tumors (65.6 ± 53.5; p = 0.02). However, HER2 enrichment did not contribute significantly to overall survival (Table 4) indicating that HML may be a more compelling contributor to survival in ER+ breast cancer than HER2 status.

Table 4 Proportional hazards table identifying mutation load as an independent prognostic factor for ER+ breast cancer

DNA damage repair pathways are mutated in tumors with HML

To investigate the pathways underlying the HML phenotype, we next investigated whether HML associated with inactivation of DDR pathways by assessing the mutational status of genes from the DDR checkpoint, as well as from each of the five major DDR pathways: BER; NER; MMR; HR; and NHEJ (see “Materials and methods” section and Table 2). We analyzed the proportion of tumors with mutations in at least one gene from each pathway in HMLs vs LMLs, and the mutational frequency (i.e., the number of non-silent mutations in genes of a specific pathway over total number of mutations in all genes) (Fig. 2a, b).

Fig. 2
figure 2

HML in ER+ cancers associates with mutations in DDR, but not checkpoint, genes. a, b Bar graphs representing the fold change in HMLs over LMLs of mutations in specified DDR pathway genes, DDR checkpoint genes (Chkpt), genes that are common to multiple DDR pathways (Other), all DDR-related genes included in the analysis (All), and any non-DDR gene in the genome (Any) in terms of: a proportion of tumors with at least one mutation in each pathway; and b frequency with which every gene of each pathway is mutated. The dotted line represents the threshold fold change calculated from baseline levels graphed in cd, inset and ef. c, d Bar graphs. Fisher’s exact test was used to generate p-values. Inset depicts bar graphs representing tumors with mutations in all genes other than DDR-related genes. e, f Percentage of tumors with mutations in genes from three randomly selected cancer-related pathways (e), and the frequency of mutations in genes from these pathways in both HML (red) and LML (blue) tumors (f). Fisher’s exact test was used to determine p-values. Gene lists were generated from KEGG database and from the previous literature and are reproduced in Tables 1 and 2

Mutational analysis in this study is confounded by the fact that HML tumors are theoretically more likely to mutate any given gene than LML tumors. To account for this bias, we calculated baseline statistics of (1) the proportion of tumors with mutations in any given gene and (2) the frequency of mutations in any given gene for both HML and LML subset tumors. We found that the baseline proportion of tumors that have a mutation in any gene is 2.5-fold higher in the HMLs relative to the LMLs as would be expected of tumors with significantly higher mutation load (Fig. 2c, inset). However, we found that the baseline mutational frequency of any gene was similar between the HMLs and LMLs suggesting that the likelihood of any random gene being mutated was comparable between the HML and LML subsets (Fig. 2d, inset). An independent calculation of these same baseline parameters on genes from three randomly selected KEGG-generated pathways revealed no significant increase in mutational proportion or frequency in HMLs (Fig. 2e, f), indicating that high mutation load does not necessarily enrich for mutations in every pathway. Based on these analyses, we set the threshold to find mutational enrichment in the HML subset as twice the baseline difference between HMLs and LMLs. This means that in order to find mutational enrichment in DDR genes in HMLs, 5-fold more tumors would need to have these genes mutated at 2-fold higher frequencies than LMLs.

Using these conservative thresholds, we found no significant enrichment for DDR mutations overall in HMLs over LMLs (Fig. 2a, b). However, mutations in MMR pathway genes occurred in 16-fold more tumors and occurred at 7-fold higher frequency in HML than in LML ER+ tumors indicating significant enrichment over and above our set thresholds (Fig. 2a, b). Uniquely, every gene specific to the MMR pathway was mutated at least once in the HML subset of ER+ tumors (Fig. 3a). Genes from the single-strand break repair pathway, NER, were also mutated in 7-fold more HML tumors and at 2.5-fold higher frequency relative to the LML ER+ tumors (Fig. 3a). Notably, there was no significant enrichment in the HMLs in DNA damage checkpoint genes (Fig. 2a, b). Some genes from the double-strand break repair pathways, e.g., BLM and XRCC4, are mutated at higher frequencies and in more tumors in the HML subset than in the LML subset, but this enrichment is not significant (Figs. 3b; 2a, b).

Fig. 3
figure 3

ER+ HML tumors are enriched for mutations in MMR and NER pathway genes. a, b Venn diagrams indicating genes from the specified DDR pathway that are mutated in either the HML (red) or LML (blue) subset of ER+ tumors, in both (purple) and in neither (white). Increasing font size indicates an increasing proportion of tumors with mutations in the specific gene. c Bar graph depicting the average SML in tumors with specified mutational status. Student’s t test with Holm’s adjustment for multiple comparisons was used to define p-values. Chkpt, genes from the DNA damage checkpoint; NL, tumors with no identified mutations in genes from the specified pathway; mut, tumors with identified non-silent mutations in genes from the specified pathway; ns not significant

In addition, we found a 50 % increase in mean SML in ER+ HML tumors with mutations in DDR pathway genes, while mutations in DDR checkpoint genes did not affect SML (Fig. 3c). Especially striking is the observation that mutations in TP53 occur in a significant fraction of breast tumors and were previously reported to affect genomic instability [11] but are not enriched over the set threshold in the HML group (0.96-fold for mutational frequency and 2.97-fold for tumor proportion) relative to the LML group. While mutations in DDR genes resulted in increased mutation load within LML subset tumors (Fig. 3c), the extremely small effect size limits the biological relevance of this finding. Together, these results indicate that the HML subset of ER+ tumors is associated with mutations in DDR pathway genes, specifically in MMR and NER genes, but not with mutations in DDR checkpoint and double-strand break repair genes.

Mutations in known prognostic genes do not affect survival

Next, we investigated potential pathways underlying the poor survival phenotype associated with HML tumors using a candidate approach. To determine whether the HML subset of ER+ tumors is enriched for mutations associated with poor prognosis, we generated a list of known prognostic genes mutated at >10 % frequency in human breast cancer based on the existing literature [4, 1618] (see “Materials and methods” section and Table 5). We assessed the proportion of tumors with mutations in these genes in both the HML and LML ER+ subsets. Our results demonstrate that the LML subset has a significantly higher proportion of good prognostic mutations than poor prognostic mutations (p = 0.002), (Fig. 4a). However, there were no significant associations found between these known prognostic mutations and overall survival in either HML or LML subsets (Fig. 4b). These data indicate that mechanisms other than those associated with known prognostic genetic mutations mediate the association between SML and breast cancer survival.

Table 5 List of ER signature genes with prognostic mutational status in breast cancer
Fig. 4
figure 4

Coincident mutations in DDR and ER signature genes associate with poor survival irrespective of mutation load. a Percentage of tumors with mutations in genes associated with either good or poor prognosis in specified subsets. Fisher’s exact test was used to determine the p-value. b Kaplan–Meier survival curves of indicated groups. Log-rank test was used to generate p-values. c Bar graph depicting the percentage of tumors with mutations in the specified pathways. Fisher’s exact test was used to identify p-values. The list of ER signature genes is presented in “Materials and methods” section. df Kaplan–Meier survival curves of indicated groups. Log-rank test was used to determine p-values. ER, ER signature genes; DDR, genes from the five major DNA damage response pathways; Chkpt, genes from the DNA damage checkpoint; mut, tumors with non-silent mutations in genes from the specified pathway; NL, tumors with no identified mutations in genes from the specified pathway; ns, not significant

Coincident mutations in ER and DDR genes are enriched in HML breast tumors and associate with poor patient survival

We next hypothesized that inactivation of DDR increases the frequency of genetic mutations in ER pathways thereby decreasing dependence on ER signaling and potentially increasing resistance to therapy. To test this hypothesis, we assessed the mutational frequency of ER signature genes in HML and LML tumors (see “Materials and methods” section), and the correlation between mutations in ER signature, DDR checkpoint, and DDR pathway genes. Mutations in ER signature and DDR checkpoint genes occurred at comparable rates between LML and HML tumors, both singly and in combination (p > 0.9; Fig 4c). However, when we compared tumors with coincident mutations in DDR pathway and ER signature genes, we observed significant enrichment in the HML subset tumors (~20 %) compared to LML subset tumors (<10 %; p = 0.03; Fig. 4c).

We next evaluated the clinical outcome of women with tumors having mutations in both DDR and ER genes. As predicted by our hypothesis, HML tumors with mutations in both DDR pathway and ER signature genes associate with worse overall survival than all other HML tumors (p = 0.007, data not shown). Notably, even LML tumors with mutations in genes of both the DDR and ER pathways associate with significantly worse overall survival than all other LML tumors (p = 0.01; Fig 4d). Further, ER+ tumors with coincident mutations in DDR pathway and ER signature genes (~10 % of all ER+ tumors) associate with significantly worse overall survival than all other ER+ tumors independent of mutation load (p = 0.0008; Fig. 4f), unlike ER tumors (Fig. 4e). These data indicate that coincident mutations in DDR and ER signature genes could constitute an indicator of poor prognosis in ER+ breast tumors.

Discussion

Mutation load and cancer outcome association in breast cancer is unique

The results presented here indicate that in ER+ breast cancer high SML may contribute to poor breast cancer survival, contrary to previous reports in colorectal cancer. Our results suggest the hypothesis that ER+ tumors with mutations in both DDR and ER signature genes are inherently less dependent on ER signaling than ER-driven tumors. This hypothesis may also explain the dichotomous behavior between ER+ and ER breast tumors with respect to mutation load. Therefore, tumors characterized by coincident mutations in DDR and ER genes may be resistant to current therapies, especially anti-estrogen-based therapies. To advance this field it will be necessary to reinvestigate the effects of mutation load on ER breast cancer as both the number of sequenced tumors as well as the length of patient follow-up in the TCGA sample set increases.

MMR gene mutations affect breast cancer survival

Large-scale studies like the TCGA have reported few new genes that have global impact on breast cancer prognosis or prediction. New discoveries will, therefore, most likely arise through pathway level, rather than gene level, analyses. In alignment with this idea, the HML subset of ER+ tumors described here is enriched for somatic mutations in MMR pathways, rather than individual genes.

While deleterious mutations in MMR genes have been identified in primary breast tumors as well as in adjacent neoplastic tissue [24, 25], we describe here a correlation between MMR genetic mutations and poor clinical outcome of patients with ER+ breast tumors. In contrast to our results, a recent publication analyzing mutational signatures of various cancers was unable to identify any correlation between MMR deficiency and mutational signature in breast cancer [26]. This discrepancy likely arose because this prior analysis examined all breast cancers as a single group instead of considering ER+ and ER breast cancer individually. This highlights the importance of incorporating knowledge of tumor biology into analyses rather than relying on pure analytics alone.

Clinical significance of mutation load and sequencing strategies in breast cancer

Our results identify mutation load as a quantitative genomic phenotype, rather than a genotype, associated with clinical outcome. Using mutation load for prediction/prognosis enables easy, quantitative estimation, and may have a greater global impact on breast cancer clinical outcomes than many single genes which are currently considered important. Moreover, mutation load may be indicative of the increased potential of an ER+ breast tumor to quickly become resistant to endocrine therapy by mutating individual pathways that can be discovered through mutational analysis. Therefore, our discovery that high SML may serve as a marker for poor survival in a subset of breast tumors indicates that genome wide sequencing can offer important clinically relevant information for ER+ breast cancer.

Conclusions

Our data indicate a novel association between SML and clinical outcome in breast cancer. Our data also implicate somatic mutations in DDR pathway genes and in ER-related genes as predictive of poor clinical outcome for ER+ breast cancer. It is important to acknowledge the small number of samples and the short follow-up time in this dataset which warrant a larger study to ascertain the contribution of mutation load to clinical outcome. However, approximately one-third of the ER+ tumors used in this study were characterized as HML (>65 mutations). This indicates that a significant proportion of ER+ breast cancer patients could benefit from SML characterization of their tumors. As the cost of DNA sequencing steadily decreases [27], analysis of SML could become a reasonable and useful prognostic marker to help select patients with aggressive and/or endocrine-resistant ER+ tumors, who may benefit from aggressive therapy targeting non-hormonal pathways.