Background

Estrogen receptor-positive (ER+) and estrogen receptor-negative (ER-) breast cancer (BC) not only have markedly different clinical prognoses and treatment decisions [1,2,3], but also are molecularly and etiologically distinct. Several non-genetic risk factors that differ between ER + and ER- BC including age at first birth, nulliparity, obesity in younger women, and age of menarche have been identified in previous studies [4, 5]. Genetic differences of breast cancer by ER status were first characterized between BRCA1 and BRCA2-related tumors [6, 7], with BRCA1-related tumors predominantly lacking ER expression [8,9,10]. While rare mutations in some breast cancer susceptibility genes such as BRCA1 and BRCA2 with moderate to high penetrance have been shown to play a distinct role in breast cancer etiology by ER status [11,12,13], genome-wide association studies (GWAS) have additionally reported that common variants associated with breast cancer risk differ between ER subtypes [14,15,16,17,18,19,20,21]. Though these GWAS have indicated common variants may contribute to genetic differences between ER + and ER- BC, the genes by which these variants act through to impact the risk of developing each ER subtype have not been fully explored.

Transcriptome-wide association studies (TWAS) have recently emerged as an approach to explore how predicted expression of genes by common variants associates with various diseases by incorporating both summary statistics from GWAS and expression prediction models using information of expression quantitative trait loci (eQTLs) [22,23,24]. By training predictive models of gene expression on a reference panel of individuals for whom gene expression and single nucleotide polymorphism (SNP) data are available, TWAS studies bypass the need to obtain gene expression data for larger cohorts where only SNP data is available. As a result, TWAS studies have been shown to comparatively have more power than GWAS studies in identifying genes that potentially impact complex traits [25]. Most previous TWAS performed to study breast cancer have focused on overall breast cancer risk [26,27,28,29] with only a few studies that have explicitly explored ER- BC or stratify their analysis by ER status. These past studies that explore how gene associations differ by ER status were in European ancestry breast cancer TWAS [30, 31], or meta-analysis of multiple ancestry TWAS of breast cancer [32, 33]. In total, these studies have identified 22 genes that associate with ER- BC and 69 genes for ER + BC (Table S1 in Additional file 1).

Though the aforementioned studies established an initial framework for performing TWAS by ER status, there are several shortcomings. First, all these studies utilize expression prediction models to perform TWAS analyses, which do not explicitly account for alternative splicing events within each gene. Splicing is markedly important to include in subtype-specific TWAS analyses since dysregulation of alternatively spliced transcripts has been shown to play a considerable role in BC etiology in prior studies [34,35,36]. Second, these previous TWAS studies have been primarily focused on studying the association between BC risk and gene expression in breast and blood tissues. The effects of gene expression in tissues other than breast tissue on breast cancer risk have thus far not been addressed for either ER + or ER- tumors. Third, the expression prediction models used in these prior studies have been based on modest sample sizes, which limits their power to detect associations. In this study, we work to address these limitations by performing a comprehensive subtype-specific TWASs of ER- and ER + breast cancer that utilize information of splicing quantitative trait loci (sQTLs) in addition to eQTL information across multiple tissues from the latest version of GTEx models (v8) to increase our power to detect breast cancer susceptibility genes by ER status.

Methods

Subtype-specific GWAS summary statistics for women with European ancestry. We obtained ER + BC summary statistics that have been previously generated from 69,501 ER + BC cases and 105,974 controls who are women with European ancestry in the Breast Cancer Association Consortium (BCAC) [37, 38]. As described in previous studies, these summary statistics for ER + BC were obtained by using an inverse-variance fixed-effects meta-analysis of BCAC participants who were genotyped using the OncoArray, participants who were part of the Collaborative Oncological Gene-Environment Study (iCOGS), and participants from 11 other BC GWAS studies within BCAC. We obtained ER- summary statistics for a total of 30,882 ER- cases and 115,468 controls who are women with European ancestry by performing a meta-analysis of 21,468 ER- cases and 105,974 controls from BCAC and 9,414 BC cases and 9,494 controls from the Consortium of Investigators of Modifiers of BRCA1 and BRCA2 (CIMBA) [39] using an inverse variance-based approach with the METAL software package [40].

Expression and splicing prediction models. Both overall gene expression and intron excision proportion prediction models were previously built using genotyping and RNA-sequencing data from 49 tissues of European ancestry from the Genotype-Tissue Expression (GTEx) Project (v8) [41]. These models were built based on fine-mapping of cis-SNPs of each gene (or each intron) to select predicting SNPs and estimate the effect sizes of these selected SNPs by applying the multivariate adaptive shrinkage (MASH) [42] method to the marginal eQTL and sQTL effects across these 49 tissues. Specifically, building prediction models for a gene includes the following steps: (1) Only genes with cis-eQTLs with a false discovery rate of 5% in any tissues were selected. (2) Fine mapping was performed in each tissue in the corresponding cis gene region (±1 Mb of the gene) to select variants with minor allele frequency > 0.01 and posterior inclusion probabilities (PIPs) > 0.01 and genes with at least one credible set of PIP > 0.1, where the credible set PIP is sum of PIPs of variants in the set, were selected. Then in each credible set, only the variant with the highest PIP was kept. A union of selected variants across 49 tissues was obtained and linkage disequilibrium (LD) pruning was applied to the union of variants to remove redundant variants. (3) The MASH method [42] was applied to the marginal eQTL effects across the 49 tissues at the union of variants to jointly estimate effects of eQTLs. (4) The predicted expression of the gene in each tissue was calculated as the linear combination of genotypes multiplying by their estimated effect sizes of the selected variants. By a similar way as described for expression models, splicing prediction models were built for each intron in 49 tissues in GTEx (v8) samples of European ancestry. The only difference is to predict the (normalized) intron excision proportion instead of the gene expression by using cis-variants. From these prebuilt models, we specifically utilized models from 11 tissues relevant to breast cancer etiology for this study including female-specific tissues (breast, uterus, vagina, ovary), connective and fat tissues akin to those in breast (subcutaneous adipose, visceral adipose, and cultured fibroblasts), immune cell-related tissues (whole blood, EBV-transformed lymphocytes, and spleen), and liver.

Harmonizing SNPs between expression prediction models and GWAS summary statistics. To harmonize the variants utilized in expression prediction models with GWAS summary statistics, we utilized the ImpG-Summary [43] method to impute z-scores from the genotyping data from GTEx samples. ImpG-Summary assumes the distribution of z-scores for all SNPs at a locus approximately follows a normal distribution with 𝒁̴ 𝑁(𝟎,S) where S is the pairwise correlation matrix between all SNPs induced by linkage disequilibrium (LD); from these pairwise correlations, ImpG-Summary estimates the posterior mean of z-scores for unobserved SNPs. As input for ImpG-Summary, we utilized the correlation matrix estimated using GTEx genotyping data along with ER + and ER- summary statistics, separately.

Joint-tissue expression- and splicing-based TWAS analyses. In this study we utilized two joint-tissue TWAS approaches including (1) an expression-based approach and (2) a splicing-based approach. Firstly, for the expression-based TWAS approach, we performed a conventional TWAS for each gene using the S-PrediXcan software [22] separately in each of the 11 tissue types using eQTL-based prediction models. We then combined TWAS p-values for each gene across all 11 tissues using the aggregated Cauchy association test (ACAT) method [44]. The ACAT method calculates a test statistic TACAT using the following formula where pk is the p-value of the kth tissue type and wk is the weight of that tissue type: \({\sum }_{k=1}^{11}{w}_{k}\text{t}\text{a}\text{n}\left(\right(0.5-{p}_{k})\)π). In our study, we utilized equal weighting of tissues where wk = 1/11. The joint p-value of this test statistic is approximated by the equation \(\frac{1}{2}-(\text{a}\text{r}\text{c}\text{t}\text{a}\text{n}({\text{T}}_{\text{A}\text{C}\text{A}\text{T}})\))/π. For genes that did not have expression prediction models in all 11 tissue types, we appropriately changed the calculation of the ACAT test statistic to only include the tissue types that contained expression prediction models and modified the weighting to still be equal among the tissue type containing models for a gene. As a sensitivity analysis, we performed the ACAT analysis utilizing square root of sample size in the prediction models as the weights, and found highly similar results, with only 4 marginally significant genes additionally identified by the sample size-weighted ACAT. Thus, we kept the equal weighted results. While the (overall) gene expression measures the total expression of all isoforms of the gene, mRNA splicing (measured by intron excision ratio in a cluster) is complementary to information from total mRNA expression levels. Intron excision ratios measure the proportions of RNA-seq read counts aligned to specific excised introns in the total read counts aligned to a cluster. Hence, we additionally implemented an intron splicing-based TWAS to test association of each excised intron using S-PrediXcan with sQTL-based prediction models for introns. After performing these splicing-based TWAS analyses for individual excised introns, we then utilized ACAT to combine the p-values from all excised introns in each gene within each of our tissues to calculate a gene-based p-value in each tissue; we then performed an additional, second-step ACAT to collate gene-based p-values to obtain a joint p-value across all tissues. By implementing this splicing-based approach we may be able to identify genes that could be missed by expression-based TWAS.

Conditioning our TWAS analyses on nearby GWAS variants. To determine whether any genes we detected in our expression- and splicing-based TWASs were independent of previously reported GWAS signals, we performed both TWAS analyses while conditioning on genome-wide significant index SNPs (p-value < 5E-8). Specifically, we conditioned the effect sizes of SNPs (eQTL or sQTL) used in expression and splicing prediction models on nearby GWAS significant index variants within +/- 2 Mb of the transcription start or stop sites of each gene. We then utilized the conditional and joint multiple-SNP (COJO) analysis [45] method to compute the GWAS index variant-adjusted effects of each eQTL and sQTL. We then performed our expression- and splicing-based TWAS analyses with these conditioned eQTL and sQTL effect sizes similar to that described in the preceding section. Additionally, for selected genes, we queried the NHGRI-EBI GWAS Catalog [46] to identify other previously reported GWAS SNPs within the same loci, and then utilized LDpop to examine the correlation between these index SNPs and the SNPs used in our eQTL and sQTL prediction models [47].

Conditioning GWAS variants on TWAS-identified genes. To examine whether previous reported GWAS index variants affect breast cancer risk through genes identified in our TWAS analysis, we utilized the COJO analysis [45] method to calculate adjusted odds ratios of GWAS index variants in association with breast cancer risk, after adjusting for eQTL and sQTL used in gene or splicing predictions within the same locus. This is similar to mediation analysis, in which the adjusted odds ratios are the direct effect and the unadjusted odds ratios are the total effect. We calculated proportion mediated for each GWAS variant using the formula \(\frac{direct\ effect \times (indirect\ effect-1)}{total\ effect-1}\) to indicate the extent to which the index variant’s total effect is accounted for by the nearby TWAS genes. We considered a GWAS index variant as being mediated by nearby genes if the variant had a proportion mediated > 0.5 and was no longer genome-wide significant in the COJO analysis (adjusted p-value > 5.0E-8).

Gene-based fine-mapping of genes and intron excision events. LD may result in false positive BC susceptibility genes being identified that are correlated with true causal genes. To identify candidate causal genes for ER + and ER- BC by accounting for LD structure, we performed gene-based statistical fine-mapping at the level of gene-trait associations using the Fine-mapping of Causal Gene Sets (FOCUS) software [48]. FOCUS controls for the genetic correlation and other pleiotropic effects induced by both LD structure and expression prediction model weights. We separately input weights for eQTLs and sQTLs used in the prediction models, an LD reference panel we computed using GTEx genotyping data, and subtype-specific summary statistics into FOCUS separately for each of the 11 tissue types. For each LD block in the genome, we identified credible sets of genes that contained the causal genes and credible sets of intron excision events using a confidence level of 90%. We additionally computed marginal PIPs in each of the 11 tissues for each gene (or each intron) within each region to be causal given the observed TWAS signals in each tissue. We classified TWAS genes that met a PIP threshold of 0.9 as candidate causal genes.

Gene set and functional annotation enrichment analysis: To determine gene sets with annotated biological pathways and other functional categories, we performed an enrichment analysis of protein-coding and long non-coding RNA (lncRNA) genes, separately for ER + and ER- breast cancer, using the GENE2FUNC method within the Functional Mapping and Annotation of Genome-wide association studies (FUMA) software package [49]; as the background gene set for testing enrichment of gene sets, we specified 33,527 protein-coding and lncRNA genes. We utilized a multiple testing threshold of an FDR-adjusted p-value < 0.05 for reporting significantly enriched gene sets.

Results

Two joint-tissue TWAS approaches: an expression-based and a splicing-based approach. We utilized expression and splicing prediction models trained in 11 tissue types obtained from European individuals from the GTEx v8 dataset; sample sizes ranged from 129 to 670 and had weights refined using a multivariate adaptive shrinkage (MASH) method. In our TWAS analyses, we tested 19,288 genes across the 11 tissues with expression prediction models including 14,615 genes in breast tissue alone, and 14,527 genes with intron splicing prediction models including 10,928 genes in breast tissue alone.

Using either the expression- or splicing-based approaches, we discovered that 230 genes were significantly associated with ER + BC at Bonferroni significance thresholds (expression-based p-value < 2.6E-6 or splicing-based p-value < 3.4E-6) (Table S2 in Additional file 1). If only considering eQTL in breast tissue, 72 (30%) genes were identified, and the number increased to 170 in multi-tissue expression TWAS, while 60 (26%) genes were identified only using the splicing-based approach (Fig. 1). Of the 230 genes associated with ER + BC, 43 genes have not been reported in any previous TWAS (Table 1). Among previously reported TWAS genes, 30 were identified in ER + TWAS studies and 5 in ER- TWAS studies (S1 in Additional file 1). Of the 230 ER + genes, 228 were located in 85 known GWAS loci, while one protein-coding gene (FAM72C) and one pseudogene (FCGR1CP) located at the 1q21.1, L1 locus were at least 1.4 Mb away from any previously identified GWAS risk variants (p < 5E-8) and were also not in LD with any risk variants (S3 in Additional file 1). Overall, among the 113 previously reported GWAS loci for ER + BC, 72 loci had a gene identified in our ER + TWAS (S3 in Additional file 1).

Figure 1
figure a

Number of genes identified cumulatively using each TWAS approach. Abbreviations: TWAS, transcriptome-wide association study; “Breast Exp”, genes identified using the expression-based TWAS approach in breast tissue only; “Breast Exp + 11 Tissue Exp”, genes identified using the expression-based TWAS approach in either breast tissue only or jointly across tissues; “Breast Exp + 11 Tissue Exp + Breast Splice”, genes identified using the expression-based TWAS approach in either breast tissue only or jointly across tissues and the splicing-based approach in breast tissue only; “Breast Exp + 11 Tissue Exp + Breast Splice + 11 Tissue Splice”, genes identified using the expression-based TWAS approach in either breast tissue only or jointly across tissues and the splicing-based approach in either breast tissue only or jointly across tissues

Table 1 Genes significantly associated with estrogen receptor (ER)-positive breast cancer that have not been identified in previous TWAS of breast cancer

Furthermore, we discovered 66 genes that were significantly associated with ER- BC at Bonferroni significance thresholds (Table S2 in Additional file 1). If only considering eQTL in breast tissue, 27 (41%) genes can be identified, and the number increased to 53 in multi-tissue expression TWAS, while 13 (20%) genes were identified exclusively using the splicing-based approach (Fig. 1). Of the 66 genes associated with ER- BC, 24 genes have not been reported in any previous TWAS (Table 2), and among previously reported TWAS genes, 11 were identified in ER- TWAS studies and 6 in ER + TWAS studies (Table S1 in Additional file 1). The 66 genes we identified in the ER- TWASs were all located in 29 known BC GWAS loci. Among the 45 previously reported GWAS loci for ER- BC, 24 loci contained a gene identified in our ER- TWAS (Table S4 in Additional file 1).

Table 2 Genes significantly associated with estrogen receptor (ER)-negative breast cancer that have not been identified in previous TWAS of breast cancer

Overall, a vast majority of genes identified using either splicing- or expression-based approaches was unique to either the ER+ (204 genes) or ER- (40 genes) subtype with only 26 genes being associated with both ER + and ER- BC (Fig. 2). Several noteworthy genes that specifically were associated with ER + BC include FGFR2 and CHEK2. Several noteworthy genes that were associated with ER- BC include TP53 and its regulator MDM4. Interestingly, we discovered that TERT in the 5p15.33 locus was associated with both ER- and ER + BC in our TWAS, while TERT was previously believed to be a predominantly ER- locus in GWAS. As shown in Table S5 (in Additional file 1), SNPs annotating to TERT generally have larger effect sizes for ER- BC than those for ER + BC. Furthermore, we discovered that TOX3, which has been historically viewed as a gene implicated in ER + and overall BC risk (with an index SNPs having ORs generally above 1.2 and below 0.8 for ER + BC), was associated with both ER + and ER- BC risk in our study (Table S5 in Additional file 1). The 270 TWAS-identified genes of ER + and ER- genes were mapped across genome in the context of known GWAS loci of ER + and ER- breast cancer, showing that our TWAS study identified genes in most of the GWAS loci (Fig. 3).

Figure 2
figure b

Venn diagram of TWAS genes identified for ER + and ER + breast cancer using each TWAS approach. Abbreviations: TWAS, transcriptome-wide association study; ER+, estrogen receptor-positive; ER-, estrogen receptor-negative

Figure 3
figure c

Ideogram of the 270 TWAS-identified genes for ER + and ER- breast cancer in the context of known GWAS loci of breast cancer. Abbreviations: TWAS, transcriptome-wide association study; GWAS, genome-wide association study; ER+, estrogen receptor-positive; ER-, estrogen receptor-negative

Joint TWAS analyses conditioned on BC GWAS index variants. To test whether the association between genes identified in our TWAS analyses and breast cancer risk could be explained by nearby GWAS variants, we conditioned the effect sizes of eQTLs and sQTLs on nearby index SNPs separately for ER + and ER- BC prior to performing our TWAS analyses. After conditioning on nearby index SNPs, we discovered that 27 genes in 17 loci remained Bonferroni-significant for ER + BC and 9 genes in 6 loci remained Bonferroni-significant for ER- BC (Table 3). Among these conditionally significant genes, only two genes in two loci (CCDC91 at 12p11.22, L1 and TOX3 at 16q12.1-q12.2) were identified for both ER + and ER- BC. Interestingly, CHEK2, a gene previously reported to impact BC risk via rare and highly penetrant mutations, was identified to be conditionally significant for ER + BC (Table 3); moreover, the SNPs used in CHEK2 expression prediction models generally had little to no correlation with other GWAS index SNPs at the same 22q12.1-q12.2 locus (Table S6 in Additional file 1).

Table 3 Genes significantly associated with estrogen receptor (ER)-positive and/or ER-negative breast cancer after conditioning on nearby GWAS SNPs for breast cancer risk

GWAS variants conditional on TWAS-identified genes. To examine whether previous reported GWAS index variants affect breast cancer risk through our TWAS-identified genes, we conducted association analysis after conditional on eQTL and sQTL for GWAS variants in loci with both TWAS and GWAS signals. We found 126 ER + variants had a proportion mediated > 0.5 and were no longer genome-wide significant (Table S3 in Additional file 1), suggesting that these variants may affect ER + breast cancer risk via expression regulation of nearby TWAS genes. There were 131 ER + variants still significant in the adjusted analysis; for example, several index SNPs in the 22q12.1-q12.2 locus remained significant after adjusting for nearby genes including CHEK2. Of ER- GWAS index variants, effects of 40 variants were found to be mediated by nearby genes, while 38 were not (Table S4 in Additional file 1).

Gene-based fine-mapping of associations. To identify candidate causal genes in each tissue we performed fine-mapping of gene-trait associations separately using expression and intron splicing prediction models, LD matrices, and subtype-specific GWAS summary statistics. Of the 230 genes in 86 loci associated with ER + BC, we discovered that 133 genes in 75 loci had a PIP above 0.9 in at least one of the 11 tissues, where 104 genes were identified by gene expression-based fine-mapping, 58 by intron splicing-based fine-mapping, and 29 by both fine-mapping analysis (Table S7 in Additional file 1). Of the 66 genes in 29 loci associated with ER- BC, we found that 44 genes in 26 loci had a PIP > 0.9 in a least one tissue, where 32 genes were identified by gene expression fine mapping, 24 by intron splicing fine mapping, and 12 by both types of fine mapping (Table S8 in Additional file 1).

Enrichment of gene sets and functional annotations. Among the genes we discovered that were associated with ER + BC, 191 were protein-coding genes and 31 were lncRNAs. Additionally, among the genes we discovered were associated with ER- BC, 59 were protein coding genes and 6 were lncRNAs. We explored the enrichment of these lncRNA and protein-coding genes identified in the ER + and ER- TWASs using FUMA package [49]. The ER + genes were enriched for gene sets including cell cycle regulation, gland development, body fat distribution, multiple other types of cancers (e.g. lung, prostate, pancreatic, esophageal), and dysregulated immune system (Crohn’s disease, allergies) (Table S9 in Additional file 1). These results suggest that the effects of many of these ER + genes may be mediated through lifestyle factors, as well as exert pleiotropic effects on other types of cancers. In contrast, ER- genes were enriched for gene sets including numerous apoptosis pathways (caspase activation, Trail signaling, c-FLIP regulation) and other types of cancers (e.g., glioblastoma, non-glioblastoma glioma, esophageal cancer, leukemia) (Table S10 in Additional file 1). In addition, we observed that ER + and ER- genes identified in our study tended to be upregulated in female reproductive tissues such as cervix, uterus, and ovary, though this enrichment for upregulation was not statistically significant for ER- genes (Figure S1 and S2 in Additional file 2). Taken together, these enrichment results suggest that future experimental studies of how ER- genes play a role in BC etiology by impacting apoptosis are warranted, as well as additional how genes in both subtypes mechanistically exert effects in female reproductive tissues.

Discussion

In this study of ER- and ER + breast cancer, we employed an expression-based TWAS approach utilizing models trained on overall gene expression and a splicing-based TWAS approach using models trained on intron excision events. We applied both TWAS approaches in breast tissue only, as well as jointly across 11 tissues potentially related to breast cancer development. In total, we identified 66 genes in 29 loci that were associated with ER- breast cancer and 230 genes in 86 loci that were associated with ER + breast cancer at a Bonferroni threshold of significance. In general, we observed modest consistency between our findings and genes reported in previous TWAS studies (Table S1, in Additional file 1). Among the 66 genes associated with ER- BC, 11 were reported in previous ER- TWASs and 6 in ER + TWASs. Among the 230 genes associated with ER + BC, 30 had been previously reported in TWASs of ER + BC and 5 in ER- TWASs (Table S2, in Additional file 1).

Our TWASs of ER- and ER + breast cancer identified more genes compared to previously conducted TWAS studies of breast cancer subtypes (22 ER- and 69 ER + in all previous TWAS combined). One factor leading to this higher number of identified genes was that our study had increased statistical power compared to previous studies by utilizing the latest GTEx v8 prediction models that were trained on much larger sample sizes (up to 670 participants per tissue type) [50] compared to previous TWAS studies which primarily utilized prediction models trained only on 67 GTEx breast samples [26, 30]. In addition, by utilizing predictive models from multiple tissue types, we were able to identify candidate genes that may play a role in the etiology of ER- and ER + breast cancer in tissues other than breast tissue. Given that dysregulated gene expression in tissues other than breast tissue have been associated with breast carcinogenesis through either having similar characteristics as cell types in breast – such as adipose tissue resembling adipocytes in breast – and/or having a direct impact on breast cancer etiology such immune cell-related tissues, it is no surprise that utilizing predictive expression models from multiple tissues allowed us to identify more genes than that using only breast tissue [51,52,53]. Furthermore, unlike previous TWAS studies which have traditionally utilized models to predict overall gene expression, our splicing-based approach additionally incorporates the predicted expression of excised introns to draw associations between gene expression and breast cancer.

Furthermore, a vast number of genes we discovered in this study were uniquely associated with either ER + or ER- BC including 40 genes that uniquely associated with ER- and 204 genes that uniquely associated with ER + BC. In particular, one gene we discovered that was only associated with ER + BC was FGFR2, a receptor tyrosine kinase that is involved in cell growth and proliferation [54] that has been shown to contain several variants that associated with an increased risk of both ER + and overall BC [55,56,57]. We identified FGFR2 to be significant via both expression- and splicing-based TWAS approaches and tissue with the max PIP was cultured fibroblasts (PIP = 1), indicating FGFR2 likely causally impacts breast cancer risk through its expression in fibroblasts. After conditioning on a nearby GWAS index variant rs10510097, these associations were no longer statistically significant. Moreover, one of the SNPs used to predict FGFR2 expression across many GTEx tissues, rs1863744, was in moderate LD with rs10510097 (R2 = 0.40). This finding suggests that SNPs in LD with the GWAS SNP rs10510097 may partially contribute to ER + BC etiology through altered expression of FGFR2. Another interesting gene we identified that uniquely associated with ER- BC was TNFSF10. This gene was first implicated in a GWAS of ER- BC in women with African ancestry [58]; later TNFSF10 was also implicated in a GWAS of overall breast cancer risk in women with European ancestry, but with a different risk variant [37]. It has been previously shown experimentally that editing the risk variant rs13074711, which is associated with ER- BC, leads to altered expression of TNFSF10 and subsequent IFN-β-induced apoptosis, suggesting that the signal from TNFSF10 at the 3q26.21 locus may play a role in BC risk through immune defense mechanisms. However, since our study identified several other tissues including visceral adipose, EBV-transformed lymphocytes, and whole blood that had PIPs of 0.949 when fine-mapping eQTLs for TNFSF10; we suggest further research is warranted to explore the effects of TNFSF10 expression on ER- BC risk in these additional tissues. Overall, this study’s identification of many subtype-specific TWAS genes, in combination with the minimal overlap in genes identified for each subtype, supports the notion that the genetic etiologies of these two breast cancer subtypes are mostly distinct.

In our study, we discovered several genes including CHEK2 and TP53 that have historically been thought to contribute to breast cancer risk through rare, moderately or highly penetrant mutations and have thus far not been identified in other TWAS. CHEK2 is an enzyme involved in apoptosis in response to double stranded DNA damage [59] that was associated only with ER + BC in both our expression- and splicing-based TWAS. Protein-truncating or rare missense variants in coding regions of CHEK2, especially 1100delC (rs555607708), have been previously reported to have moderate penetrance for overall breast cancer risk and ER + breast cancer in familial studies [60] and population based studies [61,62,63]. We noted that the 12 SNPs used to predict CHEK2 expression across the 11 tissues had little to no correlation with rs555607708 (r2 < 0.009), as well as with the other 18 reported GWAS index SNPs at the 22q12.1-q12.2 locus (Table S6 in Additional file 1) in the NHGRI-EBI GWAS Catalog [46]. Also these GWAS index SNPs remained statistically significant after adjusting for eQTL/sQTL of nearby genes including CHEK2 (Table S3 in Additional file 1). Together, the discovery of CHEK2 in our ER + TWAS, alongside the lack of overlap and correlation between SNPs used in expression prediction models and GWAS index variants, suggest that common variants which modulate CHEK2 expression likely play a role in breast cancer risk, and these common variants are etiologically distinct from those identified in previous GWAS studies. These findings strongly support the notion that there may be a previously unidentified, polygenic basis by which CHEK2 expression contributes to the risk of developing ER + breast cancer.

While rare protein-truncating and missense mutations in TP53 have been shown to exhibit moderate to high penetrance for both ER + and ER- breast cancer, they have only been implicated in prior GWAS but not TWAS studies of breast cancer [14, 38, 61, 63]. Even though TP53 has been recognized as the most common somatically mutated gene in ER- breast tumors [64], our expression-based TWASs of ER- breast cancer for the joint analysis of 11 tissues as well as breast tissue alone were the first TWAS to identify TP53. In addition, the PIP of 0.87 in breast and 0.87 in cultured fibroblasts for TP53 were relatively high, which are corroborated by numerous experimental studies showing that the expression of TP53 in these two tissues impacts breast cancer etiology [65, 66]. Moreover, the model SNP rs78378222 which was used to predict overall TP53 expression, as well as of excised introns, across breast and several other tissues has been previously reported to associate with breast cancer risk [67]. These findings indicate that while missense and protein-truncating variants in TP53 have historically been observed to impact breast cancer risk, common variants which modulate TP53 expression can additionally contribute to the development of ER- breast cancer.

Our study identified 26 genes that were associated with both ER- and ER + subtypes, including TERT and TOX3 which have not been reported in prior TWAS studies; both genes were identified using our expression-based approaches. Though TERT has been primarily regarded as an ER- gene since several GWAS variants in TERT have been associated with ER- BC [16, 68], our study shown that the predicted expression of TERT was also significantly associated with ER + breast cancer risk. As these TWAS signals for both subtypes were no longer significant in our COJO analysis, the observed signals may be explained by nearby GWAS index SNPs. On the other hand, TOX3 has generally been regarded as a gene associated with ER + breast cancer since higher mRNA and protein expression of TOX3 have been observed in ER + BC cell lines compared to ER- cell lines, as well as larger SNP effect sizes for the risk of developing ER + compared to ER- BC [69,70,71]. We additionally discovered TOX3 expression significantly associated with ER- BC in our study. For both ER+/- BC, TOX3 had a PIP of 1 in breast mammary tissue, supporting the notion that TOX3 causally impacts breast cancer etiology.

While our study was robust in replicating many previously identified susceptibility loci, our approach had several limitations. First, even though our splicing-based TWAS allowed us to combine the p-values for the associations between different intron excision events, this method does not explicitly account for the direction of association between each intron excision event and BC risk. Future work into refining the splicing-based approach to account for directionality prior to combining p-values from alternatively spliced transcripts may help increase the power to detect breast cancer associated genes. Additionally, given that our study incorporated a meta-analysis and predictive expression models that both utilized primarily individuals with European ancestry, our findings may not be portable to individuals with non-European ancestral backgrounds. It has been shown that molecular subtypes, gene expression, and germline/somatic variants in breast cancer patients differ significantly between racial groups [72], and it is imperative that future TWAS studies include individuals with diverse ancestral backgrounds.

Conclusions

In summary, our study identified many genes that are associated with ER + and ER- BC that have not been previously identified in TWAS by utilizing two joint-tissue TWAS approaches. More importantly, most of the genes for ER + and ER- breast cancer are distinct. We also discovered one novel loci for ER + BC. Interestingly, though several breast cancer susceptibility genes including TP53 and CHEK2 have been historically thought to play a role in breast cancer through rare, highly penetrant mutations in coding regions, our study provides evidence that common variants in these genes which modulate expression also impact breast cancer etiology. Taken together, utilizing a comprehensive combination of expression- and splicing-based methods can help improve our understanding of breast cancer genetics. Functional characterization of these candidate genes, in particular genes significant in the fine-mapping analysis, could shed some light on the etiology of ER + and ER- breast cancer as well as provide targets for treatment of breast cancer. eQTLs and sQTLs that are associated with expression of these candidate genes may be used in building polygenic risk prediction models to assess ER + and ER- breast cancer risk separately and to guide risk-adaptive breast cancer screening.