Background

Breast cancer is a complex disease involving interplay between lifestyle/environmental and genetic risk factors. Risk factors such as parity, breastfeeding, age at menarche, age at first full-term pregnancy, body mass index (BMI), height, mammographic density, exogenous hormonal use, and alcohol consumption are well-established [1,2,3,4,5,6,7]. Through continued collaborative efforts such as the Collaborative Oncological Gene-environment Study (COGS) and the OncoArray project [8], more than 200 common single nucleotide polymorphisms (SNPs) associated with risk of breast cancer have been identified [9,10,11].

Traditional genome-wide association study (GWAS) analyses assess the marginal effects of variants and might miss variants which only show an effect within certain strata in the population. These potential gene–environment interactions where SNPs are associated with disease risk in conjunction with lifestyle/environmental risk factors can be investigated through genome-wide gene-environment interaction studies (GEWIS) [12,13,14,15].

Very few genome-wide studies of gene-environment (G×E) interactions in breast cancer have been conducted to date, and three focused on the use of menopausal hormonal therapy as the single environmental risk factor [16,17,18]. An exploratory analysis of G×E interactions examined ten environmental risk factors and 71,527 SNPs selected from prior evidence, using data from approximately 35,000 cases and controls in the Breast Cancer Association Consortium (BCAC). That study identified two potential G×E interactions associated with breast cancer risk [19]. In the present study, we performed a comprehensive genome-wide analysis of gene–environment interactions for risk of overall breast cancer, as well as estrogen receptor positive (ER +) breast cancer using data from 72,285 cases and 80,354 controls participating in the BCAC.

Methods

Study sample

Analyses were conducted using data from 46 studies (16 prospective cohorts, 14 population-based case–control studies, and 16 non-population based studies) participating in the BCAC. We excluded participants if they were genotypically male, of non-European descent, or had a breast tumor of unknown invasiveness or in-situ breast cancer. Women with prevalent breast cancer at the time of recruitment or with unknown reference age (defined as age at diagnosis for cases and age at interview for controls) were also excluded from the analyses. Further, studies with fewer than 150 cases and 150 controls for the risk factor under evaluation were excluded from those analyses. Each participating study obtained informed consent from the participants and was approved by their local ethics committee.

Risk factor data

Risk factor data from individual studies was checked for quality using a multi-step harmonization process based on a common data dictionary. Time-dependent risk factor variables were derived with respect to the reference date defined as date of diagnosis for cases and date of interview for controls. Analyses were conducted with the following risk factors among all women: age at menarche (per 2 years), parity (per 1 birth), adult height (per 5 cm), ever use of oral contraceptives (yes/no), and current smoking (yes/no). The analysis of age at first full-term pregnancy (per 5 years) was conducted among parous women only, and that of body mass index (BMI, per 5 kg/m2) was conducted among postmenopausal women only. Menopausal status was either self-reported or assigned as postmenopausal if the reference age was greater than 54 years.

Genetic data

All samples were genotyped either using the iCOGS [20, 21] or OncoArray [9, 10, 22]. Briefly, iCOGS is a customized iSelect SNP genotyping array, consisting of ~ 211,000 SNPs [20, 21], whereas OncoArray includes ~ 533,000 SNPs of which nearly 260,000 were selected as a GWAS backbone (Illumina HumanCore) [22]. Detailed information is provided elsewhere [9, 10, 20,21,22]. Data were imputed to the 1000 Genomes Reference Panel (phase 3 version 5). Overall, 28,176 cases and 32,209 controls of European ancestry who were genotyped by the iCOGS array, and 44,109 cases and 48,145 controls who were genotyped using the OncoArray array were included in this analysis.

Genetic variants with imputation quality score < 0.5 in iCOGS or < 0.8 in OncoArray, or with minor allele frequency < 0.01, were excluded from the analyses. Variants in known breast cancer regions were also excluded from the analysis since interactions between known susceptibility variants and risk factors have been explored previously [23, 24]. After applying all exclusions, 7,672,870 genetic variants (SNPs and indels) were included in the analysis.

Statistical analysis

Unconditional logistic regression was employed to assess the associations of SNPs and risk factors with breast cancer risk. Genotypes were assessed using the expected number of copies of the alternative allele (‘dosage’) as the covariate under a log-additive model. Interactions between genetic variants and risk factors were tested by comparing the fit of logistic regression models with and without an interaction term using likelihood ratio tests. All models were adjusted for reference age, study, and ten ancestry-informative principal components. To account for potential differential main effects of risk factors by study design, all models included an interaction term between risk factor and an indicator variable for study design (population-based vs. non-population based). Analyses with current smoking were further adjusted for former smoking.

Analyses were performed separately for overall and ER + breast cancer risk, and also separately by genotyping array. Array-specific results were combined using METAL [25]. Quantile–quantile (Q-Q) plots were assessed to examine the consistency of the distribution of p-values with the null distribution. Interaction P value less than 5E-07 was considered suggestive evidence of interaction. We also calculated Bayesian False Discovery Probabilities (BFDP) for all suggestive interactions, assuming a 1 × 10–5 prior probability of a true association for each SNP-risk factor pair. Overall, G×E interactions with BFDP < 15% were considered noteworthy [26]. For noteworthy SNP-risk pairs, we evaluated the G×E interaction also for ER-negative breast cancer risk. For noteworthy interactions, we conducted stratified analyses by categories of the risk factor. All analyses were conducted using R version 3.5.1.

We estimated the overall genome-wide contribution of G×E associations for each risk factor to the familial relative risk of breast cancer using LD score regression [27]. The analysis used the G×E interaction summary statistics and was restricted to HapMap3 SNPs with MAF > 5% in European population from the 1000 Genomes Project. Under the log-additive model, the G×E heritability on the frailty scale can be estimated by hf2 = hobs2 × var(X)/P(1-P), where hobs2 is the observed heritability given by LD score regression, var(X) is the variance of the risk factor under evaluation, and P is the proportion of cases in the sample. The proportion of the familial relative risk (FRR) of breast cancer due to G×E interactions is then given by hf2/2log(λ) where λ is the familial relative risk to first degree relatives of cases (assumed to be 2) [28].

Results

Studies included in the analysis are summarized in Additional file 1: Table S1. The number of cases and controls in each analysis varied from 61,617 cases and 74,698 controls for parity to 48,276 cases and 60,587 controls for current smoking (Additional file 1: Table S2). Consistent with the literature, increasing age at first full-term pregnancy, higher adult height, ever use of oral contraceptives, and current smoking were associated with increased overall breast cancer risk, whereas increasing age at menarche, being parous, increasing number of full-term pregnancies, and breast feeding were associated with decreased breast cancer risk (Additional file 1: Table S3).

The genome-wide analysis of interactions with seven environmental risk factors yielded two SNP-risk factor pairs at BFDP < 15%, one for risk of overall breast cancer and one for ER + breast cancer risk (Table 1, Fig. 1, 2, Additional file 1: Figure S1A-S1B). No inflation in the test statistics was observed for either of the environmental risk factors. The heritability on the frailty scale of breast cancer risk explained by G×E interaction is shown in Additional file 1: Figure S2. The estimated proportion of the frailty scale heritability explained by G×E interactions was very low for all factors, being highest for age at first full-term pregnancy (~ 1.5% for both overall and ER + breast cancer risk), age at menarche and post-menopausal BMI.

Table 1 Genetic variants with suggestive (Pint ≤ 5E−07) GxE interactions for overall and estrogen receptor positive (ER +) breast cancer risk
Fig. 1
figure 1

Manhattan plot of genome-wide interactions of adult height on overall breast cancer risk. The genome-wide significance threshold of P < 5 x 10−8 is indicated by the dashed black line. Genome-wide significant findings are highlighted in blue

Fig. 2
figure 2

Manhattan plot of genome-wide interaction of age at menarche for ER + breast cancer risk. The genome-wide significance threshold of P < 5 x 10−8 is indicated by the dashed black line. Genome-wide significant findings are highlighted in blue.

For overall breast cancer risk, there was evidence of interaction between SNP rs80018847 and adult height (ORint = 0.94, 95% CI 0.92–0.96, Pint = 4.34E−08, BFDP = 11%) without an apparent marginal effect of the rs80018847 variant (ORmarg = 1.00, 95% CI 0.98–1.03, Pmarg = 0.88). By categories of adult height defined a priori, the estimated per allele ORmeta of rs80018847-G varied from 1.03 (95% CI 0.94–1.13, Pmeta = 0.53) for women shorter than 158 cm, 1.13 (1.02–1.25) for women 158–162 cm in height, to ORmeta of 1.01 (95% CI 0.93–1.09, Pmeta = 0.88) for women who were 168 cm or taller risk (Additional file 1: Table S4). Therefore, there is no linear relationship between the SNP and categories of adult height. The interaction with height was also observed for ER + breast cancer (ORint 0.95, 95% CI 0.93–0.97, Pint = 5.62E-06) but not for ER negative (ER-) breast cancer risk (ORint = 0.98, 95% CI 0.93–1.03, Pint = 0.77). The regional plot for overall breast cancer shows another SNP (rs1360506) at this locus in high linkage disequilibrium (LD) (r2 = 0.81) with rs80018847 (Additional file 1: Figure S3).

For risk of ER + breast cancer, a statistically significant interaction was observed between SNP rs4770552 and age at menarche (ORint = 0.91, 95% CI 0.88–0.94, Pint = 4.62E−08, BFDP = 11%). There was weak evidence for a marginal association between the rs4770552-T allele and ER + breast cancer (ORmarg = 1.02, 95% CI 1.00–1.05, Pmarg = 0.10). The per allele ORmeta appeared to decrease with increasing age at menarche, from 1.07 (95% CI 1.00–1.15, Pmeta = 0.04) for age at menarche less than 13 years to 0.92 (95% CI 0.77–1.09, Pmeta = 0.33) for age at menarche greater than 15 years (Additional file 1: Table S4). There was weaker evidence of interaction between SNP rs4770552 and age at menarche for overall breast cancer risk (ORint = 0.93, 95% CI 0.90–0.96, Pint = 5.47E−06), but no interaction for ER- breast cancer risk (ORint = 0.98, 95% CI 0.89–1.08), Pint = 0.66). At this locus, we found suggestive evidence of interactions between further 13 SNPs and age at menarche for ER + breast cancer risk. However, these 13 SNPs are in high LD (r2 = 0.8–1.0) with SNP rs4770552 (Additional file 1: Figure S4).

Discussion

This is the largest genome-wide gene-environment interaction study for breast cancer to date. We found evidence of one novel susceptibility loci interacting with adult height associated with increased breast cancer risk overall, and one interaction for increased risk of ER + breast cancer with age at menarche. It is important to note, however, that while these associations reached conventional levels of genome-wide statistical significance, they may still represent chance associations. Based on the assumed prior distribution of effect sizes, the BFDP for both loci were 11%, considered noteworthy. Nevertheless, studies with an even larger sample size are required to confirm or refute these associations.

Many observational studies have shown an association between increasing adult height and increased breast cancer risk, in both premenopausal and postmenopausal women [7, 29, 30]. A meta-analysis estimated that each 10 cm increment in height was associated with a 17% increase in breast cancer risk [31]. The biological link between height and breast cancer is poorly understood, but some studies have suggested that increased height corresponds to more stem cells at risk of acquiring driver mutations [32]. Another hypothesis is that adult height could be a surrogate for nutritional intake, potentially implying a role for insulin-like growth factor 1 (IGF1) [33]. The functional basis of the potential interaction between adult height and the SNP rs80018847 is unclear. This SNP is in an intronic region of the leucine rich repeat and Ig domain containing 2 gene (LINGO2) on the short arm of chromosome 9 (9p13). This gene encodes a transmembrane protein belonging to the LINGO/LERN protein family [34]. Studies in mouse embryos have shown expression of LINGO2 specifically in the central nervous system [34], but it has not been implicated in breast cancer to date.

Early age at menarche is known to be associated with elevated risk of breast cancer. There is an approximate 5% decrease in risk with each year delay in the initiation of menstruation [35]. It has been postulated that younger age at menarche corresponds to longer cumulative hormonal exposure and therefore elevated levels of estradiol [3, 36]. SNP rs4770552 is an intronic variant within the spermatogenesis associated 13 gene (SPATA13) at 13q12. SPATA13 encodes a guanine nucleotide exchange factor (GEF) for RhoA, Rac1 and CDC42 GTPases [37, 38]. Although the role of this gene in breast cancer is still unclear, there could be an indirect link via the role of RhoA GTPases in breast tumorigenesis. Rho GTPase signaling is altered in human breast cancers, and dysregulation of Rho GTPase may have differential effects on the development of breast tumors depending on the stage and subtype [39]. Activation of RhoA results in release of megakaryoblastic leukemia 1 (MKL1), which in turn has been observed to alter the transcriptional activity of ERα, known to play a critical role in breast tumors [40]. Therefore, SNP rs4770552 may potentially indirectly interact with the regulatory region of SPATA13 and affect the breast tumorigenesis process via activation of RHoA GTPases.

Given that the marginal effects of the common genetic variants are small and the associations of environmental risk factors with breast cancer are modest, interactions are also expected to be weak (Additional file 1: Figure S5). Although this is the largest breast cancer dataset available to date with more than 60,000 cases and 70,000 controls, the study is underpowered to detect weak interactions. Also, this study included only women of European ancestry and the findings may not be generalizable to women of other ancestries.

Using LDSC regression, we estimated the overall heritability due to G×E for each of the risk factors. The estimated frailty scale heritability (≤ 0.015) can be compared with corresponding heritability for the SNP main effects (for which heritability is about 0.47) or the overall heritability based on the familial risk (~ 1.4) [28, 41]. The implication is that G×E interactions make very little contribution to the heritability of breast cancer, at least for the known risk factors and common genetic variants that can be evaluated using genome-wide arrays, and hence do not make an important contribution to risk prediction at the population level. This is consistent with the fact that detection of G×E interactions is rare. This does not rule out the possibility that G×E interactions could be identified in additional large studies or that such interactions may provide important clues to mechanisms.

Conclusions

In conclusion, we identified two novel genome-wide gene–environment interactions for overall and ER + breast cancer risk for women of European ancestry. These results contribute to our global body of knowledge on genetic susceptibility for breast cancer by generating plausible biological hypotheses, but they require replication and further functional studies.