Background

Genome-wide association studies have identified a substantial number of common genetic variants that are associated with risk, for each of several diseases. However, most such associations are weak and account for only a small fraction of familial disease aggregation [1]. In the case of breast cancer, seven reproducible genetic susceptibility alleles were estimated to explain about 5% of heritability [2]. Studies of low frequency genetic variants, gene-gene interactions, genotype by environment interaction, and shared environment have been suggested [1] as means to identify the 'missing heritability' for complex diseases, along with more thorough study of variants within genomic regions of interest.

Closely related to this is the role of genetic variants in model discrimination and disease risk prediction. A recent multiple-cohort analysis of ten common genetic variants that reliably associate with breast cancer concluded that 'the level of predicted breast cancer risk among most women changed little' when these SNPs were added to existing risk assessment models [3]. In response, an accompanying editorial [4] pointed out that cellular networks within which the SNPs operate may associate more strongly with risk than do tagging SNPs alone, that gene-gene and gene-environment interactions are 'likely to be profoundly important', and that associations with breast cancer subtypes may be more impressive.

A challenge to pursuing the gene-environment concept is the typical difficulty in assessing key environmental exposures. For example, given the well-established association between obesity and post-menopausal breast cancer risk, one might expect that total energy consumption and other dietary factors may influence breast cancer risk, possibly in a manner that depends on genetic factors that relate to hormone metabolism, growth factors, or inflammation. However, dietary data are attended by random and systematic assessment biases that may seriously attenuate and distort estimated associations [5].

Randomized controlled intervention trials can provide highly desirable settings for the incorporation of genotype by environment interactions into genetic association analyses. First, the intervention group assignment is known with precision, and secondly, this assignment is statistically independent of underlying genotype by virtue of randomization. This latter feature also allows highly efficient case-only test statistics [68] to be used for genotype by intervention interaction testing.

The Women's Health Initiative (WHI) randomized controlled trial included four randomized and controlled comparisons among postmenopausal women in a partial factorial design [9, 10]. Specifically, it comprised a postmenopausal hormone therapy component that involved two non-overlapping trials: estrogen versus placebo (E-alone trial) among women who were post-hysterectomy, and estrogen plus progestin versus placebo (E+P trial) among women with a uterus; a low-fat dietary modification (DM) versus usual diet component, and a calcium and vitamin D (CaD) versus placebo supplementation component.

An elevation of breast cancer risk triggered the early stopping of the E+P trial in 2002 [11, 12]. In the E-alone trial, which was stopped early in 2004 primarily due to an elevation of stroke risk [13], there was a surprising suggestion of a reduction in breast cancer risk in the intervention group, as well as apparent interactions of the E-alone hazard ratio with several other breast cancer risk factors [14]. The DM trial continued to its planned termination in 2005. While overall it provided non-significant evidence of a breast cancer reduction over its 8.1-year average follow-up period, the breast cancer hazard ratio was significantly lower in the quartile of women who had a comparatively high fat content in their diet at baseline [15]. These women made a larger dietary change if assigned to the low-fat diet intervention. The CaD trial did not yield evidence of an effect on breast cancer risk [16].

We studied 4,988 SNPs in relation to breast cancer incidence and clinical trials intervention effects during the intervention phase of the WHI clinical trial. Nearly all of these SNPs were selected as the top-ranked SNPs according to significance level for association with breast cancer in the NCI Cancer Genetic Markers of Susceptibility (C-GEMS) genome-wide association study [17], while the remaining 244 were selected based on published data from the Breast Cancer Association Consortium genome-wide association study [18]. These SNPs were scattered throughout the genome. In fact, they arise from 3,224 distinct loci when a squared pairwise correlation (r2) between adjacent regions of less than 0.2 is used to define new loci. We ranked SNPs according to a null hypothesis test that combined evidence of overall breast cancer association with evidence of interaction with one or more of the randomized clinical trial intervention assignments.

Materials and methods

Study design and population

Enrollees in WHI trials were postmenopausal women aged 50 to 79 years who met component-specific eligibility criteria [19]. Women were randomized to a hormone therapy component, or a DM component, or both. At the one-year anniversary from enrollment, participating women could be further randomized into a CaD supplementation component. A total of 68,132 women were enrolled into the trials between 1993 and 1998, among which there were 10,739 in E-alone, 16,608 in E+P, 48,835 in DM, and 36,282 in CaD components. Details about distributions of demographic variables and breast cancer risk factors in the study cohort were published previously [19]. For the DM trial we chose to focus interaction testing on the subset of 12,208 women having baseline percentage of energy from fat in the upper quartile, and we denote the DM intervention in this sub-cohort by DMQ.

Case and control selection

All 2,242 invasive breast cancer cases that developed between randomization and the end of the trial intervention phase (31 March 2005) were considered for inclusion, among which a total of 2,166 (96.6%) cases had adequate quantity and quality of DNA. This leads to analyses based on 247 cases for E-alone, 471 cases for E+P, 428 cases for DMQ, 1,049 cases for CaD (cases arising after CaD randomization only), and corresponding controls that were one-to-one matched to cases on baseline age, self-reported ethnicity, participation in each trial component, years since randomization, and baseline hysterectomy status.

Laboratory methods

Genotyping and data cleaning methods at Perlegen Sciences (Mountain View, CA, USA) have been described [20]. The average call rate for these SNPs was 99.8%, and the average concordance rate for 157 blind duplicate samples was also 99.8%.

Principal component analysis was used to characterize population structure and to identify genotyping artifacts. The top 20 principal components did not associate with common sources of experimental variability (for example, date of sample processing or hybridization performance for either chip design). The first ten principal components were found to account for 86% of the total SNP genotype variation, while the first four principal components provided good separation among the major self-reported 'ethnicities' (white, black, Hispanic, Asian/Pacific Islander, northern versus southern European ancestry).

Statistical methods

A five-component test statistic was used for each SNP to test association with breast cancer. The first 'main effect' component arose as score test from a standard logistic regression of case (1) versus control (0) status on number of minor SNP alleles and potential confounding factors. The logistic regression model included the (log transformed) Gail 5-year breast cancer risk score [21], previous hormone use (indicators for < 5, 5 to 10, and ≥10 years for each of estrogen and estrogen plus progestin), and (log transformed) body mass index. Also included are variables used for matching controls to cases in control selection. In addition, eigenvectors from the first ten principal components from correlation analysis of the genotype data were included to adjust for population stratification [22]. The other four test statistic components were case-only tests for dependence of intervention odds ratios on SNP genotype for each of E-alone, E+P, DMQ, and CaD. These statistics arise as score tests in logistic regression of active (1) versus placebo or usual diet (0) randomization assignment on the number of minor SNP alleles with logistic regression location parameter offset by log q/(1 - q), where q is the fraction of women assigned to active intervention for the pertinent clinical trial component. The main effect test statistic is asymptotically independent of each of the case-only test statistics [23], and the interaction tests for E-alone and E+P are independent since they are based on non-overlapping sets of women. A 'sandwich' variance estimator was used to allow for possible correlations among the other pairs of case-only test statistics. A chi-square test with five degrees of freedom was then used to test SNP association with breast cancer, for each of the SNPs. Further details about this joint test procedure are included here as Additional file 1.

SNPs of interest in these association tests were subsequently examined for evidence of main effect and interaction effects separately. The latter once again employed case-only analyses, and for descriptive purposes, intervention odds ratios were estimated separately at zero, one, and two minor SNP alleles. A likelihood ratio test with two degrees of freedom assessed SNP by intervention interaction in these analyses.

The potential of SNP by clinical trial interactions to contribute to the ability to discriminate between breast cancer cases and controls was evaluated by estimating areas under the receiver operating characteristic curves (AUC), and associated confidence intervals.

Some further analyses were carried out with breast cancers classified according to either the estrogen receptor status or the progesterone receptor status of the breast tumor. All significance levels (P-values) are two-sided.

Ethics approval

This research conforms to the Helsinki Declaration and pertinent legislation, and has been approved by the Institutional Review Board of the Fred Hutchinson Cancer Research Center. All women included in this report provided informed consent that permitted their biospecimens and data to be used in the present research project.

Results

Simultaneous tests of main effect and interaction with clinical trial interventions

Table 1 presents the top 20 SNPs ranked by P-value of the combined test of main effect and interaction. Among the 4,988 SNPs evaluated, six SNPs have the joint test P-value less than 10-6 and a false discovery rate (FDR) less than 0.0005, all in the FGFR2 (fibroblast growth factor receptor 2) region in chromosome region 10q16. Immediately following are several SNPs from the MRPS30 (mitochondrial ribosomal protein S30) region in chromosome region 5p12. Of these SNPs, rs7705343 is included in the set of SNPs having FDR < 0.05, while close-by SNP rs13159598 is also among SNPs having FDR < 0.10.

Table 1 Top 20 SNPs identified by combined test for main effect and interaction with clinical trial interventions

Table 1 also shows P-values and rankings for these SNPs under the main effect association test alone. While P-values for FGFR2 SNPs tend to be somewhat diluted by the inclusion of the interaction information in the test statistic, the ordering of these SNPs is rather different under the two-testing procedures. For example, SNP rs3750817, which is in a somewhat separate linkage disequilibrium bin from tagging SNP rs2981582 [18], has a comparatively higher ranking with the combined test. We have previously reported suggestive evidence of interaction of rs3750817 with E-alone and E+P [24], and DMQ [25].

SNPs in the MRPS30 region of chromosome 5p12 have a higher ranking overall with the combined versus the main effect test. Moreover, the ordering of SNPs within this region is considerably altered by the inclusion of the interaction information. These analyses point to the genomic region in proximity of rs7705343 as relevant to breast cancer risk. Figure 1 shows squared pairwise correlations (r2) among SNPs in the MRPS30 region of chromosome 5p12. The combined test rankings tend to decrease as one moves from rs7705343 to the tagging SNP rs4415084 at the opposite end of this genomic region of approximately 230 kb.

Figure 1
figure 1

Pairwise r2 for SNPs within the MRPS30 region in chromosome 5p12, where r is the allelic correlation between SNPs.

Table 2 shows P-values individually for the five components of the combined test, for the eight SNPs in the MRPS30 region. Most of the association information derives from the main effect test, but the intervention interaction tests have rather different P-values across these SNPs, with rs7705343 having nominally significant (P < 0.05) interactions with each of E-alone, DMQ, and CaD, while interactions in relation to rs4415084 are not significant for any of the interventions.

Table 2 Significance levels (P-values) for testing interaction with WHI trial interventions for SNPs in the MRPS30 region

Table 3 shows estimated intervention odds ratios and 95% confidence intervals as a function of the number of minor alleles of rs7705343 for each of the four interventions. The GG genotype is associated with lower intervention ORs for each of E-alone, DMQ, and CaD. Additional file 2 provides corresponding information with breast cancers classified according to estrogen receptor or progesterone receptor positivity. No clear variations by tumor receptor status were suggested, through statistical power for detecting moderate variations with tumor type is limited.

Table 3 Breast cancer odds ratio for WHI trial interventions by genotype of MRPS3 0 SNP rs7705343

The majority (86%) of the case-control samples are from European-ancestry populations. In Additional files 3 and 4 we provide P-values for interaction between trial components and SNPs in the MRPS30 region, and the estimated intervention odds ratios and 95% confidence intervals as a function of the number of minor alleles of rs7705343 among women of European ancestry specifically. The patterns that we observe are quite similar to the overall patterns.

We also examined the joint associations of these FGFR2 and MRPS30 SNPs with hormonal and dietary intervention effects, using case-only analysis. Based on logistic regression applied to cases in DMQ, where the indicator for active treatment is regressed on genotypes of rs3750817 and rs7705343 together, both SNPs showed nominally significant interactions. The P-values for rs3750817 and rs7705343 were 0.0059 and 0.037. When E-alone was similarly considered, rs3750817 and rs7705343 had P-values of 0.053 and 0.043 in the joint interaction model.

The AUC was calculated from logistic regression analyses that included clinical trial randomization assignments for each of the four interventions and potential confounding factors. This gave an AUC (95% confidence interval) of 0.594 (0.578, 0.611). When main effect indicator variables were added for one and two minor alleles of rs3750817 and rs7705343, the AUC increased to 0.610 (0.594, 0.627). When SNP by intervention interaction indicator variables were also included, the AUC increased further to 0.621 (0.604, 0.637). A bootstrap test of significance for the genotype by intervention terms gave a nominal P-value of 0.007.

Discussion

We evaluated the association between 4,988 SNPs and invasive breast cancer incidence in the WHI clinical trial through the use of a statistic that combines SNP main effect information with SNP by intervention interaction information for each of four randomized interventions. This view of the data provided a clear focus on two genomic regions, the FGFR2 region of chromosome 10 q, which has a very strong main effect along with suggestive evidence for interaction, and the MRPS30 region of chromosome 5 p, which shows evidence of a comparatively smaller main effect and suggestive evidence for interaction. The inclusion of the clinical trial interventions in this testing procedure leads to interest in subregions containing FGFR2 SNP rs3750817 and MRPS30 SNP rs7705343 that are some distance from their associated tagging SNPs, possibly suggesting more than one regulatory element in these non-coding genomic regions.

We have previously [9, 10] discussed these data in relation to FGFR2. The eight MRPS30 SNPs considered here fall in a linkage disequilibrium region of approximately 230 kb from downstream of fibroblast growth factor 10 (FGF10) to downstream of MRPS30, with a minimum squared correlation among SNPs of 0.80 (Figure 1). FGF10/FGFR2 signaling [2629] could be relevant to these associations, though there is a recombination hotspot between the FGF10 gene and the 5p12 SNPs studied here.

Our analyses suggest that interactions of these two SNPs with WHI clinical trial interventions lead to a detectable increase in the ability to distinguish breast cancer cases from controls. Note, however, that AUC values in this context may be optimistic in view of our procedure for identifying SNPs of interest. Moreover, since the interactions identified in the study have yet to be confirmed by replication studies, the increase in AUC detected here is of exploratory nature as well. Also note that AUCs estimated here tend to be somewhat low due to age matching in the case-control sample.

When our combined test is separated into its constituents, one observes nominally significant evidence of interaction of MRPS30 SNP rs7705343 with three of the four WHI interventions. Given the manner in which we ranked SNPs, these analyses (Tables 2 and 3) should be regarded as exploratory and such interactions will need to be confirmed separately. Unfortunately, other clinical trial data are not available for this purpose, and confirmation in observational study settings will involve the challenge of reliable ascertainment of the relevant hormonal or dietary exposures, and will need to be carried out in a case-control rather than case-only model. Hence, quite large numbers of cases and controls will be needed, as may be accessible through cohort consortia.

It is interesting to see a significant interaction of rs7705343 with E-alone with the estimated intervention OR below 1.0 for the GG genotype, and an insignificant interaction of rs7705343 with E+P with the estimated intervention OR greater than 1 for the GG genotype. Few interactions with study subject characteristics have been suggested for E+P [12], with FGFR2 SNP rs3750817 as a possible exception [24]. In contrast, interactions with several subject characteristics have been identified for E-alone, including family history of breast cancer, benign breast disease [14], and again FGFR2 SNP rs3750817 [24]. A possible explanation is that the progestin in E+P tends to overwhelm the minor variations in hormone therapy hazard ratios that would otherwise occur, giving rise to a strong and fairly uniform risk elevation.

Study strengths include its nesting within the randomized controlled WHI clinical trial, implying randomization assignments that are known and that are statistically independent of genotype and the related ability to use case-only analyses for intervention testing. Other strengths of the study include the use of pre-diagnostic blood specimens, collected and stored according to a standardized protocol, and quality-controlled SNP genotyping.

A limitation of the study is that the average age at enrollment was 63 years in the WHI controlled trials, with many women well past menopause at enrollment. We have reported, in combined clinical trials and observational studies analyses, higher breast cancer hazard ratios for E+P and E-alone among women who first use these preparations soon after the menopause, compared to those using them later [30, 31]. Hence, the magnitude of the odds ratios shown here may be lower than would apply to typical hormone therapy users.

Conclusions

Simultaneous consideration of overall association and intervention interaction point to genomic regions in the vicinity of FGFR2 and MRPS30 genes as relevant to breast cancer risk among postmenopausal women. Moreover, subregions that were not otherwise the focus of interest, in the vicinity of SNPs rs3750817 and rs7705343, were identified as worthy of further study by virtue of suggestive interactions with hormonal and dietary interventions. These analyses represent an early step in assessing the role of genotype by 'environment' interactions to help explain familial breast cancer patterns, or as a contributor to risk discrimination.