Introduction

Breast cancer is the most common cancer in women overall worldwide. Oestrogen exposure is a central factor in the development and progression of this cancer [13] and its effects on the breast epithelium is primarily mediated by oestrogen receptor alpha (ESR1) [4]. In addition to being activated by oestrogen, the ESR1 protein can be activated by growth factors such as epidermal growth factor (EGF) [3], which acts as a potent mitogen for epithelial cells, including mammary epithelia [5]. Variation in the ESR1 (MIM 133430) and EGF (MIM 131530) genes affecting the function or expression of their respective proteins could thus potentially affect the risk of developing breast cancer, characteristics of the tumour or the risk of dying from the disease.

With regard to breast cancer risk or survival, a number of single nucleotide polymorphisms (SNPs) have been studied in the ESR1 gene, yet none have previously been investigated in the EGF gene. As far as we are aware, no attempt to capture the common genetic variation in the ESR1 gene in its entirety has yet been published. One group, who genotyped 17 SNPs in the ESR1 gene, found a decreased risk of breast cancer for carriers of three common haplotypes in the gene and an increased risk for carriers of one common haplotype [6]. We genotyped 157 SNPs in ESR1 and 54 SNPs in EGF using a population-based case-control study, which included 1,590 breast cancer cases and 1,518 controls. We selected haplotype-tagging SNPs (tagSNPs) spanning the ESR1 and EGF genomic regions and assessed their association with breast cancer risk, the Nottingham Prognostic Index (NPI) and breast cancer survival.

Patients and methods

Parent breast cancer study

The study base included all Swedish-born women between 50 and 74 years of age and resident in Sweden between October 1993 and March 1995. During this period, all breast cancer cases were identified at diagnosis through the six regional cancer registries in Sweden. Controls were randomly selected from the Swedish Registry of Total Population to match the cases in 5-year age strata. Of the eligible cases and controls, 3,345 (84%) breast cancer cases and 3,454 (82%) controls participated in this initial questionnaire-based study.

Present breast cancer study

From the parent study, we randomly selected 1,500 breast cancer cases and 1,500 age- and frequency-matched controls among the postmenopausal participants without any previous malignancy (except carcinoma in situ of the cervix or non-melanoma skin cancer). With the intention of increasing statistical power in subgroup analyses, we further selected all remaining breast cancer cases and controls that had used menopausal hormones (oestrogen alone or any combination of oestrogen and progestin) for at least 4 years. We also included all remaining participants with self-reported diabetes mellitus. In total, we selected 1,801 breast cancer cases and 2,057 controls.

Following informed consent, participants donated whole blood. For deceased cases and those cases that declined to donate blood but consented to our use of tissue, we collected archived paraffin-embedded, non-cancerous tissue samples. We acquired 70% of the requested tissue samples; the main reason for non-participation was unwillingness or lack of time at the respective pathology department to provide the tissue blocks. In total, we obtained blood samples and archived tissue samples for 1,321 and 275 breast cancer patients, respectively, and blood samples for 1,524 controls. Population-based participation rates (taking into account the proportion that did not participate in the parent questionnaire study) were 75% and 61% for the cases and controls, respectively.

We extracted DNA from 4 ml of whole blood using the QIAamp DNA Blood Maxi Kit (Qiagen, Hilden, Germany) according to the manufacturer's instructions. From non-malignant paraffin-embedded tissues, DNA was extracted using a standard phenol/chloroform/isoamyl alcohol protocol [7]. We successfully isolated DNA from 1,318 (blood) and 272 (tissue) breast cancer patients and 1,518 controls. We randomly selected 92 out of the 1,518 controls to be used for linkage disequilibrium characterisation and haplotype reconstruction of the ESR1 and EGF genes.

This study was approved by the Institutional Review Boards in Sweden and at the National University of Singapore.

SNP markers and genotyping

We selected SNPs in the ESR1 and EGF genes and their 20 kb flanking sequences from dbSNP (build 124, [8]) and Celera databases, aiming for an initial marker density of at least one SNP per 5 kb. The PvuII (rs2234693), XbaI (rs9340799), codon 243 (rs4986934) and codon 325 (rs1801132) variants were selected from the literature and added to our SNP selection. SNPs were genotyped using the Sequenom primer extension-based assay (San Diego, CA, USA) and the BeadArray system from Illumina (San Diego, CA, USA) following the manufacturers' instructions. All genotyping plates included positive and negative controls, DNA samples were randomly assigned to the plates, and all genotyping results were generated and checked by laboratory staff unaware of case-control status. Only SNPs where more than 85% of the samples gave a genotype call were analysed further. As quality control, we genotyped 200 randomly selected SNPs in the 92 control samples using both the Sequenom system and the BeadArray system. The genotype concordance was > 99.5%, suggesting high genotyping accuracy.

Linkage disequilibrium characterisation and tagSNP selection

We genotyped a dense set of SNPs in the ESR1 and EGF genes in the 92 controls (Supplementary Tables 1 and 2 in Additional File 1, respectively). We identified regions of linkage disequilibrium (LD) and selected tagSNPs. We produced LD plots of the D' and R2 values for ESR1 and EGF (Supplementary Figures 1 and 2 in Additional File 1, respectively) using the LDheatmap function in the statistical software R [9]. We reconstructed haplotypes using the partition ligation expectation maximisation (PLEM) algorithm [10] implemented in the tagSNPs program [11] and selected tagSNPs based on the R2 coefficient, described previously (equation (1) in [12]. In our case this is the squared correlation between the true number of haplotypes (defined across all SNPs typed in the 92 controls) and the number of copies of haplotypes predicted as being carried, based on the tagSNPs. The R2 coefficient in [12] can also be used for measuring association between the genotypes of all SNPs typed in the 92 controls and the genotypes predicted on the basis of knowing the tagSNPs only. We chose tagSNPs so that common SNP genotypes (minor allele frequency ≥ 0.03) and common haplotypes (frequency ≥ 0.03) were predicted with R2 ≥ 0.8 [13]. The well studied PvuII (rs2234693), XbaI (rs9340799), codon 243 (rs4986934) and codon 325 (rs1801132) variants were included as tagSNPs. In order to evaluate our tagSNPs' performance in capturing unobserved SNPs within the genes and to assess whether we needed a denser set of markers, we performed a SNP-dropping analysis [12, 14]. In brief, each of the genotyped SNPs was dropped in turn and tagSNPs were selected from the remaining SNPs so that their haplotypes predicted the remaining SNPs with an R2 value of 0.85. We then estimated how well the tagSNP haplotypes of the remaining SNPs predicted the dropped SNP, an evaluation that can provide an unbiased and accurate estimate of tagSNP performance [12, 14].

There were 19 SNPs upstream of the first tagSNP (TAG1) in ESR1 (Supplementary Table 1 in Additional File 1). Of the 19, 12 were either not polymorphic or had a minor allele frequency (MAF) of less than 3%. The remaining seven SNPs in this area were included in our LD identification and tagSNP selection analysis. Hence, all polymorphic SNPs with a MAF ≥ 3% far 5' upstream of ESR1 were captured by our tagSNPs.

Breast tumour characteristics and follow-up

We retrieved information on date and cause of death until 31 December 2003 from the Swedish Causes of Death Registry and on date of emigration from the Swedish National Population Registry. Follow-up time began at date of diagnosis and ended on 31 December 2003, or at date of death or emigration, whichever came first.

We collected information on tumour size, lymph node involvement, and grade (tumour differentiation) from medical records and calculated the Nottingham Prognostic Index (NPI) using the following formula:

NPI = 0.2 × size [in cm] + 1 × nodal stage [1, 2, or 3] + 1 × grade [1, 2, or 3] [15]

Nodal stage was defined as 1 if there were no lymph node metastases, 2 for a total of 1–3 metastatic nodes, and 3 for more than 3 metastatic nodes. A tumour of high differentiation was assigned grade 1, a tumour of intermediate differentiation grade 2, and a low differentiated tumour was assigned grade 3. We categorised the NPI into two groups: ≤ 4 or > 4. Four is the mean NPI value of the present study. It has also been shown that breast cancer survival decreases rapidly for NPI above 4 [15].

Statistical analyses

We applied unconditional logistic regression models for assessing the association between ESR1 and EGF tagSNPs and risk of breast cancer (case-control analysis) or the NPI (case only analysis). Adjusting for age (in 5-year age groups) did not affect our results. We estimated the hazard ratio of death due to breast cancer in relation to the genes' tagSNP using Cox proportional hazards models. The tagSNPs were included as covariates in the models either one at a time or in groups of five (codominant main effects only). The latter method was used for detection of association with haplotypes. Although it does not require resolution of gametic phase, tests based on such models can be powerful within regions of strong LD [16]. Likelihood ratio tests were used to generate p values for comparing models with or without covariates. We made adjustments to our test results to account for multiplicity. We did so for each outcome (risk, NPI, and survival) separately. We used a permutation-based approach that controls the family-wise error rate (probability of rejecting one or more true null hypotheses of no association). This is based on the permutation step-down procedure of Westfall and Young [17] and takes into account the dependence structure of the polymorphisms/hypotheses. We also assessed association between groups of haplotypes and breast cancer risk using three approaches (each of which resolve gametic phase). We used the logistic regression expected haplotype dosage approach of [11], combining rare haplotypes. Since there is no biological reason to cluster haplotypes on the basis of their frequency, we also employed a Bayesian association mapping approach [18] that clusters haplotypes according to their allelic similarity, and a sliding-window approach described by Li et al. [19].

To estimate power in the risk component of the study, we used a method described by Chapman et al. [20], which assumes co-dominant effects at an unobserved locus. To calculate power for log-additive effects in the survival component of the study, we used the Quanto program [21] in a similar manner as Manolio et al. [22]. Analyses were performed using the statistical software R or the SAS system (v. 9.1, SAS Institute Inc., Cary, NC, USA). Because lifestyle and reproductive breast cancer risk factors are unlikely to cause genetic variation in the genes, we thus did not adjust for them in the analyses.

Results

Characteristics of participants

Table 1 shows selected characteristics of the cases and controls included in the parent questionnaire-based study and the current genetic study. Long-term users of menopausal hormone therapy and women with self-reported diabetes mellitus were oversampled in the current study. Most other characteristics were statistically significantly different between cases and controls and reflected established associations.

Table 1 Selected characteristics of the cases and controls participating in the present and parent breast cancer study

More case-related information has been provided in our previous work [23]. The breast cancer cases that participated in our study via tissue sample donation were on average 1.5 years older (p = 0.0003) than the cases that donated blood. The former group was also more likely to have been diagnosed with TNM (tumour, nodes, metastasis) stage 2 or more advanced cancers (p < 0.0001). Since no significant differences in genotype frequencies within TNM stage 1, TNM stage 2 and TNM stages 3 and 4 were evident between the two groups of cases, this difference is unlikely to be a cause for concern.

Genotyping, LD pattern and coverage

The genotyping results and SNP coverage in the ESR1 and EGF genes are summarised in Table 2. A dense set of SNPs in the ESR1 and EGF genes were genotyped in 92 randomly selected controls (Supplementary Table 1 (ESR1) and Supplementary Table 2 (EGF) in Additional File 1), and only the SNPs that were in Hardy-Weinberg equilibrium (p > 0.01) and that were at least 3% in minor allele frequency among the 92 controls were included in LD analysis and tagSNP selection (Table 2). LD plots created from the SNPs included in our study are shown in Supplementary Figures 1 (ESR1) and 2 (EGF) in Additional File 1. Using the SNP dropping method [14], we found that the tagSNPs selected from the included SNPs could efficiently capture non-genotyped SNPs in the genes (Table 2).

Table 2 Summary statistics on genotyping results and SNP coverage in ESR1 and EGF for 92 Swedish controls

Association analyses

We selected 52 tagSNPs in ESR1 and 15 tagSNPs in EGF that could predict the included SNPs and their haplotypes with an R2 of at least 0.8. The tagSNPs were genotyped in all cases and controls (Supplementary Table 3 in Additional File 1), but seven tagSNPs in ESR1 and one tagSNP in EGF could not be genotyped in the cases that participated via tissue sample donation.

ESR1

For each outcome (breast cancer risk, NPI and breast cancer survival), we first tested the association of each tagSNP and then performed a haplotype analysis using a logistic regression sliding-window approach where five adjacent tagSNPs were analysed together (without resolution of gametic phase). The results are summarised in Figure 1 and Supplementary Table 4 in Additional File 1. Analysis of the 52 tagSNPs in ESR1 (including the PvuII, XbaI, codon 243 and codon 325 SNPs) did not reveal any association with breast cancer risk, NPI or breast cancer survival whose statistical significance withstood multiple testing correction. The strongest signal of association with breast cancer risk was obtained by the window analysis including TAGs 26–30 (p = 0.001 and p = 0.086, before and after correction for multiple testing). Within the region, there were seven common haplotypes that accounted for 92% of the chromosomes. Including the expected dosages of common haplotypes and the rare haplotypes (combined into a single variable) as covariates in a logistic regression model with the most common haplotype as reference, gave a global p (likelihood ratio test) of 0.0493 in relation to breast cancer risk (Table 3).

Figure 1
figure 1

Association of 52 tagging single nucleotide polymorphisms (tagSNPs) in ESR1 with breast cancer risk, Nottingham Prognostic Index (NPI) and breast cancer survival. Left column: breast cancer risk. Middle column: NPI (case-only analysis). Right column: breast cancer survival. Squares and horizontal lines represent odds and hazard (survival analysis) ratios (change in risk with each addition of the rare allele) and their confidence intervals. Sizes of the squares reflect the minor allele frequencies. NPI was categorised into ≤ 4 or > 4.

Table 3 Association between haplotypes reconstructed from ESR1 TAGs 26–30 and breast cancer risk

We also explored a sliding-window analysis of haplotypes using a variable window size. Three haplotypes within the region from tagSNP 18 to tagSNP 27 were implicated, showing frequency differences between cases and controls (Table 4). The significance of the frequency differences was, however, not clear, given the large number of haplotypes being searched in both fixed- and variable-sized sliding-window analyses. We also used a Bayesian haplotype clustering method [18], with a fixed window size of six tagSNPs. Interestingly, the posterior distribution for the position of a possible disease mutation coincided with the region suggested by both fixed- and variable-sized haplotype analyses (Supplementary Figure 3 in Additional File 1).

Table 4 Association of three haplotypes (TAGs 18–27) in ESR1 with breast cancer risk, as implicated by the variable-sized sliding-window analysis

Also, an analysis within groups of diabetes mellitus, menopausal hormone use or family history furthermore did not reveal any significant evidence for any tagSNP to be associated with breast cancer risk (data not shown).

EGF

None of the tagSNPs in EGF showed association with breast cancer risk, NPI, or breast cancer survival that withstood multiple testing correction (Supplementary Figure 4 in Additional File 1). This lack of association was supported by the haplotype analysis.

Discussion

We had comprehensive SNP coverage of the entire ESR1 and EGF genes and were thus able to study if there were any common variants in the genes that showed an association with breast cancer risk, NPI or breast cancer survival. No association was found between common variants in ESR1 and NPI or breast cancer survival by single tagSNP analysis. A region between TAG26 (rs3003925) and TAG30 (rs2144025) in the ESR1 gene showed a signal for association with breast cancer risk in the multi-locus analysis of five adjacent tagSNPs, but the result did not withstand multiple testing correction. Interestingly, the suggestive evidence from further haplotype analyses converge to this region, but its significance needs to be determined by follow-up analysis. None of the genotyped SNPs within this region were located in exons (all were in the middle of intron 4–5) and are thus unlikely to affect ESR1 protein structure. It is still a possibility however that the SNPs themselves, or one or more SNPs in LD with any of the SNPs, may effect the regulation of ESR1 protein expression. In fact, it has been shown that ESR1 protein overexpression is common in breast cancer [24]. Common variants within the EGF gene did not appear to affect the risk of developing breast cancer, developing a tumour with high NPI, or dying from the disease.

Our study was a well designed, population-based case-control study. Case ascertainment and case survival status were established using the nationwide, high-quality Cancer Registry and Causes of Death Registry in Sweden. Exposure status of the participants was determined using genotyping methods with low error rates from which all results underwent detailed quality control. We sought to obtain tissue samples from the deceased cases and those cases that had declined donation of a blood sample, and were able to obtain the majority of the samples requested. The relative minor lack of tissue accessibility is unlikely to be related to our exposure, ESR1 or EGF genetic variation, as it depended on the inability of the respective pathology department to retrieve the samples. The tissue sample availability was therefore random and could not have lead to selection bias. The main concern is that the non-participation of a small number of deceased cases might have reduced the power of our study, especially for the survival analysis. Furthermore, a problem might have arisen since we were not able to genotype seven tagSNPs in ESR1 and one tagSNP in EGF in the tissue samples. If these eight tagSNPs were in fact associated with severe disease, the association with risk of breast cancer death might have been biased towards null in our study since we did not genotype all the severe cases. The fact that the results were not different when we restricted our analyses to the most severe cases among those who donated blood samples indicates that the eight tagSNPs were unlikely to be associated with severe disease.

In the selection stage of our study, we oversampled cases and controls that were long-term users of menopausal hormones and those that had self-reported diabetes mellitus. In the case of an association between the tagSNPs under study and menopausal hormone use or diabetes mellitus, this oversampling might have caused us to detect an artificial association between the tagSNPs and breast cancer risk. We therefore assessed if the tagSNPs were associated with menopausal hormone use or diabetes mellitus. We found no connection between the factors and conclude that the oversampling is unlikely to have posed a problem in our study.

Most previous publications regarding the ESR1 gene and breast cancer risk have included only a few polymorphisms in the gene. One study, however, genotyped 17 common SNPs in the ESR1 gene and found three haplotypes to decrease breast cancer risk and one haplotype that increased the risk [6]. None of the haplotypes carried SNPs that were located in the region in ESR1 we found to be associated with breast cancer risk. Two of the haplotypes that showed a protective effect against breast cancer risk (H4 and H6) carried our TAG21 (rs1801132, codon 325) [6]. We were not able to confirm this association using a window or a haplotype analysis.

The PvuII (TAG6, rs2234693), XbaI (TAG7, rs9340799), codon 243 (TAG14, rs4986934) and codon 325 (TAG21, rs1801132) variants are among the most commonly studied polymorphisms in the ESR1 gene. The first two have been suggested in a couple of studies to decrease the risk of endometrial cancer [25, 26] and PvuII might affect breast cancer survival depending on oestrogen receptor status of the tumour [27], but no consistent effect over studies has been shown for the four variants with regard to breast cancer risk [19, 2837]. We found no association between these SNPs and overall breast cancer risk, NPI or breast cancer survival.

Conclusion

We analysed common genetic variation in the ESR1 and EGF genes in relation to breast cancer risk, tumour characteristics and breast cancer survival using a comprehensive haplotype tagging analysis. To our knowledge, this is the first systematic association study of these two genes for breast cancer susceptibility and prognosis. We located a region in ESR1 which showed a moderate signal for association with breast cancer risk, but were unable to link common variation in the EGF gene with breast cancer aetiology or prognosis.