Introduction

Family history is a well-established risk factor for breast cancer. First-degree relatives of women with breast cancer have an approximately twofold increased risk of developing the disease relative to the general population [1]. Twin studies are consistent with this familial clustering having, at least in part, a genetic origin [2, 3]. Mutations in high-risk susceptibility genes (mainly BRCA1 and BRCA2) explain most large multiple-case families, but account for only 15 to 20% of the excess familial risk [4]. Genome-wide association studies [5, 6] have identified more than 70 common variants that are associated with breast cancer susceptibility but they account for only another approximately 15% of the excess familial risk. The so-called ‘missing heritability’ may be explained by common variants with very small effects and/or by rarer variants with larger effects, neither of which can be identified by current genome-wide association studies. A statistically efficient alternative is to increase power by trying to identify variants associated with known quantitative phenotypic markers of susceptibility to breast cancer [7], and then to test them for association with breast cancer risk. This approach might also improve our understanding of the biological mechanisms involved in breast cancer pathogenesis.

Endogenous sex hormones are well-established risk factors for breast cancer in postmenopausal women [8]; the evidence in premenopausal women is less consistent, with some, but not all, studies suggesting an association between higher circulating levels of estrogens and increased breast cancer risk [917]. Genetic factors influence the levels of endogenous sex hormones [18] and therefore single nucleotide polymorphisms (SNPs) in genes regulating these hormonal pathways are good candidates for being breast cancer predisposition variants. We have previously studied 642 SNPs tagging 42 genes that might influence sex hormone levels in 729 healthy premenopausal women of European ancestry in relation to cyclic variations in oestrogen levels during the menstrual cycle. We found that the minor allele of rs10273424, which maps 50 kb 3′ to CYP3A5, was associated with a reduction of 22% (95% confidence interval (CI) = –28%, –15%; P = 10-9) in levels of urinary oestrone glucuronide, a metabolite that is highly correlated with serum oestradiol levels [19]. Analysis of 10,551 breast cancer cases and 17,535 controls of European ancestry demonstrated that the minor allele of rs10235235, a proxy for rs10273424 (r2 = 1.0), was also associated with a weak reduction in breast cancer risk but only in women aged 50 years or younger at diagnosis (odds ratio (OR) = 0.91, 95% CI = 0.83, 0.99; P = 0.03) [19].

The aim of the present study was to further investigate an association between rs10235235 and breast cancer risk using a much larger set of subjects – the Breast Cancer Association Consortium (BCAC) – comprising data from 49 additional studies, and to assess whether there was evidence of effect modification by age at diagnosis, ethnicity, age at menarche or tumour characteristics.

Materials and methods

Sample selection

Samples for the case–control analyses were drawn from 52 studies participating in the BCAC: 41 studies from populations of predominantly European ancestry, nine studies of Asian ancestry and two studies of African-American ancestry. The majority were population-based or hospital-based case–control studies, but some studies were nested in cohorts, selected samples by age, oversampled for cases with a family history or selected samples on the basis of tumour characteristics (Table S1 in Additional file 1). Studies provided ~2% of samples in duplicate for quality control purposes (see below). Study subjects were recruited on protocols approved by the Institutional Review Boards at each participating institution, and all subjects provided written informed consent (Additional file 2).

Genotyping and post-genotyping quality control

Genotyping for rs10235235 was carried out as part of a collaboration between the BCAC and three other consortia (the Collaborative Oncological Gene-environment Study (COGS)). Full details of SNP selection, array design, genotyping and post-genotyping quality control have been published [5]. Briefly, three categories of SNPs were chosen for inclusion in the array: SNPs selected on the basis of pooled genome-wide association study data; SNPs selected for the fine-mapping of published risk loci; and candidate SNPs selected on the basis of previous analyses or specific hypotheses. rs10235235 was a candidate SNP selected on the basis of our previous analyses [19].

For the COGS project overall, genotyping of 211,155 SNPs in 114,225 samples was conducted using a custom Illumina Infinium array (iCOGS; Illumina, San Diego, CA, USA) in four centres. Genotypes were called using Illumina’s proprietary GenCall algorithm. Standard quality control measures were applied across all SNPs and all samples genotyped as part of the COGS project. Samples were excluded for any of the following reasons: genotypically not female XX (XY, XXY or XO, n = 298); overall call rate <95% (n = 1,656); low or high heterozygosity (P < 10-6, separately for individuals of European, Asian and African-American ancestry, n = 670); individuals not concordant with previous genotyping within the BCAC (n = 702); individuals where genotypes for the duplicate sample appeared to be from a different individual (n = 42); cryptic duplicates within studies where the phenotypic data indicated that the individuals were different, or between studies where genotype data indicated samples were duplicates (n = 485); first-degree relatives (n = 1,981); phenotypic exclusions (n = 527); or concordant replicates (n = 2,629).

Ethnic outliers were identified by multidimensional scaling, combining the iCOGS array data with the three Hapmap2 populations, based on a subset of 37,000 uncorrelated markers that passed quality control (including ~1,000 selected as ancestry informative markers). Most studies were predominantly of a single ancestry (European or Asian), and women with >15% minority ancestry, based on the first two components, were excluded (n = 1,244). Two studies from Singapore (SGBCC) and Malaysia (MYBRCA; see Table S1 in Additional file 1 for all full study names) contained a substantial fraction of women of mixed European/Asian ancestry (probably of South Asian ancestry). For these studies, no exclusions for ethnic outliers were made, but principal components analysis (see below) was used to adjust for inflation in these studies. Similarly, for the two African-American studies (NBHS and SCCS), no exclusions for ethnic outliers were made.

Principal component analyses were carried out separately for the European, Asian and African-American subgroups, based on a subset of 37,000 uncorrelated SNPs. For the analyses of European subjects, we included the first six principal components as covariates, together with a seventh component derived specific to one study (LMBC) for which there was substantial inflation not accounted for by the components derived from the analysis of all studies. Addition of further principal components did not reduce inflation further. Two principal components were included for the studies conducted in Asian populations and two principal components were included for the African-American studies.

For the main analyses of rs10235235 and breast cancer risk, we excluded women from three studies (BBCS, BIGGS and UKBGS) that were genotyped in the hypothesis-generating study (n = 5,452) [19] and women with non-invasive cancers (ductal carcinoma in situ/lobular carcinoma in situ, n = 2,663) or cancers of uncertain status (n = 960)). After exclusions there were 47,346 invasive breast cancer case samples and 47,570 control samples from 49 studies (38 from populations of predominantly European ancestry, nine Asian and two African-American) used in the analysis (Tables S1 and S2 in Additional file 1). After quality control exclusions (above) the call rate for rs10235235 was 100% (one no call in 94,916 samples), and for the controls there was no evidence of deviation from Hardy–Weinberg equilibrium in any of the contributing studies (Table S2 in Additional file 1).

We did not test for an association between rs10235235 and age at menarche in our hypothesis-generating study [19]. Therefore, to maximise our power to detect an association, we included menarche data from BBCS cases (n = 2,508) and controls (n = 1,650) and from UKBGS cases (n = 3,388) and controls (n = 4,081) in this analysis. Age at menarche was not available for samples from BIGGS. Full details of genotyping of rs10235235 in BBCS and UKBGS samples have been published previously [19]. Briefly, genotyping was carried out using competitive allele-specific polymerase chain reaction KASPar chemistry (KBiosciences Ltd, Hoddesdon, Hertfordshire, UK). Call rates were 98.0% (BBCS) and 96.6% (UKBGS); there was no evidence for deviation from Hardy–Weinberg equilibrium (P = 0.29 (BBCS); P = 0.92 (UKBGS)), and the duplicate concordance based on a 1% (BBCS) and 5% (UKBGS) random sample of duplicates was 100% for both studies.

Statistical analysis

We estimated per-allele and genotypic log odds ratios (ORs) for the European, Asian and African-American subgroups separately using logistic regression, adjusted for principal components and study [5]. To test for departure from a multiplicative model we compared multiplicative and unconstrained models using a one degree of freedom likelihood ratio test. Heterogeneity in ORs between studies within each subgroup (European, Asian and African-American), and between subgroups, was assessed using the Cochrane Q statistic and quantified using the I2 measure [20].

Analyses stratified by oestrogen receptor status (+/–), progesterone receptor status (+/–), morphology (ductal or lobular), grade (1,2,3), lymph node involvement (+/–) or age at diagnosis (≤50 and >50 years) were restricted to studies of European ancestry due to the small number of studies of Asian and African-American ancestry. In addition, studies were excluded if they had selected cases on the basis of the stratifying variable, or had collected data on that variable for less than 5% of cases or less than 10 cases in total. Availability of data for each of the stratifying variables in each study is shown in Table S3 in Additional file 1. To assess the relationship between each of the stratifying variables and genotype, stratum-specific ORs were calculated using logistic regression. Cases in each stratum were compared with all control subjects, adjusted for study and principal components. Case-only logistic regression was used to test for heterogeneity between strata (binary stratifying variables) or across strata (stratifying variables with three or more strata). P values were estimated using likelihood ratio tests with one degree of freedom.

We assessed whether rs10235235 was associated with age at menarche in cases and controls separately. Studies that had not collected data on age at menarche in both cases and controls were excluded (Table S4 in Additional file 1). We used linear regression, adjusted for principal components and study, to estimate the relationship between age at menarche (years) and rs10235235 genotype (0, 1, 2 rare alleles) and logistic regression adjusted for principal components and study to estimate the association between age at menarche and breast cancer risk. To test for effect modification of an association between rs10235235 and breast cancer risk by age at menarche, we used logistic regression adjusted for principal components, study and age at menarche (grouped as ≤11, 12, 13, 14 and ≥15 years) with and without an interaction term(s). We considered four models: no interaction (zero interaction terms); assuming a linear interaction between genotype and menarche group (one interaction term); assuming a linear interaction between genotype and menarche group but allowing the linear term to differ between women who were heterozygous and those who were homozygous for the rare allele (two interaction terms); and one interaction term for each possible genotype/menarche group combination (eight interaction terms). Nested models were compared using likelihood ratio tests. All statistical analyses were performed using STATA version 11.0 (StataCorp, College Station, TX, USA). All P values reported are two-sided.

Results

The case–control analysis comprised genotype data for 47,346 invasive breast cancer cases and 47,569 controls from 49 studies, including 80,518 (84.8%) subjects of self-reported European ancestry, 12,419 (13.1%) of self-reported Asian ancestry and 1,978 (2.1%) of self-reported African-American ancestry. The mean (± standard deviation) age at diagnosis was 56.1 (± 11.6) years for European cases, 51.1 (± 10.5) years for Asian cases and 53.1 (± 10.7) years for African-American cases. There were ethnic differences in the estimated minor allele frequency (MAF) of rs10235235 (Q = 7317.1, two degrees of freedom; P for heterogeneity (Phet) = 0). The overall MAF for European control women was 0.089 (95% CI = 0.087, 0.091), but with strong evidence of between-study heterogeneity (Phet = 1 × 10-22) that was accounted for by the three Finnish studies (HEBCS, MAF = 0.15; KBCP, MAF = 0.21; and OBCS, MAF = 0.15; Phet = 0.01); no evidence of heterogeneity remained after taking account of these studies (MAF = 0.087 (95% CI = 0.085, 0.089); Phet = 0.23). Relative to Europeans, the overall MAF was higher for African-Americans (0.213, 95% CI = 0.195, 0.232; Phet = 0.26) but much lower for Asians (0.002; 95% CI = 0.001, 0.002), with strong evidence of between-study heterogeneity for the latter (Phet = 4 × 10-14).

The case–control analysis was consistent with a modest association between rs10235235 and breast cancer risk for women of European ancestry, with an estimated per-allele OR of 0.96 (95% CI = 0.93, 0.99; P for linear trend (Ptrend) = 0.02). Genotype-specific ORs were 0.98 (95% CI = 0.94, 1.01; P = 0.21) for AG versus AA (Figure 1A) and 0.80 (95% CI = 0.69, 0.93; P = 0.004) for GG versus AA (Figure 1B), with no evidence of between-study heterogeneity for either OR estimate (Phet = 0.44, I2 = 1.9% and Phet = 0.76, I2 = 0.0% for heterozygote and homozygote OR estimates respectively). There was, however, marginally significant evidence that the genotypic OR estimates departed from those expected under a multiplicative model with the inverse association of the GG genotype being more than the square of that of the AG genotype (test for deviation from multiplicative model, P = 0.04).

Figure 1
figure 1

Association of rs10235235 with breast cancer risk for women of European ancestry. Forest plots of the association of the rs10235235 AG (heterozygote) genotype (A) and GG (homozygote) genotype (B) with breast cancer risk for women of European ancestry. Horizontal lines, 95% confidence intervals (CIs); square boxes, study-specific fixed-effects estimates; diamond, combined, fixed-effects estimate of the odds ratio (OR) and 95% CI. Vertical line, null effect (OR = 1.0); dashed vertical line, estimated heterozygote OR (A) and estimated homozygote OR (B). Homozygote ORs for six studies (CTS, DEMOKRITOS, kConFab/AOCS, NBCS, NBHS and RPCI) could not be estimated because there were no GG homozygotes among cases or among controls in each of these studies (see Table S2 in Additional file 1).

Data for rs10235235 in women of Asian or African-American ancestry were more limited, with just two African-American studies (1,046 cases and 932 controls) and nine Asian studies (5,795 cases and 6,624 controls). In addition, this SNP was sufficiently rare in Asian populations (MAF = 0.002) that we were unable to estimate the heterozygote OR in two Asian studies (SEBCS, one carrier among 1,114 cases and no carriers among 1,129 controls; TWBCS, one carrier among 236 controls and no carriers among 774 cases; Table S2 in Additional file 1) and we could not estimate a homozygote OR for any Asian study (Table S2 in Additional file 1). There was no clear evidence that this SNP was associated with breast cancer risk for women of Asian ancestry (heterozygote OR = 1.06, 95% CI = 0.76, 1.49) or African-American ancestry (heterozygote and homozygote ORs were OR = 1.09, 95% CI = 0.90, 1.32 and OR = 0.94, 95% CI = 0.62, 1.42 respectively; Figure S1 in Additional file 1). This analysis, however, had low power to detect associations in non-Europeans and these OR estimates were not inconsistent with the magnitude of the observed OR estimates for European women (Phet = 0.51).

Stratifying cases by oestrogen receptor (Phet = 0.83) or progesterone receptor (Phet = 0.19) status, tumour grade (Phet = 0.63) or nodal involvement at diagnosis (Phet = 0.51) showed no evidence of effect modification (Table 1). There was some evidence of effect modification by morphology (Phet = 0.03). For ductal cancers we estimated a very modest reduction of risk for heterozygotes (ORhet = 0.98, 95% CI = 0.93, 1.02; P = 0.30) and a stronger, significant reduction for homozygotes (ORhom = 0.74, 95% CI = 0.61, 0.90; P = 0.003). For lobular cancers there was no such trend (ORhet = 1.07, 95% CI = 0.98, 1.17; P = 0.14 and ORhom = 0.91, 95% CI = 0.64, 1.27; P = 0.57).

Table 1 Association of rs10235235 with risk of breast cancer for women of European ancestry: stratified analysis

The SNP rs10235235 maps to a locus (CYP3A) that has been considered an a priori candidate for involvement in determining age at menopause and age at menarche [21, 22]. Stratifying cases by age at diagnosis (≤50 or >50 years) as a proxy for menopausal status at diagnosis showed no evidence of effect modification (Phet = 0.89; Table 2), and excluding cases who were diagnosed between age 46 and 55 as potentially perimenopausal did not alter this result (Phet = 0.28). Data on age at menarche were available for 21,736 cases and 22,686 controls (Table S4 in Additional file 1); to increase the power of the analysis we included additional data from BBCS and UKBGS (5,737 cases, 5,572 controls; Table S4 in Additional file 1) [19]. There was a 1.5% (95% CI = 0.5%, 2.7%; P = 0.004) reduction in breast cancer risk associated with each additional year’s increase in age at menarche. Mean age at menarche was positively associated with number of copies of the minor allele of rs10235235 for controls (Ptrend = 0.005; Table 3) but not for cases (Ptrend = 0.97; Table 3). Consequently, there was an inverse trend in the magnitude of the heterozygote and homozygote breast cancer ORs with mean age at menarche (Phet = 0.02; Table 4); being a carrier of one or two rare alleles of rs10235235 was associated with an estimated 16% (ORhet = 0.84, 95% CI = 0.75, 0.94; P = 0.003) or 19% (ORhom = 0.81, 95% CI = 0.51, 1.30; P = 0.39) (Ptrend = 0.002) reduction in breast cancer risk for women who had their menarche at ages ≥15 years but there was no evidence of reduction for those with a menarche at age ≤11 years (ORhet = 1.06, 95% CI = 0.95, 1.19; P = 0.30 and ORhom = 1.07, 95% CI = 0.67, 1.72; P = 0.78) (Ptrend = 0.29). There was no evidence that the inverse trend in the magnitude of ORs with mean age at menarche differed between heterozygous and homozygous carriers (P = 0.97) and no evidence that the trend was nonlinear (P = 0.70).

Table 2 rs10235235 and risk of breast cancer for women of European ancestry by age at diagnosis
Table 3 Association of rs10235235 with age at menarche for women of European ancestry by case-control status
Table 4 rs10235235 and risk of breast cancer for women of European ancestry by age at menarche

Discussion

This study of more than 47,000 breast cancer cases and 47,000 controls has confirmed that rs10235235, mapping to 7q22.1 (CYP3A), is associated with a reduction in breast cancer risk for women of European ancestry. Previously, our hypothesis-generating study of 10,000 breast cancer cases and 17,000 controls found a per-allele OR estimate of 0.96 (95% CI = 0.90, 1.02; P = 0.2), with marginally significant evidence of an inverse association for breast cancer diagnosed age 50 years or younger (OR = 0.91, 95% CI = 0.83, 0.99; P = 0.03) but no evidence of an association for breast cancer at later ages (OR = 1.01, 95% CI = 0.93, 1.10; P = 0.82) [19]. In this considerably larger study, we found a heterozygote OR estimate of 0.98 (95% CI = 0.94, 1.01; P = 0.21) and a homozygote OR estimate of 0.80 (95% CI = 0.69, 0.93; P = 0.004) with marginally significant evidence that the inverse association for homozygotes is greater than predicted by a multiplicative model (P = 0.04).

To our knowledge, rs10235235 is the first SNP to be associated with both breast cancer risk and age at menarche, consistent with the well-documented association between later age at menarche and a reduction in breast cancer risk [23]. Genome-wide association studies have identified more than 70 breast cancer risk variants [5, 6] and more than 30 variants associated with age at menarche [22], none of which map to the CYP3A locus. rs10235235 was originally identified on the basis of a highly significant association with hormone levels, accounting for 4.9% of the variation in premenopausal urinary oestrone glucuronide levels [19]. In this current analysis, rs10235235 accounted for only 0.01% of the variation across controls in age at menarche and we estimate that this SNP explains just 0.01% of the familial excess breast cancer risk. Our data thus illustrate the potential statistical efficiency of studies of intermediate phenotypes in the identification of rarer (MAF < 10%) risk alleles with modest associations. Our analysis shows some inconsistency with a recent genome-wide study of circulating oestradiol, testosterone and sex hormone-binding globulin in postmenopausal women [24]. In that study there was no genome-wide significant association observed with plasma oestradiol levels in either the primary analysis of approximately 1,600 postmenopausal women who were not taking postmenopausal hormones at blood draw or the secondary analysis that included approximately 900 current postmenopausal hormone users. Further studies will be needed to determine whether the lack of an association between CYP3A variants and postmenopausal plasma oestradiol levels reflects a difference in the menopausal status of the study subjects, the hormone/metabolite that was analysed or chance.

One possible explanation for the apparent effect modification of the rs10235235–breast cancer risk association by age at menarche is that this is a function of genotyping a marker SNP rather than the true causal variant. For example, if rs10235235 was perfectly correlated with a causal variant, SNP X, with a MAF substantially lower than that of rs10235235 (D′ ~ 1.0, r2 < 1.0), then there would be three types of chromosome in the population: type i, chromosomes carrying the common allele of rs10235235 and the common allele of SNP X; type ii, chromosomes carrying the rare allele of rs10235235 and the common allele of SNP X; and type iii, chromosomes carrying the rare allele of rs10235235 and the rare (protective) allele of SNP X. Only chromosomes carrying the rare allele of rs10235235 and the rare (protective) allele of SNP X (type iii) would be enriched in controls. Genotyping the marker (rs10235235) rather than the causal variant leads to misclassification. As the causal variant is associated with a protective effect on breast cancer risk, the proportion of chromosomes carrying both the rare allele of the causal variant and the marker (type iii) compared with the common allele of the causal variant and the rare allele of the marker (type ii) will be greater in controls than in cases such that the extent of misclassification will be greater for cases than controls. This will attenuate the association between genotype and age at menarche to a greater extent in cases than in controls creating an apparent effect modification. Fine mapping and functional studies will be required to identify the causal variant and to determine the true relationship between the causal variant, age at menarche and breast cancer risk.

Despite our original finding of a strong association between rs10235235 and hormone levels, we found no evidence that the association between this SNP and breast cancer risk differed by the hormone receptor status of the tumour, and nor did we find any evidence that the association differed by stage, grade or lymph node involvement. There was marginally significant evidence that the association between rs10235235 and breast cancer risk differed between ductal and lobular cancers (Phet = 0.03). Given the number of stratified analyses that we carried out (six stratifying variables) and given that there is no biological basis to support an interaction between rs10235235 and morphology, this is probably a chance observation.

In contrast to our earlier study [19], we found no evidence of an interaction with age at diagnosis when we stratified cases by age ≤/>50 years, either including or excluding cases diagnosed between age 46 and 55 years as potentially perimenopausal. We used age at diagnosis as a proxy for menopausal status at diagnosis because menopausal status at diagnosis is difficult to determine by questionnaire, especially given the use of hormone replacement therapies; while information on age at diagnosis was available for all but 1.4% (n = 554) of cases, information on age at natural menopause was missing for 65.6% (n = 26,552) of cases of European ancestry. Similarly, although rs10235235 is a plausible candidate for association with age at menopause, we did not test this due to the limited amount of data on age at natural menopause for controls of European ancestry (n = 11,294, 28.2%) and the difficulty in ascertaining whether treatment for breast cancer had influenced reported age at menopause for cases.

The strengths of our study include the large size of this combined analysis, and the availability of information on tumour characteristics for the majority of cases and on age at menarche for the majority of cases and controls. Limitations include low power of the study to examine an association between genotype and breast cancer risk for non-Europeans.

Conclusions

In summary, we have confirmed that rs10235235 is associated with breast cancer, have shown for the first time that rs10235235 is associated with age at menarche in controls and have suggested a potential mechanism for these associations. rs10235235, which maps to the CYP3A locus, probably tags a causal variant that affects expression of one or more CYP3A genes.