Introduction

Breast cancer is a heterogeneous disease that is associated with genetic and environmental factors. Prior to genetic studies, investigations have mainly revolved around the presence of a family history, hormonal and reproductive-related risk factors [13] with the effects of other lifestyle factors being queried recently. To date, the risk of disease has been shown to increase with a woman’s age, age at menopause, age at first live birth, previous occurrence of atypical hyperplasia and family history [4]. The inherited predisposition to this malignancy has also been thoroughly studied to reveal two major susceptibility genes, BRCA1 and BRCA2 [5, 6], as well as other gene mutations of lower penetrance [712]. However, these account for less than 5% of breast cancer cases, suggesting a more widespread relevance of common genetic variants in the population when considered cumulatively [1317]. In addition, migrant and twin studies have hinted of an environmental component that can possibly overwrite the genetic influences on breast cancer, suggesting a multi-factorial nature of breast cancer risk or gene-environment interactions [18, 19].

In an attempt to increase the clinical utility of these findings [20], statistical models have been designed and validated to aid in personalized risk assessment. Notably, the Gail model is the most widely used for breast cancer risk prediction. However, the model does not consider genetic factors directly [1] and has limited discriminatory power [21]. On the contrary, other studies in general have until recently ignored hormonal and reproductive characteristics of individual women [22, 23]. Unfortunately, most of the models are still lacking in their predictive ability [2427] and may be inapplicable to Chinese populations.

Despite having a much lower breast cancer incidence in Asia than in Europe [28], a rapidly increasing trend toward rates in the West is a cause for concern [29]. Singapore women are reported to have one of the highest rates of breast cancer incidence in the region with an annual increase of more than 3% [30]. As the linkage disequilibrium patterns differ among ethnic groups [31], findings from genome-wide association studies (GWAS) done on Caucasian populations [7, 3236] are potentially less relevant to Asian women [31, 3739]. In this nested case-control study, we incorporated a set of established GWAS risk alleles into a model with well-known lifestyle factors and evaluated its impact on predicting breast cancer risk in a Singapore Chinese cohort.

Methods

Study subjects

The subjects included in this study are women enrolled in the Singapore Chinese Health Study (SCHS), a population-based cohort study which has been described in detail previously [40]. Briefly, the cohort comprises 63,257 Chinese men and women between the ages of 45 to 74 who were recruited from 1993 to 1998. Participants were Singapore citizens or permanent residents who lived in government-built housing estates, and belonged to either of two major dialect groups: Hokkien or Cantonese. All participants were interviewed at baseline in their homes where they provided information on demographics, diet, level of physical activity, occupational exposure, smoking, and medical history. The women were also asked about their menstrual and reproductive history.

Between April 1994 and December 1999, blood and single-void urine specimens were collected from a random 3% sample of study enrollees. Details of the biospecimen collection, processing and storage procedures have been described previously [41]. Between January 2000 and April 2005, we extended our biospecimen collection to all surviving cohort members and collected biospecimens from 32,575 participants, representing a consent rate of about 60% of surviving cohort participants at that time.

Informed consent was obtained from all participants at baseline interview, as well as at time of biospecimen collection. The Institutional Review Board at the National University of Singapore has approved this study.

Case ascertainment

Incident breast cancer cases were identified through the population-based cancer registry in Singapore. As of 28 June 2010, 941 had developed breast cancer in this cohort and among them, 414 donated blood previously and were included in this study. Compared with breast cancer patients who did not donate a blood sample, those who donated were younger at diagnosis (54.9 versus 56.0 years). Patients who did not donate blood samples were less educated (39.1% had no formal education) than those who did (25.1% had no formal education). There was also a higher proportion of family history of breast cancer among those who donated blood (n = 11, 2.66%) compared to those who did not donate (n = 4, 0.76%).

Control selection

For each of the 414 breast cancer cases, up to three control subjects were randomly selected among all female cohort participants who had donated blood samples, and who were alive and free of breast cancer history at the time of cancer diagnosis of their index case. The chosen controls were matched to the index case on age at study enrollment (±3 years), dialect group (Hokkien, Cantonese), menopausal status at sample collection, dates of study enrollment (±2 years) and of blood collection (±6 months). For the 414 cases, there were six cases where only two eligible controls were found for each of them, and 408 cases where three controls were found for each of them.

SNP selection, genotyping and quality control

We reviewed all published GWAS results related to breast cancer [42]. Single nucleotide polymorphisms (SNPs) from various studies [4345], including more than 40 novel SNPs that were very recently identified from the Breast Cancer Association Consortium (BCAC) [46] and subsequently evaluated in a collaborative study on East Asian women [39], were evaluated for their application in breast cancer risk assessment. Due to differences in haplotype structure between Caucasian and Chinese populations, among SNPs in the same loci and having linkage disequilibrium (LD, r2) of more than 0.8 in HapMap Han Chinese in Beijing (CHB) population [47], the SNP with the greatest statistically significant association with breast cancer was genotyped, to ensure that all SNPs analyzed for risk prediction were independent of each other. SNPs with minor allele frequencies (MAF) less than 5% according to the Singapore Genome Variation Project (SGVP) [48] were also excluded.

Genotyping was done using the Sequenom iPLEX Gold MassARRAY system in 96-well plates (Sequenom, San Diego, CA, USA). MassARRAY Assay Design software was used to design amplification and extension primers (Sequenom). Multiplex PCR amplification was performed using Qiagen HotStart Taq DNA polymerase with 10 ng of genomic DNA (Qiagen, Germantown, MD, USA). Finally, primer extension reactions were carried out according to manufacturer’s guidelines. The investigators were blinded to the case/control status of the samples.

Of the initial 69 SNPs, seven SNPs (rs3803662, rs4808801, rs8100241, rs11199914, rs11814448, rs10069690 and rs1292011) could not be analyzed further as they produced poor, indistinguishable clusters, which could result in unreliable genotype callings. The average call rate for all SNPs was 98%, however the minor allele frequencies of rs11571833, rs132390, rs1045485, rs614367, rs999737 and rs8170 fell below the 1% threshold and were removed from analysis due to low power to detect any association with breast cancer. Deviation from Hardy-Weinberg equilibrium (P <0.0007) in controls was exhibited in the genotype distribution of rs7716600 and these SNPs were also discarded. Among the samples, three cases and fifteen controls did not meet the minimum call rate of 90%. The entire matched set was removed from analyses in the former. Therefore, 55 GWAS SNPs and 1,623 subjects (411 cases and 1,212 controls) were used for further data analyses.

Data analysis

The SCHS questionnaire contained demographic data, reproductive risk factors, as well as information on diet and lifestyle. Risk factors to be included in the prediction model were selected according to results reported from other studies done on the SCHS cohort and factors used in the original Gail model [1]. Variables in the model were: level of education (no formal schooling, primary school, or secondary school or above), age at first live birth (<20, 20 to 24, 25 to 29 or nulliparous, ≥30 years), age at menarche (≥14, 12 to 13, <12 years), history of past breast biopsy (yes, no), family history (yes, no), body mass index (BMI) (<20, 20 to 23.9, 24 to 27.9, ≥28 kg/m2) and genetic risk score (GRS) in quartiles based on the controls. BMI was calculated as the weight divided by the squared height (kg/m2). Family history was limited to first-degree relatives only. The history of past breast biopsy (yes/no) was known for 218 (13.2%) women in the current nested case-control sample. For the remaining women with unknown history of breast biopsy, we imputed the value of history of breast biopsy variable by generating five values and picking the most frequent one, with BMI, estrogen and family history as predictors.

The association between breast cancer and demographic, reproductive, and other baseline characteristics was investigated using the Student’s t test and Mantel-Haenszel chi-squared test (linear by linear association) for continuous and categorical variables respectively. Established risk factors namely, parity, age at first live birth, age at menarche, age at menopause, BMI, family history, history of past breast biopsy and estrogen use were examined for their independent associations with breast cancer risk. A GRS was derived for each individual to represent the cumulative effect of the genetic variants on a woman’s risk of breast cancer. The Cochran’s Q test [49], which is the weighted sum of the squared difference between individual and pooled effects across studies, was used to test for heterogeneity among the current and published studies. The P values were obtained by comparing the statistic with a chi-square distribution with k-1 degrees of freedom, where k is the number of studies. SNPs were included in the computation of GRS only if heterogeneity was not statistically significant. If not, they were removed on the basis of inconsistency among studies (rs11780156, rs6504950, rs6001930, and rs2981579). To account for multiple hypothesis testing, a false discovery rate (FDR) correction according to the Benjamini-Hochberg procedure [50] was applied. A total of 51 SNPs was included in the computation of GRS. All SNP selection criteria had been decided a priori.

A fixed-effects meta-analysis of published GWAS and our study’s findings was performed in order to obtain reliable estimates for each SNP’s effect size in the form of a pooled odds ratio (pOR) derived from published and local studies. We weighted the effect size estimates of each study using the inverse of the corresponding standard errors of the respective studies. The GRS for an individual woman is equivalent to the sum of (log pOR of SNP) × (number of risk alleles that the individual carries for SNP) across all 51 SNPs. The GRS was normalized by dividing it by the average effect size of all SNPs in the population, as outlined previously [51]. Box plots were used to investigate the correlation between GRS and various breast cancer risk factors: age at first live birth, age at menarche, family history, past breast biopsy, BMI and education. Conditional logistic regression was used to calculate the crude and adjusted ORs with a 95% confidence interval for each risk factor. The P value for trend across categories was reported.

Variables in the Gail model and BMI were used to construct the conditional logistic regression model. The models with and without a GRS were compared in terms of their ability to accurately assess a woman’s five-year absolute risk. The probability that an individual i would be free of breast cancer beyond a certain time point, Pi(t), was calculated as 1-(St)Ci, where St is the proportion of people who were not diagnosed with breast cancer (survived) up to time point t. We estimated St using a Kaplan-Meier survival curve, based on data from the SCHS cohort - the cohort that this nested case-control study is from. Since a five-year risk is required in this study, t = 5 years. The individual-level coefficient, Ci is determined by the formula exp[∑βj(xij - μj)], where βj is the log odds ratio (OR) of the risk conferred by a variable j, and xij refers to the value of variable j for individual. The average for the variable in the population, μj, was approximated using the average among controls. The benefit of adding genetic markers into the predictive model was assessed using a net reclassification improvement (NRI) index [52] that compares the risk classifications under models with and without GRS, to adjust the NRI index for overfitting, the index was further corrected using a bootstrap procedure [53].

As there is general expectation of a more reliable risk prediction model as additional risk variants become identified [27, 5456], we tested this hypothesis by rebuilding the model with six, nine, eleven, sixteen and fifty-one (this study) SNPs. These SNPs were chosen in an order in which their association with breast cancer risk was established through time [26, 27, 5760].

Conditional logistic regression for the association between SNPs and breast cancer risk, NRI calculation and histogram plots were performed using R version 2.13.0. All other statistical analyses were performed using IBM SPSS version 21.0 (IBM Corp., Armonk, NY, USA). Statistical tests were two-sided and P <0.05 was considered statistically significant. In the test for heterogeneity, P <0.007 was used after accounting for multiple testing through Bonferroni correction.

Results

In total, 411 cases and 1,212 controls were used in the analyses. The distribution of subjects by background characteristics is shown in Table 1. As the cases and controls were matched on age and menopausal status, they were comparable in these aspects. Cases tend to be more well-educated (P = 0.003), older at first live birth (P = 0.022), report a younger age at menarche (P = 0.033), tend to be current estrogen users (P = 0.042) and fall under a higher GRS quartile (P < 0.001) compared to controls. They also have higher BMI (P = 0.066) and a positive family history of breast cancer (P = 0.063). The remaining factors comprising sleep and dietary patterns did not differ significantly between the two groups. Figure 1 displays no statistically significant correlation between GRS and the various breast cancer risk factors.

Table 1 Distribution of demographic and established risk factors for breast cancer in cancer patients (cases) and control subjects, The Singapore Chinese Health Study, 1993 to 1998
Figure 1
figure 1

Correlation of genetic risk score (GRS) with various breast cancer risk factors. No statistically significant correlation was observed between GRS and the following breast cancer risk factors: age at first live birth, age at menarche, family history, past history of breast biopsy, body mass index (BMI) and education.

The ORs of the 51 SNPs from the current study and their pooled estimates with previous GWAS studies are displayed in Table 2. The SNPs that tag the ESR1 loci - rs2046210 and rs3757318, presented ORs that were statistically significant at the 5% level. Associations with FOXQ1 (P = 0.088) and TOX3 (P = 0.098) displayed marginal significance. Apart from two tag SNPs for ESR1, another six SNPs (rs11552449, rs13387042, rs10759243, rs3903072, rs12422552, rs2236007) were also significantly associated with breast cancer in our study. Table S1 in Additional file 1 shows the corresponding pooled estimates when only published studies were used.

Table 2 The rare allele frequencies of 51 single nucleotide polymorphisms (SNPs) in breast cancer patients and control subjects and their association with risk of breast cancer, The Singapore Chinese Health Study, 1993 to 1998

The associations between breast cancer risk, established risk factors and GRS were evaluated. Compared to the lowest quartile, women in the highest GRS quartile were close to 80% (OR = 1.75, 95% confidence interval (CI) = 1.27 to 2.41) more likely to have breast cancer (Table 3). The magnitude of the association with GRS and the dose-response relationship remained almost unchanged even after adjusting for the established risk factors and education. Age at first live birth and age at menarche presented statistically significant trends with breast cancer risk, but were no longer significant after adjustment. On the contrary, the association of BMI with risk became significant after other factors were considered. After accounting for GRS and the above-mentioned risk factors, neither a positive family history nor a previous breast biopsy was significantly associated with breast cancer risk.

Table 3 The relation for genetic risk score and established conventional risk factors with risk of breast cancer, The Singapore Chinese Health Study, 1993 to 1998

A NRI index was used to assess the improvement in risk classification that would result from adding GRS to a model comprising established risk factors only. Table 4 shows the distribution of women across the various five-year absolute risk categories from <1.0% to ≥2.5%. Approximately 1,400 women reported five-year cumulative risks of less than 1.5%, while only 5.4% of cases and 2.0% of controls were assigned into high-risk groups of 2% or more. Among the 44 cases who were categorized as having a five-year absolute risk of 1.5% to <2.0% under the model without GRS, 12 of them were shifted to higher risk groups while 13 were moved to the lower risk category of 1.0% to <1.5% when GRS was added to the model. Similarly, for about 46.1% of the controls who were initially estimated to have 1.5% to <2.0% risk based on established risk factors only, the new model with GRS indicated a lower risk of 1.0% to <1.5%, while shifting 12 individuals to the 2.0% to <2.5% risk stratum. The reclassification improvement among cases was 11.2% (P <0.001), while that for controls was 2.2% (P = 0.04), though the latter was not statistically significant. This led to an NRI of 13.4% (P = 0.006), which decreased to 6.2% after correcting for optimism using a bootstrap method. This meant, that overall, 6% of women were reclassified into more appropriate risk groups when a genetic component was considered.Figure 2 demonstrates the change in model discriminatory power as GRS that represented the cumulative effect of six, nine, eleven, sixteen and fifty-one (this study) SNPs were used in predicting the five-year absolute risk. A marginal improvement in model performance was noted.

Table 4 Reclassification of five-year absolute risk of breast cancer based on a risk model containing Gail variables with and without genetic risk scores (GRS) on 411 breast cancer patients and 1,212 healthy women, The Singapore Chinese Health Study, 1993 to 1998
Figure 2
figure 2

Effect of increasing number of SNPs in breast cancer risk assessment. The rate of increase in model discriminatory power (difference in log five-year absolute risk) between cases and controls diminishes as more SNPs are incorporated into the GRS. For instance, the change in improvement of model performance is minimal when the number of SNPs used in the GRS increased from 16 to 51. GRS is obtained by pooling the ORs of local and published studies. Y-axis is the density that reflects the frequency of subjects. (Black - controls, Red - cases). SNP, single nucleotide polymorphism; GRS, genetic risk score; OR, odds ratio.

Discussion

We have evaluated a total of 51 SNPs and constructed a GRS to reflect their cumulative effect on breast cancer risk. The genetic score was independently associated with breast cancer risk after adjusting for education and other established risk factors. These common genetic markers, when considered in aggregate, together with reproductive factors and BMI, can improve the risk stratification for close to 10% of Singapore Chinese women. Similar to past SCHS studies [61, 62], cases tend to be more highly educated compared to the controls, hence the adjustment for education in the conditional logistic regression model. However, BMI and family history did not differ significantly between the two groups although the direction of the associations was in agreement with prior knowledge. Failure to reach statistical significance is very likely due to the sample size, hence limiting our study’s power to detect an association. An attenuated effect of family history (OR = 1.78) after accounting for GRS was noted too. This could be due to the fact that risk variants, which are directly or indirectly incorporated into the GRS are also traits that tend to be inherited together.

To our knowledge, we have investigated the largest number of SNPs for use in risk assessment in an Asian population. Studies by Dai et al., Sueta et al., and Zheng et al. have reported the discriminatory power of using five, seven and eight SNPs respectively [37, 54, 55]. All groups demonstrated the clinical utility that can potentially be achieved with the incorporation of common genetic variants to a model containing established risk factors. Locally, a group has reported the potential effect of combining eight SNPs with clinicopathological factors in risk prediction for a Singapore Chinese population [56]. Likewise, we have shown that common genetic markers, when considered in aggregate, together with reproductive factors and BMI, can improve the risk stratification for close to 10% of Singapore Chinese women. However, the majority of the cases (79.6%) presented low five-year cumulative risks of less than 1.5% after GRS was considered, with only 5.4% being assigned five-year cumulative risks of 2% or more. This suggests that other genetic, physiological and environmental factors not accounted for in this study, which includes novel factors yet to be identified, still account for a large proportion of risk for breast cancer in this population.

To gauge how much value a genetic component can add to an individual’s risk assessment, we included the GRS into a model consisting of Gail variables and compared its performance with the model without GRS. Many of the studies published thus far have reported the accuracy of their models in terms of area-under-the-curve (AUC) values [63]. However, AUC is insensitive even when strong predictors are added to the model [64], hence could partially account for insignificant increases in model discrimination. It also does not provide information about the actual risks predicted, therefore, its direct clinical relevance is limited [65]. Instead, we have quantified the degree of correct risk reclassification by calculating the NRI index [52]. Even though NRI depends on arbitrary cutoff points, it is robust to moderate changes [66]. Bootstrapping was performed to account for overfitting of our data to the model, which could in turn lead to an overestimation of model performance. Although we attempted to incorporate all variables of the Gail model in our study, the low uptake of screening mammography meant that we did not have breast biopsy information for a majority of our participants. This problem of ‘missing data’ was overcome by imputing breast biopsy status based on BMI, estrogen use and family history - variables that differed between those who ever and never had a breast biopsy.

SNPs that tagged the ESR1 gene, which codes for estrogen receptor alpha (ESRα), presented statistically significant associations and consistent ORs in our study. This is in concordance with findings from other groups, which showed substantial effect sizes for SNPs in this region, highlighting a likely association with breast cancer among Chinese and Japanese women [67]. The study conducted among Singaporean Chinese also reported that rs2046210 recorded the largest magnitude, similar to our current study [56]. This is in agreement with a comparison study by Hein et al., which found significant association of the 6q25.1 locus in both Asians and Europeans but greater effects in the former [68]. Our study also supports the finding that the MAP3K SNP, rs889312, does not increase the risk of breast cancer among the Chinese [31, 38, 56], contrary to that of European populations [7, 33]. However, unlike the other studies [7, 8, 31, 33, 37, 38], statistical significance was not observed here for another well-established susceptibility loci containing FGFR2.

The OR of rs11242675 (FOXQ1) reported here was 1.15. This is contradictory to the findings of many published works, which have reported statistically significant protective effects, but the risk effect we found is supported by the most recent BCAC study [46]. As a result, significant heterogeneity was observed among the various studies and a pOR was not applicable. Forkhead box Q1 (FOXQ1) is a transcription factor found on the 6p25 locus. Overexpression of the protein has been shown to enhance tumorigenicity and tumor growth through its angiogenic and anti-apoptotic properties [69]. Its novel role in the metastasis of breast cancer has also been suggested [70]. In view of a plausible biological function of FOXQ1 in promoting cancer aggression, as well as marginal statistical significance that was a likely consequence of small sample size, rs11242675 was included in the GRS for risk assessment.

Another SNP that was also considered in the GRS due to its marginally significant P value (P = 0.098) was rs4784227. Rs4784227 is situated at 16q12.1 [71] and has been predicted to interfere with the affinity of FOXA1, an essential component of ESRα signaling [72], to its binding site [73]. Its position in a regulatory region that interacts with the TOX3 promoter enables it to disrupt the expression of this gene, which in turn alters chromatin structure and DNA-protein binding patterns essential for cell survival [71]. An OR of 1.17 was seen in our study. This effect size and direction were similar to the findings of others thus a pOR was used for GRS computation.It was observed, in Figure 2, that the marginal improvement in model performance was not proportional to the increase in additional SNPs used. Although the discovery of additional SNPs do not drastically improve the assessment of breast cancer risk, this is expected since the first few new SNPs discovered would have been associated with much larger effect sizes. Also, as the cost of genotyping continues to decrease, we expect the use of additional SNPs in risk assessment to be cost-effective in the near future.

This study has several strengths. The study was nested within a population-based prospective cohort that provides the use of questionnaire data collected before the occurrence of breast cancer to reduce recall and reverse causality bias. The inclusion of genetic variants in risk assessment is advantageous as it is not subjected to time-dependent errors in measurement, unlike environmental exposures such as BMI or smoking. We have also shown the strength of the association between GRS and breast cancer risk; it remains virtually unchanged even after all other established risk factors have been considered, highlighting the importance of genetics in this aspect.

There are also several limitations in our study. The small sample size of approximately 1,600 women has made it difficult to attain statistical significance for most of the SNPs that were identified in GWAS studies. Nevertheless, the direction of the effects of most SNPs was consistent with the published literature. As all the subjects recruited were Chinese women, this could restrict the generalizability of our results. Studies will need to be conducted in larger populations and among women of other ethnicities to validate the effect of these polymorphisms. We were not able to consider two factors in this study: 1) the presence of copy number variations (CNVs) and their potential effects on breast cancer risk and 2) the various subtypes of the disease. Given the proximity of some SNPs to CNV regions [74], and the relation between CNVs and familial breast cancer [75], an effect of CNVs on risk of disease is not unlikely. However, modeling this poses difficulties and may not alter the results substantially [74]. Although further analysis by disease subtype would have been ideal, we were restricted by the study’s limited sample size and power. Finally, the breast cancer cases included in this study from the cohort had a higher prevalence for positive family history of breast cancer compared to cases that were not included in this study, although this prevalence was still generally low (2.7%).

Conclusions

In summary, we have shown the extent to which 51 SNPs may improve the current assessment of breast cancer risk. Most of the SNPs identified in other Western and Asian studies have presented similar effect sizes in our Singapore Chinese population. Despite conferring minimal increase in risk, individual genetic variants when considered cumulatively can result in considerable effects, leading to improved risk stratification. By including a genetic component for risk assessment, more targeted measures of prevention and screening can be implemented. For countries such as Singapore where breast cancer incidence is relatively low and mammography screening is not as well-received, cost-efficiency and ethical issues can be more aptly addressed.