Breast cancer is one of the most common cancers among women worldwide [1]. Although life/environment related factors are implicated in breast carcinogenesis, it is a complex polygenic disorder in which genetic makeup also plays an important role [2, 3]. In the past decades, high-penetrance genes (for example, BRCA1, BRCA2, PTEN and TP53) have been identified to be associated with familiar breast cancer [4]. However, these genes account for less than 5% of overall breast cancer patients and most of the risk is likely to be attributable to more low-penetrance genetic variants [57].

Recently, several genome-wide association studies (GWAS) reported many novel breast cancer predisposing single nucleotide polymorphisms (SNPs) [814]. However, most of the studies were conducted among Caucasians [813] and only one among Chinese [14], and whether these genetic variants are applicable marker SNPs in Asian women is unclear. Furthermore, evaluation of a risk-predicting model is an important topic in genetic studies of human diseases, including breast cancer. An effective risk-predicting model can assist physicians in disease prevention, diagnosis, prognosis and treatment [15]. For the harvest of GWAS on breast cancer, many studies combined the genetic markers and other traditional risk factors together to evaluate the risk-predicting model of breast cancer [1622]. However, most of the breast cancer risk model effects are unsatisfied and only one related study was available in Chinese women [17].

In the current study, a two-stage case-control study of 1,792 breast cancer cases and 1,867 cancer-free controls was conducted among Chinese women to replicate 15 selected SNPs identified from previous GWAS. Then, risk models were constructed and absolute risk was calculated to evaluate the combined effects of the significant SNPs and clinical risk factors.

Materials and methods

Study subjects

This study was approved by the institutional review board of Nanjing Medical University. The hospital-based case-control study included 1,792 breast cancer cases and 1,867 cancer-free controls, and the detail process of subjects recruitment was described previously [2325]. In brief, incident breast cancer patients were consecutively recruited from the First Affiliated Hospital of Nanjing Medical University, the Cancer Hospital of Jiangsu Province and the Gulou Hospital, Nanjing, China, between January 2004 and April 2010. Exclusion criteria included reported previous cancer history, metastasized cancer from other organs, and previous radiotherapy or chemotherapy. All breast cancer cases were newly-diagnosed and histopathologically confirmed, without restrictions of age or histological types. Cancer-free control women, frequency-matched to the cases on age (± 5 years) and residential area (urban or rural), were randomly selected from a cohort of more than 30,000 participants in a community-based screening program for non-infectious diseases conducted in the same region. All participants were ethnic Han Chinese women. Of the eligible participants, 878 cases and 900 controls were randomly assigned to form the testing set, and the remaining 914 cases and 967 controls formed the validation set.

After providing informed consent, each woman was personally interviewed face-to-face by trained interviewers using a pre-tested questionnaire to obtain information on demographic data, menstrual and reproductive history, and environmental exposure history. After the interview, each subject provided 5 ml of venous blood. The estrogen receptor (ER) and progesterone receptor (PR) status of breast cancer was determined by immunohistochemistry examinations which were obtained from the medical records of the hospitals.

SNP selection and genotyping

The SNP selection procedure followed three criteria: (a) reported marker SNP in previous GWAS (last search in November 2009); (b) minor allele frequency (MAF) ≥ 0.05 in Chinese Han Beijing (CHB) based on the HapMap database (phase II, released 24 in November 2008); (c) only SNPs with low linkage disequilibrium (LD) were included (r2 < 0.8) if multiple SNPs can be found at the same region. Overall, 15 SNPs (11 regions of 2q35, 3p24, 5p11, 5p12, 6q22, 6q25, 8q24, 10q26, 11p15, 16q12 and 17q23; Table 1) were selected and genotyped by using the middle-throughput TaqMan OpenArray Genotyping Platform (Applied Biosystems Inc., Carlsbad, CA, USA) for testing set samples (878 cases and 900 controls) and by TaqMan Assays on ABI PRISM 7900 HT Platform (Applied Biosystems Inc.) for validation set samples (914 cases and 967 controls). For OpenArray Assays, normalized human DNA samples were loaded and amplified on customized arrays following the manufacturer's instructions. Each 48-sample array chip contained two NTCs (no template controls). For TaqMan Assays, approximately equal numbers of case and control samples were assayed in each 384-well plate. Two blank controls in each plate were used for quality control and 96 duplicates were randomly selected to repeat for the two platforms, and the results were more than 97% concordant.

Table 1 Association of breast cancer risk with 15 SNPs selected from previous GWAS study in the Testing Set.

Statistical analyses

Differences between breast cancer cases and controls in demographic characteristics, risk factors and frequencies of SNPs were evaluated by Fisher's exact tests (for categorical variables) or Student t-test or t'-test (equal variances not assumed) (for continuous variables). Hardy-Weinberg equilibrium was evaluated by exact test among the controls [26].

As shown in Additional file 1, three steps were performed to assess the breast cancer risk model. (1) SNPs screening. Following a two-stage strategy, associations between SNPs and risk of breast cancer were estimated by computing odds ratios (ORs) and their 95% confidence intervals (CIs). (2) Risk model construction. For the model parsimony, only genetic or clinical risk factors that were independently associated with breast cancer were included. Both OR (odds ratio) and AR (absolute risk) were taken as indicators to evaluate the risk model. For the OR-based risk model, two different methods were used. One method treated each risk allele/factor equally and combined them based on the counts of risk alleles/factors. Another method assessed the effects of the SNPs and risk factors using a risk score analysis with a linear combination of the SNP genotypes or risk factors weighted by their individual OR (The log odds at each SNP locus was additive in the number of minor alleles, and the log odds for the entire model was additive across SNPs and other risk factors). Then the risk score was classified into four groups by its quartiles in controls. AR is the risk of developing a disease over a time-period. In our paper, the AR for each woman was estimated by a modified Gail model [16, 27]. This method is described as a multiplicative model used to derive genotype relative risk from the allelic OR. The allelic OR for each SNP was obtained assuming an additive genetic model by logistic regression analysis. For each of the three genotypes at each SNP, the genotype relative risk was converted to the risk relative to the population. The overall risk relative to the population was derived by combining the risks relative to the population of all SNPs as well as the two clinical risk factors (age at menarche and age at first live birth) of the individual by multiplication. Finally, the AR for each woman was obtained based on the overall risk relative to the population, calibrated by the incidence rate of breast cancer for women (aged 20 to 85 years), and the mortality rate for all causes except breast cancer from the Shanghai registration system, China [28]. (3) Risk model discrimination. The model performance was evaluated by receiver-operator characteristic (ROC) curves and the area under the curve (AUC) to classify the breast cancer cases and controls. The difference of AUCs was tested by a non-parametric approach developed by DeLong ER et al. [29]. Furthermore, for the absolute risk-based risk models, we used the 10-fold cross-validation method to check the reliability of the models. All of the statistical analyses were two-sided and performed with Statistical Analysis System software (9.1.3; SAS Institute, Cary, NC, USA) and Stata (9.2; StataCorp LP, Lakeway Drive College Station, TX, USA), unless indicated otherwise.


A total of 1,792 breast cancer cases and 1,867 cancer-free controls were included in the final analysis, and the characteristics of these subjects were summarized in Table 2. Age at menarche (P < 0.001) and age at first live birth (P < 0.001) were consistently, differentially distributed between the cases and the controls in all samples. Among 1,437 breast cancer cases with known ER and PR status, 662 (46.07%) were both ER and PR positive, and 498 (34.66%) were both negative.

Table 2 Distribution of demographic characteristics and known breast cancer risk factors for cases and controls included in the study

The results of the selected 15 SNPs and the breast cancer risk in testing set samples were presented in Table 1. The call rates of the 15 SNPs were all above 95% and the MAF in the controls were all above 0.05. Five SNPs at 2q35, 3p24, 6q22, 6q25 and 10q26 were significantly associated with breast cancer risk (2q35: rs13387042, P = 0.039; 3p21.4: rs2307032, P = 0.017; 6q22.33: rs2180341, P = 0.040; 6q25.1: rs2046210, P = 1.26 × 10-5; 10q26.13: rs2981582, P = 0.037). Therefore, these five SNPs were included in the further validation analyses.

The call rates of the five SNPs in the validation stage were all above 95% (Table 3). Consistent associations were observed for the five SNPs, with significant or borderline significant P-values. Overall, after adjustment for age, age at menarche, menopausal status and age at first live birth, the five SNPs showed significant associations with breast cancer susceptibility (dominant genetic model: 2q35, rs13387042: OR = 1.26, 95% CI = 1.07 to 1.49; 3q24.1, rs2307032: OR = 1.24, 95% CI = 1.07 to 1.44; 6q22.33, rs2180341: OR = 1.22, 95% CI = 1.06 to 1.40; 6q25.1, rs2046210: OR = 1.51, 95% CI = 1.31 to 1.75; 10q26.13, rs2981582: OR = 1.31, 95% CI = 1.14 to 1.50).

Table 3 Association of SNPs with breast cancer risk in both testing and validation sets

The cumulative effects of the five SNPs and the two risk factors (age at menarche and age at first live birth) on breast cancer risk were examined by two methods (Table 4). One method was based on the counting of risk alleles/factors. Women carrying six or more risk alleles of the five SNPs (5.75% of case patients and 3.23% of control subjects) had a nearly three-fold increased risk for developing breast cancer compared with those carrying less than one of the risk alleles (11.08% of case subjects and 16.70% of control subjects). When taking age at menarche and age at first live birth into consideration, the top group (having more than seven risk alleles/factors) had a 5.61-fold increased risk compared to the reference group (adjusted OR = 5.61, 95% CI = 4.16 to 7.56). Another method was based on the risk score calculated with a linear combination of the SNP alleles or risk factors weighted by the individual odds ratio and then classified into four groups by the quartiles. Subjects with the upper quartile risk score were associated with a 91% increased breast cancer risk compared to those having the low quartile score (adjusted OR = 1.91, 95% CI = 1.56 to 2.35, P for trend: 5.60 × 10-10). Similarly, a 4.73-fold increased risk was illustrated when taking age at menarche and age at first live birth into consideration (adjusted OR = 4.73, 95% CI = 3.80 to 5.88, P for trend: 2.27 × 10-47). We then assessed the performance of the two risk prediction methods in discriminating cases and controls by ROC curves analyses. The AUC for the risk score analysis (0.649, 95% CI: 0.631 to 0.667; sensitivity = 62.60%, specificity = 57.05%, Figure 1) was significantly higher than that by the risk factors counting method (AUC: 0.637, 95% CI: 0.619 to 0.655; sensitivity = 62.16%, specificity = 60.03%, Figure 2) (P < 0.0001).

Table 4 Cumulative effects of associated SNPs and clinical risk factors on the risk of breast cancer in all samples
Figure 1
figure 1

The area under curves (AUCs) for breast cancer risk-predicting models calculated by risk score method.

Figure 2
figure 2

The area under curves (AUCs) for breast cancer risk-predicting models calculated by risk counting method.

Absolute risk was also calculated to evaluate the combined effects of the five SNPs and the two risk factors by a modified Gail model and a 65-year absolute risk for breast cancer among women aged 20 to 85 years was estimated for each subject. From Table 5, a clear trend was observed that more subjects were grouped as high risk along with the increased numbers of risk alleles/factors. However, the variation of absolute risk distribution increased with increasing numbers of factors used in the risk-predicting model. Compared to a uniform 65-year cumulative risk 0.07 as carrying four risk factors (chosen by the largest proportion in controls: 22.01%, Table 5) for breast cancer in the population, a wide spectrum of absolute risk estimates was found using these five markers and the two clinical risk factors (Figure 3). At a cutoff of 0.14 (two-fold of the population median risk) or 0.21 (three-fold of the population median risk), 26.57% or 10.43% of women were grouped as high risk, respectively. We also used the ROC curve analysis to evaluate the performance of absolute risk to classify the cases and controls. As shown in Figure 4, we obtained an AUC of 0.658 (95% CI: 0.640 to 0.676) (sensitivity = 61.98%, specificity = 60.26%) for five SNPs plus two risk factors. Based on the cross-validation, similar results for AUCs were obtained (0.572 (five SNPs only), 0.644 (two risk factors only) and 0.660 (five SNPs plus two risk factors)), which suggests a relative reliability of the models.

Table 5 Absolute risk estimated in all samples
Figure 3
figure 3

Distribution of estimated absolute risk of breast cancer by modified Gail model in all samples.

Figure 4
figure 4

The area under curves (AUC) for absolute risk of breast cancer.

The stratified analyses by ER or PR status of the five SNPs were summarized in Additional file 2. However, no significant heterogeneity was observed for the effect of each SNP by different ER or PR subgroups. Further stratified analysis was conducted on the cumulative effects of the five SNPs (coded 0 to 2 risk alleles as 0 and more than 3 risk alleles as 1) and found no heterogeneity between subgroups (Additional file 3).


In our study involving 1,792 breast cancer cases and 1,867 cancer-free controls, 5 of the 15 variants, identified in previous GWAS studies [814], were consistently associated with breast cancer risk in this Chinese population. Risk assessment models and absolute risk calculations combining the five SNPs and two clinical risk factors indicated the small effects of these markers in discriminating cases and controls. Overall, the results provide further evidence and utility for GWAS identified SNPs in relation to breast cancer risk assessment in Chinese women.

We summarized associations of the 15 SNPs of breast cancer identified by previous GWAS studies and following replication studies (Additional file 4). SNP rs13387042 at 2q35 was identified as a breast cancer susceptibility SNP in two GWAS conducted among Europeans [12, 13]. Significant associations were also observed in most of the later studies on Europeans and African American women [3036] except for one reported by Stevens KN et al. [37]. However, the results were conflicting in Asian populations [12, 17, 38, 39]. For 3p24, Ahmed et al. reported marker SNPs rs4973768 and rs1357245 in a four-stage GWAS study, and then located the strongest marker rs2307032 in this region [8]. Following replication studies for 3p24 region also presented consistent results among European [3437] and Asian [38, 40], including our study. SNP rs2180341 at 6q21.33 was originally found in the Ashkenazi Jewish population [10] and was well replicated in Europeans [41]. In the current study, we found consistent results among Chinese; however, no significant association was observed in other studies involving Asian populations [17, 31, 36, 38]. SNP rs2046210, located at upstream of the ESR1 gene on chromosome 6q25.1, was the only one reported by Zheng et al. (2009) in a GWAS conducted among Chinese women [14] and consistently replicated in Asian populations (Chinese and Japanese women, including partly overlapped samples from our group) [17, 4244] and women of European-ancestry [14, 36, 37, 42], but not in African American women [31, 44]. SNP rs2981582 (10q26.13) was reported by Easton et al. in the first large-scale breast cancer GWAS [10], which was replicated in Europeans and Asians [17, 3236, 38, 40, 4547], and was also reported previously with partly overlapped study samples by our group [25], but not in Africans [31, 46]. In the current study, we enlarged our study subjects and obtained similar results.

For the other SNPs, Han et al. successfully replicated SNPs rs4973768 (3p24.1), rs889312 (5p11.2) and rs3803662 (16q12.1) in Korean women with breast cancer [40]. However, SNPs rs4973768 (3q24.1), rs10941679 (5p12), rs889312 (5p11.2), rs13281615 (8q24.21), rs3817198 (11p15.5), rs12443621 (16q12.1) and rs6504950 (17q23.2) were not reported to be associated with breast cancer in Chinese women [17, 24, 38, 39], which was similar to our results. Potential explanations for the failure of replication of these SNPs in Chinese could be the genetic heterogeneity (both allelic and locus heterogeneity). Allelic heterogeneity is the phenomenon in which different mutations at the same locus (or gene) cause the same disorder. While locus heterogeneity implies that mutation in different genes may explain one variant phenotype. Further large scale resequencing or fine mapping studies on these regions may help find breast cancer causal variants.

Traditional approaches to assessing patients' disease risk are primarily achieved through non-genetic risk factors with apparent limitations, and it is expected that a better prediction can be reached if we can incorporate genetic determinants. Recently, several studies on these efforts were published [1622]. Zheng et al. conducted a validation study with 3,039 breast cancer cases and 3,082 controls for 12 GWAS identified SNPs (nine regions) in Asian women [17], and built a risk assessment model with eight SNPs and five clinical risk factors. However, only five of the eight SNPs were significantly associated with breast cancer susceptibility in the study. In our current study, two more regions were incorporated (3q24.1, 17q23.2) and we found five susceptibility SNPs with a two-stage validations, although the performance of the risk assessment model was still limited.

Overall, risk model prediction is not a diagnostic tool but provides an estimate of likelihood of developing disease in the future. A well-evaluated risk model, taking genetic and clinical risk factors together, can be used as a screening tool for high risk individuals among the general population. Women at high risk for breast cancer can be focused on by choosing an optimal cutoff (for example, two-fold of the population median risk), and these women should perform regular breast cancer screening [48, 49]. Results from this study suggest that GWAS identified SNPs can be used to improve the prediction model. However, there are a number of limitations for the current study. First, several newly reported breast cancer risk-associated SNPs were not included in the current analysis [50]. Second, more breast cancer associated risk factors should be evaluated, such as the body mass index (BMI) and family history of breast cancer [14]. However, the effects on breast cancer risk by BMI could not be well-evaluated in our study with a retrospective study design. Our moderate study sample size limited our power to evaluate the parameters of breast cancer family history (only 101 cases (7.39%) and 3 controls (0.29%) with a positive breast cancer family history). Third, the two-stage study design, although helping to avoid false positive findings, may cause the omission of low but true associations, because our overall study sample size is moderate.


Overall, five GWAS identified variants were also consistently validated in this Chinese population. Risk assessment models that incorporate both a genetic risk score based on these SNPs and the established risk factors for breast cancer may be useful for identifying high-risk women for targeted cancer prevention. More genetic risk variants and other risk factors should be well evaluated and incorporated into the risk-predicting models to improve the ability of personalized risk assessment.