Background

A previous genome-wide association study (GWAS) identified a common single nucleotide polymorphism (SNP), rs10771399 (termed the index SNP in this paper) at 12p11 to be associated with breast cancer risk in women of European descent [1]. This association, which did not vary by estrogen receptor (ER) status, was one of the most significant associations found for breast cancer risk in Breast cancer 1 (BRCA1) mutation carriers so far, and the association was predominantly found in carriers with ER-negative (ER-(-)) breast cancer [2, 3]. This association was also replicated in East Asian women [4]. The index SNP lies in an approximately 300-kb linkage disequilibrium (LD) block, containing one known breast cancer associated gene that encodes parathyroid hormone-like hormone (PTHLH). This hormone has been shown to play a role in breast tumor initiation, progression, and metastasis in animal studies [5, 6] and was found to be associated with prognosis in breast cancer patients [7]. The index SNP, however, is located in a region with no evidence of functional significance [8]. The underlying biologic mechanisms and functional variants that drive the observed association have not yet been investigated. Furthermore, it is possible that additional independent risk signals may be present in the same region, as has been observed for other susceptibility regions [911]. In order to identify additional association signals at the12p11 locus with breast cancer risk, understand the underlying mechanisms and potential causal variants responsible for the association, we conducted a large fine-scale mapping study including data from 55,540 breast cancer cases and 51,168 controls in the Breast Cancer Association Consortium (BCAC) and 15,252 BRCA1 mutation carriers in the Consortium of Investigators of Modifiers of BRCA1/2 (CIMBA).

Methods

Study population

The BCAC included 40 studies of women of European descent (48,155 cases and 43,612 controls), nine of Asian descent (6269 cases and 6624 controls), and two of African-American descent (1116 cases and 932 controls). The CIMBA included 45 studies of women of European descent (15,252 BRCA1 mutation carriers), of whom 7797 had been diagnosed with breast cancer. Details on the study characteristics, participant characteristics and the methodology used by the BCAC and CIMBA have been published elsewhere [1214]. Ethical approval of each study was given by the local institutional review boards. The full names of the institutional review boards that approved each study were listed in the Additional file 1.

SNP selection and genotyping

All SNPs within a 700-kb “fine mapping” interval at 12p11 (chr12: 27958733-28658733, hg19) were identified from the 1000 Genomes Project (1000G) (http://browser.1000genomes.org) CEU (April 2010) [15] and Hapmap III [16] (http://hapmap.ncbi.nlm.nih.gov/). The interval included all SNPs in LD (r 2 > 0.1) with the target SNP rs197593 (r 2 = 0.95 with the index SNP rs10771399) [1]. Tagging SNPs were selected to capture the remaining SNPs in the fine-mapping region at r 2 > 0.9. After quality control, genotypes for 441 SNPs were available for analysis. To improve the coverage, imputation was performed using data from the 1000G (March 2012) as the reference and the program IMPUTE2 [17] (https://mathgen.stats.ox.ac.uk/impute/impute_v2.html). This was done separately for women of European, East Asian, and African descent and BRCA1 mutation carriers. Using criteria of minor allele frequency (MAF) ≥2 % and an imputation quality R 2 > 0.3, genotype data were generated for a total of 1634 SNPs for studies of European women, 1360 for studies of East Asian women, 2508 for studies of African women in BCAC and 1646 for studies of BRCA1 mutation carriers in CIMBA.

Statistical analysis

For BCAC studies, unconditional logistic regression models were used to estimate allelic odds ratios (OR) and their 95 % confidence intervals (CIs) of each of the SNPs included in the study. Analyses were performed separately for each ethnic group, and adjusted for study and principal components (seven for European studies and two each for Asian and African ancestry studies) [12]. Additional adjustment for age (age at diagnosis for cases and age at interview for controls) did not change the estimates, and thus age was not adjusted for in the main analyses. Tests of heterogeneity of the ORs across studies were conducted using Cochran’s Q test. To identify independent association signals, we performed forward stepwise selection analyses with all SNPs associated with breast cancer risk at P < 0.0001 in BCAC European descendants or at P < 0.005 for East Asian descendants in the single-marker analysis. To reduce type 2 errors, we used a less stringent statistical significance threshold because of the smaller sample size for East Asian descendants than for European descendants in this study. Pairwise SNP-SNP interactions were evaluated using the likelihood ratio test for all SNPs selected from the forward stepwise regression analysis. Stratified analyses by ER status were performed, and the heterogeneity was assessed by case-only analysis. We estimated haplotype frequencies using the haplo.stats package under R with the expectation-maximum (EM) algorithm [18] and estimated the haplotype-specific ORs for women of European descent with adjustment for studies and principal components as described above. To evaluate whether the association varied by early-onset and late-onset cancer, stratified analyses by age at cancer diagnosis (≥45 or <45 years) were performed. The familial relative risk (FRR, λ) associated with independently associated variants in this locus was calculated using the method described previously [19, 20].

For CIMBA studies, the associations between genetic variants and breast cancer risk were evaluated using a 1-degree of freedom (df) per allele trend test (P-trend), by modeling the retrospective likelihood of the observed genotypes conditional on breast cancer phenotypes [21]. To allow for the non-independence among related individuals, an adjusted test statistic was used, which took into account the correlation between study participants [22]. Per-allele hazard ratio (HR) estimates were obtained by maximizing the retrospective likelihood. All analyses were stratified by country of residence. To increase the statistical power to detect independent signals in BRCA1 mutation carriers, we conducted a meta-analysis of the BCAC and CIMBA studies [23]. Because approximately 80 % of breast tumors with known ER status in BRCA1 mutation carriers were ER(-) [2], we only included the ER(-) breast cancer cases for BCAC studies. We combined the logarithm of the per-allele HR estimated in BRCA1 mutation carriers and the logarithm of the per-allele OR estimated in BCAC using a fixed-effects model. We further determined whether there is evidence for independent association signals through a serial of conditional meta-analyses. We performed a conditional analysis on the top variant identified in the meta-analysis mentioned above in each consortium, and carried out the meta-analysis on the conditional P value for each variant to identify the most significant variant after conditioning on the top variant in the whole region. We continued to perform the conditional meta-analyses until the most significant association found had a P value >0.0001.

Functional annotation

We used the Encyclopedia of DNA Elements (ENCODE) chromatin states (chromHMM) annotation, DNase I hypersensitive, transcription factor binding sites, histone modifications of epigenetic markers (H3K4Me1, H3K4Me3 and H3K27Ac) data from ENCODE [24] (http://genome.ucsc.edu/ENCODE/) to determine the likely regulatory elements. We used chromatin interaction analysis by paired end tag (ChIA-PET), genome conformation capture (Hi-C) data from ENCODE and enhancer-promoter interaction data predicted by He et al. [25] to identify putative gene targets in mammary cell lines (human mammary epithelial cells (HMEC) and Michigan Cancer Foundation-7 (MCF7)). We used maps of enhancers as defined in Corradin et al. [8] and Hnisz et al. [26] to identify the locations of potential enhancers. We obtained RNA-seq data from ENCODE, respectively, to evaluate the expression of protein-coding genes in mammary cell lines at this locus. We also used the same data in the chronic myeloid leukemia cell line (K562) as a comparison if available.

To predict the most likely functional variants, we mapped all candidates to the transcription factor binding maps generated by ENCODE [24], based on the hypothesis that causal variants alter the binding affinity of transcription factors. We prioritized variants that were located in binding sites of master transcription factors of breast cancer and disrupted binding motif of transcription factors. We also prioritized variants that were located in active promoter regions in mammary cell lines. Two publicly available tools, RegulomeDB [27] (see http://regulome.stanford.edu/) and HaploReg V3 [28] (see http://www.broadinstitute.org/mammals/haploreg/haploreg.php), were also used to evaluate those candidate functional variants.

Expression quantitative trait loci (eQTL) analysis

The eQTL analyses in tumor tissues were performed as previously described [29, 30]. Briefly, we downloaded RNA-Seq V2, DNA methylation and SNP genotype data of 1006 breast cancer tumor tissues from The Cancer Genome Atlas (TCGA) data portal [26] (see http://cancergenome.nih.gov/). We log2-transformed the RNA-Seq by expectation-maximization (RSEM) value of each gene, and performed principal component adjustment of gene expression data to remove potential batch effects. Residual linear regression analysis was used to detect eQTLs while adjusting for methylation and copy number alterations (CNA), according to the approach proposed by Li et al. [29].

The eQTL analyses in 135 tumor-adjacent normal breast tissues were performed using data from the Molecular Taxonomy of Breast Cancer International Consortium (METABRIC) [31] as previously described [32]. Briefly, gene expression levels were measured by the Illumina HT12 v3 microarray platform. Genotyping was performed using the Affymetrix SNP 6.0 array. Imputation was performed using data from the 1000G (CEU, March 2012) as the reference. Linear regression was performed to evaluate the association between genotypes and gene expression levels using the R (http://www.r-project.org/) package Matrix eQTL [32].

Results

Association results among women of European ancestry

Of the 2075 SNPs evaluated, 833 were associated with breast cancer risk in women of European descent at P < 0.0001 (Fig. 1). Using forward stepwise selection, we identified two SNPs that were independently associated with breast cancer risk with conditional P < 0.0001, tagging two independent signals (Table 1, Fig. 1). The index SNP is located in signal 2, approximately 30 kb upstream of the PTHLH gene and was in strong LD with the lead SNP (rs805510) for this signal (r 2 = 0.92). The lead SNP in signal 1, rs7297051, is located approximately 50 kb upstream of the PTHLH gene, and was in moderate LD with the index SNP (r 2 = 0.42). The lead SNPs for signals 1 and 2 were moderately correlated (r 2 = 0.36). After adjusting for the lead SNPs in signals 1 and 2, we found evidence of the presence of a third independent association signal (lead SNP rs1871152; conditional P = 2 × 10-4, Table 1, Fig. 1). Signal 3 lies approximately 60 kb upstream of another gene, coiled-coil domain containing 91 (CCDC91). SNP rs1871152 was not correlated with the lead SNP in signal 1 or signal 2 (r 2 = 0.01 for rs7297051 and r 2 = 0.03 for rs805510). All lead SNPs for these three signals were associated with breast cancer risk at P < 5 × 10-8 in single-marker analyses (rs7297051 OR = 0.88, P = 4 × 10-28; rs805510 OR = 0.85, P = 10-25; rs1871152 OR = 0.94, P = 3 × 10-8). No apparent heterogeneity in the ORs of the identified SNPs across the 40 studies in BCAC was found (all P heterogeneity > 0.75). No statistically significant interactions between any pair of these three lead SNPs were found (all P > 0.05).

Fig. 1
figure 1

Genetic mapping and epigenetic landscape of the 12p11 locus (a). Regional association plot of the genotyped and imputed Illumina iSelect genotyping array of the Collaborative Oncological Gene-environment Study (iCOGS) genotype data. Three independent signals were identified, marked as signal 1, 2 and 3. b Functional annotations using data from the Encyclopedia of DNA Elements (ENCODE) project. From top to bottom, the epigenetic signals evaluated include histone modifications, DNase clusters, transcription factor ChIP-seq clusters, and ENCODE chromatin states (ChromHMM) in the ENCODE cell lines. The signals of different layered histone modifications from the same ENCODE cell line are shown in the same color (the detailed color scheme for each ENCODE cell line is described in the UCSC genome browser; http://genome.ucsc.edu). Red and orange in chromatin states represent active promoter and strong enhancer regions, respectively (the detailed color scheme of the chromatin states was described in the previous study [45]). All tracks were generated by the UCSC genome browser (hg 19). c Long-range chromatin interactions. From top to bottom, genome conformation capture (Hi-C), chromatin interaction analysis by paired end tag (ChIA-PET) and RNA-Seq data from K562 cell lines, Hi-C and RNA-Seq from human mammary epithelial cells (HMEC), ChIA-PET and RNA-Seq from MCF7 cell lines, gene annotations and single nucleotide polymorphism (SNP) annotations. Black lines represent interactions with the promoter region (-1500/+500) of Parathyroid hormone-like hormone (PTHLH), and gray lines represent chromatin interaction that did not involve the PTHLH promoter region. The value of the RNA-Seq analysis corresponds to the mean reads per million (RPM) value for PTHLH from 65 K562, 4 HMEC and 19 MCF7 datasets, respectively. The annotation has been obtained through the Bioconductor annotation package TxDb.Hsapiens.UCSC.hg19.knownGene. The Hi-C and ChIA-PET raw data, available in the Gene Expression Omnibus (GEO) [GSE63525.K56, GSE33664, GSE39495], were processed using the GenomicRanges package. The tracks have been generated using ggplot2 and ggbio libraries in R

Table 1 Independent association signals identified for breast cancer risk in the 12p11 locus in women of European ancestry

Using the lead SNP from each signal, rs805510, rs7297051 and rs1871152, we identified seven haplotypes with a frequency greater than 1 % (Table 2). The most common haplotype (frequency 51 %), carrying the major allele of each SNP, was used as the reference in the association analysis. The most statistically significant association was observed for the haplotype carrying the minor alleles at both signals 1 and 2 (TTA and TTG), while less pronounced yet significant associations were observed for individuals carrying the minor allele for signal 1 but not signal 2 (CTA and CTG), consistent with results for the independent association signals from the regression analyses. The evidence for signal 3 comes largely from the observation that the CCG haplotype, which carries the rare allele for signal 3 alone, was associated with reduced risk. The haplotype carrying only the minor allele in the lead SNP for signal 2 was too rare to evaluate. Stratified analyses revealed no evidence of any apparent heterogeneity in the association of these haplotypes with breast cancer risk by age at breast cancer diagnosis (age at diagnosis <45 vs ≥45 years).

Table 2 Associations between common haplotypes derived using lead single nucleotide polymorphisms and breast cancer risk in women of European ancestry

The associations of the three SNPs did not vary appreciably by ER status (Additional file 2: Table S3). In an attempt to identify potential independent association signals that might have been missed in the analysis described above that included all breast cancer cases (Table 1), we conducted forward stepwise regression analyses separately for ER(+) and ER(-) cases. For the ER(+) breast cancer, the lead SNPs for signals 1 and 2 were identical to those found for all cases combined. For signal 3, however, a different lead SNP (rs7959641) was identified, which was moderately correlated with rs1871152, the lead SNP identified in the overall analysis (r 2 = 0.28) (Additional file 2: Table S3). The lead SNP for signal 3 in ER(-) cases is different from the SNP identified in all cases combined, but these two SNPs were highly correlated (r 2 = 0.86) (Additional file 2: Table S3).

Association results for BRCA1 mutation carriers of European descent

Of the 2087 SNPs evaluated in the CIMBA among BRCA1 mutation carriers of European descent, 234 were associated with breast cancer risk at P < 0.0001. The most significant association was found with rs113824616 (per-C allele HR 0.73, 95 % CI 0.64–0.82, P =1 × 10-7; Table 3). The three lead SNPs identified in BCAC had similar associations, although the association was statistically significant at P < 0.05 in conditional analyses only for the lead SNPs of signals 1 and 3 (rs7297051 and rs1871152, respectively) (Additional file 3: Table S4). Meta-analysis of data from BCAC ER(-) cases and CIMBA showed that rs113824616 was associated with breast cancer risk after adjusting for rs7297051 (conditional P = 7 × 10-5, r 2 with rs10773199 = 0.40; Table 3). No additional independent signals were identified. We defined the association signal represented by SNP rs113824616 as signal 4.

Table 3 Independent association signals in the meta-analysis of BCAC (ER-) and BRCA1 mutation carriers from CIMBA

Association results among women of East Asian ancestry

Of the 1801 SNPs evaluated, 118 were associated with breast cancer risk in women of East Asian ancestry (P < 0.005) (Fig. 1). The four lead SNPs in European descendants had a similar association with breast cancer risk in East Asian women, although the association was statistically significant at P < 0.005 only for the lead SNPs of signals 1 and 2 (rs7297051 and rs805510, respectively) (Additional file 4: Table S5). The MAFs for the lead SNPs of signals 1, 2 and 4 were similar to those in Europeans, but the MAF for signal 3 (rs1871152) was markedly lower in East Asians. In conditional regression analyses, only the association with signal 1 was independently statistically significant, perhaps due to the small sample size. The per-allele ORs did not differ materially from those in Europeans in the conditional analysis (data not shown).

The most significant association in Asians was with SNP rs2737455 (MAF = 0.17, per-major (T) allele OR = 1.16, 95 % CI 1.09–1.25, P = 10-5). Among women of East Asian descent, this SNP was in high LD with the two lead SNPs for signals 1 and 2 identified in populations of European ancestry, rs7297051 (r 2 = 0.67) and rs805510 (r 2 = 0.84). This variant was also associated with breast cancer in women of European descent (per T-allele OR = 1.17, 95 % CI 1.14–1.21, P = 5 × 10-25). No additional independent signal was found on stepwise regression.

Association results for women of African ancestry

Of the 2949 SNPs evaluated in African descendants, 116 were statistically significantly associated with breast cancer risk at P < 0.05. The most significant association was with rs10843021 (MAF = 0.38, per-C allele OR = 1.22, 95 % CI 1.08–1.39, P = 0.001), which is located 60 kb downstream of the gene PTHLH. This SNP is not in LD with any of the lead SNPs identified for women of European or East Asian descent (all r 2 < 0.02). There was some evidence of association of this SNP with breast cancer risk in women of European descent (P = 8 × 10-5) but not in women of Asian descent (P = 0.23). None of the lead SNPs identified for women of European or East Asian descent were associated with breast cancer risk at P < 0.05 in African descendants, although the directions of the associations were consistent and the effect sizes did not differ significantly (Additional file 4: Table S5). The MAF of the index SNP rs10771399 (MAF = 0.04) was much lower in African descendants than that in Asian and European descendants (P < 0.001).

Functional annotation

To identify putative causal variants, we used data from European descendants to exclude any variants that had a likelihood ratio <1/100 relative to the most significantly associated SNP in each signal (33). Based on this threshold, four variants in signal 1, 74 variants in signal 2, 376 variants in signal 3, and 2 variants in signal 4 were retained as candidates for causal variants (Fig. 1a and Additional file 5: Table S2).

Using data from ENCODE, we found that the histone markers (H3K27Ac and H3K4Me3) were enriched in each signal (Fig. 1b). Using both ChIA-PET chromatin interaction data and Hi-C data from ENCODE, we identified multiple and dense chromosomal interactions of variants at signals 1 and 2 with the promoter region of PTHLH in MCF7 cells (Fig. 1c). There was some evidence of interaction of variants at signal 3 with the promoter of PTHLH (Fig. 1c).

Using maps of predicted enhancer regions produced by Hnisz et al. [26] and Corradin et al. [8], we found that multiple candidate variants were located in enhancer regions in mammary cell lines (Fig. 2). Using predicted enhancer-promoter interaction data in HMEC and MCF7 cell lines generated by He et al. [25] (Fig. 2), we identified two interacting genes of these enhancers, CCDC91 and PTHLH.

Fig. 2
figure 2

Enhancer-promoter interaction data at 12p11. From top to bottom, enhancer locations as defined by Corradin et al. [8] and Hnisz et al. [26] are shown in human mammary epithelial cells (HMEC) cell lines. Enhancer-promoter (EP)-predicted interactions as defined by He et al. [25] are shown in K562, MCF7 and HMEC cells. Gene annotations and single nucleotide polymorphism (SNP) annotations. Orange EP interactions are those with the coiled-coil domain containing 91 (CCDC91) gene; blue EP are those with Parathyroid hormone-like hormone (PTHLH)

We next overlaid these candidate variants to the transcription factor binding site maps generated from ENCODE. We identified rs812020 within signal 1, rs788463 and rs10843066 within signal 2, and rs10843110, rs56318627 and rs11049453 within signal 3 to be the most likely functional variants (Fig. 3a and b; Additional file 6: Table S6). These SNPs were within or close to binding sites of multiple breast cancer-related transcription factors. Furthermore, these SNPs were predicted to disrupt the binding motifs recognized by transcription factors (Fig. 3a and b), suggesting a regulatory role. For example, in signal 1, rs812020 (per C-allele OR = 0.89, 95 % CI 0.87–0.91, P = 2 × 10-27) was annotated to a region bound by multiple key transcription factors for breast cancer, including GATA3 and FOXA1 (Fig. 3a and b). This SNP is predicted to disrupt the binding motif recognized by the transcription factor E2F3 and may change its binding affinity [32]. E2F3 has been found to increase centrosome amplification in mammary epithelial cells and regulate breast tumor development and metastasis [33]. In signal 3, SNP rs11049453 (per G-allele OR = 1.06, 95 % CI 1.04–1.08, P = 9 × 10-8) was in the binding site of transcription factors P300 and CTCF in MCF7 cell lines [31] (Fig. 3). It was also predicted to disrupt the binding motif of paired box (PAX) [33], which has been associated with the progression of breast cancer [34, 35]. No functional significance of the candidate variants in signal 4 was found.

Fig. 3
figure 3

Putative functional variants and association of rs11049453 with gene expression in breast tumor tissues. a Epigenetic signals of five potential functional variants. From top to bottom, lanes showing that those variants mapped to transcription factors predicted binding motifs, DNase I hypersensitivity sites and transcription factor ChIP-Seq binding peaks in the Encyclopedia of DNA Elements (ENCODE) cell lines and MCF7. The corresponding location of each variant is indicated by a dashed line. b Epigenetic landscape at the 12p11 locus for breast cancer risk. From top to bottom, RefSeq genes (PTHLH and CCDC91), layered H3K4Me1, H3K4Me3 and H3K27Ac histone modifications and annotation using chromatin states on the ENCODE cell lines. The signals of different layered histone modifications from the same ENCODE cell line are shown in the same color (the detailed color scheme for each ENCODE cell line is described in the UCSC genome browser). Red and orange in the chromatin states represent the active promoter and strong enhancer regions, respectively (the detailed color scheme of the chromatin states was described in the previous study [45]). c rs11049453 and the expression of coiled-coil domain containing 91 (CCDC91) and parathyroid hormone-like hormone (PTHLH). The association of the genotypes and the expression level of each gene was evaluated by residual linear regression [29]. bp base pairs, C/EBP CCAAT/enhancer-binding protein, E2F3 E2F transcription factor 3, HNF1B HNF1 homeobox B, PPARG peroxisome proliferator-activated receptor gamma, PAX paired box

To further explore the potential target genes, we performed eQTL analysis in both breast tumor and normal tissues. Using data on tumor tissues from TCGA, we found that rs10843110, rs56318627 and rs11049453 within signal 3 were associated with the expression of PTHLH at P < 0.05 and CCDC91 at P < 0.10 (Additional file 7: Table S7). Among these highly correlated SNPs, the most significant association was found for rs11049453: the risk allele G of rs11049453 was associated with increased expression of PTHLH (P = 0.01) and decreased expression of CCDC91 (P = 0.03, Fig. 3c). However, we did not find any statistically significant association for these six variants using data from adjacent normal breast tissues from METABRIC (all P > 0.05).

Discussion

Through a fine-scale mapping study at 12p11, we identified four independent association signals for breast cancer risk in women of European descent. It is of interest that the fourth signal was identified only through the meta-analysis of ER(-) breast cancer and BRCA1 mutation carriers, suggesting that this signal may be more specific to ER(-) cancers. The associations of these signals were in general consistent in women of European and East Asian descent.

Multiple genetic studies have confirmed that a locus at 12p11 is associated with breast cancer risk [2, 4]. However, it remained unknown whether the observed association was due to a single or multiple causal variants at this locus. In this study, we demonstrated that there were at least four independent signals at 12p11, three 100 kb upstream of the gene PTHLH (signals 1, 2 and 4), and one 60 kbp upstream from the gene CCDC91 (signal 3), suggesting that there may be multiple causal variants and multiple underlying mechanisms for the observed association at the 12p11 locus. Furthermore, we identified multiple candidate causal variants at each signal: four in signal 1, 74 in signal 2, 376 in signal 3 and 2 in signal 4. Using functional genomic data from ENCODE, we observed that multiple candidate functional variants located in enhancer regions, and identified PTHLH and CCDC91 as the likely target genes for these enhancers. Using data on transcription factor binding, we identified six putative functional variants with strong evidence of regulation of gene expression. Among these six variants, we observed that the rs11049453 was significantly associated with the expression of PTHLH and CCDC91. However, we could not exclude the possibility that there were other functional variants and other target genes at this locus.

PTHLH encodes the protein PTHrP, which has intracrine, autocrine or paracrine action in most normal tissues; its downstream effects include promotion of growth and anti-apoptotic effects [36]. It is a cause of humoral hypercalcemia of malignancy [37], and is expressed in more than two thirds of breast tumor tissue samples [7, 38]. It has been shown to affect the regulation of tumor-related genes, and is thought to affect the proliferation and migration of breast cancer cells [39]. PTHrP plays an important role in the formation of osteolytic bone metastases in breast cancer through its action on osteoblasts to increase RANK-ligand and promote osteoclast formation [40]. It has been proposed that PTHrP may promote breast cancer tumorigenesis; however, previous studies had conflicting results [41]. Less is known about the function of the CCDC91 gene, which is located approximately 232 kb from the PTHLH gene. CCDC91 encodes a protein known as p56 accessory protein or GGA binding partner, which binds proteins, and facilitates the transportation of secreted proteins through the trans-Golgi network [42]. CCDC91 is also expressed in a variety of cancer cell lines including MCF7 [43]. Using cBioPortal (http://www.cbioportal.org/public-portal/), we found that both PTHLH and CCDC91 genes were altered in breast tumors and that there was a statistically significant co-occurrence of alternations (including mutations and copy number aberrations) in both genes (P for tendency towards co-occurrence <0.001). Together with our findings, these results suggest that there might be correlation between these two genes and that alterations in both genes might contribute concurrently to breast cancer susceptibility. Future studies evaluating both genes and their interrelationship are needed to elucidate the underlying mechanism.

Functional annotation data suggested that the functional variants underlying the observed association, mainly those in signal 2, are located in enhancer regions involved in the transcriptional regulation of PTHLH and CCDC91 in the MCF7 and HMEC cells. Moreover, we did not find similar functional evidence for the same region in the K562 cells, which suggests that the regulatory effects might be context-specific. We identified multiple putative functional variants associated with transcriptional factors that have been found to be important for breast cancer, including GATA3, FOXA1, C/EBP, P300 and STAT3, and overlapped with binding motifs of transcriptional factors, including E2F3, C/EBP, HNF1B, PPARG and PAX. Despite strong evidence for altering the binding of transcription factor and regulating gene transcription, we found only one eQTL among these putative functional variants, which lies in signal 3, suggesting that the underlying functional variants might exert a more subtle regulatory effect on gene expressions than expected. Although we found strong genetic and epigenetic evidence for potential functional variants in signals 1 and 2, we did not observe statistically significant association between these variants and the expression of PTHLH or CCDC91, or any other protein-coding genes within a flanking region of 500 kb for each variant. It is possible that the causal variants in these two signals might be involved in regulating non-coding genes or more distant genes. Future functional studies that comprehensively investigate the regulatory elements at these loci and their target genes will be needed to elucidate the molecular mechanisms.

The top risk variants identified in women of Asian and European ancestry were not associated with breast cancer risk in African descendants. It is possible that these top risk variants might not be correlated with the causal variants in African descendants due to their different LD structures. For example, the effect allele frequencies (EAFs) for the index SNP rs10773199 and the top risk variant rs805510 in African descendants were 0.04 and 0.45, respectively, and the EAFs for these two SNPs were similar in European descendants (EAF = 0.12 for both SNPs) and in East Asian descendants (EAF = 0.17 and 0.15, respectively), suggesting a distinct LD structure at this locus in African descendants. Similarly, the EAF for the SNP rs113824616 in African descendants (EAF = 0.01) was substantially lower than that in European descendants (EAF = 0.05). In addition, the sample size for African descendants included in this study was small and the power to detect the association of these variants was low. A previous fine-mapping study in African Americans with a larger sample size (3016 cases/2745 controls) than our study (1116 cases/932 controls) showed that rs10773199 is marginally associated with breast cancer risk (OR = 0.84, P = 0.089) [44], suggesting that there might be an association of the 12p11 locus with breast cancer risk in African descendants. Studies with a large sample size are needed to elucidate the association between this locus and breast cancer risk in African descendants.

To date this is the largest and most comprehensive fine-mapping study of the 12p11 region in relation to breast cancer risk. By using densely genotyped data from a very large number of cases and controls of European descent, we derived highly reliable estimates of the association between each common SNP and breast cancer risk in women of European descent. The sample size was relatively small for East Asian and African descendants, and associations with risk of overall breast cancer and molecular subtypes in these populations should be further evaluated in future larger studies.

Conclusions

Through fine-mapping of the 12p11 locus, we identified multiple independent association signals for breast cancer risk. We estimate that the four independent signals identified by this study explain approximately 1 % of the familial relative risk of breast cancer in populations of European ancestry, more than doubling the risk explained by the index SNP (0.4 %). Bioinformatics analyses revealed that these signals are mapped to enhancer regions that interact with the gene PTHLH and CCDC91. We identified putative functional variants that might contribute to the observed association. Our findings also suggest a possible interrelation between PTHLH and CCDC91 in the etiology of breast cancer. Our study has expanded the knowledge of genetic risk associated with breast cancer at the 12p11 locus and provided clues for future functional characterization.

Abbreviations

BCAC, Breast Cancer Association Consortium; BRCA1, Breast cancer 1; C/EBP, CCAAT/enhancer-binding protein; CCDC91, Coiled-coil domain containing 91; ChIA-PET, chromatin interaction analysis by paired end tag; CI, confidence interval; CIMBA, Consortium of Investigators of Modifiers of BRCA1/2; CNA, copy number alterations; E2F3, E2F transcription factor 3; EAF, effect allele frequency; EM, expectation-maximum; ENCODE, Encyclopedia of DNA Elements; eQTL, expression quantitative trait loci; ER, estrogen receptor; FOXA1, forkhead box A1; GATA3, trans-acting T-cell-specific transcription factor GATA-3; GWAS, genome-wide association study; Hi-C, genome conformation capture; HMEC, human mammary epithelial cells; HNF1B, HNF1 homeobox B; HR, hazard ratio; iCOGS, Illumina iSelect genotyping array of the Collaborative Oncological Gene-environment Study; IMPUTEv2, IMPUTE version 2; LD, linkage disequilibrium; MAF, minor allele frequency; MCF7, Michigan Cancer Foundation-7; METABRIC, Molecular Taxonomy of Breast Cancer International Consortium; OR, odds ratio; PAX, paired box; PPARG, peroxisome proliferator-activated receptor gamma; PTHLH, parathyroid hormone-like hormone; QC, quality control; SNP, single nucleotide polymorphism; STAT3, signal transducer and activator of transcription 3; TCGA, The Cancer Genome Atlas