Background

Breast cancer is the most common malignancy among women in the United States, with more than 230,000 new diagnoses expected in 2015 [1]. Breast cancer has a heritable component [2], and researchers in recent genome-wide association studies (GWASs) have identified more than 90 [320] genetic regions associated with breast cancer risk. However, the underlying causal mechanism in these regions is not fully known, and it is likely that the index GWAS signal originates from one or many as yet unidentified causal variants. Because GWASs rely on linkage disequilibrium (LD) or correlation between neighboring common genetic variants, they cannot be used to localize causal variants with precision. Instead, one genetic variant is used to tag segments of the genome over which LD is maintained (“LD blocks”), which can contain multiple genes. Localizing causal variation is further complicated by the possibility of multiple causal variants within one tagged segment of the genome.

Moreover, GWAS-identified regions that contain common risk variants may also contain rare variations associated with disease risk. For example, rare susceptibility variants for ulcerative colitis [21] and inflammatory bowel disease [22] have been identified in GWAS regions. In principle, GWAS-identified risk single-nucleotide polymorphisms (SNPs) may be proxies for multiple rare risk alleles (“synthetic association”) [23]. Indeed, dense genotyping of the HOXB region revealed a cluster of common, low-penetrance prostate cancer risk alleles that appear to tag the rare, moderate-penetrance coding variant rs138213197 [24]. However, in practice, the LD between common SNPs on GWAS platforms and rare variants is low. This implies that direct measurement of rare variants is needed to identify rare-variant associations at GWAS-identified loci. Fine-mapping of GWAS-identified regions with the aim of identifying and prioritizing causal variants requires not only large sample sizes but also a comprehensive capture of the genetic variation, with the latter often not achieved through standard GWAS arrays. Sequencing is an attractive approach, but until recently it has been prohibitively expensive to do on a large-scale basis. Early fine-mapping studies sequenced a small number of cases and then genotyped detected SNPs in a larger population [22, 25]. Like GWAS, however, this approach will likely miss rare variants because of the low number of subjects initially sequenced.

A limitation in previous breast cancer studies is the lack of well-powered studies across multiple ancestral populations. Although breast cancer GWASs have been conducted in populations of Asian [5, 11, 12, 19, 20], African [26], and Hispanic [27] ancestry, the vast majority of studies have been conducted in European [3, 4, 68, 10, 13, 14, 1618] ancestry populations, and only a few GWASs have been conducted across ethnicities [9, 15]. LD blocks differ by ancestry, which limits the consistency of some GWAS findings across populations, and studies with subjects of a single ethnicity may miss risk alleles that are observed at higher frequencies in other populations [28]. Indeed, multiethnic studies of genetic susceptibility regions discovered in a specific ethnicity often identify different “top” variants across ethnicities [2933]; therefore, multiethnic studies have been proposed to aid fine-mapping of causal variants [28]. In this study, we attempted to overcome many of the issues related to fine-mapping of GWAS regions by using next-generation sequencing to characterize 12 breast cancer susceptibility regions in a multiethnic sample of 2288 breast cancer cases and 2323 controls.

Methods

Study subjects

The Nurses’ Health Study (NHS) was initiated in 1976, when 121,700 U.S. registered nurses aged 30 to 55 years returned an initial questionnaire. The NHS breast cancer case-control study is nested within a subcohort of 32,826 women who donated blood during 1989 and 1990 and were followed until 2004 for incident disease [34, 35]. In 1989, 116,430 additional U.S. registered nurses returned an initial questionnaire (Nurses’ Health Study II [NHSII]). The NHSII breast cancer case-control study is nested within a subcohort of 29,611 women who donated blood during 1996–1999 and were followed until 2005 [36]. Medical records were used to confirm the diagnoses in women who reported a diagnosis of breast cancer on the biennial questionnaires for both NHS and NHSII. Control subjects were matched to cases based on age, menopausal status, recent hormone replacement therapy, and blood draw-specific variables (such as date and time of day). For this study, we included a total of 771 cases and 789 controls from the NHS and NHSII who have previously been genotyped as a part of a GWAS [10] and had DNA available (Additional file 1: Table S1).

The Multiethnic Cohort (MEC) is a population-based prospective cohort study (n = 215,251) that was initiated between 1993 and 1996 and includes subjects from various ethnic groups: African Americans and Latinos primarily from California (Greater Los Angeles area), Native Hawaiians, Japanese Americans, and European Americans primarily from Hawaii [37]. State driver’s license files were the primary sources used to identify study subjects in Hawaii and California. Additionally, in Hawaii, state voter registration files were used, and in California, Health Care Financing Administration files were used to identify additional African American study subjects. In the cohort, incident cancer cases are identified annually through cohort linkage to population-based cancer Surveillance, Epidemiology, and End Results registries in Hawaii and Los Angeles County as well as to the statewide California Cancer Registry. Blood sample collection in the MEC began in 1994 and targeted incident breast cancer cases and a random sample of study participants to serve as controls for genetic analyses. Subjects are frequency-matched on age at blood draw and on ethnicity. For this study, we included subjects who had already been genotyped as part of a GWAS [15] and had DNA available: 468 cases and 469 controls of African American ancestry, 452 cases and 458 controls of Latino ancestry, and 622 cases and 638 controls of Japanese ancestry (Additional file 1: Table S1).

Sequencing

We selected and sequenced 12 regions because of their association with breast cancer (Additional file 2: Table S2). These regions were 2q35 (rs13387042), TERT (rs10069690), MAP3K1 (rs889312), ESR1 (rs2046210), 8q24 (rs1562430), ZNF365 (rs10995190), ZMIZ1 (rs704010), FGFR2 (rs2981579), 11q13 (rs614367), RAD51B (rs99737), TOX3 (rs3803662), and 19p13 (rs8170). In addition, we sequenced the TERC region on chromosome 3 because of its involvement in telomere length. Initial quality control (QC) was conducted on all 13 regions, but we present results for only the 12 regions associated with breast cancer here. Region boundaries were defined by nearest recombination hot spot downstream and upstream from the original GWAS signal as identified using the HapMap CEU (Utah residents with ancestry from northern and western Europe), YRI (Yoruba in Ibadan, Nigeria), JPT (Japanese in Tokyo, Japan), and CHB (Han Chinese in Beijing, China) populations. We set out to hybrid-capture and sequence a total of 5500 kb (Additional file 2: Table S2). Sequencing was conducted at The Broad Institute using an Illumina HiSeq sequencing system (Illumina, San Diego, CA, USA). Sequencing was performed using a capture method that uses biotinylated RNA “baits” to fish targets out of a “pond” of DNA fragments [38]. Agilent eArray software (Agilent Technologies, Santa Clara, CA, USA) was used to design the baits using 2× tiling frequency and a centered layout strategy, avoiding standard repeat masked regions, and allowing a maximum of 20-bp overlap with repeat masked regions. A complex pool of ultralong 200-mer oligonucleotides (“baits”) consisting of a target-specific 170-mer sequence flanked by 15 bases of a universal primer sequence on each side are synthesized in parallel on an Agilent microarray and then cleaved from the array. We then used in vitro transcription to generate a single-stranded RNA hybridization bait for fishing targets of interest out of a “pond” of randomly sheared, adaptor-ligated, and polymerase chain reaction-amplified DNA. RNA bait-DNA hybrids are “fished” out of the complex mixture by incubation with streptavidin-labeled magnetic beads and captured onto a strong magnet. After the beads are washed, the RNA bait is digested so that the only remaining nucleotide is the targeted DNA of interest. A few cycles of DNA amplification are performed at the end of the capture, and the targeted sample is then loaded onto the sequencing instrument. This method allows preparation of large amounts of bait from a single oligonucleotide array synthesis that can be tested for quality, stored in aliquots, and used repeatedly over the course of a large-scale targeted sequencing project. Within the nonrepetitive regions, we could design baits to cover 82.8 % of the sequence.

Alignment and genotype calling

We used Burrows-Wheeler Aligner (BWA) software to align reads to the genome [39]. Genotype calling was done using GATK software with default standard filters [40, 41]. GATK takes the raw BAM files and does initial checking by correcting for possible SNP artifacts due to local realignment around indels and mark reads that were duplicately sequenced. Owing to the size of the dataset, it was not practically feasible to recalibrate the base quality scores that are provided by the sequencing machine. Therefore, we used the following filtering for SNP calling: QD <2.0, MQ <40.0, FS >60, HaplotypeScore >13.0, MQrankSum <12.5, and ReadPosRankSum <8.0. We used QD <2.0, FS >200, Read PosRankSum <20.0, and InbreedingCoeff <0.8 for indel calling. Variant calling was made in 47 different batches (about 100 samples in each batch). We randomly assigned subjects to batches after conditioning on ethnicity and case-control status to ensure full representation in each batch. All variant calls with a quality score <30 were omitted. To account for variants that were seen in only one or a few batches, we recalled all individuals in batches where the variant was not seen from missing to reference homozygous.

Genotype and sample filtering

We initially observed 158,265 single-nucleotide variants (SNVs). We removed SNVs where >10 % of the samples had no reads or when the total of reads across all samples was <20,000. We then set individual genotypes to missing if the number of reads was <5 or the quality score was <10. Finally, we removed SNVs with >10 % missing or due to evidence of departure from Hardy-Weinberg equilibrium (p < 10−6) in any ancestry group. We excluded 8 samples that were unexpected pairwise duplicates, 43 samples that showed <90 % concordance with GWAS data (indicating sample mixup), 16 samples that had a call rate <90 % (Fig. 1d), 36 samples for which we did not have GWAS data, and 5 samples showing unexpected non-European ancestry. After applying these filters, there were 138,792 SNVs left for analysis.

Fig. 1
figure 1

Number of single-nucleotide variants (SNVs) that passed quality control within and across ethnicities for 4611 women. The majority of variants were population-specific, and only 9420 (6.8 %) of SNVs were shared among all ethnicities

Baited vs. nonbaited regions

We were not able to design baits for 48.9 % of the original targeted sequences. Before QC, 32 % of the SNVs fell within nonbaited regions. Across all samples, 61.6 % of nonbaited SNVs had an average read depth >10× compared with 99.0 % of the baited SNVs, and 43.2 % of nonbaited SNVs had an average read depth >20× compared with 96.9 % of baited SNVs. After QC, 23.5 % SNVs fell in the nonbaited regions. Of those, 94.2 % had an average read depth >10× compared with 99.7 % of the baited samples, and 67.4 % of nonbaited SNVs had an average read depth >20× compared with 97.4 % of baited SNVs. Of the SNVs that were removed in the QC, 86.5 % were in nonbaited regions (Additional file 3: Figure S1).

Coding variant annotation

Annotation of variants or assignment of a variant to a gene was implemented using GEMINI (GEnome MINIng) [42]. GEMINI is a flexible, UNIX-compatible framework for exploring genome variation that pulls information from SnpEff [43] as mapped by build 37 of the human genome. We further annotated each variant with Polymorphism Phenotyping version 2 (PolyPhen-2) scores [44], which predict the functional impact of individual variants using Variant Effect Predictor (VEP) [45]. Of the 2085 nonsynonymous coding variants that passed QC filters in our study population, we obtained PolyPhen-2 scores for 1975 (95 %). Of those, we obtained exclusive predictions for 1852 variants after excluding variants with none or unknown predictions. Lower scores correspond to less damaging qualitative values, and scores ranged from 1 to 617 (Additional file 4: Table S3). Scored variants were assigned to at least one of PolyPhen-2 qualitative prediction: benign; possibly damaging; probably damaging; or, in the case of insufficient data, unknown or none. For the rare-variant tests, we further refined the set of variants in each gene and included only nonsynonymous variants that were predicted by PolyPhen-2 to be possibly damaging or probably damaging.

Statistical methods

We used logistic regression to assess the association between each common SNV and breast cancer risk by population as well as across populations. We conducted ethnicity-specific analysis adjusting for the top three principal components within each ethnicity to adjust for potential population stratification [46]. We combined results across ethnicities using fixed-effect meta-analysis. We adjusted for multiple testing using a modified Bonferroni correction, which allows for dependence between tests within each GWAS region by calculating a region-specific effective number of tests [47]. To assess whether regions contained multiple statistically independent risk alleles, we reran the association analysis, conditioning on the top SNP/index SNP in each region.

We also conducted an approximate Bayesian analysis to estimate the posterior probability that a given SNP is a causal variant, assuming there is only one causal SNP in the region. We estimated the posterior probability using the ratio of the likelihood from the logistic regression for a particular SNP to the sum of the likelihoods for individual SNPs in the region. The highest posterior density set is then defined as the smallest set of SNPs such that the total posterior density (summed over all SNPs in the set) is >90 %. All analyses were performed in PLINK ([48], http://pngu.mgh.harvard.edu/~purcell/plink/), R [49], and METAL [50]. We conducted additional analysis using a novel fine-mapping framework (PAINTOR) [51] that integrates external functional annotation with genetic data for prioritization of causal variants. PAINTOR jointly models multiple causal variants from all included loci simultaneously, increasing localization accuracy. We included two sets of annotations in our analysis, coding variants and variants located in DNase I hypersensitive sites (DHSs) identified in the breast tissue cell lines MCF-7, HMEC, and HMF. We ran two sets of analyses, the first assuming one causal SNP per region and the other assuming two causal SNPs per region.

We used two gene-based rare-variant tests in each ethnicity: a burden test and a sequence kernel association test (SKAT) [52]. Each association test was performed separately by ethnicity and adjusted for the first three ethnicity-specific principal components. To combine evidence across ethnicities, we applied two meta-analytic techniques: inverse-variance (BURDEN)-weighted fixed-effect meta-analysis [53] and meta-analysis of SKAT assuming the effect of each variant is homogeneous, regardless of ethnicity (Hom-Meta-SKAT) [54]. Each technique uses a distinct approach: a mean-based approach for fixed-effect meta-analysis and a variance-based approach for Hom-Meta-SKAT.

Results

We sequenced 937 women of African American ancestry, 1256 women of Japanese American ancestry, 907 women of Hispanic ancestry, and 1511 women of European ancestry (Additional file 1: Table S1). Subjects were participants in NHS [34], NHSII [36], and MEC [37]. In total, we sequenced 2288 breast cancer cases and 2323 controls. We were not able to capture a total of 2740 kb (49.8 %) originally targeted, primarily because of repetitive sequence content. The median proportion of captured regions with coverage >20× was higher than 93 % across all regions (range 93.8–99.9 %).

After quality control of the 12 regions (see the Methods section above), we obtained genotype data on 137,530 SNVs. Of those, 34,532 (25.11 %) were located in intergenic regions, 62,427 (45.39 %) were intronic, 1306 (0.95 %) were synonymous, and 1983 (1.44 %) were nonsynonymous. The rest (27 %) were located upstream/downstream genes, 3′ and 5′ untranslated regions, and in coding regions (e.g., nonsense variants) (Additional file 5: Table S4). On average, each region contained 6.3 genes (range 1–26), and the median number of SNVs by region was 8401 (range 1833–23,757). We observed an abundance of rare SNVs, with 119,980 SNVs (87.2 %) having a minor allele frequency (MAF) <0.005; of these, 64,747 SNVs (47.1 %) were private mutations (54 variants were homozygous in one carrier). The number of polymorphic variants ranged from 36,799 in Japanese Americans to 64,632 in African Americans. Most variants were population-specific: 70 % of all variants were observed in only one ethnicity (Fig. 1), emphasizing the genetic diversity across populations. In contrast, a total of 9420 (6.8 %) SNVs were shared across all ethnicities. In general, Japanese Americans had the largest proportion of SNVs not shared with others (61 %), whereas Latinas had the smallest proportion of population-specific SNVs (35 %).

The majority of observed SNVs were novel: Only 27.3 % were observed in the 1000 Genomes Project [55] (Additional file 6: Figure S2, Additional file 7: Figure S3, and Additional file 8: Figure S4). Due in part to the low read depth in uncaptured repetitive sequences, 42,105 (47.4 %) of the SNVs present in the 1000 Genomes Project for these regions were not observed in the targeted sequencing data (Additional file 9: Figure S5). However, we observed 26,205 (73.4 %) of the 35,714 1000 Genomes Project SNVs that were located in captured regions in our targeted sequencing data, suggesting that the majority of 1000 Genomes Project SNVs that were not observed in our targeted sequencing data were located within regions not captured by our sequencing technology.

We first conducted individual SNP analysis of common variants in at least one ancestry (MAF >0.005, n = 27,380). After Bonferroni correction, we did not observe any significant associations in ethnicity-specific analyses (data not shown) or across all ethnicities (Additional file 10: Table S5). Given the strong correlation between SNPs in these regions, we also applied a more relaxed p value threshold, adjusting for number of effective tests [47]. Using this approach, we observed four SNPs that remained statistically significant after correcting for multiple testing (p < 4.59 × 10−6), all in the 11q13 region (rs61041893, rs7123796, rs597587, rs644376). These SNPs are all within 13 kb of each other, and SNPs rs7123796, rs597587, and rs644376 are all in strong LD with each other (r 2 > 0.74), whereas rs61441893 show only moderate correlation with the other SNPs (r 2 = 0.31–0.39) with the others. Interestingly, these SNPs are not correlated with the GWAS index SNP rs614367 (r 2 < 0.01) and approximately 40 kb away from the nearest gene (CCND1). The strongest association was observed for rs61041893 (OR 0.63, 95 % CI 0.52–0.76, p = 2.15 × 10−6). Of note, this SNP was evaluated only in African Americans and Hispanics because it was not observed in Japanese Americans and was very rare (MAF = 0.002) in European Americans. Of the 12 GWAS index SNPs previously reported in the literature, we replicated 6 across all populations using an unadjusted p value threshold of 0.05 (Table 1).

Table 1 Breast cancer association results

We used SnpEff [43] to annotate and predict SNV effects. We observed 81 SNVs that had a predicted disruptive impact (e.g., splice site donators/acceptors, loss of start/stop codon, gained stop codon, frame shift variants), 1983 nonsynonymous coding SNVs, 1374 synonymous coding SNVs, and 134,092 noncoding SNVs (Fig. 2, Additional file 5: Table S4). Two SNVs with predicted disruptive function had a MAF >0.005 in the full dataset. SNP rs79619171 in the FGFR2 region is a splice site donor in the TACC2 gene; it was observed only in the Japanese American samples (MAF 0.08) and was moderately associated with breast cancer (OR 1.47, 95 % CI 1.09–1.98, p = 0.01). SNP rs55670604 is a splice site acceptor in the RAD51L1 gene; it was observed in Latinas (MAF 0.03), African Americans (MAF 0.02), and European Americans (MAF 0.08), but it was not associated with breast cancer in any population (data not shown) or across all ethnicities (OR 1.00, p = 0.99).

Fig. 2
figure 2

Minor allele frequency distribution of observed single-nucleotide variants (SNVs). SNVs are categorized by functional impact (high, moderate, low, or modifier) as predicted by the SnpEff algorithm [43]. High (n = 81) is defined as disruptive impact SNVs, moderate (n = 1983) as nonsynonymous coding SNVs, low (n = 1374) as synonymous coding SNVs, and modifier (n = 134,092) as noncoding SNVs

To identify multiple independent signals, we reran the association analysis, conditioning on the top SNP in each region. We observed no new significant associations after correcting for multiple testing (all p > 10−4) (Table 1). We also assessed if there was evidence of additional signals beyond the index SNP in these regions by conditioning on the original GWAS index SNP (Table 1). For the ZNF365 and 11q13 regions, we observed evidence of association signals beyond the index signal (p < 10−4), in agreement with previous studies [5, 56, 57]. We also conducted an approximate Bayesian analysis [58] to estimate the posterior probability that a given SNP is a causal variant, assuming there is only one causal SNP in the region (Additional file 11: Figure S6). The highest posterior probability set (PPS) is then defined as the smallest set of SNPs such that the total posterior density (summed over all SNPs in the set) is >90 % and can help guide the selection of candidate SNPs for further downstream functional and bioinformatic analyses. The number of SNPs included in the highest posterior density set varied widely, between 11 (11q13) and 2954 (8q24), in analysis across all ethnicities (Table 2). For two regions (2q35 and 11q13), <5 % of the original SNVs were needed to obtain a 90 % PPS. Population-specific analysis showed similar results (data not shown). We conducted additional fine-mapping analyses using a novel fine-mapping framework (PAINTOR) [51] that integrates external functional annotation with genetic data and calculates SNV-specific posterior probabilities for causality. We included two sets of annotations in our analysis: coding variants and variants located in DHSs identified in the breast tissue cell lines MCF-7, HMEC, and HMF [59]. We limited our analysis to SNPs with an allele frequency >0.01. In total, we included 13,373 SNPs, and of those, 104 (0.8 %) were coding, 371 (2.8 %) were located in breast tissue DHSs, and one SNP was both coding and located in a DHS. We ran two sets of analyses, the first assuming one causal SNP per region and the other assuming two causal SNPs per region. Overall, the results from either of these analyses did not qualitatively differ from the Bayesian analysis without incorporating functional annotation data (Additional file 12: Table S6). However, for the 11q13 region, we noticed that while both the Bayesian approach and PAINTOR assuming one causal variant predicted that rs61041893 had the highest posterior probability (0.22 using both approaches), this SNP had a posterior probability of only 2.2 × 10−5 when we ran PAINTOR assuming two causal variants. Instead, rs12279395 and rs11823311 both had high posterior probabilities (>0.99) of being causal. These two variants were located 74 kb and 75 kb apart from rs61041893, respectively, and 150 kb apart from each other, They were both nominally associated with breast cancer risk (p < 0.0005). SNP rs12279395 is a nonsynonymous SNP located in the ORAOV1 gene, whereas rs11823311 is located in an intergenic region. Interestingly, these three variants (rs12279395, rs11823311, and rs61041893) all show regulatory properties in breast tissue in ENCODE [59] as defined by HaploReg [60].

Table 2 Results of posterior probability analysis

Three of the sequenced regions have shown a stronger association with estrogen receptor negative (ER) breast cancer [4, 9, 61], which is a more aggressive subtype of breast cancer. A total of 393 breast cancer cases in our data were ER. Recognizing the limited power of our study, we reran the analysis with ER breast cancer as the outcome. No SNP was significantly associated with ER breast cancer after our analysis was adjusted for number of tests (Additional file 13: Table S7). The strongest associated SNP was rs112613843 on 8q24 (p = 0.0002). However, this SNP was observed only in African Americans. In population-specific analysis, the GWAS index SNP rs10069690 in the TERT gene was significantly associated with ER breast cancer in African Americans (p = 1.11 × 10−6) but in no other population (all p > 0.3).

Of the 12 sequenced regions, 11 contained coding regions with rare (MAF <0.005), nonsynonymous variation. For each of the 47 genes in these 11 regions, we performed 2 aggregate tests of association between breast cancer risk (overall and by ER status) and all nonsynonymous rare variants (a burden test and SKAT [52]); we also repeated these tests restricting the analysis to nonsynonymous rare alleles predicted to be damaging by PolyPhen-2 [44]. Tests were conducted stratified by ethnicity (Additional file 14: Table S8) and then meta-analyzed across ethnicities (Table 3). After applying the Bonferroni correction for the number of genes tested, we did not observe any significant findings using either SKAT or the burden test: The smallest p value was 0.004 (SKAT) for overall breast cancer when we included all nonsynonymous rare variants in ORAOV1 on chromosome 11q13 (Table 3, Fig. 3). Of note, rs12279395, which was highlighted by the PAINTOR analysis in the two-causal model, is a common nonsynonymous variant (MAF 0.12) in ORAOV1, providing additional support that ORAOAV1 is important in breast cancer development.

Table 3 Nominally significant rare-variant tests for breast cancer overall and by estrogen receptor status, across all ethnicities
Fig. 3
figure 3

Gene-based rare-variant association for overall breast cancer by distance from index genome-wide association study (GWAS) single-nucleotide polymorphism (SNP). y-Axis displays log10 gene association p values. Horizontal red line represents alpha = 0.05. Only the lower of two p values is plotted. For round points, the sequence kernel association test had the lower p value, and for square points, the burden test had a lower p value. Points are color-coded for the 12 breast cancer GWAS-identified index SNPs on 9 chromosomes. Region legend: 2 = TERT, 3 = MAP3K1, 4 = ESR1, 5 = 8q24, 6 = ZNF365, 7 = ZMIZ1, 8 = FGFR2, 9 = 11q13, 10 = RAD51B, 11 = TOX3 and 12 = 19p13

Discussion

In this study, we sequenced 12 genetic regions that have been found to be associated with breast cancer in 2288 breast cancer cases and 2323 controls. We found no strong evidence for a single causal allele in any of the regions. It is likely that the lack of strong signals in our data is due to the inadequate sample size resulting in a low signal-to-noise ratio. The initial GWASs identifying these loci were larger than our study population for many of these regions. It has been shown that fine-mapping studies that use multiple ethnic populations and leverage the genetic variability across populations have greater ability to localize causal variants [28]. Although we included four different ethnicities in this study, our total sample size of 4611 subjects was most likely too small for pinpointing causal variants with high probability. Further, subsequent population-specific efforts have shown that not all regions are associated with breast cancer across ethnicities. A recent study of 3016 cases and 2745 controls of African American ancestry replicated only 4 (2q35, TERT, FGFR2, and MERIT40) of the 12 regions investigated here at p < 0.05 [30]. A study of up to 15,130 cases and 14,584 controls of East Asian descent replicated 7 (2q35, MAP3K1, ESR1, ZMIZ1, FGFR2, 11q13, and TOX3) of the 12 regions at a significance level of 0.05 (although strong evidence has been found for the rs10822013 SNP located in the ZNF365 region). In addition, the TERT region was associated with ER breast cancer [62]. In Latinas, 4 (MAP3K1, ZMIZ1, FGFR2, and TOX3) of the 12 regions have been associated with breast cancer in 1497 cases and 3213 controls [27]. However, when taking tumor subtypes into account, associations were also observed for ER breast cancer (MERIT40) and 2q35 and ZNF365 (ER+ breast cancer).

Another weakness of our study was the incomplete capture of our targeted regions. The capture method we used was able to capture only 50.1 % of the original targeted regions (range 38.8–79.1 %). Nevertheless, we discovered a large proportion of novel variants not observed in the 1000 Genomes Project, illustrating the importance of sequencing depth as well as large, diverse populations to obtain a comprehensive catalogue of the genetic variation within a specific region. Despite the large number of novel and low-frequency variants, we did not detect a significant association between rare, nonsynonymous variation and breast cancer risk. Of note, the vast majority (>98 %) of the variants sequenced in our study were outside coding regions, and one region, 2q35, did not contain any rare, nonsynonymous variants.

Earlier breast cancer fine-mapping studies [30, 56, 6365] identified multiple candidates for causal variants, but it remains a challenge to determine the evidence required to confidently declare a variant causal. In contrast to previous studies, this is the first study, to our knowledge, to use sequence data rather than genotyped and imputed data, greatly improving genomic coverage. We attempted to identify secondary signals by running conditional analyses as well as using Bayesian approaches to identify the best candidate(s) for causal variants. For two of the regions, 2q35 and 11q13, the Bayesian analysis allowed us to create 90 % PPSs including <5 % of the original SNVs, greatly reducing the number of potential candidate causal SNVs. We also incorporated functional annotations with the goal of upweighting SNVs that were of functional importance. We included coding and breast-specific DHS as our two annotations, but none of these annotations showed evidence of being enriched for causal SNVs. It is possible that our lack of findings for these annotations is due either to limited power in our analysis or to these annotations not truly being enriched for causal breast cancer SNVs. Generation of large-scale databases including functional annotations throughout the genome is a constantly evolving area, and, as more data become available, annotations such as those used here can readily be updated and expanded.

For the 2q35 region, our results agree with those of a previous fine-mapping study [63] of the same region. On the basis of data from 46,451 cases and 42,599 controls of European ancestry and 6269 cases and 6624 controls of Asian ancestry in the Breast Cancer Association Consortium (BCAC), the investigators found evidence that one of two highly correlated SNPs (rs4442975 and rs6721996) is likely to explain the association signal observed in this region. This is in agreement with our results where we found that rs6721996 (p = 2.68 × 10−5, posterior probability 0.29) and the strongly correlated rs13412666 (p = 2.98 × 10−5, posterior probability 0.25) showed the strongest association in our data. SNP rs6721996 is also strongly correlated (r 2 = 0.97) with the original GWAS SNP rs13387042. Our results, together with the BCAC results, argue that the breast cancer signal from 2q35 can be explained with only a few SNPs.

In another fine-mapping study [66] by the BCAC, the authors found evidence for three independent signals in the 5q11.2 (MAP3K1) region. After adjustment for multiple testing, we did not identify any significant associations in this region; however, our top finding (rs111944656, p = 0.0004) is correlated (r 2 = 0.63) with one of the top signals in their study (rs113317823). The BCAC also explored the FGFR2 region at 10q26 [64] and found three independent signals. Our top SNP, rs10736303 (p = 4.42 × 10−5), is strongly correlated with two of their signals (rs2981578 [r 2 = 0.94] and rs2912779 [r 2 = 0.79]). The 11q13 region was the only region where the results dramatically changed if we assumed two causal variants rather than one in our PAINTOR analysis. The initial top SNP rs61041893 lies between rs12279375 and rs11823311, and all three SNPs are in low to modest LD in our data (r 2 = 0.01–0.30). In a previous fine-mapping study [57] of this region, the BCAC investigators identified three independent regions. Unfortunately, we did not capture their top variants in our data, making it difficult to directly compare the results. However, on the basis of our and their results, it seems likely that multiple independent breast cancer associations exist in this region.

We used two approaches – SKAT and a burden test – to assess if rare genetic variations in these regions were associated with breast cancer. We limited our analysis to nonsynonymous SNPs and conducted additional analysis including only nonsynonymous variants predicted to be damaging. After adjusting for the number of tests conducted, we did not observe any evidence that rare variations in these regions affect breast cancer risk. Despite some limitations, our study population is relatively large for a sequencing study and incorporates multiple ethnicities. Mensah-Ablorh et al. [67] found that multiethnic studies with genetically diverse subjects were better powered than some single-ethnicity study populations. However, in this case, the power gain derived from including multiple ethnicities was not large enough to overcome the small number of cases and controls (<5000) included in the analysis. For genes where a given population did not carry much rare variation, multiethnic meta-analysis allowed detection of gene-based rare-variant associations that may have been missed in a monoethnic study. Indeed, sufficiently large sample sizes that incorporate ancestral populations with greater genetic diversity and that target regions appropriate to the phenotype under study are critical for effective fine-mapping.

This study makes multiple important contributions to the research field. First, the inclusion of multiple ethnicities allowed us to explore the diversity of genetic variation in these regions and highlight the importance of conducting well-powered studies within multiple ethnicities. Further, we show evidence that sequencing, as compared with genotyping, variants identified through an existing database (e.g., the 1000 Genomes Project) results in identification of many additional rare variants. On the basis of our results, we show that, for breast cancer specifically, there are no rare variants of very large effect lingering at known GWAS loci.

We make the following recommendations for future study design. Future fine-mapping studies should conduct more comprehensive sequencing (whole genome rather than capture) to fully capture the genetic variation in a region. Further, we argue that follow-up studies require even larger and carefully selected study populations than the initial GWAS.

Conclusions

We report the first large-scale follow-up of breast cancer susceptibility loci using sequencing. We did not find any strong evidence for a single causal variant in any of the regions; however, we were able to narrow the number of potential causal SNVs in two regions (2q35 and 11q13). In addition, we did not find evidence that rare genetic variation in these regions is associated with breast cancer risk. This study illustrates some of the challenges faced in fine-mapping studies in the post-GWAS era.