Background

Over the last decade, on the basis of genome-wide association studies (GWASs), > 100 common variants (single-nucleotide polymorphisms [SNPs]) have been reported to be associated with minor increases in breast cancer risk [1,2,3]. Researchers in fine-mapping studies have tried to identify the causal variants as a first step toward understanding how the elevated cancer risk is mediated. Nearly all of the SNPs are non-coding, and evidence to date suggests that some are in regulatory regions of neighbouring target genes and mediate subtle alterations in target gene expression, such as CCND1 [4], or through changes in post-transcriptional regulation, such as altered splicing in TERT [5]. However, for most of the risk loci, the mechanism of risk modification has not been explained, although it is reasonable to expect that for many it will be through modifying expression or regulation of a target gene in the vicinity of the SNP. We hypothesised that if subtle expression changes confer a low susceptibility to breast cancer, coding variants in some of these genes might confer much higher levels of risk. This concept is supported by the finding of low-penetrance SNPs associated with known moderate- and high-penetrance genes such as BRCA2, CHEK2 and potentially RAD51B (RAD51L1) [1,2,3], raising the possibility that other genes associated with low-penetrance SNPs might be enriched for coding high-penetrance predisposition alleles. To address this question, we sequenced all exons and exon-intron boundaries in 56 genes that are plausibly associated with breast cancer risk SNPs in index cases from 1043 familial breast cancer families who previously had negative test results for BRCA1 or BRCA2 pathogenic mutations and 944 population-matched cancer-free control participants from an Australian population.

Methods

Candidate genes

Because the target genes influenced by most reported breast cancer predisposition SNPs remain unknown, we used two strategies to identify genes of interest: (1) those reported as the plausible target gene in GWASs at the time of our gene panel design [2, 3, 6,7,8,9,10,11,12,13], and (2) where no gene had previously been proposed for a particular SNP, we screened any gene located ± 500 kb of the risk-associated SNP on the basis that most enhancers are < 500 kb away from the gene that they regulate and that most linkage disequilibrium (LD) blocks are < 500 kb in size [14]. In total, 56 genes associated with 56 SNPs were sequenced (Table 1, Additional file 1: Table S1), along with other candidates, as part of a custom sequencing panel [15,16,17,18].

Table 1 Candidate genes identified and corresponding breast cancer risk single-nucleotide polymorphisms

Cohorts

A total of 1043 female breast cancer-affected index cases from high-risk breast cancer families were identified from the Variants in Practice Study and ascertained from familial cancer centres (FCCs) in Victoria and Tasmania, Australia, as described previously [17]. The personal and/or family history of all the cases were assessed by a specialist FCC and determined to be sufficiently strong to be eligible for clinical genetic testing for hereditary breast cancer predisposition genes by local criteria. All cases in this study had a negative test result for pathogenic mutations in BRCA1 and BRCA2. The average age of cases in this study was 45 years (range, 22–81).

The control participants comprised 944 female subjects randomly selected from among the > 54,000 female participants of the Lifepool Study (http://www.lifepool.org/). The control participants had no self-reported or cancer registry-confirmed cancers diagnosed as of May 2016. Lifepool has recruited women > 40 years of age through the population-based mammographic screening program in Victoria, Australia (BreastScreen Victoria). The average age of Lifepool control DNA donors in this study was 59 years (range, 40–92).

Targeted sequencing, variant calling and variant filtering

The coding regions and exon-intron boundaries (plus ≥ 10 bp of each intron) of 56 genes were enriched from germline DNA using a custom-designed HaloPlex Targeted Enrichment Assay panel (Agilent Technologies, Santa Clara, CA, USA). The libraries were sequenced on a HiSeq2500 Genome Analyzer (Illumina, San Diego, CA, USA) as described previously [17].

Sequencing data were processed and analysed using an in-house bioinformatics pipeline constructed using SEQLINER v0.1a (http://bioinformatics.petermac.org/seqliner). Raw reads (FASTQ files) were first quality-checked using FastQC (v0.11.2; http://www.bioinformatics.babraham.ac.uk/projects/fastqc/) and trimmed using cutadapt (1.7.1) [19] to ensure high read quality. Filtered reads were then aligned to the human reference genome (GRCh37/hg19) using the Burrows-Wheeler Aligner tool [20], with base quality score recalibration and indel realignment performed using the Genome Analysis Toolkit (GATK v3.2.2) [21]. GATK UnifiedGenotyper v2.4 (Broad Institute, Cambridge, MA, USA) [22], HaplotypeCaller [23] and PLATYPUS [24] were used for variant calling. Annotation of variants was performed using a local copy of the Ensembl [25] version R73 database and a customised version of Ensembl Variant Effect Predictor. Variants were determined by reference to the canonical transcripts. The Ensembl definition was as follows: (1) longest Consensus Coding Sequence Project translation with no stop codons; (2) if no (1), choose the longest Ensembl/Havana merged translation with no stop codons; (3) if no (2), choose the longest translation with no stop codons; (4) if no translation, choose the longest non-protein-coding transcript. Only variants that were identified by at least two variant callers with a total read depth of at least ten and an alternate allele read proportion ≥ 20% were included in the analysis. Loss-of-function (LoF) mutations were defined as stop-gained, frame shift or essential splice site mutations. The in silico assessment tools Condel [26], Polymorphism Phenotyping version 2 (PolyPhen-2) [27], SIFT [28], Combined Annotation Dependent Depletion (CADD) [29] and rare exome variant ensemble learner (REVEL) [30] were used to examine the likely pathogenicity of missense variants. Variant were defined as “likely deleterious” when predicted deleterious or damaging by Condel, PolyPhen-2 or SIFT, or when they had a CADD score ≥ 15 or a REVEL sore ≥ 0.5. The Exome Aggregation Consortium (ExAC) and Exome Variant Server (EVS) databases were used as additional references for the frequency of variants in the general population. Because this study was focused on the identification of moderate- to high-penetrance alleles, which will be rare [31, 32], only variants with a population allele frequency ≤ 0.001 (in both overall and European Caucasian populations) were assessed. Variants were visually inspected using Integrative Genomics Viewer [33, 34] to exclude artifacts.

Statistical analysis

ORs and p values were calculated using a two-tailed Fisher’s exact test and the chi-square test in R version 3.3.2 [35].

Results

All exons and exon-intron boundaries of 56 genes identified by either GWAS-proposed or location-based neighbouring criteria (Table 1; see also selection criteria described in the Methods section) were sequenced with consistent high coverage in cases and control participants (average sequencing depths of 170.4 and 175.6, respectively). Overall, 96.0% of the bases among the cases and 97.1% of the bases among the control participants were sequenced to a depth greater than tenfold (Additional file 1: Table S2). As previously described, principal component analysis using 7574 variants from all genes in the sequencing panel showed that ~ 98% of study subjects were of European Caucasian ancestry, and no bias was observed in the population distribution between the case and control cohorts [18].

Loss-of-function variants

LoF variants (minor allele frequency [MAF] in ExAC and EVS, ≤ 0.001) were rare in both the cases and control participants across all the candidate genes, with only 38 unique variants observed in a total of 39 carriers (Table 2). For the majority of genes (36 of 56), no LoF variants were detected in either the case or control cohorts (Table 3).

Table 2 Loss-of-function variants detected in case and control cohorts
Table 3 Number of carriers with loss-of-function and missense variants detected in case and control cohorts

No gene had a significant excess of LoF mutations in the cases versus the control participants. TET2 had the largest number of LoF variants, with five in the cases and two in the control participants, whereas three LoF mutations were detected in NRIP1 but none in the control participants. No more than two mutation carriers were identified in each cohort for the remaining 18 genes harbouring LoF variants. Across all 56 genes, there was a total 26 LoF mutations in the cases compared with 13 among the control participants (OR, 1.83; p = 0.077; 95% CI, 0.9–3.9). Notably, there were ten genes with LoF variants detected only in the cases, compared with only three genes with LoF variants detected only in the control participants. Restricting this analysis to only the 35 genes directly proposed by GWASs with a potentially higher likelihood of being the target gene (as opposed to being based solely on their location ± 500 kb from the SNP), we observed a significant excess of LoF mutations in the cases (17 versus 4; OR, 3.89; 95% CI, 1.26–15.95; p = 0.008). In contrast, no difference was observed for the 21 location-only-based candidate genes (9 versus 9).

Missense variants

Similar to the LoF variants, the total number of carriers with rare missense variants (MAF ≤ 0.001 in ExAC and EVS) (Table 3, Additional file 1: Table S3) across all 56 genes was greater in the cases than in the control participants (406 versus 353; OR, 1.07), but this finding was not statistically significant (p = 0.512). In addition, 34 genes had a higher frequency of missense variants in the cases compared with only 16 genes with a higher frequency in the control participants. ZNF283 showed the strongest enrichment for missense variants in the cases (17 versus 6); however, this difference was not statistically significant. There was no obvious difference in the rare missense variant frequency based on whether they were GWAS-proposed genes or location-only-based genes.

The missense variants were further stratified according to a series of in silico prediction tools (Condel, PolyPhen-2, SIFT, CADD and REVEL) as a means of enriching for variants with a higher likelihood of pathogenicity (Table 4). There was a trend towards a slightly higher frequency of predicted pathogenic missense variants observed in the cases than in the control participants using any single prediction tool (ORs ranging from 1.11 to 1.37), but none of the comparisons reached statistical significance. Further restricting the analysis to only those variants predicted to be pathogenic by all five in silico tools, we detected no significant difference between the cases and the control participants (58 versus 39; p = 0.170).

Table 4 Number of carriers with likely deleterious missense variants predicted by in silico tools

Discussion

The majority of common, low-penetrance breast cancer SNPs are located in non-coding genomic regions, and although different hypotheses have been proposed, the biological mechanisms underlying these risk associations remain inconclusive. Studies to date have demonstrated mechanisms at least for some risk SNPs involving altered expression of the target gene as a result of disruption to enhancer or promoter regions or by affecting RNA splicing [4, 5]. On this basis, we hypothesised that if subtle alterations to gene expression result in small increases in breast cancer risk, then coding variants with more profound effects on gene function might convey much higher levels of risk. BRCA1 and BRCA2 are the prime examples of such a scenario where both highly penetrant coding mutations and low-penetrance non-coding SNPs exist. GWASs are not designed to identify such variants, owing to their rarity in the population.

Among the 56 candidate genes sequenced, LoF variants were rare, with over half of genes having no LoF variants in either the cases or control participants. However, there was a small excess of both the total number of LoF and missense variants in the cases compared with the control participants (LoF OR, 1.83; missense OR, 1.07), but because the mutation frequency for each individual gene was very low, it is unclear if this result reflects a higher penetrance effect of a small number of genes or if many of the variants contributed to a small excess in breast cancer risk. The genes with the greatest contribution to the excess of LoF variants in the cases included TET2, NRIP1, RAD51B and SNX32 (12 cases versus 2 control participants), whereas ZNF283 and CASP8 contributed largely to the excess of missense variants (25 cases versus 8 control participants). However, on an individual gene level, none showed a significant difference in the cases compared with the control participants. A larger cohort size is needed to confirm this trend and identify the contribution of any single gene. Of note, there were no LoF variants detected and no excess of missense variants (four in cases versus four in control participants) in FGFR2, the “top hit” in many independent breast cancer GWASs.

The strongest excess of LoF variants in this study was TET2 (five cases versus two control participants). This gene was reported to have a genome-wide influence on gene expression by altering DNA methylation whereby its dysregulation was associated with aberrant DNA methylation and involved in the development of acute myeloid leukaemia [36, 37]. Guo et al. showed that the association with cancer appeared to be with functional SNPs that lie in the promoter or enhancer that consequently affects TET2 expression [38]. Such evidence suggested that it is plausible that rare coding variants in TET2 could lead to compromised TET2 function and involvement in breast cancer susceptibility. However, the data for TET2 need to be interpreted cautiously because it is a gene known to cumulate age-related somatic mutations in blood [39]. It is possible that some of the variants we identified are somatic mutations rather than germline variants, particularly in light of the fact that the alternate allele read proportions of LoF variants were generally in the low range (≤ 35%).

Researchers have proposed that LoF variants in RAD51B (RAD51L1) confer a high risk of breast cancer [40], but it remains inconclusive owing to the extreme rarity of the LoF mutations (only 48 carriers in 60,706 participants in ExAC; carrier frequency, 0.08%). Few germline LoF mutations have been reported: one splicing variant in a breast and ovarian cancer family [41], one splicing and one nonsense variant in two patients with ovarian cancer [42], and one nonsense variant in a melanoma family (p.Arg47Ter) [43]. We observed two carriers of the same nonsense mutation, p.Arg47Ter, which is the most common LoF variant seen in ExAC database (21 carriers in total, including 14 South Asian and 5 non-Finnish European carriers). In addition to breast cancer family history, each carrier had a relative with ovarian cancer (mother, grandmother), and one had both parents diagnosed with melanoma. Together with the previously cited reports, our data support RAD51B as a plausible candidate gene in breast cancer families, especially breast and ovarian cancer families, and it may also play a role in melanoma predisposition.

With respect to missense variants, CASP8 showed a strong signal towards an excess of rare variants (eight cases versus two control participants). Notably, the corresponding low-penetrance GWAS SNP rs1045485 (p.Asp344His; MAFExAC, 0.12) is a missense variant in CASP8; however, it is not included in the missense variants in this study, because we focused only on the rare variants (MAF, ≤ 0.001). In a meta-analysis of one promoter polymorphism that decreased CASP8 expression, Cai et al. concluded that it was associated with a reduced risk of a broad range of cancers, including breast cancer [44]. This evidence and our data would be consistent with a model whereby a subtle reduction in CASP8 function leads to reduction in cancer risk, whereas missense mutations conferring an enhanced or altered function increase cancer risk. Regardless of the status of these leading candidate genes, our data clearly show that low-penetrance SNP-associated genes are not conspicuously enriched for high-penetrance breast cancer predisposition alleles and at best could explain only a small proportion of hereditary breast cancer families with no known pathogenic variants.

It has been suggested that one possible mechanism contributing to the minor risks detected in GWASs for common variants that lie close to the coding sequence of a gene could be an uneven distribution of much rarer, high-risk coding variants between the different SNP alleles. For many SNPs this explanation appears unlikely on the basis of underlying LD structure and the distance between the tagging SNP and the nearest gene, and for a smaller number this has been excluded by fine-mapping and functional studies that have directly demonstrated the effect of the causative variant. However, our data provide an opportunity to examine this potential mechanism systematically for all of the genes sequenced. We compared the frequency with which LoF and rare missense variants in the 56 genes were observed in association with either the corresponding risk SNP or the alternate allele, both in the case group and in the control group (Additional file 1: Table S4), and we found no convincing evidence of an interaction between the common and rare variants. For a few genes, including PDE4D and TERT, there was a notable trend towards an excess of rare variants in association with the risk form of the SNP, but this was not statistically significant when adjusted for the effect of multiple testing. Similar trends were observed for some genes, including UNC13A and DNAJC1, in the opposite direction, indicating that the trends on each side of the association were very likely due to random chance. Of note, the greatest excess of rare variants in carriers of the risk allele was found for the PDE4D gene, where pathogenic missense variants have previously been associated with an unrelated rare high-penetrance dominant disorder, acrodysostosis type 2 [45].

This study has several main limitations. Firstly, as a consequence of the rarity with which LoF variants were observed in these candidate genes, our cohort size could not provide sufficient power to determine the cancer predisposition role for any individual gene. Secondly, further breast cancer predisposition SNPs continue to be identified, and we have not analysed genes that are located near more recently identified SNPs, although there is no reason to believe that the genes we studied are not representative of SNP-related genes in general. Thirdly, the cases and control participants in this analysis are well matched for ethnicity and represent a very similar population in which the predisposition SNPs were originally identified. However, we are unable to evaluate if moderate- to higher-penetrance predisposing variants do exist in other ethnic groups. In addition, in this study, we were not able to examine whether some candidate genes were significant in specific molecular subtypes of breast cancer.

Conclusions

In summary, our study describes, for the first time to our knowledge, an assessment of the contribution of rare coding variants in SNP-associated genes to familial breast cancer risk. Although confirmatory studies are required, our data suggest that rare LoF and missense variants in genes associated with low-penetrance SNPs may contribute some additional risk but that they are unlikely to be major contributors to breast cancer heritability.