Introduction

Breast cancer is the most common cancer among women in Europe, with approximately 523,000 cases diagnosed annually [1] and remains a leading cause of death among adult women. Primary prevention of breast cancer by endocrine therapy has side effects and is not absolute, whereas prophylactic surgery is very effective [2] but socially and emotionally burdensome. Secondary prevention by early detection through mammographic screening can reduce mortality but at the cost of overdiagnosis and the burden of false-positive results [3, 4]. Stratification of women according to the risk of developing breast cancer could provide a persuasive rationale for surgical intervention as well as improve the efficacy of risk-reduction and screening strategies by tailoring starting age and frequency [5, 6•].

Box 1 Definition of breast cancer risk

Clinically, definitions such as low, moderate, and high breast cancer risk are often used. However, this can reflect relative or absolute risks. For a given relative risk (RR), absolute risk can vary between countries depending on cancer incidences. Another term often used is lifetime risk, which is the absolute risk of breast cancer over the period of a woman’s life. Here, we define moderate risk as RR = 2 to 4, high risk as RR > 4, and low or population risk as RR < 2.

To accurately assess a woman’s risk, it is important to take all risk factors into account. Having a positive family history is one of the main risk factors for breast cancer. For women with a first-degree relative with breast cancer, the risk for developing breast cancer is twofold compared with women without such a family history [7]. Approximately 25% of this familial relative risk (FRR) is explained by (likely) pathogenic variants in a small number of genes, and a further 18% by the currently known common low-risk variants, mostly single nucleotide polymorphisms (SNPs) [8, 9, 10•, 11]. Besides the familial relative risk, other risk factors such as mammographic density and lifestyle factors are also important. In this review, we briefly summarise the breast cancer susceptibility factors, and then discuss avenues towards combining all these factors to create individual risk scores, and towards the identification of factors capable of explaining the remainder of familial relative risk.

Rare Genetic Variation Associated with Breast Cancer

The definition of ‘rare’ variation is somewhat arbitrary but is generally taken as to occur in < 0.5% of the general population. Indeed, we currently know that some variants associated with risk to breast cancer are extremely rare (< 0.001%), others moderately rare (~ 0.1%), or even almost ‘common’ (~ 1%). In addition, the risks conferred by these variants may vary from less than twofold to over tenfold. Classic linkage analysis in multiple-case families discovered some of the genes, but many were discovered by DNA sequencing of candidate genes. The best-known examples of linkage-detected genes are BRCA1 and BRCA2 [12, 13]. Pathogenic variants in either gene, each with a joint allele frequency of ~ 0.1%, will lead to a high risk of breast and ovarian cancer [14, 15]. Other genes, particularly TP53, PTEN, STK11, CDH1, and NF1, were discovered because of their association with typical familial cancer syndromes of which breast cancer is one feature [16,17,18,19,20]. Accordingly, their prevalence in the population is extremely rare. These findings also underscore the pleiotropic effects that some DNA variations display by predisposing to cancers of diverse tissue origin. Yet for most breast cancer genes discovered so far, the most conspicuous ‘other’ cancer with which an association has been firmly established is ovarian cancer. Another ‘syndromic’ gene is ATM; pathogenic variants in ATM act in a recessive way to cause Ataxia telangiectasia, a neurodegenerative disorder, but heterozygous carriers are at moderately increased risk for breast cancer [21]. The discovery that BRCA1, BRCA2, and ATM are involved in DNA damage repair, and that BRCA2 is a Fanconi anaemia gene [22], suggested that other DNA repair genes might also confer breast cancer susceptibility. Sequence analysis of these candidates then led to the discovery of CHEK2, BARD1, PALB2, NBN, and RAD51D [23,24,25,26,27] as breast cancer genes, although evidence is sometimes limited to specific variants in populations of specific ethnic background [26]. Breast cancer risks in these five genes are generally moderate, with the exception of loss-of-function variants in PALB2, which can lead to breast cancer risks comparable to BRCA2 [26, 28].

There is a long list of genes, including BRIP1, FANCC, FANCM, MEN1, MRE11A, PPM1D, RAD50, RAD51B, RAD51C, RECQL, and XRCC2, for which an association with breast cancer has been reported in a few studies, but for which replication in sufficiently large samples of cases and controls and establishment of effect sizes are still lacking. In fact, BARD1 and RAD51D were only recently confirmed in such analyses as moderate-risk genes [29•]. Finally, a long-standing issue is whether the Lynch syndrome genes (MLH1, MSH2, MSH6, and PMS2) and MUTYH are associated with breast cancer risk. Interpretation of breast cancer incidence in studies of Lynch syndrome families is complicated due to various biases (e.g., ascertainment). The issue remains controversial to date, even though a recent study again found an association between pathogenic variants in MSH6 and breast cancer risk [30]. More detailed discussions on the association of gene variants and breast cancer and the corresponding risks can be found in reviews by Wendt et al., Easton et al., and Graffeo et al. [26, 27, 31].

Box 2 Classification of gene variants

The ACMG has recommended a five-tier classification system, which has been adopted by many countries [32]. These classes are (1) Benign, (2) Likely Benign, (3) VUS, (4) Likely Pathogenic, (5) Pathogenic. For VUS, the pathogenicity and hence the association with disease risk are unknown, usually because they result in a similarly shaped amino acid or reside in a part of the gene not essential for its function.

Challenges in Risk Assessment and Clinical Translation

Once a gene has been repeatedly associated with breast cancer, other challenges arise that may hamper introduction into the clinic. One is allelic diversity and the notion that different types of variants (e.g., nonsense versus missense changes) might confer different breast cancer risks [26]. For BRCA1 and BRCA2, the effect of mutation-position on the relative risks for breast and ovarian cancer has been firmly established [33]. Furthermore, several missense changes have been identified in BRCA1 and BRCA2 that cause much more moderate risks than the typical loss-of-function variants [34•, 35]. Conversely, while most pathogenic variants in ATM will give an intermediate breast cancer risk, one specific missense mutation (c.7271C>G) seems to reach a level of risk approaching that of BRCA1/2 pathogenic variants [36, 37]. The presence of allelic diversity in breast cancer genes also highlights the difficulties we are still having with establishing pathogenicity for each variant. This seems straightforward for protein-truncating variants (although exceptions exist [38]), but for many missense and ‘spliceogenic’ variants the impact on protein function (and, by inference, on cancer risk) is hard to predict. The many in silico tools available for this purpose still perform poorly with respect to clinical standards, and for virtually all genes listed above, well-calibrated high-throughput functional analyses in model systems are lacking [39]. As a result, many variants detected by sequencing in these genes are still classified as Variants of Uncertain Significance (VUS).

Another challenge is to establish the penetrance of pathogenic variants and the corresponding breast cancer risks with sufficient accuracy. With some exceptions, there is still much uncertainty surrounding the magnitude and precision of the risks conferred by pathogenic variants in the genes. One problem underlying this issue is ascertainment bias in the sample used in the analyses. Patient series consisting mostly of women with a positive family history are almost certainly overestimating risk due to the enrichment of other risk factors. This is especially true for tumour syndrome genes, investigation of which is usually triggered by the syndrome criteria. For example, the penetrance of TP53 variants was initially estimated to be very high [40]. But with the introduction of gene panel sequencing, pathogenic variants in TP53 were also reported in families who do not fulfil the classical criteria of Li-Fraumeni Syndrome [41]. These families show older ages of onset of breast cancer [42], suggesting lower penetrance of at least some TP53 pathogenic variants. This is consistent with recent estimates of the prevalence of pathogenic germline TP53 variants in the general population [43], which are also much higher than expected on the basis of the prevalence of Li-Fraumeni Syndrome alone. The other problem is the rarity of variants, which necessitates the analysis of very large case-control series in order to sufficiently narrow down confidence intervals of risk estimates. For this reason, we have reasonably good breast cancer risk estimates for the 1100delC variant in CHEK2, which occurs in ~ 0.5% of the general population in Europe [44, 45] and the USA [29•, 45], but not for most other, much rarer variation in this gene. To establish an odds ratio of 2 with a 95% confidence interval of 1.4–2.8, conferred by a variant with an allele frequency of 0.01%, it would require genotyping 100,000 cases and 100,000 controls. Larger numbers are needed for lower risks and lower allele frequencies.

Gene Panel Studies—Non-BRCA1/2 Genes

Gene panel sequencing (GPS) has become a diagnostic reality in cancer genetics. Due to the lower costs and improving data quality, it became possible to test multiple genes in addition to BRCA1 and BRCA2 in a single assay, driven by a desire to explain familial clustering of breast cancer in more families and thus impact clinical management. As explained above, the frequency of pathogenic variants found in clinic-based series of familial cases is dependent on the selection criteria of the families included. The highest frequencies, up to 10%, of pathogenic variants are still found in the BRCA1 and BRCA2 genes in familial breast cancer cases [46,47,48]. Pathogenic variants in non-BRCA1/2 genes are found in 3.7–6.2% of the cases [29•, 46,47,48,49,50]. The highest frequencies of pathogenic variants in non-BRCA1/2 genes are found in CHEK2, ATM and PALB2 [29•]. However, this increased diagnostic yield comes at the expense of a large proportion of detected VUS, which poses a significant clinical problem. Gene panel studies have found a VUS in 13.6–41.6% of the cases [46, 48, 49, 51]. This means that for every pathogenic variant found in a case, two to three cases with VUS are detected. Furthermore, gene panels contain many genes for which the relevance to breast cancer is unknown or uncertain, as outlined above. Due to these uncertainties, most of the test results of commercial gene panels do not translate well into cancer risk assessment. Even the relatively well-defined cancer risks conferred by BRCA1 and BRCA2 are influenced by mutation position and mutation class, as well as by non-genetic exposures and lifestyle factors [35, 52, 53]. Therefore, the gain in clinical utility of testing genes for which evidence of their association with breast cancer is still ill-defined remains limited [26, 54].

SNPs and Polygenic Risk Scores

Since 2005, genome-wide association studies, using SNP arrays and very large case-control samples, enabled the identification of common low-risk variants for breast cancer [11]. Collaborative groups, such as the Breast Cancer Association Consortium (BCAC), have currently identified ~ 180 SNPs as significantly associated with breast cancer [10•]. The first substantial batch of SNPs was found by the Collaborative Oncologic Gene-environment Study (COGS) in 2013, coordinated by BCAC, which was subsequently confirmed and extended by combining with other GWAS data [55]. Another 65 loci were detected after the introduction of the OncoArray, a SNP array with a much denser SNP coverage than COGS [10•]. Some of the associated SNPs are more strongly associated with Estrogen Receptor (ER)-negative or ER-positive subtypes of breast cancer [10•, 56•]. The currently known SNPs explain 18% of the familial relative risk for breast cancer, but a much greater proportion (~ 40%) can be explained when variants that can be reliably imputed from the OncoArray data are included [10•, 57•]. To validate these latter SNPs, very large case-control studies are needed to reach genome-wide significance levels of association because many of these are expected to be relatively rare (< 5%) and/or of very small effect sizes.

The breast cancer–associated SNP alleles are distributed normally throughout the general population. This means that, in contrast to pathogenic variants in breast cancer susceptibility genes, all individuals in the population carry a certain number of risk alleles, with most individuals carrying the average number. Individually, these risk alleles confer a very small increase in breast cancer risk but their joint effect may be a substantially higher [8]. In the absence of evidence of clear interactions between SNPs [8, 58], a simple log-additive (or multiplicative) model combines all SNPs into a single Polygenic Risk Score (PRS).

Many different PRSs for breast cancer have been published in recent years (Table 1). Most studies have generated PRSs for overall unilateral breast cancer, a few have addressed ER status-specific PRS-models with the use of subtype-specific odds ratios of certain SNPs. Subtype-specific PRSs can potentially be useful to guide clinical management for chemoprevention and other prevention strategies. Two studies [74, 75] have used a PRS to predict contralateral breast cancer, and two have studied the PRS as risk modifier in rare gene mutation carriers (BRCA1, BRCA2, and CHEK2) [72•, 73•]. The number of SNPs, their allele frequencies, and effect sizes determine the discriminatory and predictive power of a PRS. Predictive power of a PRS is usually expressed as odds ratio (OR) per standard deviation unit of the distribution; discriminatory power is assessed by the area under the curve (AUC). The number of SNPs included in a PRS is not strongly correlated with the overall effect size or the AUC. This is because the SNPs detected in the earliest studies, although smaller in number, generally have higher effect sizes than those detected more recently in studies with more statistical power. Including large numbers of SNPs at lower than genome-wide significance thresholds may increase predictive power of the PRS but at the expense of being less specific [57•].

Table 1 Effect size and AUC of Polygenic Risk Scores

A limitation of many PRSs is that most SNPs contained in it are discovered in European-descent populations and their effects cannot be translated directly to other ethnicities. Studies are ongoing to define breast cancer–associated SNPs and evaluate the European-descent-derived PRSs in Asian and African-American populations.

For all PRS-models, the AUC is modest, but should this alone preclude their application as an individual test to predict if a woman will develop breast cancer or not? A comparison with gene panel testing, which is widely used in the clinic for this purpose, is illustrative. A PRS has been shown to be capable of stratifying women into different risk categories in a clinically meaningful way [8, 62, 73•, 74], but the most relevant clinical information of the PRS is in the extreme tails of the distribution. And because these tails concern the general population (as opposed to gene carriers only), the associated attributable risks of the PRS are in fact far greater than that achieved by gene panel testing. For example, the best performing PRS at this moment includes 313 SNPs with an association at a p value threshold three orders below genome-wide significance (P < 10−5). For this PRS, in the general population, 35% of all breast cancers occur in women in the highest quintile and only 9% of all breast cancers in the lowest quintile [57•]. Women in the top 1% of the PRS313 are at fourfold elevated risk relative to population average (95% CI 3.34–4.89), a risk-level defined in many countries as ‘high’. In comparison, BRCA1 mutation carriers explain < 2% of all breast cancer in Western Caucasian populations [76] and comprise ~ 0.1% of the general population. Implementation research is ongoing to introduce the PRS into clinical genetic testing, e.g. in the Netherlands, Germany, the UK and the USA. An example of how individual PRS testing could aid risk counselling in the setting of familial breast cancer is shown in Fig. 1, which highlights how two individuals that would otherwise have received the same risk assessment (sisters in generation IV) on the basis of their identical family history, are clearly classified into distinct risk classes on the basis of their PRS313.

Fig. 1
figure 1

Standardised Polygenic Risk Scores for breast cancer cases and their female relatives. In this non-BRCA1/2 breast cancer family, multiple family members were genotyped by SNP array. For all genotyped individuals, the SNP313 Polygenic Risk Score (PRS) was calculated. The individual PRSs are standardised to population controls in the BCAC dataset (mean = 0 and SD = 1 in controls). The numbers in the figure are therefore Z-scores of the individual PRSs. A higher Z-score indicates a higher breast cancer risk

Another potential application of the PRS is in deciding when and how frequent women should undergo breast cancer screening [6•, 77]. In most countries running such screening programs, women are offered screening above a certain age, usually between 45 and 50, when their breast cancer risk exceeds a certain cost-effective level. Women in the lowest quintile of the PRS313 in fact never reach that threshold, whereas those in the highest quintile will attain this level of risk before age 40 years [57•].

Hormonal, Environmental and Lifestyle Risk Factors

A number of non-genetic risk factors are presently firmly established as being associated with breast cancer. Besides age, these include physical factors such as body height and weight [78, 79]. For weight, breast cancer risk is dependent on menopausal status. Weight gain and obesity (BMI > 30) after menopause are associated with an increase in postmenopausal breast cancer [78]. It is likely that higher oestrogen levels underlie this effect in postmenopausal women [80]. A higher mammographic density due to a high proportion of connective and glandular relative to adipose tissue leads to a higher risk for breast cancer [81, 82]. Hormonal factors influencing breast cancer risk include the use of oral contraception and hormone replacement therapy (HRT) [83, 84], as well as age at menarche and menopause [85]. Reproductive history (age of first childbirth or nulliparity) may have similar impact on mammary gland biology [82, 86]. The lifestyle factors like alcohol use and smoking increase breast cancer risk as well, while physical activity and breastfeeding seem to act protectively [87,88,89]. Finally, a personal history of benign breast disease also signifies an increased breast cancer risk [82].

Combining Risk Factors

Since any woman will have only a single certain risk level at a given moment in time to develop breast cancer over the course of her life, genetic and non-genetic risk factors must somehow combine to define that risk. A major challenge for individual breast cancer risk prediction, therefore, is to design risk calculation models that accommodate all known risk factors, which require knowledge about the underlying model and how they interact. Through the large international consortia such as BCAC, data to design and validate such models are now forthcoming. There are now much more accurate estimates on how the PRS can modify the breast and ovarian cancer risks conferred by pathogenic variants in BRCA1, BRCA2, and CHEK2 [72•, 73•] (Table 1). This can help inform choices and timing of preventive surgery or chemoprevention. The interaction between the 1100delC variant in CHEK2 and the PRS appears to follow a simple multiplicative interaction, but the per SD hazard ratio estimates in BRCA1 and BRCA2 carriers were smaller than those in general population (Table 1). In BRCA1 carriers, the PRS based on SNPs associated with ER-negative disease showed a much stronger association with breast cancer risk in comparison with the ER-positive PRS, consistent with the predominant ER-negative tumour subtype in BRCA1 carriers [73•]. These issues highlight the complexity of some of these interactions and underscore the necessity of large prospective cohort studies to validate these models. A similar deviation from simple multiplicative interactions has been found for individuals with rare pathogenic variants in more than one breast cancer–associated gene [90]. There is limited evidence for interaction between SNPs and lifestyle/hormonal factors [91]. For environmental factors (e.g. reproductive factors, BMI and alcohol intake), the PRS can, in general, be combined in a multiplicative way [92].

Breast Cancer Risk Prediction Models

Currently, predicting whether a woman will develop primary breast cancer or not is mainly done within Cancer Family Clinics. Healthy women who are worried because of their family history for breast cancer can be referred by their general practitioner to such a clinic; alternatively, breast cancer patients with a clear family history are referred by oncologists, also because of the potential impact a gene diagnosis may have for their therapeutic options. The major incentive behind these referrals is the possibility to detect a high-risk variant in BRCA1, BRCA2 and, more recently, PALB2. As set forth above, however, such variants are found in < 10% of all referred families. For women from non-BRCA1/2 breast cancer families, breast cancer risk is often based on family history alone, although more than 20 risk prediction algorithms known today [93] include other risk factors as well. Some well-known risk prediction algorithms are the Gail model, BRCAPRO, Tyrer-Cuzick and the breast and Ovarian Analysis of Disease Incidence and Carrier Estimation Algorithm (BOADICEA). Depending on what the model predicts and for which population, the most appropriate model can be used.

The Gail model predicts breast cancer lifetime risks for women older than 35 years and is widely studied and validated. It includes hormonal risk factors, breast biopsies and affected first-degree relatives [93, 94]. The Chen model extends this by incorporating mammographic breast density as well [95]. The BRCAPRO model calculates breast cancer lifetime risks and the risk of contralateral breast cancer. The calculation is based on family history, the prevalence of BRCA1 and BRCA2 pathogenic variants, population incidence rates and pathological markers for breast cancers [96]. The Tyrer-Cuzick model incorporates hereditary (first- and second-degree relatives with breast or ovarian cancer), hormonal and environmental risk factors (age, BMI, menarche, reproductive factors, menopause and HRT) and pathological variables (breast biopsies and benign breast pathology). Mammographic density will be incorporated in the model in an upcoming version [93]. BOADICEA calculates breast cancer lifetime risks and contralateral cancer risks for women with a family history of breast cancer [97]. The model includes tumour pathology characteristics, current cancer incidences and pathogenic variants in ATM, BRCA1, BRCA2, CHEK2 and PALB2. For BOADICEA, family history is not restricted to a number of relatives or a particular degree.

Several studies have shown an improved discriminative power between breast cancer cases and controls by combining the PRS with a breast cancer risk prediction tool [60, 63, 66, 69]. In one study [62], new breast cancer lifetime risks for women from breast cancer families were calculated by adding the PRS to family-based risk prediction. For up to 23% of the women, screening recommendations, as stipulated by local management guidelines, could alter.

The BOADICEA model has recently been extended to accommodate a broad range of genetic and non-genetic risk factors for breast cancer, adding mammographic density, reproductive factors, age at menarche and menopause, use of hormones, BMI, body height, alcohol use and the SNP313 PRS to the previous version [98•]. This is the first time that so many factors are combined into a single model. Unsurprisingly, the potential for risk stratification was the greatest when all risk factors were used for risk prediction. Of all factors, the PRS had the largest contribution in risk stratification. Without knowledge of the genetic status of a woman for the rare genes, or family history, the lifetime breast cancer risk varied from 2.8% for the lowest to 30.6% for the highest percentile of the PRS. The model assumes that the risk factors and the PRS act multiplicatively, consistent with evidence from previous studies but not yet formally demonstrated for PRS313. Similarly, the assumption that the PRS313 combines multiplicatively with the effects of rare truncating variants in the five breast cancer genes will need validation. Finally, the current BOADICEA model uses population breast cancer risks of several countries but the UK risk factor distributions and therefore may require tailoring for application in other populations.

Conclusion

Approximately half of familial relative risk of breast cancer can be explained by the genes and variants identified over the past three decades. In order to be able to maximally exploit a woman’s genomic data for breast cancer risk prediction, we will have to detect the genetic factors underlying the remaining half. To do so, researchers must face the conundrum of genome-wide significance and costs. Restricting to protein-coding regions by whole-exome sequencing, a so-called burden-type association analysis (counting presumable loss-of-function variants in cases and controls), and using a Bonferroni-corrected significance level of p < 2.5 × 10−6 will require data on at least 10,000 cases and 10,000 controls to be sufficiently powered. For whole-genome sequencing, not only the costs per sample are several-fold higher than for exome sequencing, genome-wide significance is at least 50-fold more stringent, requiring many more samples to be analysed. In addition, functional annotation of intronic and intergenic variants, to guide which variants to include in the association analysis, is still in its infancy.

Epidemiology has firmly established and quantified the role of many non-genetic factors in causing breast cancer. Currently, computational models are being built that integrate available knowledge so as to allow highly personal risk estimates. Ultimately, such models will empower women to exploit these risk estimates and take appropriate actions to lower this risk (many risk factors are modifiable). While there are many challenges still to overcome (particularly the lack of evidence to demonstrate improved clinical or economic outcomes), the use of genomic and personal lifestyle data in breast cancer prediction seems imminent.