Individualized risk prediction is a major goal of clinical genetics. For prediction of breast cancer risk, family history is an important factor, women with a first-degree relative having twofold increased risk [1], while women from families with multiple breast/ovarian cancers or cancers diagnosed at younger age experience a much higher risk of developing the disease [2]. Rare mutations in high- and moderate-risk genes including BRCA1, BRCA2, TP53, PTEN, PALB2, and CHEK2 explain about 20 % of the familial relative risk for breast cancer [3]. A polygenic component comprising many variants of small effect contributes to the risk of developing the disease in the general population and may also modify the risk in cancer families [35].

Over the last few years, genome-wide association studies (GWAS) have been successful in identifying some of the common low-penetrance variants predisposing to breast cancer [68]. To date, more than seventy variants have been identified, which together explain about 14 % of the familial risk of breast cancer [5, 6]. Individually, the effect sizes associated with these common variants are small. However, their combined effect, summarized as a polygenic risk score (PRS), is more substantial [5]. In a recent population-based case–control study, eight percent of women at the high end of the PRS distribution were found to fall into a group of intermediate life-time risk (17–30 %) according to the UK NICE guidelines [9]. In recent studies, the PRS has been tested in combination with other risk prediction methods, such as BOADICEA and BRCAPRO [10], mammographic density (BI-RADS) [11], and a combination of family history and established risk factors (BCRAT and IBIS) [10].

The contribution of the PRS to disease risk for individuals with family history of breast cancer and within breast cancer families has not been studied extensively. Here, we investigate the association between a 75-variant PRS and disease status in individuals with and without family history in a large Finnish case–control study and 52 Finnish breast cancer families, which have an extensive pedigree information available and which have been well characterized in terms of their genetic and pathological characteristics. We use a family history score based on the BOADICEA risk prediction algorithm to evaluate whether the PRS predicts disease status among women sharing similar family history, and discuss clinical utility of the PRS for risk prediction in familial breast cancer.

Patients and methods

Study subjects

We included two separate sets of study subjects in the analyses. The case–control dataset consisted of

  1. i:

    three series of consecutive, unselected breast cancer patients (n = 1303) enrolled for their first primary breast cancer at the Helsinki University Central Hospital during 1997–1998, 2000, and 2001–2004 as described previously [1214],

  2. ii:

    additional index cases (n = 378) with positive family history of breast cancer (one per family), tested negative for germ-line mutations in high penetrance susceptibility genes BRCA1 and BRCA2 from an ongoing collection started at 1995 at the Helsinki University Central Hospital, Department of Clinical Genetics [15, 16],

  3. iii:

    and healthy population controls (n = 1272, blood donors, Finnish Red Cross) (Supplementary Table 1).

In the case–control dataset, the index cases with positive family history of breast cancer were categorized into two groups based on whether they came from a family with two breast cancers in first-degree relatives (later referred to as “small families”) or from a family with three or more cases of breast or ovarian cancer in first- or second-degree relatives (later, “large families”).

The breast cancer family dataset consisted of 493 (427 women) genotyped study subjects and registry data for further 3992 family members from 52 Finnish families with multiple cases of breast cancer, collected systematically as described previously [16] (Supplementary Table 1). The criteria for including families in this study were the highest possible number of informative study subjects (women with breast cancer) with available DNA sample. The families were traced back to find the most recent common ancestors of the breast cancer patients and then traced forward including in the pedigree all the descendants of the most recent common ancestors according to records of church parish registries, Population Register Center, and Finnish Cancer Registry. The age and disease status were ascertained up to 31st December 2010. The index case of each family had previously been tested for germline mutations in BRCA1 and BRCA2 mutations and was found to be negative [16]. The number of family members varied between 22 and 356 (median 57.5) (Supplementary Table 2). Median proportion of affected women born between 1910 and 1970 was 22 % (Supplementary Table 2). The mean follow-up age of genotyped healthy women was 60.3 years, and the mean diagnosis age of genotyped breast cancer patients was 54.1 years. Seven pedigrees originated from two unrelated founder couples. One of the moderate-penetrance mutations, CHEK2:c.1100delC, PALB2:c.1592delT, or putative moderate penetrance mutation FANCM:c.5101C>T, was transmitted in seven families.


DNA was extracted from peripheral blood samples and genotyped at CNIO genotyping unit, Madrid, or Génome Québec Innovation Centre using a custom Illumina Infinium array, which was designed for the Collaborative Oncological Gene-environment Study (COGS). Data quality was monitored as described earlier [6]. Analyses were based on 75 variants reported to be associated (at P < 5 × 10-8) in the analysis of the COGS dataset or previous publications [58], with either overall breast cancer or ER-negative disease (Supplementary Table 3). All variants used in the risk score passed quality control filters. For the majority of variants, genotypes were missing in <0.1 % of individuals. Four variants used in the risk score (rs17879961, rs10941679, rs4973768, and rs13281615) had <0.4 % missing rate, and for one genotype, rs2943559, 1.06 % of individuals had missing genotypes. Twelve samples with missing genotype calling for over 10 % of the susceptibility variants were excluded from further analyses.

In the analyses, the genotype of each variant was represented by allele dose (the number of copies of the rare allele). Missing genotypes were imputed for controls using the mean of the known genotype doses, and for cases using the following formula:

$$= 2 * \left[ {\frac{(p * \emptyset )}{(1 - p) + (p * \emptyset )}} \right],$$

where Ø is the effect size (per-allele Odds Ratio) for association between that variant and breast cancer risk in the case–control dataset, and p is the minor allele frequency in controls for the missing variant.

Three moderate-penetrance mutations, which are relatively frequent in the Finnish population, CHEK2:c.1100delC, PALB2:c.1592delT, and FANCM:c.5101C >T, had previously been genotyped locally using a custom made TaqMan assay (Applied Biosystems, Thermo Fisher Scientific Inc.), AmpliFluor fluorescent genotyping (KBiosciences), conformation sensitive gel electrophoresis heteroduplex conformation analysis and Sanger sequencing [17], or with the Sequenom MassARRAY system using iPLEX Gold assays, respectively, as described previously [1820]. All study subjects of the case–control dataset were genotyped for all three moderate-penetrance mutations. However, in the breast cancer family dataset, the index case in each family was genotyped. If a mutation was found, all family members were genotyped for that mutation. If no mutation was found, all family members were considered as noncarriers.

Statistical analyses

Polygenic risk score

Statistical analyses were performed using R version 3.0.2 [21], including packages multiwayvcov [22] and lmtest [23]. A PRS summarizing the risk effects associated with 75 common breast cancer susceptibility variants [68] was calculated for each individual using the following formula: \(\sum\limits_{i = 1}^{n} {a_{i} \log_{} OR_{i} },\) where n is the number of loci included in the model, a is the number of disease alleles at locus, i and OR is the corresponding per-allele odds ratio for breast cancer (Supplementary Table 3). The ORs for each variant, used as weights in the PRS, were taken from previously published estimates from the Breast Cancer Association Consortium (BCAC) [5]. The case–control component of the present study was included in this estimation, but it only comprised a small fraction (3 %) of the total BCAC data available for estimation of the ORs. To avoid ascertainment bias, studies oversampling for cases with family history were excluded from this estimation. The PRS-values were standardized by the mean (0.50) and standard deviation (0.45) of the PRS-values in healthy population controls, so that odds ratios could be reported as per unit standard deviation of the PRS. This corresponded well to a recently introduced PRS of 77 common variants [5], although we did not include data for two imputed variants on 11q13 (rs78540526 and rs75915166). When study subjects were categorized into centiles of the PRS distribution, the centile boundaries were defined on the basis of the PRS distribution in healthy population controls.


Familial history was quantified with a single quantity, which we term a BOADICEA score. This was calculated for all genotyped women (n = 427) in the 52 breast cancer families by applying the BOADICEA risk prediction algorithm [24]. The BOADICEA algorithm is implemented in a modified version of MENDEL [25], coded in FORTRAN 90. BOADICEA is a genetic model for breast and ovarian cancer that models risk in terms of BRCA1 and BRCA2 mutations and a polygene representing a large number of loci of small effect to capture the residual familial aggregation of breast cancer. The polygenotype is assumed to be normally distributed, but this is implemented using a binomial approximation, with transmission determined by a discrete hypergeometric polygenic model [26]. Thus, BOADICEA codes the polygenotype as a discrete variable (0–6). The polygenic score for each discrete polygenotype is the mean value from a standard normal distribution over the same probability interval. The BOADICEA algorithm can compute the posterior probability for each polygenotype, conditional on their family history. The overall BOADICEA score used in this analysis was calculated by summing the scores for the discrete polygenotypes weighted by the posterior probability. Hence, the BOADICEA score is an estimate of the mean polygenotype for the individual, given their family history (and hence would predict their future cancer risk, although it is not identical to the cancer risk output by the BOADICEA program, which also depends on age). The score was calculated assuming women were tested negative for the mutations in the BRCA1 and BRCA2 genes, with sensitivities of 0.7 and 0.8 for BRCA1 and BRCA2 testing, respectively. The study subject (target) was coded uninformative with regard to age and disease status (in effect, computing the mean polygenotype at birth, given the family history). Other available information on birth year, age, and cancer diagnoses of her relatives were included in the calculations as in the BOADICEA data form. For data standardizing purposes, only first-, second-, and third-degree relatives of the same and older generations were included in the input pedigrees.

Association analyses

Association between breast cancer risk and the PRS was modeled using logistic regression. Analysis performed within the breast cancer families (affected vs. healthy women) was adjusted for age (diagnosis age for cases, follow-up age for healthy women) and family history as summarized by the BOADICEA score, and additionally for moderate-penetrance mutations in CHEK2, PALB2, and FANCM (separate terms for each mutation, coded as 0, noncarriers; 1, heterozygotes; 2, homozygotes). P-values and 95 % confidence intervals (CI) were corrected using robust variance estimation, clustering the study subjects in families. Association between PRS and the tumor estrogen receptor (ER) expression was assessed with logistic regression after categorizing the study subjects into quintiles of the PRS distribution. Association between PRS and age at diagnosis was examined by linear regression. Pearson’s r was used as a measure of correlation in comparisons between PRS and BOADICEA scores.


Association between PRS and case–control status

The combined effect of 75 variants with confirmed associations with breast cancer risk was captured by constructing a PRS for each individual. The PRS was associated with the increased risk of breast cancer in unselected patient series and within breast cancer families (Table 1, Supplementary Fig. 1). When healthy population controls were used as a reference group, the odds ratio (OR) for association between the PRS and breast cancer risk was lower for sporadic cases (OR: 1.41, 95 % CI [1.30–1.54]) than for cases with positive family history, but did not differ between index cases of “small” (OR: 1.85 [1.63–2.11]) and “large” families (OR: 1.81 [1.59–2.06]). Similarly, when comparing affected women from the 52 breast cancer families to healthy population controls, the PRS was associated with the disease risk (OR: 1.82 [1.55–2.13]), but no difference was seen by the number of affected first-degree relatives (Table 1, Supplementary Fig. 1c). The PRS was significantly higher among healthy women in the 52 breast cancer families than among population controls (OR: 1.29 [1.12–1.48]), consistent with an association between PRS and positive family history of breast cancer (Table 1, Supplementary Fig. 1b).

Table 1 Association between the PRS and breast cancer risk in case–control and breast cancer family datasets

PRS as a risk predictor in breast cancer families

The correlation between PRS and BOADICEA score was 0.15 in affected and 0.0099 in healthy women. The association between the PRS and breast cancer risk was investigated in 427 women from 52 breast cancer families. Healthy female family members were considered as the reference group in this analysis. The OR for association between the PRS and breast cancer in these families was 1.55 [1.26–1.91] (P: 3.3E−5), adjusting for age and the family history (BOADICEA score). When the model was further adjusted for the moderate-penetrance mutations, the PRS OR was 1.59 [1.28–1.98] (P: 2.9E−5, Supplementary Table 4).

Previously, the relevance of the PRS in risk prediction has been evaluated by deriving the relative and absolute risk estimates for women at specific percentiles of the PRS distribution [5]. We also assigned the affected and healthy women from the breast cancer families into centiles of the PRS distribution based on the healthy population controls. The ORs for breast cancer estimated in the 52 families were similar in all categories to those obtained in the BCAC dataset of population-based case–control studies [5], except for the lowest ten percent, where the number of study subjects was quite low and the confidence interval included the published estimate (Fig. 1). The OR for women in the highest ten percent was 2.71 [1.46–5.03] (P = 0.0015).

Fig. 1
figure 1

Estimated effect sizes (odds ratios: OR with confidence intervals: CI) by percentile of the PRS in the 52 breast cancer families

Estrogen receptor status and age at diagnosis

Among cases, lower PRS-values were associated with higher proportion of ER-negative disease in both nonfamilial (P = 0.00082) and familial index cases (P = 0.023), but this trend was not observed within breast cancer families (P = 0.22)(Supplementary Table 5). No significant association between PRS and age at diagnosis was seen in our case–control or breast cancer family dataset.


In this paper, we have investigated the potential of the PRS to improve risk stratification in the setting of familial breast cancer. The PRS we constructed used information from 75 variants, almost all breast cancer susceptibility variants known to date. We tested for association of this PRS with breast cancer risk among different groups according to their family history of breast cancer. The PRS was on average higher in patients with family history of breast cancer in a first-degree relative than in sporadic cases in the case–control dataset (Table 1). We neither saw a difference in the PRS between the “large” and “small” families, nor an increasing trend with the number of affected first-degree relatives within the family dataset. Epidemiological studies have reported life-time risk for women with two affected first-degree relatives to be higher than for women with only one affected relative (21.1 vs. 13.3 %) [1]. This difference was not reflected in a change in PRS, although power for this comparison was limited in our dataset (Table 1, Supplementary Fig. 1). The PRS was significantly associated with increased risk of breast cancer within the breast cancer families, when comparing affected to healthy women. Furthermore, the PRS was higher among the healthy women of the breast cancer families compared with healthy controls from the population (Table 1, Supplementary Fig. 1), supporting the notion that common genetic risk variants cluster in breast cancer families.

Our 52 breast cancer families comprised 427 genotyped and ~4000 nongenotyped individuals. The proportion of affected women in the breast cancer families ranged between 6 and 67 % per family (median 22 %) (Supplementary Table 2). Two-thirds of the healthy women of the dataset were first-degree relatives of the breast cancer cases, and the remainder second- or third-degree relatives. The BOADICEA risk prediction algorithm is able to capture complex family structure, including information on more distant relatives as well as information on ages of diagnosis or interview, and is more informative than simpler measures of family history. We therefore calculated a ‘BOADICEA score’ as an estimate of the mean polygenotype for each individual, given their family history. The correlation between our BOADICEA score and the PRS in affected women from the 52 breast cancer families was 0.15, consistent with the relatively small fraction of the heritability explained by these variants. As expected, the correlation among the healthy women was lower (Pearson’s r = 0.0099). These estimates are somewhat higher than that reported recently between PRS and BOADICEA risk prediction output (0.01 [−0.05–0.07]) by an Australian study [10]. This could be explained partly by the different study designs, as the Australian study included a large number of sporadic cases, and any pedigree data were collected by interview, whereas we collected family data systematically from population and cancer registries. Furthermore, they included in the correlation both breast cancer cases and healthy controls, which may have masked an existing correlation in cases.

The PRS was significantly associated with breast cancer risk in a logistic regression model adjusted for age and the BOADICEA score (OR: 1.55 [1.26–1.91]). The magnitude of the effect size associated with the PRS was consistent with the estimate made in the unselected series of the case–control dataset (OR: 1.49 [1.38–1.62]) and with the estimate reported in a recent population-based study (OR: 1.55 [1.52–1.58]) [5], and provides further support for using the PRS in risk prediction. Furthermore, when the model was adjusted for the moderate-penetrance mutations in CHEK2, PALB2, and FANCM, the OR associated with the PRS was very similar (1.59 [1.28–1.98]) supporting a multiplicative mode of interaction between the low- and moderate-penetrance genetic variants. This is been demonstrated explicitly for CHEK2:c.1100delC and the PRS (Muranen et al. in review), but further analyses in larger datasets would be needed to evaluate such interactions for the FANCM and PALB2 variants.

A recent study examined the utility of the PRS for risk prediction in breast cancer families in the BCFR [27]; our results are broadly in agreement with the conclusions of this prospective study. This study used only 24 variants rather than the full complement of 75 (or 77 as in [5]), so we would have expected to see a larger effect size in our study. However, we were not able to directly compare the effect sizes reported, as the BCFR-study was prospective in design, used Cox regression for modeling, and the effect of the PRS was studied in the context of 10-year BOADICEA risk estimates. In addition, individuals with moderate-penetrance mutations were excluded from their analyses, while these individuals were included in our analyses.

Our results on subtype-specific associations are consistent with previous observations that most common genetic variants are more strongly associated with ER-positive disease, while fewer ER-negative specific variants have been identified. Although pathology information in the familial study was limited, we noted a higher proportion of ER-negative cancers among women with the lowest PRS in this study. Noteworthy, none of these cancers were from carriers of PALB2 or FANCM mutations, which have been previously associated with ER-negative disease [19, 20]. It would be of interest to study the pathology of breast cancers within families and evaluate any correlations in the context of the PRS.

The effect sizes associated with the PRS can be used to derive estimates of the absolute risk of breast cancer according to PRS [5]. In women with a family history of breast cancer, the baseline absolute risk of developing breast cancer is higher. Our observations indicate that the relative risks associated with the PRS are similar to those in the general population, consistent with a model in which the PRS multiplies the effects of other familial risk factors. Hence, the effect of the PRS on the absolute risk of disease will be much greater than in the general population. For example, if the pedigree-based familial risk for a woman was about 17 % [9], and her PRS was in the highest 20 centile, the combined risk would be 30.9 %, moving her from intermediate- to high-risk category according to UK NICE guidelines [9]. By comparison, women with no data on family history would have to be in the top centile of the PRS to have the same absolute risk (~30 %). We did not find support for the protective effect associated with low PRS-values (Fig. 1). This may reflect low power, as few women have very low PRS in the breast cancer families. However, it might be explained by the presence of yet-unidentified moderate/high-risk genetic or shared environmental factors [28].


Our results suggest that it would be valuable to combine the PRS with pedigree-based risk estimation in the context of familial disease.