Introduction

Breast cancer is the most common cancer among females in high-, middle- and low-income countries and it accounts for 23% of all new female cancers globally [1, 2]. While there has been a significant reduction in mortality, incidence rates have continued to rise [3]. Breast cancer incidence rates are high in North America, Australia, New Zealand, and Western and Northern Europe. It has intermediate levels of incidence in South America, Northern Africa, and the Caribbean but is lower in Asia and sub-Saharan Africa [1].

Early detection of breast cancer improves prognosis and increases survival. Mammographic imaging is the best method available for early detection [4] contributing substantially in reducing the deaths caused by breast cancer [5]. Unfortunately mammography mass screening still leads to some levels of over-diagnosis and over-treatment [6]. As yet routine mammography screening is not readily available globally, particularly in some developing countries [7, 8]. This is supported by the observations that for every million adult women there are only four mammogram screening machines in Sudan has four mammogram machines, whereas Mexico has 37 and Canada has 72 [9]. Under these circumstances, it is clearly more appropriate to prioritize access to mammographic screening or other targeted interventions (such as tamoxifen chemoprevention) for higher-risk individuals who could be identified using a sensitive and specific risk prediction model [10]. Such risk prediction models are individualized statistical methods to estimate the probability of developing certain medical diseases. This is based on specific risk factors in currently healthy individuals within a defined period of time [11]. Such prediction models have a number of potential uses such as planning intervention trials, designing population prevention policies, improving clinical decision-making, assisting in creating benefit/risk indices and estimating the burden cost of disease in population [10].

A general case can also be made for using risk models for certain diseases. For example, their use can allow the application of risk-reducing interventions that may actually prevent the disease in question. If their application can be based on use of existing health records this will avoid increasing levels of anxiety in at least low to moderate risk individuals. The National Cancer Institute of the USA (NCI) has confirmed that the application of “risk prediction” approaches has an extraordinary chance of enhancing “The Nation’s Investment in Cancer Research” [12]. This provides an explanation for the rapid increase in the number of models now being reported in the literature [11, 13]. It is clear that not all developed models are valid or can be widely used across populations. The minimum performance measures required for a useful and robust risk prediction model in clinical decision making are discrimination and calibration [14].

We recognize that risk models are increasingly now being used as part of a “triage” assessment for mammography and/or for receipt of other more personalized medical care. There is a growing interest in applying risk prediction models as educational tools.

The models developed can differ significantly with regard to; the specific risk factors that are included; the statistical methodology used to estimate, validate and calibrate risk; in the study design used; and in the populations investigated to assess the models. These differences make it essential that any assessment of model usefulness takes into account both their internal and external validity. Here, we focus on the reliability, discriminatory accuracy and generalizability of breast cancer risk models that exclude clinical (any variable which needs physician input e.g., presence of atypical hyperplasia) and any genetic risk factors. Accurate assessment of risk using easily acquired data is essential as a first stage of tackling the rising burden of breast disease globally. Well-validated models with high predictive power are preferable although this is not the case for all models. The usability of any model is dependent on the purpose the model will be used for and its target populations [15]. Furthermore, it has been suggested that adapting existing predictive models to the local circumstances of a new population rather than developing a new model for each time is a better approach [16].

This review focuses on breast cancer risk predicting models that incorporated modifiable risk factors and/or factors that can be self-reported. Such models could be applied as an educational tool and potentially used to advice at risk individuals on appropriate behavioural changes.

Methods

Databases

The following databases were searched for all related publications (up to July 2016): PubMed (https://www.ncbi.nlm.nih.gov/pubmed/); ScienceDirect (http://www.sciencedirect.com/); the Cochrane Database of Systematic Reviews (CDSR) (http://www.cochranelibrary.com/). Terms used for the search were “assessment tool, assessment model, risk prediction model, predictive model, prediction score, risk index, breast cancer, breast neoplasm, breast index, Harvard model, Rosner and Colditz model, and Gail model”. Risk models were retrieved based on any study design, study population or types of risk factors.

A Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) approach was applied for selecting reviewed articles [17]. A total of 61 genetic and non-genetic breast cancer risk models were identified and then filtered to include only risk models with non-clinical factors (Fig. 1). These models contain variables which are considered to be modifiable and/or self-reported by the respondents. For this review, 14 studies were eventually considered to be eligible. No literature reviews were found on breast cancer risk models solely focusing on epidemiological risk factors although all the selected reviews summarized generic composite risk models. The literature search was extended to include publications relating to systematic reviews and meta-analyses; this did not reveal any appropriate publications.

Fig. 1
figure 1

Identification of eligible risk models using PRISMA flowchart

Confidence in risk factors

Details relating to the degree of confidence in variables used as risk factors in the risk models were taken from the Harvard report [18]. The degree of confidence was categorized as either:

  • definite (an established association between outcome and exposure where chance, bias [systematic error], confounders [misrepresentation of an association by unmeasured factor/s] are eliminated with significant confidence)

  • probable (an association exists between the outcome and the exposure where chance, bias, confounders cannot be eliminated with sufficient confidence—inconsistent results found with different studies)

  • possible (inconclusive or insufficient evidence of an association between the outcome and the exposure)

Results

Potential risk factors included in breast cancer non-clinical predictive models

The variables used in the 14 models under review and specifies the degree of confidence (definite, probable or possible) in those variables as risk factors for breast cancer based on the current literature are summarized in Table 1.

Table 1 Breast cancer risk factors included in the 14 models

Age, age at first birth, age at menarche, family history of breast cancer, and self-reported history of biopsies were the most common variables used amongst the 14 models selected. These variables are considered as definite risk factors for developing breast cancer [18]. Other additional variables were observed in fewer models. These included ethnicity (Jewish—definite), definite hormonal replacement therapy, diet (some probable and others possible), physical activity (possible), height (definite), weight (probable- for pre-menopausal women and definite for post-menopausal women). Among pre-menopausal females, weight is considered to be a protective factor [19]. In contrast amongst post-menopausal women, weight is considered to be a risk factor [20,21,22] as is parity, oral contraceptive pill use (definite), pregnancy history, timing and type of menopause (definite), menstrual regularity (possible), menstrual duration and gestation period (probable), smoking (possible), mammogram screening (probable) and age of onset of breast cancer in a relative (definite).

The largest number of definite factors included in a model (n = 10 variables) was seen in the study reported by Colditz and Rosner [18]. This was followed by studies by reported by Park [23], Novotny [24] and Rosner [25]. We evaluated the number of the definite, probable and possible variables in the models to compare their performance based on the type and number of the variable included.

Evaluation measures of the risk models

The most important measures used to assess the performance of the models were considered to be as follows:

  • Calibration (reliability): the E/O statistic measures the calibration performance of the predictive model. Calibration involves comparing the expected versus observed numbers of the event using goodness-of-fit or chi square statistics. A well-calibrated model will have a number close to 1 indicating little difference between the E and O events. If the E/O statistic is below 1.0 then the event incidence is underestimated, while if the E/O ratio is above 1.0 then incidence is overestimated [14, 26].

  • Discrimination (precision): the C statistic (Concordance statistic) measures the discrimination performance of the predictive model and corresponds to the area under a receiver operating characteristic curve. This statistic measures how efficiently the model is able to discriminate affected individuals from un-affected individuals. A C-statistic of 0.5 indicates no discrimination between individuals who go on to develop the condition and those who do not. In contrast, a C-statistic of 1 implies perfect discrimination [27, 28]. Good discrimination is important for screening individuals and for effective clinical decision making [10].

  • Accuracy: is tested by measuring of ‘sensitivity’, ‘specificity’, ‘positive predictive value’ (PPV) and negative predictive value (NPV). All of these terms are defined in Table 2. These measures indicate how well the model is able to categorize specific individuals into their real group (i.e., 100% certain to be affected or unaffected). Accuracy is equally important for both individual categorisation and for clinical decision making. Nevertheless, even with good specificity or sensitivity, low positive predictive values may be found in rare diseases [10] as the predictive values also depend on disease prevalence. With high prevalence, PPV will increase while NPV will decrease [29].

  • Utility: this evaluates the ease with which the target groups (public, clinicians, patients, policy makers) can submit the data required by the model. Utility evaluation assesses lay understanding of risk, risk perception, results interpretation, level of satisfaction and worry [30]. This evaluation usually uses surveys or interviews [26].

Table 2 Formulas used to calculate the accuracy of the model

Calibration and discrimination were the most common measures used to assess the breast cancer risk models under review and these measures are summarized in Fig. 2. Internal calibration was performed in just three of the 14 models with values ranging from 0.92 to 1.08. These calibration values represented a good estimate of the affected cases using these models. For external calibration, six of the 14 models used an independent cohort. Rosner [25] and Pfeiffer [31] reported the highest with E/O values of 1.00 and followed by Colditz [18] with an E/O of 1.01.

Fig. 2
figure 2

Calibration and discrimination performances of the 13 breast cancer risk models

The C-Statistic values measuring internal discrimination ranged across studies from 0.61 to 0.65. The Park [23] model achieved the best outcome (C-Statistic = 0.64). Additionally, Park [23] showed the highest value with a C-Statistic of 0.89 when applied to subjects recruited from the NCC (National Cancer Centre) screening program. The lowest C-Statistic (0.56) was observed in the Gail model [32]. Overall, this demonstrates that the models have better calibration than discrimination. Accuracy was only evaluated in the Lee model [33]. Sensitivity, specificity and overall accuracy were calculated. The values indicate low accuracy with values ranging from 0.55 to 0.66 (Table 3).

In qualitative research relating to the impact and utility [34] of the Harvard Cancer Risk Index (HCRI) [18], nine focus groups (six female, three male) showed good overall satisfaction with HCRI. Participants appreciated both the detailed explanation and the updated inclusion of risk factors. On the other hand, some participants criticized the absence of what they considered to be important factors (e.g., environmental factors and poverty). Some participants believed that some of the factors on which subjects had been assessed might cause anxiety. It is also noted, however, that the case has been made that such anxiety provides motivation for action to mitigate risk [35].

Overview of current models

All the models described (except for Lee et al. 2004) [36] are extended versions of either the Gail model or the Rosner and Colditz model (Tables 4, 5). The Gail model developed in 1989 [37] was the first risk model for breast cancer and included the following variables: age, menarche age, age at first birth, breast cancer history in first-degree relatives, history of breast biopsies and history of atypical hyperplasia. The range of calibration of the Gail modified models was E/O = (0.93–1.17) and the discrimination range was C-Statistics = (0.56–0.65). This indicates that these models are well calibrated, although discrimination could be improved.

Table 3 Summary of the evaluation measures of the risk models
Table 4 Characteristic summary of the reviewed breast cancer risk models
Table 5 Models reviewed in this article

Ueda et al. [38] modified the Gail model by including age at menarche, age at first delivery, family history of breast cancer and BMI in post-menopausal women, as risk factors in his model for Japanese women. However, as with the original Gail model, no validation was performed. In the Boyle model [39], more factors were included such as alcohol intake, onset age of diagnosis in relatives, one of the two diet scores and BMI and HRT. This results in calibration with E/O close to unity and less acceptable discrimination of C-stat = 0.59. The Novotny model [24] added the number of previous breast biopsies performed on a woman and her history of benign breast disease. However, no validation assessment was performed for this model. Newer models [32, 40, 41] included the number of benign biopsies. This resulted in acceptable calibration but less acceptable discrimination (Gail [32]: E/O = 0.93; C-stat = 0.56; Matsuno: E/O = 1.17, C-statistic = 0.614; and Banegas E/O = 1.08). Park et al. [23] included menopausal status, number of pregnancies, duration of breastfeeding, oral contraceptive usage, exercise, smoking, drinking, and number of breast examinations as risk factors. This model has an E/O = 0.965; C-stat = 0.64. However, the C-statistic reported from the external validation cohort was high compared to the original C-statistic. They reported a C-statistic of 0.89 using the NCC cohort. This discrepancy was claimed to be caused by the population characteristics (participants were 30 years and above, recruited from cancer screening program, from a teaching hospital in an urban area) [23]. In the same year, Pfeiffer et al. [23] developed a model where parity was considered as a factor and had E/O of 1.00 and a C-statistic of 0.58. The later Gail model published in 2007 used logistic regression to derive relative risks. These estimates are then combined with attributable risks and cancer registry incidence data to obtain estimates of the baseline hazards [32].

The Rosner and Colditz model of 1994 [42] was based on a cohort study of more than 91,000 women. The model used Poisson regression (rather than logistic regression as in the Gail model). The variables were as follows: age, age at all births, menopause age, and menarche age. This model was not validated. A new version in 1996 [25] included one modification (current age was excluded) and gave an E/O = 1.00 and a C-statistic = 0.57. In 2000, Colditz et al. [18] modified the model with risk factors for: benign breast disease, use of post-menopausal hormones, type of menopause, weight, height, and alcohol intake. This model gave an E/O = 1.01; C-statistic = 0.64.

Lee et al. [36] used two control groups: a “hospitalised” group and a nurses and teachers group. The risk factors in the hospitalized controls were as follows: family history, menstrual regularity, total menstrual duration, age at first full-term pregnancy, and duration of breastfeeding. The risk factors in the nurses/teachers control group were as follows: age, menstrual regularity, alcohol drinking status and smoking status. This model was not based on Gail or Rosner and Colditz. Hosmer–Lemeshow goodness of fit was used to assess model fit which had a p value = 0.301 in (hospital controls) and p value = 0.871 in (nurse/teacher controls). No calibration or discrimination measures were reported.

Lee [33] used three evaluation techniques to assess the discrimination and the accuracy of their model: support vector machine, artificial neural network and Bayesian network. Of the three, support vector machine showed the best values among the Korean cohort. However, accuracy and discrimination were less acceptable in this model.

In summary, calibration performance is similar between models (Modified Gail and modified Rosner, Colditz), yet modified Gail models showed better discrimination performance with the C-statistic of the Park model being 0.89.

Discussion

There is increasing interest among clinicians, researchers and the public in the use of risk models. This makes it important that we fully evaluate model development and application. Each risk model should be assessed before it can be recommended for any clinical application. Performance assessment should involve the use of an independent population [43] separate from the population used to build the model. We have reviewed breast cancer risk models that include non-genetic and non-clinical risk factors but exclude clinical risk factors. By using PubMed, ScienceDirect, Cochrane library and other research engines, 14 models met these criteria. The most recent model examined was developed in 2015 [33]. Most models were based on two earlier risk models developed over 20 years ago—the Gail model [37] and the Rosner and Colditz model [42]. The modified versions of these two original models varied in the risk factors included and the estimation methods used. In 2012, there were two literature reviews published which analysed breast cancer risk prediction models [11, 28]; however, our review focuses particular on modifiable risk factors and/or self-reported factors and we have updated the models published after 2012 [23, 31, 33].

Most models with modifiable risk factors included report acceptable calibration, with E/O close to 1 but less acceptable discrimination with C-statistic close to 0.5. Calibration and validation were improved when more definite factors were included. A possible explanation for less acceptable discrimination performance could be the inclusion of weaker evidence-based factors (probable and possible risk factors). All the models had combinations of probable and possible factors with no single model restricted to the inclusion of the definite factors.

Various factors affect model performance. Inclusion of less significant factors is likely to occur in studies with small sample sizes [11, 28]. Some important clinical risk factors were not included and this may affect the model’s final performance [44]. Breast cancer heterogeneity may also contribute to poor performance as different cancer types may have different risk factors [11]. Most of the models included in this review did not stratify breast cancer into its subtypes during model development. Rosner and Colditz however evaluated the model’s performance based on breast cancer subtypes (ER±, PR ± or HR2±) and concluded that risk factors vary according to the subtypes [45, 46]. Finally, even when strong risk factors are included in a model, significant increases in C-statistic have not been seen [47].

Model performance statistics were affected by the criteria used to stratify the analysis. Four models were stratified by age (below 50 and above 50). One model was further stratified by menopausal status [38], one by ethnicity [41] and one by number of births [42]. Breast cancer risk models could be improved if appropriate factors were used to stratify the population. For example, pre-menopausal and post-menopausal females have different risk factors in breast cancer development. The models that applied menopausal status have some limitation in that this may not be applicable to women who have had hysterectomy. For example, in the US, hysterectomy is the second most common procedure performed and the likelihood of oophorectomy varies by age at hysterectomy [48]. Hence, completion of risk assessment outside of a clinical setting is problematic as women may be challenged to define their menopausal status. Even though the overall performance of these models appears to be moderate in differentiating between cases and non-cases, they may still serve as a good educational tool as part of cancer prevention. Utility evaluation assesses the public’s knowledge of breast cancer risk factors rather well and could be used to promote cancer risk reduction actions.

A significant limitation in the development of risk models is the absence of consensus standards for defining and classifying a model’s performance. For example what is the level of good or acceptable calibration or measures of discrimination? what are acceptable measures of specificity and sensitivity in diagnostic/prognostic/preventive models? how close to unity should calibration and discrimination be for a model to be considered valid? what is the utility cut-off in each type of model? All of these questions are hard to answer without global agreement. However, this lack of consensus is understandable as these values vary depending on the type of the model type (diagnostic, prognostic, preventive), goal (clinical tool, educational tool, screening tool), targeted audience (public, high-risk patients, patients visiting the clinic) and the disease itself and its types or subtypes (such as breast cancer, familial breast cancer, lobular/ductal/invasive/in situ carcinoma breast cancer). This suggests that the closer value of E/O and C-statistics to 1, the better model performance. Such a pragmatic attitude permits us to begin to focus on improving the availability of effective risk reduction actions.

Furthermore, some of the models reviewed cannot be applied to some of the populations as the risk factors may vary between different populations. For example, alcohol consumption would not be applicable to Muslim women. We recommend that researchers develop a more reliable and valid breast cancer risk model which has good calibration, accuracy, discrimination and utility where both internal and external validations indicate that it can be reliable for general use. In order to improve our models, the following should be considered: (1) the model type (diagnostic, prognostic, preventive), goal (clinical tool, educational tool, screening tool), targeted audience (public, high-risk patient), (2) inclusion of definite risk factors while incorporating the clinical and/or genetic risk factors where possible, (3) dividing the model into disease subtypes, age and menopausal status, (4) ensuring that a model is developed that can be validated externally.