Background

Since 2009, the U.S. Preventive Services Task Force recommends breast cancer screening with biannual mammograms for women age 50 to 74 years old [1]. In 2013, Switzerland also adopted a national strategy, recommending biannual breast cancer screening for women over 50 [2, 3]. Age over 50 years is the sole risk factor considered for entering a population screening program [4,5,6]. However, about 25% of breast cancer patients are diagnosed in women under 50 years old [7, 8]. Mammograms are less effective as a breast cancer screening tool for younger women, who are more likely to have dense breast tissue, compromising the utility of routine mammograms in this age group. This contributes to diagnostic delays and increased morbidity and mortality [8, 9]. Risk-based screening could be more effective, less morbid, and more cost-effective [10,11,12,13,14,15,16,17]. Comprehensive breast cancer risk prediction models, able to classify women into clinically meaningful risk groups, will enable identifying and targeting women at high-risk, while reducing interventions in those at low risk.

The Breast Cancer Risk Assessment Tool (BCRAT), also known as the Gail model, and the Breast and Ovarian Analysis of Disease Incidence and Carrier Estimation Algorithm (BOADICEA) model were developed to identify high-risk women based on known risk factors and have been integrated into clinical guidelines to help guide decision making about breast cancer risk management [18, 19]. BCRAT was developed and validated with data from the US Surveillance, Epidemiology, and End Results registry [20]. The model uses eight risk factors, i.e., age, age of menarche, age of first live birth, number of previous biopsies, benign disease, BRCA mutations, race, and number of first-degree relatives affected with breast cancer, to calculate 5-year and lifetime risk for women older than 35 years old [21]. The National Comprehensive Cancer Network suggests using BCRAT to identify women with a 5-year risk greater than 1.66% and women with remaining lifetime risk greater than 20%, who could consider risk-reducing chemo-prevention and annual screening with mammograms and MRIs (magnetic resonance imaging) starting at 30 years old. The BOADICEA model was the first polygenic breast cancer risk prediction model, based on data from 2785 UK families. BOADICEA uses information from personal and family history of breast cancer, including information from breast cancer pathology, ethnicity, and BRCA mutations [22]. Clinical guidelines in several European countries and Switzerland recommend using BOADICEA for breast cancer risk prediction [23, 24].

However, both models have considerable limitations. BCRAT can only be used for women above 35 years old, and only takes into account history of breast cancer in first-degree relatives (mother, sisters, or daughters), without including age at diagnosis of these relatives. It does not consider family history of ovarian cancer, which may be of crucial importance for women with hereditary breast and ovarian cancer (HBOC). The BOADICEA model does not account for risk factors associated with reproductive history and hormonal exposure and has limited utility in cases with small family history. Although both models have been validated with large cohort data, their discriminatory ability, area under the ROC (receiver operating characteristics) curve, is between 0.53 and 0.64 [21, 25,26,27,28]. There is 36 to 47% chance that the BCRAT and BOADICEA model will not identify high-risk women, while some low-risk women may receive unnecessary preventive treatments. Both models make implicit assumptions that risk factors relate to cancer development in a linear way and are mostly independent from other risk factors. Thus, both models likely oversimplify complex relationships and non-linear interactions in numerous risk factors [27].

Machine learning (ML) forecasting

ML offers an alternative approach to standard prediction modeling that may address current limitations and improve accuracy of breast cancer prediction tools [29]. ML techniques developed from earlier studies of pattern recognition and computational statistical learning. They make fewer assumptions and rely on computational algorithms and models to identify complex interactions among multiple heterogeneous risk factors. This is achieved by iteratively minimizing specific objective functions of predicted and observed outcomes [30]. ML has been used in models related to cancer prognosis and survival and produced better accuracy and reliability estimates [31,32,33,34]. To date, very few studies applied ML methods for personalized breast cancer risk prediction or compared the predictive accuracy and reliability with models commonly used in clinic practice [35]. The purpose of this study was to apply different ML techniques for forecasting individualized breast cancer risk and to compare the discriminatory accuracy of ML-based estimates against the BCRAT and BOADICEA models.

Methods

To provide strong assessment, reliable comparison, and reproducible results, we compared ML-based estimates and estimates from BCRAT and BOADICEA model using eight synthetic simulated datasets and two actual observational datasets. In order to have fair comparisons, we used the same risk factors as BCRAT and BOADICEA models, respectively, as input for the ML algorithms in each comparison.

Simulated datasets

We used simulated data to compare the performance between the different ML algorithms and determine the stability and validity of these predictions within each algorithm. We generated two sets of four simulated datasets (eight in total), one set consistent with the input values of BCRAT and the other consistent with the input values of the BOADICEA model. The BCRAT and BOADICEA models rely on different risk factors, which necessitated this dichotomy. For each of the two scenarios, we generated four synthetic datasets: A. simulated data with no signal (null data); B. simulated data with artificial signals; C. simulated dataset (B) adding 20% missing values; and D. simulated dataset (C) after applying multiple imputations. We randomly masked as missing 20% of values in datasets (B) to generate datasets (C), then we applied multiple imputations to datasets (C) to generate datasets (D). The cancer outcome for simulated dataset (B) for the BCRAT was simulated based on linear aggregation effects of all variables, with an artificial effect size for each variable. Variables in the null dataset (A) had no signal—these were generated with completely random values within specific ranges. In our simulation, having certain risk factors could elevate an individual’s breast cancer risk. This relative risk (signal or artificial effect size) is given according to published meta-analyses for that specific risk factor. Each individual had a baseline probability randomly assigned to them. After adding each risk factor’s attribution (RR multiplied by baseline) to baseline, we set a cutoff of the final probability to classify each sample as “healthy” or “sick”. Datasets (B) for BCRAT and BOADICEA have different input variables and data structure. For example, in data used for the BOADICEA model, each individual is imbedded into a family pedigree and have two individuals as parents. We randomly set family sizes between 3 and 80 members, and the number of generations from 1 to 5 in each family, based on our observations in the Swiss clinic-based dataset. Family members’ age and age gap between the two closest generations was set according to average age for first child birth. The pedigree (hierarchical) dataset (B) with artificial signal for the BOADICEA model was generated with R Package “pedantics,” enabling pedigree-based genetic simulation, pedigree manipulation, characterization, and viewing [36]. Multiple imputations with R package “MICE” (multivariate imputation by chained equations) [37] addressed missing data in datasets (C).

US population-based retrospective data

We used baseline data from a prospective randomized trial conducted in Michigan (USA) including a statewide, randomly selected sample of young breast cancer survivors (YBCS) who were diagnosed with invasive breast cancer or ductal carcinoma in situ (DCIS) and their cancer-free female relatives [38, 39]. The trial recruited women diagnosed with breast cancer younger than 45 years old from the state cancer registry. The sample was stratified by race, Black versus White/Other, for adequate representation of Black YBCS. YBCS recruited cancer-free, first- and second-degree female relatives. The trial collected all information required for calculating BCRAT scores from 850 YBCS and 293 of relatives (total n = 1143), after excluding individuals younger than 35 years old.

Swiss clinic-based retrospective data

The oncology department at the Geneva University Hospital (HUG) has been offering genetic evaluation and testing since 1998 to breast cancer patients and cancer-free individuals. During the genetic consultation process, information about demographic and clinical characteristics, disease history, previous genetic test results, and a detailed family pedigree are recorded with “Progeny” software [40]. Information from pathology reports, archived tumor tissue, and cancer treatment is recorded for breast cancer patients. Data from genetic consultation records and Progeny files were extracted with R packages “tm” and “gdata” [41] from 2481 families with totally 112,587 individuals. Extracted data is suitable for risk calculations with the BOADICEA model for one female member from each family. Information from 2481 women are included in this study, who are either the first female in their family to receive genetic evaluation or testing, or were a first-degree relative of a male who received genetic evaluation or testing.

Missing values

For the US population-based dataset, there were less than 3% missing values among the variables used by the BCRAT model. For Swiss clinical datasets, there were about 13% missing values among the variables used by the BOADICEA model. Among those missing values, BRCA mutations, estrogen receptor, and progesterone receptor attributed the most (11%). Thus, missing values in BRCA mutation and hormone receptor testing were given a separate category of “unknown” in the analyses, in addition to “positive” and “negative.” This approach is also consistent with the flexibility of the BOADICEA models in handling missing information.

Statistical analyses

Descriptive statistics, i.e., frequencies, percentages, means, and standard deviations, were computed describing sample characteristics for both categorical and continuous variables in the BRCAT and BOADICEA models and in ML approaches for n = 1143 US YBCS and cancer-free relatives and n = 2481 Swiss cancer patients and cancer-free individuals.

BCRAT

Comparisons between ML versus BRCAT were based on performance assessment on five datasets: Simulated data A to D (n = 1200) and retrospective data from the U.S. population-based trial (n = 1143 women). The R package “brca” version 2.0 was used to calculate absolute lifetime risk of invasive breast cancer according to BCRAT algorithm for specific race/ethnic groups and age intervals for each individual in the datasets [42].

BOADICEA model

Comparisons between ML versus the BOADICEA model were based on performance assessment on five datasets: Simulated data A to D (n = 2500 women) and retrospective data from HUG with 2481 females from 2481 families including 112,587 family members. Lifetime risk predictions were generated with the web-based batch processing from the BOADICEA web application. The lifetime risk for each woman was calculated using data from all the members in her family. In simulated datasets A to D, we randomly assigned a female member in each family as the index case.

ML algorithms

We used both model-based and model-free ML techniques for predictive analytics. The model-based approaches included generalized linear models (GLM), logistic regression (LOGIT), linear discriminant analysis (LDA), Markov Chain Monte Carlo generalized linear mixed model (MCMC GLMM), and quadratic discriminant analysis (QDA) [43]. The model-free predictive analytics involved adaptive boosting (ADA), random forest (RF), and k-nearest neighbors (KNN) [43]. We selected these algorithms based on prior reports of their reliability and effectiveness in identifying, tracking, and exploiting salient features in complex, heterogeneous, and incongruent biomedical and healthcare datasets [29, 43,44,45,46]. Variables included in each comparison were listed in Table 1.

Table 1 Variables included in ML for comparison with BCRAT and BOADICEA

One benefit of using ML approaches was the supervised classification of breast cancer patients and cancer-free controls, where controls could outnumber patients or vice versa. We rebalanced the datasets prior to ML predictions to reduce the potential for estimate bias with the R packages “unbalanced"”(racing for unbalanced methods selection) and “SMOTE” (Synthetic Minority Over-sampling TEchnique) [47, 48]. These packages implement known ML techniques to propose a racing algorithm for adaptively selecting the most appropriate strategy for a given unbalanced task.

To ensure the reliability of ML predictions and the consistency of the forecasts, we used internal statistical n-fold cross-validation. This is an alternative strategy for validating risk estimates without a prospective dataset [49] and provides a powerful preventative measure against model overfitting [50]. Random subsampling split the entire datasets into n samples of equal size (n-folds). Each algorithm used n − 1 folds for training the ML algorithm and tested its accuracy with the last fold of the data in each of the n experiments. The final error estimate of the classification was obtained by averaging the n individual error estimates. We used n = 10 folds cross-validation with 20 repetitions in this process [51].

Comparisons of predictive accuracy

The performance of BCRAT and the BOADICEA models were evaluated using measure of the area under the receiver operating characteristic curve (AU-ROC), while for the ML techniques the performance is presented with the mean AU-ROC from 10-fold cross validations.

Variable importance ranking

To understand, interpret, and gain trust in the ML techniques, we identified the salient features with the highest contribution to the accuracy of these predictions by ranking them within each cross validation using training sets (n − 1 folds). These features were explored to ensure they are in line with both human domain knowledge and reasonable expectations. For decision tree classification methods (e.g., RF and ADA), we ranked variable importance on variable selection frequency as a decision node. For GLM, LOGIT, LDA, QDA, and MCMC GLMM algorithms, variable importance was determined by the coefficient effect size. KNN used an overall weighting of the variable within the model.

Results

Sample characteristics

Table 2 presents sample characteristics of the two independent observational retrospective datasets. The US population-based trial oversampled Black participants. There were more cancer cases than controls in the US sample, while the opposite was true for the Swiss sample. The average number of family members affected by breast cancer was higher in the US database, while the Swiss database included more known mutation carriers. Despite these differences, using breast cancer as an outcome grouping variable, we had sufficient number in each group even before applying a data balancing protocol.

Table 2 Sample characteristics of the US population-based sample (n = 1143) and the Swiss clinic-based sample (n = 2481)

Prediction accuracy

Tables 3 and 4 present prediction ability comparison for BCRAT and BOADICEA models and the ML techniques. In the simulated dataset A with no signal, all approaches failed to discriminate cancer cases from cancer-free controls, i.e., AU-ROCs were around 50%. In the simulated dataset B with artificial signal, most ML algorithms (except GLM) showed about 90% accuracy in prediction. The ML (except GLM) methods also maintained high accuracy (89.77–93.00%) in dataset C with 20% missing values and dataset D with multiple imputations. Using the same risk factors and similar sample sizes, the accuracy of ML techniques was superior to BCRAT and BOADICEA models in the US and Swiss observational retrospective samples. For the US population-based sample, predictive accuracy reached 88.28% using ADA and 88.89% using RF versus BCRAT AUC 62.40%. For the Swiss clinic-based sample, predictive accuracy reached 90.17% using ADA and 89.32% using MCMC GLMM versus BOADICEA AUC 59.31%. Compared to BCRAT and BOADICEA models, predictive accuracy increased by approximately 35% and 30%, respectively. In order to visualize the accuracy improvement, we generated the ROC curves in Fig. 1a, b from predictions of BCRAT and BOADICEA models and one ML approaches performed best.

Table 3 Performance AU-ROC curve of BCRAT and ML algorithms (with standard deviation) predicting breast cancer lifetime risk from simulated datasets (n = 1200) and the US population-based sample (n = 1143)
Table 4 Performance AU-ROC curve of the BOADICEA model and ML algorithms (with standard deviation) predicting breast cancer lifetime risk from simulated datasets (n = 2500) and Swiss clinic-based sample (n = 112,587 women from 2481 families)
Fig. 1
figure 1

a The area under the receiver operating characteristic curves (AU-ROC) for BCRAT and ML-Random forest approach. b The area under the receiver operating characteristic curves (AU-ROC) for BOADICEA model and ML-adapt boosting approach

ML variable importance rankings

Tables 5 and 6 present the most influential variables in different ML algorithms and the relative rank of the top five variables in decreasing order. In the US population-based sample, three of the risk factors included in BCRAT (number of biopsies, age, and number of first-degree relatives with breast cancer) were the top-ranked risk factors for almost all ML algorithms, except for LDA. Four ML algorithms (RF, ADA, KNN, and MCMC GLMM) identified number of biopsies as the most important risk factor for discriminatory accuracy (Table 5). For the Swiss clinic-based sample, two of the risk factors included in the BOADICEA model (age, family history) were the top-ranked risk factors for all ML algorithms, except for KNN and QDA (Table 6).

Table 5 Top five important risk factors in descending order for different ML algorithms based on the US population-based training samples in 10-fold internal statistical cross-validations
Table 6 Top five important risk factors in descending order for different ML algorithms based on the Swiss clinical-based training samples in 10-fold internal statistical cross-validations

Discussion

We examined whether using ML algorithms could improve breast cancer predictive accuracy compared to the BCRAT and BOADICEA models. We computed the predictive accuracy of these two models and eight different ML algorithms using datasets with artificial signals (datasets B to D) and two observational retrospective datasets from two different countries and different target samples (population-based versus clinic-based). Compared to BCRAT and the BOADICEA models, most ML techniques we tested were superior at distinguishing cancer cases from cancer-free controls. ML algorithms improved significantly the predictive accuracy of both models from less than 0.65 to about 0.90, especially when tested with real samples. ML algorithms that produced the best accuracy were ADA followed by RF using variables of BCRAT, and the MCMC GLMM using variables of the BOADICEA model. The increased predictive accuracy observed with ML algorithms was not due to additional input variables, since we used exactly the same risk factors as the BCRAT and the BOADICEA models. Rather, this was due to inherently better predictive ability of ML algorithms. With supervised learning approaches, the artificial or natural complexities of each dataset were restored and adhered to different algorithms with high accuracy. When the datasets were intentionally perturbed by introducing missing values or performing multiple imputations, the prediction performance of the ML algorithms remained stable.

Using different simulated datasets allow us to control the input and assess the case-classification/prediction results relative to “ground truth.” We simulated dataset (A) as a “null” reference case-study. This helps us identify false-positive predictions, because when no signal exists in the dataset, all approaches should fail to classify the samples. In simulated datasets (B), (C), and (D), we created the artificial signals within the datasets to strongly correlate with the outcome (breast cancer yes/no). This approach allows us to test whether the machine learning algorithms we used can detect these artificial signals and provide valid and stable predictions, even when there are missing values. This helps us identify false-negative predictions.

In the simulated datasets, we assigned estimations (e.g., coefficient or weight) to each risk factor based on published epidemiological data. Unfortunately, there is no available information about the underlying estimation of each risk factor used in the BCRAT and BOADICEA models. The only available information is that these estimations are derived from large cohort studies over time. Therefore, it is possible that the estimations in the simulated datasets are different from the estimations used by the BCRAT and BOADICEA models, which may explain the underperformance of the later models to predict the class in the simulated datasets. Moreover, the simulated datasets have oversimplified artificial signals, which make it relatively easier for the more general approaches of machine learning to pick up a signal and identify features in the controlled simulated data than in real datasets. Thus, the machine learning-based algorithms showed opposite trends on simulated data compared to the model-based methods. Finally, the simulated datasets were not used for a comparison between the machine learning algorithms and the BCRAT or the BOADICEA model. The main purpose of using simulated datasets was to compare predictions between different machine learning algorithms and the stability within each machine learning method.

Ranking importance of variables in each model was consistent with our expectations. Biopsy testing indicated suspicious cell abnormality. Number of first-degree relatives affected with breast cancer as well as cancer age onset in a family pedigree can partially reflect the common environmental exposures, inherited information, and lifestyles. We observed variations and similarities in the importance of risk factors depending on the core algorithms in each ML approach and variable types. ADA and RF were both based on decision trees and resembled closely in variables and ranking. QDA placed more importance on categorical variables, e.g., number of first-degree relatives with breast cancer, while LDA placed more importance on continuous variable, e.g., age in both comparisons. This finding has implications for future research aiming to develop a new breast cancer risk prediction model, incorporating established and newly evaluated risk factors.

As firm supporters of “open-science,” we have packaged, documented, and distributed the complete end-to-end R-protocol used to generate the synthetic data and perform all data analytics reported in this manuscript. We have shared the protocol via GitHub (https://github.com/SOCR/ML_BCP/).

Strengths and limitations

The inclusion-exclusion selection criteria of the US and the Swiss datasets may have influenced the association between observed variance and outcomes. In the US population-based sample, YBCS had fewer affected relatives than their cancer-free relatives. Thus, number of affected relatives was detected as an important variable but without external validity in interpretation. Interpretability of the function modeled by ML algorithms is only partially limited by the “black-box” nature of ML algorithms in our study because we included a limited number of well-established breast cancer risk factors. However, the inherent complexity of how risk factors interact with each other, their independent effect on the outcome, and how effect sizes are determined within each ML algorithm is not known.

Significant strengths of the study include the novelty of the approach, i.e., applying ML algorithms in individual breast cancer risk prediction and comparing predictive accuracy with existing models. The improvement achieved with ML algorithms in accurate classification of women with and without breast cancer compared to the state-of-the-art model-based approaches was striking. We demonstrated a range of ML algorithms with cross-validations, which is lacking in other applications of ML for cancer prognosis [32]. Different ML algorithms for feature selection and classification showed great adaptability and discriminatory accuracy in our study by handling multidimensional and heterogeneous data. Ranking variable importance may inform algorithm selection with diverse predictive risk factors for future development of new risk prediction models.

Conclusions

Predictive models are essential in personalized medicine because they contribute to early identification of high-risk individuals based on known epidemiological and clinical risk factors. Accurate breast cancer risk estimates can inform clinical care and risk management across the breast cancer continuum, e.g., behavioral changes, chemoprevention, personalized screening, and risk-stratified follow-up care. Available risk prediction models have an overall accuracy less than 0.65. ML approaches offer the exciting prospect of achieving improved and more precise risk estimates. This is the first step in developing new risk prediction approaches and further explores diverse risk factors. ML algorithms are not limited to a specific number of risk factors but have the flexibility to change or incorporate additional ones. The improvement in predictive accuracy achieved in this study should be further explored and duplicated with prospective databases and additional risk factors, e.g., mammographic density, risk factors in IBIS Breast Cancer Risk Evaluation Tool, and polygenic genetic scores. Improvements in computational capacity and data management in healthcare systems can be followed by opportunities to exploit ML to enhance risk prediction of disease and survival prognosis in clinical practice [52].