1 Introduction

1.1 Historical review

Cancer is recognized as a significant healthcare challenge by the Horizon Europe program [1]. Among female cancers, Breast Cancer (BC) is the most prevalent, with an incidence rate of 5 cases per 1,000 women, as extensively documented in the literature [2,3,4,5,6,7,8]. In the European Union (EU) in 2020, 2.7 million BC cases were diagnosed, resulting in 1.3 million deaths. The World Health Organization (WHO) guidelines strongly recommend optimising cancer treatment and care [9]. The "European Commission Cancer Plan" highlights the crucial role of cancer prevention and treatment optimization [10]. It also provides information on the allocation of funds for cancer research on early detection and introduces a new "EU supported Cancer Screening Scheme" aiming to offer screening to 90% of the EU population by 2025. As an immediate objective, the European Commission plans to propose an update to the Council Recommendation on cancer screening by 2022, incorporating the most recent scientific evidence. The updated recommendation suggests expanding cancer screening campaigns beyond breast, colorectal, and cervical cancer to include prostate, lung, and gastric cancer. Furthermore, the Commission proposes identifying criteria to target screening based on personal risk and characteristics rather than just age.

BC is categorised into three subtypes based on the presence or absence of molecular markers for estrogen receptor (ER) or progesterone receptor (PR) and human epidermal growth factor 2 (ERBB2 or HER2). Specifically, hormone receptor positive/ERBB2 negative cancers account for 70% of all BCs, ERBB2 positive accounts for 15%-20%, and triple-negative for 15% [11]. More than 90% of BC cases are non-metastatic at the time of diagnosis, and the therapeutic goals in such cases include tumor eradication and prevention of recurrence.

As shown in Table 1, BC mortality exhibits significant geographical variability [12,13,14,15], influenced by factors such as population structure, lifestyle, genetics, and the environment [16]. The 5-year net survival rate after BC diagnosis also varies and reaches 87% in developed countries where screening and early diagnosis are practised [17].

Table 1 Mean (SD) for breast cancer mortality rate for each IHME super region from 1995 to 2015, from [15]

Risk factors for BC incidence and mortality can be classified into two groups: genetic risk factors (such as BRCA1 and BRCA2) and non-genetic risk factors (age at menarche, menopause, childbearing, breastfeeding, mammography density, overweight and obesity, physical inactivity, alcohol consumption, and lifestyle choices).

Breast Cancer Risk Models [18] utilise a model-driven analysis that incorporates a combination of several factors. However, aside from female gender and increasing patient age, certain risk factors have shown weak effects on BC, necessitating a large amount of data for accurate evaluation [19]. Data-driven analysis approaches in the field of Artificial Intelligence (AI) offer the potential to more effectively identify combinations of risk factors that contribute to increased BC incidence. These approaches leverage AI techniques to analyse and derive insights from extensive datasets, allowing for the identification of complex relationships and interactions among various risk factors. By harnessing the power of AI, we can enhance our understanding of BC risk and potentially improve the accuracy of risk prediction models.

1.2 Brief summary of the most recent studies

In recent years, Machine Learning (ML) techniques have emerged as one of the most prominent topics in the fields of Information Technology (IT) and Artificial Intelligence (AI). ML has experienced continuous growth and its applications extend across various domains, including pattern recognition, computer vision, finance, entertainment, computational biology, as well as biomedical and medical applications [20, 21]. ML represents an engineering approach that aims to enhance the ability to extract valuable information from data itself, without relying heavily on external inputs or prior knowledge. The primary objective of ML is to develop and refine models that can be trained using context-specific data, enabling decision-making without complete knowledge of external factors. The process of ML involves two essential steps: training and inference. During the training phase, an ML algorithm processes a dataset and identifies the function that best captures the underlying patterns in the data. This function is then encoded and referred to as the model, which is subsequently employed to extract knowledge from new data instances [22].

1.3 Opening problems under investigation

In recent years, significant advancements have been made in applying ML techniques to healthcare, as extensively documented in the literature [23]. Previous studies have demonstrated that augmenting the widely-used Gail risk model with additional inputs improves its ability to predict BC risk.

The Gail model incorporates six breast cancer risk factors, namely: age, age at menarche, age at first live birth, number of breast biopsies, history of atypical hyperplasia, and number of first-degree relatives with breast cancer. Based on this information the model provides the individual estimate of BC risk. Based on the Gail model, women with a breast cancer risk of > 1.66% were considered as high-risk according to the estimated 5-year breast cancer- risk assessment [24].

However, these models, including Gail, typically rely on simple statistical architectures and incorporate inputs obtained from expensive and/or invasive procedures. In contrast, recent studies [25] have presented ML models that utilise readily available personal health data to predict BC risk over a five-year period. Many of these studies have compared the accuracy of different models based on various ML algorithms and techniques, such as Random Forest (RF), K-Nearest Neighbour (K-NN), Naive Bayes (NB), Neural Network (NN), Decision Tree (DT), Logistic Regression (LR), Discriminant Analysis (DA), or Support Vector Machine (SVM) [26,27,28].

ML methods employed for tumor identification, classification, detection, or differentiation have demonstrated highly competitive results [29].

This review primarily focuses on the potential role of Artificial Intelligence (AI) in supporting BC prevention and the challenges that need to be addressed to enhance operational quality. Specifically, this work aims to analyse various ML techniques applied in the field of early detection of BC. To achieve this objective, recent papers (from 2017 to 2022) employing these techniques were collected and compared to determine the optimal combination of data types, feature extraction methods, and models that yield the most accurate results. Additionally, a secondary goal is to investigate the reasons behind the preference for certain ML techniques while neglecting others.

Aside from the primary goal of investigating the use of ML techniques in research studies, a secondary goal of this research initiative is to conduct a thorough investigation into the underlying factors that lead researchers to prefer certain ML methodologies while ignoring others. The availability and quality of training data, computational resource constraints, the established body of previous research in the respective field, the inherent complexity of the problem to be addressed, and the potential interpretability and explainability of the chosen ML models are all factors to consider when selecting machine learning techniques. The ML models used by all the authors of the publications included in this review will be examined in the discussion chapter, with an emphasis on how the most commonly used models have changed over time.

2 Methods

2.1 Search and selection of literature

The studies included in this review were identified through a systematic literature review conducted on PubMed and Scopus databases until December 2022. The search included articles published between 2017 and 2022. The search terms used were: "[Model name]" AND "machine learning" AND "breast cancer" AND "validation" AND ("prevention" OR "diagnosis" OR "risk analysis") AND "AUC" AND "accuracy" AND PUBYEAR > 2016. Only full-text documents written in English that defined validation methods and presented performance results in terms of Area Under the Curve (AUC) and accuracy were considered eligible for inclusion. Some articles have been excluded if they did not present either of the two-performance metrics mentioned above, articles suggesting the use of Deep Learning (DL) models instead of ML models, articles unrelated to BC, and articles that did not specify the dataset size.

DL is a subclass of ML that is a data-driven technique for learning features and tasks. The term 'deep' refers to the various layers of algorithms that data passes through during computing to construct a neural network. This study decided to remove DL algorithms, which we know require a lot bigger quantity of data than typical ML algorithms, in order to compare datasets with a higher cardinality to each other. The distinctions between the two modalities are adequately highlighted in Section 2.1 of the paper [30], where it is stated that there are significant disparities between the two modalities in both the approach and the description of the data required.

Figure 1 illustrates the search queries used in the PubMed and Scopus databases to retrieve articles related to the early detection and prevention of breast cancer using ML algorithms tested between 2017 and 2022.

Fig. 1
figure 1

Papers selection flow

Model validation is a critical step in the process that ensures the effectiveness of the developed model. It involves evaluating the model's performance using an external dataset known as the validation set. The validation set is separate from the training data and is used to assess the quality and fit of the model's results. In the reviewed papers, the majority of studies employed the cross-validation method for model validation.

Cross-validation is a resampling technique where different subsets or partitions of the data are used for training and testing the model. There are several variations of cross-validation based on how the data is divided and utilised. One of the most commonly used approaches is the tenfold cross-validation, as depicted in Fig. 2 in [31]. In this method, the data is divided into 10 equal-sized subsets or folds. The model is then trained on 9 folds and tested on the remaining fold. This process is repeated 10 times, with each fold serving as the test set once. The results from each iteration are aggregated to assess the overall performance of the model.

Fig. 2
figure 2

10-fold-cross-validation, modified from [31]

Figure 2 in [31] provides an illustration of the tenfold cross-validation approach, highlighting the repeated training and testing steps with different subsets of the data.

2.2 ML performance metrics considered for the paper selection

The performance evaluation of the ML techniques was conducted by comparing them in terms of two key metrics: Area Under the Curve (AUC) and accuracy.

Accuracy is an important and intuitive performance measure in evaluating classification models. It represents the ratio of correctly predicted observations (True Positives and True Negatives) to the total number of observations. In the context of a binary classification problem, True Positives (TP) and True Negatives (TN) correspond to the correctly classified instances of the positive and negative classes, respectively. False Positives (FP) and False Negatives (FN) represent the instances that are incorrectly classified as positive and negative, respectively.

$$Accuracy=\frac{TP+TN}{TP+FP+TN+FN}$$
(1)

Although accuracy is a regularly used metric to evaluate classifier performance, it may be insufficient to provide a thorough evaluation, particularly in circumstances involving unbalanced datasets. When dealing with imbalanced classes, a classifier that predicts the majority class for all occurrences can nevertheless generate a high rate of accuracy even if it fails to identify the minority class adequately. This is the most obvious limitation in depending solely on accuracy as a performance metric.

To solve this constraint, more measures must be incorporated into the review process.

The Area Under the Receiver Operating Characteristic Curve (AUC) being visually appealing and providing an overview of a classifier's performance across a wide range of specificities, is the performance measure most frequently used within the ML studies working with imbalanced datasets [32].

Additionally, by incorporating the AUC alongside accuracy, practitioners can acquire a more nuanced and trustworthy picture of classifier performance, which is especially important in cases when class imbalances are widespread.

The AUC is a measure of the classifier's ability to distinguish between different classes, and it provides a summary of the Receiver Operator Characteristic (ROC) curve (see Fig. 3). The ROC curve plots the true positive rate (sensitivity) against the false positive rate (1-specificity) at various classification thresholds. The AUC represents the area under this curve corresponding to the probability that the model will rate a random positive case higher than a random negative example.

Fig. 3
figure 3

Example of ROC curve, with False Positive Rate (FPR) on x-axis and True Positive Rate (TPR) on y-axis

A higher AUC indicates a better performance of the model in distinguishing between the positive and negative classes. When the AUC is equal to one, the classifier achieves perfect discrimination between all the positive and negative class instances. Conversely, an AUC value of zero suggests that the classifier incorrectly predicts all negatives as positives and all positives as negatives. Higher AUC values indicate better performance.

It is worth noting that AUC, as well as accuracy, is affected by distortions due to the problem of imbalanced training data occurring frequently in bioinformatics. When ML methods are trained on very imbalanced data sets, they often tend to produce majority classifiers – over-predicting the presence of the majority class being mild levels of imbalance – at 30–40% of the data in the minority class – sufficient to alter the values of the measures commonly used to assess models performance. When large amounts of data in the minority class are easy to obtain, some authors suggested to undersample the majority class and effectively balance the data sets.The same authors also suggested when these data are sparse, then bioinformatics researchers would do well to consider the oversampling and cost-sensitive learning techniques, developed in machine learning in recent years [33, 34]. Furthermore, Saitto and colleagues, focusing on the performance evaluation of the final ML models, proposed some ROC alternatives as the Concentrated ROC (CROC), the Cost Curves (CC), and the Precision/Recall (PRC) plots. The authors concluded that being the PRC the only visual analysis tool that changes with the ratio of positives and negatives it represents the most informative one [32].

3 Results

A total of 184 articles were initially selected, with 86 retrieved from Scopus and 98 from PubMed. After removing duplicates, the remaining 160 papers underwent screening and evaluation. Among these, 106 articles were excluded based on specific criteria. This included 11 articles that were not available in full-text, 1 article written in Chinese, 31 articles that focused on Deep Learning (DL) or other techniques instead of Machine Learning (ML) techniques, 49 articles with incomparable results, and 14 articles that either focused on the prevention of morbidities other than BC or were not specifically focused on prevention.

After applying these exclusion criteria, a total of 54 papers were included in the review. The majority of these papers utilised the LR model, accounting for 22.4% of the included studies. The SVM model followed with 18.3% representation, while the RF model accounted for 13.8%. The remaining papers, approximately 45.5%, employed various other ML techniques. The distribution of these works across different methods and years of publication is depicted in Fig. 4.

Fig. 4
figure 4

Distribution of different methods used in selected related works in years

Although the majority of the analysed studies were conducted in the USA, the patient populations in the included papers encompassed individuals from other countries such as China, Japan, Africa, and Iran, indicating a broader geographical representation in the research.

This information highlights the selection process, the distribution of ML techniques used in the included papers, and the geographical diversity of the studied patient populations.

The temporal distribution of papers based on the ML models provides insights into the growing interest in ML methods, particularly the notable increase in publications focusing on Logistic Regression (LR) since 2021. The high number of papers utilising the SVM model could be attributed to its frequent use as a benchmark when evaluating the performance of other algorithms. Additionally, the number of studies utilising the RF technique has also increased in recent years. RF is often compared to DT to showcase the differences in results when using a single DT versus combining multiple DTs to obtain the final outcome.

The results of the 59 selected articles are organised by ML methods in Tables [2,3,4,5,6,7,8,9,10,11,12,13,14,15]. Each table provides the main characteristics of the study and presents the accuracy and AUC results. These metrics serve as indicators of the performance achieved by the respective ML models.

By examining these tables, readers can gain an understanding of the characteristics and outcomes of studies conducted using different ML techniques, allowing for comparisons and evaluations based on accuracy and AUC performance measures.

3.1 Support Vector Machine (SVM)

Support Vector Machines (SVMs) are supervised learning algorithms commonly used for binary classification tasks. In SVMs, a hyperplane is established to separate the sample items into two classes, as illustrated in Fig. 5. The hyperplane established by SVMs is determined by a subset of data points called support vectors, which lie closest to the decision boundary. These support vectors play a crucial role in defining the hyperplane and determining the classification boundaries.

Fig. 5
figure 5

Optimal hyperplane in SVM, for a binary classification. Modified from [35]

The goal is to find the optimal hyperplane that maximises the margin between the two classes, ensuring the best generalisation ability.

By maximising the margin, SVMs aim to achieve high accuracy not only on the training set but also on new, unseen data that may be added to the dataset in the future [35]. This ability to generalise well to new data is a key advantage of SVMs.

Overall, SVMs are powerful tools for binary classification, providing an effective means of separating data into distinct classes by finding an optimal hyperplane that maximises the margin between them.

Articles concerning the application of SVM models in the BC prevention field are summarised in Table 2, sorted by dataset cardinality.

Table 2 Support vector machine results

SVM models in some cases are combined with Fast Fourier Transform (FFT), in other cases with Discrete Cosine Transform (DCT), with Structural Similarity Index (SSIM) or with Sequential Forward Feature Selection (SFSS) as feature selection techniques. Paper [45] combines SVM with PET (Positron Emission Tomography) features and CT (Computed Tomography) features. Paper [56] instead applies SVM to Quantitative UltraSound (QUS) features. In other cases, SVM is combined with Semi-Supervised Learning (SSL) or Supervised Learning (SL) techniques.

The selected papers in the field of BC prevention propose various approaches to achieve optimal performance. Some common strategies include feature selection, modifying the algorithm to fit the data, and selecting data to suit the specific model being used.

In study [36], the SVM model is combined with three different feature selection techniques: the SSIM, which quantifies image quality degradation due to compression or data transmission losses, the DCT, which transforms pixel information from spatial domain to frequency domain, and the FFT, which computes the discrete Fourier transform of input sequences. These techniques aim to identify important patterns and information in the input images.

Another approach is presented in a different paper [37], where the dataset is divided into a modelling dataset and an external verification dataset. The authors selected 75% of the samples from the modelling dataset as the training set. They employed variable selection, one-hot encoding, and a basic model, which were assembled into a pipeline. This pipeline was then entered into grid search using the tenfold cross-validation technique, allowing for thorough evaluation and optimization of the model's performance.

The selected papers [40, 42, 44, 54] introduce specific variations of the SVM model, different from the standard version, and investigate their impact on performance. For instance, paper [61] achieved the best accuracy and AUC values by using SVM with the quadratic kernel function (SVMQ), while the worst performance was observed when using SVM with the linear kernel function (SVML). This indicates that different SVM models applied to the same data can have varying effects on performance outcomes.

In contrast, paper [39] employed the linear version of the SVM and combined it with both SL and SSL techniques. The performance achieved in this case was significantly better than that of paper [61]. The authors attribute this improvement to the utilisation of labelled data in training the SL algorithm and the availability of a larger training dataset. SSL, on the other hand, is a hybrid approach that combines SL and unsupervised learning, utilising unlabeled data and unsupervised information. It achieves competitive results with a smaller amount of data compared to SL methods.

Other papers in the selection combine the standard SVM model with specific features or additional learning models. Among these, paper [50] achieved the highest values for both accuracy and AUC, followed by papers [42, 49, 52]. These studies demonstrate the potential of incorporating additional features or employing hybrid models to enhance the performance of SVM in breast cancer prevention.

The paper [38] achieved the worst result by combining the SVM model with Molecular subtype features. The authors reviewed histopathological reports to identify prognostic biomarkers (such as Lymph node status, tumor grade, ER, PR, HER2, and Ki67) that were strongly associated with molecular subtypes of BC. Despite the inclusion of these features, the performance of the SVM model in this study was low.

Similarly, paper [54] obtained low performance values when using Precontrast and Postcontrast images. Precontrast images are acquired before contrast material injection, while postcontrast images are acquired during the fifth phase after contrast material injection. Subtraction images are obtained by subtracting the Precontrast and Postcontrast images. The SVM model performed better when applied to the subtraction images compared to the other two types of images.

These findings suggest that the inclusion of certain features or imaging modalities does not always lead to improved performance in breast cancer prevention using SVM models. It highlights the importance of carefully selecting relevant features and considering the specific characteristics of the data to optimise the performance of the SVM model.

3.2 Naive Bayes

The Bayesian classifier, as demonstrated in the paper [63], is capable of predicting the probabilities of class membership, which represent the likelihood that a given sample belongs to a specific class. This classifier is built on the foundation of Bayes' theorem, which is expressed by the following formula:

$$P(h|d) = \frac{P(d|h)*P(h)}{P(d)}$$
(2)

The Bayesian classifier uses probabilities to estimate the likelihood of a sample belonging to different classes and assigns it to the class with the highest probability. It considers both the prior probability of each class and the likelihood of observing the input given each class, providing a probabilistic framework for classification.

The naive Bayesian classifier makes the assumption of class conditional independence, meaning it assumes that the effect of an attribute value on a class is independent of the values of other attributes. This assumption simplifies the computation and is the reason behind the "naive" characteristic of the classifier.

Table 3 collects all the performance results obtained by the articles using the NB algorithms, sorted by descending cardinality of the dataset.

Table 3 Naive Bayes results

The paper [39] achieved the best results in terms of NB application, similar to the findings for SVM models. Additionally, the NB model showed different performance values when combined with Supervised Learning (SL) and Semi-Supervised Learning (SSL) techniques. The authors concluded that better performance can be obtained using a fuzzy version of the algorithm instead of the Gaussian one.

On the other hand, the worst performance was observed in the paper [38], where Naive Bayes was combined with Lymph node features and Molecular subtype features.

3.3 Linear and logistic regression

In the paper [66], the performance values of different types of regression models were described. Regression models are used to estimate the impact of independent predictors on a single dependent variable. Specifically, the Linear Regression model assumes a linear relationship between the predicted continuous variable and the predictor variables (Fig. 6). On the other hand, the LR model assumes that the predicted variable represents the logarithmic probability of an event occurrence, based on the predictor variables. The predicted variable in LR is dichotomous, ranging from 0 to 1, representing the probability of the event happening.

Fig. 6
figure 6

Scatter plot showing a linear relation between the two variables, from [66]

As well as in the previous sections, the data presented in Table 4 are sorted by decreasing dataset cardinality.

Table 4 Logistic regression results

All of the articles in Table 4 use logistic regression (LR) on their datasets, combining it with various feature selection approaches [62], applying it to training and testing subsets [68], to different types of pictures [54], or to different forms of analysis [69]. The study [65] proposes a variant of the conventional LR that incorporates the LASSO (Least Absolute Shrinkage and Selection Operator) technique, whereas the paper [53] employs a Logistic elastic net to analyse its data.

The paper [39] demonstrated the best performance in terms of the LR model, consistent with the findings for other models in previous tables. In [69], the authors aimed to highlight the importance of early detection of lymphedema in BC survivors. They identified 24 lymphedema-associated symptoms as potential predictors and found that Logistic Regression achieved the best performance for early detection. Conversely, in [49], all studied models showed no significant differences in performance. In this study, features were extracted using LASSO from DCE-MRI images.

In [57], excellent results were achieved by utilising 279 textual features for each case. These features were analysed using the MaZda software, publicly accessible through [71, 72]. To reduce the complexity of subsequent ML analysis, a feature selection analysis was performed using SPSS, resulting in the reduction of weak features.

Interestingly, in [54], the LR model performed well when applied to subtraction images, but yielded poorer results when applied to Precontrast and Postcontrast images.

In [67], the LR model was applied to data collected through a questionnaire to investigate the impact of demographic and other risk factors on BC onset. The authors selected a total of 10 variables, including 3 demographic factors, 6 reproductive history factors, and family history of BC. However, the LR model yielded poor performance results in this study.

Similarly, in [38], when the LR model was combined with the molecular subtype features of BC, low performance results were obtained.

3.4 K-nearest neighbour

In the paper [73], the K-Nearest Neighbors (K-NN) algorithm is described as one of the most fundamental and straightforward classification methods. It involves associating new data with the most common class among its k nearest neighbours, where k is a predefined parameter that influences the final result. The accuracy of the K-NN classifier is influenced by both the choice of k and the distance metric used to compute distances between data points. Different distance measures can yield varying levels of accuracy depending on the presence of noise in the data, as discussed in [74].

In Fig. 7, from paper [74], a visual comparison of 10 distance measures is presented, including the Average (L1, Linf) distance, Canberra distance, Clark distance, Correlation distance, Cosine distance, Dice distance, Divergence distance, Lorentzian distance, Manhattan distance, Squared Chi-Squared distance, and Whittaker's index of association distance. These distance measures are used to compute the distances between data points in the K-Nearest Neighbors (K-NN) algorithm.

Fig. 7
figure 7

Average Accuracy of KNN classifier using top 10 distance measures with different level of noise from paper [74]. AvgD, Average (L1, Linf) distance; CanD, Canberra distance; ClaD, Clark distance; CorD, Correlation distance; CosD, Cosine distance; DicD, Dice distance; DivD, Divergence distance; LD, Lorentzian distance; MD, Manhattan distance; SCSD, Squared Chi-Squared; WIAD, Whittaker’s index of association distance

In Table 5, all the papers listed combine the standard K-NN model with a set of features. However, the paper [61] is excluded from the table as its results for the K-NN model are not comparable with the other papers.

Table 5 K-nearest neighbour results

The paper [53] proposes a different version of the K-NN model called Weighted KNN. In this version, the distance of the nearest neighbours is incorporated, and the observations of the nearest neighbours are upweighted compared to those of more distant neighbours. The authors conclude that this weighted version improves the classifier's performance [75].

In the context of supervised and semi-supervised learning, the paper [39] achieved the best results among the studies examined. On the other hand, the paper [47] obtained low results when combining the K-NN model with twofold and threefold cross-validation. This can be attributed to the fact that with small training sets, increasing the number of folds in cross-validation helps reduce bias in generalisation error estimation by utilising more training data in each iteration.

3.5 Decision tree

According to the paper [76], a DT is a formal representation of classification flow within a given set of instances. In a DT, each leaf node represents one of the possible classes, and the intermediate nodes correspond to the tests performed on the data. Each branch originating from a node represents one of the possible outcomes of the test conducted at that node.

Table 6 contains papers that utilise a DT model, while Table 7 includes papers that focus on other tree-based classification approaches.

Table 6 Decision tree results
Table 7 Other tree results

The paper [39] stands out as one of the best-performing articles in Table 6. On the other hand, the results of the paper [49] cannot be considered the best due to overfitting, even though it achieved 100% accuracy on the training set. Therefore, the paper [39] still holds the best results.

Additionally, the paper [38] has already been mentioned in previous sections for its poor performance when combining the DT model with molecular subtype features. In the case of the paper [62], the authors extracted 23 features from the dataset for each lesion but only used three features as reference standards: histopathologic Residual Cancer Burden (RCB) class, Recurrence-Free Survival (RFS), and Disease-Specific Survival (DSS). The DT model combined with RFS yielded low results. Table 6 presents the median of the four-fold cross-validation results.

While Table 6 compiles all publications that show DTs as a model, Table 7 compiles all research that presents another form of tree. These models can be utilised in various forms of analysis, as in article [69], or can be used for training and testing datasets, as in papers [37, 77, 79]. Paper [77] includes a Linear Discriminant Analysis (LDA) step to maximise the ratio of between-class variation to within-class variance in the dataset, assuring maximum separability. The LDA model will be presented in more detail in Section 3.6.

Study [49] experienced overfitting when the model was applied to the training set, and therefore, its result cannot be considered the best in Table 7.

In the study [69], the authors aimed to determine the most effective model for stratifying the risk of breast cancer survivors and excluding potential patients with lymphedema. When the two proposed models were applied to the late detection of lymphedema, they achieved some of the best results, as shown in Table 7.

Additionally, it is noted that the paper [49] obtained the worst result when the Gradient Boosting Decision Tree model was applied to the testing set. This confirms that the high result achieved when applying the same model to the training set was solely due to overfitting.

3.6 Discriminant analysis

Linear Discriminant Analysis (LDA) [78] is a method that handles cases where within-class frequencies are uneven. It has been tested using randomly generated test data, and its performance has been evaluated. The goal of LDA is to maximise the ratio of between-class variance to within-class variance in a given dataset, ensuring maximal separability.

The use of Linear Discriminant Analysis for data classification is particularly applied to classification problems in speech recognition. In comparison to Principal Component Analysis (PCA) [79], literature suggests that LDA performs better. PCA is another dimensionality reduction technique commonly used in machine learning, but LDA has shown to provide improved results in terms of separability and classification accuracy.

Among the papers listed in Table 8, the paper [60] achieved the best results. The study aimed to assess the differentiation of benign and malignant breast lesions using Blood Oxygenation Level Dependent Magnetic Resonance Imaging (BOLD-MRI) and Diffusion Weighted Magnetic Resonance Imaging (DW-MRI). The combination of LDA with robust BOLD and DWI features, extracted using the Least Absolute Shrinkage and Selection Operator (LASSO), yielded the best performance.

Table 8 Discriminant analysis results

Another paper, [51], also obtained high results. The study compared the effects of radiomic analysis on 2D and 3D tumor segmentation using different machine learning (ML) models. An independent testing technique was employed, where a training set of 103 patients and a testing set of 51 patients were used. The LASSO method, followed by a tenfold cross-validation, was used to select the best subset of features based on mean square error. The performance of the models was evaluated using these selected features, comparing their distributions in 2D and 3D analysis.

On the other hand, the papers [50, 62] achieved the worst outcomes in Table 8. In the first paper, the authors mainly focused on comparing their results with artificial neural network (ANN) results and did not extensively investigate the reasons behind the performance of other machine learning models. The second paper, [50], extracted 23 features for each lesion but only utilised three of them: histopathologic Residual Cancer Burden (RCB) class, Recurrence-Free Survival (RFS), and Disease-Specific Survival (DSS). The results presented in Table 8 represent the median of a fourfold cross-validation.

3.7 Artificial neural network

As defined in the paper [81], Artificial Neural Networks (ANN) are an intelligent system inspired by biological neural networks.

ANN are characterised by the activation function used by their artificial neurons (see Fig. 8) and by the links between artificial neurons in different layers of the networks, as presented in Fig. 9.

Fig. 8
figure 8

Different types of activation functions from [81]: a Threshold, b Piecewise linear, c Singmoid and (d) Gaussian

Fig. 9
figure 9

Different types of artificial neural networks from [81]

In Table 9, the paper [69] achieved the best results, in addition to the previously discussed paper [39]. The study described in [69] aimed to find an effective approach to stratify the risk of BC survivors while excluding potential lymphedema patients. The DT model was used to achieve this goal, and it yielded impressive results.

Table 9 Artificial neural network results

On the other hand, the paper [38] obtained low results when applying the Artificial Neural Network (ANN) model to the molecular subtype features. This suggests that the chosen model was not suitable for the molecular subtype features, but it performed well with the other types of features.

It's important to note that the performance of a model can vary depending on the specific features and characteristics of the dataset being analysed. The success or failure of a model in one context does not necessarily guarantee the same outcome in a different context or with different features.

Table 10 collects results from papers that describe other types of Neural Networks (NN), like MultiLayer Perceptron Neural Network (MLP—NN) or Feed Forward Neural Network (FNN). The studies in Table 10 use the MLP—NN to specific feature sets, PET (Positron Emission Tomography) features or CT (Computed Tomography) features, or to certain types of images, DES (Dual-energy subtracted) or LE (Low-Energy), with associated segmentation.

Table 10 Others neural network results

In Table 10, the paper [50] achieved the best results. This study focused on designing and testing different types of Feed Forward Neural Networks (FNN) with varying layer sizes. The authors found that the optimised learning rate for each model was 0.01, and they determined the specific number of nodes per layer to maximise performance. The best-performing models were FNN2 with 350 nodes, FNN4 with 400 nodes, and FNN8 with 300 nodes.

On the other hand, the paper [45] obtained the worst result among the papers in Table 10. This study applied Artificial Neural Networks (ANN) to PET features. The PET and CT images were processed using different methods, including Exponential, Gradient, Laplacian of Gaussian (LoG), Logarithm, Square, Square root, and Wavelet filtering. The PET/CTconcat features represented an integration of all the PET and CT radiomic features, while the PET/CTmean radiomic feature was the average value of the individual CT and PET radiomic features.

The performance of a model can be influenced by various factors, such as the choice of features, data preprocessing techniques, and model configuration. The best and worst results in Table 10 reflect the outcomes specific to the approaches taken in each respective paper.

3.8 Random forest

The RF algorithm combines several decision trees and aggregates their prediction by averaging [83]. Some authors think RF aggregates random decision trees without considering how the trees are obtained. Other authors instead claim that RF refers to Breiman’s [84] original algorithm.

In Table 11, it is important to consider the issue of overfitting and not solely rely on accuracy results obtained on the training set. The paper [49] achieving 100% accuracy on the training set but lower accuracy on the validation set indicates a potential case of overfitting, where the model has memorised the training data instead of learning general patterns.

Table 11 Random forest results

As mentioned, the paper [39] consistently achieved the best results among the studies reported in Table 11. This indicates the effectiveness of the approach presented in that paper across multiple evaluation metrics or datasets.

The paper [77] also achieved excellent results by manually segmenting scanner images and extracting texture features for classification. This indicates the importance of careful pre-processing and feature extraction techniques in achieving good performance.

On the other hand, the paper [62, 67] obtained poor results in Table 11, which were also mentioned in previous sections. It suggests that the chosen models or feature representations might not have been suitable for the respective datasets or for classification tasks.

In summary, it is crucial to consider the impact of overfitting and the generalizability of results when evaluating the performance of classification models. The best results are often achieved by papers that effectively address these considerations and demonstrate good performance on validation or independent test sets.

3.9 Boosting

Boosting is a particular ML approach based on a combination of a highly accurate rule with other weaker or less accurate rules.

As explained in [86], the Adaptive Boosting (AdaBoost) is the first practical boosting algorithm and it is still one of the most used and studied. AdaBoost pseudocode is presented in Algorithm 1.

Algorithm 1
figure a

AdaBoost pseudocode from paper [86]

XGBoost is short for eXtreme Gradient Boosting package. As described in the paper [87], XGBoost is an efficient and scalable implementation of a Gradient Boosting Machine, that is defined in [88].

For each round t = 1,…,T, a distribution \({D}_{t}\) is computed over the m training instances, and a particular weak learner or weak learning algorithm is performed to find a weak hypothesis, as defined in Eq. 4. The weak learner's goal is to find a weak hypothesis with a low weighted error \({\epsilon }_{t}\) relative to \({D}_{t}\). Equation 8 is the final or combined hypothesis H(x). H(x) is calculated as a weighted majority vote of the weak hypothesis \({h}_{t}\), with weight \({\alpha }_{t}\) assigned to each hypothesis.

The latter paper presents a general gradient descent “boosting” model developed for any of the fitting criteria. Specific algorithms for Least Squares, Least Absolute Deviation, Huber-M loss functions for regression, and multiclass logistic likelihood for classification are also described.

To describe the Boosting algorithms’ results, papers are divided in three Tables 12, 13, and 14.

Table 12 AdaBoost results
Table 13 Boosting results
Table 14 XGBoost results

Table 12 focuses on papers that describe the results of AdaBoost [80] in various machine learning tasks. Among the papers listed, the best result was achieved by the paper [58], which applied AdaBoost with two different classifiers: SVM and DT. The AdaBoost-DT classifier performed the best among the two, surpassing the performance of the hybrid classifier proposed in the same paper, which combined SVM, RF, and DT classifiers. On the other hand, the worst result in Table 12 was obtained by the paper [62], which was already mentioned in the previous sections.

AdaBoost is known for its ability to boost the performance of weak classifiers by focusing on misclassified instances and iteratively updating the weights of the training data. It has been successfully applied to various classification and regression problems, and the papers in Table 12 provide insights into its effectiveness when combined with different base classifiers and applied to specific datasets.

Table 13 focuses on papers that present results of Gradient Boosting algorithms. Among the papers listed, the best result was achieved by the paper [39], which has already been discussed extensively in previous sections. Another paper, [52], introduced a modified version of the standard Gradient Boosting algorithm [89] called LightGBM, which showed excellent results. LightGBM is considered a new addition to the collection of boosting models and is known for its efficiency and performance advantages over XGBoost in certain aspects. The principles of LightGBM, Gradient Boosting Decision Trees (GBDT), and XGBoost are similar, with all three methods utilising the negative gradient of the loss function to approximate the residuals and fit decision trees.

In the paper [52], radiomic feature extraction was performed on the original parametric images without any filtering, and features related to shape, grey-level, and grey-tone were calculated. A total of 293 features for each subject's imaging set were extracted in this study.

On the other hand, the worst result in Table 13 was obtained by the paper [40], where only eight features were selected because they were detected in all patients in the dataset. Among these features, five markers were identified for differentiating between breast cancers and benign tumors.

Table 14 presents the results of papers that utilised the XGBoost model. Apart from the paper [39], the best results among the papers in the table were achieved by the papers [52, 59]. In the paper [59], a large number of quantitative imaging features (1,409 features) were automatically extracted from each VOI (Volume of Interest). These features were then categorised into four groups. Data preprocessing and a tenfold cross-validation were performed, followed by the use of the LASSO function to select relevant features. Radiomic features with non-zero coefficients were identified as the final set of features.

On the other hand, the paper [45] obtained the worst results when the XGBoost model was applied to CT and PET features. In this study, a semi-automatic segmentation algorithm was used to determine the VOI of the three-dimensional gross tumor volume (GTV), and manual adjustment was performed for accuracy. Subsequently, PET and CT images underwent various processing methods, including Exponential, Gradient, Laplacian of Gaussian (LoG), Logarithm, Square, Square root, and Wavelet filtering. A total of 1,218 CT and 1,218 PET radiomic features were extracted from the segmented tumor region of each patient.

3.10 RBF network

The Radial Basis Function (RBF) Networks, as described in the paper [90], are a type of machine learning model commonly used for prediction and forecasting tasks. The structure of an RBF Network typically consists of three layers: an input layer, a hidden layer, and an output layer, as shown in Fig. 10.

Fig. 10
figure 10

RBF Network structure, from [90]

Unlike traditional neural networks with multiple intermediate layers, RBF Networks have only a single hidden layer. However, they are still capable of solving complex problems. This is achieved through the use of a Gaussian function applied in the hidden layer, which allows the network to transform nonlinear inputs into linear outputs. The hidden layer computes the nonlinear output based on the input, utilising the Gaussian function centred around specific radial basis functions.

The linear output of the RBF Network is obtained by summing the weighted nonlinear outputs from the hidden layer. This combination of nonlinear transformations and linear aggregation enables RBF Networks to effectively capture and model complex relationships in the data.

The description of paper [39], the only one in Table 15, has been thoroughly explored in the preceding sections, being often one of the best results in the tables previously discussed.

Table 15 RBF network results

In order to understand why the RBF+SSL model outperforms the RBF+SL model in paper [39], one first needs to outline the dataset on which the two models are applied. The Wisconsin Diagnostic Breast Cancer (WDBC) dataset contains 569 samples, 357 of which are benign and 212 of which are malignant. Although it is not a small dataset, it is also not excessively enormous. The RBF+SSL model can improve its knowledge of the data by making better use of the limited labelled samples. Furthermore, where there is a class imbalance, such as more benign than malignant instances in the WDBC, RBF+SSL is able to balance the model's exposure to both classes, increasing classification performance.

4 Discussion

AI techniques have advanced quickly in recent years allowing consistent improvements in medical image processing, computer-aided diagnosis, image interpretation, fusion, registration, segmentation, image-guided therapy, image retrieval, and image analysis. To date, understanding the associated data structures and statistics to the aim of converting the ML algorithm into a product working consistently in broad clinical use is still complex and prone to ethical issues [91]. This makes the scientific research about the ML algorithms performance, as well as the most recent development of DL techniques, a focus of the discussion about early pathology detection and medical imaging interpretation times and health costs saving, to the final goal of address to society's expectations towards innovative health solutions based on concrete and health safe models and policies.

This study selected and analysed the recent literature on the application of ML techniques in the field of preventive BC diagnosis along three dimensions: Accuracy, AUC and Dataset cardinality. Papers that applied the same models to different datasets and used different feature selection methods have been compared. The selected models included LR, RF, DT, Boosting algorithms (such as XGBoost), and Artificial Neural Networks. To the aim of reducing the heterogeneity in terms of cardinality among the reviewed studies, the ones led by applying DL models only, have been excluded as they are generally based on substantially larger datasets [92]. Additionally, our choice of excluding DL studies on large amounts of data is motivated by the most realistic condition of managing limited data sets of medical imaging. This is the fundamental issue affecting the creation of ML models that simultaneously learns from its surroundings and it largely depends on the time-consuming procedures of medical picture segmentation and annotation, which greatly limit medical imaging data collection [92].

Based on the analysis of the reviewed papers, the performances of the selected models showed to be generally comparable. Figure 11 plots the performance results across different models. It's important to note that the apparent superiority of the RBF model's performance may be skewed by the fact that only one paper meeting the selection criteria was found for this particular model.

Fig. 11
figure 11

Mean results for analysed papers, grouped by ML model

This suggests that while different models may exhibit varying degrees of performance in different scenarios, no single model emerged as consistently superior across all studies. The choice of the most appropriate model may depend on factors such as the specific dataset, feature selection methods, and other considerations relevant to the particular application of preventive breast cancer diagnosis.

The paper by Al-Azzam et al. [39] demonstrated the best performance among the papers reviewed in terms of tumor type diagnosis. The authors explored various algorithms and combined them with both SL and SSL approaches to achieve high specificity in their diagnosis.

In SL, the use of labelled data is required for training the algorithm. However, labelling data can be a time-consuming task. The advantage of SSL is that it can achieve competitive results with less labelled data, reducing the overall cost of diagnosis. The authors found that SSL algorithms yielded very similar accuracies to SL algorithms, ranging from 91 to 98%.

These findings indicate that SSL is a promising and competitive solution for tumor type diagnosis, particularly when dealing with a small sample of labelled data and limited computational resources. SSL algorithms can provide accurate results while optimising the use of labelled data, making them an efficient approach for tumor type classification.

Our results agree to the ones of a recent review concentrated on the most widely used ML approaches (SVM, DT, Nearest Neighbour, Naive Bayesian Networks, ANN, and Convolutional Neural Networks) that highlighted the need of using labelled images throughout the training of SL methods application [92].

5 Conclusion

The analysis of the reviewed literature enables us to draw conclusions regarding the factors that influence the performance of the selected models. One notable finding is that the feature selection process has a greater impact on model performance compared to the size of the dataset. This highlights the importance of selecting relevant and informative features and the need to have large quantities of diagnostic labelled images available to achieve accurate predictions.

This review confirms previous findings: Semi-Supervised Learning is a promising approach, with respect to Supervised Learning, since by exploiting a smaller set of labelled data it is able to achieve similar results, in terms of accuracy [93]. This makes the diagnosis process for breast cancer screening faster and cheaper. The future research challenge is to enhance ML algorithm efficiency and accuracy for fueling precision medicine, in a holistic patient-centric approach that integrates personal, clinical, genetic and environmental data in order to improve both diagnosis (faster, more accurate, cheaper) and therapy (reducing side effects and improving efficacy) [94]. In this perspective, ML is the base for fueling progress over time. However ethical, social, and legal implications of using Artificial Intelligence in healthcare need to be investigated in depth [95].

To further enhance the accuracy of predictive models in BC risk assessment, future work should focus on standardising image acquisition scanners, lighting and enlargement factors configurations, sizes, as well as incorporating large volumes of personal and behavioural health data. This additional data can provide valuable insights and improve the model's ability to accurately predict an individual's risk of developing BC. Continued research and advancements in this area can contribute to more effective and personalised BC risk prediction models in the future.