1 Introduction

Breast cancer is the most prevalent cancer among adults, with over 2.3 million cases occurring annually. In 95% of countries, breast cancer ranks as the top or second-most frequent cause of cancer-related deaths in women (Organization, 2023). The use of breast imaging is crucial for detecting breast cancer at an early stage and for monitoring and assessing the effectiveness of treatment. Research indicates (Balkenende et al., 2022) that Deep Learning (DL) algorithms demonstrate comparable or superior performance to radiologists in breast cancer imaging. However, it is evident that extensive clinical trials are necessary, particularly for ultrasound, to precisely determine the added benefits of DL in this field. Ultrasound is considered a safe, painless, and non-invasive screening technique compared to other screening procedures that use ionizing radiation, such as X-rays in mammography. Also, it is very fast and slightly less expensive. Considering these facts, more younger women choose to take an ultrasound examination instead of a mammography.

When abnormalities are detected by other imaging modalities or on palpation, ultrasonography is utilized to detect and diagnose breast lesions due to its benefits, such as safety, accessibility, and low cost. Moreover, breast ultrasound is planned to become an additional screening technique for women with mammographically dense breasts. This screening approach is anticipated to identify tumours at an early stage and decrease the risk of dying of breast cancer in women. Therefore, the proposed intelligent system can identify breast lesions in ultrasound images. In the first phase, the system locates the tumour, if any. In the second phase, it fits the tissue into one of the 3 categories: healthy, malignant, or benign.

Automatic tumour segmentation and classification are still difficult because of the high noise, low contrast, weak or blurry boundaries, and significant quantity of shadows in breast ultrasound images. Recent works in medical image segmentation, such as those seen in Das et al. (2022), utilize non-deep learning techniques like hybrid ellipse-fitting based on bounded opening and Fast Radial Symmetry. These methods are ingenious but can face challenges, particularly when dealing with images that exhibit high variability, noise, and complex morphological structures. Similarly, traditional techniques reviewed in Ganesan et al. (2013) like thresholding and edge detection are foundational in understanding the limitations of non-deep learning methods. They often require manual tuning and may not effectively handle the subtle nuances present in medical images, such as overlapping tissues or varying densities. These limitations are significant when considering the precision required for medical diagnosis, where deep learning approaches like CNNs demonstrate superior performance. Their capacity to learn representations that capture underlying data distributions makes deep learning methods indispensable in modern medical image analysis.

Efficiency is vital for the system, given the delicate issue it has to address. A large and complex dataset is essential for proper algorithm training. Considering the limited public data available at the moment, the system initially employs a data augmentation step to enhance the dataset, followed by image segmentation and classification steps to ensure robust performance. The augmentation phase consists of creating new images, to handle the limited and unbalanced dataset. These new images are built by a Generative Adversarial Network (GAN), a fascinating recent development in Machine Learning able to yield new data instances that mimic the existing training data. The UNet model, a popular deep architecture in medical imaging for disease detection and diagnosis, is utilized for image segmentation. Further, a classic Convolutional Neural Network is employed for image classification. This proposed work addresses the significant challenge of diagnosing breast cancer from ultrasound images using deep learning models by focusing on addressing data limitations, an end-to-end model validation and on investigating data impact on performance. Firstly, public datasets in this domain are often limited in size and imbalanced, hindering model training. This work proposes a novel GAN-based data augmentation approach, generating realistic and diverse synthetic images to enrich the dataset and achieve performance comparable to state-of-the-art (SOTA) methods. Secondly, this research introduces and validates a novel end-to-end deep learning model for breast cancer segmentation and classification. The model streamlines the analysis process by combining segmentation and classification tasks into a single framework, offering advantages in efficiency compared to traditional approaches employing separate models. Finally, this research investigates the impact of data quality and quantity on the performance of the proposed model. The systematic investigation of data augmentation techniques, including the proposed GAN-based approach, highlights the potential benefits of data augmentation for overcoming limitations in public datasets.

This work proposes a novel approach to address the challenges of limited and imbalanced datasets in breast cancer segmentation using ultrasound images. A Generative Adversarial Network (GAN)-based data augmentation technique that generates realistic and diverse synthetic images is introduced to enhance the training data and improve the performance of the proposed end-to-end deep learning model. This continuous augmentation strategy aims to address data imbalances and increase training data diversity. By employing this novel GAN-based approach, the main aim is to contribute to the field by overcoming data limitations and fostering further research in reliable and efficient breast cancer diagnosis tools. This technique brings constant benefits to the learning process, and it helps the proposed model outperform the state-of-the-art model developed in Yap et al. (2019).

In summary, the contributions of this paper are three-fold:

  • the design, implementation, and validation of the end-to-end model, followed by comparing it with other models from the literature. The validation and comparison performed on a benchmark dataset indicate the efficiency and robustness of the proposed system.

  • an investigation of the training data’s quality and quantity is conducted since the automatic learning of the end-to-end model is somehow restricted by the limited and unbalanced ground truth provided by the radiologists. Observations reveal that the generated data consistently enhances the learning process. A GAN-based method is introduced for automatic augmentation, addressing the challenges of unbalanced and limited datasets, a common issue in the medical field.

  • the experimental results on a dataset that contains the required annotations (segmentation masks and classification labels) show that the proposed end-to-end model outperforms (Yap et al., 2019), given the fact that the images with a Dice score lower than 0.5 were also considered to compute the overall accuracy.

Therefore, the proposed developments support answering the research questions addressed in this study:

  • RQ\(_1\): Which is the most efficient approach to design a robust system for breast tumour identification and characterization: a sequential system or an end-to-end one?

  • RQ\(_2\): How do the quality and quantity of training data impact the performance of automatic stadialization of breast tumours?

The current paper is structured as follows: Section 3 gives a brief overview of algorithms used for breast cancer identification. Next, Section 4 presents the proposed approach, including preprocessing, augmentation, segmentation, and classification stages. Section 5 focuses on a brief explanation of the used dataset, applied performance metrics, and the conducted experiments, coupled with their results. The objective of Section 6 is to provide answers to the research questions that were proposed in the introduction. The last Section 7 contains the conclusions, a short evaluation of the proposed end-to-end model, and proposals for future improvements.

2 Research Significance

This proposed work addresses the significant challenge of diagnosing breast cancer from ultrasound images using deep learning models by focusing on:

  • addressing data limitations: public datasets in this domain are often limited in size and imbalanced, hindering model training. This work proposes a novel GAN-based data augmentation approach, generating realistic and diverse synthetic images to enrich the dataset and achieve performance comparable to state-of-the-art (SOTA) methods.

  • end-to-end model validation: this research introduces and validates a novel end-to-end deep learning model for breast cancer segmentation and classification. The model streamlines the analysis process by combining segmentation and classification tasks into a single framework, offering advantages in efficiency compared to traditional approaches employing separate models.

  • investigating data impact on performance: this research investigates the impact of data quality and quantity on the performance of the proposed model. The systematic investigation of data augmentation techniques, including the proposed GAN-based approach, highlights the potential benefits of data augmentation for overcoming limitations in public datasets.

In summary, this research contributes to the field by:

  • proposing a novel approach that leverages Generative Adversarial Networks (GANs) to address limited and imbalanced public datasets.

  • systematically investigating the impact of data quality and quantity on the performance of the proposed end-to-end model.

  • prioritizing computational efficiency while maintaining accuracy, considering the sensitive nature of breast cancer diagnosis.

  • conducting a rigorous evaluation of the proposed end-to-end model by validating its performance on an established benchmark dataset.

3 Related Work

Ultrasound images are used more frequently, and radiologists spend a very long time examining large volumes of these images. Thus, it has become a major problem in many countries because it leads to increasing medical expenses and worsening the quality of medical services.

According to (MD et al., 2019), in recent years, AI has revolutionized medical research, to detect and diagnose cancer. These methods use different types of algorithms, e.g. Convolutional Neural Networks (CNN) architectures and learning procedures, for cancer classification, and have achieved outstanding performance. Lately, Deep Learning technologies have been applied for radiological images, for example, to detect tuberculosis in chest X-rays, lung nodules, or cranial tumours in MRI. Also, in the case of breast cancer, the recent advanced AI methods proved to be useful in analysing various medical modalities (ultrasound images, MRIs, CTs). In the case of breast ultrasound images, datasets with comparable sizes to those used in the current numerical experiments, have been cited 325 times (Al Saleh et al., 2021).

According to Roslidar et al. (2019), research in the classification of breast cancer based on histological images using CNN has an accuracy of 98%. Using mammography, some studies in which the Convolutional Neural Networks were applied in the classification of tumours, achieved an accuracy of 97%. Meanwhile, an improved approach using a support vector machine (Chen et al., 2017) for processing the ultrasound images achieved 76.8% accuracy through binary classification, the two classes being benign and malignant. Considering these results and the fact the number of studies on ultrasonic images is significantly lower than the ones that use mammography, this article focuses on improving the performance of ultrasound cancer diagnosis.

Now, approaches focused on tumor identification will be presented, taking into account the image type: mammography or breast ultrasound images (BUSI) In some approaches, the use of pre-trained models helps to speed up the process of adapting networks, leading to faster problem-solving.

Table 1 Performance of various discriminating models between benign and malignant breast tumours
Table 2 Performance of two detectors of benign and malignant breast tumours

Several models have been developed to detect and discriminate between various breast tumours in mammography images. In Ragab et al. (2019), the authors used the AlexNet network to achieve binary classification, which they modified by introducing Support Vector Machines on the last layer. They also used a threshold-based method to automatize the segmentation process. They have successfully achieved the classification of benign and malignant tumours, working on some mammographic datasets (the DDSM (Heath et al., 2007) and CBIS-DDSM datasets (Lee et al., 2016)), obtaining an accuracy of 87.2%. Levy and Jain (Lévy & Jain, 2016) compared AlexNet, GoogleNet, and a simple CNN architecture, to which they have added transfer learning techniques, batch normalization, preprocessing, and augmentation. Their best model achieves 0.934 recall and 0.924 precision on the DDSM dataset (Heath et al., 2007) for discriminating between benign and malignant tumours. Jung et al. (2018) proposed the RetinaNet network, with pre-trained weights on an in-house dataset GURO, to demonstrate that pre-trained models achieve similar performance to those trained directly on the public INbreast dataset (Moreira et al., 2012), which contains both benign and malignant tumours. Their detection model obtained an average false positive rate of 0.34. In (William Hang & Hannun, 2017), an adversarial network was used to detect tumours. The study was conducted on a convolutional network, followed by Conditional Random Fields (CRF) for structured learning. The adversarial model was also used to control the overfitting that could occur since authors had access to a small amount of data. They used two datasets: INbreast (Moreira et al., 2012) and DDSM-BCRP (Heath et al., 2007) on which they obtained 66.18% accuracy which can be considered state-of-the-art performance for discriminating between benign, malignant, and healthy tissue. In another paper, Bakkour and Afdel (Bakkouri & Afdel, 2017) proposed a new discriminative technique for supervised learning, using the Softmax layer as a classifier. The network has been improved using Gaussian pyramids to highlight the regions of interest. Results on DDSM (Heath et al., 2007) and BCDR (Oliveira et al., 2011) revealed an accuracy of 97.28% on 2 classes: benign and malignant.

In what follows, the obtained results on breast ultrasound images are presented, starting from some approaches already reviewed in the article (T et al., 2020).

Several models were built and their performance in the differentiation of benign and malignant breast tumours has been evaluated. The input of these models is represented by B-mode images (Han et al., 2017; Fujioka et al., 2019; Mango et al., 2020) or shear wave elastography images (Zhang et al., 2016; Fujioka et al., 2020), while the architecture varies from traditional backbones as GoogleLeNet (Han et al., 2017; Fujioka et al., 2019) or DenseNet (Fujioka et al., 2020) to various Boltzmann machines (Mango et al., 2020). Even though the results are promising (Table 1), no access to the datasets used in these experiments is provided (Mango et al., 2020). Regarding the systems able to perform the detection task (that refers to localizing and categorizing the lesion in an image), different input types (the hand-held B-mode images (Cao et al., 2019) or automated breast B-mode images (Jiang et al., 2018)) and different detection models (YOLO or SSD (Cao et al., 2019)) can be identified. The experiments’ datasets are not published, even though the findings are encouraging (Table 2).

In addition to detectors, some methods enable high precision to identify all the pixels that belong to a particular object in the image, even if sometimes they are computationally intensive. The task is called semantic segmentation, and the ML algorithms learn a label (from a prefixed set) for every pixel of an image. Regarding breast lesion segmentation, many CNN architectures used in various approaches are based on the UNet model, one of the most popular neural networks in the medical field because it produces a satisfactory segmentation with very few data samples available (Ronneberger et al., 2015). For instance, in Zhuang et al. (2019), the authors proposed a U-Net-based model able to segment benign and malignant tumours in breast ultrasound images. They trained and validated their model on a large dataset (but only a part of it is publicFootnote 1) and obtained very good performances in segmenting various lesions in breast images.

Other approaches are based on deep architectures different from UNet. In Hu et al. (2019) the authors joined a dilated fully convolutional network with a phase-based active contour model to segment breast tumours, achieving a Dice score of 88.97%. Recently, Kumar et al. (2020) used a contextual-information-aware deep adversarial learning framework to propose an effective model for breast tumour segmentation in BUS images. In this framework, they applied a deep learning paradigm to capture both the textural features and the contextual dependencies in the BUS images. Two datasets have been involved in their experiments: a semi-public dataset of BUS images (but without ground truth masks that constrains the possibility of replicating their results) and a public dataset. Even if the results on the public dataset indicate a good performance of this approach (a Dice over 86%), it achieved limited tumour detection segmentation accuracy with some BUS images.

Ensemble methods have been investigated for breast cancer classification, achieving promising results. For instance, the study titled "The stratified K-folds cross-validation and class-balancing methods with high-performance ensemble classifiers for breast cancer classification" T R et al. (2023) employed stratified K-folds cross-validation alongside ensemble classifiers. While their approach addressed class imbalance to some degree, the authors recognized the potential for bias in highly imbalanced datasets. To tackle this issue, they proposed the incorporation of Synthetic Minority Over-sampling Technique (SMOTE). This technique generates synthetic data points for the minority class by interpolating-specifically, by taking a weighted average between an original data point and its selected neighbour. This approach achieved an accuracy of 99.3%, and a precision of 99.2% for the majority voting ensemble demonstrating the potential of SMOTE to improve model performance when dealing with class imbalance.

Building upon the foundation laid by hybrid CNN models and ensemble frameworks, as demonstrated in the recent studies Sahu et al. (2023, 2024), published in Biomedical Signal Processing and Control, 2023, significant strides have been made in the field of deep learning for breast cancer detection using ultrasound images. These studies have introduced innovative approaches, such as combining efficient deep CNN networks to form hybrid models that utilize weight factors for enhanced accuracy and speed, and ensembling powerful transfer learning models like AlexNet, ResNet, and MobileNetV2 for their synergistic benefits. The application of advanced image processing techniques, such as the Laplacian of Gaussian-based modified high boosting filters, further refine the quality of ultrasound images, leading to more precise detection capabilities. Moreover, the focus on small datasets, as highlighted in the proceedings of the 22nd International Conference on Intelligent Systems Design and Applications (ISDA’22), showcases the evolving ability of deep learning frameworks to achieve remarkable performance even with limited data availability Sahu et al. (2023). These developments highlight the continuous advancements in the application of deep learning architectures and image processing techniques, contributing significantly to the enhancement of both the reliability and efficiency in the diagnostic processes of breast cancer through ultrasound imaging.

As previously mentioned, substantial advancements have been achieved in the domain of artificial intelligence (AI)-enabled detection of breast cancer through ultrasound imaging. However, many important challenges still prevent it from being widely adopted and achieving its full potential:

  • Data Quantity and Quality - Small and potentially biased datasets: Many studies rely on limited and retrospective datasets, which can restrict the generalizability of the models and potentially introduce bias into the results. These limitations can lead to models with reduced accuracy, reliability, and generalizability, hindering their real-world application in clinical settings. Fujioka et al. (2020)

  • Overfitting: The concern of overfitting, a limitation often discussed in the realm of deep learning models, remains relevant for ultrasound-based breast cancer detection. Even when working with larger datasets, there’s the risk of models becoming overly focused on the specific characteristics present in their training data, leading to reduced generalizability and poor performance on unseen images. Sahu et al. (2023)

This article aims to address these challenges by utilizing data augmentation techniques to artificially create diverse variations of existing data points. This method allows models to learn from a richer and more varied representation of the underlying patterns, potentially improving generalizability and performance without requiring extensive data collection. To the best of our knowledge, the approaches mentioned previously have focused on a single task: either classification, detection, or segmentation of tumors. Furthermore, the numerical experiments have been conducted on datasets that provide specific information, such as tumor labels, bounding boxes, or masks. The proposed end-to-end approach is able to perform both tumour segmentation and tumour categorization. Nevertheless, for this purpose, a corresponding dataset (with both labels and masks) must be considered.

Fig. 1
figure 1

Pipeline diagram

4 Materials and Methods

This study utilized a publicly available dataset of breast ultrasound images, obtained from Dataset-BUSI-with-GT (W et al., 2020). As the dataset was publicly accessible and did not involve direct interactions with human participants or animals, formal review and approval were not applicable. However, ethical considerations associated with the original collection of the data were considered.

The main aim is to develop an end-to-end decision support system, which automatically performs both steps (cancer segmentation and stratification) and could be an efficient solution (in terms of quality of predictions, but also speed) to the unmet medical need for proper delimitation and characterization of the tumour. The input of the intelligent system is a medical breast ultrasound image. In the first phase, the intelligent system locates the lesion, if any. In the second phase, it fits the tissue into one of several categories (e.g., normal, benign, malignant). Using a dataset that includes ultrasound breast images of 600 women aged between 25 and 75 years old, the intention is to validate the hypothesis that an end-to-end system is more performant than the two-stage systems developed until now in the literature. In addition to its efficiency from a computational point of view (reduced time and space complexity), the proposed system will be able to improve the process of cancer identification because of its architecture. Being an end-to-end system, the training of the decision core algorithm will benefit simultaneously from both cost functions that measure the quality of predictions in terms of lesion segmentation and lesion discrimination. Furthermore, the entire learning procedure behind the AI algorithm is agnostic to the input type or size - current results are obtained on 2-dimensional B-mode ultrasound images, with planned new experiments on SWE ultrasound images and 3D tomosynthesis images. The ground truth data required by such a system must include both annotations for every breast image: the location of the lesion and its type. To the best of our knowledge, these types of datasets are not available elsewhere, except in our investigated set of images. This lack of similar datasets could affect the validity of the proposed system as there are currently no other sets on which to test/confirm its robustness. However, good results were obtained when validating specific components of the approach. Figure 1 illustrates a graphical overview of the applied pipeline. Starting with the preprocessing steps, continuing with the augmentation, followed by segmentation combined with the classification through 4 intermediate layers, the end-to-end model was obtained. Two filters (a Gamma correction and a Gaussian blur) and thresholding are applied during the preprocessing stage (more details are given in Section 4.2). To handle the small and unbalanced initial dataset and to increase the training dataset required by the segmentation model, a GAN is employed in the augmentation step to generate new images (the details of image generation process are given in Section 4.3). The next step regards the partition of the images into specific and meaningful regions and it is performed by a UNet-based model (see Section 4.4). Finally, the segmented regions are labelled as benign, malign or normal by a trained classifier (see Section 4.5). By integrating segmentation and classification into a single and efficient framework, the proposed approach not only enriches the dataset but also enhances the learning process, showcasing a significant leap over traditional data augmentation techniques.

Fig. 2
figure 2

Gamma Correction/Logarithmic Correction for various image samples: (a) benign (b) malignant (c) normal

4.1 Dataset

The used dataset is breast-ultrasound-images-dataset: Dataset-BUSI-with-GT (W et al., 2020), which includes mammographic ultrasound images of women aged between 25 and 75 years old. These data were collected in 2018. The number of patients is 600, representing females. The dataset consists of 780 images with an average image size of 500*500 pixels. The images are in PNG format. In this dataset, the 2D images are divided into 3 categories: benign, malignant, and normal.

  • The “benign” category contains 891 images, including approximately 437 ultrasound images and their corresponding masks.

  • The “malignant” category contains 421 images, including approximately 210 ultrasound images and their corresponding masks.

  • The “normal” category contains 266 images, including approximately 133 ultrasound images and their corresponding masks.

In addition to the category label, the dataset contains ground truth images (the breast tumour masks). By having both elements (lesion label and contour), this dataset corresponds to the aim of the proposed approach: an end-to-end model able to localize the breast tumour and classify it as benign or malignant one.

4.2 Preprocessing

The proposed system reads the pixel matrix in RGB format and applies different techniques to obtain the most relevant information. At each step, various filters were tested, and only the results showing improvements were proceeded further. Next, images with partial filtering applied to them (one image from each category: benign/malignant/healthy tissue) will be attached, followed by more details about the chosen filters for further analysis.

4.2.1 Step 1-Gamma correction

Gamma correction is a nonlinear operation that defines the relationship between the numerical value of a pixel and its true brightness (in Colour, 2021). With this correction, the shades captured by a device are brought as close as possible to those perceived by the human eye.

$$\begin{aligned} s=c*r^\gamma \end{aligned}$$
(1)

where:

r=input pixels

s=output pixels

c=constant (scaling factor)

\(\gamma \)=the main exponent responsible for changing the brightness threshold of the image

Two cases are possible:

  • Case 1: \(\gamma<\) 1: Gamma encoding is useful when there is a narrow range of dark pixels in the original image and the range of output values needs to be expanded. The effect of this curve is to accentuate the bright areas of an image.

  • Case 2: \(\gamma>\) 1: Gamma decoding (or Gamma correction) is useful when there is a wide range of dark pixels in the original image and the range of output values needs to be narrowed. The effect of this curve is to reduce the bright areas of an image.

Logarithmic correction is useful for enhancing images with low contrast, since it compresses the dynamic range of pixel values, bringing out more details in dark areas while avoiding overexposure of bright areas. However, logarithmic correction may also result in a loss of detail in bright regions, especially when the image has a high dynamic range (Akram & Hussain, 2015).

Comparing the images resulting from the Gamma and Log corrections, the preference leaned towards the former. In the dataset case, a gamma correction with \(\gamma \)=2 was applied to reduce the brightness of the pixels and emphasize the darker areas of the image, which may be tumor suspicious (some examples are given in Fig. 2).

4.2.2 Step 2 - Noise removal by Gaussian Blur

The Gaussian filter is applied to an image for noise and detail reduction (ScienceDirect, 2021). It uses a Gaussian function to compute the transformation of each pixel in the image. Thus, the use of this filter represents the application of the convolution operation on the image, with a Gaussian function.

The Gaussian formula for two-dimensional space is:

$$\begin{aligned} G(x,y)=\frac{1}{2 \pi {\sigma ^2}} e^{-{\frac{x^2+y^2}{2 \sigma ^2}}} \end{aligned}$$
(2)

where:

  • x = distance from the origin, on the Ox axis;

  • y = distance from the origin, on the Oy axis;

  • \(\sigma \) = standard deviation.

The value of the deviation depends on the width of the resulting image after applying the filter. The smaller the variance, the more concentrated the pixels are in a certain area. A Gaussian kernel features the highest value at the center, which decreases symmetrically towards the edges.

In the proposed approach, the images are filtered (through convolution operations) by the following Gaussian kernel:

$$ A_{5\times 5} = \begin{Bmatrix} {1/256}&{4/256}&{6/256}&{4/256}&{1/256}\\ {4/256}&{16/256}&{24/256}&{16/256}&{4/256}\\ {6/256}&{24/256}&{36/256}&{24/256}&{6/256}\\ {4/256}&{16/256}&{24/256}&{16/256}&{4/256}\\ {1/256}&{4/256}&{6/256}&{4/256}&{1/256} \end{Bmatrix} $$

Convolution: Mathematically, convolution is an operation that combines two signals and emits a third signal. Assuming two functions, f(t) and g(t), the convolution is defined as the integral that indicates the overlap of a function g, over the function f.

$$\begin{aligned} (f*g)(t)=\int _{-\infty }^{\infty } {f(\tau )}{g(t-\tau )}d\tau \end{aligned}$$

In image processing, convolution is the process in which each element of the image is added to its local neighbours and then weighted by the kernel (Saha, 2020). A kernel is a small matrix, with an odd number of lines/columns, in which each cell has a number and an anchor point (that is used to find the position of the kernel relative to the image). Each pixel from the initial matrix is multiplied by the corresponding element from the kernel and then added to the sum. In the end, the resulting matrix is composed of these sums.

Convolution is very important in image processing because it can be used for blurring, sharpening, edge detection, and noise reduction.

Some examples of applying these filters are given in Fig. 3.

Fig. 3
figure 3

Edges obtained by applying Canny / Box Blur / Gaussian Blur filters for various image samples: (a) benign (b) malignant (c) normal

4.2.3 Step 3 - Binarization by Sauvola Thresholding

At this step, image pixels are normalized to the (0, 1) range, and only pixels that have a value that exceeds a certain threshold are kept. If the pixels exceed that value, they will receive a value of 1, otherwise, they will become white (0). In the experiments, the value 0.45 was chosen for the threshold.

Sauvola Thresholding is an extension (it can be considered an improvement) of Niblack’sFootnote 2 algorithm. This technique is used for images with a nonuniform background.

$$\begin{aligned} T=m*(1+k*((stdN/R) - 1)) \end{aligned}$$
(3)

where:

  • m = mean of the neighbourhood

  • k = constant in the range [0.2, 0.5] (default 0.5)l

  • stdN = standard deviation of pixel values in the neighbourhood;

  • R = dynamic range of standard deviation (default 128).

Instead of calculating a single overall threshold for the entire image, this technique calculates multiple thresholds for each pixel, considering the mean and the standard deviation of the local neighbourhood (defined by a window centered around the pixel). Some examples of thresholded images are given in Fig. 4.

Fig. 4
figure 4

Filtered images obtained by Sauvola Thresholding for various sample images: (a) benign (b) malignant (c) normal

Fig. 5
figure 5

Sketch of the augmentation process by a GAN

4.3 Augmentation

4.3.1 General description

During some preliminary experiments, a classification accuracy between 94-99% was obtained for the training set, and between 65-71% for the test set. Analysing these results and the discrepancy between the result obtained on the training set and the one on the test set, it was concluded that the number of images from each set has a major impact on the output of the method.

Augmentation is one of the main methods used in the process of creating new samples of inputs, by manipulating the original existing data. There are two frequent situations where the use of augmentation should be considered: when an unbalanced dataset or a small dataset is present. As observed in the input of the algorithm, both situations are encountered.

Generally, augmentation is limited to an approach in which it returns, rotates, or randomly changes the hue, saturation, brightness, and contrast of an image. This augmentation procedure is simple and can be done without much effort. The disadvantage of using these techniques is that it does not introduce new synthetic data into the model, but only include the same samples in a different state. Therefore, the model already knows these samples, and the impact on the result is limited.

Generating new realistic data is a difficult task that includes learning to imitate the original distribution of the available dataset. For this issue, it is possible to use a generative model, such as a Generative Adversarial Network (GAN) (Goodfellow et al., 2014), able to create new and realistic enough images from an existing dataset. This technique was chosen because it can generate better synthetic data samples that may improve the performance of the model. A GAN is composed of two important parts: Generator and Discriminator (Baeldung, 2022).

  • The Generator is a CNN that learns to create new plausible data. It receives as input a random vector of fixed length and learns to produce samples that imitate the distribution of the original dataset. Then, the generated samples become negative examples for the discriminator.

  • The discriminator is a CNN that learns to distinguish the synthetic data of the generator from the real data. It receives a sample as input and classifies it as “real” (it comes from the original dataset) or "synthetic" (it comes from the generator). The discriminator penalizes the generator for producing implausible samples.

Thus, the Discriminator and the Generator play a “game” with two participants, in which the Generator tries to mislead the Discriminator (to classify the synthetic samples as real) – see Fig. 5.

4.3.2 Augmentation Model

  1. 1.

    Network architecture:

    • Generator architecture: The Generator is architecturally configured to progressively upscale input latent vectors into higher-resolution images. This is achieved through a series of upsampling blocks, each consisting of transposed convolutional operations, batch normalization, and leaky rectified linear unit (LeakyReLU) activations. This series of operations systematically increases the spatial dimensions while concurrently decreasing the depth of feature maps, culminating in a high-fidelity image representation. The final output layer employs a hyperbolic tangent (Tanh) activation function to ensure the pixel values of the generated images are normalized.

    • Discriminator architecture: The Discriminator is designed to perform the inverse operation of the Generator. It progressively downscales the input images into more abstract representations. This is facilitated through a series of downsampling blocks, each comprising convolutional operations, layer normalization, and LeakyReLU activations. These operations systematically reduce the spatial dimensions while increasing the depth of feature maps, allowing the network to assess the features of the input images.

    Fig. 6
    figure 6

    Images generated using GAN - (a) benign, (b) malignant, (c) normal

  2. 2.

    Block components:

    • Upsampling blocks in the Generator: Each upsampling block employs a transposed convolutional layer to increase the feature map’s spatial dimensions, followed by batch normalization to stabilize the learning process and LeakyReLU activation to introduce non-linearity.

    • Downsampling blocks in the Discriminator: Each downsampling block utilizes a convolutional layer to reduce the feature map’s spatial dimensions, followed by layer normalization for effective re-centering and scaling of the activations, and LeakyReLU activation to introduce non-linearity.

  3. 3.

    Training parameters: In the training phase, the Wasserstein (Adler & Lunz, 2018) loss function is employed. Both the Generator and Discriminator are optimized using the Adam optimizer with a learning rate of 0.0001 and beta parameters set to (0.0, 0.9), ensuring a balanced optimization trajectory. The training is conducted over 800 epochs, a duration carefully chosen to allow sufficient model convergence. A ratio of five Discriminator updates for every Generator update is maintained to preserve the adversarial balance, a critical aspect for the success of GAN training. Additionally, a gradient penalty with a weight term of 10 is applied.

In Fig. 6, augmented images generated by generative adversarial networks, along with their corresponding masks, can be observed.

Fig. 7
figure 7

U-Net architecture (Ronneberger et al., 2015)

4.4 Segmentation

4.4.1 General description

U-Net is a convolutional neural network that has been implemented for biomedical image segmentation tasks (Ronneberger et al., 2015). It consists of a full network with a modified architecture that allows it to work with fewer images in the training set, while still producing accurate results. The diagram of the network layers is shown in Fig. 7.

The U-Net was trained to predict masks of the tumours within the ultrasound images, utilizing the reference mask already available in the dataset as the "correct" value. This segmentation model was selected based on indications of its potential from previous researchers (Zhuang et al., 2019). However, they applied a flavour of UNet only on a balanced dataset. The dataset involved in current experiments is not balanced.

4.4.2 Segmentation Model

  1. 1.

    Network architecture: The architecture of the U-Net is constructed in a way that initially captures a broad context of the input image through its contracting path-comprised of convolution layers that increase the depth of the network while reducing the spatial dimensions of the image, effectively expanding the feature representation. This expansion is achieved by applying a series of 3x3 convolutions followed by a rectified linear unit (ReLU) and 2x2 max pooling at each layer, which doubles the number of feature channels while halving the image dimensions. In the expansive path, the process is reversed. Here, the network performs up-convolutions (also known as transposed convolutions or deconvolutions), which increase the spatial dimensions of the feature maps. These upsampled features are then concatenated with the corresponding feature maps from the contracting path, ensuring that fine-grained details are carried through to the final layers. This concatenation helps preserve important spatial information that might be lost due to pooling operations. As a result, the expansive path gradually decreases the number of feature channels while restoring the spatial dimensions, leading to the final segmentation map. Dropout is incorporated throughout the network to regularize the model and prevent overfitting, ensuring that the model generalizes well to new data.

  2. 2.

    Training parameters: A small batch size of 16 is chosen, taking into consideration the relatively limited number of ultrasound images available. The model is trained over 50 epochs, a duration deemed sufficient to thoroughly learn and adapt to the features present in the medical images. Each image is resized to a uniform dimension of 256x256 pixels, a necessary step as the model architecture is designed to process square images. The Adam optimizer is employed for its proven efficiency in handling sparse gradients, a common characteristic in binary masks where the region of interest, often mapped with ones, is significantly smaller than the background, mapped with zeros. The learning rate is fixed at 0.003, striking a balance between rapid convergence and stability. The loss function used is Binary Crossentropy, a commonly utilized metric in binary segmentation models. It measures the similarity between the predicted mask and the actual mask as a probabilistic distribution, effectively enabling the model to classify each pixel into one of two categories: zero for the background or one for the region of interest.

4.5 Classification

4.5.1 General description

Convolutional Neural Networks (CNN) are leveraged for classification to provide an analytical perspective like human visual perception. Just as the human eye deconstructs an image into smaller segments for analysis, a CNN applies a similar strategy through three fundamental steps: Convolution, Max Pooling, and Flattening.

Fig. 8
figure 8

Basic principle of MaxPooling (Karpathy & Li, 2015)

Convolution layers act as feature detectors from the input image, while Max Pooling layers reduce the spatial size of these representations, as illustrated in Fig. 8. This process not only diminishes computational complexity but also helps in achieving translational invariance in feature detection. Flattening, depicted in Fig. 9, converts the pooled feature maps into a one-dimensional array, laying out the extracted features end-to-end. This flattened array then feeds into the dense layers of the network, where classification decisions are made based on the presence of learned features. These steps ensure that CNNs can process and interpret images effectively, leading to accurate image classifications.

Fig. 9
figure 9

Basic idea of Flattening (AI, 2021)

4.5.2 Classification Model

  1. 1.

    Network architecture: The classification model is structured as a sequential convolutional neural network, designed to process and classify 400x400 pixel ultrasound images. It starts with convolutional layers, each followed by max pooling to reduce dimensionality and dropout to prevent overfitting. The convolutional layers have 16, 32, and 64 filters, respectively, each employing the ReLU activation function for non-linearity. After convolution and pooling, the data is flattened and passed through a dense layer with 512 neurons, also activated by ReLU. The network’s final layer is a dense layer with 3 neurons, corresponding to the number of image classes, using the softmax activation function to output class probabilities.

  2. 2.

    Training parameters: For training, the Adam optimizer is used due to its effectiveness in managing learning rates and enabling rapid convergence. The Categorical Crossentropy loss function guides the training, measuring the difference between the predicted class probabilities and the actual class distribution. The model trains over 30 epochs with a batch size of 32. This setup, including a limited number of epochs and a modest batch size, is chosen considering the dataset’s size and the goal of achieving efficient and effective training without overfitting.

5 Results

Two scenarios were investigated to identify the benign and malignant tumours:

  • a sequential approach - two models, one for segmentation and another one for classification, trained independently

  • a unitary approach - an end-to-end model, composed of a segmentation block, followed by a classification block, trained simultaneously.

5.1 Dataset Specification

The experiments were conducted on two distinct datasets. The first dataset consists of the original 780 ultrasound images from the Breast Ultrasound Images (BUSI) dataset, which includes 437 benign, 210 malignant, and 133 normal images. The second dataset expands on the first by including augmented images, resulting in a total of 3000 images, with each category (malignant, benign, normal) equally represented by 1000 images.

For both datasets, an 80/20 split was implemented for training and testing purposes. This split resulted in the following sample distributions:

  • For the first dataset, out of the total 780 images, 624 images were utilized for training and 156 images for testing.

  • For the second dataset, from the total of 3000 images, 2400 images were allocated for training while 600 were set aside for testing.

5.2 Metrics

For segmentation, the Dice coefficient was chosen as the evaluation metric, considering the need to assess the level of similarity between the mask resulting from the algorithm and the real mask found in the initial dataset. Dice coefficient is computed as the double of the area of overlap divided by the total number of pixels in both images. The Dice coefficient value is in the range (0, 1). The closer this value is to 1, the greater the similarity between the two images.

Accuracy was utilized as an evaluation metric in the classification task. Accuracy, the percentage of the image labels that are classified correctly, has values in the range [0%, 100%]. The closer the value is to 100%, the more performant the classification algorithm is.

Additionally, the inference time was calculated for both segmentation and classification across three classes: benign, malignant, and normal. This represents the time required for the algorithm to apply the trained neural network model to new input data.

Table 3 The segmentation results in the case of two samples (one benign and one malign)

Without formally analysing the computational complexity of the proposed algorithms, an empirical processing times which serve as a practical indicator of computational effort can be provided: augmentation phase: approximately 3 hours for the entire dataset, segmentation training: roughly 1 hour, classification phase: about 30 minutes, and-to-end model training: approximately 1-2 hours. All phases were executed on a GPU-P100, which reflects the practical computational requirements

5.3 Experiment 1 - Segmentation

Initially, two segmentation models were trained: SegModel-A and SegModel-B. U-Net is the architecture of each model. SegModel-A is trained by using only the original images, while SegModel-B is trained by using both original and augmented images.

The investigated models were trained using a variety of hyperparameters that were carefully chosen to optimize performance: 50 epochs, batch size of 16, Adam optimizer with a learning rate of 0.003, and the Binary Cross-Entropy loss function.

Table 4 Inference time - segmentation. The second column corresponds to SegModel-A, while the third column corresponds to SegModel-B

Segmentation accuracy in this study is defined as the proportion of pixels correctly classified as tumor or non-tumor in the segmentation output relative to the reference standard, which is the ground truth mask. Although accuracy is a common metric, it is acknowledged that in the context of medical image segmentation, where the region of interest such as a tumour might occupy a relatively small portion of the image, accuracy alone may not be sufficient to capture the effectiveness of the model. Therefore, alongside accuracy, the Dice coefficient is employed as it is more indicative of the model’s performance by measuring the overlap between the predicted segmentation and the ground truth masks.

After training on benign and malignant BUS images, a segmentation accuracy of 94% was achieved by SegModel-B, as well as 94% when training only on the original dataset (SegModel-A). However, it was noticed that the masks predicted by the model trained only on the original dataset (SegModel-A) are much more precise than those predicted by the model trained on original and augmented images (SegModel-B). In Table 3, you can see the difference in most of the segmentations estimated by these models. Additionally, a Dice score of 0.8911 was obtained for SegModel-B, and 0.9015 for SegModel-A. Since the tumor is a very small area in the image, identifying it in a more restricted area does not contribute to a radical change in the dice score value. Therefore, the two values obtained are close and do not reflect the fact that the augmented segmentation is less precise.

In Table 3, the first three rows of images contain the original BUS images, the ground truth masks, and the predictions in the case of a benign tumour, while the next three rows of images contain the original BUS images, the ground truth masks, and the predictions in the case of a malign tumour. The first column of images refers to the model trained on original and GAN-based augmented images (SegModel-B), while the second column refers to the model trained on original images only (SegModel-A). The Dice coefficients from the top refer to the average segmentation performance obtained on the test dataset.

The inference time for image segmentation was computed for both models. The inference time represents the time required for the algorithm to apply the trained model to new input data. The inference results were obtained using 10 random images from the augmented dataset (first row from Table 4) and 10 random images from the original BUSI-Dataset (W et al., 2020) (second row from Table 4) and computing their average inference times.

Fig. 10
figure 10

Training and validation process - ClassifModel-A (using only original images)

5.4 Experiment 2 - Classification

Figure 10 presents the training and validation progress of ClassifModel-A, which is trained solely using original images. Throughout the 30 epochs, there is an observable trend where training accuracy significantly improves, and training loss decreases, reflecting the model’s capacity to learn effectively from the dataset. However, the validation accuracy and loss demonstrate some volatility, indicating the model’s challenge in generalizing to new data.

In contrast, Figure 11 showcases the enhanced performance of ClassifModel-B, which benefits from a richer training dataset that includes both original and synthetic images generated by a GAN. The training curve for ClassifModel-B reveals a steadier and more consistent improvement in accuracy and a more substantial reduction in loss, both for training and validation phases. This suggests that the inclusion of GAN-synthesized images contributes positively to the model’s ability to generalize, leading to better performance when compared to ClassifModel-A. The smoother convergence of ClassifModel-B, as evidenced by the less erratic and generally higher validation accuracy, underlines the value of diversifying the training dataset with additional synthetic data.

Table 5 Classification results on test set
Fig. 11
figure 11

Training and validation process - ClassifModel-B (using both original and synthetic images)

After training the previously described CNN classifier (see Section 4.5) on the BUSI-Dataset (W et al., 2020), accuracy was used as an evaluation metric on test images. The results are presented in Table 5.

Furthermore, the inference time for classification in one of the three classes: benign, malignant, and normal, was computed. This represents the time required for the algorithm to apply the trained neural network model to new input data.

The columns in Table 6 are representing the two classification models. The original model was trained on the original BUSI-Dataset (W et al., 2020) – ClassifModel-A – . The augmented model was trained on the original dataset combined with the augmented images – ClassifModel-B –. The inference results were obtained using 10 random images from the augmented dataset (first row from Table 6) and 10 random images from the original BUSI-Dataset (W et al., 2020) (second row from Table 6) and computing their average inference times.

Table 6 Inference time - classification. The second column corresponds to ClassifModel-A, while the third column corresponds to ClassifModel-B
Fig. 12
figure 12

Architecture of the end-to-end model

5.5 Experiment 3 - Segmentation and Classification

To calculate the final accuracy of the algorithm, the segmentation was merged with the classification model to obtain the end-to-end model. Below is the representation of the end-to-end model in layers, where ’Functional’ is the segmentation model and ’Sequential’ is the classification model (see Fig. 12). Again, two end-to-end models were trained: one by using the original images only and another one by using original and synthetic images, and their performance is presented in Fig. 7.

Table 7 illustrates the performance of the end-to-end model, which is composed of two components: the segmentation and classification models. The classification models are represented by the rows, while the columns display the segmentation models. Each of these models has two scenarios: training on either an augmented or original dataset. For example, the result of 68.7% was obtained in the test process of the end-to-end model using the classification model trained on the original BUSI (W et al., 2020) dataset and the segmentation model trained on the augmented dataset.

The results of the end-to-end model are slightly inferior to the results obtained from the sequential scenario, where the input for the classification model is the real ground truth mask. However, this result is a consequence of the fact that the predicted mask overlaps with the real one in a proportion of 90%. Table 8 provides a comparative analysis of studies employing the end-to-end approach, encompassing the research detailed in this article as well as findings from other studies. It focuses on aspects such as the datasets utilized, techniques implemented, and metrics used to assess performance. This comparison seeks to underscore the variety of approaches in the field, illustrating how different strategies influence the overall performance of the models.

6 Discussion

In the current study, an end-to-end model for breast cancer identification in BUS images has been developed, aiming to address the well-known issue of efficiency introduced by the pipeline of detection and characterization of breast cancer in images (RQ\(_1\)). Finally, the current study investigated the impact of dataset quality and dimension over the training process (RQ\(_2\)).

Regarding the first RQ, the obtained results revealed that the end-to-end approach performed best (both in terms of quality and speed) across all test images. Being an end-to-end system, the training of the decision core algorithm benefits simultaneously from both cost functions that measure the quality of predictions in terms of lesion segmentation and lesion discrimination. Furthermore, the entire learning procedure behind the AI algorithm is agnostic/independent of the input type or size. The current results are obtained on 2-dimensional B-mode ultrasound images, with planned new experiments on SWE ultrasound images and 3D tomosynthesis images. The ground truth data required by such a system must include both annotations for every breast image: the location of the lesion and its type; to the best of our knowledge, these types of datasets are not available, apart from the investigated dataset. The absence of other double annotated datasets limits the possibility of training a performant lesion detector/classifier. The augmentation methods helped to pass over this constraint: the newly generated data, similar to the original samples, influenced the intelligent system performance. Furthermore, by using an augmentation method, it is possible to balance the dataset to have a uniform distribution of samples over lesion position and type.

In response to the second RQ, it was observed that the manual generation of new data (e.g., by rotation or translation of an image) positively impacts the quality of breast cancer identification, while the automatic generation (e.g., by using a Generative Adversarial Network (GAN)) was not as useful. In the second scenario, better results could be obtained by a finer tuning of GAN’s parameters.

Furthermore, if the proposed model is compared with others from the recent literature, it can be noticed some common elements, but also some new features. In Sahu et al. (2023, 2024) the authors proposed a hybrid framework for training 2 stand-alone classification models, so the inference stage is much more expensive. The advantage of the system proposed in this paper is precisely that it is a single model (end-to-end) that learns everything and then the inference will be faster as well (it saves time). In addition, the authors initially framed the issue as a binary classification problem. However, the proposed model advances this approach by simultaneously predicting both the class and the location of the lesion. This dual-prediction model offers computational benefits, as it determines both outcomes-the lesion’s location and type-with a single computational process. Moreover, the proposed model is theoretically capable of classifying lesions into multiple categories, not just two.

The current results have several limitations. First, all the data used for validation was provided by retrospective and single-center medical studies. Therefore, a prospective multi-center analysis should be carried out to investigate the power of predictive models. Second, only ultrasound images have been included in the test data. Other medical image modalities (such as mammographic images or tomosynthesis images) could be worth exploring.

Table 7 Accuracy comparison on 4 end-to-end models
Table 8 Comparison of end-to-end models trained on the BUSI dataset

The design and validation of a novel end-to-end model specifically designed for BUS images indicates that the results of this study mark an important advance in the field of breast cancer diagnoses. Compared to recent literature, the proposed model offers several advantages: it is a single end-to-end model (learning both segmentation and classification simultaneously), the inference with this model is faster (saving time compared to hybrid frameworks with separate models), it can predict both class and lesion location, offering flexibility for classifying lesions into multiple categories.

7 Conclusions and Future Work

In this work, a novel approach was proposed for breast cancer recognition; it leverages Generative Adversarial Networks (GANs) to address limited and imbalanced public datasets in breast cancer diagnosis. A systematic investigation analysed the impact of data quality and quantity on the performance of the proposed end-to-end model by prioritizing computational efficiency while maintaining accuracy and by considering the sensitive nature of breast cancer diagnosis. A rigorous evaluation of the proposed end-to-end model was conducted by validating its performance on an established benchmark dataset.

As a result of the performed experiments, the results established as an objective have been approached. The findings suggest that ultrasound imaging can serve as a valuable adjunct to mammography in breast cancer screening.

Based on the conducted analysis, it was found that an end-to-end model is highly effective for analysing ultrasound images of the breast. By incorporating both segmentation and classification into a single pipeline, highly accurate results were achieved in identifying and diagnosing breast lesions. Compared to traditional methods of breast cancer screening, an end-to-end model offers several advantages, including improved speed and efficiency, as well as the ability to automatically learn and adapt to new data.

Overall, the obtained results suggest that an end-to-end model is a highly promising approach for ultrasound imaging in breast cancer screening. By continuing to refine and develop this methodology, the potential exists to further improve the accuracy and efficacy of breast cancer diagnosis, ultimately leading to better outcomes for patients. The current accuracy of 86% on the test set, while not clinically sufficient, serves as an encouraging baseline for this proof-of-concept study. The segmentation model has performed well, with a Dice coefficient of 0.90, showing promise in the accurate delineation of regions of interest. Additionally, the successful generation of synthetic ultrasound images is a significant achievement that will support the training of more advanced classification models in future work. The necessity for a classification model with improved performance is recognized, and it is identified as a principal focus for ongoing advancement. Given the encouraging outcomes from the segmentation model and the creation of synthetic images, there is a strong belief in the methodology, with expectations set for considerable enhancements in future versions of the comprehensive model.

Reflecting on the work Sahu et al. (2023) it becomes evident that the journey of breast cancer detection and diagnosis is continuously evolving. Looking ahead, the future of this domain holds promising avenues:

  • Exploring GANs for data augmentation: Advancing data augmentation techniques using GANs, especially those able to capture textural features, could significantly enhance the discrimination between benign and malignant lesions.

  • Improving model accuracy: Exploring different neural network architectures and fine-tuning hyperparameters might yield improvements in model accuracy.

  • Dataset diversification: Expanding the dataset to include more diverse images, such as mammography, MRI, or CT scans, and images from different hospitals, countries, and ethnicities to make it more representative.

  • Integration with multiple modalities: Incorporating additional data types such as clinical data, genomics, or proteomics alongside medical imaging might offer a more holistic approach to diagnosis.

  • Deployment in clinical settings: The development of a user-friendly interface that allows radiologists to upload medical images and receive automated predictions, coupled with integration into Electronic Health Records (EHRs), could streamline the diagnostic process, making it more efficient and accessible.