Introduction

Breast cancer is a leading cause of mortality for women globally. As per the World Health Organization (WHO) data, 2.3 million women were diagnosed with breast cancer, and 685,000 succumbed to it by 2020 [1]. When breast cancer is identified at an initial stage, before it has grown significantly or spread, the chances of effective treatment are higher. The American Cancer Society states that for breast cancer caught in its early and localized phase, the 5-year relative survival rate is an impressive 99% [2]. The most reliable strategy to detect breast cancer early is to have frequent screening examinations. Screening consists of tests and exams aimed at identifying diseases in individuals without any manifesting symptoms. Having routine mammograms is pivotal for the early identification of breast cancer, maximizing the chances of successful treatment [2].

Mammograms are performed with a machine that specifically looks at breast tissue. The machine uses lower-dose x-rays than those used to examine other regions of the body, such as the lungs or bones. Two plates on the mammography machine compress or flatten the breast to spread the tissue apart. This produces a higher-quality image while using less radiation [2]. Mammogram of each breast are taken from two different views which are either mediolateral-oblique (MLO) and cranio-caudal (CC). The CC view is a top-down view of the breast. The MLO, on the other hand, is taken from an angled perspective. For radiologists, the MLO view is preferred because it shows more of the breast features in the upper outer quadrant [3]. While mammograms are universally employed regardless of the severity of breast cancer, breast magnetic resonance imaging (MRI) is typically utilized for detecting higher levels of breast cancer. This can sometimes yield misleading results, indicating cancer even when none is present [3].

Diagnosing tumors is a time-consuming process and often poses challenges for radiologists interpreting medical images due to the interference of noise, artifacts, and intricate structures [4]. This challenge is intensified by a worldwide scarcity of radiologists and medical professionals skilled in analyzing screening data, especially in under-served areas and developing nations [5]. Swift patient care encompassing screening, diagnosis, and treatment is crucial, as time plays a pivotal role in saving lives when it comes to breast cancer. Beyond the time aspect, there are instances where mammography might misread certain tumors, leading to false negatives. This can delay the detection and subsequent treatment of cancer. To address this, Computer Aided Diagnosis (CAD) was developed to serve as a supplementary opinion to the radiologist’s judgment.

The use of imaging technology is crucial for the early diagnosis of breast cancer, a process that encompasses phases from screening and early detection to subsequent diagnosis and treatment. Various modalities such as mammography, contrast-enhanced mammography, microwave imaging, optical imaging, ultrasonography, magnetic resonance imaging (MRI), and nuclear medicine are employed in in the detection of abnormalities [6,7,8,9]. The integration of intelligent systems capable of promptly and accurately identifying and diagnosing abnormalities is increasingly vital. Deep learning, a subset of machine learning algorithms, operates directly on images, autonomously selecting optimal features without human intervention. These algorithms are rapidly being adopted in medical image processing, especially in mammography. They enhance traditional computer-aided diagnosis systems, aiming to maximize the performance levels of radiologists [10,11,12].

We used DICOM (Digital Imaging and Communications in Medicine) images in our study, which offer several advantages over PNG (Portable Network Graphics) images when used in deep learning applications. Firstly, DICOM images preserve the highest quality of pixel data, ensuring that no critical information is lost during image processing. This high-quality representation enhances the performance and accuracy of deep learning algorithms, leading to more reliable results [13]. Additionally, DICOM files contain extensive metadata relevant to the patient, study, and other contextual information. This rich set of metadata provides valuable additional information for analysis and interpretation, enabling more comprehensive and accurate deep learning models [14]. By leveraging these benefits, DICOM images prove to be superior to PNG images in deep learning applications for radiological studies.

Before implementing an intelligent system,it is crucial to pre-process mammographic scans effectively to improve the quality of imaging outputs [5]. One of the key elements impacting the quality of imaging outputs is data noise, and Mitigating the detrimental impact of noise is a critical step in biomedical image processing, as any unwanted information in images is referred to as noise [15]. Image processing techniques, such as image enhancement, restoration, analysis, and compression, can be applied to the input data to improve performance and reduce noise and distortion during analysis [16].

Mammography has long been viewed as a potent method for screening breast cancer. Yet, its effectiveness is compromised in certain cases due to limitations like reduced sensitivity towards dense breasts and limited contrast features [17]. Breast density is determined by the ratio of fibrous and glandular tissue to fatty tissue in the breast. A higher proportion of fibrous and glandular tissue indicates denser breasts. It’s common for about half of women to have dense breasts. Women with dense breasts face a marginally increased risk of breast cancer. The presence of dense tissue complicates tumor detection in mammograms, as both tumors and fibrous/glandular tissue appear white, potentially obscuring concerning findings. There’s ongoing debate among experts regarding the additional tests needed for women with dense breasts who don’t have elevated breast cancer risk due to factors like genetics or family history [2].To improve tumor detection accuracy, image contrast enhancement becomes essential.

In our study, we employed automated mass segmentation in full-field mammograms to address the limitation of lacking significant mass boundary information in mass detection. This approach offers practical benefits for detecting and diagnosing breast cancer by eliminating the need for manual extraction of regions of interest (ROI) by radiologists. By automating this process, time-consuming and tedious tasks are minimized, making the detection and diagnosis more efficient.Based on the advantages of using artificial intelligence (AI), and deep learning techniques for automatic breast screening tasks, we propose a novel framework for automatic breast cancer segmentation to carry out lumps identification for an effective mass diagnosis in a mammography. The proposed framework capable to achieve a desirable results of segmentation with several pre-processing steps. The proposed study includes of removing artefacts, Li algorithm to minimize the cross-entropy between the foreground and the background, enhancing contrast using contrast-limited adaptive histogram equalization (CLAHE), normalizing, and median filtering. An effective augmentation approaches are used to address the issues of over- and under-fitting. Afterwards, a convolutional encoder-decoder network on the basis of U-net is used to accurately identify suspicious masses. The contributions of this work are summarized as follows:

  • Various pre-processing steps are performed to accurately segment the breast masses to achieve desired results.

  • The proposed frame work is tested on two different datasets of INbreast and CBIS-DDSM images.

  • Extensive comparisons are presented between the proposed model and the state-of-the-art techniques.

Related works

Over the past few years, deep learning methodologies have surged in prominence within the domain of medical imaging [18,19,20], particularly in breast cancer diagnosis [21]. Numerous research endeavors have showcased the prowess of deep learning algorithms in pinpointing and categorizing breast abnormalities across mammograms, breast thermograms [22],ultrasound captures [23], and MRI scans [24]. A notable strength of deep learning frameworks is their capacity to assimilate vast data volumes, enhancing their precision and adaptability. Further, the concept of transfer learning has gained traction; this technique involves priming a model using an expansive dataset before refining it with a more task-specific, smaller dataset.Predominantly, the encouraging outcomes from such research underscore the transformative potential of deep learning in refining the diagnostic accuracy for breast cancer, setting the stage for improved patient prognoses. As we delve into related works, the application of the deep learning models for breast cancer segmentation stands out as a critical area of exploration.

Sun et al. [25] presented the Attention-guided Dense Upsampling segmentation (AUNet). This technique aims to harness the most pertinent details from both high-level and low-level feature sets. In their approach, they employed Bilinear upsampling succeeded by a convolution layer. Subsequently, a Batch normalization layer was utilized as a substitute for the dense upsampling convolution. Notably, they refrained from applying any image augmentation or processing techniques. Their evaluations on the INbreast dataset, employing 5-fold cross-validation, yielded a Dice score of 79.1% for INbreast and 81.8% for CBIS-DDSM. However, cross-validation can sometimes result in overfitting on validation sets, especially when the fold count is minimal. In contrast, our proposed method achieved a Dice score of 87.98% for CBIS-DDSM and 85.61% for INbreast. Our training exclusively utilized the CBIS-DDSM dataset, and testing was conducted on 20% of CBIS-DDSM and all mass images from the INbreast dataset. This suggests that our strategy potentially offers better generalization across varied mammogram datasets.

Hou et al. [26] introduced MTLNet, an attentive multi-task learning network for breast cancer segmentation. By integrating group convolution and attention mechanisms, the model optimizes feature-learning, reducing redundancy. Their model’s effectiveness was validated through a five-fold cross-validation. However, incorporating regularization techniques like dropout and weight decay could potentially enhance its performance on unseen data, suggesting avenues for future refinement.

Yan et al. [27] proposed an advanced deep detection system for automated mass localization that leverages a multi-scale fusion technique. For accurate mass delineation, they employed a convolutional encoder-decoder framework enriched with hierarchical and dense skip linkages. By benchmarking various architectures such as U-Net, CGAN, cascaded U-Net, and v19U-Net++, they observed that the collective average Dice score, when trained on both CBIS and INbreast, amounted to 80.44% for INbreast test visuals.

Li et al. [28] introduced a design where a densely connected convolutional network serves as the encoder and a U-Net with integrated attention gates functions as the decoder. The encoder’s output is transformed into a gating signal vector to filter out irrelevant and noisy responses. Their method attained an F1 score of 82.24% on the DDSM dataset. In contrast, our model recorded an F1 score of 87.98% on the CBIS-DDSM dataset. While they relied solely on the DDSM dataset for both training and testing, there’s a potential benefit in validating their approach across diverse datasets.

In their research, Zeiser et al. [29] utilized the U-Net framework to identify masses in mammograms, enhancing their approach with various data augmentation techniques like extraction from nine specific regions of interest, image zoom, and horizontal image inversion. Using the DDSM dataset for training and testing, they reported a Dice score of 79.39%, emphasizing the synergistic benefits of U-Net combined with data augmentation for mammogram analysis. Yet, their model encountered difficulties discerning masses that had similar densities to the surrounding breast tissue.heir exclusive dependence on the DDSM dataset for training (70%), validation (10%), and testing (20%) suggests the necessity for broader dataset validations. Interestingly, while they trained their model for 140 epochs, our proposed approach, trained for only 100 epochs, secured a Dice score of 87.98% on the CBIS-DDSM dataset.

Exploring different avenues, Vidal et al. [30] and Singh et al. [31] introduced unique lesion extraction and segmentation techniques. Vidal et al. combined multiple input-connected U-Net models in an ensemble approach, testing on the TCGA-BRCA database. Singh et al., on the other hand, employed max-mean and least-variance models, utilizing morphological operations and image gradient methods for tumor detection.

Further adding to the diversity of methodologies, Khan et al. [32] segmented breast cancer images using color thresholding and unsupervised segmentation, while another study [33] used a region-growing approach with dragonfly optimization for seed point determination in cancer detection. Wang et al. [34] presented a model combining principal component analysis with U-Net for hyperspectral image segmentation, achieving an impressive accuracy of 87.14%.

For MRI and ultrasound image analysis, Rahman et al. [35] and Byra et al. [36] contributed novel techniques. Rahman et al. enhanced MRI image quality using mean and Gaussian filters, while Byra et al. introduced a Selective Kernel U-Net for ultrasound image segmentation. Lastly, Ilesanmi et al. [37] improved breast ultrasound segmentation using contrast-constrained histogram equalization.

The integration of machine intelligence in supply chain management, as discussed by Myvizhi et al. [38], parallels the use of deep learning in medical imaging. Similarly, the challenges and opportunities in sustainable transportation outlined by Nabeeh et al. [39] reflect the balance of efficiency and accuracy in healthcare, akin to how advanced algorithms guide breast cancer segmentation. Additionally, Sallam’s et al. [40] insights into IoT in supply chain management are comparable to the use of interconnected technologies in healthcare, highlighting shared concerns like data security and interoperability.

In summary, these studies collectively underscore the importance of AI and ML in specialized domains and their role in enhancing the precision and effectiveness of medical applications like breast cancer segmentation.

Table 1 Pros and limitations of some breast cancer segmentation models

Materials and methodology

The suggested workflow for the segmentation process is illustrated in Fig. 1, with each stage represented as individual blocks.

Fig. 1
figure 1

Working flow of the entire segmentation process

Datasets

INbreast and CBIS-DDSM are two open datasets, used to validate the proposed model. A recent updated version of the DDSM database, CBIS-DDSM (Digital Database for Screening Mammography) [43, 44], includes images in the DICOM format. The CBIS-DDSM dataset is extensive, with a size surpassing 160GB and image resolutions roughly around 4000 \(\times \) 6000 pixels. Additionally, it encompasses pixel-level annotations for masses and specific lesion pathology labels. Each view from the CBIS-DDSM, including the mediolateral oblique (MLO) and craniocaudal (CC) views, was used as a separate image. In our study, the binary segmentation masks from the CBIS-DDSM dataset were used to extract the ROIs. The dataset, comprising a total of 1588 images, was divided into training and testing sets according to the BIRADS category. For this division, 20% of the cases, amounting to 358 images, were allocated for the testing set, with the remaining cases used for training. In Fig. 2, The top left subplot illustrates the distribution of BIRADS categories across the dataset, segmented into ’Test’ and ’Training’ groups. The top right subplot displays the distribution of cases by breast side. The bottom left subplot showcases the distribution of breast density categories, a critical factor in mammographic diagnosis. Finally, the bottom right subplot presents the count of different mammographic image views (CC and MLO), which are essential for a comprehensive evaluation of breast tissue. Together, these visualizations offer a multifaceted view of the dataset’s composition, with a clear distinction between testing and training instances, facilitating an understanding of the data’s structure and potential biases.

The INbreast dataset, referenced in [45, 46], comprises 410 mammographic images sourced from 115 individuals at St. John’s Hospital’s Breast Centre located in Porto, Portugal. 90 of these included ladies who had the cancer on both breasts. However, only 107 images with masses and precise delineations created by experts are used as testing set. The images in this database include two perspectives: MLO and (CC). The images in INbreast are officially as ".DICOM" format and have a size of 3328x4084 or 2560x3328 pixels. In Fig. 3 The top left panel displays the distribution of BIRADS assessment categories, indicating the range of breast cancer risk as interpreted by radiologists. The top right panel shows the distribution of the mammographic images by the breast side, denoting the prevalence of cases in either the left or right breast. The bottom left panel represents the distribution of breast tissue density, with categories ranging from A (almost entirely fatty) to D (extremely dense). The bottom right panel illustrates the distribution between the two types of mammographic views, CC and MLO, used in the screening process. These distributions provide insight into the dataset’s composition and can help in understanding the variety of cases analyzed in the study.

Fig. 2
figure 2

Comparative Analysis of CBIS-DDSM Dataset Characteristics by Dataset Type

Fig. 3
figure 3

Overview of INbreast dataset characteristics

Image pre-processing

In this section, we illustrate how to improve the quality of mammograms by removing artefact in background and enhance image contrast as illustrated in Fig. 4. To make any model learn features and patterns easily from the enhancement quality information. Achieving a great performance out of any model requires removing artefacts. We can see that mammograms contain the following flaws based on a visual inspection of the raw mammograms, as shown in Fig. 5:

  1. 1.

    Bright white borders or corners.

  2. 2.

    Artefacts floating around such as letters and labels.

  3. 3.

    The orientation in which the breasts face varies.

  4. 4.

    Low susceptibility to dense breasts.

  5. 5.

    The images have no fixed size.

Fig. 4
figure 4

Pipeline diagram of image pre-processing

Fig. 5
figure 5

Sample images of CBIS-DDSM dataset

Normalize images

Normalizing data is essential to ensure uniform distribution for every input parameter, in this context, each pixel. This step aids in expediting the model’s training convergence. Normalization is achieved by deducting the mean from each pixel and then dividing by its standard deviation, resulting in data distributed around a Gaussian curve centered at zero. However, for image inputs, positive pixel values are necessary, so the adjusted data is rescaled to fall within the [0, 1] range.

Right orientation mammogram

The orientation in which the breasts face varies. Some are looking to the left, while others are looking to the right. This is a problem since it may make image preprocessing more difficult. We must first right-orient the images in order for the algorithm to generalize across all mammograms as in Fig. 6. We just compare the number of nonzero pixels on both halves of the images to determine left-oriented breast images. This is a really basic method of detecting orientation, and it works since the background pixels are completely black, giving us sense of where the breasts are on either half of the image.

Fig. 6
figure 6

Right orientation mammogram

Removing artefacts

Unwanted items or areas can arise in images by accident, as demonstrated in Fig. 5. In the background, there are artefacts floating around. In practice, these artefacts are used by radiologists to distinguish between the left and right breasts, as well as to determine the scan’s orientation. For example, "LCC" stands for "left breast, CC orientation". Also, there are some of the margins have bright white borders or corners. As a feature in the images that any model learns, these borders may generate an arbitrary edge. The removal of this border is critical since the tumor’s intensity and that of this border are nearly identical, potentially affecting the performance of any model.

Thresholding algorithms

Masks are the most common way to remove or select certain areas of an image. Transforming an image to a binary image based on a pixel intensity threshold to assess whether or not it is of importance to us to isolate particular parts of an image. We use Image thresholding technique to construct a mask. Masks are typically formed by performing one or more logical operations on an image.

The most basic thresholding method employs a manually defined image threshold. Using an automated threshold method on an image, on the other hand, calculates the numerical value of the image better than the human eye and can be easily repeated. We’ll experiment with some Thresholding algorithms to see which thresholding strategies work best.

As shown in Fig. 7, Li and the Triangle technique appear to be doing well in our sample image. The other outcomes in this scenario are significantly worse. To transform our image to a binary image, we’ll apply Li thresholding.

Li Thresholding algorithm

In 1993, Li and Lee introduced an innovative technique to ascertain the ideal threshold to differentiate the image’s foreground from its background. They aimed to reduce the cross-entropy between the two means, which consistently provided optimal thresholding results. Before 1998, the method of determining the best threshold involved evaluating all potential thresholds and opting for the one yielding the least cross-entropy. However, Li and Tam later developed an iterative method, rooted in the gradient of the cross-entropy function, to expedite the identification of this threshold [47].

Mathematically, consider T as the threshold, and F(T) and B(T) as the respective means of the image’s foreground and background. The iterative equation to deduce the optimal threshold can be depicted as:

$$\begin{aligned} T_{n+1} = \frac{F(T_n) + B(T_n)}{2} \end{aligned}$$
(1)

Here, \(T_n\) stands for the threshold during the nth iteration. This equation progressively adjusts the threshold by computing the mean of the current foreground and background values. The iterations persist until the threshold stabilizes at a value that optimally reduces the cross-entropy function.

Fig. 7
figure 7

Thresholding algorithms

Applying mask

The label function of Scipy library returns an array in which each object in the input is assigned an integer index. Unless the output parameter is specified, it returns a tuple containing the array of object labels and the number of items detected, in which case only the number of objects is returned. A structural element determines how the objects are connected. Capable of providing two-way connections, the structural matrix must be centrosymmetric. A symmetric matrix about the center is known as a centrosymmetric matrix [48].

The input can be any value that is not zero and is considered to be a part of the object. A squared connectivity equal to one is used to construct the structuring element. By experimenting with different label values to select breast pixels, we select half of the 1st dimension and left quarter of the 2nd dimension of the mammogram image and any other pixel not connected to the label equals to zero. Figure 8 shows removing artefacts after applying the mask.

Fig. 8
figure 8

After removing artefacts

Enhancing contrast

To improve tumor detection accuracy, image contrast enhancement becomes essential. The Contrast Limited Adaptive Histogram Equalization (CLAHE) technique augments local contrast in specific image areas. Compared to alternative Histogram Equalization techniques, CLAHE not only evens out the histogram but also maximizes entropy, making it particularly apt for medical imaging [17]. However, CLAHE’s efficacy largely hinges on the clip limit, governing the amplification of noise in images. An improperly set clip limit might degrade image quality and introduce noise [49].

Data augmentation

To combat overfitting, we employed data augmentation methods to expand our training dataset. It is crucial to select appropriate augmentations that generate realistic images beneficial for our task.

In this study, we created augmented images by randomly flipping the horizontal and vertical axes by 50% and randomly adjusting brightness by 30%. The binary images used to represent the mass segmentation masks were cropped, resized, and enhanced to match their corresponding mammograms.

Network architecture

In 2015, Ronneberger et al. [50] introduced the U-Net structure tailored for semantic segmentation, specifically aimed at biomedical image processing. As depicted in Fig. 9, U-Net is primarily composed of an encoding and a decoding block. The encoder, through convolutional and max pooling layers, extracts image features. After feature acquisition in the encoder, the decoder upsamples these feature maps non-linearly, integrates them with skip pathways from the encoder, and processes them through a duo of 3X3 convolution layers, each followed by a ReLU activation function. A final convolution layer then allocates a probability to every pixel, and a pixel-wise sigmoid classifier refines the output. This blend allows U-Net to utilize features discerned at all depths, generating a final segmentation map with full resolution, as initial CNN layers often grasp basic features while deeper layers capture more advanced ones.

Fig. 9
figure 9

U-net architecture

For our decoder, we employed transposed convolution layers, enlarging feature maps while halving the channel count. Post-convolution layers, we integrated a 0.5 dropout for enhanced network stability. The output from the decoder’s corresponding segment is added to a transposed convolution’s output at every stage. Padding is applied to the output map to ensure the output segmentation matches the input image’s dimensions.

We employed the Adam optimizer [51] with an initial learning rate set at 1e-3 and a momentum of 0.9 to reduce the Dice loss during training. Our training spanned 100 epochs, closely monitoring the dice score of the validation dataset. In each epoch, every image from the training set was utilized once, using a batch size of 24. The input mammogram images were resized to 224 \(\times \) 224 dimensions. Our model’s development, training, and testing were carried out using Python 3.9 and Tensorflow.

Evaluation metrics

In this study, a number of metrics: region level metrics and Pixel level metrics were utilized to evaluate how well the proposed approach was working. A few measurements at the pixel level are precision, sensitivity and f1 score. The precision evaluates the proportion of accurate predictions relating to the ground truth’s mass region’s pixels. The sensitivity (recall) evaluates the proportion of accurate predictions corresponding to the prediction map’s normal region’s pixels. The F1 score is the weighted average of precision and sensitivity. It factors in both the count of pixels correctly identified as mass and those mistakenly labeled as normal.

True positive (TP) denotes the count of pixels accurately classified as mass. True negative (TN) represents the number of pixels correctly identified as not being part of the mass. False positive (FP) indicates the count of pixels that are mistakenly classified as mass, even though they are normal. False negative (FN) signifies the number of mass pixels incorrectly labeled as normal.

Under-segmentation occurs when the mass is too small for the model to segment. Because there are many FN pixels, the sensitivity is low in this condition. over-segmentation occurs when the model segments normal region as a mass. as there are many FP pixels, the precision is affected.

$$\begin{aligned}{} & {} Precision=\frac{TP}{TP+FP} \end{aligned}$$
(2)
$$\begin{aligned}{} & {} Sensitivity=\frac{TP}{TP+FN} \end{aligned}$$
(3)
$$\begin{aligned}{} & {} \begin{aligned} F1\,Score&=2\times \frac{Precision\times Recall}{Precision+Recall} \\&= \frac{TP}{TP+\frac{1}{2}(FP+FN)} \end{aligned} \end{aligned}$$
(4)

The area and boundary of corrective predictions are used by region level metrics for evaluation. The regions in the resulting prediction are evaluated using the Hausdorff distance, Jaccard coefficient, and dice score. Jaccard coefficient and dice score calculates how much the prediction mass and the ground truth overlap. There is no overlap between prediction and ground truth when the value is 0. A score of 1 denotes a perfect match between the prediction and ground truth. Hausdorff distance is the longest distance between any two points that are closest to one another and are located between the ground truth (GT) and the prediction (P) as cleared in Fig. 10.

$$\begin{aligned}{} & {} Dice\,score=\frac{2 |\ GT\bigcap P |}{|GT |+|P |} \end{aligned}$$
(5)
$$\begin{aligned}{} & {} Jaccard\,coeff.=\frac{|GT\bigcap P |}{|GT |+|P |-|GT\bigcap P |} \end{aligned}$$
(6)
$$\begin{aligned}{} & {} \begin{aligned} Hausdorff\,distance= max\Bigg \lbrace \max \limits _{gt\in GT}\min \limits _{p\in P} d(gt,p),\\ \max \limits _{p\in P}\min \limits _{gt\in GT} d(p,gt)\Bigg \rbrace \end{aligned} \end{aligned}$$
(7)
Fig. 10
figure 10

Hausdorff distance

Results

In this study, a large dataset of breast mammograms was used to be trained on an Unet model for mass detection. It was demonstrated that the suggested model, which had been trained on a completely different dataset (CBIS-DDSM) than the tested dataset (INbreast), could be modified to accurately find masses in full mammogram. The suggested mass detection framework outperformed previous research in the literature in terms of better pixel-scale sensitivity and region-scale dice score.

Dice score evaluation

A more in-depth examination of the model’s dice score metrics is depicted in the Empirical cumulative difference plot, as illustrated in Fig. 11. By mapping the empirical cumulative difference of the dice score, we gain a clearer insight into the alignment of predictions and actual values across the entirety of the CBIS-DDSM dataset. Given that the plot starts from a dice score of 0.8, it’s evident that the U-Net effectively identifies lesions, as there are no predictions falling below this score. The steeper ascent in the U-Net plot further indicates the precision of most of its predictions.

Fig. 11
figure 11

Dice score empirical cumulative difference plot for the test images from the CBIS-DDSM dataset

Case studies

This section will feature specific cases illustrated to demonstrate the application of the methodology to CBIS-DDSM and INbreast images.The model’s prediction outcome for sample images from both datasets is shown in Figs. 12,13. The boundary of the projected regions is outlined using red lines, and the boundary of the ground truth is delineated using blue lines.

Based on both quantitative metrics and visual evaluations, our U-Net model demonstrates exceptional prowess in segmenting tumors within mammographic images. Notably, it is proficient not only in detecting minuscule tumors with a Dice score of 84.71% and a perfect sensitivity rate of 100% as in Fig. 12a,but also in diagnosing multiple tumors within a single image in Fig. 12b.The model’s capability shines especially in challenging scenarios-accurately detecting tumors, regardless of their proximity to high-density regions or their diminutive size, as depicted in Fig. 12c and d. This is particularly noteworthy considering that radiologists might find these areas demanding and time-intensive for accurate diagnosis.

Despite being exclusively trained on the CBIS dataset, with no exposure to INbreast during training, our model showcased a commendable generalization capability by accurately predicting tumors in the INbreast dataset, achieving a Dice score of 85.61%. This accomplishment is particularly significant given the inherent differences in the nature of the two datasets, as illustrated in Fig. 13. The model’s proficiency extends to both large and small tumors. Notably, even for diminutive tumors, as depicted in Fig. 13a, it achieved an impressive Dice score of 89.05%.

Fig. 12
figure 12

Results of prediction for sample of test images from the CBIS-DDSM dataset, their IDs a Mass-Test-P00124-RIGHT-CC, b Mass-Test-P00116-RIGHT-CC, c Mass-Test-P00016-Left-MLO,and d Mass-Test-P01378-RIGHT-MLO

Fig. 13
figure 13

Results of prediction for sample of test images from the INbreast dataset, their IDs a case-53581406, b case-20587810, c case-50996352, and d case-22670278

Comparison with state-of-the-art methods

In our evaluation, we selected state-of-the-art methods for a comparative analysis, adhering rigorously to the evaluation protocols of the CBIS-DDSM and INbreast datasets. Several of these methods are detailed in the literature review section.

Table 2 Comparison of the proposed and state-of-the-art methods on CBIS-DDSM
Table 3 Comparison of the proposed and state-of-the-art methods on INbreast

The results in Tables 2 and 3 demonstrate that our proposed method significantly surpasses the performance of leading techniques. Evidently, our model showcases marked enhancements across crucial metrics compared to other strategies. For the CBIS-DDSM dataset, our approach achieved the top sensitivity of 90.58%, which is approximately 5% superior to previously documented methods. Additionally, there’s a noticeable decrease in the Hausdorff distance metric by 0.78. For both CBIS-DDSM and INbreast datasets, our method stands out, registering the highest dice scores of 87.98% and 85.61%, respectively.

Conclusion

This research introduces a holistic approach for segmenting masses in digital mammography. The process encompasses data augmentation and image pre-processing, where the contrast of the mammograms is improved using CLAHE. Lesions are then identified using a deep supervised U-Net, eliminating the need for manual feature extraction or the selection of specific parameters like machine learning algorithms.

A number of metrics at the pixel and region levels are employed to assess the model’s performance. When compared to other state-of-the-art approaches, the proposed architecture is determined to perform better segmentation outcomes on masses of completely different sizes and shapes with an overall average Dice of dice score 87.98% for CBIS-DDSM and 85.61% for INbreast. We used 80% of CBIS-DDSM for training and test on 20% of CBIS-DDSM and all mass images of INbreast dataset.

Building upon the findings presented, our research has yielded a model that not only excels in accuracy but also demonstrates the practical viability of deep learning in medical diagnostics. The utilization of advanced image preprocessing techniques, such as CLAHE, has resulted in improved visibility of mammographic features, thus enabling the deep supervised U-Net to perform precise lesion segmentation. This negates the need for manual feature extraction, streamlining the diagnostic process.

The employment of various evaluation metrics at both pixel and region levels has provided a comprehensive understanding of the model’s performance, affirming its superiority over other state-of-the-art methods. The ability of our model to adeptly handle masses of varying sizes and shapes underscores its potential as a valuable tool in clinical settings. Moreover, the training and testing on distinct subsets of the CBIS-DDSM and INbreast datasets have demonstrated the robustness and adaptability of our approach.

An additional advantage of our proposed deep U-Net framework in mammography image analysis is its significant contribution to the early detection of breast cancer. Key clinical benefits include:

  1. 1.

    Enhanced Diagnostic Accuracy: The framework’s capability to discern subtle nuances in mammograms elevates diagnostic precision. This is crucial for identifying minor tissue changes that may be indicative of early-stage breast cancer.

  2. 2.

    Expedited Analysis Process: Our approach streamlines the image analysis procedure, thereby reducing response times. This acceleration is vital for the early detection of breast cancer, as it allows for quicker intervention.

  3. 3.

    Reduction in Human Error: The deep U-Net model’s automated and precise analysis minimizes the likelihood of human error, which is a critical factor in medical diagnostics. By relying on advanced algorithmic processing, the system ensures a more consistent and reliable interpretation of mammographic images.

Future work

Future efforts will concentrate on further advancing validating and integrating our breast cancer segmentation models into clinical practice. This includes testing with varied datasets for broader applicability and implementing advanced deep learning methods for enhanced performance. Key to our approach is conducting clinical trials with healthcare professionals to ensure the model’s efficacy and compliance with healthcare standards, ultimately aiming to embed AI-driven diagnostics into routine clinical workflows for better breast cancer management.