1 Introduction

Female breast cancer emerged as the primary contributor to cancer incidence worldwide, overtaking lung cancer. Additionally, it positioned itself as the fifth leading cause of global cancer-related deaths [1]. The current diagnostic technique, mammography, which uses X-rays, is inappropriate for women under 40, expecting mothers, and others who should not be exposed to ionizing radiation. When mammography is not appropriate or a more thorough assessment is required, other techniques such as magnetic resonance imaging (MRI) and ultrasound are used. MRI is not recommended for pregnant women or individuals with implants, nevertheless. In contrast, medical ultrasonography requires specialist professionals for analysis even if it is a less precise ultrasound-based process.

Ultrasound imaging has become a cornerstone in diagnosing breast masses, owing to its safety, affordability, and efficiency. However, its efficacy is often contingent upon the expertise of the radiologist conducting the examination. To mitigate the subjectivity inherent in diagnosis, CAD systems have emerged as invaluable tools. By harnessing intelligent computing technology, CAD systems automatically provide diagnostic insights and enhance accuracy of the system.

The main modules used in typical CAD system for examination of BUS images are image preprocessing, breast lesion segmentation, image feature extraction, and classification. Among these steps, accurate segmentation of lesions from BUS images holds particular significance. It aids in the diagnosis and treatment of breast cancer, thereby contributing to a reduction in mortality rates. However, segmenting BUS lesions remains challenging due to inherent issues like speckle noise, strong shadows, and the irregularities of breast lesions, including variations in tumor shapes and sizes among patients [2].

By leveraging intelligent computing technology to enhance diagnostic accuracy and efficiency, CAD systems empower healthcare professionals in providing timely and effective interventions, thereby promoting better health and well-being for individuals affected by breast cancer and also they represent innovations that drive progress in healthcare infrastructure and medical research, ultimately supporting sustainable development efforts worldwide. Therefore, the advancements in CAD systems for breast cancer detection aligns with SDG Good Health and Well-being.

Various techniques based on deep convolutional neural networks (CNNs) have been proposed for the segmentation of breast cancer ultrasound images [3,4,5,6]. However, while CNNs excel in learning abstract data representation for improved local image transformation robustness, spatial information abstraction might be suboptimal for semantic segmentation. Addressing this, DeepLab [7] introduced Atrous Spatial Pyramid Pooling (ASPP). Subsequently, Deeplabv3 [8] enhanced this approach by integrating multiple parallel ASPPs to capture contextual information across scales. Integrating insights from both U-Net and Deeplabv3, DeepLabv3 + [9] emerged, incorporating a decoder module to restore object boundaries.

Presently, attention-based networks, a prevalent feature in various computer vision tasks [10], are being harnessed. This strategy optimizes network utilization by prioritizing salient and informative features without extra supervision. The efficacy of attention in improving the outcomes of semantic segmentation networks has been substantiated.

In this study, a modified DeepLabV3 + model has been proposed with Convolutional Block Attention Module (CBAM). In the proposed model, CBAM has been seamlessly integrated into both the encoder and decoder components of the DeepLabV3 + model. This innovation refines the accuracy and robustness of the segmentation procedure, enhancing the overall performance of the CAD system for breast ultrasound lesion segmentation. The main contributions of the proposed work are as follows:

  1. 1.

    BUS dataset was first applied on state of-the art DeepLabV3 + architecture for the segmentation lesions from BUS images

  2. 2.

    Modified DeepLabV3 + has been proposed by integrating a CBAM module in encoder as well as decoder to focus on the more informative features.

  3. 3.

    The comparative analysis of original DeepLabV3 + model and proposed modified DeepLabV3 + model has been made on the test dataset using various performance metrics like Specificity, IoU, Dice Coefficient, Recall and Precision.

The remaining paper is structured as stated: Related work of the proposed study is presented in Sect. 2. Section 3 shows the material and methods used employed in the work. Section 4 elaborates the result and discussion followed by Sect. 5 which shows the conclusion of the proposed work along with future work.

2 Related work

Medical diagnosis utilizing CAD has been explored through several methodologies, such as conventional image processing techniques, machine learning algorithms, and more advanced approaches involving deep learning and artificial intelligence (AI) [11,12,13]. Currently, different deep learning-based techniques are used to segment tumor from BUS images [14, 15]. Almajalid et al. [16] proposed a technique to improve the accuracy of the model by contrast enhancement technique and denoising operations. A system for tumor diagnosis was developed by Moon et al. [17] using an image fusion technique which uses the combination of different image content representations and ensemble of different CNN architectures on BUS images [17].

A deep learning model with PAN as the base structure was designed by Lyu et. al [18] for the low definition of ultrasound breast images, and the effectiveness of this model was verified on BUSI and OASBUD. Feature extraction module for multi-scale images is proposed, which can acquire deep image local information while preserving shallow image global information. The attention module called SCA is added to the decoder based on PAN to prompt the model to pay attention to the edges and details in the image and strengthen the segmentation ability of the model. A CAM-DLS method with image-level labels has been proposed to segment tumor by the Li et. al [19]. Anatomical constraints has been used to reduce search space for breast tumor segmentation based on our domain knowledge of breast ultrasound images. Vakanski et. al [20] proposed a U-Net based model with attention blocks to has been used for segmentation of tumor in BUS images. Yan et al. [21] proposed an Attention Enhanced U-net with hybrid dilated convolution (AE U-net with HDC) model for the segmentation of the tumors in BUS images. Spatial Information loss is reduced by applying three groups of HDC with different expansion rates in AE U-net. Some of the other authors utilized channel attention mechanism [22] and spatial attention mechanism [23, 24] to produce better segmentation results in BUS images. Lei et al. [25] also proposed a model for segmentation of tumors using integration of various attention mechanisms modules. Chen et al. [26] constructed U-Net model using a hybrid adaptive attention module instead of convolutional operations, for improving segmentation accuracy and generalization ability. Erragzi et al. [27] developed a new model, called Ultrasound Network (US-Net), uses the U-Net architecture with attention gates to perform the segmentation of breast tumour from BUS images. You et al. [28] proposed, an effective tumour segmentation method, using EfficientUNet which adopts a step-by-step enhancement method, combining ResNet18, a channel attention mechanism and deep supervision. Umer et al. [29] proposed a multiscale cascaded convolution with residual attention-based double decoder network extract the diverse semantic spatial features which is used for breast cancer segmentation.

3 Materials and methods

3.1 Dataset description

This study utilized a publicly available breast ultrasound (BUS) image dataset [30], which was collected from Behaye Hospital. The dataset is freely accessible on their website and received approval from the hospital's ethics committee. The data was gathered in 2018 and consisted of BUS images from 600 female patients aged between 25 and 75. The scanning procedure employed two high-quality imaging ultrasound systems: the LOGIQ E9 and the LOGIQ E9 Agile.The ML6-15-D Matrix linear probe with transducers ranging from 1 to 5 MHz was used in the ultrasound scanning. The dataset contains 780 grayscale images having average size of 500 × 500 pixels. It represents a heterogeneous collection with three categories: normal, malignant and benign. Specifically, there are 133 normal images, 437 benign images, and 210 malignant images in the dataset. This research aims to segment breast cancer in ultrasound images; therefore, we have taken two classes out of three, namely benign and malignant tumor class. 80% of the samples were taken in the training set and 20% were taken in the test set. Out of training set 10% of the samples are taken for validation set. Training and test set samples distribution are given in Table 1. Sample images and their corresponding masks for both classes are given in Fig. 1.

Table 1 Number of samples in train, validation and test set
Fig. 1
figure 1

Sample images from the dataset. First and third row represents breast ultrasounds of benign and malignant tumors. Second and fourth row represents their respective ground truth masks

3.2 Data pre-processing

Before training of the model, data pre-processing is essential to ensure that the input images are in an optimal format for effective learning. Image pre-processing has been applied on both ultrasound images as well as the ground truth mask. First, image normalization has been done to standardize the pixel values of an image between 0 and 1 to simplify computations and accelerate convergence. It is done by dividing the pixel value by 255. After normalization, image resizing has been so that all the images could be of equal size before feeding them to the model for training. The images have been resized to 256 × 256 pixels. Data augmentation has been done on training set to increase the size of the dataset. Data augmentation is done to increase the generalizability of the model. Following data transformations have been applied on training images: horizonal flip, vertical flip, and, random rotation between 0 to 45 degrees. Sample transformed images are shown in Fig. 2.

Fig. 2
figure 2

Sample images after applying data augmentation. a Benign Tumors b Malignant Tumors

3.3 DeepLabV3 + model

In this study, a modified version of the DeeplabV3 + model tailored for automatically segmenting tumors in breast ultrasound images has been proposed. DeepLabV2 + utilizes a Atrous Spatial Pyramid Pooling (ASPP) for effective segmentation [31, 32]. Widely recognized as one of the most efficient semantic segmentation algorithms [33, 34], the DeeplabV3 + network employs an encoder-decoder structure.

The DeeplabV3 + model employs a Residual Neural Network (ResNet) for extraction of features from the image. Notably, it incorporates an augmented version of ASPP to prevent information loss from the image. The ASPP module in DeeplabV3 + acts as a multi-scale analyzer. It examines the image using multiple filters at different scales simultaneously. This allows it to capture information about objects and their surrounding context at various sizes, preventing the loss of important details that might occur with traditional resampling methods. This multi-scale approach helps DeeplabV3 + achieve a better understanding of the image and improve its performance in tasks like semantic segmentation. The architecture of DeepLabV3 + model is shown in Fig. 3. As shown in Fig. 3, the encoder section of DeeplabV3 + consisting of a backbone network and an ASPP module. To capture semantic information from multiscale images, this module is affixed at the end of the backbone network.

Fig. 3
figure 3

Original DeepLabV3 + Architecture

The ASPP module employs multiple parallel atrous convolutions with different rates like 6, 12, and 18, to extract features at various scales. In addition to the atrous convolutions, the ASPP module includes a global average poolinglayer that generates image-level features. The outputs from the different atrous convolutions and the image-level features are concatenated and passed through a 1 × 1 convolution layer to reduce dimensionality and combines the multi-scale features into a unified representation. Decoder section of the model aims to recover spatial details and object boundaries. In the decoder phase as shown in Fig. 3, the features extracted by the encoder are first upsampled using fourfold bilinear interpolation. At each channel level, low-level feature information lost during downsampling can be incorporated using upsampled map. After that upsampled features and low-level features are concatenated to refine the segmentation mask. Then, a series of 3 × 3 convolutional layers is applied to further refine the features and produce the final segmentation predictions.

3.4 Proposed modified DeepLabV3 + architecture

The proposed modified architecture of DeepLabV3 + is shown in Fig. 4. In the proposed architecture, a convolutional block attention module (CBAM) [35] has been incorporated to increase model's ability to focus on relevant features and spatial information. In encoder, as shown in Fig. 4 the CBAM module has been integrated after the output from backbone network, while in decoder the CBAM module has been integrated after applying 1 × 1 convolution on the features obtained from the backbone network. The CBAM module sequentially integrates two submodules: Channel Attention Module (CAM) and Spatial Attention Module (SAM). The pseudocode for proposed model is given by algorithm 1.

Fig. 4
figure 4

Proposed Modified DeepLabV3 + Architecture

Algorithm 1
figure a

Proposed Modified DeeplabV3+ model with CBAM integration

The architecture of CBAM module is shown in Fig. 5. CAM mainly consists of global average pooling, global max pooling and multi-layer perceptron (MLP) module.

Fig. 5
figure 5

Architecture of Channel Block Attention Module (CBAM)

The output obtained from shared MLP is fused by element-wise summation and then the channel attention features are obtained by a sigmoid activation function. Channel attention map \({M}_{c}\) can be defined by Eq. (1).

$${M}_{c}(N)=\sigma \left(MLP\left(AvgPool(N\right)\right)+MLP(Maxpool\left(N\right))$$
(1)

where \(\sigma \) is sigmoid activation function, \({M}_{c}\) represents the 1D channel attention map and \(N\in {R}^{C X H X W}\) denotes the input feature map. Channel refined feature \(N{\prime}\) is then calculated by multiplying feature weight of each channel and \(N\) as given by Eq. (2).

$${N}{\prime}= {M}_{c}\left(N\right)\otimes N$$
(2)

Channel refined features are then passed to spatial attention module where spatial attention map \({M}_{s}\in {R}^{1 X H X W}\) is obtained as given by Eq. 3:

$${M}_{s}(N{\prime})=\sigma \left( {f}^{{Z}{\prime}}([AvgPool\left({N}{\prime}\right):MaxPool\left({N}{\prime}\right)]\right)$$
(3)

where \({Z}{\prime}\) denotes the kernel size of convolutional layer. \(N{\prime}\) gets the final redefined feature \(N{\prime}{\prime}\) after SAM by e element-wise multiplication between the spatial feature weights and N′. It can be expressed by Eq. (4).

$${N}{\prime}{\prime}= {M}_{s}(N{\prime})\otimes N{\prime}$$
(4)
Algorithm 2
figure b

CBAM module

The CAM module weights each channel to enhance the learning of specific channels, while the SAM module applies an attention mechanism to a feature map to improve the spatial information. The pseudo-code of CBAM module is given by Algorithm 2.

After processing through the CBAM, the refined features are fed to the ASPP module which further enhances the multi-scale feature extraction capabilities of the modified DeepLabV3 + model. In the decoder section, of the DeepLabV3 + model the features extracted by the encoder are first upsampled using fourfold bilinear interpolation to Hypthen concatenated with low-level features refined using another CBAM module from the backbone network, which helps to incorporate fine-grained spatial information. Following the concatenation, a series of 3 × 3 convolutional layers is applied to further refine the features, enhancing the segmentation mask. This refined output is then subjected to a final bilinear upsampling operation to produce the final segmentation predictions at the original image resolution.

The proposed model was implemented using Python. Additionally, various python-based libraries like scikit-learn, NumPy, matplotlib, pandas were used. Grid search was used for optimization of the model. The final hyper-parameters selected for the training of proposed model are given in Table 2. For training of the proposed model, Adam optimizer was used with 0.001 learning rate. Adam optimizer has been selected because it outperforms other optimizers by adapting the learning rate for each parameter individually and also there is incorporation of momentum which smooth out the updates and accelerate convergence. Binary cross-entropy was used as loss function. With the availability of GPU memory size, batch size of 32 was selected.

Table 2 Hyperparameter configurations

3.5 Evaluation metrics

The proposed framework was evaluated on the different performance metrices like precision, recall, specificity, Dice coefficient, and IoU to evaluate the performance of segmentation algorithms. Precision defines the accuracy of positive predictions made by the segmentation algorithm. A high precision means the model is less likely to misclassify healthy tissue as diseased, reducing false positives and unnecessary further testing or treatment. Recall gives algorithm’s ability to correctly identify all positive instances in the image. A high recall ensures that most actual tumors are identified, minimizing the chances of missed diagnoses (false negatives), which can have serious consequences for patient health. Specificity measures the proportion of true negative predictions among all actual negative instances. It is important in medical imaging because it indicates how well the model can correctly identify healthy tissue as normal. Dice coefficient, is the harmonic mean of recall and precision. It is calculated by taking the ratio of twice the intersection of the segmented image and the ground truth to the sum of pixels in both images and is given by Eq. (5).

$$\text{Dice Coefficient}=2*\frac{\text{Intersection\, of\, segmented\, image\, and\, ground\,truth\, mask}}{Sum\, of\, pixels\, in\, segmenetd\, image\, and\, ground\, truth\, image}$$
(5)

Intersection over Union (IoU) evaluates the similarity index between the ground truth and the segmented image using the ratio of the intersection of the two sets to their union.

$$\text{IoU}=2*\frac{\text{Intersection of segmented image and ground truth mask}}{\text{Union of segmented image and ground truth mask}}$$
(6)

4 Results and discusions

In this section, we have first evaluated the original DeeplabV3 + model as well as our proposed DeeplabV3 + model on the basis of performance metrics selected. Then comparative analysis of two models has been presented w.r.t precision, recall, specificity, dice coefficient and IoU. Finally, the performance of proposed model has been compared with the state-of-the-art deep learning segmentation methods.

4.1 Performance of proposed DeeplabV3 + model

Figure 6a–e illustrate the variations in precision, recall, specificity, dice coefficient, and IoU across the 50 epochs of training on benign tumor samples. Precision reached to 0.943 and 0.9517 on the training set and validation set, respectively. As shown in Fig. 6b, recall improved to 0.956 on and 0.964 on the training set and validation set, respectively. Specificity increased to 0.997 and 0.998 on the training set and validation set, respectively, as illustrated in Fig. 6c. The dice coefficient improved to 0.963 and 0.968 on the training set and validation set, respectively as illustrated in Fig. 6e, IoU reached 0.86 for training set and to 0.87 for validation set (see Fig. 7).

Fig. 6
figure 6

Performance of the proposed modified Deeplabv3 + model during Training Phase of benign tumor samples

Fig. 7
figure 7

Performance of the proposed modified Deeplabv3 + model during Training Phase of malignant tumor samples

4.2 Visual analysis

Figure 8 represents the masks predicted for benign and malignant tumor samples respectively. In Fig. 8, the original image is represented by first column having benign tumor, second column presents the original masks, 3rd column shows the masks predicted by the proposed modified DeeplabV3 + model and the last column gives the overlapped image obtained by combining original image and predicted mask.

Fig. 8
figure 8

Performance of Proposed Model on test dataset for benign tumor samples

Similarly, Fig. 9 shows the predicted masks for malignant tumour samples. From the visual analysis of predicted masks obtained for both benign and malignant tumor samples it has been inferred that the proposed model generates masks that closely resemble the original ones (see Table 3).

Fig. 9
figure 9

Performance of Proposed Model on test dataset for malignant tumor samples

Table 3 Performance comparison of Proposed Modified DeepLabV3 + Model with original DeepLabV3 + model

4.3 Comparative analysis with state-of-the-art methods

To check the efficacy of our proposed model, a comparative analysis with state-of-the-art methodologies in the field has been conducted as shown in Table 4. The comparative analysis revealed that our proposed methodology outperforms existing approaches in terms of precision, recall, specificity, dice coefficient and IoU.

Table 4 Comparison of proposed model with state-of-the-art methods

5 Conclusion and future scope

This paper presents a comprehensive study on the segmentation of breast tumors in ultrasound images, which is fundamental for CAD systems in breast cancer detection. In this study, a modified DeepLabV3 + architecture integrated with a CBAM has been introduced to enhance feature extraction and segmentation accuracy. Evaluation of the proposed model demonstrates its outstanding performance across various performance metrics, including precision, recall, specificity, Dice coefficient, and IoU. Compared to the original DeepLabV3 + model, the modified architecture with CBAM integration shows substantial improvements, achieving higher precision, recall, and overall segmentation accuracy for both benign and malignant tumor samples.

As future research, various optimization algorithms can be embedded into the DeepLabV3 + segmentation pipeline to enhance segmentation accuracy. Furthermore, the contributions of different backbone architectures in the encoder part of the segmentation pipeline can be investigated. Also, combining information from multiple imaging modalities, such as ultrasound, MRI, and mammography, could enhance segmentation accuracy.