1 Introduction

Breast cancer, the most common malignancy among women worldwide, has a high incidence rate among females aged 45–59 years. According to the World Health Organization’s 2020 statistics, 136.8 out of every 100,000 women in this age group were newly diagnosed with breast cancer, with 37.3 of them dying from it. This is three times higher than the second-highest incidence of colorectal cancer [1]. However, according to the American Cancer Society, patients with stage 0 and stage 1 breast cancer have a five-year survival rate of over 99% [2]. Therefore, promoting regular breast screening, especially among those at high-risk of breast cancer, can achieve the WHO’s breast cancer control strategy of early detection and early treatment [3]. The American College of Radiology (ACR) published the latest Breast Imaging Reporting and Data Systems (BI-RADS) 5th Edition [4] in 2013, which provides standard lesion descriptions, lesion risk evaluations, and medical action advice for mammography, sonography, and magnetic resonance imaging of the breast. Based on the pretest probability of pathological examination, BI-RADS assessment classifies lesions into seven categories that represent the integrative result of breast screening. Determining the false positive rate with an auxiliary screening tool is common in breast cancer screening, and the biopsy test is the necessary workup for breast cancer diagnosis.

It is noteworthy that Category 2 (with a 0% probability of malignancy) represents benign findings such as cysts, and routine follow-up tracing of the patient is recommended. Category 3 (with less than 2% probability of malignancy) represents probably benign findings such as fibroadenoma, and short-interval follow-up tracing is recommended. Category 4 and Category 5 (with 2–95% and > 95% probability of malignancy, respectively) represent suspicious and highly suggestive findings and are suggested to consider a biopsy test for further confirmation. The use of the BI-RADS standard lesion description can significantly reduce the variability of diagnosis reports, decrease intra-observer variation, and achieve intra-observer agreement. In addition, deep learning-based computer-aided diagnosis (CADx) systems are capable of not only reducing the time required for the task but also decreasing inter-observer variability through the use of an objective second opinion) [5,6,7,8]. Unlike machine learning-based CADx which has three stages (lesion detection, segmentation, and classification [5,6,7,8], deep learning-based CADx uses an end-to-end model to implement the lesion diagnosis task. This approach not only avoids the possibility of causing nonlesion options by using traditional image processing methods in the lesion detection stage but also decreases the dependence on the handcrafted segmentation feature [5].

The problems of gradient vanishing and degradation become more severe as convolutional neural networks (CNNs) become deeper. To address these issues, recent developments including ResNet [9, 10] and DenseNet [11] structures with skip connection and dense connection, respectively, promote forward propagation, and backpropagation and simultaneously improve the efficiency of models. Also, transfer learning is a way to deal with the rarity of medical images [12]. The study conducted by Brem et al. [13] highlights the importance of accurately classifying lesion risk levels and appropriately confirming them through pathological examinations during breast screening. Although the use of breast ultrasonography as an auxiliary diagnostic tool can increase the sensitivity of mammography on dense breasts from 47.6 to 76.1%, the false positive rate could remain as high as 2.4%.

In this paper, our main contribution is to establish a deep learning-based CADx system for classifying breast ultrasound lesions into three categories, in which patients require routine follow-up tracing (Category 2), short-term follow-up tracing (Category 3), and biopsies (Category 4/5).

2 Literature Review

2.1 The Diagnosis of Breast Lesions

Mammography, sonography, and magnetic resonance imaging (MRI) are three common imaging methods for breast screening. Mammography, which uses low radiation doses, is the most widely used method for first-line screening. Sonography is often used to detect obscured lesions, especially in dense breasts. MRI is more commonly used to screen high-risk patients due to its high cost and time-consuming nature [6, 14]. Computer-aided systems for breast lesion analysis can be classified into two types: computer-aided detection (CADe) systems, which are used to detect the position of lesions; and the CADx systems, which aid decision-making for lesion classification. The structure of a CADx system for breast lesion analysis typically includes three steps: lesion detection, lesion segmentation, and lesion classification [5, 6]. Deep learning-CADx systems, which differ from traditional machine learning-based methods, reduce the dependence on handcrafted features and decrease human intervention (and the bias that comes with it) through automatic feature extraction [5, 8, 15].

Since mammography enables the visualization of the entire breast architecture, studies of deep learning-based CADx systems typically involve two stages: lesion localization and malignant classification. Ribli et al. [16] utilized a Fast R-CNN model with a Region Proposal Network for lesion detection and Fully Connected Layers for classification, achieving an area under the curve (AUC) of 0.95 and 90% sensitivity for malignant lesions. Al-Antari et al. [17] used the YOLO model for lesion detection, and ResNet-50, and InceptionResNet-v2 for classification, achieving accuracy rates of 92.55% and 95.32%, respectively. It is noteworthy that all of these studies were conducted on the INBreast dataset, which consists of digital mammography images; thus, the performances of these models on other datasets and modalities may vary. As for calcified lesions, Rehman et al. [18] not only automatically segmented the region-of-interest (ROI) of lesions but also utilized the depthwise-separable convolutional layers to classify them into benign and malignant lesions. The performance achieved 80% and 87% accuracy before and after model improvement, respectively.

Conversely, there are some previous studies in which the breast ultrasound CADx system is used in the diagnosis of benign and malignant lesions. Han et al. [19] utilized a single model structure, GoogleNet, and achieved 91.2% mean accuracy. Moreover, they also found the best ROI selection method was the distance between the lesion’s margin and the ROI’s edge, with 180 pixels being the optimal threshold. Qi et al. [20] focused on the double labeling of malignant and solid lesions. They performed cross-training of the Inception-ResNet-v2 and Inception-v3 models, which resulted in 94.5% accuracy. Besides, Moon et al. [21] utilized fusion imaging and the ensemble model with average prediction to achieve the best accuracy rate (91.1%). These previous studies demonstrate the potential of breast ultrasound CADx systems in accurately diagnosing breast lesions. Moreover, there is a growing demand for breast MRI CADx systems to deal with 3D imaging for more in-depth analyses of breast tissues in clinical settings. Balkenende et al. investigated various methods of breast MRI CADx systems and highlighted 3D image usage and extraction [14]. In one study, Zhou et al. [22] used a single 2D MRI image as a unit and extracted the ROI via morphological methods. Then, they used an ensemble model consisting of two types of DenseNet to distinguish benign breast lesions from malignant ones, achieving 83.7% accuracy [22].

2.2 Assessment of Ultrasound Images

A thorough review of several studies revealed that the application of machine learning and deep learning in the sonography CADx system is not only limited to breast lesions but is also utilized in diagnosing lesions in other organs [15]. For BI-RADS risk level diagnosis in breast sonography, Huang et al. [23] utilized a two-stage model of the fully convolutional network and the deep CNN to detect lesions and sequentially distinguish lesions’ risks. They achieved 99%, 94%, 73%, 92%, and 87% accuracy for BI-RADS categories 3, 4 A, 4B, 4 C, and 5, respectively. Moreover, Ciritsis et al. [24] used the sliding window method and four different deep CNNs to hierarchically implement automatic breast lesion classification, obtaining an AUC of 0.84.

Chang et al. [25] used the Speckle Reducing Anisotropic Diffusion and the K-means clustering methods to segment lesions and extract features. They used three classifiers of the SVM, the Random Forest, and the CNN to distinguish BI-RADS Category 2–5, obtaining accuracy rates of 80%, 77.8%, and 85.4%, respectively. Furthermore, Xing et al. [26] provided the BI-RADS Vector-Attention Network, which used the Residual Attention Block to extract features and incorporate the assemble vector of the BI-RADs standard description to distinguish between benign lesions and malignant ones. Similarly, Qian et al. [27] utilized a combined fusion image from three different ultrasound modalities to train the Squeeze-and-Excitation Network for lesion malignancy classification. They also used the transverse and longitudinal views to comprehensively evaluate lesion malignancy and effectively decreased the false-positivity rate of single-view evaluation. Finally, they achieved an AUC of 0.96 [95% CI: 0.91–0.98]. These reviewed studies demonstrate that applying deep learning methods to implement ultrasound CADx systems is effective and combining different models’ opinions by the ensemble methods can enhance classification efficiency [20, 22, 27].

3 Materials and Methods

3.1 Dataset

Our study was approved by the National Yang Ming Chiao Tung University’s Institutional Review Boards under project number YM108127E. All the ultrasound images used in this paper were captured using an ultrasound device (Terason uSmart 3300, Probe Medical Corporation) with a depth of 4 cm. Each image in the dataset was carefully labeled by a professional radiologist and a physician trained in breast surgery. The dataset contains 569 cases, with a total of 8,026 ROI images. These images are further categorized as follows: 3,662 images belonged to Category 2 (lesion size: 0.77 ± 0.26 cm), 2,390 images belonged to Category 3 (lesion size: 1.23 ± 0.45 cm), and 1,974 images belonged to Category 4/5 (lesion size: 1.90 ± 0.65 cm).

To prepare the dataset for training and testing purposes, it was divided into two, with 80% being allocated to the training set and 20% to the test set. Before feeding the images into the model, they were preprocessed with histogram equalization, and their pixel values were rescaled to 0–1 for normalization. Also, we employed one-hot encoding to represent the labeling data. To increase the diversity of the dataset and improve the model’s performance, we augmented the images by applying horizontal flipping and translation with a magnitude of 12 pixels ten times, which resulted in an expanded dataset. Finally, all images were resized to a resolution of 128 × 128 pixels, ensuring dataset uniformity.

3.2 Deep Learning Models

A server equipped with a graphics processing unit (NVIDIA GeForce GTX 1080 Ti, NVIDIA Corporation) was used for experiments in our study. We also utilized Tensorflow to build the deep learning framework and Tkinter to design our CADx interface. Our study encompassed three types of deep learning architecture: VGG [28], ResNet [10], and DenseNet [11]. These different types of architecture were initialized with ImageNet pretrained models and fine-tuned with breast ultrasound ROI images. During the training process, we employed a mini-batch size of 32 images and utilized Stochastic Gradient Descent with a learning rate of \({10}^{-5}\)and a momentum of 0.9 as an optimizer. Additionally, to mitigate model overfitting, a weight decay of \({10}^{-5}\) and a dropout rate of 0.5 were incorporated. For the analysis of categorical results, we used several evaluation metrics, including precision, recall, F1-score, and AUC [29]. Furthermore, the accuracy and the balanced accuracy [30] were also used to evaluate the model’s performance.

3.3 Ensemble Methods

Herein, we employed three different types of deep learning architecture that showed the best performance to build our ensemble model. To combine their predictions, we utilized two ensemble methods, which are average prediction and voting. For the average prediction method, we calculated the average of the three models’ predictions for each set of test data. The category with the highest average was selected as the final predicted category. For the voting method, we calculated the number of predicted categories from the three models. The category with the highest number of votes was chosen as the final prediction of the ensemble model.

In cases where the first and second models had different opinions on the predicted category, the third model’s vote played a crucial role in determining the final prediction. However, if these three single models had completely distinct opinions, indicating significant disagreement, we chose the medium risk level (BI-RADS Category 3) as the final predicted category of the ensemble model. This decision aimed to mitigate extreme predictions and provide a more conservative assessment in cases of conflicting model outputs.

3.4 Structure of Our Deep Learning-Based Computer-Aided Diagnosis (CADx) System

Based on breast ultrasound raw data and deep learning techniques, we propose a comprehensive CADx system for automated diagnosis and human interpretation. By utilizing the BI-RADS standard description, this system can be used to not only improve case follow-up but also optimize the deep learning performance of our breast ultrasound CADx system, which can be seamlessly integrated with a picture archiving and communication system (PACS) and can also be applied in a routine imaging workflow. This integrated system provides two main functions: a labeling breast ultrasound lesion service and a retrieving and deleting patient report service. Additionally, our system goes beyond the basic functionalities and focuses on the analysis and management of the diagnosis report database, including a statistical analysis service and a missing data check service. To address patients’ privacy concerns while optimizing deep learning performance, our system provides a de-identification service that ensures the appropriate protection of patient data, adhering to privacy regulations. The system architecture is illustrated in Fig. 1.

Fig. 1
figure 1

The architecture of our deep learning-based breast ultrasound CADx system

4 Results and Discussion

4.1 The Evaluation of the Single Models

Table 1. presents the results of different single models used in our study. Among these models, the RestNet-50 model achieved the best performance with 80.7% accuracy and 79.2% balanced accuracy with histogram equalization and data augmentation. It is noteworthy that correctly classifying BI-RADS Category 3, which represents the medium risk level of lesions, posed a challenge in every single model’s results. The precision, recall rate, and F1 score of this category were relatively lower than those of the other categories. This indicates the difficulty in accurately distinguishing lesions in the medium risk category. These performance results provide valuable insights into the strengths and limitations of the single models, highlighting the challenges associated with the correct classification of lesions in the BI-RADS Category 3.

Table 1 Single model results units

In our study, we compared the results of the VGG-16, ResNet-50, and DenseNet-121 models, and analyzed their Grad-CAM heat maps. Despite the complexity and rarity of the breast ultrasound dataset, we observed interesting variations in the heat maps generated by different models, even when they made the same predictions for a given example. The Grad-CAM [31] heat map results are presented in Fig. 2.

Fig. 2
figure 2

Heat maps of different single models. (A), (E), and (I) are the ROI examples of the BI-RADS Category 2, Category 3, and Category 4/5 lesions, respectively. (B), (F), and (J) are the visualization results of Grad-CAM using the VGG model. (C), (G), and (K) are the visualization results of Grad-CAM using the ResNet model. (D), (H), and (L) are the visualization results of Grad-CAM using the DenseNet model

We found that the ResNet-50 and DenseNet-121 models produced more precise and detailed heat maps than the VGG-16 model produced, indicating that these models were able to capture and highlight important regions within the breast ultrasound images with higher accuracy and precision. For example, in Fig. 2(C), the ResNet-50 model accurately detected and emphasized the enhancement of the ultrasound posterior feature in a Category 2 lesion example. This demonstrates the model’s ability to capture subtle and important lesion characteristics. Furthermore, the Grad-CAM heat maps provided insights into the challenges associated with the classification of BI-RADS Category 3 lesions. The variations in heat maps among different models indicate the difficulty in accurately distinguishing and localizing features of lesions in this category.

4.2 The Evaluation of the Ensemble Model

We applied ensemble methods (average prediction and voting) using the best-performing VGG-16, ResNet-50, and DenseNet-121 models trained with histogram equalization and data augmentation. The results of the ensemble models are presented in Table 2. We observed that the ensemble methods not only improved the overall performance of the model but also increased the classification efficiency for the challenging BI-RADS Category 3. They showed comparable performances to those of the average prediction and voting methods. However, the ensemble model with average prediction consistently achieved better AUC values compared to the voting method. This can be attributed to the fact that the voting method reduces the sensitivity of the predicted numbers, affecting the discernment. To further illustrate the performance, the receiver operating characteristic (ROC) curves of the ensemble methods with average prediction and voting are presented in Fig. 3, and Fig. 4, respectively.

Table 2 Results of ensemble models (average prediction and voting)
Fig. 3
figure 3

Receiver operating characteristic (ROC) curve of ensemble models with average prediction

Fig. 4
figure 4

Receiver operating characteristic (ROC) curve of ensemble models with voting

4.3 Comparisons with a State-of-the-Art Study

We compared the results of our study to Hejduk’s method [32], and the comparison is presented in Table 3. Our study achieved an accuracy rate of 79.7% and 77.6% balanced accuracy, outperforming Hejduk et al.’s study regarding both the best performance of the single models and the best performance of the ensemble models. It is noteworthy that our study differed from Hejduk’s method regarding the classification approach. While Hejduk’s study focused on binary risk classes of BI-RADS Category 2/3 and BI-RADS Category 4/5, ours emphasized adhering to the BI-RADS medical suggestions. This included recommending routine follow-up for BI-RADS Category 2, short-term follow-up for BI-RADS Category 3, and a biopsy for BI-RADS Category 4/5. This alignment with practical application scenarios adds relevance and usability to our study’s findings.

Table 3 Comparison with a state-of-the-art study

4.4 Implementation of Our Deep Learning-Based Computer-Aided Diagnosis System (CADx)

To promote the development of computer-aided diagnosis systems, we are also dedicated to opening our computer-aided diagnosis system for use in research. Our CADx system implemented communication interfaces regarding of DICOM DIMSE-C Services to ensure seamless integration with PACS. The deidentification module utilizes the guidelines provided by the DICOM Basic Attribute Confidentiality Profiles [33] to deidentify breast ultrasound DICOM objects and derivative patient reports, maintaining attribute confidentiality, and privacy throughout the diagnostic workflow .As shown in Fig. 5, the main page of our CADx system concludes five different areas: (A) the Setting Box, (B) DICOM’s Pixel Data Display Area, (C) the Control Box, (D) the Patient Report Display Area, and (E) Other Areas. The detailed source code of our proposed CADx and execution proposed AI model are public available on the GitHub [34].

Fig. 5
figure 5

The architecture of our deep learning-based breast ultrasound CADx

In the routine workflow of breast ultrasound image screening, sonographers or physicians conduct real-time ultrasound scanning, selecting key images that are subsequently uploaded to PACS. Following this, physicians retrieve these key images from PACS, reviewing and interpreting them for reporting purposes. The proposed CADx system assists physicians in interpreting images and generating standardized reports, aiding in classifying BI-RADS assessment classes and facilitating data statistical analysis and reuse through standardized data fields that prioritize adherence to BI-RADS medical guidelines. This improvement can facilitate physicians in reporting with the classification of BI-RADS. However, this scenario does not realize real-time assessment during screening.

To integrate this AI model, separating from the CADx system, into real-time ultrasound scanning in clinics, it could be combined with ultrasound devices equipped with higher-speed computing units to enable real-time detection during examinations. Additionally, deploying the CADx as a cloud service or a cloud PACS with high network bandwidth to swiftly retrieve the AI results could be another suggestion. Furthermore, enhancing performance by integrating image acquisition information such as echo patterns and posterior features through the concept of multi-modality into this AI model integrated with ultrasound devices could further enhance the interpretation of BI-RADS features in the future.

5 Conclusions

Our study contributes significantly to the field of breast ultrasound lesion classification by employing deep learning models. We achieved competitive performance in accurately classifying breast ultrasound lesions into the ACR BI-RADS assessment of Category 2, Category 3, and Category 4/5. We also developed an open-source computer-aided diagnosis system to facilitate the practical application of our research. Our study focused on achieving accurate and reliable classification results, which are crucial for determining the risk level of breast lesions and providing appropriate medical recommendations. We employed various deep learning models, with ResNet-50 yielding the best performance with 80.7% accuracy and 79.2% balanced accuracy. Furthermore, we proposed the ensemble models and found that the average prediction approach further improved the classification performance with 81.8% accuracy and 80.2% balanced accuracy.

Our results are comparable to Hejduk’s method, which signifies the quality and competitiveness of our findings in the context of breast lesion classification. Also, we found that training models with histogram equalization and data augmentation led to notable improvements in accuracy (3–5%). Additionally, the utilization of data augmentation and ensemble methods enhanced the performances of all models in classifying BI-RADS Category 3 lesions. Our experimental results demonstrated excellent classification performances (AUCs > 0.9) for both BI-RADS Category 2 and Category 4 and good performance (0.8 < AUC < 0.9) for BI-RADS Category 3, indicating the reliability and efficacy of our findings in differentiating lesion categories.

To further improve our study, we recommend expanding the dataset to encompass a broader range of ROIs, incorporating additional ACR BI-RADS standard lesion descriptions such as shape, margin, and echo pattern features for improved classification, and standardizing the ROI by quantifying the surrounding tissues of lesions [19].