1 Introduction

Breast cancer is distinguished by its high incidence and mortality rate, which puts a severe threat to women’s health worldwide [1]. The molecular subtype of breast cancer is essential to identify substantially varied clinical phenotypes, treatment responses, and outcomes [2]. Triple-negative breast cancer (TNBC) tends to be more aggressive and resistant to common treatments with a high recurrence rate and poor prognosis [3, 4]. To develop an appropriate therapy and improve the prognosis of TNBC patients, it is crucial to distinguish TNBC from the other three subtypes. In clinical practice, the molecular subtype can only be determined with certainty through surgical resection, for tumor heterogeneity can lead to the existence of multiple molecular types in a tumor, and the samples captured by core biopsy only constitute a small portion of the whole lesion, not always representative of heterogeneous tumors. Additionally, for advanced patients, neoadjuvant therapy plays an essential role in the treatment plan. The molecular subtype of a breast lesion may change after treatment [5]. More importantly, it has been shown that the biological characteristics of residual lesions after neoadjuvant therapy have a greater impact on the prognosis, rather than the characteristics of the primary tumor [6]. And the conversion to triple-negative after neoadjuvant therapy is an independent risk factor affecting the prognosis [5]. In this case, another biopsy is required for further therapies. It is of great value to develop a noninvasive, accurate, and efficient approach to determine the molecular subtype of breast cancer.

Medical imaging plays an essential part in the assessment of breast cancer, as the primary tool to detect and diagnose lesions. Nevertheless, the role of imaging is undergoing a rapid evolution from merely providing diagnostic information to leading the advancement of personalized precision medicine, with the permeation of deep learning into the field of medical imaging [7]. Deep learning (DL) has exhibited promising performance on a range of diagnostic and predictable tasks on medical images [8,9,10]. A multicenter study has achieved satisfying performance (with an AUC of 0.91) to differentiate three breast cancer molecular subtypes on MRI using deep learning algorithms [11]. Another study developed a deep learning mammography-based model that identified women at high risk of breast cancer [8]. These advances motivate the use of deep learning for molecular subtype determination from medical images.

Ultrasound (US) is a common imaging modality that uses sound waves to produce images of body structures [12, 13], including breast, thyroid, muscles, joints, vessels, and internal organs. The images can provide valuable information for diagnosing and directing treatment for diseases. It is a preferred medical imaging method for breast cancer and has the highest adoption rate in Asian countries, for its noninvasiveness, convenience, and high sensitivity to breast nodules in the dense breast [14, 15]. However, the acquisition of US images is prone to discrepancies between operators. Artifacts during the US image acquisition, like noise, speckle, and signal attenuation, can make it difficult for radiologists to identify the disease. More importantly, because of variability in equipment and grayscale adjustments, the size, format, and grayscale of the captured images also vary. There is no standardized method to acquire images, and this poses challenges for applications using US images. Only a few deep learning studies use breast ultrasound as a modality, compared to mammography and MRI, and most of them focus on the development of deep learning approaches to assist the detection, segmentation, and diagnosis of breast cancer [16,17,18,19]. Few studies have explored distinguishing molecular subtypes solely from raw ultrasound images, and the performance varies widely across molecular subtypes [20,21,22,23].

In this study, we aimed to develop a fully automated deep learning-based system for molecular subtype solely from breast cancer ultrasound images and evaluate the ability of the system to distinguish TNBC, which has the poorest diagnosis and prognosis, from other cases (luminal A, luminal B, or HER2-positive).

2 Materials and methods

This retrospective study was approved by the Institutional Review Board of Perking Union Medical College Hospital (Number: JS-1987), and written informed consent was obtained from all the participants.

2.1 Study cohorts and datasets

The dataset was collected at Peking Union Medical College Hospital and consists of 145 female breast cancer patients without a breast cancer history who underwent ultrasound examination by a single radiologist between April 2018 and March 2019. Exclusion criteria were the following: (1) preoperative intervention (neoadjuvant therapy (NAT), biopsy) performed before ultrasound examination. Biopsy affects the tumor morphology by cutting a part of the tumor. NAT results in changes in tumor size, morphology, and even clone. These changes do not reflect the true condition of the tumor. (2) multiple malignant lesions. Patients with multiple breast cancers have significantly poorer disease-free survival than those with a single tumor [24]. The tumor multiplicity is used as an independent factor for subclassifying breast cancer. So, multiple breast cancer cases were excluded from this study. (3) incomplete clinical or pathological information. The patients were divided into train, validation, and test sets randomly at a ratio of 8:1:1. A flowchart summarizing these steps is shown in Fig. 1.

Fig. 1
figure 1

Flowchart of procedure in this study

For each patient who underwent US examination, two or more US images were captured using ACUSON S2000 (Siemens) and EPIQ7 (Philips) machines with linear probes (3–12 MHz) by an experienced radiologist (with 11 years of experience) and reviewed by two experienced radiologists (with 6 and 4 years of experience, respectively) to confirm the index lesion. The index tumor images were captured from multiple angles, including at least longitudinal and transversal sections. US tumor size was measured at the longitudinal section. Both grayscale and color Doppler images were included.

2.2 Subtype labeling

The St. Gallen International Breast Conference proposed estrogen receptor (ER), progesterone receptor (PR), human epidermal growth factor receptor type 2 (HER2) protein, and the Ki-67 proliferation index as the main receptor indicators for the molecular subtype of breast cancer [25]. The above factors are strongly associated with the prognosis and outcome of breast cancer patients [26,27,28,29,30]. According to this, each patient underwent a surgical excision to obtain the tumor’s biological marker status evaluation via immunohistochemical (IHC) staining, namely the estrogen receptor (ER), the progesterone receptor (PR), the human epidermal growth factor receptor type 2 (HER2) protein, and the Ki-67 proliferation index.

The molecular subtype was labeled using the standard criteria applied to biological marker status evaluation referring to the surrogate definition of intrinsic molecular subtypes of breast cancers from the St. Gallen International Breast Cancer Conference [25]: luminal A (ER (+) or PR (+), Ki-67 low (< 14%) and HER2 (−), luminal B (ER (+) or PR (+), Ki-67 high (≥ 14%), and/or HER2 (+)), HER2-positive (ER (−), PR (−) and HER2 (+)), and triple-negative (ER (−), PR (−), HER2 (−)).

Note, ER or PR is considered positive when the percentage of stained cells is > 1%. For HER2, a score of 0 or 1+ is considered negative, and a score of 3+ is considered positive. If IHC scored 2+, FISH is further tested, and HER2 is considered positive if the ratio of the HER2 gene signal to the chromosome 17 probe signal (HER2/CEP17 ratio) is ≥ 2.0, or the average HER2 signals/cell is ≥ 6.0.

Given the poorer prognosis of triple-negative cases, we grouped the non-triple-negative (luminal A, luminal B, and HER2-positive) cases and focused on the binary classification task of identifying triple-negative cases.

2.3 Image preprocessing

First, images were resized and cropped to have a uniform model input size and also mitigate the presence of noise (e.g., black bands) in the outer parts of the images. Then, we preprocessed the US images to deal with the problem of intensity heterogeneity in ultrasound, whereby a tumor tissue of the same nature (e.g., a cancerous tissue) appears with varying pixel intensity across images, depending on the settings used by the ultrasound machine operator. It is a common problem and a major challenge in automated inference from ultrasound images. We transformed the images using adaptive histogram equalization [31]. This contrast enhancement method adjusts the intensity of pixels across an image to normalize the local histograms of pixel values (Fig. 2). It computes multiple histograms, one for each section of the image, and uses them to balance the intensity values of the image. We assessed the effectiveness of several other approaches to counter the intensity heterogeneity, including standard dataset-level normalization (mean centering and standard scaling) or image-level normalization, and found the adaptive histogram equalization algorithm to be the most effective.

Fig. 2
figure 2

Image processing steps. Images are resized and cropped to 224 × 244 pixels, converted to grayscale, and then altered by adaptive histogram equalization

2.4 Model architecture

For automatic classification, a VGG-based model was employed to distinguish triple-negative tumors from other tumors using pixel information presented in US images, given the popularity and success of the VGG model in the medical field [32]. The model uses solely US images as input, without a delineated region of interest (ROI), to predict the molecular subtype of a patient. Fig. 3 shows the components of the system and the architecture of the model.

Fig. 3
figure 3

End-to-end system and architecture of the model. VGG-19 architecture modified to perform binary classification, preceded by adaptive histogram equalization

We alter the VGG-19 architecture [33] to perform binary classification of raw ultrasound images into the triple-negative class or the rest class, end-to-end. For each patient, each image is classified independently from the patient’s other images. This allows us to assess the model’s generalization across the different angles of capture and variations in the grayscale or color Doppler ultrasound. For this, we ensure all the images of a patient appear in one and only one of the different image sets used while training the model.

2.5 Model training

As opposed to most other studies that employ such models in the medical field, we do not pre-train the model on a larger, unrelated dataset, but instead, we train it from scratch. We do so since all available pre-trained models are trained on datasets of natural images such as ImageNet, where the pixel distribution differs fundamentally from those of ultrasound images.

To tune the model, we adopt a train, validation, and test setup. The model is trained to learn visual patterns using the cross-entropy loss with standard backpropagation [34, 35]. The model is trained to learn image patterns for each of the classes on the train set (80%), and the validation (10%) and test sets (10%) are left out of the sample, i.e., not seen during training. Here, we report the out-of-sample performance as measured on the test set. Note, the data is partitioned in the space of patients and not in the space of images so that distinct images of a patient all appear in one and only one of these three sets.

2.6 Performance evaluation

The model’s performance was measured using k-fold stratified cross-validation. Each fold is a random partition of the dataset, and at each fold, the proportion of each of the two classes in the three subsets is the same as those in the overall dataset. To further counter the class imbalance, we under-sample the dominant class (rest) at each epoch.

We use Adam as a learning rate schedule [36], but find that manually reducing the learning rate helps gain a few percentage points in the performance. We cap the number of training epochs to 30 and apply an early-stopping scheme. The early stopping epoch and hyper-parameters are set according to the performance of the validation set. The loss curves and evolution of the model’s classification performance through training epochs (learning curves) are shown in Fig. 4.

Fig. 4
figure 4

Performance of the model. Loss (summation of errors in the model) (a), F1-score (harmonic mean of the model’s precision and recall) (b), and accuracy (proportion of classifications a model made correctly) (c) of the model at each epoch on the train, validation, and test sets of one of the four partitions

To exhibit the performance of the model, the area under the receiver operating characteristic curve (AUC) with a 95% confidence interval (CI) is reported. The accuracy, specificity, sensitivity values, and F1-score with 95% CIs are reported. In addition, the number of true-positive, false-positive, true-negative, and false-negative findings of the model are reported in a confusion matrix. These statistical analyses were performed using SPSS software (version 25.0, IBM, Armonk, NY, USA).

To show the interpretability of the model, t-SNE (t-distributed Stochastic Neighbor Embedding) analysis and saliency maps were used to visualize the features learned by the model and the areas of the images that are most suggestive of triple-negative breast cancer in the model.

3 Results

3.1 Patients

A total of 145 patients (mean age, 51.66±11.13 years; range, 29–82 years) were enrolled in this study. Baseline information on the study population is detailed in Table 1. The dataset was partitioned into train, validation, and test cohorts, with, for example, in the case of Partition 1, 115 (79.31%), 15 (10.34%), and 15 (10.34%) breast cancer patients respectively. There was no significant difference among the three cohorts with respect to age (P = 0.667), US tumor size (P = 0.178), histological type distribution (P = 0.351), and molecular subtype (P = 0.840). Molecular subtypes have an incidence in the dataset in proportions comparable with the incidence in the broader population [37].

Table 1 Baseline information for the train, validation, and test sets

3.2 Model performance

The effectiveness of the model at distinguishing triple-negative tumors from the other three molecular subtypes was assessed using four-fold cross-validation. We report in Table 2 the median of metrics across the four partitions. The model reaches an AUC of 0.86 (95% CI: 0.64, 0.95), a sensitivity of 86%, a specificity of 86%, and an F1-score of 0.74 on the test set.

Table 2 Classification performance on the test set across the four partitions

The metrics are best understood via the confusion matrices in Fig. 5, which show the exact number of correct and incorrect classifications for each class and each dataset partition.

Fig. 5
figure 5

Confusion matrix of the classification results. Rows are the true class (ground truth) of test examples and columns are the predicted class

3.3 Feature visualization

We visualize in Fig. 6 the internal features learned by the model using t-SNE. Each color corresponds to a class from the dataset, and each dot represents a breast ultrasound image from the dataset, projected from the 4096-dimensional output of the model’s last hidden layer into two dimensions. Two clusters of dots can be identified, exhibiting class separation. The triple-negative cluster lies on the edge of the cloud of dots, highlighting that triple-negative cases are visually distinguishable, to the extent that the model learns a high-level representation in which triple-negative cases are separable from other cases (luminal A, luminal B, HER2-positive).

Fig. 6
figure 6

Visualization of the model’s internal features. ab Cropped original images and saliency maps of four triple-negative patients (a) and four non-triple-negative patients (b), where highlighted pixels are those with greater influence on the model’s classification decision. c t-SNE visualization of convolutional features in the last hidden layer of the trained model, exhibiting class separation in internal high-level feature representations

Furthermore, we produce saliency maps to understand the visual features of breast tumors, as seen on the ultrasound images, used by the model to classify images (Fig. 6). Saliency maps highlight the pixels of an input image that most influence the model’s classification decision, for images randomly sampled from the dataset for each of the two classes. The maps are computed by taking each pixel’s gradient with respect to the model’s loss function. On a saliency map, highlighted pixels are those with greater influence on the model’s classification decision. This shows that the CNN network focuses on the most predictive part of the image. TN saliency maps display higher brightness on hypoechoic lesions of TN, suggesting that the model relies more heavily on information from the tumor tissue and margin in TN cases. In contrast, for saliency maps of other types (non-TN), brightness is more uniform across both lesion and background areas. This indicates the model indifferently uses information from the lesion or neighboring tissue.

4 Discussion

Triple-negative breast cancer often occurs at a young age and presents the highest degree of malignancy and invasiveness. Unfortunately, endocrine therapy and targeted therapy cannot benefit patients, who may only rely on chemotherapy [2, 4, 38]. Therefore, identifying triple-negative breast cancer is the key to guiding the selection of clinical pathways.

In this study, we present a model that automatically distinguishes triple-negative breast cancer from other molecular subtypes in a non-invasive, comprehensive manner. The model is trained on US images, without any histopathological information as predictive input, and achieves an area under the receiver operating characteristic curve (AUC) of 0.86, a sensitivity of 85.7%, and a specificity of 86.3%, promising for the task of predicting the biological behavior of TNBC preoperatively. Our proposed approach demonstrates the potential of CNN models to automatically identify triple-negative patients based on US images preoperatively and can assist in making more appropriate treatment decisions.

Although distinguishing breast cancer molecular subtypes from US images is a relatively new research field, previous studies have found differences between TN and non-TN tumors in visual US features; TN tumors are more likely to have a circumscribed margin but less likely to present calcifications and echogenic halo [39,40,41,42,43]. Previously, ultrasonic features of invasive breast ductal carcinoma were extracted and selected using machine learning methods, and those features demonstrated a strong correlation with receptor status and molecular subtypes [44]. Also, it has been reported that some features mined from ultrasonic imaging could distinguish TNBC and fibroma [45]. Recently, the use of deep learning has helped advance the automated classification of molecular subtypes of breast cancer. In the existing literature, only four other studies tackle the determination of breast cancer molecular subtypes solely from raw ultrasound images. A study developed three deep learning models that determine the molecular subtype from multi-modal US images, including a monomodal model (grayscale US), a dual-modal model (grayscale US and color Doppler), and a multimodal model (grayscale US, color Doppler, and shear-wave elastography, SWE) [22]. Two other studies first performed benign-malignant identification and later inferred the molecular subtype separately [21, 23], while another study put more emphasis on the task of discriminating luminal and non-luminal cases [20]. However, the predictive ability for triple-negative breast cancer varied widely across studies (accuracy range of 53.19–97.02%) and could be further improved. In contrast with previous studies, we conducted a discriminative prediction of TN and non-TN. We employed VGG-19, a convolutional neural network architecture different from previous studies (ResNet50 and Xception for the first two [36] and for the third one [6], respectively). Our model achieves superior performance when considering grayscale US and color Doppler, with an AUC of 0.86 and an F1-score of 0.74. Also, we used a method for standardization of ultrasound images, in which images are preprocessed to eliminate the effect caused by intensity variability—beneficial to the model’s generalization ability.

After observing this attractive performance, to understand how the model learned from the input data to discriminate TN from others, two analytical methods were employed to visualize the model’s learned internal features. Feature visualization is needed in part to confirm the model indeed focuses on US features associated with triple-negative cases rather than irrelevant parts of the image. First, t-SNE analysis shows that in the learned feature space, TN and other cases are separable. Second, the saliency maps produced are an intuitive reflection of the different weights given by the model to visual features in US images. For TN lesions, the model gives greater importance to pixel information from the tumor tissue and margin, as seen by the higher brightness of hypoechoic lesions on TN saliency maps. However, for lesions of other types, the model indifferently uses information from the lesion or neighboring tissue, as seen by the more uniform brightness across both lesion and background areas on saliency maps of other subtypes (non-TN). This is consistent with previous findings that under grayscale ultrasound, triple-negative lesions tend to have more circumscribed margins and can be clearly distinguished from surrounding tissue while non-TN lesions are typically less differentiated from surrounding tissue and have lower contrast and more irregular shape [39, 40].

Our study presents several limitations. First, the sample is small, and the data were collected from a single center, so the predictive ability of the model needs to be validated on further external data in a multi-center setup—a necessary step toward clinical use. Second, in this study, we collected US images presenting at least the largest diameter sections and the orthogonal section. It remains debatable whether the predictive ability of the model would be significantly affected by including additional US images from a single index lesion. Third, while our study focuses on the binary problem of identifying triple-negative cases, future research should tackle the four-way breast cancer molecular subtyping task. Fourth, benchmarking several deep learning models (including non-CNN ones, e.g., Vision Transformers [46]) would help identify the architecture best suited for the task at hand and achieve performance gains.

5 Conclusion

An end-to-end deep learning approach was proposed to identify in raw ultrasound images triple-negative breast cancer, characterized by its poor diagnosis and prognosis—a task that radiologists are not able to perform. The approach is non-invasive and automated, as it does not use any histopathological information from biopsy or surgery as predictive input and does not rely on manually crafted features like region of interest or radiomics. The system can serve as a prospective decision-making tool for clinicians enacting treatment plans and assessing prognosis.