Introduction

Breast cancer increases in stage and severity as it metastasizes to axillary lymph nodes [1]. Lymph node involvement increases the risk of recurrence and acts as a prognostic indicator, with the survival rate of node-positive patients being up to 40% lower than node-negative patients [2,3,4,5,6]. As a result, lymph node status is critical for diagnosis, prognosis, and monitoring of treatments [7].

Although lymph node management has become less invasive with the use of sentinel lymph node (SLN) biopsy as opposed to full axillary lymph node dissection, significant side effects including shoulder dysfunction, lymphedema, and nerve damage were still observed in as much as one-fourth of patients [8, 9]. Moreover, studies have reported > 70% of biopsied SLNs are negative [8], indicating that such procedure is unbeneficial and potentially harmful to a significant amount of breast cancer patients. Accurate non-invasive assessment of nodal involvement therefore is valuable in cancer staging, surgical risk, and financial cost reduction.

Breast cancer is an area of peaked interest for the combination of radiomics and artificial intelligence, with clinical impact possible as both a diagnostic and prognostic tool [10]. One such task is the development of a predictive model for non-invasive staging of the axillary lymph nodes as an alternative to SLN biopsy. Nomograms and radiomic pipelines have been used to predict SLN status with promising results [9, 11,12,13,14,15,16,17,18,19]. However, conventional radiomics (CR) has several disadvantages. For instance, the robustness of the conventional hand-crafted radiomic features is variable based on changing parameters, including pixel size, region-of-interest (ROI) delineation, and signal-to-noise ratio [20]. Deep learning has the potential to serve as a more powerful tool to overcome these issues as shown in several studies [21,22,23,24,25,26]. Moreover, deep learning is capable of learning high-level and task-adaptive image features [27]. It enables direct feature extraction from multiple levels without explicit definition and can provide a higher level of feature abstraction [28]. However, deep learning requires a large training data size to obtain a generalizable and functional classification model. Fortunately, studies have demonstrated that initial features extracted by deep learning network are largely similar to CR, since they both detect edges, ripples, and various other textures prior to observing more complex features [29,30,31]. Thus, it is possible to use features identified by a pre-trained deep learning network as an alternative to hand-crafted features used in CR.

The purpose of this study was to develop a DLB feature prediction model for preoperative prediction of SLN metastasis and compare its predictive performance to state-of-the-art CR. Specifically, this study aimed to compare the generalizability of CR vs DLB features in an independent testing set of dissimilar resolution.

Methods

Figure 1 shows the general pipeline used in this work.

Fig. 1
figure 1

Schematic representation of pipeline for feature extraction, reduction, and model creation. The CR pipeline and the pipeline using deep learning-based (DLB) features are only different in their feature extraction step. All other steps remain identical. LASSO: Least absolute shrinkage selection operator; ROC: Receiver operating characteristic

Study population

The dataset used in this study is an expansion of that described in previous publication [13]. Briefly, data for this institutional review board-approved retrospective study collected images from June 2013 to June 2017. Inclusion criteria were patients that had (1) preoperative dynamic contrast enhanced (DCE)-magnetic resonance imaging (MRI), (2) diagnosis of invasive breast cancer by histopathology, (3) SLN biopsy result, and (4) no neoadjuvant chemotherapy. Exclusion criteria were patients that had (1) no SLN biopsy result, (2) very small tumor ROI (less than 64 voxels), or (3) MRI after neoadjuvant chemotherapy. After inclusion/exclusion criteria, a sample of 198 patients (67 positive SLNs and 131 negative SLNs) was used in this study. Of those 198 subjects, 163 had an in-plane resolution of 0.7 × 0.7 mm2; that 163 subject cohort was randomly divided into two independent subsets: a training set (approximately 67%, 109 patients with 37 positive SLNs) and a validation set (approximately 33%, 54 patients with 18 positive SLNs).

The remaining 35 subjects (35 patients with 12 positive SLNs) with a different in-plane resolution (0.78 × 0.78 mm2) were treated as an independent testing set with dissimilar resolution to test the generalizability of the predictive models for imaging data acquired with slightly different resolution. Given that radiomics has been shown to have limited generalizability, an independent testing set of dissimilar resolution will more rigorously assess this potential of the predictive model.

Clinical data collected for this study included whether the tumor was confined to the upper inner quadrant, multifocality, age, pathological type, tumor grade, molecular subtype, and lymphovascular invasion.

MRI examination

The MRI examinations were all performed using a dedicated 8-channel breast coil on 1.5 T GE Signa (GE Healthcare, Wauwatosa). The sequence of interest in this study was the DCE series; sagittal VIBRANT multiphase sequence was acquired with the following parameters: repetition time (TR) = 4.46–7.80 ms; echo time (TE) = 1.54–4.20 ms; flip angle = 10°; matrix = 256 × 256; slice thickness = 2 mm. I.V. contrast agent was Magnevist (Schering, Berlin), injected at a dose of 0.2 mL/kg at a rate of 2 mL/s, followed by 20 mL saline flush. Five phases were acquired: one pre-contrast and four post-contrast images. Patients with pixel sizes of 0.7 × 0.7 mm2 were split into training and validation cohorts. Patients with pixel sizes of 0.78 × 0.78 mm2 were separately analyzed in an independent testing set. This analysis allows for the clinically practical reality that it is knowingly difficult to standardize pixel size, which may need to be adjusted based patient specific variable (e.g., size of the patient).

Map calculation

Reducing the effect of varying TR and TE, three ratio maps were used: wash-in maps ((S1-S0)/S0) × 100%, wash-out maps ((S1-S4)/S1)) × 100%, and signal enhancement ratio (SER) maps ((S1-S0)/(S4-S0)) × 100%, where S0, S1, and S4 are the pre-contrast, first post-contrast, and fourth (the last) post-contrast images, respectively. These maps are independent of the original MR signal intensity and capture the behavior of contrast enhancement in the tissue. Representative image and calculated kinetic maps are shown in Fig. 2.

Fig. 2
figure 2

Representative image of pre- and post-contrast images of an in-breast tumor and the calculated Wash-In, Wash-Out, and SER maps. Equations for map calculation: Wash-in map ((S1-S0)/S0) × 100%, Wash-out map ((S1-S4)/S1)) × 100%, SER map ((S1-S0)/(S4-S0)) × 100%, where S0, S1, and S4 are the pre-contrast, first postcontrast, and fourth (the last) post-contrast images, respectively

Segmentation

ROIs of the tumor were manually drawn on the first post-contrast image by a radiologist with 11 years of experience. We noted that although manually drawn ROIs can be subjective, an automated convolutional neural network (CNN)-based segmentation was shown to be comparable in radiomics task-based assessment within this cohort [32]. The original ROI was dilated by 4 mm using Matlab v2017b (MathWorks, Natick). This resulted in two regions of interest: one intratumoral ROI and one peritumoral region (0–4 mm). These regions are shown in Fig. 3.

Fig. 3
figure 3

Representative image of intratumoral ROI and peritumoral ROI. Intratumoral ROI was drawn by a radiologist on the first post-contrast image. The peritumoral ROI was generated by dilating the intratumoral ROI by 4 mm

Feature extraction

CR features

Shape features, first order (histogram) features and second order texture features [grey-level co-occurrence matrix (GLCM), neighborhood grey-level different matrix (NGLDM), grey-level run-length matrix (GLRLM), grey-level zone-length matrix (GLZLM)] were extracted following the image biomarker standardization initiative standard [33] using LifeX 3.42 [34]. Laws features [35] were extracted using in-house software written in Matlab v2017b (MathWorks, Natick). Image was quantized to 128 grey levels, and absolute resampling was performed (for intratumoral ROIs, wash-in map: 0 ∼ 640%; wash-out map: − 156 ∼ 100%; SER map: − 1280 ∼ 1280%; for peritumoral ROIs, wash-in map: 0 ∼ 640%, wash-out map: − 540 ∼ 100%, SER map: − 1280 ∼ 1280%). A total of 105 features were extracted. Summary of features is included in Supplemental Table 1. Shape features were only calculated for intratumoral ROI; the remainder of the features were calculated for both intratumoral and peritumoral ROI.

DLB features

VGG-16 [36], a pre-trained CNN architecture that is 16 layers deep, was utilized for DLB feature extraction. A schematic representation of the network is shown in Fig. 4.

Fig. 4
figure 4

Schematic representation of adapted VGG-16 network for DLB feature extraction. Each layer is defined by corresponding color legend at the bottom of the figure. Input was a 3D volume for a single slice of the three kinetic maps. Features were extracted from the last fully connected layer (yellow), resulting in a feature vector of 1000 features for each 2D slice. Conv: Convolutional layer; ReLu: Rectified linear unit

The image was multiplied by the binary mask of either the intratumoral ROI or peritumoral ROI such that the regions outside of the RIOs were set to zero. Absolute resampling, similar to above, was performed (for intratumoral ROIs, wash-in map: 0 ∼ 640%; wash-out map: − 156 ∼ 100%; SER map: − 1280 ∼ 1280%; for peritumoral ROIs, wash-in map: 0 ∼ 640%, wash-out map: − 540 ∼ 100%, SER map: − 1280 ∼ 1280%). Data were then normalized on a scale from 0 to 1, with 0 being the lowest value and 1 being the highest value referenced above. Then, the resultant image was multiplied by 255 to match the 0–255 range expected by VGG-16.

VGG-16 has a predefined input structure of 224 × 224 × 3. Each 2D slice of our dataset was cropped to 224 × 224. A 3D volume comprised of the Wash-In, Wash-Out, and SER maps for each slice was inputted.

Matlab was used to import VGG-16. The model was not retrained; instead, all layers remained frozen (weights remained the same), and only activations from the last fully connected layer (fc8) were extracted. These were exported as rows, in which each 2D slice had a single row of 1000 features. Given that each ROI has multiple slices, the value in each column was averaged across all slices for that subject.

Feature reduction

From this point forward, the pipelines for the CR and DLB features are separate and identical.

Standard z-score normalization was used on the training set; z-score is value minus training set mean divided by training set standard deviation. The validation and testing sets were also normalized using the training set mean and standard deviation. Training set was rebalanced using an adaptive synthetic sampling approach; this improves class balance by the creation of new samples from the minority group [37].

Given the high dimensionality of all the extracted features, several steps were performed to remove redundant or non-informative features. Firstly, Mann-Whitney U-test was used to find significantly different features between SLN positive and SLN negative groups; a range of p-value thresholds were tested (0.001, 0.005, 0.01, 0.05). Secondly, groups of highly correlated radiomic features were identified (Spearman ρ) and only one representative feature was selected from each correlated group; similarly, several ρ-value thresholds were tested (0.75, 0.80, 0.85, 0.90, 0.95). Finally, an optional step of principal component analysis (PCA) was performed to further reduce the feature space; several number of PCA components were tested (20, 40, 60, 80, 100). The optimal thresholds for feature reduction were chosen as those that resulted in the highest validation accuracy of the average of 100 random seeds.

Feature selection and model creation

The remaining features from the reduction process combined with the clinical features were the input for feature selection process. A logistic regression model was used for the prediction task. The selection of important predictors was performed in the training set using the Least Absolute Shrinkage Selection Operator Regression (LASSO) [38] with 3-fold cross-validation. The selected model was that of minimum cross-validation error plus one standard deviation. To avoid overfitting, the maximum number of the selected features was restricted to 10. These features were then used to establish logistic regression models to predict SLN metastasis. The optimal threshold of the receiver operating characteristic analysis was determined by maximizing the Youden index (YI) in the training set, where the YI is defined as sensitivity + specificity - 1. This threshold was applied to the independent validation and testing datasets. Predictive performance measures tabulated included area under the curve (AUC), sensitivity, specificity, negative predictive value (NPV), positive predictive value (PPV), and accuracy. To avoid the model optimization becoming stuck in a local minimum, the LASSO procedure was repeated 100 times with different seeds. The cross-validation results across all folds were averaged; the model that achieved the highest accuracy in the training set was selected as the prediction model. Additionally, the training set was shuffled each iteration to randomize the cross-validation within the training set, while the independent validation and testing set remained the same.

Results

Model incorporating peritumoral region

The primary analysis for this study was the model incorporating intratumoral plus peritumoral (4 mm) features, given that it has been shown to outperform intratumoral features alone in a previous publication [13].

For CR model, a total of 157 features (146 radiomic and 11 clinical) were included in the 3-fold cross-validation LASSO feature selection process. The optimal feature reduction parameters were a ρ-value threshold of 0.95, a p-value threshold of 0.05, and no PCA. We noted that PCA was also performed for the CR feature reduction pipeline but did not improve the predictive performance of the model. Eight features, including 1 clinical, 2 shape, and 5 texture features, were optimized for this model (Table 1).

Table 1 Features chosen by CR pipeline. Features from intratumoral region are denoted by (I) and features from 4 mm peritumoral region are denoted by (P). There was only one clinical feature selected, and the remainder were radiomic features (5 from intratumoral region, 2 from peritumoral region)

For DLB model, the optimal feature reduction parameters were a ρ-value threshold of 0.85, a p-value threshold of 0.001, and a PCA value of 80. After feature reduction, there were 91 features (80 DLB and 11 clinical) inputted into the feature selection process. For this model, 5 features were finally selected, including 2 clinical (tumor grade and lymphovascular invasion) and 3 DLB features.

Predictive performance metrics are shown in Table 2 and Fig. 5 for the CR and DLB pipelines. In the validation set (i.e., the group with the same resolution as the training set), the DLB model outperformed the CR model [accuracy (CR: 80%, DLB: 83%), YI (CR: 0.56, DLB: 0.67), NPV (CR: 86%, DLB: 91%)]. Furthermore, in the independent testing set of dissimilar resolution meant to evaluate the generalizability of the model to dissimilar condition, the DLB model outperformed the CR model in all metrics [accuracy (CR: 71%, DLB: 77%), YI (CR: 0.37, DLB: 0.45), NPV (CR: 78%, DLB: 80%)]. It is noted that we included the performance of the training set for the completeness of the paper; however, it should not be used for comparison due to overfitting concerns.

Table 2 Predictive performance results for intratumoral plus 4 mm peritumoral region for CR and DLB models. Values shown are from the random seed with highest training set accuracy. The DLB pipeline slightly outperformed CR in the validation set of the same resolution as the training set. A larger improvement is seen in the testing set of dissimilar resolution. This indicates the DLB pipeline might be more generalizable and less sensitive to pixel size differences
Fig. 5
figure 5

Predictive performance of feature models including intratumoral and 4 mm peritumoral region with CR and DLB pipeline. The DLB pipeline slightly outperformed CR in the validation set. A larger improvement is seen in the testing set of dissimilar resolution. This indicates the DLB pipeline might be more generalizable and insensitive to pixel size differences. Note that the performance of the training set is not used for comparison due to overfitting concerns. AUC: area under the curve; Sens: sensitivity; Spec: specificity; PPV: positive predictive value; NPV: negative predictive value; Acc: accuracy; YI: Youden index

Model excluding peritumoral region

As a secondary analysis, models created utilizing clinical features and intratumoral features alone were analyzed.

For CR model, a total of 104 features (93 radiomic and 11 clinical) were fed into the 3-fold cross-validation LASSO feature selection process. The optimal feature reduction parameters were a ρ-value threshold of 0.95, a p-value threshold of 0.05, and no PCA. In logistic model creation, a maximum of 10 features were included. As discussed in Methods, the random seed with the highest training set accuracy was reported. Finally, there were 10 features chosen, including 2 clinical features, 2 shape features, and 6 texture features (Table 3).

Table 3 Features chosen by CR model for intratumoral features only. This model included 2 clinical features and 8 radiomic features

For DLB model, the optimal feature reduction parameters were a ρ-value threshold of 0.75, a p-value threshold of 0.005, and no PCA. After feature reduction, there were 48 features (37 DLB features, 11 clinical) inputted into the 3-fold cross-validation LASSO feature selection process. For this model, 9 features, including 2 clinical features (Tumor grade and Lymphovascular Invasion) and 7 DLB features, were selected.

Predictive performance metrics are shown in Table 4 and Fig. 6 for the CR and DLB models. In the validation set, the DLB pipeline performed similarly compared to the CR pipeline [accuracy (CR: 80%, DLB: 81%), YI (CR: 0.56, DLB: 0.58), NPV (CR: 86%, DLB: 86%)]. Furthermore, in the testing set of dissimilar resolution, a similar trend is seen compared, where DLB features outperformed CR features in some metrics [accuracy (CR: 71%, DLB: 74%), YI (CR: 0.25, DLB: 0.53), NPV (CR: 72%, DLB: 89%)]. Similar to above, the performance of the training set should not be used for comparison due to overfitting concerns.

Table 4 Predictive performance results for intratumoral region for CR and DLB models. Values shown are from the random seed with highest training set accuracy. The DLB model performed similarly to CR in the validation set. In the testing set, DLB model outperformed CR model in numerous metrics, including NPV, accuracy, and YI. This indicates the DLB model might be more generalizable and less sensitive to pixel size differences
Fig. 6
figure 6

Predictive performance of features including intratumoral region with CR and DLB models. The DLB model performed similarly to CR in the validation set. In the testing set, the DLB model outperformed CR in numerous metrics, including NPV, accuracy and YI. This indicates the DLB model might be more generalizable and less sensitive to pixel size differences. Note that the performance of the training set is not used for comparison due to overfitting concerns. AUC: area under the curve; Sens: sensitivity; Spec: specificity; PPV: positive predictive value; NPV: negative predictive value; Acc: accuracy; YI: Youden index

Discussion and conclusions

The results of our study showed that the predictive performance of the DLB model outperformed the CR model in several metrics. More interestingly, these improvements were seen particularly in the independent testing set with dissimilar resolution. This could indicate that DLB features are less sensitive to varying conditions (e.g., pixel size changes) and ultimately result in a more generalizable model.

SLN status prediction has been explored using nomograms; examples include those developed by Memorial Sloan Kettering Cancer Center and MD Anderson, which include age, tumor characteristics (size, grade, type, focality, location), lymphovascular invasion and hormone receptors to predict the likelihood of a diseased, positive SLN. These nomograms have been shown to have a moderate predictive performance [11, 12]. Imaging studies have investigated the use of non-invasive quantitative imaging radiomic biomarkers along with clinical data for prediction of SLN status, with promising results (AUC > 0.8) [9, 13,14,15,16,17,18,19]. Specifically, our previously published work has validated a radiomic pipeline of peritumoral region in combination with intratumoral region for the prediction of SLN metastasis [13]. The peritumoral region is of interest because tissue surrounding the tumor may contain valuable information such as angiogenic-lymphangiogenic factors and tumor infiltrating lymphocytes, which have been shown to be related to treatment response [39].

Predictive models based on CR features can be disadvantageous because radiomic features have been shown to be sensitive to changing parameters, such as pixel size alteration [20]. Numerous studies have shown DLB models to outperform CR pipelines. Using MRI, CNN performance has been compared and shown to outperform radiomics for the purposes of breast lesion classification [23] and gene mutations in low grade gliomas [24]. Additionally, the combination of the CR and DLB features was superior in the survival and classification prediction of high-grade gliomas [25, 26]. Our study took one step further, creating an independent testing set of dissimilar resolution, in attempt to identify a particular condition in which DLB features may outperform CR features. Note that the goal of this work is to compare DLB and CR features. Also, given that the radiomic model has shown to outperform the model using clinical characteristics alone in previous study [13], the models presented here are not compared with the clinical only model.

Deep learning can utilize CNNs to extract features by applying convolution layers composed of small-sized fields or kernels, followed by pooling layers to reduce the size of the resultant feature space. Optimizing the kernel to be applied to the image to extract meaningful features is the purpose of training the network. Consequently, the performance of a CNN is dependent on the amount of data it has available to train on. This results in the CNN being more ‘data hungry’ and typically needing a very large sample size, on the order of millions of images, to have optimal performance [40]. Even if collaborating with multiple institutions, acquiring millions of medical images is frequently not feasible, especially in the instances of rare diseases. One way to address this limitation is the utilization of transfer learning. Transfer learning involves the implementation of a pre-trained network that is further fine-tuned with samples specific to a desired task. Transfer learning has been explored for prediction of lymph node metastasis in patients with cervical cancer using MRI (AUC: 0.9) and breast cancer using CT (AUC: 0.8) with promising results [41, 42]. In this work, we used a pretrained VGG model that was originally optimized to classify 1000 types of objects. Thus, the features extracted from the VGG model are expected to be robust as the features were used to differentiate similar shaped objects such as tow trucks and cars, and animals such as ladybugs and crabs. Transfer learning models that further train VGG with new medical imaging data can further manipulate the way features are extracted; whereas our proposed method doesn’t rely on refining the original VGG network and directly utilizes the robust features that have been trained in its original task. Our proposed model uses the last fully connected layer in VGG prior to classification, allowing our DCE images to fully propagate throughout the network for feature extraction. This approach is more readily available compared to transferred learning approach and can be directly incorporated into more traditional radiomics pipeline. Particularly that for studies with relatively small sample size, it is expected to be less prone to overfitting.

Future directions of this study include fine-tuning of the network directly for classification instead of feature extraction [40]. Additionally, incorporation of additional features such as CoLlAGe and wavelet features could be performed for the radiomics pipeline; specifically, it has been shown that these directional-based features have correlated with tumor microenvironment as seen on pathology, such as orientation and densely packed tumor infiltrating lymphocytes [39, 43]. Moreover, other evaluations using different parameters other than in-plane resolution could be performed to further evaluate the generalizability. In general, there also exists the possibility to look specifically at the lymph nodes. This study analyzed the in-breast tumor to predict metastasis to the lymph node. To date, efforts to analyze axillary lymph nodes on MRI has been limited by small axillary lymph node sizes, breast coil sensitivity in regions in the axilla, and exclusion of a majority of the axilla from field of view. Although nodal morphological features on MRI are predictive of malignancy [2, 44], predictive value of contrast enhancement and morphological criteria, such as size and shape of lymph nodes, was found to be controversial [45, 46]. Furthermore, application of this model to more advanced stage breast cancer (e.g., patients undergoing neoadjuvant chemotherapy) or other cancer types (e.g., head and neck cancer) is another potential avenue of study.