1 Introduction

Breast cancer is the most frequent cancer among women in the UK. While most breast cancer diagnoses occur in women over 50, and younger women can also develop breast cancer. About 1 in 8 women are diagnosed with breast cancer during their lifetime, but there is a good chance of recovery if detected early [34]. Therefore, there are lots of researchers who focus on how to early detect breast cancer in the early stage by computer-aided (CAD) systems. CAD systems can be subdivided into computer-aided detection (CADe) systems and computer-aided diagnosis (CADx) systems. The CADe systems are mainly utilized for extracting ROIs from medical images for further analysis tasks. Based on the obtained ROIs, the CADx systems focus on extracting features from the obtained ROIs and make predictions of severity based on the extracted features. There are several challenges to the early detection of breast cancer by CAD systems. Firstly, compared to natural images, mammograms are usually with higher resolutions and larger sizes. The high resolutions are challenging for both the hardware and the performance of the algorithms for diagnosis. Secondly, the anatomical architectures of the organs and the tissues in mammograms are more difficult to recognize and detect than the natural images. Traditional CAD systems cannot easily obtain these features of anatomical architectures.

In recent years, with the development of deep learning, deep learning-based CAD systems achieved great results in solving existing challenges of CAD systems. However, deep learning-based CAD systems also have some limitations for breast mass classification tasks. Firstly, the performance of the deep learning model highly relies on differentiating malignant masses from benign masses. The main differences between benign and malignant masses are the shape and margins of the masses [1]. The shape of a mass can likely be irregular, round, and lobular, while benign masses are more likely to have circumscribed oval and round shapes, and malignant masses tend to have irregular shapes. The margins of the masses can also be subdivided into categories, including microlobulated, obscured, ill-defined, and spiculated. The microlobulated margins describe the scalloped appearance of the mass that is distinct from the breast tissues. The obscured margins indicate the margins of the masses that were partially blocked by adjacent tissue. As a result, the masses in this situation may fail to be differentiated from the breast tissue. The ill-defined margins refer to the margins that are indistinct from the breast tissue. The reason why margins are ill-defined can be the low contrast of the images and the high breast density. The spiculated margins are shown in the form of radiating lines from the breast masses and are usually shown in malignant breast masses. However, malignant masses can also have circumscribed margins with a low possibility that follow-up examinations are required to distinguish those masses. The second limitation is that deep learning models always need sufficient computational power, and the improvements of well-trained models are hard to obtain, although there are powerful computational resources.

Based on the challenges and limitations mentioned above, it is still of great value to design a deep learning-based CADx system for breast mass classification as they can advise radiologists in a relatively short time and facilitate the diagnosis procedures instead of forcing patients to go through painful tissue extraction procedures for biopsy. Aimed at developing a breast mass classification system with promising performance, we proposed to develop a novel deep feature-based model called DF-dRVFL. The main contributions of this study can be concluded as follows:

  1. 1.

    We developed a high-performance CADx system for breast mass classification based on DF-dRVFL for mammography images. The developed system works on the extracted ROIs using a previously developed breast mass detection system. The developed system consists of three components: model training, feature extraction, and feature classification. In model training, we transferred the state-of-the-art deep models that were pre-trained on ImageNet instead of training the models from scratch. We first removed the top layers of the deep models as they were initially trained for 1,000-class classification and added new fully connected layers for the classification task here. We also added the dropout layers to prevent the trained models from overfitting. After training, we then extracted the features from the trained models as the input of our classifier DF-dRVFL. The experimental results on the public DDSM dataset showed the effectiveness of the developed system and therefore validated the plausibility between the combinations of breast mass feature extraction and classification.

  2. 2.

    We developed a novel strategy for breast mass classification by introducing a novel hybrid deep learning-based model. In this study, we proposed a VGG19-DF to deploy trained deep learning models as feature extractor instead of relying on hand-crafted features. As mentioned before, the differences between benign masses and malignant masses mainly lie in the shapes and the margins. However, these hand-crafted features are not reliable, and they can mislead the diagnosis results. Instead, we proposed to use deep features that are extracted by a trained deep learning model for mass classification. Compared to hand-crafted features, deep features are more representative and robust, and the models trained with deep features are likely to show higher performance.

  3. 3.

    We found an efficient method for breast mass classification performance improvement with low computational cost. If well fine-tuned, the performance of deep learning models can be improved step by step. However, the process pf optimizing the settings of hyper-parameters for model optimization can be lengthy. Therefore, there is an unmet need to improve the performance of the classifiers at a minimal cost. Toward this, we proposed to introduce DF-dRVFL for fast performance improvement. Experiments on the public dataset DDSM showed that the novel classifiers could be trained on the dataset consisting of more than 2,000 samples within only a few seconds. More importantly, the performance of the classifiers has also been improved. Considering that breast mass classification is only a special case of classification, we believe the proposed method for efficient performance improvement can also be extended to other scenarios.

The remainder of this paper will be arranged as follows. In Section 2, we will briefly revisit the related works. Then we will present the details of the developed system in Section 3. As mentioned before, we utilized deep learning models as feature extractors in DF-dRVFL. However, some adjustments have to be made to adapt the models for the classification task. The classification task is implemented by a novel classifier called deep random vector functional link network (dRVFL). For comparison, we also take machine learning models, including ELM, RVFLN, and spiking neural network (SNN), as the classifiers. We will present the details of these machine learning models and examine the overall classification performance of these models, where the results will be shown in Section 4. At the beginning of the experiment section, we will introduce the details of the dataset used in this chapter. We then move to parameter settings, followed by the ablation for model refinement. The experiment results will be shown in the last part of this section. We then discuss some related issues in Section 5 and end this paper with the conclusion and future work in Section 6. The abbreviations used in this paper are listed in Table 1.

Table 1 Symbols and meaning

2 Related works

Breast mass can be generally classified into two categories, benign and malignant mass. The main difference between these two kinds of masses is that malignant mass is cancerous and may lead to death if no timely treatment is applied, while benign mass is milder and non-cancerous. Mammography is a useful modality for diagnosing breast mass. It helps doctors to diagnose breast mass in the early stage and gives medications to the patients to avoid death [31, 54]. Before the CAD systems were proposed, only well-trained radiologists and doctors could manually classify breast masses. However, traditional CAD systems include multiple steps such as pre-processing, segmentation, feature extraction, feature selection, and final classification [19]. Compared with deep learning-based CAD systems, traditional CAD systems are narrow and brittle [36]. In contrast, deep learning-based methods integrate multiple modules into deep convolutional networks with higher performance. Considering the advantages of deep learning-based methods, we will mainly review the related works implemented via deep learning. Moreover, there are also some works related to discriminative feature learning which are highly related to our work [4, 48].

Since deep learning has achieved great success in many domains, more researchers are try to integrate the deep learning method into CAD systems. For instance, Dhahri et al. [5] used the Tabu search to extract the feature maps, then applied the K-Nearest Neighbors algorithm to classify the breast lesions. The proposed method evaluates on the BIDMC-MGH dataset with an AUC of 95% and the WDBC dataset with an accuracy of 98.24%. Li et al. [26] proposed a DenseNet-II model to classify benign and malignant mammograms, and their model is based on the DenseNet with modifications. It tests on the dataset from the First Hospital of Shanxi Medical University and gets an average accuracy of 94.55%. Zhang et al. [55] designed a novel model based on feature fusion for mass classification. This model was evaluated on the CBIS-DDSM dataset and got a receiver operating curve value of 0.97, an accuracy of 94.30%, and a specificity of 97.19%. Another work using the GAN-based method was proposed by Muramatsu [32]. In this work, they used additional synthetic data from the cycle GAN and evaluated on the DDSM dataset with an accuracy of 81.4%. Khan et al. [20] implemented a multi-view feature fusion to improve the performance of CNN further to classify the malignant and benign. Finally, the proposed method reached an AUC of 0.84 on the CBIS-DDSM dataset.

There are some works based on transfer learning. In work [25], the authors proposed to apply transfer learning based on AlexNet for breast mass classification. To validate the effectiveness of mass context to the overall performance of the deep learning models, the authors extracted breast mass patches that were marginally larger than the breast mass and patches that were two times of the mass bounding box. There are different data augmentation methods, including rotation, cropping, and flipping, were applied to augment the size of the training set by a factor of 25. These data augmentation methods can alleviate problems with data scarcity. The dataset comprised 1820 images from DDSM, of which 80% were partitioned into the training set, and the remaining 20% of the images were randomly yet evenly partitioned into the validation set and testing set, respectively. The model based on GoogLeNet, however, showed the best performance that provided an accuracy of 0.929. Similar conclusions were drawn by the authors of the work [17]. In another AlexNet-based work [41], the authors proposed to transfer AlexNet for feature extraction while taking a supporting vector machine (SVM) as the classifier. The experiments on the DDSM dataset showed that the accuracy of the transferred AlexNet was only 71.01%, while the accuracy increased to 79% when SVM was fed with the features extracted by the transferred AlexNet. Another work compared the performance of multiple different models, including AlexNet, VGG16, ResNet50, InceptionV3, and DenseNet-121 on DDSM in [23]. It was concluded that VGG16 performed best with an area under the curve (AUC) of 0.82 when no data augmentation was applied. DenseNet121, however, turned out to be the best performing with an AUC of 0.91 when data augmentation was applied. Ensemble learning is another widely applied technique that aims at improving classification performance by combining the classification results from multiple models. Compared to single model-based methods, ensemble learning-based methods enjoy higher stability and robustness. In work [42], the authors proposed to ensemble AlexNet-based models for better classification performance. The three best models with the best results were selected, and the conclusion was based on the average probability. Experiments showed that the performance of individual models ranged from 75% to 77%, while the combined result was over 80%.

While some of the single-stage methods rely heavily on human intervention, multiple-stage methods that require less human intervention are more in demand and popular. Compared to single-stage methods, multiple-staged methods aim at extracting useful features in the early stages that may benefit the classification in the later stage. In work [19], authors developed a novel breast mass classification method based on weighted association rule mining (WARM). Initially, mammograms were pre-processed for contrast enhancement while the pectoral muscle was removed. In the processed mammograms, each mammogram was divided into the non-overlapping block, within which the sum average feature was calculated. The block with highest sum average feature was selected as the seed of the region growing for breast mass segmentation. By doing so, the breast mass patch can then be extracted. The latent rules between the extracted patches and the targeted classes were valued. The patch was classified as benign if the rules were more inclined to benign and vice versa. The experiment on Mammographic Image Analysis Society (mini-MIAS) showed that the proposed method achieved 95.15% accuracy on the testing set, which outperformed other methods significantly. Another work that integrated breast mass segmentation and classification can be found in [40]. In this work, a novel breast mass segmentation method called adaptive fuzzy C-means clustering was introduced. Two feature extraction methods, gray-level run-Length matrix (GLRM) and gray-level co-occurrence matrix (GLCM) are deployed to extract features from the extracted patches. Finally, the extracted features were classified by a recurrent neural network (RNN) that was optimized by a novel optimization algorithm called Average Fitness New Updating-based GrassHopper. The proposed method was evaluated on mini-MIAS. In terms of segmentation accuracy, sensitivity, \(F1_{score}\), and Matthews Correlation Coefficient (MCC), the proposed method turns out to be the best compared to other state-of-the-art methods. For the classification task, the authors compared the proposed method with traditional machine learning techniques such as decision tree, SVM as well as deep convolutional neural network DCNN model. The experiment results showed that the proposed method showed the best performance against other methods from the perspective of almost all evaluation metrics.

Table 2 Related works

We conclude these works in Table 2, as can be seen from the mentioned methods. Some works were evaluated on the testing set with small sizes, and only tens or thousands of images were evaluated. Another issue is the performance of existing methods. Despite the fact that numerous deep learning models may show different performances regarding the classification task and the best model can be chosen for the task. It remains a problem with performance improvement at a minimal cost. Therefore, we developed a novel hybrid framework DF-dRVFL for breast mass classification in this work.

3 Methodology

In this work, the developed DF-dRVFL mainly consists of two modules including, a deep feature extractor VGG19-DF and a classifier dRVFL. Initially, we use different deep learning models by transfer learning as the backbone to classify the breast mass images, and these deep learning models are VGG19, ResNet101, InceptionV3, DenseNet201, and IncepresNetv2 [11, 14, 47, 50, 51]. After comparing the performance of these transfer learning-based deep learning models, we chose the VGG19 with the best performance as the backbone of our new feature extractor called VGG19-DF. After choosing the deep learning models, we designed a novel classifier called dRVFL to use the feature maps generated from the VGG19-DF to get better performance compared with the original classifier. To evaluate the performance of our proposed classifier, we also test different classifiers, including ELM, RVFLN, and SNN on the same dataset for comparison. In the remaining section, we will illustrate the details of VGG19-DF in Section 3.1 and express the process of choosing the best classifier in Section 3.2.

3.1 VGG19-DF: deep feature extractor

To construct the deep feature extractors for the following classification task, we deployed transfer learning for efficient model acquisition. We first adjusted the architectures of the pre-trained deep learning models by removing the top layers that were initially designed for the 1000-class classification task. On the top of the fully connected layer for 1000-class classification, we added several stacked layers, including two dropout layers and three fully connected layers. If we denote the dropout layer as DropL and the fully connected layer with X neurons as FCX (For example, FC24 means a fully connected layer with 24 neurons, and FC64 is a fully connected layer with 64 neurons), the architecture of the adjusted top layers can be represented as FC1000-DropL-FC64-DropL-FC24-FC2-Prob, where Prob stands for the classification layer that takes softmax as the activation function. Note that a large dropout probability may slow the convergence. Therefore, we set the dropout probability to 0.3 for faster convergence. The architecture of the deep feature extractor based on VGG19 can be seen in Fig. 1. The adjusted deep models are then trained with the training set for feature learning, while deep features can then be extracted from the trained models. Note that deep features can be obtained from different depths of the deep learning models. Empirically, we then extract the features from the fully connected layers, as shown in Fig. 1c, and more details will be presented in the experiment section.

Fig. 1
figure 1

The architecture of the feature extractor before and after adjustment. The dashed lines in Fig. 1c mean the possible path to obtain deep features

3.2 Design of classifiers

The second component of DF-dRVFL is a classifier called dRVFL. The choice of the dRVFL is based on the exploration of performance by the multiple classifiers on the breast mass classification tasks. Instead of deploying traditional feature classifiers such as decision tree (DT) and SVM, we proposed four classifiers based on the architectures and algorithms behind these novel classifiers: ELM, RVFLN, dRVFL, and SNN. By the comparison of the performance of these classifiers, we found that dRVFL based model achieved the best performance. Therefore, we named the developed model DF-dRVFL.

3.2.1 ELM classifier

The proposal of the ELM algorithm aims to solve the slow training problem with traditional feed-forward neural networks (FFNN). The slow training problem can be tracked back to the iterative training due to the gradient-based learning algorithms. Instead of training the networks via iterative training, ELM randomly chooses the nodes in the hidden layer of the single hidden layer feed-forward neural networks (SHFN) and then determines the output weights via analysis. By doing so, the training time has been significantly reduced while providing good generalization performance, although the architectures of the neural networks remain unchanged.

Considering a series of observed samples X and the desired output Y that can be denoted as \(\left( {\textbf {x}}_{i},{\textbf {y}}_{i}\right) \), so that \({\textbf {x}}_{i}=\left[ x_{i1},\cdots ,x_{iu}\right] ^{T} \in \mathbb {R}^{u}\) and \({\textbf {y}}_{i}=\left[ y_{i1},\cdots ,y_{iv}\right] ^{T} \in \mathbb {R}^{v}\). If we denote the number of the observations as N, the activation function as s(x), and the number of hidden nodes as h, then the output \({\textbf {O}}\) can be modeled by

$$\begin{aligned} {\textbf {o}}_{j}=\sum _{i=1}{h}\alpha _{i}s({\textbf {w}}_{i}\cdot {\textbf {x}}_{j}+th_{i}) \end{aligned}$$
(1)

where \({\textbf {o}}_{j}\) is the output of jth node in the output layer, \(j=1,\cdots ,N\), \({\textbf {w}}_{i}=\left[ w_{i1},\cdots ,w_{iu}\right] ^{T}\) is the weight vector between the input nodes and ith hidden node. \(\alpha _{i}=\left[ \alpha _{i1},\cdots ,\alpha _{im}\right] ^{T}\) and \(th_{i}\) stands for the threshold value of ith hidden node. The operation \(\cdot \) means the inner product of \({\textbf {w}}_{i}\) and \({\textbf {x}}_{i}\). Ideally, the SHFN can approximate the expected output of these N samples with zero means by

$$\begin{aligned} {\textbf {y}}_{j}=\sum _{i=1}{h}\alpha _{i}s({\textbf {w}}_{i}\cdot {\textbf {x}}_{j}+th_{i}) \end{aligned}$$
(2)

The (2) can also be written as \({\textbf {Y}}={\textbf {B}}\alpha \), where B is called the hidden layer output matrix. The corresponding cost function can be expressed as

$$\begin{aligned} E=\sum _{j=1}^{N}\left( \sum _{i=1}^{h}\alpha _{i}s({\textbf {w}}_{i}\cdot {\textbf {x}}_{j}+th_{i})-{\textbf {y}}_{j}\right) ^2 \end{aligned}$$
(3)

To minimize the error from gradient-based algorithms, the parameters including \({\textbf {W}}\), which is the vector form of \({\textbf {w}}_{i}\), \(\varvec{\alpha }\), and \({\textbf {b}}\) will be iteratively updated. The iterative updating of \({\textbf {W}}\) can be denoted as:

$$\begin{aligned} {\textbf {W}}_{n}={\textbf {W}}_{n-1}-\beta \frac{\partial E({\textbf {W}})}{\partial {\textbf {W}}} \end{aligned}$$
(4)

where \(\beta \) here is the learning rate. Back propagation is usually used as the learning algorithm so that errors can be back forwarded for parameter optimization. However, some issues on the back propagation appeared. The first one is the definition of \(\beta \). If it was too small, it took much longer time for the learning algorithm to converge. However, a large \(\beta \) may lead to instability or even divergence. Local minima and the time-consuming gradient-based learning are other perplexing issues. The difference between algorithm ELM and back propagation mainly lies in the method for parameter updating. The main procedures of ELM can be seen in Algorithm 1. As can be seen, there are just several matrix multiplication operations with the algorithm, so the training time is much less than the gradient-based algorithms. The architecture of ELM is shown in Fig. 2. For ELM, the number of neurons in the hidden layer is the key factor that determines the overall performance of ELM.

figure a

Parameter updating of ELM.

Fig. 2
figure 2

The architecture of extreme learning machine

3.2.2 RVFLN classifier

RVFLN, which was first proposed in work [39], was another training-efficient classifier. RVFLN showed that the weights connecting the input layer with the hidden layer could be randomly generated while these generated weights can be fixed during the training stage. Also, on the bounded finite dimensional set, RVFLN has been proven to be a universal approximator for a continuous function with a close-form solution [15]. The architecture of RVFLN can be seen in Fig. 3. The neurons between the input layer and the output layer are called enhancement nodes.

Fig. 3
figure 3

The architecture of random vector functional link net

As can be seen, the output layer receives both the original input \({\textbf {Y}}\) from the original input layer and the transformed features \({\textbf {H}}\) from the hidden layer. Like ELM, the weights connecting the hidden layer with the output layer are randomly generated and are fixed during the training stage. Only output weights \(\alpha _{s}\) need to compute so that the optimization problem can be represented as:

$$\begin{aligned} Obj=\mathop {min}\limits _{\alpha _{s}}||B\alpha _{s}-Y||^{2}+\eta ||\alpha _{s}||^{2} \end{aligned}$$
(6)

where \({\textbf {B}}=\left[ H X\right] \), which is the concatenated features from the input layer and the hidden layer. \(\eta \) is the regularization parameter. The (6) can be solved via either Moore-Penrose pseudoinverse (when \(\eta =0\)) or ridge regression (when \(\eta \ne 0\)). If the Moore-Penrose pseudoinverse is used, the solution will be (5) in Algorithm 1. When ridge regression is used, the close-form solution will be \(\alpha _{s}=({\textbf {B}}^{T}{} {\textbf {B}}+\eta {\textbf {I}})^{-1}{} {\textbf {B}}^{T}{} {\textbf {Y}}\) in the primal space or \(\alpha _{s}={\textbf {B}}^{T}({\textbf {B}}^{T}{} {\textbf {B}}+\eta {\textbf {I}})^{-1}{} {\textbf {Y}}\) in the dual space. The main computational cost comes from the matrix inversion operation, which can be avoided using either a primal or dual solution.

3.2.3 dRVFLN classifier

Based on the original RVFLN, we proposed to employ an improved version of RVFLN called dRVFL in [46] as the classifier. In dRVFL, there are usually several stacked hidden layers between the input layer and the output layer. Like RVFLN, the input is concatenated with the output of the last hidden before the concatenated features go into the output layer. The weights between the input layer and the first hidden layers, as well as the weights between hidden layers are randomly generated. By doing so, the optimization process is much more efficient compared to the back propagation-based optimization pattern. The architecture of dRVFL can be seen in Fig. 4.

Fig. 4
figure 4

The architecture of deep random vector functional link net

Given the input X, the activation function for each hidden layer as \(s(\cdot )\), and no bias term considered, the output of the first hidden layer can be denoted as:

$$\begin{aligned} {\textbf {H}}^{1}=s({\textbf {XW}}_{1}) \end{aligned}$$
(7)

Similarly, the output of the Lth hidden layer can be denoted as:

$$\begin{aligned} {\textbf {H}}^{L}=s({\textbf {H}}^{L-1}{} {\textbf {W}}_{L}) \end{aligned}$$
(8)

where \(\varvec{W}_{1}\) and \(\varvec{W}_{L}\) are the weight matrices between the input and first hidden layer and the \(L-1\)th hidden layer and the Lth hidden layer, respectively. Similar to parameters in ELM, these parameters are randomly generated and will not be updated during the training session. Then the input features fed to the output layer can be expressed as:

$$\begin{aligned} \varvec{B}=\left[ \varvec{H}^{1} \cdots \varvec{H}^{L} \varvec{X}\right] \end{aligned}$$
(9)

The output can then be obtained via

$$\begin{aligned} \varvec{Y}=\varvec{B}\varvec{\alpha }_{s} \end{aligned}$$
(10)

So that \(\varvec{\alpha }_{s}\) can be calculated via Moore-Penrose pseudoinverse as was shown in (5).

3.2.4 SNN classifier

The SNNs were first proposed by Maass [29]. SNN is a type of ANNs that simulates the behavior of biological neurons. They are characterized by their ability to model the dynamics of neural firing through the use of spike-time coding, which is based on the timing of individual action potentials rather than their frequency. Due to these features, the SNNs can integrate information from different aspects such as time, frequency, and phase [8, 13, 16, 18, 52]. There are some works about deploying SNNs in the computer vision tasks. Masquelier and Thorpe [30] use SNNs to model the behavior of the visual cortex in response to naturalistic visual stimuli. Kheradpisheh et al. [21] designed a SCNN to extract the edges, and Panda and Roy [38] suggested a convolutional Auto-Encoder learning method for SNN. Moreover, an SNN-based object detection system Spiking-YOLO was developed that converges 2.3 to 4 times faster than the previous SNN methods [22]. Another study demonstrated the effectiveness of using SNNs for gesture recognition, achieving accuracy comparable to that of traditional neural networks while using significantly less energy [45]. Overall, SNNs offer a promising approach to performing efficient and energy-efficient event-driven processing of spatiotemporal information in computer vision tasks.

In our framework, we design the SNN model by adjusting the model presented in work [37]. Initially, this work was evaluated on the MNIST dataset by vectorizing the images into the vectors. In our classification task, we, instead, proposed to feed the model with the features extracted by our trained deep feature extractors.

The overall flow chart for model training is shown in Fig. 5.

Fig. 5
figure 5

The overall flowchart of the DF-dRVFL

4 Experiment

In this section, we will mainly introduce the experiment-related issues of this study. To begin with, we will briefly introduce the information of the dataset used in this study, followed by the settings of the experiment. By saying the settings, we mean the settings, including hardware configurations, parameter settings, and so forth. Subsequently, we will compare the performance of the feature extractors by evaluating the classification performance of these deep CNN models on the classification task. After we determined the feature extractor for the following classifiers, we then explored the optimal architectures of the classifiers to obtain the best-performed model via the ablation experiment. Finally, we will compare our method with traditional machine learning-based methods and state-of-the-art methods at the end of the section.

4.1 Dataset

The images involved in this study for method training and evaluation come from the DDSM dataset [12, 24]. Note that we simply aimed to evaluate the performance of the developed breast mass classification system here. Therefore, manually cropped ROI from the full field mammograms. We used five-cross validation to train and evaluate the models on the mass-level dataset of the DDSM, which can be found at https://www.kaggle.com/datasets/skooch/ddsm-mammography. The detailed information of each randomly partitioned subset can be seen in Table 3. As can be seen, the number of benign masses and malignant masses are roughly the same.

Table 3 Patch-level dataset from DDSM for model training and validation

4.2 Settings of the experiment

In this section, we will mainly introduce the configurations of the experiment, including the hardware environment, the hyper-parameter setting for the feature extractor, and the architectural settings of the feature classifiers. For hardware configuration, we deployed the SPECTRE High-Performance Computing Facility provided by the University of Leicester for model training and validation. The memory of the facility is 16 GB. For the deep learning models trained for the breast mass classification task, we used the parameters listed in Table 4. Another factor that affects the overall classification performance is the architectures of the classifiers, such as dRVFL, as it was known that both ELM and RVFLN have only one hidden layer. For dRVFL, we set the number of layers to be three by default because deeper networks are more likely to be overfitted. As for the number of neurons in the classifiers, we will determine them via ablation experiments.

Table 4 The setting of hyper-parameters for deep models

4.3 Performance of feature extractors

In this study, we employed state-of-the-art deep learning models, including VGG19, ResNet101, InceptionV3, DenseNet201, and IncepresNetv2 [11, 14, 47, 50, 51]. The performance of these deep models is mainly determined by the number of learnable parameters and the architectures, while the architectures of these models can be indicated by the number of connections between the layers. We then listed the numbers of the parameters and the numbers of the connections in Table 5, where we use \(Rate_{CL}\), which is the rate between the number of connections to the number of layers, to show the architectural complexity of the deep models. As can be seen, VGG19 turns out to be the most complicated model in terms of the number of learnable parameters. DenseNet201, however, is the most complicated if we consider the \(Rate_{CL}\), while VGG19 becomes the simplest one.

Table 5 Architectural details of deep models

To evaluate the performance of the trained deep models, we used the same evaluation metrics, including sensitivity, specificity, accuracy, precision, \(F1_{score}\), and AUC. As mentioned before, we carried out the five cross-validation on DDSM and calculated the averaged evaluation metrics regarding the deep learning models trained five times individually. The results are shown in Table 6. The corresponding ROCs can be seen in Fig. 6.

Table 6 Performance of the deep learning models toward breast mass classification
Fig. 6
figure 6

ROCs of deep models on DDSM

As can be seen from the above table and figures, VGG19 performed best among all of the models in terms of sensitivity, accuracy, AUC, and other metrics. Therefore, we believe VGG19 is more suitable to be the feature extractor and performance extraction for the following feature classifiers. Note that the AUCs of VGG19 are stable, but they varied slightly around 0.88. VGG19 may benefit from the large volume of the number of parameters and the straightforward architecture. However, the number of learnable parameters is not the only factor that determines the overall classification performance. As can be seen, InceptionV3 has fewer parameters compared to ResNet101. However, the overall performance of InceptionV3 greatly outperformed the ResNet101. Nevertheless, the conclusion here doesn’t deny the high performance of the other deep models, but more fine-tuning procedures should be introduced for better performance. For the experiment, we take the trained VGG19 model as the the backbone of our feature extractor VGG19-DF.

4.4 Model ablation

Although we have successfully trained the deep models for feature extraction, the final performance of the classifiers, however, not only depends on the performance of feature extraction but also relies on utilizing the feature extractor for feature extraction and the architectures of the following classifiers. In this section, we will carry out the ablation experiment to specify the best configurations of the feature classifiers and the integrations of the feature extractor and feature classifiers. The output size of the feature extractor can vary from 1000 to 2. However, a large size of the output will unnecessarily increase the computational cost, while a small size of output suffers from significant information loss. Therefore, we take the deep feature output from the FC24 layer and FC64 as the features for the classifiers. For simplicity, we denote the feature representation generated from FC24 as \(Fea_{24}\) and the feature representation generated from FC64 as \(Fea_{64}\). By concatenating \(Fea_{24}\) and \(Fea_{64}\), then the concatenated feature map is \(Fea_{24+64}\). Inspired by the work in [28], we believe the classifiers may gain extra benefit by learning from multiple-level features. We then consider the performance of the classifiers under these three situations.

4.4.1 \(Fea_{24}\)-based mass classification

We first explored the performance of the classifiers, i.e., ELM, RVFLN, and dRVFL. To specify the most optimal number of the hidden layer, we then varied the number from 40 to 1,000 and concluded the results in Table 7. As can be seen from Table 7, the ELM model with 1000 hidden nodes gives the highest overall accuracy compared to ELMs with other numbers of hidden neurons. However, the highest AUC is given by the ELM model with only 40 or 80 hidden nodes. Also, these models have very close performance to the performance of the trained VGG19 models, while some of them showed even worse performance. Therefore, it may not be suitable to take ELMs as the feature classifiers.

Table 7 Performance of the extreme learning machine with the varied numbers of hidden neurons based on deep feature \(Fea_{24}\)(unit:%)

For RVFLN, the number of enhancement nodes controls the overall performance of the RVFLN models. Similarly, we then varied the number of enhancement nodes to determine the best model, and the results can be seen in Table 8. As can be seen, the RVFLN model with 100 enhancement nodes is the best model compared to RVFLNs with other numbers of enhancement nodes in terms of precision, \(F1_{score}\), and accuracy. Generally, all RVFLN models showed competitive performance against each other while showing higher performance than ELM when the input feature is \(Fea_{24}\). Besides, the RVFLN models with different numbers of enhancement neurons also showed close or even better performance against the VGG19 models. Compared to the previous ELM models, RVFLN showed better performance, so we believe RVFLN models are preferable to be the classifiers for breast mass classification tasks.

Table 8 Performance of the random vector functional link with the varied number of enhancement nodes based on deep feature \(Fea_{24}\)(unit:%)

For dDVFL, the number of hidden layers and the number of neurons within each layer of the dRVFL models collaboratively determine the final performance. As we decided to set the number of layers to three, we then varied the number of neurons within each hidden layer. Also, we keep the number of hidden neurons in each layer the same. To avoid overfitting issues, we set the number of hidden neurons to be relatively small as we range the value from 6 to 24 with an interval of 6. The results can be seen in Table 9. As can be seen, the dRVFL model with six enhancement nodes turns out to be the best model compared to dRVFL models with other numbers of enhancement nodes in terms of sensitivity, accuracy, and AUC. Note that the average accuracy has been improved to \(81.60\%\). Also, compared to ELMs and dRVFL models, dRVFL models showed consistent performance on DDSM while possessing higher AUCs, which indicated the suitability of these models to be utilized as classifiers.

Table 9 Performance of the deep random vector functional link nets with the varied number of enhancement nodes based on deep feature \(Fea_{24}\)(unit:%)

For SNN, the number of hidden neurons in the hidden layer determines the final performance of the SNN model. So we varied the number of hidden neurons in the hidden layer to obtain the candidate models. The results can be seen in Table 10. Note that the measurement of AUC is not applicable to the SNN models, so we only calculated the metrics, including sensitivity, specificity, precision, \(F1_{score}\), and accuracy. As can be seen, the network with 40 hidden neurons turns out to be the best model compared to other SNNs with different numbers of hidden nodes in terms of accuracy. However, the overall performance of SNNs is even worse than the trained VGG19 and other classifiers. Also, the standard deviation errors of SNNs increase along with the number of hidden nodes. Therefore, SNN may not be suitable for the classification task here.

Table 10 Performance of the spiking neural network with the varied number of hidden nodes based on deep feature \(Fea_{24}\)

4.4.2 \(Fea_{64}\)-based mass classification

We then explored the best models of the classifiers, i.e., ELM, RVFLN, and dRVFL based on \(Fea_{64}\). Similarly, we then varied the number of hidden layers for ELM from 40 to 1,000 and concluded the results in Table 11. As can be seen from Table 11, the ELM with 80 hidden nodes performed best compared to ELMs with other numbers of hidden neurons in terms of accuracy. Compared to the ELM models that were trained on \(Fea_{24}\), the models trained on \(Fea_{64}\) showed higher performance. Also, the ELM models with large numbers of hidden nodes failed to perform better than the ELM models with small numbers of hidden nodes, which suggests the careful choice of the number of hidden nodes is required. Interestingly, the models trained on \(Fea_{64}\) showed constantly better performance than the ELMs trained on \(Fea_{64}\). Therefore, we may roughly conclude that \(Fea_{64}\) is more representative than \(Fea_{64}\).

Table 11 Performance of the extreme learning machine with the varied number of hidden neurons based on deep feature \(Fea_{64}\)

Similarly, we then varied the number of enhancement nodes to determine the best model for RVFLN. The results can be seen in Table 12. As can be seen, the RVFLN model with 400 enhancement nodes is the best model compared to RVFLNs with other numbers of enhancement nodes. Compared to the RVFLN models that were trained on \(Fea_{24}\), the RVFLN models trained on \(Fea_{64}\) have a better performance than other models. This can also verify the representativeness of \(Fea_{64}\).

Table 12 Performance of the random vector functional link with the varied number of enhancement nodes based on deep feature \(Fea_{64}\)

For dRVFL, we varied the number of hidden neurons in each hidden layer while making sure the hidden neurons between the hidden layers were the same. The results can be seen in Table 13. As can be seen, the model with 12 neurons within three hidden layers is the best model compared to dRVFLs with other numbers of enhancement nodes. Note that the average accuracy reached \(81.55\%\), which means an average difference of 0.22 between the dRVFL models and the trained VGG19s. When these trained models are compared with the models trained on \(Fea_{24}\), we found that these models performed even better and should be considered as the candidate classifier.

Table 13 Performance of the deep random vector functional link with the varied number of enhancement nodes based on deep feature \(Fea_{64}\)

For SNN, we then varied the number of hidden neurons for best candidate model acquisition. The corresponding results can be seen in Table 14. As can be seen, the SNN model with 40 hidden neurons is the best model compared to other SNNs with different numbers of hidden nodes. The comparison between SNN models trained on \(Fea_{24}\) and on \(Fea_{64}\) showed that the models trained on \(Fea_{64}\) are more desirable though the overall performance of the SNN models is still worse than other classifiers.

4.4.3 \(Fea_{24+64}\)-based mass classification

Inspired by another work, we believe it worthwhile to concatenate the features from different levels for model training. Based on previous classifiers, we then slightly changed the architectures of those classifiers aiming at generating the best models based on the concatenated feature \(Fea_{24+64}\). Similarly, we first varied the number of hidden layers for ELM from 40 to 1,000 and listed the results in Table 15. As can be seen from Table 15, the ELM model with 200 hidden neurons performed best compared to ELMs with other numbers of hidden neurons. Compared to the ELM models that were trained on \(Fea_{24}\) and on \(Fea_{64}\), the models trained on \(Fea_{24+64}\) showed similar performance.

Table 14 Performance of the spiking neural network with the varied number of hidden nodes based on deep feature \(Fea_{64}\)
Table 15 Performance of the extreme learning machine with the varied number of hidden neurons based on deep feature \(Fea_{24+64}\)
Table 16 Performance of the random vector functional link with the varied number of enhancement nodes based on deep feature \(Fea_{24+64}\)

Similarly, we then varied the number of enhancement nodes to determine the best RVFLN model. The results can be seen in Table 16. As can be seen, RVFLN with 40 hidden neurons is the best model compared to other RVFLN models. Also, the model with 40 hidden neurons showed the highest specificity, precision, \(F1_{score}\), and accuracy that justify the effectiveness of the model. However, compared to the RVFLN models that were trained on \(Fea_{24}\) and \(Fea_{64}\), the RVFLN models trained on \(Fea_{24+64}\) are more quite similar, and therefore the benefit of concatenating features from different levels is minimal here.

We then varied the number of hidden neurons in each hidden layer while making sure the hidden neurons between the hidden layers were the same for dRVFL models. The classification results can be seen in Table , while the corresponding ROCs can be seen in Fig. 7. As can be seen, the dRVFL model with 12 hidden neurons was the best model compared to other dRVFL models. Specifically, the overall accuracy has been improved to \(81.71\%\) with improved sensitivity of \(83.17\%\). When these trained models are compared with the models trained on \(Fea_{24}\) and \(Fea_{64}\), we found that no significant improvement can be seen, while some models showed even worse performance. Nevertheless, the dRVFL model with 12 hidden neurons can be taken as the classifier for breast mass classification.

Table 17 Performance of the deep random vector functional link with the varied number of enhancement nodes based on deep feature \(Fea_{24+64}\)
Fig. 7
figure 7

ROCs of drvflns trained with \(Fea_{24+64}\) on DDSM

For SNN models, we then varied the number of hidden neurons. The corresponding results can be seen in Table 18. As can be seen, the SNN model with 40 hidden neurons is the best model compared to other SNNs with different numbers of hidden nodes. Strangely, the performance of the SNN model with 1000 hidden neurons experienced the greatest decline. The reason could be the overfitting problem because of large volumes of the parameters introduced by 1000 neurons. However, The comparison between SNN models trained on \(Fea_{24}\) and on \(Fea_{64}\) showed that the models trained on \(Fea_{64}\) have the best performance. Nevertheless, the SNN models are likely to provide results with high sensitivity. So far, all experiments showed that the models trained on \(Fea_{64}\) are more likely to perform better than the models trained on \(Fea_{64}\) and \(Fea_{24+64}\), although the model with the best performance was obtained after being trained on \(Fea_{24+64}\). Moreover, dRVFL models showed overall higher performance compared to other classifiers throughout the different settings. Therefore, we believe that the combination between VGG19 based feature extractor and the dRVFL model-based classifier seems to be more reasonable.

Table 18 Performance of the spiking neural network with the varied number of hidden nodes based on deep feature \(Fea_{24+64}\)
Table 19 Performance of the traditional classifiers for breast mass classification based on \(Fea_{24}\)
Table 20 Performance of the traditional classifiers for breast mass classification based on \(Fea_{64}\)

4.5 Method comparison

Based on the previous experiments, we validated the effectiveness of the classifier on the breast mass classification task. We first compare the performance of our model with the models based on traditional classifiers. Specifically, we examined the performance of traditional representative classifiers, including SVM, KNN, and DT. The results have been listed in Tables 19, 20, and 21. In these tables, \(KNN_{5}\), \(KNN_{10}\), and \(KNN_{20}\) stand for the KNN with 5, 10 and 20 as k, respectively. For BDT, \(BDT_{5}\), \(BDT_{10}\), and \(BDT_{20}\) are the BDT of 5, 10, and 20 bags, respectively. As can be seen from Table 19, almost all of the models except \(KNN_{10}\) showed declining performance, within which \(KNN_{10}\) showed the overall best performance while SVM gave the highest sensitivity. For general KNN models, a larger k doesn’t mean better performance, but it should be carefully chosen, which is supported by the results here. Similar conclusions for the choice of the number of DT bags can be found. The comparison between these models showed that KNN models are preferable to DT and BDTs in terms of overall accuracy, while SVM is the model between them. The experiment results of BDT justify its effectiveness as the ensembles of DT. Compared to the novel classifiers trained on the \(Fea_{24}\), the performance of the traditional models dropped with a bigger difference against the performance of trained VGG19 models.

For the traditional classifiers trained on \(Fea_{64}\), the overall performance of these classifiers is higher than that of the classifiers trained on \(Fea_{24}\) while the average accuracy of the best model achieved \(81.68\%\). Therefore, we believe \(Fea_{64}\) is preferable to \(Fea_{24}\). However, SVM so far doesn’t seem to be a suitable classifier for the classification task here as SVM models still showed declining performance on \(Fea_{64}\). For KNN models, it seems that the optimal k that can lead to the KNN model with even higher performance is close to 10 but remains to be explored as KNN with 10 neighbors. For DT and BDT, the performance of BDT has improved while the performance of DT remained low. It is worth notifying that BDT of 10 bags showed competitive performance though further exploration is needed. For the models trained on \(Fea_{24+64}\), it seems no extra benefit is introduced except for SVM. Therefore, the conclusion that \(Fea_{64}\) is the most representative feature amongst \(Fea_{24}\), \(Fea_{64}\), and \(Fea_{24+64}\) is boosted. Combining the performance of all models trained on different datasets, our selected novel classifiers, especially dRVFL, turned out to be the ideal classifiers.

Table 21 Performance of the traditional classifiers for breast mass classification based on \(Fea_{24+64}\)

Considering the fact that the results in the related works are not replicable due to access limitations, we then compared our methods to the methods that are validated on the same dataset DDSM, as it is a relatively large-scale public dataset when compared to other datasets. The comparison results can be seen in Table 22. In this table, the MultiScaled Deep CNNs proposed by Jing et al. [35] got the highest performance with a sensitivity of 0.97, an accuracy of 0.96, and an AUC of 0.96. However, their work only evaluated 150 test images. The same situation also applied to the work done by Daniel et al. [25]. Then if we compare the performance with the nearest test samples, the work proposed by the Ragab et al. [41] uses 676 test samples that are close to our test samples, our sensitivity increased by 7%, the accuracy increased by 3%, and the AUC increased by 5%. If we compare the work with the Sujata et al. [23], our AUC increased by 3%, compared with Rampun et al. [42], our accuracy increased by at most 2%. In conclusion, our number of images for evaluation became the highest one, and considering the overall performance, it is worth noting that our method has competitive accuracy and AUC against other methods. Combining all factors, including the performance of our model and the size of the validation set, we believe our method turns out to be the most desirable method among all listed methods.

Table 22 Performance of the state-of-the-art methods on DDSM

5 Discussion

In this study, we proposed a DF-dRVFL, which includes a deep learning model VGG19-DF and a novel classifier dRVFL for breast mass classification. Through the experiments, we found that the proposed model DF-dRVFL can reach state-of-art performance with less computation power and time. In this section, we discuss the limitations and possible future works based on our proposed DF-dRVFL.

Firstly, the backbone of the deep learning model can be further improved by using more complex and deeper model architectures such as Transformer [53], Ensemble [43], and ConvNeXt [27]. It is also applicable by using some unsupervised learning-based feature extractors instead. For example, the feature extractor can implement by SimClR [3], SimSiam [2], and BYOL [9]. These unsupervised learning-based feature extractors do not need cost on the ground-truth labels and save the training cost further.

The second shortcoming is that the performance of the entire DF-dRVFL highly depends on the feature extractor. If the backbone of the feature extractor cannot perform well on the classification task, the performance improved by the classifiers is meaningless, although there still exists improvements compared with the single feature extractor. In contrast, although some single feature extractors can get higher performance, our proposed dRVFL can still further improve the overall performance and the training time. The computational power is less than the original classifier in the feature extractor.

Another one is how to automatically specify the architectures of the classifiers. As was shown in experiment Section 4, we used the exhaustive-like method to determine the number of hidden neurons or enhancement nodes. However, the models that showed the highest performance on the classification can not be guaranteed as the best model in the search domain. Therefore, the best method for defining the architecture of the classifiers remains to be explored in the future.

The last limitation of this model is the performance of the classifiers. Initially, we evaluated the performance of these classifiers with the original architecture. Therefore, there still exists gaps in performance that can be further enhanced. There are related works that can be deployed, such as ensembling different classifiers [6]. There are different algorithms for ensembling: bagging-based ensemble [10], random forest algorithm [33], AdaBoost [7], voting classifier [44], and so on.

Our future work will be based on the limitations mentioned above to further explore powerful feature extractors and classifiers with breast mass classification tasks. Firstly, we would like to try more complex and deeper feature extractors, then try to implement feature extractors in the unsupervised learning manner in order to save the cost of the labelling process. Secondly, we aim to design an algorithm to automatically explore the numbers of hidden neurons in the proposed classifiers. Thirdly, we will use ensemble-based algorithms to integrate different classifiers to achieve better performance than the single classifier.

6 Conclusion

Breast mass severity classification plays an important role in mammogram-based cancer analysis. In this paper, we developed a novel framework DF-dRVFL for breast mass in mammography images. The proposed method is a hybrid architecture deploying using both deep learning architecture and novel machine learning models for the classification task. Based on the experimental results on DDSM, the proposed framework showed very high performance on the datasets. As it is a time-consuming task to train the deep learning model so that the best performance can be achieved, we circumvented the non-trivial task by repurposing the trained deep learning models as the deep feature extractor and introducing novel classifiers instead. After the introduction of the novel classifiers, including dRVFL, the overall training time has been greatly reduced, while the classification performance has also been improved. For the classifiers chosen for this task, we aimed to introduce classifiers that have a good generalizability and novelties in the architectures. To fully use the deep features from different levels, we proposed to combine the deep features from different levels of the trained deep learning models as the input for the classifiers. Hence, dRVFL, which contributed to our DF-dRVFL model amongst all evaluated models, is the model that benefits most from the concatenated features. Also, the experiments showed that \(Fea_{64}\) might be more suitable for model training as the models trained via \(Fea_{64}\) generally showed higher performance compared to those models trained via \(Fea_{24}\). In conclusion, we believe the developed DF-dRVFL model could serve as a promising model for breast mass classification.