Performance analysis of seven Convolutional Neural Networks (CNNs) with transfer learning for Invasive Ductal Carcinoma (IDC) grading in breast histopathological images

Voon, Wingates; Hum, Yan Chai; Tee, Yee Kai; Yap, Wun-She; Salim, Maheza Irna Mohamad; Tan, Tian Swee; Mokayed, Hamam; Lai, Khin Wee

doi:10.1038/s41598-022-21848-3

Performance analysis of seven Convolutional Neural Networks (CNNs) with transfer learning for Invasive Ductal Carcinoma (IDC) grading in breast histopathological images

Article
Open access
Published: 10 November 2022

Volume 12, article number 19200, (2022)
Cite this article

Download PDF

You have full access to this open access article

Scientific Reports

Performance analysis of seven Convolutional Neural Networks (CNNs) with transfer learning for Invasive Ductal Carcinoma (IDC) grading in breast histopathological images

Download PDF

Wingates Voon¹,
Yan Chai Hum¹,
Yee Kai Tee¹,
Wun-She Yap²,
Maheza Irna Mohamad Salim³,
Tian Swee Tan⁴,
Hamam Mokayed⁵ &
…
Khin Wee Lai⁶

3380 Accesses
15 Citations
1 Altmetric
Explore all metrics

Abstract

Computer-aided Invasive Ductal Carcinoma (IDC) grading classification systems based on deep learning have shown that deep learning may achieve reliable accuracy in IDC grade classification using histopathology images. However, there is a dearth of comprehensive performance comparisons of Convolutional Neural Network (CNN) designs on IDC in the literature. As such, we would like to conduct a comparison analysis of the performance of seven selected CNN models: EfficientNetB0, EfficientNetV2B0, EfficientNetV2B0-21k, ResNetV1-50, ResNetV2-50, MobileNetV1, and MobileNetV2 with transfer learning. To implement each pre-trained CNN architecture, we deployed the corresponded feature vector available from the TensorFlowHub, integrating it with dropout and dense layers to form a complete CNN model. Our findings indicated that the EfficientNetV2B0-21k (0.72B Floating-Point Operations and 7.1 M parameters) outperformed other CNN models in the IDC grading task. Nevertheless, we discovered that practically all selected CNN models perform well in the IDC grading task, with an average balanced accuracy of 0.936 ± 0.0189 on the cross-validation set and 0.9308 ± 0.0211on the test set.

Leveraging CNN and Transfer Learning for Classification of Histopathology Images

Enhancing IDC Histopathology Image Classification: A Comparative Study of Fine-Tuned and Pre-trained Models

Breast Cancer Classification Using Transfer Learning

Introduction

Worldwide, there were an estimated 19.3 million new cancer cases and almost 10.0 million cancer deaths in 2020. For women, breast cancer is now the most common type of cancer, with an estimated 2.3 million new cases each year¹. Breast cancer is a category of disorders in which the cells of the breast multiply uncontrolled, resulting in the formation of a lump in a specific location of the breast². IDC is the most common type of breast cancer, accounting for more than 80% of all cases³. Early detection and screening are critical for effectively preventing breast cancer. Breast cancer screening consists of three procedures: mammography, breast magnetic resonance imaging (MRI), and breast ultrasonography⁴. If suspicious tissue is detected, physicians extract it via biopsy for further histologic examination. After tissue extraction, three steps are performed prior to histological grading: (1) formalin fixation, (2) paraffin section embedment, and (3) haematoxylin and eosin staining⁵.

The primary three prognostic markers that determine a breast cancer treatment are (1) lymph node (LN) status, (2) tumour size and (3) histological grade⁶. Multiple studies have shown that the prognosis indicated by the histological grade is equal to the lymph node (LN) condition but higher than the tumour size^7,8. It is established that the prediction accuracy for clinical outcomes improved when both histological grade and LN condition are applied together⁹. Frkovic-Grazio and Bracko¹⁰ found that the histology grade predicted tumour behaviour accurately, especially for early small tumours. Schwartz et al.¹¹ revealed that high-grade breast cancer patients who underwent mastectomy suffered greater mortality rates and axillary lymph node frequency than lower grade patients. Therefore, the breast cancer grade (IDC grade) is a major indicator of breast cancer outcomes.

The breast cancer grade indicates the tumour’s aggressiveness¹². Specifically, pathologists categorize breast cancer using the Nottingham Grading Scheme (NGS), which assigns a grade characterized by three morphological traits of the breast cancer tissue: (1) mitotic count (the number of proliferating tumour cells), (2) nuclear pleomorphism (the overall appearance of the tumour cell), and (3) degree of tubule formation (how well the tumour cells replicate normal glands)⁵. These characteristics combine to produce a total score that indicates the presence of low-grade (grade 1), intermediate-grade (grade 2), or high-grade (grade 3) breast cancer¹². Although manual breast cancer grading remains the gold standard for cancer diagnosis, pathologists' competence can have a considerable impact on results¹³. Inexperienced pathologists may make incorrect diagnoses¹⁴. Manual breast cancer grading is laborious, time-consuming, and subjective, owing to pathologists’ wide intra- and inter-observational variability¹³. Elmore et al.¹⁵ discovered an overall agreement of around 75.3 percent between each pathologist's investigation and the expert consensus–derived reference diagnosis. Additionally, manual grading in low magnification images is susceptible to statistical, distributional, and human errors¹⁶.

Automated breast cancer grading approaches have risen in popularity as computer vision technology has advanced. Previous research^17,18,19,20 attempted to overcome the manual breast cancer grading system by combining NGS criteria with classic machine learning approaches. Nevertheless, traditional approaches are highly feature-dependent, time-consuming, and expensive to compute. On the other hand, deep learning methods improve grading efficiency while reducing human workloads²¹. Wan et al.²² pioneered deep learning by employing a Convolutional Neural Network (CNN) to classify breast cancer grades. Several other studies^23,24,25 used a range of deep learning techniques to handle this categorization problem. These techniques, on the other hand, are robust and necessitate a large amount of computer power. Transfer learning, on the other hand, is becoming increasingly common; for example, many studies^26,27 used transfer learning to grade breast cancer. There is a knowledge gap among these research, to our knowledge: there have been no performance comparisons of recent pre-trained state-of-the-art CNN architectures ((EfficientNetB0²⁸, EfficientNetV2B0²⁹, EfficientNetV2B0-21k²⁹, ResNetV1-50³⁰, ResNetV2-50³¹, MobileNetV1³², and MobileNetV2³³). As a result, many people are unaware of how CNN structures are used in automatic IDC grading. As a result, we plan to fill a knowledge gap by providing our findings on the automated IDC grading application employing several CNN architectures ranging from simple and light-weight CNNs to complicated and heavy-weight CNNs.

The purpose of this work is to examine contemporary CNN architectures in IDC grading through the use of histopathology images. The following are the study's aims, in no particular order:

1.
To review the state-of-the-art CNN architectures adopted in IDC grading.
2.
To conduct a comparative investigation of the performance of seven selected cutting edge CNN architectures on the Four Breast Cancer Grades (FBCG) Dataset²⁶.

Our work studied seven types of CNN architectures (EfficientNetB0, EfficientNetV2B0, EfficientNetV2B0-21 k, ResNetV1-, ResNetV2-50, MobileNetV1, and MobileNetV2) in the application of automated IDC grading. We employed the transfer learning technique that leverages pre-trained CNNs from the TensorFlow Hub (TF Hub) for visual feature extraction. The saved CNNs were trained on the ImageNet dataset. We applied our proposed technique to the Four-Breast-Cancer-Grades (FBCG) dataset. Conversely, our work was accomplished without improving the pre-trained CNN architectures and implementing the effect of stain normalisation. We summarise our contributions as below:

1.
We conducted a performance analysis of seven CNN architectures on IDC grading applications based on the Four Breast Cancer Grades (FBCG) Dataset.
2.
We successfully designed and conducted experiments to uncover that the EfficientNetV2B0-21 k outperformed other CNN models (balanced accuracy = 0.9666 ± 0.0185, macro precision = 0.9646 ± 0.0174, recall = 0.9666 ± 0.0185 and F1 score = 0.9642 ± 0.0184 on fivefold stratified cross-validation (CV), balanced accuracy = 0.9524, macro recall = 0.9524) with only low FLOPs (0.72B), parameters (7.1 M), inference time (0.0758 ± 0.0001) and training time (0.5592 ± 0.0162).
3.
We discovered that all CNN architectures exhibited comparatively good performance in IDC grading applications with an average balanced accuracy of 0.9361 ± 0.0189 (fivefold stratified CV) and 0.9308 ± 0.0211(test result).

The following is the structure of this work: Related works” section highlights the development of breast cancer grading systems. Methodology section outlines the technique used to compare the performance of seven CNN architectures. Results and discussion” section summarises our conclusions and results from the comparison study. Finally, in “Conclusion” section, we summarise our findings and discuss future developments.

Related works

This section reviews the history of automated breast cancer grading using histopathology images. These studies are divided into two categories: classic feature-based and deep learning-based (manual feature extraction, end-to-end feature extraction, and transfer learning).

Initially, breast cancer grading was based on the NGS criteria for (1) mitotic count, (2) degree of tubule formation, and (3) nuclear pleomorphism. For example, Dalle et al.¹⁷ proposed a multi-resolution technique that incorporated all three NGS criteria in order to address previous automated breast cancer grading systems that only addressed portions of the NGS criteria. The proposed approach was executed in a manner comparable to manual grading. Doyle et al.¹⁹ suggested an automated quantitative image analysis method based on spectral clustering and image attributes from the textural and architectural domains. Prior to performing spectral clustering, the authors computed textural and architectural characteristics from the images in order to minimise the dimensionality of the feature set. The suggested technique classified low and high breast cancer grades with a 93.3% accuracy when all architectural factors were included.

Naik et al.¹⁹ outlined an automated gland and nuclei segmentation method for prostate and breast histopathology that integrated three types of image information: (1) low-level information based on pixel values, (2) high-level information based on the correlations between pixels for object detection, and (3) domain-specific information based on the correlations between histological structures. The proposed method achieved 80.52% and 93.33% accuracy for low and high breast cancer grades, respectively, using automated and manually extracted feature sets. Basavanhally et al.²⁰ proposed a multi-field-of-view (multi-FOV) framework for grading ER + breast cancers using entire histopathology slides. The authors used a multi-FOV classifier capable of automatically integrating image features from multiple FOVs of varying sizes to predict the breast cancer grade of the images. For classifying low versus high grades, low versus intermediate grades, and intermediate versus high grades, the approach achieved area under curve (AUC) values of 0.93, 0.72, and 0.74. Dimitropoulos et al.³⁴ proposed a method for automatically grading breast cancer by encoding histological images as Grassmann manifold-based Vector of Locally Aggregated Descriptors (VLAD) representations. Additionally, the authors created a new medium-sized breast cancer grading dataset. With the overlapping patch size 8 × 8 strategy, the proposed method achieved an average classification accuracy of 95.8%.

Despite their simplicity, these methods are probably obsolete in light of recent advancements in computer vision technology. Additionally, these methods are primarily feature-based, focusing exclusively on segmenting and classifying histological primitives. Additionally, these methods require a greater amount of computational power due to the complexity of the pre-processing steps (segmentation, nuclei separation, and detection) and the absence of heuristics for feature extraction ²³.

Deep learning based methods

Deep learning is a part of machine learning techniques inspired by the human brain to recognize patterns. Deep learning approaches train on hierarchical representations to achieve high performance. Prior domain knowledge is inessential since these methods can extract and categorize distinct features. Contrarily, conventional machine learning approaches require hand-crafted feature extraction. Hence, deep learning techniques, particularly CNNs, have become the de facto standard for medical image classification³⁵. CNN is a type of deep neural network (DNN) that relies on the correlation of neighbouring pixels. Initially, CNN utilizes randomly specified patches for input and then changes the patches during model training. Subsequently, the CNN utilizes these modified patches to predict the validation and testing sets after model training. CNNs have wildly succeeded in image recognition problems as automatic feature extractors since CNNs excel in matching the data point distribution in the image. A CNN architecture comprises two types of transformations: (1) convolution layer (pixels are convolved with a filter, delivering the dot product between the image patch and filter); and (2) subsampling layer (max, min, or average pooling, functions to lower the data dimensionality). The filter dimension (height x width x depth) and the pooling filter size can be configured based on the network or user requirement. After utilizing a combination of convolution and pooling layers, the output is passed through to a fully connected layer for final classification³⁶.

Manual feature extraction

Wan et al.²² proposed a method for grading breast cancer in histopathological images by combining multi-level image features at three levels: (1) pixel-level, (2) object-level, and (3) semantic-level features. The method achieved a 92% accuracy difference between low and high grades, a 77% difference between low and intermediate grades, a 76% difference between intermediate and high grades, and a 69% difference between all breast cancer grades. The multi-level features allow for accurate morphological classification of cancer while also extracting structural information and interpretable high-level concepts from histopathological images. Additionally, the use of cascaded ensembles lowers computational costs. However, the dataset used is relatively small (106 images). The implemented CNN architecture is inefficient, resulting in a lengthy training period (20 h). As a result, we intend to investigate deep learning methods that incorporate automatic feature extraction.

Automatic feature extraction

Li et al.²⁴ proposed a multi-task deep learning method for breast cancer grading that embeds contrastive constraint as well as classification constraint (SoftMax) in the feature representation learning process. In the representation learning process, the authors combined classification and verification tasks of image pairs. The variances in feature outputs were calculated for different subclasses and within the same subclass. For the breast cancer grading task, the proposed method achieved 93.01% accuracy. Yan et al.³⁷ proposed a nuclei-aware network (NANet) that grades breast cancer in histopathological images with medical intent (attention to nuclei-related features) while learning image feature representations in their entirety. The NANet is divided into two branches: (1) the main branch extracts the feature representation of the entire image, and (2) the guide branch extracts only the feature representation of the segmented nuclei image. In terms of overall breast cancer grading, the proposed model achieved 92.2% accuracy. Senousy et al.²³, in contrast to Yan et al.³⁷, proposed an Entropy-Based Elastic Ensemble of deep convolutional network (CNN) models (3E-Net) for breast cancer grading. The proposed method employs multiple CNNs as well as an ensemble-based uncertainty-measure component that selects the most certain image-wise models for the final breast cancer grading. The proposed models' two variations achieved grading accuracy of 96.15% and 99.50%, respectively. Despite their success, CNN deep learning approaches require much computational power and are more complicated than transfer learning techniques. As a result, we intend to research transfer learning techniques in IDC grading applications.

Transfer learning methods

CNNs with transfer learning techniques have become more prevalent in classification tasks. Numerous contemporary approaches make use of fine-tuning to enhance performance³⁸. Transfer learning enhances performance by transferring knowledge from a target domain to a source domain. Hence, the dataset required for training in the target domain can be reduced³⁹. Zavareh, Safayari, and Bolhasani²⁷ proposed a method for classifying the Databiox⁴⁰ using transfer learning (BCNet). The BCNet is composed of three main components: (1) a VGG16 pre-trained model that acts as a feature extractor, (2) a global average pooling layer, and (3) three dense layers that are fully connected. The BCNet achieved a validation accuracy of 88% and a test accuracy of 72% for breast cancer grading. Similarly, Abdelli et al.²⁶ proposed using transfer learning to grade breast cancer using two distinct types of CNN architectures. In three breast cancer grade datasets, the MobileNetV1 achieved 93.48% accuracy, while the ResNetV1-50 achieved 92.39% accuracy. Additionally, the authors developed a novel dataset strategy (Four-Breast-Cancer-Grades Dataset) by combining two distinct breast cancer datasets to create a new class (grade 0) for breast cancer grading. Both models performed better on the new dataset than on the original dataset; the ResNetV1-50 achieved a higher accuracy of 97.03% than the MobileNetV1.

We discovered that transfer learning studies^27,28 lack comparisons of recent pre-trained state-of-the-art CNN architectures' accuracy, complexity, size, inference time, and training time. As a result, users lack an understanding of how the CNN architecture is used in automated IDC grading. As a result, we intend to compare the performance of seven distinct types of CNN architectures for IDC grading applications.

Summary

Early breast cancer research^17,18,19 is feature-dependent, requires increased computational power, and lacks feature extraction heuristics. Deep learning methods (CNN) have evolved exponentially in recent years to excel at histopathological image analysis of breast cancer. Additionally, several studies^23,24,37 demonstrated that deep learning methods could achieve near-perfect performance in grading breast cancer, on par with state-of-the-art approaches. Transfer learning techniques have become more prevalent in deep learning approaches, owing mainly to the presence of small datasets in breast cancer datasets. Abdelli et al.²⁶ and Zavareh, Safayari, and Bolhasani²⁷ used transfer learning to grade histopathological images of breast cancer. The details of these works are summarised in Table 1. However, we discovered that these publications omit performance evaluations of contemporary CNN architectures. As a result, we intend to conduct a comparative analysis of the performance of seven distinct CNN architectures used in IDC grading applications. The methods and datasets used in previous studies on breast cancer grading are summarised in Table 1. The following Table 2 summarises the available databases of breast cancer histological images.

Table 1 This table summarises the methods and datasets adopted by previous studies on breast cancer grading.

Full size table

Table 2 This table summarises available databases of breast cancer histological images.

Full size table

Methodology

In this section, we described the methodology for the comparative analysis of the performance of 7 CNN architectures in IDC grading applications using pre-trained CNNs from the TF Hub for image feature extraction (transfer learning). We adopted the Four-Breast-Cancer-Grades (FBCG) Dataset. We fed the datasets into our proposed method that utilised the seven different pre-trained CNN architectures for feature extraction. Our experiments were conducted on the Google Collaboratory platform, which meets the following specifications: (1) 2.30 GHz Intel(R) Xeon(R) CPU, (2) 12 GB RAM, (3) up to 358 GB disc space, and (4) 12 GB/16 GB Nvidia K80/T4 GPU. For our work, we primarily used the TensorFlow library. Our approach is divided into four stages: (1) image data pre-processing, (2) custom CNN construction (using pre-trained CNNs from TF Hub as feature extractor), (3) model compilation and training, and (4) model evaluation. The stages of our methodology are summarised in Fig. 1. We confirm that all procedures were carried out in accordance with relevant guidelines and regulations.

Dataset

The FBCG dataset comprises two datasets: (1) BreaKHis⁴³ and (2) the Breast Cancer Grading (BCG) dataset⁴⁴. BreaKHis contains 7909 histopathological images of breast cancer obtained from 82 patients at four different magnification factors (40X, 100X, 200X, and 400X), corresponding to four different objective lenses (4X, 10X, 20X, and 40X). The dataset is primarily divided into two categories: benign (2480 images) and malignant (5429 images); benign and malignant breast tumours can be further classified into four distinct types: Adenosis (A), Fibroadenoma (F), Phyllodes Tumour (PT), and Tubular Adenoma (TA) for the benign class; and Ductal Carcinoma (DC), Lobular Carcinoma (LC), Mucinous Carcinoma (MC), and Pa (see Fig. 2). The term "benign" has historically been used to refer to a lesion that lacks malignant characteristics such as metastasis (spreading from an initial site to a secondary site), significant cellular atypia (appears abnormal in shape, colour, or size), mitosis (parent cells divide and grow), and disruption of basement membranes (which are the thin, dense sheets of the specialised extracellular matrix that surround tissues). In general, benign lesions are non-aggressive, growing slowly, with distinct borders, and remaining localised. Malignant lesions are frequently locally invasive and have a proclivity to invade distant sites, resulting in death. The images were created using Hematoxylin and Eosin (H&E) stained breast tissue biopsy slides and then processed into a digital RGB format with a resolution of 700 × 460 pixels. The BreaKHis is summarised in Table 2. The distribution of images by class and magnification factor is shown in Table 3.

Table 3 This table illustrates the image distribution of BreaKHis by class and magnification factor.

Full size table

Zioga et al.⁴⁴ published the BCG dataset containing different grades of breast cancer histological images. Each breast carcinoma histological sample was collected in the Department of Pathology at Thessaloniki's "Agios Pavlos" General Hospital, Greece, using a Nikon digital camera equipped with a 40X objective lens (equivalent to a magnification of 400X in the BreaKHis dataset). This dataset contains 300 images with a resolution of 1280 × 960 and staining with H&E. The dataset contains three IDC grades (107 images), grade 2 (102 images), and grade 3 (91 images) that correspond to 21 patients based on their NGS results: grade 1 (107 images), grade 2 (102 images), and grade 3 (91 images) (see examples in Fig. 3).

The FBCG dataset²⁶ is created to address the constraints associated with small breast cancer datasets. The FBCG dataset is formed by combining the magnified 400X benign images (as Grade 0) from the BreaKHis with the Grade 1, 2, and 3 images from the BCG dataset. For the experiments, the dataset was divided into a 20% test set and an 80% training set with no overlap. The test set images were chosen through stratification (the first portion of images in the dataset was selected to form the test set). The distribution of images in the FBCG dataset is summarised in Table 4.

Table 4 This table shows the image distribution of the FBCG dataset.

Full size table

Data pre-processing

Pre-processing the data is critical for converting it to a format compatible with the pre-trained CNN architectures. To perform the fivefold stratified CV, we divided the training set into five folds. Stratified fivefold CV ensures that each training set fold obtains the same proportion of observations with a given label while ensuring that each CNN model is properly trained. The "ImageDataGenerator" class (from Keras pre-processing.image) was used to normalise the images by scaling them by 1/255. (original images are composed of RGB coefficients ranging from 0 to 255, which are incompatible with CNN models). Then, using the "flow_from_dataframe" method, we applied image normalisation to the training set using the configurations listed in Table 6. The FBCG dataset's image sizes (700 × 460 and 1280 × 960) are large in comparison to the CNN models' input sizes (see Table 5). We noticed that resizing images preserved global characteristics but ignored local characteristics. As a result, the model's performance would be highly dependent on the model's ability to recognise and learn global features⁴⁶.

Table 5 This table summarises the seven CNN architectures adopted for the comparative analysis in terms of their main contributions, datasets involved, FLOPs, parameters and input shapes.

Full size table

Data augmentation

Data augmentation is a standard procedure to address the risk of model overfitting during model training by increasing the number of input images of the dataset⁵⁰. This procedure also assures a fairer comparison between our study results and other published results in the literature. Although The FBCG dataset contains 888 images, the dataset is still considered small relatively; as a result, model overfitting may occur during model training. Thus, we implemented data augmentation by infusing the training samples with artificial diversity via random but realistic transformations. We used the TensorFlow Keras pre-processing layers to augment the data. The data augmentation layers supplement the training data but are disabled during validation and testing operations. We used three techniques for augmentation: (1) random horizontal and vertical flips, (2) random rotation, and (3) random zoom (see Table 6). We used random flipping and rotation because pathologists' ability to examine histopathological images is not affected by rotation angles. As a result, we assumed that different rotation angles would not affect the CNN's ability to learn. Additionally, we used random zoom augmentation to simulate the magnification factor found in histopathological images of breast cancer in order to enhance the CNN's generalisation ability.

Table 6 This table summarises the pre-processing, data augmentation, and model compilation details for the standardised framework.

Full size table

Data balancing

The FBCG data set is imbalanced (see Table 4). An imbalanced dataset will cause the CNN model to be more biased toward predicting the majority class. We used the class weighting technique from the Scikit-Learn Python library to resolve this concern. This technique grants the minority class a higher weight in the model cost function in order to impose a greater penalty on the minority class. As a result, the model can converge on the objective of minimising errors for the minority class⁵¹. We used the following equation to determine the weight of each class:

$$W = \frac{N}{{N_{c} \times N_{sc} }}$$

(1)

where $W=$ class weight. $N=$ total number of samples. ${N}_{c}=$ number of classes. ${N}_{sc}=$ number of samples in each class.

Transfer learning

CNN approaches only perform well when the models are trained on large and well-annotated datasets. Nevertheless, the FBCG dataset is considered small (888 images). Therefore, we opted for the CNN with transfer learning technique to address the issue of small datasets (model overfitting). Additionally, transfer learning can reduce model training time and improve model performance³⁹. Transfer learning consists of four components: (1) source domain (${\mathrm{D}}_{\mathrm{s}}$), (2) target domain (${\mathrm{D}}_{\mathrm{t}}$), (3) source learning task (${\mathrm{T}}_{\mathrm{s}}$), and (4) target learning task (${\mathrm{T}}_{\mathrm{t}}$); transfer learning attempts to improve the target predictive function ${\mathrm{D}}_{\mathrm{t}}(.)$ in ${\mathrm{D}}_{\mathrm{t}}$ with the knowledge in ${\mathrm{D}}_{\mathrm{s}}$ and ${\mathrm{T}}_{\mathrm{s}}$, where ${\mathrm{D}}_{\mathrm{s}}\ne {\mathrm{D}}_{\mathrm{t}}$ or ${\mathrm{T}}_{\mathrm{s}}\ne {\mathrm{T}}_{\mathrm{t}}$²⁵. Generally, the first few layers of a CNN recognise more generic features (edges and generic shapes), whereas the final few layers recognise problem-specific features. Thus, transfer learning utilises of the general features learned in the first few layers of the source dataset and then relearns the specific features of the target dataset in the final few layers. Since the first few layers’ features still remain relevant to the problem, transfer learning makes the model training process fast and reduces the amount of data required for model training³⁹. Therefore, transfer learning enables small datasets to be trained on CNN models with minimal risk of model overfitting.

Transfer learning techniques

Transfer learning entails two distinct methods for customising a pre-trained model:

1.
Feature Extraction; this technique leverages a previous network's representations to extract critical features from a new dataset. This is accomplished by superimposing new classifier layers (that have been trained from scratch) on top of the pre-trained model (no training required). As a result, previously learned feature representations can be repurposed for the new dataset.
2.
Fine-tuning; this technique unfreezes several top layers of a frozen base model (pre-trained model) and then trains the newly added classifier layers along with the unfrozen base model layers. This process "fine-tunes" the base model's specific feature representations (high-order features) to make the representations more applicable for the particular task⁵².

While fine-tuning the model may improve performance, this technique may induce overfitting. To avoid overfitting, we utilised seven pre-trained CNN architectures (EfficientNets^28,29, ResNets^30,31 and MobileNets^32,33) as feature extractors in this work. Early CNN architectures (LeNet⁵³, AlexNet⁵⁴, and GoogleNet⁵⁵) were disregarded as they were considered outdated and no longer state-of-the-art. Hence, comparing more recently developed models is more meaningful and inclusive. We utilised each pre-trained CNN architecture in the form of an image feature vector (a dense 1D tensor describing the whole image), reposited in the TF Hub. To apply the feature vector to our work, we employed the "hub.KerasLayer" to integrate the feature vector into our framework. This layer produces a batch of feature vectors whose size is proportional to the input size. The comparison of the seven CNN architectures is summarised in Table 5.

Experimental details

We constructed the IDC grading model using the Keras Functional API by combining data augmentation (described in the Data Augmentation Section), pre-trained CNN architectures (feature vector), and several new classifier layers. Thus, the final IDC grading model is composed of seven layers: (1) an input layer, (2) a data augmentation layer, (3) the feature vector, (4) a dropout layer with a rate of 0.5, (5) a dense layer of 256 neurons with ReLU activation, (6) a dropout layer with a rate of 0.4, and (7) a dense layer of four neurons with the SoftMax activation function (N = number of classes).

Standardizing model pipelines and hyperparameters

We standardised the model pipelines and hyperparameters to ensure fair comparisons. Munien and Viriri⁴⁶ were the inspiration for the standardised framework. Initially, the input layer assigned a specific shape to the input data (image resolution). Then, during model training, the data augmentation layer augmented (randomly flips, rotates, and zooms) the input data. Subsequently, the input data was fed into a pre-trained CNN model (feature vector) to extract features. The output data was then passed through a first dropout layer with a rate of 0.5, a fully connected layer with 256 neurons, a second dropout layer with a rate of 0.4, and an output fully connected layer (4 neurons). If the input units were not set to 0, they were scaled up by 1/(1 − rate) to maintain the same sum of all inputs⁵⁶. Finally, the dense layer's SoftMax function converted the model output to a vector of probabilities for each class's input data. The architecture of our proposed framework is depicted in Fig. 4.

Model compiling

We adopted the Adam Optimiser with a learning rate of 0.001. Determining an appropriate learning rate is critical for model training since it affects the time required for the model to converge to local minima. A rapid rate of learning may induce the model to deviate from its local minima. On the other hand, a slow learning rate may impede model training, resulting in increased computational costs⁵⁷. Thus, we chose the 0.001 learning rate as the optimal value after undertaking several empirical tests. Correspondingly, we implemented the weighted categorical cross-entropy loss function for the classification task that required the use of the weight class technique and the metrics parameter "accuracy." Finally, each fold was trained for 100 epochs. The details of the model's construction are summarised in Table 7. The weighted categorical cross-entropy loss function is described as:

$${\text{WCE}} = - w_{j} *log\left( {\frac{{e^{{s_{p} }} }}{{\mathop \sum \nolimits_{j}^{c} e^{{s_{j} }} }}} \right)$$

(2)

Table 7 This table summarises the results acquired from the fivefold stratified CV.

Full size table

where $S_{p} =$ positive output score. $S_{j} =$ other classes output scores. ${\text{WCE}} =$ weighted categorical cross entropy. $w_{j} =$ classes weights.

Performance evaluation metrics

We used the macro-average technique to evaluate the precision, recall, and F1 score of the seven CNN architectures due to data imbalance. The macro-average method calculates each class metric independently and then averages the results, ensuring that all classes are treated equally. For the accuracy score, we used the balanced accuracy score from Scikit-Learn to calculate the average recall per class. The inference time indicates the average amount of time required for the CNN model to predict a single image. The training time is the period required for the CNN model to complete 100 epochs of training. Finally, we quantified the model's ability to distinguish between classes using the Area Under the Receiver Operating Characteristic (ROC) Curve (AUC) ⁵⁸. The following mathematical expressions define the evaluation metrics:

$${\text{Balanced Accuracy}} = \frac{1}{\left| G \right|}\mathop \sum \limits_{i = 1}^{\left| G \right|} \frac{{TP_{i} }}{{TP_{i} + FN_{i} }}$$

(3)

$${\text{Precision }}_{{{\text{macro}}}} = \frac{1}{\left| G \right|}\mathop \sum \limits_{i = 1}^{\left| G \right|} \frac{{TP_{i} }}{{TP_{i} + FP_{i} }}$$

(4)

$${\text{Recall }}_{{{\text{macro}}}} = \frac{1}{\left| G \right|}\mathop \sum \limits_{i = 1}^{\left| G \right|} \frac{{TP_{i} }}{{TP_{i} + FN_{i} }}$$

(5)

$$F1_{{{\text{macro}}}} = 2\frac{{{\text{Precision}}_{{{\text{macro}}}} \times {\text{Pecall}}_{{{\text{macro}}}} }}{{{\text{Precision}}_{{{\text{macro}}}} + {\text{Precision}}_{{{\text{macro}}}} }}$$

(6)

$${\text{Inference }}\;{\text{time}} \left( s \right) = \frac{1}{10}\mathop \sum \limits_{i = 1}^{n = 10} \left( {\frac{{T_{f} - T_{in} }}{{N_{s} }}} \right)_{i}$$

(7)

$${\text{Training }}\;{\text{Time}} \left( h \right) = \frac{1}{3600}\mathop \sum \limits_{i = 1}^{n = 100} (T_{t} )_{i}$$

(8)

where

$$G = \left\{ {1, \ldots ,4} \right\} \left( {Number\, of \,classes} \right)$$

$TP =$ true positive. $TN =$ true negative. $FP =$ false positive. $FN =$ false negative. $T_{f} =$ final prediction time for all the images in the validation/test set. $T_{i} =$ initial prediction time for all the images in the validation/test set. $N_{s} =$ number of validation/test samples. $T_{t} =$ training time.

Summary

In summary, we used the FBCG dataset to compare the performance of seven different CNN architectures. Our approach was divided into four stages: (1) image data pre-processing, (2) custom CNN construction (using pre-trained CNNs from TF Hub as feature extractor), (3) model compilation and training, and (4) model evaluation. We divided the dataset into 80% training and 20% test sets (see Table 4). The training set was then subjected to the fivefold stratified CV. To pre-process our dataset, we used the "ImageDataGenerator" class and the "flow_from_dataframe" method (see Table 6). Additionally, we used TensorFlow Keras pre-processing layers to augment the data (see Table 6). We implemented the Scikit-Learn Python library's class weighting technique for the unbalanced data. To classify the FBCG dataset, we used seven pre-trained CNN architectures as feature extractors (see Fig. 4 for model framework; see Table 6 for model compiling). Finally, we evaluated each CNN architecture's performance using the following metrics: balanced accuracy, macro precision, macro recall, macro F1 score, inference time, and training time.

Results and discussion

We classified the FBCG dataset into four grades using selected state-of-the-art pre-trained CNN architectures (EfficientNetB0, EfficientNetV2B0, EfficientNetV2B0-21k, ResNetV1-50, ResNetV2-50, MobileNetV1, and MobileNetV1). Table 7 summarises the performance metrics (balanced accuracy, macro precision, macro recall, macro F1-score, inference time, and training time) of each CNN architecture obtained from the fivefold stratified CV. CV was performed on all the training images to assure the stability of the model (For the test set result, see Table 8). The EfficientNetV2B0-21k yielded the highest balanced accuracy score (0.9666 ± 0.0185), macro precision (0.9646 ± 0.0174), recall (0.9666 ± 0.0185) and F1 score (0.9642 ± 0.0184) among the other CNN models. The high performance of the EfficientNetV2-B0-21k may be attributable to the pre-trained ImageNet21k dataset. The ImageNet21k dataset comprises approximately 12.4 million images, which is larger and more diverse than the previous ImageNet1k. The authors claimed that the pre-training on ImageNet21k outperformed the pre-training on ImageNet1k⁴⁸.

Table 8 Breast cancer grading results on the test set using the final retrained model (using all training images).

Full size table

While MobileNetV2 failed to outperform other CNN architectures, it has the fewest FLOPs (0.3B). (the “FLOPs” here refer to the number of floating-point operations that indicate the complexity of the model architecture; the higher the number of FLOPs, the more complex the model is). Similarly, the MobileNetV1 demonstrated a trade-off between accuracy and complexity in terms of parameter count (4.2 M) and computation time (0.0424 ± 0.0004 s) (the number of parameters represents the size of the CNN model, whereas the inference time indicates the speed of the CNN model in image prediction). Additionally, the EfficientNetB0 achieved a mediocre performance metric score with the least amount of training time (0.5565 ± 0.0088 h) (the training time is the average training period acquired from the fivefold stratified CV).

In general, the EfficientNetV2B0-21k model outperformed other CNN models in terms of balanced accuracy, macro precision, recall, and F1 score while being simpler (0.72B), smaller (7.1B) and requiring less inference time (0.0758 ± 0.0001 s) and training time (0.5592 ± 0.0162 h). In comparison to other CNN architectures, the MobileNetV1 is identified as the fastest (with an inference time of 0.0424 ± 0.0004 s).

For IDC grading purposes, CNN models with greater accuracy are preferred. In order to determine the best treatment for breast cancer patients, the IDC grading classification requires high precision. Automated IDC grading is most likely deployed in a healthcare facility equipped with high-power heavyweight workstations. Thus, resource-intensive CNN models would not be a criterion for selecting the optimal CNN architectures unless the IDC grading applications are extended to real-time settings in the future. Other applications, on the other hand, such as smartphone-based skin disease classification^59,60, breast cancer detection in mobile devices⁶¹, and organ segmentation applications^62,63,64 necessitate compact size and low computational cost CNNs. In these applications, a lighter (fast and compact) or equipped with Minimum Redundancy Maximum Relevance (mRMR) CNN approaches^21,65 that can reduce computational time and cost would be preferred over a more accurate but complex CNN architecture.

All CNN models used in the automated IDC grading application demonstrated a high degree of capability for classifying IDC grades; the EfficientNetV2B0 model achieved the lowest accuracy (0.9076 ± 0.0398), while the EfficientNetV2B0-21k model achieved the highest accuracy (0.9666 ± 0.0185). The average accuracy of the seven CNN models is 0.9361, with a standard deviation of 0.0189. The low standard deviation score indicates only a slight discrepancy between the seven CNN architectures, demonstrating that all examined CNN architectures are capable of accurately classifying IDC grades. Thus, in addition to accuracy, other factors can be considered when selecting the optimal CNN architectures for a particular IDC grading application (such as model complexity, model size and inference time). For instance, in the event of limited resources, a simpler CNN model (such as MobileNetV1) is preferred.

However, not all CNN models are equally capable of predicting IDC grades with a short inference time; MobileNetV1 took the shortest inference time (0.0424 ± 0.0004 s), while ResNetV2-50 took the longest (0.2277 ± 0.0010 s). The average time required for inference is 0.1094 ± 0.0791 s. The large discrepancy indicates that several CNN models (MobileNetV1, MobileNetV2, and EfficientNetV2B0-21k) are capable of achieving high accuracy while requiring minimal inference time. In comparison, certain CNN models (ResNetV1-50, ResNetV2-50) can achieve high accuracy only at the expense of a long inference time. Although IDC grading applications prioritise accuracy over speed, embedded systems such as the Nvidia Jetson TX1, TX2, and Raspberry Pi 3 (B +) require fast and light-weight CNN models. Real-time CNN applications^66,67 implement embedded systems with a short inference time, low power consumption, and a small computational cost. As a result, deep learning techniques can be used to implement IDC grading applications.

For balanced accuracy, precision, recall, and F1 score (median score > 0.9), all seven CNN architectures achieved high scores in Fig. 5. As a whole, these CNN models have an acceptable score range (> 0.9), except for EfficientNetB0, ResNetV1-50, and MobileNetV2. As a result of these findings, the classic CNN models (ResNetV1-50 and ResNetV2-50) are comparable to recent CNN models (EfficientNetB0 and EfficientNetV2B0s). Choosing a CNN architecture may not be the main concern in IDC grading. The user's needs should be prioritised over other factors (resource availability, hardware type, and cost).

According to Fig. 6, complex and large-weight CNN models (ResNetV1-50 and ResNetV2-50) may not outperform simpler and light-weight CNN models (EfficientNetV2B0-21k, MobileNetV1, MobileNetV2). The EfficientNetV2B0-21k model achieved the highest accuracy score (0.9666) while requiring only 0.72B FLOPs and 7.1 M parameters. On the other hand, the ResNetV1-50 model achieved a low accuracy score (0.9253) despite being associated with the highest FLOPs (4.1B) and parameters (25.6 M). CNN models with a high FLOPS count do not always perform well in IDC grading applications. As a result, simpler CNN models can be used to reduce computational costs while maintaining high performance. Similarly, the scatter plot demonstrates that heavy-weight (more parameters) CNN architectures do not always outperform light-weight (fewer parameters) CNN architectures. Despite its large number of parameters (25.6 M), the ResNetV1-50 model achieved a mediocre accuracy score (0.9253). In comparison, the EfficientNetV2-B0-21k with 7.1 M parameters outperformed all other CNN models. As a result, it is more cost-effective to choose a lightweight CNN capable of producing relatively high accuracy.

According to Fig. 7, most CNN models (except ResNetV1-50 and ResNetV2-50) can generate predictions in less than 0.1 s. MobileNetV1 predicts outputs the fastest (inference time = 0.0424 s), while ResNetV2-50 predicts outputs the slowest (inference time = 0.2277 s). As a result, MobileNetV1 would be more suitable for real-time applications such as breast cancer detection on mobile devices⁶¹ and skin disease classification on smartphones⁶⁸. However, with a short inference time (0.0758 s), the EfficientNetV2-B0-21k outperformed all CNN models (balanced accuracy = 0.9666). As a result, the EfficientNetV2-B0-21k can provide the best of both worlds (accuracy and inference time). With regards to the training time parameter, all CNN models can be trained in less than 0.6 h. ResNetV1-50 and ResNetV2-50 (heavy-weight) achieved lower accuracy at the expense of increased training time (0.5795 h and 0.5968 h). On the other hand, the EfficientNetV2B0-21k model outperformed all other CNN models (0.9666) despite requiring little training time (0.5592 h). As a result, the EfficientNetV2-B0-21k model is well-suited for applications that require high performance but require little training.

Table 8 summarises the final breast cancer grading results (receiver operating characteristics (ROC)) on the test set using a model retrained with all of the images from the training set. The receiver operating characteristic (ROC) curve, shown in Fig. 8, is generated by computing and plotting the true positive rate versus the false positive rate for a binary classifier over a range of threshold values. The area under the curve (AUC) is depicted in the figure, which shows that all models perform nearly equally well, in which the Grade 0 versus other grades achieved the highest average AUC. Figure 9 depicts the training versus validation loss curve for the test set, showing the models can be built without obvious signs of being overfitted.

Limitation of study

The dataset used in this study was inspired by Abdelli et al.²⁶. As a result, the generated results are only applicable to the FBCG dataset. Additionally, the results are comparable only to Abdelli et al.²⁶ using the same dataset. This research examined seven well-known and state-of-the-art CNN architectures (EfficientNetB0, EfficientNetV2B0-21k, ResNetV1-50, ResNetV2-50, MobileNetV1, and MobileNetV2); additional CNN architectures were omitted due to time constraints and limited resources. The methodology involved end-to-end feature extraction via transfer learning using pre-trained CNN architectures. However, we omitted from our work the fine-tuning of CNN architecture. If fine-tuning is performed in the correct location within the model architecture, it can improve the performance of CNNs without inducing overfitting. In our study, we omitted the effect of stain normalisation. Veta et al.⁶⁹ asserted that the tissue preparation and histology staining processes could introduce colour discrepancies into images, impairing CNN training. However, as demonstrated in the study by Gupta et al.⁷⁰, useful features and classifiers may obviate the need for stain normalisation.

Challenges

One of the difficulties we encountered in this work was the issue of overfitting. The adopted dataset (FBCG dataset) is relatively small in comparison to other histopathological breast cancer datasets (BreaKHis). As a result, when training with more complex CNN architectures, overfitting may occur. To overcome this obstacle, we augmented the adopted dataset with augmentation layers (random flip, random rotation, and random zoom). Additionally, we included two dropout layers that can randomly zeros out input units at a specified rate during model training. Dealing with an unbalanced dataset is another of the difficulties encountered in this work. As a result, the CNN model is prone to predict the majority class. Thus, we applied the class weighting technique by giving the minority class a higher weight in the model cost function in order to impose a greater penalty on the minority class.

Conclusion

In this paper, we compared the performance of seven CNN architectures in the automated IDC grading application. The Four-Breast-Cancer-Grades (FBCG) dataset was classified into four grades using transfer learning: Grade 0, Grade 1, Grade 2, and Grade 3. The results showed that EfficientNetV2B0-21k outperformed all other CNN models in the fivefold stratified CV (balanced accuracy score = 0.9666 ± 0.0185, macro precision = 0.9646 ± 0.0174, recall = 0.9666 ± 0.0185, and F1 score = 0.9642 ± 0.0184), despite having low FLOPs (0.72B), parameters (7.1 M), inference time (0.0758 ± 0.0001 s), and training time (0.5592 ± 0.0162 h). The EfficientNetV2B0-21k also achieved the highest balance accuracy (0.9524) and macro recall (0.9524) in the test. Similarly, the MobileNetV1 scored the highest balanced accuracy (0.9524), macro precision (0.9545), and macro recall (0.9545) in the test results (0.9524). All CNN models, however, demonstrated significant capability in the automated IDC grading application, with an average balanced accuracy of 0.9361 ± 0.0189 in the fivefold stratified CV and 0.9308 ± 0.0211 in the test result. Choosing heavy-weight CNNs is not a problem because the IDC grading application highlights that accuracy and resources are not limiting factors. If future IDC grading applications require real-time settings, a smaller and faster CNN (MobileNetV2) would be preferable. We may expand our work for future development by comparing it to more recent state-of-the-art CNN architectures. In addition, to conduct our comparative performance analysis, we may consider a variety of breast cancer histopathological datasets.

Data availability

The origin datasets combined for the current study are available in the Four Breast Cancer Grades (FBCG) Dataset https://web.inf.ufpr.br/vri/databases/breast-cancer-histopathological-database-breakhis/, and breast carcinoma histological images from the Department of Pathology, https://zenodo.org/record/834910#.WXhxt4jrPcs.

References

Sung, H. et al. Global cancer statistics 2020: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA. Cancer J. Clin. 71, 209–249 (2021).
Article PubMed Google Scholar
American Cancer Society. Breast cancer facts and figures 2019–2020. Am. Cancer Soc. https://doi.org/10.1007/174_2016_83 (2019).
Article Google Scholar
Sharma, G. N., Dave, R., Sanadya, J., Sharma, P. & Sharma, K. K. Various types and management of breast cancer: An overview. J. Adv. Pharm. Technol. Res. 1, 109–126 (2010).
PubMed PubMed Central Google Scholar
Eroğlu, Y., Yildirim, M. & Çinar, A. Convolutional Neural Networks based classification of breast ultrasonography images by hybrid method with respect to benign, malignant, and normal using mRMR. Comput. Biol. Med. 133, 104407 (2021).
Article PubMed Google Scholar
Rakha, E. A. et al. Breast cancer prognostic classification in the molecular era: The role of histological grade. Breast Cancer Res. https://doi.org/10.1186/bcr2607 (2010).
Article PubMed PubMed Central Google Scholar
Shea, E. K. H., Koh, V. C. Y. & Tan, P. H. Invasive breast cancer: Current perspectives and emerging views. Pathol. Int. 70, 242–252 (2020).
Article PubMed Google Scholar
Sundquist, M. et al. Applying the Nottingham Prognostic Index to a Swedish breast cancer population. Breast Cancer Res. Treat. 53, 1–8 (1999).
Article CAS PubMed Google Scholar
Galea, M. H., Blamey, R. W., Elston, C. E. & Ellis, I. O. The Nottingham prognostic index in primary breast cancer. Breast Cancer Res. Treat. 22, 207–219 (1992).
Article CAS PubMed Google Scholar
Henson, D. E., Ries, L., Freedman, L. S. & Carriaga, M. Relationship among outcome, stage of disease, and histologic grade for 22,616 cases of breast cancer. The basis for a prognostic index. Cancer 68, 2142–2149 (1991).
Article CAS PubMed Google Scholar
Frkovic-Grazio, S. & Bracko, M. Long term prognostic value of Nottingham histological grade and its components in early (pT1n0m0) breast carcinoma. J. Clin. Pathol. 55, 88–92 (2002).
Article CAS PubMed PubMed Central Google Scholar
Schwartz, A. M., Henson, D. E., Chen, D. & Rajamarthandan, S. Histologic grade remains a prognostic factor for breast cancer regardless of the number of positive lymph nodes and tumor size: A study of 161 708 cases of breast cancer from the SEER program. Arch. Pathol. Lab. Med. 138, 1048–1052 (2014).
Article PubMed Google Scholar
Johns Hopkins University. Staging and grade - breast pathology | Johns Hopkins Pathology. https://pathology.jhu.edu/breast/staging-grade/ (2021).
He, L., Long, L. R., Antani, S. & Thoma, G. R. Histology image analysis for carcinoma detection and grading. Comput. Methods Programs Biomed. 107, 538–556 (2012).
Article PubMed PubMed Central Google Scholar
Bardou, D., Zhang, K. & Ahmad, S. M. Classification of breast cancer based on histology images using convolutional neural networks. IEEE Access 6, 24680–24693 (2018).
Article Google Scholar
Elmore, J. G. et al. Diagnostic concordance among pathologists interpreting breast biopsy specimens. JAMA 313, 1122–1132 (2015).
Article CAS PubMed PubMed Central Google Scholar
Jannesari, M. et al. Breast cancer histopathological Image classification: a deep learning approach. In Proceedings - 2018 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2018 vol. 7 2405–2412 (IEEE, 2019).
Dalle, J.-R., Leow, W. K., Racoceanu, D., Tutac, A. E. & Putti, T. C. Automatic breast cancer grading of histopathological images. In 2008 30th Annual International Conference of the IEEE Engineering in Medicine and Biology Society 3052–3055 (IEEE, 2008). doi:https://doi.org/10.1109/IEMBS.2008.4649847.
Doyle, S., Agner, S., Madabhushi, A., Feldman, M. & Tomaszewski, J. Automated grading of breast cancer histopathology using spectral clusteringwith textural and architectural image features. In 2008 5th IEEE International Symposium on Biomedical Imaging: From Nano to Macro 496–499 (IEEE, 2008). doi:https://doi.org/10.1109/ISBI.2008.4541041.
Naik, S. et al. Automated gland and nuclei segmentation for grading of prostate and breast cancer histopathology. In 2008 5th IEEE International Symposium on Biomedical Imaging: From Nano to Macro 284–287 (IEEE, 2008). doi:https://doi.org/10.1109/ISBI.2008.4540988.
Basavanhally, A. et al. Multi-field-of-view framework for distinguishing tumor grade in ER+ breast cancer from entire histopathology slides. IEEE Trans. Biomed. Eng. 60, 2089–2099 (2013).
Article PubMed PubMed Central Google Scholar
Yildirim, M. & Cinar, A. Classification with respect to colon adenocarcinoma and colon benign tissue of colon histopathological images with a new CNN model: MA_ColonNET. Int. J. Imaging Syst. Technol. 32, 155–162 (2022).
Article Google Scholar
Wan, T., Cao, J., Chen, J. & Qin, Z. Automated grading of breast cancer histopathology using cascaded ensemble with combination of multi-level image features. Neurocomputing 229, 34–44 (2017).
Article Google Scholar
Senousy, Z., Abdelsamea, M. M., Mohamed, M. M. & Gaber, M. M. 3E-net: Entropy-based elastic ensemble of deep convolutional neural networks for grading of invasive breast carcinoma histopathological microscopic images. Entropy 23, 620 (2021).
Article ADS PubMed PubMed Central Google Scholar
Li, L. et al. Multi-task deep learning for fine-grained classification and grading in breast cancer histopathological images. Multimed. Tools Appl. 79, 14509–14528 (2020).
Article Google Scholar
Pan, S. J. & Yang, Q. A survey on transfer learning. IEEE Trans. Knowl. Data Eng. 22, 1345–1359 (2010).
Article Google Scholar
Abdelli, A., Saouli, R., Djemal, K. & Youkana, I. Combined Datasets For Breast Cancer Grading Based On Multi-CNN Architectures. In 2020 Tenth International Conference on Image Processing Theory, Tools and Applications (IPTA) 1–7 (IEEE, 2020). doi:https://doi.org/10.1109/IPTA50016.2020.9286653.
Zavareh, P. H., Safayari, A. & Bolhasani, H. BCNet: A deep convolutional neural network for breast cancer grading. http://arxiv.org/abs/2107.05037 (2021).
Tan, M. & Le, Q. V. EfficientNet: rethinking model scaling for convolutional neural networks. In 36th International Conference on Machine Learning, ICML 2019 vols 2019-June 10691–10700 (International Machine Learning Society (IMLS), 2019).
Tan, M. & Le, Q. V. EfficientNetV2: Smaller models and faster training. (2021).
He, K., Zhang, X., Ren, S. & Sun, J. Deep Residual Learning for Image Recognition. Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. 2016-Decem, 770–778 (2015).
He, K., Zhang, X., Ren, S. & Sun, J. Identity Mappings in Deep Residual Networks. Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics) 9908 LNCS, 630–645 (2016).
Howard, A. G. et al. MobileNets: Efficient convolutional neural networks for mobile vision applications. (2017).
Sandler, M., Howard, A., Zhu, M., Zhmoginov, A. & Chen, L.-C. MobileNetV2: Inverted Residuals and Linear Bottlenecks. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition 4510–4520 (IEEE, 2018). doi:https://doi.org/10.1109/CVPR.2018.00474.
Dimitropoulos, K. et al. Grading of invasive breast carcinoma through Grassmannian VLAD encoding. PLoS ONE 12, e0185110 (2017).
Article PubMed PubMed Central Google Scholar
Litjens, G. et al. A survey on deep learning in medical image analysis. Med. Image Anal. 42, 60–88 (2017).
Article PubMed Google Scholar
Dabeer, S., Khan, M. M. & Islam, S. Cancer diagnosis in histopathological image: CNN based approach. Inform. Med. Unlocked 16, 100231 (2019).
Article Google Scholar
Yan, R. et al. NANet: Nuclei-aware network for grading of breast cancer in HE stained pathological images. In Proceedings - 2020 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2020 865–870 (Institute of Electrical and Electronics Engineers Inc., 2020). doi:https://doi.org/10.1109/BIBM49941.2020.9313329.
Aresta, G. et al. BACH: Grand challenge on breast cancer histology images. Med. Image Anal. 56, 122–139 (2019).
Article PubMed Google Scholar
Xu, J. & Dong, X. A survey of transfer learning in breast cancer image classification. In Proceedings of 2020 IEEE 3rd International Conference of Safe Production and Informatization, IICSPI 2020 220–223 (Institute of Electrical and Electronics Engineers Inc., 2020). doi:https://doi.org/10.1109/IICSPI51290.2020.9332405.
Simonyan, K. & Zisserman, A. Very deep convolutional networks for large-scale image recognition. 3rd Int. Conf. Learn. Represent. ICLR 2015 - Conf. Track Proc. (2014).
Cruz-Roa, A. et al. Automatic detection of invasive ductal carcinoma in whole slide images with convolutional neural networks. Med. Imaging 2014 Digit. Pathol. 9041, 904103 (2014).
Article Google Scholar
Pêgo, A. P. & Aguiar, P. de C. Bioimaging. INEB http://www.bioimaging2015.ineb.up.pt/index.html (2015).
Spanhol, F. A., Oliveira, L. S., Petitjean, C. & Heutte, L. A dataset for breast cancer histopathological image classification. IEEE Trans. Biomed. Eng. 63, 1455–1462 (2016).
Article PubMed Google Scholar
Zioga, C. et al. Breast carcinoma histological images from the Department of Pathology, “Agios Pavlos” general hospital of Thessaloniki. Greece https://doi.org/10.5281/ZENODO.834910 (2017).
Bolhasani, H., Amjadi, E., Tabatabaeian, M. & Jassbi, S. J. A histopathological image dataset for grading breast invasive ductal carcinomas. Inform. Med. Unlocked 19, 100341 (2020).
Article Google Scholar
Munien, C. & Viriri, S. Classification of hematoxylin and eosin-stained breast cancer histology microscopy images using transfer learning with efficientNets. Comput. Intell. Neurosci. 2021, 1–17 (2021).
Article Google Scholar
Russakovsky, O. et al. ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. 115, 211–252 (2015).
Article MathSciNet Google Scholar
Ridnik, T., Ben-Baruch, E., Noy, A. & Zelnik-Manor, L. ImageNet-21K Pretraining for the Masses. (2021).
He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. 2016-Decem, 770–778 (2016).
Shorten, C. & Khoshgoftaar, T. M. A survey on Image Data Augmentation for Deep Learning. J. Big Data https://doi.org/10.1186/s40537-019-0197-0 (2019).
Article Google Scholar
Analytics Vidhya. How to dealing with imbalanced classes in machine learning. Analytics Vidhya https://www.analyticsvidhya.com/blog/2020/10/improve-class-imbalance-class-weights/ (2020).
TensorFlow. Transfer learning and fine-tuning. Tensorflow https://www.tensorflow.org/tutorials/images/transfer_learning (2021).
LeCun, Y., Bottou, L., Bengio, Y. & Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 86, 2278–2324 (1998).
Article Google Scholar
Krizhevsky, A., Sutskever, I. & Hinton, G. E. ImageNet classification with deep convolutional neural networks. Commun. ACM 60, 84–90 (2017).
Article Google Scholar
Szegedy, C. et al. Going deeper with convolutions. in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition vols 07–12-June-2015 (2015).
Keras. Dropout layer. Keras https://keras.io/api/layers/regularization_layers/dropout/.
Zulkifli, H. Understanding Learning Rates and How It Improves Performance in Deep Learning | by Hafidz Zulkifli Towards Data Science. Towards Data Science https://towardsdatascience.com/understanding-learning-rates-and-how-it-improves-performance-in-deep-learning-d0d4059c1c10 (2018).
Bex, T. Comprehensive Guide to Multiclass Classification Metrics. Towards Data Science https://towardsdatascience.com/comprehensive-guide-on-multiclass-classification-metrics-af94cfb83fbd (2021).
Krohling, B., Castro, P. B. C., Pacheco, A. G. C. & Krohling, R. A. A Smartphone based Application for Skin Cancer Classification Using Deep Learning with Clinical Images and Lesion Information. (2021).
Velasco, J. A smartphone-based skin disease classification using MobileNet CNN. Int. J. Adv. Trends Comput. Sci. Eng. 8, 2632–2637 (2019).
Article Google Scholar
Ansar, W., Shahid, A. R., Raza, B. & Dar, A. H. Breast Cancer detection and localization using MobileNet Based transfer learning for mammograms. 11–21 (2020). doi:https://doi.org/10.1007/978-3-030-43364-2_2.
Zhao, Y. et al. Knowledge-aided convolutional neural network for small organ segmentation. IEEE J. Biomed. Heal. Informatics 23, 1363–1373 (2019).
Article Google Scholar
Wang, H. et al. Rib segmentation algorithm for X-ray image based on unpaired sample augmentation and multi-scale network. Neural Comput. Appl. https://doi.org/10.1007/s00521-021-06546-x (2021).
Article PubMed PubMed Central Google Scholar
Ni, B., Liu, Z., Cai, X., Nappi, M. & Wan, S. Segmentation of ultrasound image sequences by combing a novel deep siamese network with a deformable contour model. Neural Comput. Appl. https://doi.org/10.1007/s00521-022-07054-2 (2022).
Article Google Scholar
Eroglu, Y., Yildirim, M. & Cinar, A. mRMR-based hybrid convolutional neural network model for classification of Alzheimer’s disease on brain magnetic resonance images. Int. J. Imaging Syst. Technol. 32, 517–527 (2022).
Article Google Scholar
Adineh-Vand, A., Karimi, G. & Khazaei, M. Digital implementation of a spiking convolutional neural network for tumor detection. J. Microelectron. Electron. Components Mater. 49, 193–201 (2019).
Google Scholar
Tripathi, S., Kang, B., Dane, G. & Nguyen, T. Low-complexity object detection with deep convolutional neural network for embedded systems. Appl. Digit. Image Peocess. https://doi.org/10.1117/12.227551210396,317-331 (2017).
Article Google Scholar
Hum, Y. C. et al. The development of skin lesion detection application in smart handheld devices using deep neural networks. Multimed. Tools Appl. https://doi.org/10.1007/s11042-021-11013-9 (2021).
Article Google Scholar
Veta, M., Van Diest, P. J., Jiwa, M., Al-Janabi, S. & Pluim, J. P. W. Mitosis counting in breast cancer: Object-level interobserver agreement and comparison to an automatic method. PLoS One 11, e0161286 (2016).
Article PubMed PubMed Central Google Scholar
Gupta, V., Singh, A., Sharma, K. & Bhavsar, A. Automated classification for breast cancer histopathology images: Is stain normalization important? in Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) vol. 10550 LNCS 160–169 (Springer Verlag, 2017).

Download references

Funding

The authors would like to gratefully acknowledge the support of Fundamental Research Grant Scheme (Ref: FRGS/1/2019/ICT04/UTAR/02/1, vote account no: 8073/Y01) and Universiti Tunku Abdul Rahman Research Fund (Ref: IPSR/RMC/UTARRF/2022-C1/H01).

Author information

Authors and Affiliations

Department of Mechatronics and Biomedical Engineering, Lee Kong Chian Faculty of Engineering and Science, Universiti Tunku Abdul Rahman, Sungai Long, Malaysia
Wingates Voon, Yan Chai Hum & Yee Kai Tee
Department of Electrical and Electronic Engineering, Lee Kong Chian Faculty of Engineering and Science, Universiti Tunku Abdul Rahman, Sungai Long, Malaysia
Wun-She Yap
Diagnostic Research Group, School of Biomedical Engineering and Health Sciences, School of Biomedical Engineering and Health Sciences, Faculty of Engineering, Universiti Teknologi Malaysia, 81300, Skudai, Johor, Malaysia
Maheza Irna Mohamad Salim
BioInspired Device and Tissue Engineering Research Group, School of Biomedical Engineering and Health Sciences, Faculty of Engineering, Universiti Teknologi Malaysia, 81300, Skudai, Johor, Malaysia
Tian Swee Tan
Department of Computer Science, Electrical and Space Engineering, Lulea University of Technology, Luleå, Sweden
Hamam Mokayed
Department of Biomedical Engineering, Universiti Malaya, 50603, Kuala Lumpur, Malaysia
Khin Wee Lai

Authors

Wingates Voon
View author publications
You can also search for this author in PubMed Google Scholar
Yan Chai Hum
View author publications
You can also search for this author in PubMed Google Scholar
Yee Kai Tee
View author publications
You can also search for this author in PubMed Google Scholar
Wun-She Yap
View author publications
You can also search for this author in PubMed Google Scholar
Maheza Irna Mohamad Salim
View author publications
You can also search for this author in PubMed Google Scholar
Tian Swee Tan
View author publications
You can also search for this author in PubMed Google Scholar
Hamam Mokayed
View author publications
You can also search for this author in PubMed Google Scholar
Khin Wee Lai
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

W.V. conducted the comparative study and wrote the main manuscript text. Y.C.H. suggested the methodology, generated figures, and supervised the research. Y.K.T., W.S.Y, M.I.M.S assisted in result interpretations. H.M validate the results. T.S.T and K.W.L were responsible for exploring the data.

Corresponding author

Correspondence to Yan Chai Hum.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Voon, W., Hum, Y.C., Tee, Y.K. et al. Performance analysis of seven Convolutional Neural Networks (CNNs) with transfer learning for Invasive Ductal Carcinoma (IDC) grading in breast histopathological images. Sci Rep 12, 19200 (2022). https://doi.org/10.1038/s41598-022-21848-3

Download citation

Received: 24 February 2022
Accepted: 04 October 2022
Published: 10 November 2022
DOI: https://doi.org/10.1038/s41598-022-21848-3
Springer Nature Limited

This article is cited by

DIEANet: an attention model for histopathological image grading of lung adenocarcinoma based on dimensional information embedding
- Zexin Wang
- Jing Gao
- Yuhua Ma
Scientific Reports (2024)
Evaluating the effectiveness of stain normalization techniques in automated grading of invasive ductal carcinoma histopathological images
- Wingates Voon
- Yan Chai Hum
- Khin Wee Lai
Scientific Reports (2023)
Lymphocyte detection for cancer analysis using a novel fusion block based channel boosted CNN
- Zunaira Rauf
- Abdul Rehman Khan
- Asifullah Khan
Scientific Reports (2023)

Performance analysis of seven Convolutional Neural Networks (CNNs) with transfer learning for Invasive Ductal Carcinoma (IDC) grading in breast histopathological images

Abstract

Similar content being viewed by others

Leveraging CNN and Transfer Learning for Classification of Histopathology Images

Enhancing IDC Histopathology Image Classification: A Comparative Study of Fine-Tuned and Pre-trained Models

Breast Cancer Classification Using Transfer Learning

Introduction

Related works

Deep learning based methods

Manual feature extraction

Automatic feature extraction

Transfer learning methods

Summary

Methodology

Dataset

Data pre-processing

Data augmentation

Data balancing

Transfer learning

Transfer learning techniques

Experimental details

Standardizing model pipelines and hyperparameters

Model compiling

Performance evaluation metrics

Summary

Results and discussion

Limitation of study

Challenges

Conclusion

Data availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher's note

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

DIEANet: an attention model for histopathological image grading of lung adenocarcinoma based on dimensional information embedding

Evaluating the effectiveness of stain normalization techniques in automated grading of invasive ductal carcinoma histopathological images

Lymphocyte detection for cancer analysis using a novel fusion block based channel boosted CNN

Search

Navigation