1 Introduction

Skin cancer has been increasing among men and women worldwide for many decades [32]. There were approximately 76,250 new cases of melanoma and approximately 8,790 new melanoma-related deaths during 2012 in the United States [51]. In Brazil, it is estimated that for the biennium of 2018–2019, there were 165,580 new cases of non-melanoma skin cancer [34]. Skin cancer spreading results from many factors such as long longevity of the population, people being exposed to the sun, and early detection of skin cancer. Dermoscopy, a noninvasive skin imaging technique, is one of the most effective ways for early identification of skin cancer. The appearance of skin lesions in dermoscopic images may change significantly according to the skin condition. In addition, different artifact sources, including hair, skin texture, and air bubbles may lead to misidentifying the boundary between the skin lesions and the surrounding healthy skin.

Despite the effectiveness of dermoscopic diagnosis for skin cancer, it is very difficult for expert dermatologists to provide an accurate classification of malignant melanoma and benign skin lesions for large number of dermoscopic images. Hence, it is very necessary to build up a non-invasive Computer Aided Diagnosis (CAD) system for skin lesion classification. The CAD system generally comprises of four main steps: image pre-processing, segmentation, feature extraction, and classification. Note that each step significantly affects the classification performance of the whole CAD system [50]. Therefore, efficient algorithms should be employed in each step to achieve high diagnosis performance.

Various studies examined distinct machine learning methods for diagnosis of different types of cancer [7, 11, 33, 45]. The majority of these studies used classifiers trained on a set of hand-crafted features captured from the images. Most of machine learning techniques require high computational time for accurate diagnosis and their performance depends on the selected features that characterize the cancerous region. Deep learning techniques and Convolutional Neural Networks (CNNs) have become important for automated diagnosis of different types of cancer [3]. Deep learning has achieved impressive results in image classification applications. In image classification tasks, transfer learning [29] and data augmentation [38] are employed to overcome the lack of data and to reduce the computational and memory requirements.

Transfer learning is an effective tool for classification problems with few available datasets like medical imaging applications. Instead of training a CNN from scratch, which would require a huge amount of data and a high computational cost, it is computationally efficient to employ a pre-trained CNN architecture and just fine-tune its performance to accelerate the operation. Various pre-trained CNNs, including AlexNet, Inception, ResNet, and DenseNet [8, 37] were trained on the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) dataset. Some studies employed distinct pre-trained CNNs and reported reasonable diagnosis performance for skin cancer [2, 13, 26, 30, 31, 36].

Data augmentation is a technique for enhancing the size of the input data by developing new data from the original data [38]. Image augmentation techniques can be employed to defeat the lack of skin cancer datasets. There are many data augmentation strategies, including rotation, scaling, random cropping, and color modifications. Data augmentation is broadly employed with pre-trained CNN architectures. In [30], the performance of skin lesion classification was investigated with 13 data augmentation techniques trained on three CNNs (Inception-v4, ResNet, and DenseNet), and the results revealed the positive impact of using data augmentation in both training and testing phases. An automatic skin lesion classification system based on the Alex-Net CNN architecture was proposed by [13]. The architecture weights were fine-tuned, and the datasets are augmented by fixed and random rotation angles, yielding average accuracy of 95.91%.

In [26], a deep learning approach based on neural network ensemble was employed to classify skin lesions using dermoscopic images. The melanoma classification system of [31] was developed for extracting asymmetry, border irregularity, color variation, diameter, and texture of the skin lesion and classifying malignant melanoma and benign skin lesions using Support Vector Machine (SVM) classifier. In [2], a deep CNN prediction model based on new regularizer technique was employed for skin lesion classification with an accuracy of 97.49%. In the work of [36], a Gabor wavelet-based CNN model for skin lesion diagnosis was developed by employing Gabor filters on skin images to obtain detailed characteristics of the skin lesions and then modeling these details using deep CNN models. Decision fusion based on the sum rule was used for classifying the skin lesions with an accuracy of 83%.

The majority of exiting skin lesion diagnosis systems report reasonable classification results for discriminating malignant melanoma from benign lesions. However, it is still very difficult to develop an efficient automated framework that can provide very accurate diagnosis results, while running in real-time using low hardware specifications. The aim of this study is to propose an accurate diagnosis system for skin lesion classification in dermoscopic images. The proposed approach is based on employing efficient image processing techniques for pre-processing, data augmentation, and image segmentation along with investigating different pre-trained CNN architectures for skin lesion classification. The proposed framework is examined on the most common dermoscopic image databases, the first one is the International Skin Imaging Collaboration (ISIC) 2017 [15], and the second one is the MNIST: HAM10000 [17].

The main contributions of this work can be summarized as follows: (1) A new real-time automated CAD system is proposed for skin lesion classification with high performance using low computational complexity. (2) The proposed framework is based on new pre-processing techniques for hair removal and lesion identification, Grab-cut in HSV color space for lesion segmentation, and automatic detection of the ABCD features for discriminating malignant melanoma from benign lesions. (3) Different pretrained CNNs, including VGG-16, ResNet50, ResNetX, InceptionV3, and MobileNet are examined for skin lesion classification. (4) The impact of applying data augmentation to all pretrained CNN models under investigation (VGG-16, ResNet50, ResNetX, InceptionV3, and MobileNet) is examined using several evaluation metrics, including area under the ROC curve, accuracy, sensitivity, precision, F1-score, and computational time using 5-fold cross-validation. The results revealed the positive influence of utilizing data augmentation in improving the diagnosis performance. Extensive experiments have been performed to evaluate the performance of the proposed framework, and experimental results reveal the superior performance of the proposed approach over other exiting state-of-the-art diagnosis systems in terms of the classification rate, the sensitivity, the specificity, and the computational efficiency. This demonstrates that the proposed framework represents an efficient tool to assist dermatologists in real-time skin lesion classification.

The remaining parts of this paper are structured as follows. Section 2 presents the techniques used in pre-processing, segmentation, feature extraction, and classification. Section 3 shows the experimental results of the proposed skin lesion classification system using real dermoscopic images, followed by a comparison between the proposed framework and other recent methods. In Section 4, the conclusions are presented.

2 Methodology

The proposed framework for skin lesion classification comprises of four main stages: image pre-processing, segmentation, feature extraction, and classification. The structure of the proposed approach is shown in Fig. 1.

Fig. 1
figure 1

The structure of the proposed skin cancer CAD system

2.1 Image pre-processing

The pre-processing step is used to eliminate all artifact sources such as bubbles and hair from skin images. This step is very essential before any segmentation or feature extraction to provide accurate diagnosis. A new approach based on morphological filtering is investigated to remove the hair from the skin images [35]. This technique is divided into two mainly steps. First, the color images are converted into grayscale versions using the weighted process of [47], where the grayscale image is obtained from the RGB color space by the relation: grayscale \(=0.3 R+0.59 G+ 0.11 B\). Second, the hair contour of the grayscale images is identified using morphological black-hat transformation [52]. Finally, the inpainting function is used to create a mask based on the Fast-Marching Method (FMM) [42]. In the current work, when the pixel value is lower than the threshold, it will be set to 0, otherwise it will be set to 1.

After the pre-processing step, each pre-processed skin image is augmented into four images by rotating the input image into four directions of 0°, 90°, 180°, and 270°. Data augmentation is investigated in the current work to increase the data size, generate new data from the original input data, and overcome the lack of the tagged images.

2.2 Grab-cut skin lesion segmentation

After the pre-processing step, the images are segmented automatically using Grab-cut segmentation technique [22]. In this technique, the boundary regions are identified, i.e., region between the lesion, background, and the difference within the lesion. All skin images are converted into HSV color space to distinguish image intensity from color information which is a significant step for the adaptive histogram equalization process as shown in Fig. 2. Using the Gaussian Mixture Model (GMM) [54], the foreground region is extracted from the Grab-cut mask resulted from the adaptive histogram equalization process. Any region beyond this rectangle will be taken as background. The segmentation results are quantitively evaluated by determining the overlap with the ground truth. Generally, Jaccard Coefficient (JC) is the most frequently used evaluation metric in image segmentation analysis. The JC measures pixels similarity between the ground truth and the segmented image to examine which pixels are matched and which are unmatched. It is defined as the intersection over union between index segmentation results and ground truth [16]. Its value lies between 0 and 1, with 1 showing perfect overlap and 0 indicating no overlap. The segmented images with their JC values are shown in Table 1.

Fig. 2
figure 2

Conversion from RGB into HSV color space

Table 1 Segmentation results using Grab-cut technique

2.3 Feature extraction

To distinguish malignant melanoma from benign skin lesions, the main features of skin lesions should be extracted. The ABCD rule of dermoscopy [28] is investigated in the current study for skin lesion identification. The ABCD features are extracted from the intrinsic physical attributes of skin lesions, including asymmetry, border irregularity, color variation, and dermoscopic patterns. The ABCD rule is used to characterize the geometrical and structural lesion properties and to separate melanoma from benign lesions because of its effectiveness, efficiency, and simplicity of performance and implementation. Table 2 shows the results of ABCD rule of dermoscopy. The four main components of this method can be summarized as follows [19]:

  • Asymmetry A: Axes of asymmetry are obtained by drawing two orthogonal axes crossing at the center of gravity of the lesion. For a given axis, the determination of whether a lesion is symmetric or not is identified based on the symmetry along that axis for all measures, namely color, brightness, and shape. The asymmetry score is zero if there is no asymmetry in both axes. A score of one is given if there is asymmetry by one axis, and a score of two is given in the case of asymmetry by both axes.

  • Border B: The lesion is partitioned into eight parts (slices). A border score is assigned according to the existence of sharp or gradual cut-off at the periphery of the skin lesions. A score of zero is given for regular periphery, while a score of one is given for irregular boundary. The border score varies between zero and eight according to the number of slices with irregular periphery. Vector product and inflexion point descriptors are utilized to extract the number of peaks, valleys, and straight lines at lesion edges corresponding to clear irregular and small irregular peripheries, respectively [10].

  • Color C: It is one of the most important discriminators of malignant melanoma from benign lesions. Lesions with at least one of six colors, namely white, red, black, light brown, dark brown, and blue-gray, are probably melanomas. The existence of one suspicious color is confirmed when the number of pixels concerning that color surpasses by 5% of the total number of pixels of the lesion [43]. In a lesion, more than one suspicious color may occur, and the color score is increased by one for each existing color. The color score ranges from one to six.

  • Dermoscopic structures D: The existence of five distinct structures, namely network, structureless areas, branched streaks, dots, and globules, within the lesion is investigated. The existence of each structure within the lesion increments the score by one. The minimum structure score is zero and the maximum structure score is five.

Table 2 Feature extraction using ABCD rule of dermoscopy

2.4 Classification with CNN

In this step, the resulted ABCD dermoscopy images are fed into a pretrained deep CNN architecture. Different pretrained CNN architectures, including VGG-16 [39], ResNet50 [18], ResNetX [9], InceptionV3 [41], and MobileNet [14] are investigated for skin lesion classification. Note that designing a new deep CNN network from scratch is hard because (i) it needs a huge amount of labeled training data and high experience to guarantee efficient convergence; (ii) the process is time-consuming and tedious; (iii) it may suffer from overfitting problems. Instead of training a CNN from scratch, it is computational efficient to employ a pre-trained CNN model and just fine-tune its performance to accelerate the operation. All pre-trained CNNs considered in the current study were trained on a huge dataset, namely ILSVRC. They were trained on approximately 1.2 million training images with another 50,000 images for validation and 100,000 images for testing. In this work, we use the weights pretrained on ImageNet to fine-tune the model. This makes the pretrained network can learn new task faster and easier than the network which is trained from scratch using randomly initialized weight.

For all pretrained CNN architectures under investigation, classification results are examined, and the results reveal that VGG-16 and ResNet50 architectures achieve the best performance. In this work, a batch size of 40/16, learning rate of \({10}^{-3}\)/\({10}^{-4}\), and momentum of 0.8/0.9 are, respectively, the best values for yielding the highest diagnosis performance of ResNet50 and VGG-16 architectures. The detailed layers information of ResNet50 and VGG-16 architectures are shown in Tables 3 and 4, respectively. The proposed framework is enhanced by combining the linear kernel SVM classifier [4] with both VGG-16 and ResNet50 architectures and investigating their classification results.

Table 3 Summary of the CNN layers for the ResNet50 structure

To authenticate the diagnosis performance and avoid any bias, the k-fold cross-validation technique is employed [40]. Usually, the value of k is aprioristically selected and fixed. Generally, increasing the value of k from 2 to 10 leads to bias decrease, variance increase in the test error rate estimation, and more computational burden [25]. Note that the computational time tends to follow an increasing exponential distribution with the increase of k. Larger values of k are usually employed to use a larger number of samples for training at the expense of error estimation degradation of the classifier [12]. In the current study, the training data is sufficient to use 5-fold cross validation which allows reducing the variance estimate of error prediction and minimizing the computational cost. This is very critical for practical implementation of the proposed CAD system in real-time applications. Also, obtaining a tight and rigorous estimation of the classifier error allows validating, in a statistical sense, the reliability of the model [12]. Moreover, when employing the 5-fold cross validation, larger fraction of samples can be kept for the classification phase. For the data under investigation, the 5-fold cross validation represents a good choice to compromise between the percentage of data used for training, and the rigor of the estimated error.

Table 4 Summary of the CNN layers for the VGG-16 structure

3 Results

To examine the usefulness of the proposed skin lesion classification approach, 580 images (310 benign and 270 malignant) from the ISIC 2017 database and 350 images (215 benign and 135 malignant) from the MNIST: HAM10000 database are investigated. The data under investigation covers distinct cases of abnormality, patient age, and patient gender as shown in Table 5. In order to increase the total number of images and allow more data for training and testing, each image is augmented into four images of different directions of rotation, providing a total number of 2320 image from the ISIC 2017 database and 1400 image from the MNIST: HAM10000 database. Evaluation of image classification depends mainly on computing four parameters: the number of true positives (TP), true negatives (TN), false negatives (FN), and false positives (FP). The classification performance is identified in terms of Accuracy\(\left(ACC\right)\), Sensitivity (\(Se\)) or recall, Precision (\(Pr)\), F1-score, area under the ROC curve (\(AUC\)), and computational time. ACC is a metric of true predictions, Pr is the fraction of detected malignant cases that agree with the ground truth, and Se is the fraction of true malignant cases that are identified malignant. F1-score is the harmonic mean of Pr and Se. \(AUC\) measures the entire two-dimensional area underneath the entire ROC curve and it gives a total performance evaluation across all possible classification thresholds. All these metrics are defined as follows:

$$ACC=(TP+TN)/(TP+FP+TN+FN)$$
(1)
$$Pr =TP/(TP+FP)$$
(2)
$$Se =TP/(TP+FN)$$
(3)
$$F1=2 (Pr\times Se)/(Pr+Se)$$
(4)
Table 5 Distribution of images taken from the ISIC 2017 and MNIST: HAM10000 databases

All images under investigation (2320 augmented images from the ISIC 2017 and 1400 augmented images from the MNIST: HAM10000) are divided into 2 main partitions: 1120 for training and validation and 280 for testing using the MNIST: HAM10000 database, and 1856 for training and validation, and 464 for testing using the ISIC 2017 database. To authenticate the diagnosis performance and avoid any possible bias, the 5-fold cross-validation is utilized, where the training and validation data (1856 from MNIST: HAM10000 database and 1856 from the ISIC 2017 database) are partitioned into five equal sized groups, each time one group is retained as a validation set and the remaining four groups are retained as a training set. The classification metrics are computed for each fold relevant to the test dataset for both datasets.

For all pretrained CNN architectures under investigation (VGG-16, ResNet50, ResNetX, InceptionV3, and MobileNet), training and validation loss curves along with training and validation accuracy curves are investigated, and the results reveal that VGG-16 and ResNet50 architectures achieve the best performance. For example, the training progress versus the number of epochs for the ResNet50 architecture is shown in Fig. 3a. The ResNet50 network has faster and better training advance than other CNN networks. Also, it achieves higher performance than other CNN architectures in terms of the validation accuracy as shown in Fig. 3a. The VGG-16 architecture achieves comparable performance with ResNet50 network. Therefore, a critical comparison is made between VGG-16 and ResNet50 architectures for skin lesion classification under various parameter regimes.

Fig. 3
figure 3

Training and validation (a) accuracy and (b) loss versus number of epochs for the ResNet50 architecture

Table 6 shows the 5-fold benign-malignant classification results and standard deviations of the proposed system using VGG-16 and ResNet50 architectures for the ISIC 2017 database with data augmentation. The 5-fold average classification results of\(AUC\), \(ACC\), Pr, \(Se\), and F1-score are calculated by averaging the 5-fold classification results (see Table 6). The average classification results for each metric were calculated for all CNN architectures under investigation (VGG-16, ResNet50, ResNetX, InceptionV3, and MobileNet), showing the better performance of ResNet50 architecture in terms of all metrics (\(ACC\), AUC, Pr, \(Se\), and \(F1\)-score). Similar results are obtained for all CNN architectures using MNIST: HAM10000 database, and the results show the higher performance of ResNet50 architecture as shown in Table 7.

Table 6 Average 5-fold classification results and standard deviation for ResNet50 and VGG-16 using 2320 images from the ISIC 2017 database
Table 7 Average 5-fold classification results and standard deviations for VGG-16, ResNet50, ResNetX, InceptionV3, and MobileNet architectures with and without SVM for both ISIC 2017 and MNIST: HAM10000 databases with and without data augmentation

We further enhance the proposed system by combining the kernel SVM classifier with all CNN architectures and investigating their classification results. Table 7 shows the average 5-fold benign-malignant classification results and standard deviations of VGG-16, ResNet50, ResNetX, InceptionV3, and MobileNet architectures with and without SVM for both ISIC 2017 and MNIST: HAM10000 databases with and without data augmentation. It can be noted that the results of either VGG-16 or ResNet50 architecture with SVM are better than the standalone VGG-16 or ResNet50. Also, the performance of ResNet50 architecture combined with SVM is superior to VGG-16 architecture with SVM in terms of average \(ACC\)(99.87% versus 98.66%), \(AUC\) (99.52% versus 97.63%), \(Se\) (98.87% versus 95.49%), Pr (98.77% versus 96.79%), and \(F1\)-score (97.83% versus 95.77%) for ISIC 2017 database. The same superior performance of ResNet50 architecture with SVM is achieved for the MNIST: HAM10000 database. Note that the ResNet50 architecture with SVM provides smaller standard deviation values than VGG-16 with or without SVM, showing its reliability and consistency in skin lesion classification. In comparing the classification results with and without data augmentation for both databases, results reveal that data augmentation allows increasing the size of data used for training and testing, and hence enhancing the classification results for all cases.

3.1 Performance analysis

The proposed ResNet50 CNN architecture combined with SVM for skin lesion classification is compared with other recent systems [1, 2, 5, 6, 13, 19, 20, 23, 24, 30, 31, 36, 46, 48, 49, 53], and the results are shown in Table 8. The average classification results using 5-fold cross-validation are presented to validate the performance of the proposed system when comparing it with other techniques. Table 8 reveals that we investigated larger number of lesion images than most of other techniques under comparison. It can be noted that the proposed approach outperforms other diagnosis systems in terms of \(ACC\)(99.87% for ISIC 2017 and 98.97% for HAM10000) and \(AUC\) (99.52% for ISIC 2017 and 97.91% for HAM10000). The proposed system achieves the highest \(Se\) in comparing with other techniques [1, 2, 13, 19, 20, 24, 31, 46, 48, 49, 53], revealing its effectiveness in classifying patients with skin cancer. Note that sensitivity is a more important evaluation metric than other metrics for dermatologists and clinicians because low sensitivity or high rate of false negative (malignant identified as benign) may result in death. In this study, very few malignant lesions are misclassified, and consequently the proposed approach has superior sensitivity results. Table 8 demonstrates the superior diagnosis results of the proposed CAD system over recent state-of-the-art methods.

Different pretrained CNN architectures, including VGG-16, ResNet50, ResNetX, InceptionV3, and MobileNet were investigated for skin cancer diagnosis, and the results show that ResNet50 architecture achieves the best performance. A detailed comparison between all these CNN architectures is made, and the results demonstrate the better performance of ResNet50 architecture with low computational requirements. It is well-known that there are various pretrained CNN structures [27], and we have already examined most of them. More deeper CNN models mean more computational burden. This is not practical for real-time diagnosis systems. Also, the use of more deep layers increases the number of free parameters, which may lead to over-fitting issues and performance declination. In the current study, the data variability is smaller than other image classification applications, and hence deeper CNNs are not computationally effective for real-time diagnosis. The CNN models selected for this study represent proper trade-off between speed and accuracy.

The overall average computational time of the proposed ResNet50 architecture combined with SVM for skin lesion classification is around 3.23 s on a Kaggle Notebook GPU cloud (2 CPU cores and 13 GB RAM) using Python software. In comparing with other studies under comparison, most of them did not report any computational time analysis. In [53], they reported the computational time using NVIDIA GTX Titan XP GPU which was used for training the model in round 30 h with an average test time of 0.02 s per patch. Despite the low computational time of [53] in comparing to the proposed approach, it requires high hardware requirements and provides low classification performance (see Table 8). The proposed approach achieves high diagnosis performance, while achieving reasonable computational efficiency using low hardware specifications. In comparing with the deep learning model of [21] which achieved high \(ACC\) of 94.5%, the computational efficiency of the proposed framework outperforms it in terms of testing time (3.23 s versus 6.9 s). This elucidates the usefulness of the proposed framework in improving the skin lesion classification performance through real-time systems.

Despite the high performance of the proposed skin lesion classification system, there are few limitations that can be investigated in the future. First, the number of pre-trained CNN architectures under investigation can be increased to incorporate more recent and advanced pre-trained models. Second, the process of extracting comprehensive features such as ABCD features may include some practical limitations like misclassifying melanoma of homogeneous color and regular shape. Future work on this topic may involve the use of new deep learning techniques for further improvement of artifacts removal and structures detection. Also, the hardware design and implementation of the proposed CAD system using high dimensional chaotic circuits [44] can be a goal for future investigation. These points are challenges that will be explored and reported in the near feature.

Table 8 Comparison between the proposed diagnosis system and several recent classification methods based on different image processing techniques and CNN architectures using different datasets

4 Conclusions

In this paper, a new CAD system is proposed for automated skin lesion classification using low hardware and software requirements. The proposed approach investigates new pre-processing techniques for hair removal and lesion identification, Grab-cut in HSV color space for lesion segmentation, the ABCD rule for discriminating malignant melanoma from benign lesions, image data augmentation, and pretrained CNN architectures for classification. Different pretrained CNN structures such as VGG-16, ResNet50, ResNetX, InceptionV3, and MobileNet are investigated for skin lesion classification. To investigate the usefulness of the proposed approach, real skin images covering several cases from the ISIC 2017 and HAM10000 databases are examined. The use of data augmentation is investigated for overcoming the problem of data limitation and examining its impact on the performance of benign-malignant classification for all CNN architectures under investigation. The classification performance is identified using several metrics, including \(AUC\), \(ACC\), \(Se\), \(Pr\), F1-score, and consumed time. Experimental results reveal that the ResNet50 CNN architecture combined with SVM and data augmentation provides very accurate skin lesion classification with an average \(AUC\) of 99.52%, \(ACC\) of 99.87%, \(Se\) of 98.87%,\(Pr\) of 98.77%, ad F1-score of 7.83% for the ISIC 2017datasets, and \(AUC\) of 97.91%, \(ACC\) of 98.79%, \(Se\) of 9778%, \(Pr\) of 96.86%, and F1-scor of 97.45% fo the HAM10000 datasets, hile requiring low computational time. The results highlighted the positive impact of utilizing data augmentation in enhancing the diagnosis performance. The proposed method is compared with recent skin lesion classification systems, and the results elucidate the superior classification results of the proposed method over other systems under comparison. This demonstrates that the proposed framework can be used to assist less experienced dermatologists and clinicians in classifying various skin lesions.