1 Introduction

Skin cancer is caused by mutations within the DNA of skin cells, which causes their abnormal multiplication [1, 2]. In the early development of skin cancer, lesions appear on the the outer layer of the skin, the epidermis. Not all lesions are caused by malignant tumours, and a diagnosis classifying the lesion as either malignant (cancerous) or benign (non-cancerous) is often reached based on preliminary visual inspection followed by a biopsy. Early detection and classification of lesions is important because early diagnosis of skin cancer significantly improves the prognosis [3].

The visual inspection of potentially malignant lesions carried out using an optical dermatoscope is a challenging task and requires a specialist dermatologist. For instance, according to Morton and Machie  [4], in the case of melanoma, a particularly aggressive type of skin cancer, only about 60–90% of malignant tumours are identified based on visual inspection, and accuracy varies markedly depending on the experience of the dermatologist. As skillful dermatologists are not available globally and for all ethnic and socioeconomic groups, the situation causes notable global health inequalities [5].

Due to the aforementioned reasons, machine learning techniques are widely studied in the literature. Machine learning has potential to aid automatic detection of skin cancer from dermoscopic images, thus enabling early diagnosis and treatment. Murugan et al. [6] compared the performance of K-nearest neighbor (KNN), random forest (RF), and support vector machine (SVM) classifiers on data extracted from segmented regions of demoscopic images. Similarly, Ballerini et al. [7] used a KNN-based hierarchical approach for classifying five different types of skin lesions. Thomas et al. [8] used deep learning based methods for classification and segmentation of skin cancer. Lau and Al-Jumaily [9] proposed a technique based on a Multi-Layer Perceptron (MLP) and other neural network models. Chaturvedi et al. [10] used deep CNN for the classification of skin cancer into different classes. Jinnai et al. [11] used region based CNN for efficient detection of skin cancer. Whereas, Nawaz et al. [12] introduced deep learning based skin cancer classification technique. During data pre-processing phase visual appearance of image is improved. After that, output from region based CNN is then provided to fuzzy K mean cluster in order to segment out the cancer affected region. Performance of Nawaz’s technique is checked on different datasets. A recent review by Chan et al. [13] summarizing many of these studies concluded that while many authors reported better sensitivity and specificity than dermatologists, “further validation in prospective clinical trials in more real-world settings is necessary before claiming superiority of algorithm performance over dermatologists.”

What all the aforementioned methods have in common is that they require large amounts of training data in the form of dermoscopic images together with labels indicating the correct diagnosis. Several authors have proposed approaches to reduce the amount of training data required to reach satisfactory classification accuraracy. Hosney et al. [14] describe a method based on transfer learning, which is a way to exploit available training data collected for a different classification task than the one at hand. Hosny’s technique is based on a pre-trained AlexNet network (a specific deep learning architecture proposed by Krizhevsky et al. [15]) that was originally trained to classify images on a commonly used ImageNet dataset, and then adapted to perform skin cancer classification by transfer learning. Similarly, Dorj et al. [16] used a pre-trained AlexNet combined with a SVM classifier. Guo and Yang [17] utilized a ResNet network, another commonly used deep learning architecture by He et al. [18]. Li and Shen [19] use a combination of two deep learning models for the segmentation and classification of skin lesions. Hirano et al. [20] suggested a transfer learning based technique in which hyperspectral data is used for the detection of melanoma using a pre-trained GoogleNet model [21]. The same pre-trained model is used by Kassem et al. [22]. Esteva et al. [23] use a pre-trained Inception V3 model [21].

Another commonly used technique to improve classification accuracy when limited training data is available is ensemble learning [24, 25]. The idea is to combine the output of multiple classifiers, called base-learners, by using their outputs as the input to another model that is called a meta-learner, to obtain a consensus classification. Ensemble learning tends to reduce variability and improve the classification accuracy. Ain et al. [26] proposed Genetic Programming ensemble learning technique for skin cancer classification. In Ain’s approach feature extraction is performed using GP and then extracted feature space is provided to ensemble of classifier for classifying input images into different classes. Mahbod et al. [27] proposed an ensemble based hybrid technique involving pre-trained AlexNet, VGG16 [28], and ResNet models as feature extractors. Latent features extracted from these models are combined using SVM and logistic regression classifiers as meta-learners.

In addition to the shortage of large quantities of labeled training data, many clinical datasets have severe class imbalance: the proportion of positive cases tends to be significantly lower than that of the negative cases, see [29]. This reduces the amount of informative data points and lowers the accuracy of many machine learning techniques, and may create a bias that leads to an unacceptably high number of false negatives when the model is deployed in real-world clinical applications. To deal with the class imbalance issue, most of the previously reported techniques use data augmentation, i.e., oversampling training data from the minority (positive) class and/or undersampling the majority class. This tends to lead to increased computational complexity as the amount of training data is in some cases increased many-fold, and risks losing informative data points due to undersampling. Qin et al. [30] used GAN for generating high quality images, which ultimately improved the performance of deep learning based classifier. Zunair and Hamza [31] proposed technique which comprises of two steps. In the first step, a CycleGAN [32] generative network is trained to do image-to-image translation from negative to positive samples in order to augment the minority (positive) class data. The synthetic data along with the original data samples are then used as the input to a combination of five (out of 16) layers of the VGG16 network, a pooling layer, and a softmax classification layer.

In this paper, we propose a new technique for skin cancer classification from dermoscopic images based on transfer learning and ensembles of deep neural networks. The motivation behind the proposed technique is to maximize the diversity of the base-learners at various levels during the training of the ensemble to improve the overall accuracy. In our proposed ensembled-based technique, a number of CNN base-learners are trained on input images scaled to different sizes between \(32\times 32\) and \(256\times 256\) pixels. During training, two out of six base-learners are pre-trained CNNs trained on another skin cancer dataset that is not part of primary dataset. In the second step, all the predictions from base-learners along with the meta-data, including, e.g., the age and gender of the subject, provided with the input images is provided to an SVM meta-learner to obtain the final classification. By virtue of training the base-classifiers on input images of different sizes, the model is able to focus on features in multiple scales at the same time. The use of meta-data further diversifies the information which improves the classification accuracy.

We evaluate the performance of the proposed technique on data from the International Skin Imaging Collaboration (ISIC) 2020 Challenge, which is highly imbalanced containing less than 2% malignant samples [33]. Our experiments demonstrate that (i) ensemble learning significantly improves the accuracy even though the accuracy of each of the base-learners is relatively low; (ii) transfer learning and the use of meta-data have only a minor effect on the overall accuracy; (iii) overall, the proposed method compares favourably against to all of the other methods in the experiments, even though the differences between the top performing methods fit within a statistical margin of error.

In Sect. 2, we describe the proposed method, including the architecture of the CNN blocks that are used as the base-learners in ensemble, the transfer learning procedure, as well as the SVM meta-classifier. In Sect. 3, we describe the main dataset and its pre-processing and division into training, validation, and test sets; evaluation metrics are also discussed in the same section. Sect. 4 covers the experimental results in terms of accuracy as well as computational complexity, including additional results that characterize the impact of the ensemble and transfer learning on the performance. Limitations of the study and some open research questions suggested by our study are discussed in Sect. 5, followed by conclusions in Sect. 6.

2 The Proposed Method

The proposed technique is an ensemble-based technique in which CNNs are used as base learners. The base learners are either pre-trained on a balanced dataset collected from the ISIC archive or trained directly on the ISIC 2020 dataset, which is our target dataset. Predicted probabilities of the positive class from all the base learners along with the auxiliary data contained in the metadata associated to the images are used as input to an SVM classifier, which is trained to classify each image in the dataset as positive (malignant) or negative (benign). Figure 1 shows the flowchart of the proposed technique.

2.1 Architecture

In the proposed technique, six base-learners are used. Each of the base learners operates on input data of different dimensions. During training four base learners, \(\mathrm {CNN}_{32\times 32}\), \(\mathrm {CNN}_{64\times 64}\), \(\mathrm {CNN}_{128\times 128}\), and \(\mathrm {CNN}_{256\times 256}\), are trained from random initial parameters on \(32\times 32\), \(64\times 64\), \(128\times 128\), and \(256\times 256\) input images respectively. Another two base learners are trained on malignant and benign skin cancer images of sizes \(32\times 32\) and \(64 \times 64\) respectively, which are not part of ISIC 2020 dataset. We used the F1-measure on the validation data to tune the architectures and the other hyperparameters as explained in more detail in Sect. 3 and “Appendix A”.

Fig. 1
figure 1

Block diagram of the proposed technique

2.2 Transfer Learning During Training of the Base-Learners

Transfer learning is used to transfer the knowledge extracted from one type of machine learning problem to another [34, 35]. The domain from where information is extracted is known as the source domain, and the domain where extracted information is applied is called the target domain. The benefit of transfer learning is that it not only saves time that is needed to train the network from scratch but also aid in improving the performance in the target domain.

In the proposed technique, the idea of transfer learning is exploited during the training phase of the base-learners. We pre-train some of the CNN base-learners on a balanced dataset collected from the ISIC archive.Footnote 1 This archive dataset was constructed in 2019, so none of the ISIC 2020 data are included in it. The rest of the base-learners are trained on the ISIC 2020 dataset that comprises the target domain. The introduction of the CNNs pre-trained on balanced data provides a diverse set of predictions, complementing the information coming from the base-learners trained on the ISIC 2020 data. Moreover, since the pre-trained base-learners need to be trained only once instead of re-training them every time we repeat the experiment on random subsets of the ISIC 2020 data (see Sect. 3.2 below), the pre-training saves time.

2.3 SVM as a Meta-classifier

In the proposed technique SVM is used as a meta-learner. The predictions from the base learners, in the format of probabilities of the positive class, are fed as input to the SVM along with the metadata. The purpose of using multiple deep learning modules as an ensemble is to ensure that the SVM meta-classifier can benefit from diverse information extracted from the input images by the the different base-learners, along with side information contained in the metadata.

We use a radial basis function (RBF) kernel of degree 2; for further details, see “Appendix A”.

3 Data and Evaluation Metrics

We use the ISIC 2020 Challenge dataset to train and test the proposed method along with a number of benchmark methods and evaluated their performance with commonly used metrics designed for imbalanced data.

3.1 Dataset and Pre-processing

The dataset used in the proposed technique contains 33,126 dermoscopic images collected from 2056 patients [33]. All the images are labelled using histopathology and expert opinion either as benign or malignant skin lesions. The ISIC 2020 dataset also contains another 10982 test data images without the actual labels, but since we are studying the supervised classification task, we use the labeled data only. All the images are in the JPG format of varying dimensions and shape. We use different input dimensions for different base-learners, so we scale the input images to sizes \(32\times 32\), \(64\times 64\), \(128\times 128,\) and \(256\times 256\) pixels.Footnote 2

Figure 2 contains example images present in the dataset. Table 1 shows the features in the metadata. Categorical features were encoded as integers in order to reduce the number of parameters in the meta-learner. All the missing values in the metadata are replaced by the average value of the feature in question.

The ISIC 2020 data set is highly imbalanced because out of the total 33,126 images (2056 patients), only 584 images (corresponding to 428 distinct patients) are malignant. The division of the data into training, validation, and test sets is described below in Sect. 3.2.

Fig. 2
figure 2

Benign versus malignant images

Table 1 Metadata

3.2 Division of the Data and Hyperparameter Tuning

We use the validation set method to divide the data in three parts. As illustrated in Fig. 3, 10% of the total data D is kept as test data, \(D_\mathrm {Test}\), which is not used in the training process. The other 90% is further split by using 90% of it as training data, \(D_\mathrm {T}\), and the final part as validation data, \(D_\mathrm {V}\) which is used to tune the hyperparameters of each of the used methods. Since the ISIC 2020 dataset contains multiple images for the same patient, we require that input images from a given individual appear only in one part of the data (\(D_\mathrm {T}\), \(D_\mathrm {V}\), or \(D_\mathrm {Test}\)).Footnote 3 The validation data is used to adjust the hyperparameters in each of the methods in the experiments by maximizing the F1-score (see Sect. 3.3 below). Hyperparameter tuning was done manually starting from the settings proposed by the original authors (when available) in case of the compared methods, and adjusting them until no further improvement was observed. Likewise, the neural network architectures of the CNN base-learners used in the proposed method were selected based on the same procedure as the other hyperparameters. Tables 5, 6, 7, 8, 9 and 10 in the “Appendix” show the details of the architectures of the CNN base-learners as well as the hyperparameters of the SVM meta-learner.

Fig. 3
figure 3

Division of the data in training, \(D_\mathrm {T}\), validation \(D_\mathrm {V}\), and test \(D_\mathrm {Test}\) sets

We used the training and validation data from one random train-validation-test split to tune all hyperparameters. We then used the obtained settings in 10 new random repetitions with independent splits to evaluate the classification performance in order to avoid bias caused by overfitting.

3.3 Evaluation Metrics

As is customary in clinical applications with imbalanced datasets, we use the F1-measure, the area under the ROC curve (AUC-ROC), and the area under the precision–recall curve (AUC-PR) as evaluation metrics; see, e.g., [29]. The F1-measure is the harmonic mean of precision and recall (see definitions below), which is intended to balance the risk of false positives and false negatives. To evaluate the optimal F1-value, we set in each case the classification threshold to the value that maximizes the F1-measure.

AUC-ROC and AUC-PR both characterize the behavior of the classifier over all possible values of the classification threshold. AUC-ROC is the area under the curve between the true positive rate (TPR) and the false positive rate (FPR) at different values of the classification threshold, whereas AUC-PR is the area under the precision–recall curve. While both measures are commonly used in clinical applications, according to Davis and Goadrich [37] and Saito and Rehmsmeier [38], AUC-PR is the preferred metric in cases with imbalanced data where false negatives are of particular concern.

The used metrics are defined as follows:

$$\begin{aligned} \text{ Precision }= & {} \frac{T_p}{T_p+F_p} \end{aligned}$$
(1)
$$\begin{aligned} \text{ Recall }= & {} \mathrm {TPR}=\frac{T_p}{P} \end{aligned}$$
(2)
$$\begin{aligned} \mathrm {FPR}= & {} \frac{F_p}{N} \end{aligned}$$
(3)
$$\begin{aligned} \text{ F1-measure }= & {} 2 \times \frac{\text{ Precision }\times \text{ Recall }}{\text{ Precision }+\text{ Recall }}, \end{aligned}$$
(4)

where \(T_P\) is the number of true positives (positive samples that are correctly classified by the classifier), \(F_P\) is the number of false positives (negative samples incorrectly classified as positive), and P and N are the total number of positive and negative samples, respectively.

4 Experimental Results

For the implementation of the deep learning models we use the Keras version 2.2.4 and TensorFlow version 1.14.0. The other machine learning methods and preprocessing methods were implemented in Python 3.0 and scikit-learn version 0.15.2.

4.1 Computational Cost

We carried out all the experiments on a high-performance computing cluster using maximum four Intel Xeon Gold 5230 CPUs, two Nvidia Volta V100 GPUs, and 300GB of memory for each method. Precise timing comparisons are not straightforward due to variable load on the cluster, but the relative differences are large enough to draw the following qualitative conclusions.

The conventional methods (KNN, RF, MLP, SVM) were the fastest, requiring up to about 1 hour to complete one training and testing cycle. The method proposed by Esteva et al. [23] using an undersampled and augmented version of the ISIC 2020 dataset used 3.1 hours, while our proposed method used 5.2 hours on the full ISIC 2020 dataset. The method by Mahbod et al. [27] was clearly the slowest and took over 72 hours to complete one training and testing cycle even on an undersampled and augmented version of the ISIC 2020 dataset. However, it is worth noting that once the model has been trained, which only needs to be done when the training data is updated, processing new test inputs takes a negligible amount of time compared to the training times (except for the KNN method for which there is no training stage).

4.2 Main Experimental Results

Table 2 and Fig. 4 show a comparison of the proposed technique with four non-deep learning classifiers (KNN, RF, MLP, SVM) and three selected deep learning based techniquesFootnote 4; see “Appendix B” for the most important parameters of the benchmark methods. In each of the benchmark methods except those by Esteva et al. [23] and Mahbod et al. [27], the \(32\times 32\) pixel RGB input images (altogether 3072 input features) along with the auxiliary information in the metadata (additional 3 input features) were used as the input.Footnote 5

In the case of the methods by Esteva et al. [14] and Mahbod et al. [27], we apply downsampling of the majority class to balance the two classes, and use the same data augmentation procedures as described in the original article; for details, see “Appendix B”.

The proposed technique achieves average F1, AUC-PR, and AUC-PR values 0.23, 0.16, and 0.87 respectively, which are highest among all of the compared methods. However, the differences between the top performing methods are within the statistical margin of error.Footnote 6 A more detailed visualization of the ROC and PR-curves is shown in Figs. 5, 6 and 7.

Table 2 Comparison of the proposed method with seven other methods in terms of three evaluation metrics (F1-measure, AUC-PR, AUC-ROC). The table shows the average score over \(n=10\) random repetitions ± 95% confidence intervals
Fig. 4
figure 4

Classification accuracy of the proposed method and seven other methods measured by three evaluation metrics (F1-measure, AUC-PR, AUC-ROC). The scores are averages over \(n=10\) independent repetitions. Error bars are 95% confidence intervals based on the t-distribution with \(n-1=9\) degrees of freedom

Fig. 5
figure 5

Left: AUC-ROC curves for the proposed method. Right: AUC-PR curves for the proposed method. Both panels show the curves for ten independent runs (light blue curves), the average curve (bold blue line), and an interval showing the standard deviation of the curve (gray region)

Fig. 6
figure 6

Left: Comparison of average ROC curves with three other deep learning methods. Right: Average PR-curves

Fig. 7
figure 7

Left: Comparison of average ROC with non-deep learning methods. Right: Average PR-curves

4.3 Performance Gain from Ensemble Learning

The proposed technique is comprised of two steps; in the first step base-learners are trained and in the second step meta-learner is trained on the top of base-learners. Table 3 shows the performance of each of the base-learners individually, which can be compared with the performance of the resulting SVM meta-classifier that combines the base-learners outputs as the ensemble classification. Out of six base-learners four are trained from scratch on the ISIC 2020 dataset while the remaining two are pre-trained on skin cancer images that are not part of ISIC 2020 dataset. The performance comparison shows that even though the accuracies of each of the base-learners individually are quite low, the meta-classifier performs markedly better. This suggests that the base-learners succeed in providing a diverse set of inputs to the meta-learner, thus significantly improving the overall performance of the ensemble over any of the base-learners.

Table 3 Performance of the base-learners and the full ensemble (the SVM meta-classifier), showing the significantly better performance by the ensemble compared to the individual base-learners

4.4 Significance of Transfer Learning and Meta-data

To evaluate the impact of using pre-trained models and that of the metadata on the classification accuracy in the proposed method, we also evaluated the performance with either one of these components disabled. Table 4 shows the performance comparison of the proposed technique to a version where the pre-trained CNNs are disabled, and one where the metadata is not included as auxiliary data for the meta-learner. As seen in the table, excluding the pre-trained CNNs does not significantly affect the performance. The exclusion of the metadata led to somewhat inferior performance, but here too, the differences are relatively minor and is within the statistical margin of the error. Further research with larger datasets and richer metadata is needed to confirm the benefits.

Table 4 Performance comparison of the proposed technique without pre-trained base-learners and meta-data

5 Limitations and Open Questions

As with all empirical work, the results from our experiments are subject to various biases, most notably the data set bias, so that the results cannot be directly transferred to other datasets. Further work and experiments on other high-quality, carefully curated data sets is necessary to validate our results. While we have made an effort to avoid overfitting by using a carefully planned training, validation, and testing procedure, the need for architecture optimization and hyperparameter tuning poses challenges to reproducibility. This issue is further amplified by the fluidity of the line between method and preprocessing especially in deep learning, which makes it hard to carry out fair head-to-head comparisons. Further work towards reproducibility standards is acutely needed [40].

Our experimental setup includes only a limited set of baseline and benchmark methods, and it is likely that better performing methods are quite certainly, will be available. The purpose of our work is not only to propose accurate skin cancer detection tools, but more importantly, to study generic techniques that can be used in combination with any existing or future machine learning methods. We hope that they will eventually find their way into clinical work, overcoming some of the current limitations [13] and help in reducing the global health inequalities due to limited availability of qualified dermatologists [5].

Additional open questions and promising research directions include: scaling to massive datasets created by extensive data augmentation; exploring the importance of metadata (on which our results, where the improvement was limited, should not be taken to be conclusive); and the use of enriched imaging data including hyperspectral images [41].

6 Conclusions

We proposed an ensemble-based deep learning approach for skin cancer detection based on dermoscopic images. Our method uses an ensemble of CNNs trained on input images of different sizes along with metadata. We present our results on the ISIC 2020 dataset which contains 33,126 dermoscopic images from 2056 patients. The dataset is highly imbalanced with less than 2% malignant cases. The impact of ensemble learning was found to be significant, while the impact of transfer learning and the use of auxiliary information in the form of metadata associated with the input images appeared to be minor. The proposed method compared favourably against other machine learning based techniques including three deep learning based techniques, making it a promising approach for skin cancer detection especially on imbalanced datasets. Our research expands the evidence suggesting that deep learning techniques offer useful tools in dermatology and other medical applications.