Transfer Learning with Ensembles of Deep Neural Networks for Skin Cancer Detection in Imbalanced Data Sets

Early diagnosis plays a key role in prevention and treatment of skin cancer. Several machine learning techniques for accurate detection of skin cancer from medical images have been reported. Many of these techniques are based on pre-trained convolutional neural networks (CNNs), which enable training the models based on limited amounts of training data. However, the classification accuracy of these models still tends to be severely limited by the scarcity of representative images from malignant tumours. We propose a novel ensemble-based convolutional neural network (CNN) architecture where multiple CNN models, some of which are pre-trained and some are trained only on the data at hand, along with auxiliary data in the form of metadata associated with the input images, are combined using a meta-learner. The proposed approach improves the model’s ability to handle limited and imbalanced data. We demonstrate the benefits of the proposed technique using a dataset with 33,126 dermoscopic images from 2056 patients. We evaluate the performance of the proposed technique in terms of the F1-measure, area under the ROC curve (AUC-ROC), and area under the PR-curve (AUC-PR), and compare it with that of seven different benchmark methods, including two recent CNN-based techniques. The proposed technique compares favourably in terms of all the evaluation metrics.


Introduction
Skin cancer is caused by mutations within the DNA of skin cells, which causes their abnormal multiplication (Armstrong & Kricker, 1995;Simões et al., 2015).
In the early development of skin cancer, lesions appear on the the outer layer of the skin, the epidermis.Not all lesions are caused by malignant tumours, and a diagnosis classifying the lesion as either malignant (cancerous) or benign (non-cancerous) is often reached based on preliminary visual inspection followed by a biopsy.Early detection and classification of lesions is important because early diagnosis of skin cancer significantly improves the prognosis (Bray et al., 2018).
The visual inspection of potentially malignant lesions carried out using an optical dermatoscope is a challenging task and requires a specialist dermatologist.For instance, according to Morton & Mackie (1998), in the case of melanoma, a particularly aggressive type of skin cancer, only about 60-90 % of malignant tumours are identified based on visual inspection, and accuracy varies markedly depending on the experience of the dermatologist.As skillful dermatologists are not available globally and for all ethnic and socioeconomic groups, the situation causes notable global health inequalities (Buster et al., 2012).
Due to the aforementioned reasons, machine learning techniques are widely studied in the literature.Machine learning has potential to aid automatic detection of skin cancer from dermoscopic images, thus enabling early diagnosis and treatment.Murugan et al. (2019)  Li & Shen (2018) use a combination of two deep learning models for the segmentation and classification of skin lesions.Hirano et al. (2020) suggested a transfer learning based technique in which hyperspectral data is used for the detection of melanoma using a pre-trained GoogleNet model (Szegedy et al., 2015).The same pre-trained model is used by Kassem et al. (2020).Esteva et al. (2017) use a pre-trained Inception V3 model (Szegedy et al., 2016).

Another commonly used technique to improve classification accuracy when
limited training data is available is ensemble learning, see (Dietterich et al., 2002;Polikar, 2012).The idea is to combine the output of multiple classifiers, called base-learners, by using their outputs as the input to another model that is called a meta-learner, to obtain a consensus classification.Ensemble learning tends to reduce variability and improve the classification accuracy.Mahbod et al. (2019) proposed an ensemble based hybrid technique involving pre-trained AlexNet, VGG16 (Simonyan & Zisserman, 2015), and ResNet models as base-learners.
Output obtained from these models is combined using SVM and logistic regression classifiers as meta-learners.
In addition to the shortage of large quantities of labeled training data, many clinical datasets have severe class imbalance: the proportion of positive cases tends to be significantly lower than that of the negative cases, see (He & Garcia, 2009).This reduces the amount of informative data points and lowers the accuracy of many machine learning techniques, and may create a bias that leads to an unacceptably high number of false negatives when the model is deployed in realworld clinical applications.To deal with the class imbalance issue, most of the previously reported techniques use data augmentation, i.e., oversampling training data from the minority (positive) class and/or undersampling the majority class.This tends to lead to increased computational complexity as amount of training data is in some cases increased many-fold, and risks losing informative data points due to undersampling.We evaluate the performance of the proposed technique on data from the International Skin Imaging Collaboration (ISIC) 2020 Challenge, which is highly imbalanced containing less than 2 % malignant samples (Rotemberg et al., 2021).Our experiments demonstrate that (i) ensemble learning significantly improves the accuracy even though the accuracy of each of the base-learners is relatively low; (ii) transfer learning and the use of meta-data have only a minor effect on the overall accuracy; (iii) overall, the proposed method compares favourably against to all of the other methods in the experiments, even though the differences between the top performing methods fit with statistical margin of error.

The Proposed Method
The proposed technique is an ensemble-based technique in which CNNs are used as base learners.The base learners are either pre-trained on balanced dataset collected from ISIC archive or on the the ISIC 2020 dataset.Predictions from all the base learners along with the auxiliary data contained in the metadata associated to the images are used as input to an SVM classifier, which finally classifies each image in dataset as positive (malignant) or negative (benign).Figure 1 shows the flowchart of the proposed technique.2009).The domain from where information is extracted is known as the source domain, and the domain where extracted information is applied is called the target domain.The benefit of transfer learning is that it not only saves time that is needed to train the network from scratch but also aid in improving the performance in the target domain.
In the proposed technique, idea of transfer learning is exploited during the training phase of base-learners.We pre-train some of the CNN base-learners on a balanced dataset available on Kaggle1 collected from the ISIC archive is used.

Data and Evaluation Metrics
We use the ISIC 2020 Challenge dataset to train and test the proposed method along with a number of benchmark methods and evaluated their performance with commonly used metrics designed for imbalanced data.

Dataset and pre-processing
The dataset used in the proposed technique contain 33126 dermoscopic images collected from 2056 patients (Rotemberg et al., 2021)

Division of the data and hyperparameter tuning
We use the validation set method to divide the data in three parts.As illustrated in Fig. 3 We used the training and validation data from one random train-validationtest split to tune all hyperparameters.We then used the obtained settings in 10 new random repetitions with independent splits to evaluate the classification performance in order to avoid bias caused by overfitting.

Evaluation metrics
As is customary in clinical applications with imbalanced datasets, we use the F1-measure, the area under the ROC curve (AUC-ROC), and the area under the precision-recall curve (AUC-PR) as evaluation metrics; see, e.g., (He & Garcia, 2009).The F1-measure is the harmonic mean of precision and recall (see definitions below), which is intended to balance the risk of false positives and false negatives.To evaluate the optimal F1-value, we set in each case the classification threshold to the value that maximizes the F1-measure.
AUC-ROC and AUC-PR both characterize the behavior of the classifier over all possible values of the classification threshold.AUC-ROC is the area under the curve between the true positive rate (TPR) and the false positive rate (FPR) at different values of the classification threshold, whereas AUC-PR is the area under the precision-recall curve.While both measures are commonly used in clinical applications, according to Davis & Goadrich (2006) and Saito & Rehmsmeier (2015), AUC-PR is the preferred metric in cases with imbalanced data where false negatives are of particular concern.
The used metrics are defined as follows: where T P is the number of true positives (positive samples that are correctly classified by the classifier), F P is number of false positives (negative samples incorrectly classified as positive), and P and N are the the total number positive and negative samples, respectively.

Experimental Results
All the computations were done on the Puhti supercomputer Atlos Bull-

Main experimental results
Table 2 and Figure 4  The proposed technique achieves average F1, AUC-PR, and AUC-PR values 0.23, 0.16, and 0.87 respectively, which are highest among all of the compared methods.However, the differences between the top performing methods are within statistical margin of error 5 .A more detailed visualization of the ROC and PR-curves is shown in Figs.5-7.

Performance gain from ensemble learning
The proposed technique in comprised of two steps; in first step base-learners are trained and in second step meta-learner is trained on the top of base-learners.
4 The computational cost of training the SVM classifier prohibited the use of the higherresolution images.
5 Following Berrar & Lozano (2013), we present comparisons in terms of confidence intervals instead of hypothesis tests ("We argue that the conclusions from comparative classification studies should be based primarily on effect size estimation with confidence intervals, and not on significance tests and p-values.").We calculate 95 % confidence intervals based on the t-distribution with n − 1 = 9 degrees of freedom by µ ± 2.26216 × σ √ n , where µ is the average score, σ is the standard deviation of the score, and n = 10 is the sample size (number of repetitions with independent random train-validation-test splits).Table 3 shows the performance of each of the base-learners individually, which can be compared with the performance of the resulting SVM meta-classifier that combines the base-learners outputs as the ensemble classification.Out of six base-learners four are trained from scratch on the ISIC 2020 dataset while the remaining two are pre-trained on skin cancer images that are not part of ISIC 2020 dataset.The performance comparison shows that even though the accuracies of each of the base-learners individually are quite low, the meta-classifier performs markedly better.This suggests that the base-learners succeed in providing a diverse set of inputs to the meta-learner, thus significantly improving overall performance of the ensemble over any of the base-learners.

Significance of transfer learning and meta-data
To evaluate the impact of the using pre-trained models and that of the metadata on the classification accuracy in the proposed method, we also evaluated the performance with either one of these components disabled.Table 4 shows the performance comparison of the proposed technique to a version where the pre-trained CNNs are disabled, and one where the metadata is not included as auxiliary data for the meta-learner.As seen in the table, excluding the pretrained CNNs doesn't significantly affect the performance.The exclusion of the with both (= proposed method) 0.23 ± 0.04 0.16 ± 0.04 0.87 ± 0.02 metadata led to somewhat inferior performance, but here too, the differences are relatively minor and with within statistical margin of error.Further research with larger datasets and richer metadata is needed to confirm the benefits.

Conclusions
We proposed an ensemble-based deep learning approach for skin cancer detection based on dermoscopic images.Our method uses an ensemble of convolutional neural networks (CNNs) trained of input images of different sizes along with metadata.We present results on the ISIC 2020 dataset which is contains 33126 dermoscopic images from 2056 patients.The dataset is highly imbalanced with less than 2 % of malignant samples.The impact of ensemble learning was found to be significant, while the impact of transfer learning and the use of auxiliary information in the form of metadata associated with the input images appeared to be minor.The proposed method compared favourably against other machine learning based techniques including three deep learning based techniques, making it a promising approach for skin cancer detection especially on imbalanced datasets.Our research expands the evidence suggesting that deep learning techniques offer useful tools in dermatology and other medical applications.
compared the performance of K-nearest neighbor (KNN), random forest (RF), and support vector machine (SVM) classifiers on data extracted from segmented regions of demoscopic images.Similarly, Ballerini et al. (2013) used a KNN-based hierarchical approach for classifying five different types of skin lesions.Thomas et al. (2021) used deep learning based methods for classification and segmentation of skin cancer.Lau & Al-Jumaily (2009) proposed a technique based on a Multi-Layer Perceptron (MLP) and other neural network models.A recent review by Chan et al. (2020) summariz-ing many of these studies concluded that while many authors reported better sensitivity and specificity than dermatologists, "further validation in prospective clinical trials in more real-world settings is necessary before claiming superiority of algorithm performance over dermatologists."What all the aforementioned methods have in common is that they require large amounts of training data in the form of dermoscopic images together with labels indicating the correct diagnosis.Several authors have proposed approaches to reduce the amount of training data required to reach satisfactory classification accuraracy.Hosny et al. (2020) describe a method based on transfer learning, which is a way to exploit available training data collected for a different classification task than the one at hand.Hosny's technique is based on a pre-trained AlexNet network (a specific deep learning architecture proposed by Krizhevsky et al. (2012)) that was originally trained to classify images on a commonly used ImageNet dataset, and then adapted to perform skin cancer classification by transfer learning.Similarly, Dorj et al. (2018) used a pre-trained AlexNet combined with a SVM classifier.Guo & Yang (2018) utilized a ResNet network, another commonly used deep learning architecture by He et al. (2016).
In this paper, we propose a new technique for skin cancer classification from dermoscopic images based on transfer learning and ensembles of deep neural networks.The motivation behind the proposed technique is to maximize the diversity of the base-classifiers at various levels during the training of the ensemble to improve the overall accuracy.In our proposed ensembled-based technique, a number of CNN base-learners are trained on input images scaled to different sizes between 32 × 32 and 256 × 256 pixels.During training, two out of six base-learners are pre-trained CNNs trained on another skin cancer dataset that is not part of primary dataset.In the second step, all the predictions from baselearners along with the meta-data, including, e.g., the age and gender of the subject, provided with the input images is provided to an SVM meta-learner to obtain the final classification.By virtue of training the base-classifiers on different input images of different sizes, the model is able to focus on features in multiple scales at the same time.The use of meta-data further diversifies the information which improves the classification accuracy.
In the proposed technique, six base-learners are used.Each of the base learners operates on input data of different dimensions.During training four base learners, CNN 32×32 , CNN 64×64 , CNN 128×128 , and CNN 256×256 , are trained from random initial parameters on 32 × 32, 64 × 64, 128 × 128, and 256 × 256 input images respectively.Another two base learners are trained on malignant and benign skin cancer images of sizes 32×32 and 64×64 respectively, which are not part of ISIC 2020 dataset.After training of all six base-learners, predictions from all of them, along with the metadata is then fed into an SVM classifier that functions as the meta classifier.The SVM is trained on the training data and used to finally classify each of the test images as malignant or benign.For both the base and the meta classifiers, the validation data is used to adjust hyperparameters as explained in more detail below.

Figure 1 :
Figure 1: Block diagram of the proposed technique This archive dataset was constructed in 2019, so none of the ISIC 2020 data are included in it.The rest of the base-learners are trained on the ISIC 2020 dataset that comprises the target domain.The introduction of the CNNs pre-trained on balanced data provides a diverse set of predictions, complementing the information coming from the base-learners trained on the ISIC 2020 data.Moreover, since the pre-trained base-learners need to be trained only once instead of retraining them every time we repeat the experiment on random subsets of the ISIC 2020 data (see Sec. 3.2 below), the pre-training saves time.

Figure 3 :
Figure 3: Division of the data in training, D T , validation D V , and test D Test sets.
X400 cluster comprised of Intel CPUs.For implementation of the deep learning models we use the Keras version 2.2.4 and TensorFlow version 1.14.0.The other machine learning methods and preprocessing methods were implemented in Python 3.0 and scikit-learn version 0.15.2.All the source code needed to replicate the experiments will be released together with the published version of this paper.
show a comparison of the proposed technique with four non-deep learning classifiers (KNN, RF, MLP, SVM) and three selected deep learning based techniques; see Appendix B for the most important parameters of the benchmark methods.In each of the benchmark methods except those by Esteva et al. (2017) andMahbod et al. (2019), the 32 × 32 pixel RGB input images (altogether 3072 input features) along with the auxiliary information in the metadata (additional 3 input features) were used as the input. 4

Figure 4 :
Figure 4: Classification accuracy of the proposed method and seven other methods measured by three evaluation metrics (F1-measure, AUC-PR, AUC-ROC).The scores are averages over n = 10 independent repetitions.Error bars are 95 % confidence intervals based on the tdistribution with n − 1 = 9 degrees of freedom.

Figure 5 :Figure 6 :Figure 7 :
Figure 5: Left: AUC-ROC curves for the proposed method.Right: AUC-PR curves for the proposed method.Both panels show the curves for ten independent runs (light blue curves), the average curve (bold blue line), and an interval showing the standard deviation of the curve (gray region).

Table 1 :
. All the images are labelled using histopathology and expert opinion either as benign or malignant skin lesions.The ISIC 2020 dataset also contains another 10982 test data images without the actual labels, but since we are studying the supervised classification task, we use the labeled data only.All the images are in JPG format of varying dimensions and shape.We use different input dimensions for different base-learners, so we scale the input images to sizes 32 × 32, 64 × 64, 128 × 128, and 256 × 256 pixels. 2 Figure 2 contains example images present in the dataset.Table 1 shows the features in the metadata.Categorical features were encoded as integers in order to reduce the number of parameters in the meta-learner.All the missing values in the metadata are replaced by the average value of the feature in question.The ISIC 2020 data set is highly imbalanced because out of the total 33126 images (2056 patients), only 584 images (corresponding to 428 distinct patients) are malignant.The division of the data into training, validation, and test sets is described below in Sec 3.2.Metadata

Table 2 :
Comparison of the proposed method with seven other methods in terms of three evaluation metrics (F1-measure, AUC-PR, AUC-ROC).The table shows the average score over n = 10 random repetitions ± 95 % confidence intervals based on the t-distribution with n − 1 = 9 degrees of freedom.

Table 3 :
Performance of the base-learners and the full ensemble (the SVM meta-classifier),

Table 4 :
Performance comparison of the proposed technique without pre-trained base-learners