An ensemble-based convolutional neural network model powered by a genetic algorithm for melanoma diagnosis

Melanoma is one of the main causes of cancer-related deaths. The development of new computational methods as an important tool for assisting doctors can lead to early diagnosis and effectively reduce mortality. In this work, we propose a convolutional neural network architecture for melanoma diagnosis inspired by ensemble learning and genetic algorithms. The architecture is designed by a genetic algorithm that finds optimal members of the ensemble. Additionally, the abstract features of all models are merged and, as a result, additional prediction capabilities are obtained. The diagnosis is achieved by combining all individual predictions. In this manner, the training process is implicitly regularized, showing better convergence, mitigating the overfitting of the model, and improving the generalization performance. The aim is to find the models that best contribute to the ensemble. The proposed approach also leverages data augmentation, transfer learning, and a segmentation algorithm. The segmentation can be performed without training and with a central processing unit, thus avoiding a significant amount of computational power, while maintaining its competitive performance. To evaluate the proposal, an extensive experimental study was conducted on sixteen skin image datasets, where state-of-the-art models were significantly outperformed. This study corroborated that genetic algorithms can be employed to effectively find suitable architectures for the diagnosis of melanoma, achieving in overall 11% and 13% better prediction performances compared to the closest model in dermoscopic and non-dermoscopic images, respectively. Finally, the proposal was implemented in a web application in order to assist dermatologists and it can be consulted at http://skinensemble.com.


Introduction
Melanoma is the most serious form of skin cancer that begins in cells known as melanocytes. Melanoma has an increasing incidence, where just in Europe were estimated 144,200 cases and 20,000 deaths in 2018 [1], whereas in USA, 106,110 new cases of invasive melanoma will be diagnosed (62,260 in men and 43,850 in women) and 7180 deaths are expected in 2021 (4600 men and 2580 women) [2]. The lesion is first diagnosed through an initial clinical screening, and then potentially through a dermoscopic analysis, biopsy and histopathological examination [3]. Despite the expertise of dermatologists, early diagnosis of melanoma remains a challenging task since it is presented in many different shapes, sizes and colors even between samples in the same category [4]. Providing a comprehensive set of tools is necessary for simplifying diagnosis and assisting dermatologists in their decision-making processes [3].
Several automated computer image analysis strategies have been used as tools for medical practitioners to provide accurate lesion diagnostics, including descriptor-based methods [5,6] and convolutional neural networks (CNNs) [3,7,8]. Descriptor-based methods require the previous extraction of handcrafted features [9], which rely on the expertise of dermatologists and introduce a margin of error. By contrast, CNN models can automatically learn highlevel features from raw images [3], thus allowing for the development of applications in a shorter timeframe. Furthermore, Nasr-Esfahani et al. [10] showed that CNN models can overcome handcrafted features-based methods. The authors obtained 7% and 19% better sensitivity performance compared to color-based descriptor and texturebased descriptor, respectively. Recently, Brinker et al. [11] demonstrated that CNN models can match the prediction performance of 145 dermatologists.
CNN models have shown to be effective in solving several complex problems [12,13]. However, they still present several issues which hamper their accuracy in diagnosing skin conditions. CNN models can learn from a wide variety of nonlinear data points. As such, they are prone to overfitting on datasets with small numbers of samples per category, thus attaining a poor generalization capacity. It is noteworthy that so far, most of the existing public skin datasets only encompass a few hundreds or thousands of images. This can limit the learning capacity of CNN models. On the other hand, CNN models are sensitive to some characteristics in data, such as large inter-class similarities and intra-class variances, variations in viewpoints, changes in lighting conditions, occlusions, and background clutter [14]. These days, the majority of skin datasets are made up of dermoscopic images reviewed by expert dermatologists. However, bear in mind that there is an increased tendency to collect images taken by common digital cameras [15]. The above can reduce invasive treatments and their associated expenses in addition to augmenting the development of modern tools for cheaper and better melanoma diagnoses. Finally, CNN models are approximately invariant with regard to small translations to the input, but they are not rotation, color or lighting-invariant [16,17]. Invariance is an important concept in the area of image recognition. It means that if you take the input and transform it, the representation you get is the same as the representation of the original. Due to the vast variability in the morphology of moles, this is an important issue to resolve in order to attain a more effective melanoma diagnosis, thus allowing a model to detect rotations or changes to proportion and adapt itself in a way that the learned representation is the same.
Several techniques can be applied to overcome some of these issues, but the most proven include data augmentation [18], transfer learning [3], ensemble learning [19] and, more recently, generative adversarial networks [20] and multi-task learning [21]. However, in most cases, researchers rely on their expertise to select which techniques to apply, and there is no specific pattern to follow that will definitively produce a model with a high level of reliability. Furthermore, most research does not follow a standard experimental study and only includes a limited number of datasets.
As a consequence of the above, in this work, a novel approach for diagnosing melanoma via the use of images is proposed. First, the proposal is inspired by ensemble learning and is built via a genetic algorithm. The genetic algorithm is designed to find the optimal members of such ensembles while considering the entire training phase of the possible members. In addition, the abstract features of all models in the ensemble are merged, resulting in an additional prediction. Next, these individual predictions are combined and a final diagnosis is made. This approach can be seen as a double ensemble, in which all predictive components are double related. The aim is to find the models that best contribute to the ensemble, rather than the individual level. As a result, the state of each CNN model that best trains, generalizes and suits with the other CNN models is selected [22]. In this manner, the training process is implicitly regularized, which has shown better convergence, mitigates the overfitting of the model and improves the generalization performance [19,23]. To the best of our knowledge, this is the first attempt at intelligently constructing ensembles of CNN models by following a genetic algorithm to better solve the challenge of diagnosing melanoma through the use of images. Second, a novel lesion segmentation method, which is capable of efficiently obtaining reliable segmentation masks without prior information through the use of just a CPU, is applied. Bear in mind that state-of-the-art biomedical segmentation methods commonly require a prior training stage and the use of GPUs. In addition, common techniques such as transfer learning and data augmentation have been applied to further improve performance. To evaluate the suitability of this proposal, an extensive experimental study was conducted on sixteen melanoma-image datasets, enabling a better analysis of the effectiveness of the model. The results showed that the proposed approach achieved promising results and was competitive compared to six state-of-the-art CNN models which have previously been used for diagnosing melanoma [3,10,[24][25][26]. These works were summarized in Pérez et al. [8], and the most relevant are explained in the next section. To perform a fair comparison, the above architectures were assessed in Sect. 5.5 by using the same large number of datasets and a quantification of the results was included. Finally, a web application was developed in order to support dermatologists in decision making.
The rest of this work is organized as follows: Sect. 2 briefly presents the state of the art in solving melanoma diagnosis problems mainly by using CNN models; Sect. 3 describes the proposed ensemble convolutional architecture; Sect. 4 describes the genetic algorithm application; the analysis and discussion of the results are portrayed in Sect. 5; and finally, Sect. 6 outlines conclusions and future works.

Related works
The popularity of CNN models increased when AlexNet [27] won the well-known ImageNet Large Scale Visual Recognition Challenge (ILSVRC) in 2012, reducing the top-5 labels error rate from 26.1% to 15.3%. Since then, CNN models are widely applied in image classification tasks [28][29][30]. However, most authors apply other sophisticated techniques over CNN models to achieve a better performance in melanoma diagnosis, such as segmentation [31], data augmentation [18], transfer learning techniques [3], and CNN-based ensembles [19].
Data augmentation is employed to add new data to the input space, which helps reducing overfitting [32] and obtaining transformation-invariant models [17]. This technique is usually performed by means of random transformations [33]. Also, most of the datasets available for melanoma diagnosis lack of balance between the categories, so data augmentation can help to tackle the imbalance issue [34]. For example, Esteva et al. [3] showed the suitability of CNN models as a powerful tool for melanoma diagnosis. The authors augmented the images by a factor of 720Â using basic transformations, such as rotation, flips and crops. Also, they compared the performance of one CNN model to 21 board-certified dermatologists on biopsy-proven clinical images. The results showed that the CNN model achieved a performance on par with experts. Pérez et al. [8] showed that 12 CNN models achieve better performance when applying data augmentation. For example, Xception [35] increased its average performance by 76%.
Transfer learning has been successfully applied in image classification tasks [3,30,36,37]. For example, Shin et al. [30] applied it specifically in thoraco abdominal lymph node detection and interstitial lung disease classification, in both cases the weights obtained after training with Ima-geNet were beneficial. Esteva et al. [3] used Google's InceptionV3 [38] architecture pretrained on ImageNet. The authors removed the final classification layer and then they re-trained with 129,450 skin lesions images. Pérez et al. [8] studied the impact of applying transfer learning in skin images, where MobileNet [39] increased its average performance by 42%.
On the other hand, ensemble learning [40] has also shown to be effective in solving complex problems [41,42]. The combination of several classifiers built from different hypothesis spaces can reach better results than a single classifier [43]. For example, Mahbod et al. [19] proposed an ensemble-based approach where two different CNN architectures were trained with skin lesion images. The results showed to be competitive compared to the state-of-the-art methods for melanoma diagnosis. Harangi et al. [23] proposed an ensemble composed by the wellknown CNN architectures AlexNet [27], VGG and Incep-tionV1 [44]. The ensemble was assessed on the International Symposium on Biomedical Imaging (ISBI) 2017 challenge and obtained very competitive results. However, the ensemble models are usually built trusting in prior knowledge.
In order to achieve better performance, not only new strategies to train the models have been developed, but also efforts have been made to improve the input data. Skin lesion segmentation plays an important role in melanoma diagnosis. It isolates the region of interest and significantly improves the performance of the model. This is a highly complex task, and it is important because some areas not related to the lesion can lead CNN models to misclassify samples. However, some authors decided not to use some of these preprocessing techniques. Mahbod et al. [19] ignored complex preprocessing steps, as well segmentation methods, but applied basic data augmentation techniques to prevent overfitting. Nevertheless, if irrelevant information is removed from the image, the models could be able to achieve better performance. Since 2016, The International Skin Imaging Collaboration 1 (ISIC) project annually organizes a challenge in which more than 180 teams have already participated. From 2016 to 2018, there was a special task about lesion segmentation. In ISIC-2016, considering the top performances, the use of segmentation obtained 8% better sensitivity compared to its non-use. Several segmentation methods can be found. For example, Ronneberger et al. [45] designed a CNN model (U-Net) for biomedical image segmentation. The model relies on data augmentation to use the available labeled images more efficiently. The U-Net architecture achieved high performance on different biomedical segmentation applications, such as neuronal structures in electron microscopic recordings and skin lesion segmentation [46,47]. However, U-Net requires a costly training process using GPU and more important, a prior knowledge from data already segmented by expert dermatologists. Furthermore, Alom et al. [48] proposed a recurrent residual U-Net model (R2U-Net). The model was tested on blood vessel segmentation in retinal images, skin cancer segmentation, and lung lesion segmentation. The results showed better performance compared to the U-Net model. Huang et al. [31] proposed a new segmentation method based on end-to-end object scale-oriented fully convolutional networks. The authors achieved 92.5% of sensitivity and outperformed all CNN models in their study, which was 1.4% better compared to the winner in ISIC-2016.
Considering the above, it would be interesting to design a deep learning model that combines features from different approaches such as segmentation, data augmentation, transfer learning and ensemble learning. We hypothesized that evolutionary optimization methods could be an effective approach to find the optimal combination of CNN models [49,50], considering all states from all the models. Evolutionary methods have proven to be useful in solving many complex problems, finding and optimizing architectures of neural networks [51], and mining imbalanced data [52]. Regarding segmentation methods, in this work it is applied an extension of the Chan-Vese segmentation algorithm [53] and it is evaluated using specialized datasets. To augment data, it is important to perform a data augmentation both on training and test phases [18], which can increase the performance significantly.

Ensemble-based convolutional architecture
First of all, the source datasets were preprocessed following an extension of the Chan-Vese segmentation algorithm, as shown in Fig. 1. This algorithm is designed to segment objects without clearly defined boundaries and is based on techniques of curve evolution, Mumford-Shah [54] functional for segmentation and level sets. Chan-Vese has been previously applied in skin image segmentation [55], being an effective starting point to accurately segment skin images. First, Chan-Vese is applied on each input image and as a result, a mask with positive and negative values is obtained. After that, the positive pixels within 40% of the center are selected (cluster of pixels P). After using several recognized skin lesion detection applications such as SkinVision 2 , we realized that most of them demand that images must be centered in the lesion in order to perform an accurate diagnosis 3 [56]. In addition, after reviewing a large number of skin image datasets, it was noticed that most images are centered in the lesion, which is an advantage when applying segmentation. Third, all positive clusters (Q) that intersect P are merged, obtaining a new segmentation mask M, . Finally, the segmented image is obtained after applying the mask M on the original image. In Sect. 5, the proposed segmentation method is compared to other state-of-the-art biomedical segmentation methods to corroborate its effectiveness.
Next, the segmented images are used as input to the architecture described below.
Let us say U is a model with m independent CNN models, which learn the representations from the same feature space, and an extra prediction block (stacking of dense layers) that yields an extra prediction. The prediction block is obtained by first concatenating all the representations learned by the individual CNN models. Figure 2 shows a specific example of the proposal. An image i is passed as input to the jth CNN model of U, and a chromosome indicates which tuples of CNN models and their weights determine the ensemble's architecture. Each CNN outputs the learned representation (r j i ) and a partial predictionô j i for the label of this sample by considering the weights required in the chromosome, e.g., the weights from DenseNet trained until epoch 33 (g 1 ¼ 33). Thereafter, the representations r j i learned by each CNN model are flattened, then concatenated and then passed to the prediction block of the model. A late fusion approach is used to concatenate all the representations learned by the CNN models ( Fig. 2c), which has proven to obtain better performance compared to merely combining the individual predictions [57]. Nevertheless, in Sect. 5.5 the prediction block is assessed versus its non-use. Thereafter, we freeze the weights from the individual CNN models and only the prediction block is trained during 20 epochs, providing an additional prediction (ô mþ1 i ) for the label of the sample. In this work, the prediction block is composed of Dense(512 ReLUs) ! Dense(256 ReLUs) ! Dense(128 ReLUs) ! Dense(1 unitsigmoid). Then, U predicts the label of an image by using a soft-voting procedure based on the individual predictions. Regarding efficiency, each CNN model was trained just once and its weights along every epoch of interest were stored in a hard drive.
Once the prediction for a given training sample is computed, the losses produced by each CNN model and the prediction block of U are calculated. The loss produced by the jth CNN model of U on the ith training image (denoted as L j ðiÞ) is computed by means of a binary cross-entropy. The goal is to iteratively minimize the prediction errors on the training samples. Mini-batch Gradient Descent (MGD) [58] can be applied to solve this optimization problem, since the first derivative of the loss function is well-defined. This algorithm consists in randomly splitting the training set in small batches in order to calculate the model error.
The above method has several advantages, such as the model update frequency is higher than batch gradient descent which allows for a more robust convergence, avoiding local minima; batch-based updates provide a computationally more efficient process than stochastic gradient descent and the split in small batches allows the efficiency of not having all training data in memory.

Ensemble model via genetic algorithm
In this work, a genetic algorithm (GA) was designed to find the optimal members of an ensemble model. Next, the different components and steps of the proposed method are explained in detail.

Individuals and chromosome codification
Let say that the population of the GA has q individuals fI 1 ; I 2 ; . . .; I q g, where the j-th individual (I j ) has a chromosome encoded as a list of integer values (from 0 to 150), as shown in Fig. 2a. A chromosome is composed of K genes, where g j k represents the k-th gene of the individual I j . Each gene index k is independent of the others and is related to a specific CNN model. Its value means the epoch of the CNN model to be considered as part of the ensemble; if g j k [ 0, the CNN model is selected; otherwise, it is not selected. Consequently, using this encoding, each individual of the population represents a full solution to the problem.

Fitness function
The fitness function used to evaluate the individual I j can be calculated as where L j k ðiÞ calculates the loss values for the ith image in the CNN model with index k, and L j mþ1 ðiÞ means the loss value of the prediction block; p means the number of times that g j k [ 0 is fulfilled; m represents the total number of CNN models encoded in the chromosome; and n is the total number of images. In summary, for each individual, the fitness function calculates the average loss value between the chosen CNN models and the prediction block; lower average means a more desirable individual.

Creation of the initial population
Maintaining a diverse population, especially in the early iterations of the algorithm, is crucial to ensure a good exploration of the search space. The training epochs (e) and the CNN models (m) determine the total number of possible combinations in the search space. To guarantee the diversity, the chromosome of each individual I j of the population is randomly created, but repeated individuals are not allowed. In this manner, it is possible to avoid the early convergence of the method to local minima. In this work, we have considered an individual as repeated when all genes g j k in the chromosome are identical. For example, let us say g A k and g B k are genes from chromosomes A and B, ; c a late fusion approach is used to merge all the representations respectively; k means the index related to a CNN model (e.g., DenseNet, InceptionV3 and Xception) and m means the maximum number of CNN models in each ensemble. A and B are identical if for every k ¼ 1; 2; :::; m, g A Having said the above, the individuals are repeated if they use a different CNN in any of their genes, i.e., two models are considered different if they have different architectures or the same architecture but with different weights.

Parent selection
The parents are selected by a tournament selection procedure to create the intermediate population [59]. A tournament size equal 2 was used in this work; the smaller the tournament size, the lower the selection pressure and the search space can be widely considered. To achieve this, two individuals are randomly selected. Then, the individuals are compared and the best individual is selected with replacement, i.e., this individual could be selected in further rounds. This process is repeated until the number of individuals is completed. As the generations increase, the algorithm can focus more on promising regions of the search space. Figure 3 shows the genetic operators applied. First, a custom Flat crossover [60,61] was performed with a crossover rate p c , where its general mechanism is adapted for an integer representation scheme. Flat crossover is applied in each pair of genes that are located in the same locus and represents the same CNN model. As a result, a random integer value is chosen from the interval ½g 1 k ; g 2 k . Once the new offspring is generated, an one-point mutator operator is applied with a probability p m . The mutation operator randomly selects a gene, then, the value of epoch is changed by a new value selected from a range of possible epochs, e.g., [0; 150]. Finally, the offspring was generated.

Genetic operators
It is noteworthy that valid individuals are always obtained after performing these operators.

Population update
In this work, a generational elitist algorithm [62] was used to update the population passed from one generation to the next one. As a result, the best individual in the last generation is the best individual of the evolution. To achieve this, the population in each generation keeps all new individuals, as long as the best parent is not better than all the children. In such cases, the best parent replaces the worst child. The worst child is determined by sorting the individuals of the new population according to the fitness value. After replacing the worst child, the new population replaces the previous one. At the end of the algorithm, the best individual in the last generation will be the best ensemble. In this work, the number of individuals generated in each generation is 100 and the population size is kept constant in order to alleviate the computational cost of CNN. However, these values could be tuned depending on the context, including the possibility of increasing/decreasing them. Regarding efficiency, a map with each explored individual and its corresponding fitness value is cached. First, the source images are preprocessed by using the method explained above. Second, given a set of CNN models the initial population of the GA is created. For example, if we have an individual with g j 1 ¼ 50 and another one is g jþz , the last one is obtained after training g j 1 for one more epoch. In general, each CNN model was trained just one time when needed and its weights along every demanded epoch were stored in a hard drive. In the worst case, we train each CNN model 150 epochs once. Third, each ensemble represented by a chromosome is trained for n epochs, bearing in mind that only the prediction block is updated. Then, the fitness of each individual of the current population is calculated by considering the loss obtained in the training images by the members of the ensemble. Four, the parents are selected, then, a crossover and a mutation are performed, and after that a new population is created. Five, the population of the GA is updated. Steps from three to five are repeated until one of the stopping criteria is satisfied. In this work, we have applied the most frequently used stopping criterion which is a specified maximum number of generations. In addition, we stopped the search when an individual had achieved the top performance. Finally, the ensemble model U is obtained.

Experimental study
This section describes the experimental study carried out in this work. First, the datasets and the experimental protocol are portrayed, and then, the experimental results and a discussion of them are presented. Table 1 shows a summary of the benchmark datasets. UDA [66], MSK [67], HAM10000 [65] and BCN20000 [63] datasets are included in the ISIC repository. The images are composed strictly of melanocytic lesions that are biopsyproven and annotated as malignant or benign. Also, images from BCN20000 would be considered hard-to-diagnose and had to be excised and histopathologically diagnosed. PH2 4 [69] comprises dermoscopic images, clinical diagnosis and the identification of several dermoscopic structures. The Dermofit Image Library 5 [64] gathers 1,300 focal high-quality skin lesion images under standardised conditions. Each image has a diagnosis based on expert opinion and like PH2, it includes a binary segmentation mask that denotes the lesion area. MED-NODE 6 [68] collects 170 non-dermoscopic images from common digital cameras; this type of image is very important to prove the models with data from affordable devices. SDC-198 7 [70] contains 6,584 real-world images from 198 categories to encourage further research and its application in real-life scenarios. Finally, DERM7PT 8 [21] is a benchmark dataset composed of clinical and dermoscopic images, allowing to assess how different it is to use dermoscopic images versus images taken with digital cameras.

Datasets
Only the images labeled as melanoma and nevus were considered, being in total 36,703 images. Most datasets present a high imbalance ratio (ImbR), up to ten times in the case of MSK-3, commonly hampering the learning process. The intra-class (IntraC) and inter-class (InterC) metrics show the average distances between images belonging to different classes, as well as between images belonging to the same class. Both metrics were computed using the Euclidean distance; each image i was represented as a vector. Then, the ratio (DistR) between these metrics showed that both distances are similar, which commonly indicates a high degree of overlapping between classes. Finally, the silhouette score (Silho) [71] was calculated, representing how similar an image is to its own cluster compared to other clusters. The results indicated that images were not well matched to their own cluster, and even samples belonging to different clusters are close in the feature space.

Experimental settings
Firstly, the proposed segmentation algorithm was applied over PH2 and DERM-LIB, where both datasets were manually segmented by expert dermatologists and they are commonly used as benchmarks due its quality [64,69]. The aim was to find how much the method is close to expert segmentation. Also, we compared the segmentation algorithm to U-Net and R2U-Net, which are CNN architectures designed for biomedical image segmentation. Bear in mind that the above CNN architectures need prior training. In order to obtain the segmentation mask of an image X, the CNN architectures were trained during 150 epochs with 10% of images as validation set and the rest as training set. The best model obtained during validation was applied on the image X and a segmentation mask was obtained. The above procedure was applied in both PH2 and DERM-LIB datasets.
The second phase aimed at analyzing the GA's hyperparameters on the model's performance. The model was dubbed as Genetic Algorithm Programming-based Ensemble CNN model (GAPE-CNN). It should be noted that the main aim of this work was to perform a preliminary study to assess the effectiveness that can be attained by using the proposed architecture in melanoma diagnosis. Consequently, only the hyperparameter values listed in Table 2 values were considered, because including more settings requires high computational resources during training CNN models. Three different number of generations were evaluated, where larger values can lead to a larger number of possible ensembles, but increasing the training cost. Three crossover probabilities were tested, where higher values increase the probability that the genetic information of parents can be combined to generate new individuals. Three mutation probabilities were also evaluated, where higher values allow to escape from local minima and add more exploration of the search space.
In the third phase, the proposal was compared to the following state-of-the-art CNN models that have previously been used in melanoma diagnosis: InceptionV3 [3], DenseNet [25], VGG16 [10], MobileNet [24], Xception [26] and NASNetMobile [72,73]. Table 2 shows the configuration used to train all the models: the learning rate ðaÞ was equal to 0.01 and it was reduced by a factor of 0.2 if an improvement in predictive performance was not observed during 10 epochs; the weights of the networks were initialized using Xavier method [74] in those cases where transfer learning was not present, e.g., the prediction block and baseline CNN models; a batch of size 8 was used due the medium size of the used datasets and the models were trained along 150 epochs. Mini-batch gradient descent was used for training the models, which is one of the most used optimizers for training CNNs. Despite its simplicity, it performs well across a variety of applications [75] and has been successfully applied for training networks in melanoma diagnosis [7,18,76]. In this work, a tuning process was not carried out and so the results could not be conferred to an over-adjustment. The datasets utilized in this work correspond to binary classification problems, so the cost function used for training the models was defined as the average of the binary cross-entropy along all training samples.
Data augmentation technique was mainly applied to tackle the imbalance problem in melanoma diagnosis by applying and combining random rotation-based, flip-based and crop-based transformations over the original images. Bear in mind that color-based transformations were not considered in order not to alter the color space, which is important for the diagnosis of melanoma [77]. As a result, the changes performed during data augmentation do not change the labels of the samples. The data augmentation process was assessed in both training and test data, which can increase the performance significantly [18]. After splitting a dataset into training and test sets, training data were balanced by creating new images until the number of melanoma images was equal to the normal ones, and the generated training images were considered as independent from the original ones. On the other hand, test data were expanded by randomly augmenting each test image at least ten times, but the generated images remained related to the original ones. Consequently, given an original test image X, the classes' probabilities for X and its related set of images S X were averaged to yield the final prediction; so any CNN model performs like an ensemble one, where the final probabilities for a test image was computed using a softvoting strategy.

Evaluation process
In order to evaluate quantitatively the segmentation method, several performance metrics are considered, including accuracy (ACC), F1-score, Dice coefficient (DC), Jaccard similarity (JS), and the U-Net score (UNS), which is the average between DC and JS. Accuracy and F1score are calculated using Eqs. 2, 3 and Dice coefficient and Jaccard similarity are calculated using Eqs. 4 and 5.
where TP, FP, TN and FN represent well-selected pixels, mis-selected pixels, well-discarded pixels and mis-discarded pixels, respectively; GT and SR mean the ground truth pixels and the segmentation result, respectively. The comparison between the segmentation methods is performed using UNS, summarizing DC and JS. The above metrics have been used before in the segmentation task of the ISIC-2018 contest.
Regarding the evaluation process of the CNN models, a 3-times 10-fold cross validation process was performed on the datasets, and the results were averaged across all fold executions. In each fold, Matthews Correlation Coefficient (MCC) was used to measure the predictive performance of the models. MCC is widely used in Bioinformatics as a performance metric [78], and it is specially designed to analyze the predictive performance on unbalanced data. MCC is computed as: where t p , t n , f p , and f n are the number of true positives, true negative, false positives, and false negatives, respectively. MCC value is always in the range ½À1; 1, where 1 represents a perfect prediction, 0 indicates a performance similar to a random prediction, and -1 an inverse prediction. Finally, non-parametric statistical tests were used to detect whether there was any significant difference in predictive performance. Friedman's test [79] was conducted in cases where a multiple comparison was carried out, Hommel's post hoc test [80] was employed to perform a multiple comparison with a control method, Shaffer post hoc test [81] was employed to perform pairwise comparisons, and finally Wilcoxon Signed-Rank test [82] was performed in those cases where only two individual methods were compared. All hypothesis testing was conducted at 95% confidence.

Software and hardware
The experimental study (training and test) was executed with Ubuntu 18.04, four GPUs NVIDIA Geforce RTX 2080-Ti with 11 GB DDR6 each one and four GPUs NVIDIA Geforce RTX 1080-Ti with 11 GB DDR5X each one. All the experiments were implemented in Python v3.6, and the CNN models were developed by using Keras  [84] as backend. In addition, the proposals were implemented in a web-based application in order to assist dermatologists in decision making. Amazon web services 9 were used in deployment time as platform, providing secure and resizable compute capacity in the cloud. Regarding efficiency, we converted all tensorflow models to Tensorflow Lite models. Tensorflow Lite is a lightweight library for deploying models using a minimum of computational resources. Table 3 shows the minimum amount of hard drive space, RAM and inference time required by each CNN model. The time was measured when processing one sample. Bear in mind that these values were analyzed in deployment time by using CPU and not GPU. As expected, MobileNet was the lightest-weight model which required the least amount of resources. In addition, the highest difference between using or not Tensorflow Lite is regarding RAM-this technology saves more than ten times compared to default deployment. As a result the models only needed 510 MB of RAM and one CPU core to perform inference. In this work was selected basic Amazon EC2 t2.micro, with one GB of RAM and one CPU core ($5:26/month), saving $36:72=month compared to Amazon EC2 t2.large ($41:98/month), which is the closest architecture capable of supporting more than five GB of RAM.

Results and discussion
In this section, the main results are presented. Additional material for supporting this work can be found at the available web page 10 . Table 4 shows the segmentation performance obtained from a small sample (594 images in total). Both datasets contain high-quality images and manual segmentation performed by expert dermatologists. Each image was evaluated applying UNS and other metrics. The proposed segmentation method obtained the best average performance in both datasets with 81% and 80% UNS in PH2 and DERM-LIB, respectively. In addition, Fig. 5 shows Friedman's test ranking, where the proposed method outperformed both U-Net and R2U-Net. Then, Friedman's test rejected the null hypothesis with a p-value \ 2.2E-16 in DERM-LIB dataset. The proposal was ranked first, and afterward, the Shaffer's post hoc test was conducted, where the proposal achieved significantly better performance compared to U-Net and R2U-Net. In addition, U-Net significantly outperformed R2U-Net. On the other hand, in PH2 no significant differences were encountered between the three methods, the Friedman's test did not reject the null hypothesis with a p-value equal to 6.476E-1. (The test was conducted with two degrees of freedom, and the Friedman's statistic was equal to 86.898E-2.) However, the proposal was ranked first and it attained 111% and 74% less variance compared to U-Net and R2U-Net, respectively. In this way, the CNN models can focus on relevant pixels, so easing the learning of better abstract and discriminative features for melanoma diagnosis. Also, the proposal did not require prior training, which is a clear advantage compared to those based on CNN models. In addition, the proposed segmentation method can be used with only a CPU, avoiding a significant amount of computational power for training like those using GPU. Finally, R2U-Net obtained a better performance compared to U-Net in PH2, but the opposite occurred in DERMLIB. A larger number of datasets are needed to validate the differences between both methods regarding skin lesion diagnosis. In addition, we corroborated that the CNN models trained with segmented data were able to achieve better performance. All models achieved the best performance using segmented data, and on average the best ones were MobileNet, DenseNet, and InceptionV3, in that order. Despite MobileNet was designed with efficiency in mind, it managed to overcome more complex CNN models. Nevertheless, DenseNet and InceptionV3 surpassed MobileNet in BCN20000, which is the largest and one of the more complex datasets. Skin image datasets are commonly small, which is where MobileNet usually achieved its best performance [85,86]. Although these results could be also caused by the use of default hyperparameters, all CNN models shared the same condition in order to avoid advantages between them. On the other hand, the number of epochs could be another cause. This is traditionally large The best value for each column is highlighted in bold typeface, e.g., MobileNet is the one that consumes less RAM, less hard drive space, and needs less inference time (the aim is to minimize all metrics) when training CNN models, often hundreds or thousands, allowing the learning algorithm to run until the error from the model has been sufficiently minimized. It is common to find examples in the literature and in tutorials where the number of epochs is set to 500, 1000, and larger. However, in this work we used 150 epochs, mainly because of the number of available images, and even then all models and the proposal achieved competitive performances. UDA-2 was the most challenging dataset. On average, the models achieved only 44% MCC; UDA-2 has the lowest Silhouette value (0.020), indicating a high overlapping level between classes and increasing the difficulty. The overall best performance was achieved in DERM-LIB, PH2 and HAM10000 datasets with a 90%, 85% and 79% MCC, respectively. The above datasets have the three highest Silhouette values, meaning that they have a low overlapping level between images, which makes easier the task. Table 5 and Fig. 6 summarize the results obtained after training the six CNN models with the segmentation method. The models in Fig. 6 are ordered from left to right according to the ranking computed by Friedman's test, where the proposed segmentation method achieved the best performance in all datasets, followed by TDA. Friedman's test rejected the null hypothesis on all CNN models. It was observed that overall all CNN models using the segmented data presented better results compared to its non-use, proving to be suitable for melanoma diagnosis. Shaffer's post hoc test found that InceptionV3 and Xception significantly improved their performance when segmented data was used compared to the baseline and transfer-learning combined with data augmentation. All baseline CNN models were significantly surpassed by the other techniques. Results showed the proposed segmentation method was able to improve the performance in all CNN models considered in this work. In the following sections, all CNN models are compared using segmented data, which already proved to achieve the best performance.

Analyzing the impact of three main hyperparameters
First of all, the advantages of using the extra prediction block were analyzed. The best epochs from each CNN model were combined and all possible ensembles were evaluated. In the end, 57 ensembles were obtained after discarding the empty set and the one-element sets. Then, the best one was selected as baseline (BL) and included in the comparison. In addition, it was obtained another baseline model applying the same above procedure by simply combining the predictions of the CNN models without using the prediction block (BLs). Table 6 shows the average MCC values on test sets comparing BL versus BLs. Results show that models using the extra prediction block obtained the best performance in all datasets compared to BLs. Finally, the Wilcoxon's test rejected the null hypothesis with a p-value equal to 2.189E-4, confirming the benefit and effectiveness of using the prediction block for combining the features from the individual CNN models. Henceforth, the rest of the experimental study was executed using the proposal architecture. Table 7 shows the average MCC results attained by the proposed genetic algorithm using different values of g, p c and p m . For each dataset, the best MCC value is highlighted in bold typeface. Overall, all models obtained a high performance, being the lowest and the highest average performance 96.9% and 97.8%, respectively. The models attained their best performance in MSK-3, PH2, DERM-LIB and SDC-198 datasets. Regarding the parameters that  The null hypothesis was rejected with adjusted p-value \ 2.2E-16 Table 5 Average MCC values obtained by using six state-of-the-art CNN models. The best MCC values by dataset were highlighted in bold typeface. ''B'' means a baseline performance (i.e., none technique was applied). ''TDA'' and ''STDA'' represent the model applying transfer learning and data augmentation, and combining segmented images with transfer learning and data augmentation, respectively; data augmentation was applied both in training and test sets.

Dataset
control the GA, the results showed the higher the number of generations, the better was the predictive performance. A higher number of generations could imply more exploration and higher possibilities to create offspring with a more accurate set of CNN models. Furthermore, the best results peaked when using a high mutation probability, denoting that a higher mutation probability could imply a better exploration of good sets of epochs. Finally, it was observed that 250 generations are good enough to obtain a performance which is significantly similar to the top performance obtained by using 400 generations. The Rank column shows the average ranking computed by Friedman's test, and this ranking found that g ¼ 400, p c ¼ 90% and p m ¼ 30% were the best settings for the GA, obtaining on average the best predictive performance. The Friedman's test rejected the null hypothesis with a p-value equal to 8.771E-15; Friedman's statistic was equal to 124.28 with 26 degrees of freedom. Afterward, the Hommel's post hoc test was conducted by considering the GA as the control method, which significantly outperformed 50% of all other considered configurations. Next, the best GA is compared to several state-of-the-art CNN models that have previously been used in melanoma diagnosis and at the same time are the baseline to build the ensemble. Table 8 shows the fold changes between the proposed model and each of the state-of-the-art CNN models. The BL ensemble overcame all individual state-of-the-art models, except in DERM-LIB, where DenseNet201 and MobileNet surpassed it by a small margin. It should be noticed that these models are the same considered by the GA to build the ensemble, so this comparison plays another role, which is to corroborate that ensemble learning is more suitable for melanoma diagnosis tasks compared to individual models. The results were very promising since the proposal GAPE-CNN achieved the highest MCC values in all datasets. It is noteworthy that the proposal achieved a predictive performance 165% and 130% higher than Xception and InceptionV3 on UDA-2 dataset, respectively. Also, it achieved high performance in the two largest datasets. The best average predictive performance was observed on DERM-LIB and PH2 datasets, where it was obtained 98% and 96% MCC values, respectively. The lowest overall predictive performance was observed on the dataset UDA-2 with 56% of MCC. However, the proposal was at least 62% and 21% better than all the individual CNN models and the BL, respectively. Overall, the worst performance was attained by VGG16. Table 9 summarizes the average MCC values on test sets comparing the baseline versus GAPE-CNN. The proposal surpassed BL in all datasets, and finally, the Wilcoxon's test rejected the null hypothesis with a p-value equal to 2.189E-4, confirming the benefit and effectiveness of using genetic algorithms for learning the set of CNN models to build an ensemble. Each independent CNN model provides a partial prediction, which is aggregated to yield a final decision. The architecture follows an ensemble approach, which has demonstrated to be an effective way to improve the learning process in many real-world problems [87]. Furthermore, the proposed model applies transfer learning and data augmentation techniques. Data are augmented not only at training stage, but also at test stage, thus allowing to attain a better predictive performance. The data augmentation process on both phases has shown to be an effective way to improve melanoma diagnosis [88], and also it is an excellent approach to cope with the imbalance data issue, and the high inter-and intra-class variability present in most skin image datasets. Figure 7 shows the average performance attained by the models by grouping the datasets in dermoscopic and nondermoscopic ones. The results showed that the proposed architecture attained the best performance whatever the type of image, denoting the effectiveness of the approach. All CNN models attained their best performance using dermoscopic images by a slight margin. NASNetMobile was the architecture most benefited from using dermoscopic images, with an improvement of about 8%. The biggest improvement was found comparing with VGG; the proposal outperformed VGG considering dermoscopic and non-dermoscopic images in 31% and 30%, respectively. The second best performance behind the proposal was achieved by the BL in both types of images; the proposal surpassed it by 5% in both types of images. To sum up, the results obtained through the experimental study revealed that GAPE-CNN was effective for diagnosing melanoma, attaining better predictive performance with respect to the state-of-the-art models.

Conclusions
It is clear that automatic melanoma diagnosis via deep learning models is a challenging task, mainly due to a lack of data and differences even between samples from the same category. We addressed these problems via a series of contributions. First, to preprocess data, we validated and applied an extension of the Chan-Vese algorithm. The segmentation masks indicated that the proposal achieved a better performance compared to state-of-the-art segmentation methods. Also, the results showed that all CNN models improved their performance by using segmented data. Second, the training and testing data were enriched using data augmentation, reducing overfitting and obtaining transformation-invariant models. Third, we increased the performance by applying transfer learning from the pretrained ImageNet. Also, transfer learning alleviated the requirement for a large number of training data. Finally, to further improve the discriminative power of CNN models, we proposed a novel ensemble method based on a genetic algorithm, which finds an optimal set of CNN models. An extensive experimental study was conducted on sixteen image datasets, demonstrating the utility and effectiveness of the proposed approach, attaining a high predictive performance even in datasets with complex properties. Results also showed the proposed ensemble model is competitive with regard to state-of-the-art computational methods.   The % columns represent the fold changes after comparing the proposed model with the others, e.g., GAPE-CNN attained a 21% of improvement over VGG16 on BCN20000 dataset. The best MCC values for each dataset are highlighted in bold typeface Future works will conduct more extensive experiments to validate the full potential of the proposed architecture, for example by considering a wide set of hyperparameters to be tuned as well as a larger number of datasets. Finally, it is noteworthy that our approach benefits from combining different CNN models-abstract features are combined in the extra prediction block, and individual predictions from the ensemble and the mentioned block are combined to obtain a diagnosis. This approach is not restricted to melanoma diagnosis problems and could be applied on other real-world problems in the future.
Ackowledgements This research was supported by the i-PFIS contract no. IFI17/00015 granted by Health Institute Carlos III of Spain. It was also supported by the University of Cordoba and the European Regional Development Fund, project UCO-FEDER 18 REF.1263116 MOD.A.

Declarations
Conflict of interest The authors declare that they have no conflict of interest.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons. org/licenses/by/4.0/.