1 Introduction

Nowadays, cancer is considered one of the killing diseases worldwide [28]. Over the next fifty years, it is predicted that the total number of cases of cancer will double [60]. However, early detection of cancer helps in minimizing the number of deaths caused by it [58]. Skin cancer is considered the most popular form of cancer. In addition, melanoma is the most aggressive type of Skin Lesion (SL). It is a malignant tumor that develops from the pigment-containing cells called melanocytes [21]. Also, it has the most rapidly increasing mortality rate among skin cancers. According to the Skin Cancer Foundation, over 9,500 people in the US are diagnosed with skin cancer daily. For every hour, skin cancer causes the death of 2 people [58]. Additionally, in the United States, about 197,700 new melanoma cases are estimated to be diagnosed and about 7,650 people are expected to die of it in the year 2022. According to the statistics [58], the risk of getting melanoma for whites is about 2.6%, for blacks about 0.1%, and for Hispanics about 0.4%. Moreover, 90% of the deaths associated with cutaneous tumors are caused by melanomas [30]. Melanomas occur in different shapes, colors, and sizes. This is the reason behind the difficulty to deliver a comprehensive set of warning signs. Since detecting this type of cancer early is so crucial, the common signs, symptoms, and early detection strategies are discussed. Many moles, growths, and brown spots on the skin are harmless but it is not always the case. The ABCDEs and the Ugly Duckling sign can help in detecting melanoma. The first five letters of the alphabet (i.e., ABCDE) [62] can be explained as follows:

  • A is for Asymmetry: Most melanomas are asymmetrical (i.e., the two halves don’t match).

  • B is for Border: The borders usually are uneven and may have notched or scalloped edges.

  • C is for Color: Multiple colors in melanomas are a warning sign. The moles that have different shades of brown, tan, or black tend to be melanomas.

  • D is for Diameter: If a lesion is the size of an eraser (i.e., about 6 mm in diameter) or larger, it tends to be a melanoma.

  • E is for Evolving: Any change in shape, size, color, or elevation of a spot on the skin, is considered a warning sign of melanoma.

Another warning sign of melanoma is the Ugly Duckling [59]. The concept behind this strategy is based on the thought that most normal moles resemble one another, whereas melanomas stand out like ugly ducklings in comparison. Several factors increase the risk of developing melanoma. These factors include fair skin, excessive ultraviolet (UV) light exposure, the existence of many moles or unusual moles, a family history of melanoma, and a weakened immune system [22].

As mentioned before, accurate recognition of melanoma in its early stages can increase the survival rate significantly. Nevertheless, the manual detection of melanoma is time-consuming, complicated, and error-prone [3]. Additionally, it suffers from inter-observer variations and creates a huge demand for well-trained specialists. Manual scanning is considered a challenging task for many reasons [2]. Dermoscopy images might contain hair, blood vessels, lubricants, bubbles, as well as other disturbances that make the identification difficult. Low contrast between the outer tissue and the SL region makes the segmentation difficult. Different sizes, forms, and colors of SL might restrict the effectiveness of approaches in obtaining greater levels of accuracy.

Moreover, manual detection of melanoma is difficult for dermatologists, hence, it is worthwhile to develop a reliable automatic system for melanoma detection, advancing the accuracy and efficiency of dermatologists and pathologists. Computer-Aided Diagnostic (CAD) doctors save time and diagnose skin cancer more accurately. Recently, many publications based on Convolution Neural Network (CNN) are proposed to deliver fully automated systems [1, 18, 31]. Generally, the automatic detection can be classified into five steps (i.e., preprocessing, segmentation, feature extraction, classification, and deployment stages) [9]. In preprocessing, various image preprocessing techniques (e.g., color space correction, contrast augmentation, and noise removal) are applied to enhance skin cancer diagnosis [40]. Then, image segmentation is applied to extract regions of interest by segmenting cancerous areas from healthy areas [63]. It is used to provide accurate detection by isolating healthy tissue before extracting characteristics from lesions. Typically, the feature extraction is performed after the segmentation. This step lowers the size of data to make it more manageable. Moreover, preserving relevant data will make processing data faster and easier. Also, the detecting accuracy will significantly rise when the feature extraction is executed properly [49]. The purpose of the classification is to allocate a region of interest (ROI) from a picture of a specific class. Many melanoma detection reports employed feedforward and recurrent NNs. Different machine and deep learning approaches had been proposed to perform the task of skin cancer detection and segmentation (e.g., Fuzzy C-means [56], support vector machines SVM [20], Deep Neural Networks [24], and Recurrent Neural Networks [54]). The traditional machine learning (ML) methods require extracting the features manually [19]. Then, classification and segmentation methods are utilized to classify and segment the tumor using those features. Deep learning (DL) is a popular ML research branch that can effectively capture complex relations [8, 12].

The current study focuses on melanoma skin cancer detection. The classification is accomplished using 5 pre-trained convolution neural network (CNN) models. They are VGG16, VGG19, DenseNet169, DenseNet201, and MobileNet. Additionally, the Harris Hawks Optimization (HHO) algorithm is used to accomplish to gain the performance metrics of the state-of-the-art (SOTA). Segmentation is performed in two ways, namely manual segmentation and automatic segmentation. The manual segmentation is used when the dataset has no masks to use while the automatic segmentation approach is used, using U-Net models, to build an adaptive segmentation model. A novel manual segmentation approach called “HMB-MAS” is proposed in the current study.

This study proposes a hybrid approach using deep learning-based algorithms and Harris Hawks optimization algorithm (HHO) for skin lesion detection, classification, and segmentation. HHO algorithm has been used to optimize the hyperparameters of the applied models to enhance the performance of these models. Deep learning (DL) has been integrated with HHO to make it more robust and adaptive utilizing a proper balance between exploration and exploitation of the search space.

The main contributions of the current study can be recapped in the next points:

  • Proposing a hybrid approach using Harris hawks optimization algorithm and deep learning approach for melanoma detection.

  • Proposing a novel manual segmentation approach called “HMB-MAS” for extracting regions of interest.

  • Applying transfer learning using 5 pre-trained CNN models.

  • Applying Harris hawks’ optimization algorithm to optimize the hyperparameters and acquire the optimal configurations for each model.

  • Reporting the state-of-the-art performance metrics and comparing the current performance with the reported related studies.

The rest of this paper is organized as follows: A brief review of some publications introduced for melanoma detection is discussed in Section 2. The proposed methodology is described in detail in Section 3. Section 4 shows the experiments and reports the results. Conclusion, limitation, and future works are presented in Section 5.

2 Related studies

Recently, many pieces of research to detect melanoma automatically based on deep learning techniques are conducted. Several approaches and techniques were proposed in publications to aid in computer- aided diagnosis. Some of these are retrieved in terms of the newness and the highest performance. The discussed studies can be classified into A) Deep Learning-based techniques and B) hybrid techniques.

2.1 Deep learning-based techniques

Recently, a shift from the handcrafted feature vector to extract the features from the input automatically by the computer has become prevalent. This concept is the idea of numerous DL-based algorithms. Vani et. al. [64] presented a new technique to detect melanoma as well as a prediction tool. The suggested technique utilized DL techniques to predict the lesion of the afflicted region. Additionally, multiple performance metrics such as precision, accuracy, recall, and F1-score are used to assess the proposed approach. For improving the quality of images, pre-processing algorithms were applied. The segmenting between normal and diseased skin areas was done using an active contour segmentation procedure. The detection was done using Self-Organizing Map (SOM) and CNN classifiers. The suggested approach showed significant efficiency in melanoma detection with improved accuracy of 90% using a randomly generated collection of 500 pictures, 350 pictures as the training dataset, and 150 pictures as the validation dataset. To address the task of skin lesion segmentation, Wu et. al. [65] developed a Feature Adaptive Transformers Network (FAT-Net). It is an effective and innovative two-stream net with a feature adaptation transformer. Unlike the typical CNN-based encoders, the proposed transformer encoder achieves segmentation using a new sequence-to-sequence predicting technique. Testing is performed on four public datasets (i.e., the International Skin Imaging Collaboration (ISIC) 2016, ISIC 2017, ISIC 2018, and PH2 datasets). Also, extensive experiments were performed to verify the achieved accuracy. Their reported experiments and results showed the superiority of the proposed FAT-Net in terms of accuracy and testing speed.

A modified U-Net approach for segmenting skin lesions in medical imagery to perform melanoma detection is presented by Anand et. al. [5]. The PH2 dataset was used for validating the proposed network. The suggested U-Net was tested by utilizing the Stochastic gradient descent (SGD), Adadelta, and Adam optimization techniques. The suggested approach achieved an accuracy of 96.27%, a Jaccard Score of 96.35%, and a Dice Coefficient of 89.01%. Agrahari et. al. presented a multi-class classification for skin cancer [4] with high efficiency equivalent to a dermatologist. The model is constructed using a pre-trained MobileNet network. The ISIC dataset HAM10000 (Human Against Machine with 10000 training images) was used for training and evaluating the proposed method. A category accuracy of 80.81%, a top-2 accuracy of 91.25%, and a top-3 accuracy of 96.26% were reported in detecting skin lesions.

Nersisson et. al. in [46] proposed a You Only Look Once (YOLO) based-Convolutional Neural Network (CNN) technique to detect skin lesions. The characteristics such as texture and color information collected from the lesion region were combined with the features acquired from the CNN. Then, these characteristics were sent to a Fully Connected Deep Neural Network (FCDNN) that was trained with the international symposium on biomedical imaging (ISBI) Melanoma dataset. The experimental results showed that the suggested approach could enhance the classification accuracy of skin lesions compared to state-of-the-art techniques. The performance improvements included accuracy of 94%, a precision of 85%, recall of 88%, and AUC of 95%.

Kaur et. al. [42] proposed a Deep CNN (DCNN) with an efficient melanoma classification technique to detect benign and malignant melanoma. Dermoscopy images were gathered from the International Skin Imaging Collaboration datastores (e.g., ISIC 2016, ISIC2017, and ISIC 2020). Accuracy, recall, specificity, precision, and F1-score were used to assess the classifier performance. For the ISIC 2020, ISIC 2017, and ISIC 2016 datasets, the suggested classifier achieved an accuracy rate of 90.42%, 88.23%, and 81.41%, respectively. A novel Residual DCNN for melanoma detection is proposed by Hosny and Kassem [34]. Six well-known melanoma datasets (e.g., PH2, ISIC2016, ISIC2018, ISIC2017, MED-NODE (MElanoma Diagnosis from NOn-DErmoscopic images), and DermIS and Quest) were used to train and evaluate the proposed model. Three separate experiments were used to evaluate the proposed neural network accuracy. The first one was conducted using the original dataset pictures with no segmentation or pre-processing. For the second one, the segmented pictures were used to evaluate the suggested model. Finally, the output training model of the second experiment is preserved and utilized as a pre-trained model in the third one. The suggested RDCNN surpassed the current DCNN in terms of overall performance.

Shorfuzzaman et. al. [55] presented a CNN-based stacked ensemble approach for the early detection of melanoma. The transfer learning principle was applied in stacking ensemble learning, where many CNN sub-models were constructed. The classification results were generated by a new model called a meta-learner, which incorporated all the sub-model results. Experimental results demonstrate the performance of the proposed approach with an accuracy of 95.76%, a sensitivity of 96.67%, and an AUC of 95.7%. Even though several publications have focused on identifying melanoma, Kumar et. al. [44] narrows the scope in determining the levels of skin cancer by applying DL techniques. Their proposed algorithm enhanced the accuracy of skin lesion detection and could provide suitable therapy based on the cancer level. State-of-the- art methods such as SVM, Random Forest (RF), and Artificial NN (Neural Networks), as well as the suggested fusion-based DL approach, were tested. Different evaluation criteria such as Accuracy, Mean Square Error (MSE), Precision, Peak Signal to Noise Ratio (PSNR), and Recall were used to evaluate the suggested approach and monitor its validity. The proposed approach achieved an accuracy of 97%. Elansary et. al. [26] presented a CNN-based skin cancer classification method. The proposed model was tested using ISIC 2020. Furthermore, the dataset was extremely imbalanced, with fewer than 2% aggressive instances. For addressing imbalanced data, random oversampling and data augmenting were applied. Furthermore, to construct a classifier that could learn from all classes identically while focusing more on the minority class, the class weight approach was utilized to assign a weight value to each category. EfficientNet-B6 was proposed for melanoma detection. The reported findings revealed that the recommended system accuracy rate was 97.84%.

2.2 Hybrid techniques

Generally, combining two different machine learning techniques resulted in a hybrid one. For example, a hybrid classification model can consist of one unsupervised learner to preprocess the training data and one supervised classifier to learn the clustering result. Additionally, many hybrid techniques integrate CNN with different classifiers to enhance melanoma detection accuracy. A structured scheme for analyzing and assessing the possibilities of melanoma is proposed by Srividhya [61]. The proposed approach performs skin lesion segmentation and feature extraction followed by good performance machine learning techniques to accomplish intensity level adjustment during the pre- processing phase. The correlation parameters were significant in determining the success of recognizing skin tumors and acted as a metric to define the distinct feature set for training CNN to detect melanoma. Sensitivity and Identifying Efficiency (IE) were determined to be 93.3% and 95%, respectively. Gazioglu and Kamasak [31] aimed to study the impact of exterior items (i.e., ruler and hair) and visual quality (i.e., noise, blurring, and brightness) in influencing the result of melanoma detection. Four pre-trained CNN models, namely, AlexNet, ResNet50, DenseNet121, and VGG16, were employed by the authors. DenseNet obtained the best accuracy in blurred and noisy datasets. Accuracies of 89.22% for hair set, 86% for ruler set, and 88.81% for none set were reported.

Cao et. al. [23] proposed a Mixed Skin Lesion MSL image using Mask R-CNN (MSLP-MR) model to perform melanoma detection. Furthermore, they established a Mask-DenseNet melanoma detection system. Their approach merged the concept of ensemble learning with MSLP-CNN to add mask segmentation and integrate many classifiers for weighted predictions. Their proposed approach was validated using the ISIC dataset. The reported accuracy was 90.61%, sensitivity was 78%, specificity was 93.43%, and the Area Under the Curve (AUC) was 95.02%.

In Ilkin et. al. [38], the SVM classifier and a bacterial colony optimization algorithm were integrated to create a hybrid classification to detect melanoma accurately. The proposed technique was tested on two separate datasets, ISIC and PH2. According to ISIC and PH2 results, AUC values of 98%, and 97%, respectively, were achieved. A Fully Transformer Network (FTN) was proposed by He et. al. [32] to learn long-range context data for skin lesion diagnosis. FTN is a hierarchy transformer that uses the Spatial Pyramid Transformer to compute features. Since it integrated a spatial pyramid pooling module with multi-head attention, it requires computing costs. Training and testing were done on ISIC 2018 dataset. Comparing FTN with state-of-the-art techniques, FTN was more efficient and capable of achieving superior results. Patil and Bellary [47] presented two approaches for categorizing the phases of melanoma. In the first approach, melanoma cancer was classified into two stages, namely, stage 1 and stage 2. In the second one, melanoma was classified into three stages, namely, stage 1, stage 2, and stage 3. The suggested framework utilized a CNN method with a loss function of Similarity Measure for Text Processing. Experiments with several loss functions were presented and compared to the suggested loss function. The reported results showed that the suggested technique outperformed many different loss functions.

An approach for classifying skin lesions as either benign or malignant was provided by Sayed et. al. [53]. The ISIC 2020 dataset was used to evaluate the suggested framework. Their research provided a strategy to overcome the extreme category imbalance that existed in the used dataset based on data augmentation and random over-sampling. In addition, a novel hybrid form of CNN architecture with bald eagle search optimization was suggested. The optimization method identified the best settings for the hyperparameters of SqueezeNet. The suggested melanoma cancer predictive algorithm had a 98.37% accuracy rate, 96.47% specificity, 100% sensitivity, 98.40% f-score, and 99% AUC. Gaonkar et. al. [29] proposed a hybrid technique using SVM and Radial Basis Function Network (RBFN) to minimize the complexity and processing time of the classification process. Three main steps (i.e., lesion segmentation, extraction of features, and classification) were applied. SVM and RBFN achieved an accuracy of 87% and 91%, respectively. Specificity and Sensitivity for SVM were 82% and 92%, respectively. While for RBFN, specificity and sensitivity were 90% and 93% respectively.

2.3 Plan of solution

Skin melanoma detection and segmentation is an important and difficult task in medical imaging applications. In the current study, different CNN architectures are utilized to perform the task of skin melanoma classification and segmentation. After evaluating these architectures, transfer learning and Harris Hawks optimization algorithm are used to tone and optimize training parameters and hyperparameters. Finally, different experiments using different performance metrics are performed to report the best architectures.

3 Methodology and suggested approach

The proposed framework for the classification and segmentation phases is shown in Fig. 1. As mentioned in the introduction, this study proposes a hybrid algorithm to perform melanoma detection using different CNN architectures (i.e., VGG16, VGG19, DenseNet169, DenseNet201, and MobileNet) and a metaheuristic optimizer named Harris Hawks optimization algorithm (HHO). In summary, the images are accepted by the input layer. Then, dataset augmentation, scaling, and balancing are employed to preprocess the images. The images can be classified after using the proposed models. The transfer learning and meta- heuristic optimization phase occur. In the following subsections, a discussion about these phases is presented.

Fig. 1
figure 1

The proposed framework for classification and segmentation

3.1 Materials

In the current study, two public datasets are acquired and used from Kaggle (a public dataset repository). For each dataset, a summary of the number of classes, number of images, and the number of images per class are discussed in Table 1. Additionally, Fig. 2 shows samples from the used datasets. The first dataset is named “Melanoma Skin Cancer Dataset of 10000 Images” [39]. It is composed of 10,605 images and partitioned into 2 classes. The second one is named “Skin Cancer ISIC” [41]. It is composed of 2,248 images and partitioned into 9 classes.

Table 1 Summarization of the utilized datasets
Fig. 2
figure 2

Samples from the utilized datasets

3.2 Data scaling

The pre-processing phase is responsible for improving the images and removing the unwanted parts from them [14]. Generally, when dealing with DL neural networks, data scaling is a suggested pre-processing step [7]. It is used to translate data into a certain scale, such as [0: 100] or [0: 1]. It is typically employed when procedures are based on how widely apart data points are. The current study employs five scaling approaches. Normalization, standardization, max–min scaling, max-absolute scaling, and robust scaling are the five methods. Normalization rescales the dataset from its original range so that all values are now ∈ [0, 1] range. As stated in Eq. 1, a value can be normalized. The value 255 is commonly used as the maximum pixel value for grayscale or RGB (in each channel) images, and so max(input) = 255. Standardization is the process of rescaling the value distribution so that the mean of the observations becomes 0 and the standard deviation becomes 1. This can be accomplished by removing the average (or centering) of the data. It is beneficial and necessary when the dataset contains input values of varying scales. It is assumed that the observations have a correct mean and standard deviation and fit the Gaussian distribution (i.e., bell curve). As stated in Eq. 2, a value is standardized. The min–max normalization process entails rescaling the feature range to scale it in [0, 1] or [− 1, 1]. The target range is determined by the type of data. The scale of a value is indicated in Eq. 3. Finding the absolute maximum value of the feature in the dataset and dividing all the values in the column by that maximum value is the max-absolute scaling approach. When this is applied to all numerical columns, all values will fall between -1 and 1. The technique’s key disadvantage is that it is susceptible to outliers. As stated in Eq. 4, a value is scaled. Robust scaling is used to scale the data according to the quantile range (i.e., IQR) by removing the median. The IQR ranges between the 1st quartile (i.e., 25th quantile) and the 3rd quartile (i.e., 75th quantile). This method uses statistics and is robust to outliers. As stated in Eq. 5, a value can be scaled. Determining which method to use is applied during the hyperparameters optimization process [15].

$$output=\frac{input}{\mathrm{max}(input)}$$
(1)
$$output=\frac{input-mean}{\mathrm{std}}$$
(2)
$$output=\frac{input-\mathrm{min}(input)}{\mathrm{max}\left(\mathrm{input}\right)-\mathrm{min}(\mathrm{input})}$$
(3)
$$output=\frac{input}{|\mathrm{max}(input)|}$$
(4)
$$output=\frac{input-median}{{\mathrm{Q}}_{3}-{Q}_{1}}$$
(5)

3.3 Segmentation phase using “HMB-MAS”

The current study utilized the manual and automatic segmentation approaches. The manual segmentation is used when the dataset has no masks to use while the automatic segmentation approach is used, using U-Net models, to build an adaptive segmentation model. Figure 3 shows the suggested segmentation approach named “HMB-MAS”. The doctors and specialists can take part in the process and interpret the output masks to withdraw the wrongly masked images. Figure 4 shows the flow of the manual segmentation process. The process starts by accepting the image and applying linear contrast correction. Simply, it can be considered as a min–max scaling to the input image.

Fig. 3
figure 3

The processes of manual and automatic segmentation, named “HMB-MAS”, utilized in the current study

Fig. 4
figure 4

The process of manual segmentation utilized in the current study

After that, the image is converted to the gray scale domain and Gaussian blurring is applied using a kernel sized (5 × 5). Then, the adaptive threshold takes place using a Gaussian-weighted sum of the neighborhood values minus C where C is set to 3. The size of the neighborhood area is set to 125. The pixels above the threshold are set to be black and white otherwise. The image is eroded and dilated after that using 3 iterations. The contours are retrieved and sorted according to the area size in descending order and the largest contour is selected. A final step is to apply a bitwise AND between the mask and the corrected image to extract the region of interest (ROI). Figure 5 shows three examples from different images. For the automatic segmentation process, the end-to-end U-Net is utilized in it [51]. U-Net consists of a set of decoders following a set of encoders [37, 50]. The used configurations are (1) hidden activation function is GeLU, (2) VGG19 backbone, (3) five filter blocks with sizes [64, 128, 256, 512, 1024], (4) Sigmoid output activation function, (5) Adam optimizer, (6) binary cross entropy loss function, (7) batch size of 4, (8) 10 epochs, (9) applying batch normalization, and (10) freeze the backbone without fine-tuning.

Fig. 5
figure 5

Example from three images after applying the manual segmentation process using “HMB-MAS”

3.4 Classification phase

The classification phase is qualified for classifying the images into their categories [10, 11]. In the current phase, five architectures are used to perform the task of melanoma detection (i.e., VGG16 [57], VGG19 [57], DenseNet169 [36], DenseNet201 [36], and MobileNet [35]). The following subsections describe each architecture briefly.

3.4.1 VGG model architecture

Recently, VGG architectures are popular pre-trained models because of their simplicity [19]. The VGG architecture is formed up of both convolution and pooling layers incorporated together. For VGG16, 13 convolution layers, 5 max-pooling layers, and 3 dense layers with two 4096 sized layers are included. Hence, the depth of the network is 16 layers without the problem of vanishing gradients. The activation function used by all the hidden layers is non-linear ReLU whereas a SoftMax function is used by the final layer. Differently from VGG16, the VGG19 network depth is 19 layers. There are 16 convolution layers, 5 MaxPool layers, and 3 dense layers with two 4096 sized layers [57]. Like VGG16, the ReLU activation function is utilized by all the hidden layers of VGG19 utilize and the SoftMax function is used by the final layer.

3.4.2 DenseNet model architecture

DenseNet is a CNN architecture that links each layer to every other layer in a feed-forward fashion. Traditional CNN with L layers has L connections (i.e., one between each layer and its subsequent layer). DenseNet model has L × (L + 1) connections [36]. In each layer, the feature maps produced by the prior layers are used as inputs. DenseNets reduce the problem of the vanishing gradient, motivate feature reuse, support feature spread, and significantly reduce the number of parameters. The DenseNet-169 and DenseNet-201 models are two of the DenseNet group of models designed to perform image classification. They have been pre-trained on the ImageNet image database. The main difference between these two models is the size and accuracy of the model. DenseNet-201 is 201 layers deep while DenseNet-169 is 169 layers deep. Hence, the DenseNet-201 is larger at over 77 MB in size while the densenet-169 model is 55 MB in size.

3.4.3 MobileNet model architecture

MobileNet is designed to be aware of the restricted resources for an on-device, in addition, to efficiently improving accuracy. To solve the problem of the resource limitations of the computing devices, MobileNet has low-latency and low-power models [35]. MobileNet applies separable filters (i.e., a mixture of a depth- and a point-wise convolution). Filters with a size of 1 × 1 are operated for minimizing the computational overheads of the normal convolution operation. Hence, the network is lighter concerned with the size and computational complexity. The MobileNet includes 4.2 million parameters with input of size 224 × 224 × 3.

3.5 Learning and optimization

In this study, different deep learning training hyperparameters (i.e., (1) batch size, (2) dropout, (3) the pre-training TF model learn ratio), (4) the parameters (i.e., weights) optimizer, (5) the scaling technique, (6) Is augmentation needed to be applied or not, (7) rotation range, (8) width shift range, (9) height shift range, (10) shear range, (11) zoom range, (12) horizontal flipping, (13) vertical flipping, and (14) brightness range, require to be toned to acquire the SOTA performance. The last eight hyperparameters are optimized in the case of the sixth hyperparameter is true.

If the sixth hyperparameter is false, only 6 hyperparameters are required to be optimized. So, using the grid search approach will lead to O(n6) running complexity. For that, the meta-heuristic optimization approach using the Harris Hawks optimization algorithm (HHO) is followed among several other optimization techniques [6, 13, 25, 27, 43, 45, 48, 52].

3.5.1 Harris hawks optimization algorithm

Harris Hawks Optimization Algorithm (HHO) is a population-based optimization algorithm based on what is called surprise pounce (i.e., the hunting technique and collaborative behavior of Harris’s hawk in nature) [33]. In this strategy, the hawks are attacking collaboratively from many directions to pounce on the prey. Based on the escaping patterns of the victim, Harris’s hawk exposes several chase styles. The exploration and exploitation strategies proposed by HHO are motivated by exploring the prey, the surprise pounce, and Harris’s hawk’s unique attacking technique [16]. Initially, the initial population-based chaotic maps, all parameter values, objective function, and the search space are defined.

3.5.2 Exploration phase

In the exploration phase, all Harris’ hawks are candidate solutions. In each iteration, for all possible solutions, the fitness value is computed based on the intended prey. the exploration performances of the Harris hawks in the search space are mimicked by applying two approaches specified in Eq. 6 where X (t + 1) is the position of each candidate in the next iteration, Xrabbit is the position of the prey, and Xrand is the currently chosen random. X (t) is the current position vector of Hawks, UB and LB are the upper and lower bounds of variables, the Xm is the average number of the solutions, and the r1, solution r2, r3, r4 and q are random scaled factors within [0, 1] that is updated in each iteration.

This approach is used to generate the positions of Hawks within (UB − LB) bounds utilizing two rules. The first one is to construct the solutions using the randomly selected hawks from the current population. The other is to construct the solutions using the average position of Hawks, the prey location, and random scaling factors. Once the value of r4 is close to 1, the randomness of the rule will be increased. In this rule, a randomly scaled movement length is added to LB. This is applied to explore different areas of the feature space using more diversification techniques. The average position of hawks is formulated in Eq. 7 where Xm(t) is the current average solutions number, N represents all possible solutions, and Xi(t) indicates the current location of each solution that is generated based on chaos theory. Usually, in Eq. 6, the first rule is utilized when the information from the random hawks is used to hunt the prey. While the second one is used when the best solution is shared by all hawks and the best one is employed.

$$x\left(t+1\right)=\left\{\begin{array}{l}{x}_{rand}\left(t\right)-{r}_{1}\times \left|{x}_{rand}\left(t\right)-{2\times r}_{2}\times X\left(t\right)\right| q\ge 0.5\\ {(x}_{rabbit}\left(t\right)-{x}_{m}\left(t\right)){-r}_{3}\times (LB+{r}_{4}\times (NB-LB)) q<0.5\end{array}\right.$$
(6)
$${X}_{m}\left(t\right)=\frac{1}{N}\sum_{i=1}^{N}{X}_{i}\left(t\right)$$
(7)

3.5.3 Transition from exploration to exploitation

This step is concerned with the movement of the algorithm from exploration to exploitation phase, based on E (i.e., the prey energy). HHO assumes that the escaping actions gradually decrease the energy. E0 is the initial energy decrease from [1, − 1], represented in Eq. 8 where T demonstrates the maximum iterations, and t is the current one.

$$E=2\times {E}_{0}\left(1-\frac{t}{T}\right)\to {E}_{0} \in \left[-\mathrm{1,1}\right]$$
(8)

3.5.4 Exploitation phase

In the exploitation phase, four approaches are applied to accomplish parameter sets. The position determined by the exploration phase is the base for this phase. The hawks attacking strategy is imitated using four approaches, namely, hard besiege, soft besiege, hard besiege with progressive rapid dives, and soft besiege with progressive rapid dives. These approaches are depending on two variables that define the executed approach (i.e., r and |E|) where r is the escaping probability and |E| refers to the prey escaping energy. When r < 0.5, the possibility for the prey to successfully escape is higher. Contrastly, r ≥ 0.5 refers to unsuccessfully escape.

Soft besieg

For this approach, in which |E|≥ 0.5 and r ≥ 0.5, the prey has some energy to escape; hence, the hawks are softly surrounding the prey making it lose more energy. Soft besiege is represented mathematically in Eqs. 9, 10 and 11 where ∆X (t) indicates the current difference between the prey position vector and the current location, and J represents the prey jump power and r5 is a random variable.

$$X (t + 1) = \Delta X (t) - E \times |J \times Xrabbit - X (t)|$$
(9)
$$\Delta X (t) = Xrabbit - X (t)$$
(10)
$$J = 2 \times (1 - r5) \to r5 \in [0, 1]$$
(11)

Hard besiege

For this strategy, in which |E|< 0.5 and r ≥ 0.5, the prey is tired and has a weak chance of escaping. Hence, the hawk surrounds the prey to perform the surprise pounce. Thus, the position of the hawks is updated using Eq. 12.

$$X (t + 1) = Xrabbit (t) - E \times |\Delta X (t)|$$
(12)

Soft besiege with progressive rapid dives

In this strategy, where r < 0.5 and |E|≥ 0.5, the prey has energy to escape. The hawk smartly moves around the prey and dives patiently before doing the surprise pounce. The position of the hawks is updated using two steps. In the first one, the next move of the prey is estimated so the hawks can move toward them using the Eq. 13.

$$Y = Xrabbit (t) - E \times |J \times Xrabbit (t) - X (t)|$$
(13)

For the second step, the hawk decided whether to dive or not by comparing the possible result and the previous dive. If it is not, the Levy Flight (LF) concept is used by the hawks to produce irregular dive using the Eq. 14 where Dim is the dimension of solutions and S is a vector of size 1 × dim. LF is the Levy Flight that can be calculated using Eq. 15 where β is a constant set to 1.5 and u, v are values randomly set within [0, 1]. σ is calculated using Eq. 16. Thus, updating the positions of hawks with progressive rapid dives can be formulated in Eq. 17 where Y and Z are done using Eqs. (13) and (14), and both refer to the location of the next iteration.

$$z=Y+S\times LF\left(Dim\right)$$
(14)
$$LF\left(x\right)=0.01\times \frac{u\times \sigma }{{|v|}^{\frac{1}{\beta }}}$$
(15)
$$\sigma =(\frac{\left(1+\beta \right)\times \mathrm{sin}\left(\frac{\pi \times \beta }{2}\right)}{(\frac{1+\beta }{2})\times \beta \times {2}^{\left(\frac{\beta -1}{2}\right)})}{)}^{\frac{1}{\beta }}$$
(16)
$$X\left(t+1\right)=\left\{\begin{array}{c}Y,\;\;\;if\;F\left(Y\right)\;<F(X\left(t\right))\\Z,\;\;\;if\;F\left(Z\right)\;<F(X\left(t\right))\end{array}\right.$$
(17)

Hard besieges with progressive rapid dives

In this strategy, where r < 0.5 and |E|< 0.5, there is no energy for prey to escape, and the hawks conduct rapid dives before performing a surprise pounce. The movement of the hawks is formulated in Eq. 17 where Y is calculated Eq. 18 and Z is updated using Eq. 19.

$$Y = Xrabbit (t) - E \times |J \times Xrabbit (t) - Xm(t)|$$
(18)
$$Z = Y + S \times LF(Dim)$$
(19)

3.6 Pseudocode of the proposed model

The Pseudocode of the proposed approach for learning and parameter optimization using HHO is presented in Algorithm 1. The process continually repeats until the stopping condition—which is usually the number of iterations—is met.

figure a

Algorithm 1: Pseudocode of the proposed model

4 Experiments and discussions

The experiments are grouped into two categories: (1) segmentation experiments and (2) optimization, learning, and classification studies. In general, Python is employed for scripting in the current study. The learning and optimization environment is Google Colab (with GPU). Tensorflow, Keras, NumPy, OpenCV, Pandas, and Matplotlib are some of the most often used Python packages [17]. The dataset split ratio is set at 85% (for training and validation) and 15% for testing. Random dataset shuffling is used. For classification, the scans are scaled to (100 × 100 × 3) and for segmentation, to (256 × 256 × 3) in RGB. The configurations of the trials are summarized in Table 2.

Table 2 The utilized experiments configurations

4.1 Segmentation experiments

As discussed earlier, the current study suggested the “HMB-MAS” algorithm for melanoma segmentation using the images only without the input masks. The manual process finds the largest contour but what if the contours are very small? The current study solved that issue by limiting the size of the contour to be 5% of the original image size. In other words, if the largest contour is less than 5% of the original image size, it is ignored and bypassed. The process is applied to the “Skin Cancer ISIC” dataset as it contains different classes. The reported success rate is reported in Table 3. Table 4 shows that the U-Net reported results concerning the “Skin Cancer ISIC” dataset. The used performance metrices are loss, accuracy, Mean Absolute Error, Mean Squared Error, Mean Squared Logarithmic Error, and Root Mean Squared Error. Loss defines the performance of the model in each iteration while accuracy is the ratio of rightly estimated observations to overall observations. Mean Absolute Error, Mean Squared Error, Mean Squared Logarithmic Error, and Root Mean Squared Error are measures of the difference between the actual output and the true values. The first term gives the absolute value of the error while the second term gives the squared mean of the error. Mean Squared Logarithmic Error is the squared mean of the natural logarithm of error and Finally Root Mean Squared Error is the root of the Mean Squared Error. Table 4 shows that the U-Net is optimized well on the data as it reports errors below 20%. Figure 6 shows samples from applying the auto segmentation using U-Net. In it, the green line is generated from the manual segmentation while the blue line is from the U-Net model after learning and optimizing it.

Table 3 The success rate of the manual segmentation process using “HMB-MAS”
Table 4 The reported configurations and performance metrics using the U-Net segmentation model in the "HMB-MAS" suggestion
Fig. 6
figure 6

Samples from applying the auto segmentation using U-Net

4.2 Learning, classification, and optimization experiments

The current subsection presents and examines the learning and optimization experiments performed with the previously described pre-trained CNN models (i.e., VGG16, VGG19, MobileNet, DenseNet201, and DenseNet169) and HHO metaheuristic optimizer. As previously stated, the number of epochs is set to 5. The number of HHO iterations and the population size are both set to ten.

The used performance metrices are accuracy, precision, sensitivity, specificity, F1-score, AUC, cosine similarity, Hinge loss, and Squared hinge loss. Accuracy is the ratio of rightly estimated observations to overall observations while precision is the ratio of the rightly estimated positive observations to the overall estimated positive observations. Sensitivity is the ratio of rightly estimated positive observations to overall observations. F1-score is calculated as the weighted ratio of Recall and Precision. Specificity indicates the number of cases identified as negative from all the real negative cases. AUC is the area under the ROC Curve. Cosine Similarity is the cosine of the angle between 2 vectors to find if they have the same orientation. Hinge loss is the topmost value between 0 and the outcome of comparison between prediction and the true target of prediction subtracted from 1. Squared hinge loss is a square of the hinge loss.

The summarization of the best-reported results related to the performed CNN experiments using the “Melanoma Skin Cancer Dataset of 10000 Images” dataset are given in Tables 5, 6, 7, 8 and 9. For the different CNN models, Table 5 presents the selected hyperparameters, Table 6 presents the selected data augmentation hyperparameters, Table 7 presents the loss and confusion matrix values, Table 8 presents the performance metrics required to be maximized, and Table 9 presents the performance metrics required to be minimized. It shows that the best reported scores from the applied CNN experiments are 97.08%, 98.50%, 95.38%, 98.65%, 96.92% in terms of overall accuracy, precision, sensitivity, specificity, and F1-score, respectively by the DenseNet169 pre-trained model. Applying augmentation is recommended by 4 experiments. The summarization of the best-reported results related to the performed CNN experiments using the “Skin Cancer ISIC” dataset are given in Tables 10, 11, 12, 13 and 14. For the different CNN models, Table 10 presents the selected hyperparameters, Table 11 presents the selected data augmentation hyperparameters, Table 12 presents the loss and confusion matrix values, Table 13 presents the performance metrics required to be maximized, and Table 14 presents the performance metrics required to be minimized. It shows that the best reported the best reported scores from the applied CNN experiments are 96.06%, 83.05%, 81.05%, 97.93%, 82.03% in terms of overall accuracy, precision, sensitivity, specificity, and F1-score, respectively by the MobileNet pre-trained model. Applying augmentation is recommended by 3 experiments. Figure 7 presents a graphical summarization of the confusion matrix results concerning the “Melanoma Skin Cancer Dataset of 10000 Images” dataset. Figure 8 presents a graphical summarization of the confusion matrix results concerning the “Skin Cancer ISIC” dataset. Figure 9 presents the GradCam for five classes from the “Skin Cancer ISIC” dataset using the MobileNet pretrained model. From it, the bright regions represent the promising features.

Table 5 The selected hyperparameters of the different CNN models for “Melanoma Skin Cancer Dataset of 10000 Images” dataset experiments
Table 6 The selected data augmentation hyperparameters of the different CNN models for “Melanoma Skin Cancer Dataset of 10000 Images” dataset experiments
Table 7 The loss and confusion matrix values of the different CNN models for “Melanoma Skin Cancer Dataset of 10000 Images” dataset experiments
Table 8 The performance metrics required to be maximized for the different CNN models for “Melanoma Skin Cancer Dataset of 10000 Images” dataset experiments
Table 9 The performance metrics required to be minimized for the different CNN models for “Melanoma Skin Cancer Dataset of 10000 Images” dataset experiments
Table 10 The selected hyperparameters of the different CNN models for “Skin Cancer ISIC” dataset experiments
Table 11 The selected data augmentation hyperparameters of the different CNN models for “Skin Cancer ISIC” dataset experiments
Table 12 The loss and confusion matrix values of the different CNN models for “Skin Cancer ISIC” dataset experiments
Table 13 The performance metrics required to be maximized for the different CNN models for “Skin Cancer ISIC” dataset experiments
Table 14 The performance metrics required to be minimized for the different CNN models for “Skin Cancer ISIC” dataset experiments
Fig. 7
figure 7

Confusion matrix results concerning the “Melanoma Skin Cancer Dataset of 10000 Images” dataset

Fig. 8
figure 8

Confusion matrix results concerning the “Skin Cancer ISIC” dataset

Fig. 9
figure 9

The GradCam for five classes from the “Skin Cancer ISIC” dataset using the MobileNet pretrained model

4.3 Time complexity

For the CNN models, each model takes about 1 min. As there are 14 optimized hyperparameters using HHO, the total number of iterations is 10, the population size is 10, and the number of epochs is 5, then there are 10 × 10 × 5 = 500 runs for each model. The approximate time is 1 × 500 = 500 min per model. As there are 5 CNN models and 2 datasets, the total approximate time can be 500 × 5 × 2 = 5000 min, approximately 85 h (i.e., 4 days).

4.4 Related studies comparisons

Table 15 shows a comparison between the suggested approach and related studies concerning the used datasets. According to the results presented in this table, the suggested approach could achieve an accuracy of 97.08% using DenseNet169. This accuracy is among the best achieved accuracies by other approaches. Therefore, the proposed approach can is applicable in medical systems.

Table 15 Comparison between the suggested approach and related studies

5 Conclusions, limitations and future work

The automatic detection, classification, and segmentation of skin lesions is an open-ended research area that still requires more searching. This study proposes a methodological approach for categorizing several images of skin lesions into their corresponding classes using the convolution Neural Networks. The DL-based system consists of two phases, namely segmentation phase and classification phase. In the segmentation phase, manual and automatic segmentation approaches are utilized. Manual segmentation is used when the dataset has no masks to use. Additionally, the U-Net model is used to perform the automatic segmentation to build an adaptive segmentation model. For the segmentation, the best-reported scores are 0.15908, 91.95%, 0.08864, 0.04313, 0.02072, 0.20767 in terms of loss, accuracy, Mean Absolute Error, Mean Squared Error, Mean Squared Logarithmic Error, and Root Mean Squared Error, respectively. The metaheuristic HHO optimizer is utilized to perform the optimization of the hyperparameters of the CNN models to fulfill the phase of the learning, classification, and optimization. Here, 5 pre-trained CNN models are used, namely, VGG16, VGG19, DenseNet169, DenseNet201, and MobileNet. The used datasets are collected from two different publicly available sources. After completing the collection process, they are “Melanoma Skin Cancer Dataset of 10000 Images” and “Skin Cancer ISIC” datasets. For the “Melanoma Skin Cancer Dataset of 10000 Images” dataset, the best reported overall accuracy is 97.08% by the DenseNet169 pre-trained model. The floored average TP, TN, FP, and FN were 4.870, 5.424, 74, and 236, respectively. For the “Skin Cancer ISIC” dataset, the best reported overall accuracy is 96.06% by the MobileNet pre-trained model. The floored average TP, TN, FP, and FN were 1.783, 17.236, 364, and 417, respectively. Then, the results were compared with 9 of the mentioned related literature. This showed that this study had outperformed numerous prior works.

The current work shows the possibility of using deep learning models for skin lesion classification and segmentation, the suggested approach suffers from some limitations. The major one is instantaneity, as the classifier training phase is the most time-consuming stage. The causes of this limitation are the high-dimensional features and slow convergence of the boosting algorithm. Additionally, only 5 CNN models and the U-Net are utilized. There was a big challenge to get a suitable accuracy, Due to the lack of datasets in this area. In future studies, the plan is to (1) try other DL- or ML-based techniques, (2) evaluate the system with additional datasets, and (3) work on enhancing the complexity of the segmentation phase.