1 Introduction

Skin diseases are one of the most common health problems. Some skin diseases, such as malignant melanoma, are fatal, can easily spread out to other body parts, and require early detection [1]. In addition, monkeypox, which also shows symptoms with skin lesions, was declared a global public health emergency by WHO in 2022 [2]. Monkeypox is also infectious and fatal; therefore, it requires early diagnosis and rapid patient isolation.

Skin lesions have different characteristics. But, in the early stage of the disease, determining them visually can be hard. A special device, called dermoscopy which placed on the skin surface provides more detailed information. The skin surface can be monitored at a magnified view owing to dermoscopy [3]. Also, using high-resolution ultrasound imaging, deep layers of the skin can be observed [4]. However, this process is mainly based on qualitative, subjective and time-consuming observations. Thus, there is a critical need to develop automated, objective, and non-invasive tools for this task.

Developing Computer-aided Diagnosis or Detection (CAD) systems to analyze the skin lesions is an emerging research area. Owing to these systems, the burden and cost of skin cancer screening are alleviated. In order to reduce the challenges encountered in manual inspection, such user-friendly tools can assist dermatologists [5]. These analyses may be based on examining dermoscopic images, clinical images, and patients’ meta-data. On the other hand, analyzing skin diseases is challenging due to the disease-irrelevant image contents such as dense hairs and other noise, low contrast and border irregularity.

In recent years, deep learning-based approaches have made effective and powerful image analytics possible [6,7,8,9]. These approaches have been widely adopted for medical image analysis applications, including disease diagnosis from medical images. In the literature, there are some binary human monkeypox classification and multi-class skin lesion classification studies. One of the goals of this proposed study is to handle monkeypox and skin disease classification problems with a single system. This system is aimed to give better results than the literature.

On the other hand, in order to obtain a robust computer vision system, the developers usually need a large amount of data. Unfortunately, obtaining large amounts of image data can be hard. Also, these computer vision systems can assist the doctor in decision-making. Therefore, to cope with that challenge, a transfer learning based approach was developed in this study and a high-performance system was obtained using a small amount of data. In order to better show the benefits of transfer learning using a small amount of data, the training from scratch approach was compared with transfer learning results.

In this study, it is focused on the classification of skin lesion images. Transformer networks, a popular deep learning model, and convolutional neural network (CNN) have been used to realize the system. CNNs have some inductive advantages, such as translation equivariance and locality. Thus, generalizations can be made even with small data. Transformers need to have these features. Therefore, they do not generalize well when trained on insufficient amounts of data. To cope with it, a model trained on larger datasets can be used as presented in [10]. In the proposed study, fine-tuning is applied to a pre-trained Vision Transformer on the ImageNet-21k dataset for the skin lesion classification task. Also, some pre-trained CNNs have been trained and tested for this task, and results have been compared. Then, the models that show best accuracy scores have been combined to create final result. The proposed system competes with and even outperforms state-of-the-art algorithms running on the same dataset. Also, the contributions of the study are given below:

1)A robust vision transformer-based skin lesion analysis system is developed using a small amount of image data.

2)The strengths of the vision transformer model combined with the strengths of CNN using the ensemble model. This approach improved the performance.

3)The developed system can help dermatologists to recognize skin lesions quickly and cheaply. It can be beneficial for scientists and/or dermatologists in decision-making when dealing with large numbers of patients in a short time.

2 Related works

In earlier years, the detection of skin diseases was done by focusing on color, shape and texture features. Hameed et al. [8] extracted co-occurrence matrix features, then they classified these features with a support vector machine (SVM). Xie et al. [11] extracted skin lesions’ color, texture, and border features, reduced the dimension using Principal Component Analysis (PCA) and classified them using an ensemble of neural networks. Murugan et al. [12] extracted features using Gaussian filters and then classified them using SVM.

In recent years, more powerful image analysis studies can be performed owing to deep learning methods. This situation has also increased the popularity of studies on the analysis of diseases through images. Detection of skin diseases using images has been an important research and application field and various studies have been performed using different datasets. In [13], the authors explored various algorithms, including machine learning techniques as well as deep learning techniques. For evaluation they used ISIC dataset. In [14], the authors developed a CNN-based model for malignant melanoma identification. Then, they developed a web application for this model. In [15], the authors, developed a transfer learning-based model with some modification on pre-trained Xception model. They evaluated their model on a HAM10000 dataset. They applied data augmentation techniques to overcome the unbalancing in the dataset. Finally, they obtained 96.40% accuracy for skin disease classification.

In [16], the authors applied a major modification to U-Net architecture to improve its performance in skin lesion segmentation. Their method showed better performance than the basic U-Net, FCN, SegNet, and U-Net + + , and achieved the performance of state-of-the-art segmentation techniques. In [17], the authors segmented skin cancer using transfer learning and fine-tuning. In [18], the author proposed a CNN-based ensemble framework for early melanoma skin cancer detection. In ensemble framework, multiple CNN perform the same task based on transfer learning approach. In [19], the authors incorporated layers like global average pooling, dropblock, and batch normalization to the base models to classify unbalanced skin-lesion data. Their approach improved the performance of the model compared to fine-tuning with drop-out. In [20] and [15], the authors distinguished the skin lesions using CNN-based approaches. Srinivasu et al. [21] used the HAM10000 skin lesion dataset to classify skin lesions with MobileNetv2 and Long Short-Term Memory. In [22], and [23] authors focused on two cancer types, and separated these lesions from PAD-UFES-20 dataset. (originally, there are six cancer types in it). Then, they classified these lesions using binary image classification. In [24], the authors proposed a novel domain generalization algorithm called environment-aware prompt vision transformer (EPVT) for robust skin lesion recognition. Their method produces solution for the co-artifacts problem using a domain mix-up strategy and cross-domain learning problems using a domain prompt generator.

The number of studies trying to detect monkeypox lesions from skin images has recently risen due to the increase in human monkeypox disease. In [25, 26], the authors formed a new human monkeypox image dataset. Then they classified the images using deep learning approaches.

Also, in recent literature, some monkeypox classification studies have used the Vision transformer model. Aloraini [27] applied a monkeypox vs. not monkeypox binary classification using a fine-tuned Vision Transformer model. Kundu et al. [28] applied a monkeypox vs. not monkeypox binary classification with Vision Transformer and some other machine learning approaches. Then, the results were compared. Ahsan et al. [29] applied the binary classification with Vision Transformer and three other modified transfer learning models such as VGG and ResNet.

Differently from previous monkeypox classification studies,

  • This study doesn’t focus on a binary monkeypox classification. It focuses on the classification of skin lesions, including human monkeypox.

  • In order to better show the benefits of the transfer learning-based Vision transformer approach, the same model has been trained by initializing the weights with random values. In this way, the training-from-scratch approach has been compared with the transfer learning approach.

  • In order to obtain a robust model, the final decision is made using an ensemble deep vision transformer and deep CNN model.

In our previous study [30], we combined two datasets to gather monkeypox and other skin lesions. Then, we compared different pre-trained CNNs using this dataset. Finally, resnet-18, which shows the best performance, was converted to the TensorFlowLite version and used in a mobile application. Differently from the previous study, the proposed study uses the popular vision transformer model and applies an ensemble approach. Also, the proposed study shows better performance than the previous study.

3 Methodology

In this study, a combined dataset was created using two different skin lesion datasets. After that, using the transfer learning approach, some pre-trained CNN models and the Vision Transformer model were trained and tested for the skin lesion classification tasks. Then, all of the results were compared, and vision transformer and DenseNet201 models, which showed the best accuracy score, have been ensembled to make the final decision. The system pipeline has also been illustrated in Fig. 1.

Fig. 1
figure 1

General system pipeline

3.1 Vision transformer model for skin lesion classification task

3.1.1 Image preprocessing

Some preprocessing is required to prepare the images for the vision transformer model. For that purpose, each skin lesion image is divided into a sequence of fixed-size 16x16 patches. Then these are linearly embedded. Then the token is added at the beginning of the sequence in order to classify images. After that, absolute position embeddings are added and this sequence is used as input in the Transformer encoder. This structure is shown in Fig. 2.

Fig. 2
figure 2

Skin lesion classification using vision transformer

3.1.2 Model development

Transformer networks [31] are state-of-the-art approaches that are especially popular in the Natural Language Processing (NLP) task [32, 33]. These models use “attention” mechanisms and capture long-term dependencies between words in a sentence. Also, transformer networks are well suited for parallelization and facilitate training on large datasets. The success of transformer networks in NLP has inspired their use for tasks like object detection [34] and panoptic segmentation [35], etc.

Owing to Transformers’ computational efficiency and scalability, to train big models with over 100B parameters becomes possible [36, 37].

Recently, the vision transformer [10] architecture has been introduced, in which input images are processed as sequences of markers for image classification. In the vision transformer, the input image is divided into patches. Then the sequence of linear embeddings of these patches is used as input. Thus, image patches are used like tokens in an NLP application.

To classify, an extra learnable “classification specifier” is added to the array. It then analyzes the co-relationships of text tokens using the attention mechanism. This attention mechanism is used to model pixel dependencies within the image.

The Transformer encoder includes alternate multi-headed self-attention layers and multi-layer perceptron blocks. Self-attention [31] is a popular neural structure. For each element in an input sequence, a weighted sum in the sequence is calculated.

Multi-headed self-attention (MSA) is an extended version of self-attention. In MSA, many self-attention operations can run parallel and output their concatenated outputs. These operations are called “heads” [10].

In this study, fine-tuning is applied to a pre-trained Vision Transformer using the ImageNet-21k dataset. This dataset includes 60,000 32x32 color images in ten classes. This structure is modified for the skin lesion classification task. To do that, the pre-trained prediction label has been removed and the zero-initialized D \(\times \) N feedforward layer has been attached, where N is the number of downstream classes, D is hidden size, which is kept fixed throughout the layers [10]. These operations are also illustrated in Fig. 2.

The modified model includes a linear layer on top of a pre-trained vision transformer model. Owing to the linear layer, a linear transformation is applied to the input data. The linear layer is used for tasks such as classification or detection by converting the transformer’s outputs to a vector of the desired size.

3.2 Deep transfer learning with CNN for skin lesion classification

CNN, one of the deep learning approaches, produces high performance results in image analysis and pattern recognition problems [38,39,40,41,42]. Two important criteria are required for a CNN-based system to perform well. These are appropriate and adequate use of data and proper model development.

CNN is formed using various combination of layers such as convolution, pooling, activation, dropout and fully connected. The last layer can be classification, regression, etc., based on the problem. In the convolution layer, input images are convolved with kernels. Therefore, feature maps are created. This process is applied using (1) [43].

$$\begin{aligned} S(i,j) = (I*K)(i,j) = \sum _{m} \sum _{n} I(i+m,j+n)K(m,n) \end{aligned}$$
(1)

where I is 2-D input image, K is the 2-D kernel, and S is the 2-D output, (i,j) are matrix indexes and (m,n) are filter sizes. The pooling layer reduces the image size; thus, it speeds up the operations and prevents over-fitting. Dropout randomly ignores some of the neurons and their connections. Therefore, it prevents over-fitting. As an activation function, Relu is usually used in CNNs. ReLU eliminates negative values as shown in (2) [43].

$$\begin{aligned} f(x)=\max (0,x) \end{aligned}$$
(2)

where x is input.

There are two ways to develop a suitable model. The first is training the model from scratch. The second way is to reuse a network previously developed for a different task, with minor modifications. This approach is called transfer learning. If the number of data is small for the solution of the problem, transfer learning provides a great advantage. Because pre-trained network will have already learned the basic image features such as edge, color and shape, which are important for solving an image analysis problem. In order to learn the features specific to the problem, additional training is performed, initializing from the learned filters.

In this study, various pre-trained CNN models have been trained and tested for comparison purposes. These models and their features have been listed in Table 1. These networks were trained on over a million images using the ImageNet [44] database.

Table 1 Pre-trained networks’ details

3.3 Ensemble learning using bagging strategy

In this study, a pre-trained deep CNN model and a vision transformer model are combined to obtain a final decision using ensemble learning. Thus, combining the strengths of two different model, more robust system are obtained.

In ensemble learning, there are different strategies. Bagging model is used in the proposed system. The long name of bagging is bootstrap aggregating. In this strategy, new sub-datasets are created by taking random samples from the original dataset. Each subset is predicted with a model and predictions are combined to create the final prediction [45]. Based on [45], owing to bagging overfitting is reduced and more stable models are created.

4 Experimental results

4.1 Dataset and data preparation

There are various public datasets that contain different types of skin diseases in the literature. One of the motivations of this study is to detect skin lesions, including skin cancer types, and monkeypox which has increased in recent years and to contribute to patients’ rapid isolation when necessary. For this purpose, experiments were carried out by combining two datasets in this study. These datasets are PAD-UFES-20 [46] and MSLD [47]. MSLD dataset preparation details are also explained in [26]. PAD-UFES-20 dataset preparation details are also explained in [48].

The PAD-UFES-20 dataset contains six different skin lesion classes, which are Actinic Keratosis (ACK), Basal Cell Carcinoma (BCC), Melanoma (MEL), Nevus (NEV), Squamous Cell Carcinoma (SCC), and Seborrheic Keratosis (SEK).

MSLD includes two different labels. These are Monkeypox (MPX) and “Non-monkey-pox”. On the other hand, the “Non-monkeypox” class includes skin lesion images that belong to two different diseases without their actual labels. Thus “others class” have been eliminated from the combined dataset. Finally, the combined dataset includes seven different classes. Figure 3 shows some sample images from combined dataset.

Fig. 3
figure 3

Some samples from the combined dataset

The combined dataset includes 2298 images from PAD-UFES-20 and 102 images from MSLD. These images have been split into two parts as training set (80%) and testing set (20%). The data augmentation techniques like rotation, reflection, translation, and scale have been applied to the training set to provide balanced data. Figure 4 shows the number of samples in the combined dataset before and after data augmentation.

Fig. 4
figure 4

Distribution of the number of samples in the data set before and after data augmentation

4.2 Working environment and evaluation metrics

Training and testing processes were carried out using A100 GPU with 15GB capacity in Colab. The parameters used in the training step are as following: learning rate is 2e-5, batch size is 10, weight decay 0.01, evaluation strategy is epoch.

In the classification systems, performance is evaluated using different criteria indices. Owing to these different indices, the users and scientist can further evaluate the performance of classification systems. Thus, they can understand different aspects of the system. Each index has some advantages and limitations. Therefore, using them together can provide more comprehensive evaluation. In these study, the results have been evaluated five criteria indices. These are Accuracy, Recall, Fscore, Precision, and Jaccard. Although, accuracy is often used to evaluate overall performance, it may not be sufficient on its own when there are class imbalances [49]. In this case, the Jaccard metric can provide a more accurate measure of performance [50]. Precision evaluates the performance of the model when false positives are significant. Recall measures the performance of the model when false negatives are significant. F-Score evaluates precision and recall metrics together [51]. The mathematical equations of these metrics are given in (3), (4), (5), (6) and (7). In the equations, TP is true positive, FP is false positive, TN is true negative, and FN is false negative.

$$\begin{aligned} Accuracy = (TP+TN)/(TP+TN+FP+FN) \end{aligned}$$
(3)
$$\begin{aligned} Recall = TP/(TP+FN) \end{aligned}$$
(4)
$$\begin{aligned} Fscore = 2*TP/(2*TP+FN+FP) \end{aligned}$$
(5)
$$\begin{aligned} Precision = TP/(TP+FP) \end{aligned}$$
(6)
$$\begin{aligned} Jaccard = TP/(TP+FP+FN) \end{aligned}$$
(7)

4.3 Comparative results

In this study, some different pre-trained models have been trained and tested on the same combined dataset. Table 2 reports the classification results. As seen in the table, the vision transformer model outperforms the other models. Also, the Vision transformer model has been trained from scratch to better show the benefits of transfer learning in a small amount of data. This system showed 60.49% accuracy score (80.66 % vs. 60.49%).

Table 2 Comparison of the vision transformer model with the pre-trained networks

In this study, a deep ensemble learning technique has also been applied. As seen in Table 2, the best accuracy scores have been obtained using densenet201 and the vision transformer models. Thus, these two models have been combined using the Bagging Ensemble technique. This approach improved the accuracy to 81.91%.

Figure 5 shows the confusion matrix of the ensemble model. Although there are some mistakes, the proposed system usually generates promising results. The worst performance was obtained for the SCC lesion. These images are low-quality images taken with a mobile phone camera. System success can also increase as a result of training and testing with higher-quality images.

Fig. 5
figure 5

Confusion matrix of the ensemble model

Fig. 6
figure 6

Example results showing the contribution of the ensemble model

Figure 6 shows two samples from test results. Figure 6(a) is an ACK-type lesion. It is predicted as MPX by Densenet201 and as ACK by the vision transformer model. The vision transformer model usually performs better in cases where the lesion covers a long portion of the image. Because these types of images have long-range features. Vision transformer models can catch these. As can be seen, owing to the vision transformer model, the ensemble model could correctly predict it. Figure 6(b) is an MPX-type lesion. Although densenet201 correctly predicted this image, the Vision transformer couldn’t correctly predict it. As seen in this figure, the lesion is in a small part of the image. densenet201 is more successful in classifying such images, which require local feature information and are concentrated in a specific image region. For this image, the final ensemble model made the correct prediction owing to densenet201. As can be seen from these samples, owing to the different strengths of the densenet201 and the vision transformer model, the final ensemble model shows successful performance.

The proposed Vision transformer and CNN-based ensemble model for the skin lesion classification task has also been compared with other studies which used the same PAD-UFES-20 dataset. Table 3 reports the classification results. Although, the proposed ensemble-based system classifies 7-lesion, it outperforms other studies which classify 6-lesion in terms of accuracy, precision, and Fscore. Although [30] use 7 class PAD-UFES+Monkeypox data, the proposed system outperforms it in terms of all indices.

Table 3 Comparison of the vision transformer model with the other studies

Also, in order to provide a more fair comparison with the literature, the Vision transformer-based system has also been trained and tested using only 6-class PAD-UFES-20 data. The testing results of this 6-class system have also been presented in the comparison table. This system produced better results than [52,53,54,55] in terms of Accuracy, Precision, and Fscore. In terms of Recall, it is second behind of [52] and better than the other studies. Skin lesion images are complex images and a successful feature extractor is needed to distinguish them successfully. In this study, vision transformer model outperforms other models. This can be explained by the attention mechanisms, which can focus on some important features. Moreover, with the ensemble learning, the system combines strengths of densenet201 and vision transformer model. Thus, it outperforms the other models in terms of many criteria indices.

5 Conclusion

This paper proposed an automated system that classifies monkeypox and other skin lesion images. For that purpose, two datasets which are PAD-UFES-20 and MSLD have been combined. The combined dataset includes 7-type skin lesions. To classify them, a state-of-the-art vision transformer model has been used. Vision transformer models usually perform well using larger data. But in the combined dataset, there are a small amount of data and to cope with it, fine-tuning has been applied. The vision transformer model, which was initially trained on ImageNet-21k that a dataset of 14 million labeled images, has been used for skin lesion classification task. Also, to obtain a fair comparison, the same system has been trained and tested only PAD-UFES-20 dataset images. In this experiment, the system outperformed the literature in terms of Accuracy, Precision, and Fscore. It produced comparable results in terms of Recall. Furthermore, some popular pre-trained networks have been trained for skin lesion classification tasks using the transfer learning approach and their test results have been compared. The best accuracy have been obtained using densenet201. Finally, an ensemble model has been created using the pre-trained densenet201 model and the vision transformer model. All the results have been reported in tables in the “Experimental Results” section. The ensemble model produced 81.91% accuracy, 65.94% Jaccard, 87.16% Precision, 74.12% Recall and 78.16% Fscore scores for 7-class classification. According to these values, the system competes with state-of-the-art models and also outperforms some of them.

The proposed system can assist researchers and doctors while diagnosing skin lesions. It can be helpful in situations that require rapid diagnosis and urgent patient isolation, such as monkeypox.