Vision transformer and CNN-based skin lesion analysis: classification of monkeypox

Yolcu Oztel, Gozde

doi:10.1007/s11042-024-19757-w

Vision transformer and CNN-based skin lesion analysis: classification of monkeypox

Open access
Published: 09 July 2024

Volume 83, pages 71909–71923, (2024)
Cite this article

Download PDF

You have full access to this open access article

Multimedia Tools and Applications Aims and scope Submit manuscript

Vision transformer and CNN-based skin lesion analysis: classification of monkeypox

Download PDF

Gozde Yolcu Oztel¹

333 Accesses
Explore all metrics

Abstract

Monkeypox is an important health problem. Rapid diagnosis of monkeypox skin lesions and emergency isolation when necessary is essential. Also, some skin lesions, such as melanoma, can be fatal and must be rapidly distinguished. However, in some cases, it is difficult to distinguish the lesions visually. Methods such as dermoscopy, high-resolution ultrasound imaging, etc. can be used for better observation. But these methods are often based on qualitative analysis, subjective and time-consuming. Therefore, in this study, a quantitative and objective classification tool has been developed to assist dermatologists and scientists. The proposed system classifies seven skin lesions, including monkeypox. A popular approach Vision Transformer and some popular deep learning convolutional networks have been trained with the transfer learning approach and all results have been compared. Then, the models that show the best accuracy score have been combined to make the final prediction using bagging-ensemble learning. The proposed ensemble-based system produced 81.91% Accuracy, 65.94% Jaccard, 87.16% Precision, 74.12% Recall, and 78.16% Fscore values. In terms of different criteria metrics, the system produced competitive or even better results than the literature.

Convolutional Neural Network for Monkeypox Detection

Ensemble of Deep Convolutional Neural Network for Skin Lesion Classification in Dermoscopy Images

SkinMarkNet: an automated approach for prediction of monkeyPox using image data augmentation with deep ensemble learning models

Article 20 July 2024

Find the latest articles, discoveries, and news in related topics.

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Skin diseases are one of the most common health problems. Some skin diseases, such as malignant melanoma, are fatal, can easily spread out to other body parts, and require early detection [1]. In addition, monkeypox, which also shows symptoms with skin lesions, was declared a global public health emergency by WHO in 2022 [2]. Monkeypox is also infectious and fatal; therefore, it requires early diagnosis and rapid patient isolation.

Skin lesions have different characteristics. But, in the early stage of the disease, determining them visually can be hard. A special device, called dermoscopy which placed on the skin surface provides more detailed information. The skin surface can be monitored at a magnified view owing to dermoscopy [3]. Also, using high-resolution ultrasound imaging, deep layers of the skin can be observed [4]. However, this process is mainly based on qualitative, subjective and time-consuming observations. Thus, there is a critical need to develop automated, objective, and non-invasive tools for this task.

Developing Computer-aided Diagnosis or Detection (CAD) systems to analyze the skin lesions is an emerging research area. Owing to these systems, the burden and cost of skin cancer screening are alleviated. In order to reduce the challenges encountered in manual inspection, such user-friendly tools can assist dermatologists [5]. These analyses may be based on examining dermoscopic images, clinical images, and patients’ meta-data. On the other hand, analyzing skin diseases is challenging due to the disease-irrelevant image contents such as dense hairs and other noise, low contrast and border irregularity.

In recent years, deep learning-based approaches have made effective and powerful image analytics possible [6,7,8,9]. These approaches have been widely adopted for medical image analysis applications, including disease diagnosis from medical images. In the literature, there are some binary human monkeypox classification and multi-class skin lesion classification studies. One of the goals of this proposed study is to handle monkeypox and skin disease classification problems with a single system. This system is aimed to give better results than the literature.

On the other hand, in order to obtain a robust computer vision system, the developers usually need a large amount of data. Unfortunately, obtaining large amounts of image data can be hard. Also, these computer vision systems can assist the doctor in decision-making. Therefore, to cope with that challenge, a transfer learning based approach was developed in this study and a high-performance system was obtained using a small amount of data. In order to better show the benefits of transfer learning using a small amount of data, the training from scratch approach was compared with transfer learning results.

In this study, it is focused on the classification of skin lesion images. Transformer networks, a popular deep learning model, and convolutional neural network (CNN) have been used to realize the system. CNNs have some inductive advantages, such as translation equivariance and locality. Thus, generalizations can be made even with small data. Transformers need to have these features. Therefore, they do not generalize well when trained on insufficient amounts of data. To cope with it, a model trained on larger datasets can be used as presented in [10]. In the proposed study, fine-tuning is applied to a pre-trained Vision Transformer on the ImageNet-21k dataset for the skin lesion classification task. Also, some pre-trained CNNs have been trained and tested for this task, and results have been compared. Then, the models that show best accuracy scores have been combined to create final result. The proposed system competes with and even outperforms state-of-the-art algorithms running on the same dataset. Also, the contributions of the study are given below:

1)A robust vision transformer-based skin lesion analysis system is developed using a small amount of image data.

2)The strengths of the vision transformer model combined with the strengths of CNN using the ensemble model. This approach improved the performance.

3)The developed system can help dermatologists to recognize skin lesions quickly and cheaply. It can be beneficial for scientists and/or dermatologists in decision-making when dealing with large numbers of patients in a short time.

2 Related works

In earlier years, the detection of skin diseases was done by focusing on color, shape and texture features. Hameed et al. [8] extracted co-occurrence matrix features, then they classified these features with a support vector machine (SVM). Xie et al. [11] extracted skin lesions’ color, texture, and border features, reduced the dimension using Principal Component Analysis (PCA) and classified them using an ensemble of neural networks. Murugan et al. [12] extracted features using Gaussian filters and then classified them using SVM.

In recent years, more powerful image analysis studies can be performed owing to deep learning methods. This situation has also increased the popularity of studies on the analysis of diseases through images. Detection of skin diseases using images has been an important research and application field and various studies have been performed using different datasets. In [13], the authors explored various algorithms, including machine learning techniques as well as deep learning techniques. For evaluation they used ISIC dataset. In [14], the authors developed a CNN-based model for malignant melanoma identification. Then, they developed a web application for this model. In [15], the authors, developed a transfer learning-based model with some modification on pre-trained Xception model. They evaluated their model on a HAM10000 dataset. They applied data augmentation techniques to overcome the unbalancing in the dataset. Finally, they obtained 96.40% accuracy for skin disease classification.

In [16], the authors applied a major modification to U-Net architecture to improve its performance in skin lesion segmentation. Their method showed better performance than the basic U-Net, FCN, SegNet, and U-Net + + , and achieved the performance of state-of-the-art segmentation techniques. In [17], the authors segmented skin cancer using transfer learning and fine-tuning. In [18], the author proposed a CNN-based ensemble framework for early melanoma skin cancer detection. In ensemble framework, multiple CNN perform the same task based on transfer learning approach. In [19], the authors incorporated layers like global average pooling, dropblock, and batch normalization to the base models to classify unbalanced skin-lesion data. Their approach improved the performance of the model compared to fine-tuning with drop-out. In [20] and [15], the authors distinguished the skin lesions using CNN-based approaches. Srinivasu et al. [21] used the HAM10000 skin lesion dataset to classify skin lesions with MobileNetv2 and Long Short-Term Memory. In [22], and [23] authors focused on two cancer types, and separated these lesions from PAD-UFES-20 dataset. (originally, there are six cancer types in it). Then, they classified these lesions using binary image classification. In [24], the authors proposed a novel domain generalization algorithm called environment-aware prompt vision transformer (EPVT) for robust skin lesion recognition. Their method produces solution for the co-artifacts problem using a domain mix-up strategy and cross-domain learning problems using a domain prompt generator.

The number of studies trying to detect monkeypox lesions from skin images has recently risen due to the increase in human monkeypox disease. In [25, 26], the authors formed a new human monkeypox image dataset. Then they classified the images using deep learning approaches.

Also, in recent literature, some monkeypox classification studies have used the Vision transformer model. Aloraini [27] applied a monkeypox vs. not monkeypox binary classification using a fine-tuned Vision Transformer model. Kundu et al. [28] applied a monkeypox vs. not monkeypox binary classification with Vision Transformer and some other machine learning approaches. Then, the results were compared. Ahsan et al. [29] applied the binary classification with Vision Transformer and three other modified transfer learning models such as VGG and ResNet.

Differently from previous monkeypox classification studies,

This study doesn’t focus on a binary monkeypox classification. It focuses on the classification of skin lesions, including human monkeypox.
In order to better show the benefits of the transfer learning-based Vision transformer approach, the same model has been trained by initializing the weights with random values. In this way, the training-from-scratch approach has been compared with the transfer learning approach.
In order to obtain a robust model, the final decision is made using an ensemble deep vision transformer and deep CNN model.

In our previous study [30], we combined two datasets to gather monkeypox and other skin lesions. Then, we compared different pre-trained CNNs using this dataset. Finally, resnet-18, which shows the best performance, was converted to the TensorFlowLite version and used in a mobile application. Differently from the previous study, the proposed study uses the popular vision transformer model and applies an ensemble approach. Also, the proposed study shows better performance than the previous study.

3 Methodology

In this study, a combined dataset was created using two different skin lesion datasets. After that, using the transfer learning approach, some pre-trained CNN models and the Vision Transformer model were trained and tested for the skin lesion classification tasks. Then, all of the results were compared, and vision transformer and DenseNet201 models, which showed the best accuracy score, have been ensembled to make the final decision. The system pipeline has also been illustrated in Fig. 1.

3.1 Vision transformer model for skin lesion classification task

3.1.1 Image preprocessing

Some preprocessing is required to prepare the images for the vision transformer model. For that purpose, each skin lesion image is divided into a sequence of fixed-size 16x16 patches. Then these are linearly embedded. Then the token is added at the beginning of the sequence in order to classify images. After that, absolute position embeddings are added and this sequence is used as input in the Transformer encoder. This structure is shown in Fig. 2.

3.1.2 Model development

Transformer networks [31] are state-of-the-art approaches that are especially popular in the Natural Language Processing (NLP) task [32, 33]. These models use “attention” mechanisms and capture long-term dependencies between words in a sentence. Also, transformer networks are well suited for parallelization and facilitate training on large datasets. The success of transformer networks in NLP has inspired their use for tasks like object detection [34] and panoptic segmentation [35], etc.

Owing to Transformers’ computational efficiency and scalability, to train big models with over 100B parameters becomes possible [36, 37].

Recently, the vision transformer [10] architecture has been introduced, in which input images are processed as sequences of markers for image classification. In the vision transformer, the input image is divided into patches. Then the sequence of linear embeddings of these patches is used as input. Thus, image patches are used like tokens in an NLP application.

To classify, an extra learnable “classification specifier” is added to the array. It then analyzes the co-relationships of text tokens using the attention mechanism. This attention mechanism is used to model pixel dependencies within the image.

The Transformer encoder includes alternate multi-headed self-attention layers and multi-layer perceptron blocks. Self-attention [31] is a popular neural structure. For each element in an input sequence, a weighted sum in the sequence is calculated.

Multi-headed self-attention (MSA) is an extended version of self-attention. In MSA, many self-attention operations can run parallel and output their concatenated outputs. These operations are called “heads” [10].

In this study, fine-tuning is applied to a pre-trained Vision Transformer using the ImageNet-21k dataset. This dataset includes 60,000 32x32 color images in ten classes. This structure is modified for the skin lesion classification task. To do that, the pre-trained prediction label has been removed and the zero-initialized D $\times $ N feedforward layer has been attached, where N is the number of downstream classes, D is hidden size, which is kept fixed throughout the layers [10]. These operations are also illustrated in Fig. 2.

The modified model includes a linear layer on top of a pre-trained vision transformer model. Owing to the linear layer, a linear transformation is applied to the input data. The linear layer is used for tasks such as classification or detection by converting the transformer’s outputs to a vector of the desired size.

3.2 Deep transfer learning with CNN for skin lesion classification

CNN, one of the deep learning approaches, produces high performance results in image analysis and pattern recognition problems [38,39,40,41,42]. Two important criteria are required for a CNN-based system to perform well. These are appropriate and adequate use of data and proper model development.

CNN is formed using various combination of layers such as convolution, pooling, activation, dropout and fully connected. The last layer can be classification, regression, etc., based on the problem. In the convolution layer, input images are convolved with kernels. Therefore, feature maps are created. This process is applied using (1) [43].

$$\begin{aligned} S(i,j) = (I*K)(i,j) = \sum _{m} \sum _{n} I(i+m,j+n)K(m,n) \end{aligned}$$

(1)

where I is 2-D input image, K is the 2-D kernel, and S is the 2-D output, (i,j) are matrix indexes and (m,n) are filter sizes. The pooling layer reduces the image size; thus, it speeds up the operations and prevents over-fitting. Dropout randomly ignores some of the neurons and their connections. Therefore, it prevents over-fitting. As an activation function, Relu is usually used in CNNs. ReLU eliminates negative values as shown in (2) [43].

$$\begin{aligned} f(x)=\max (0,x) \end{aligned}$$

(2)

where x is input.

There are two ways to develop a suitable model. The first is training the model from scratch. The second way is to reuse a network previously developed for a different task, with minor modifications. This approach is called transfer learning. If the number of data is small for the solution of the problem, transfer learning provides a great advantage. Because pre-trained network will have already learned the basic image features such as edge, color and shape, which are important for solving an image analysis problem. In order to learn the features specific to the problem, additional training is performed, initializing from the learned filters.

In this study, various pre-trained CNN models have been trained and tested for comparison purposes. These models and their features have been listed in Table 1. These networks were trained on over a million images using the ImageNet [44] database.

Table 1 Pre-trained networks’ details

Full size table

3.3 Ensemble learning using bagging strategy

In this study, a pre-trained deep CNN model and a vision transformer model are combined to obtain a final decision using ensemble learning. Thus, combining the strengths of two different model, more robust system are obtained.

In ensemble learning, there are different strategies. Bagging model is used in the proposed system. The long name of bagging is bootstrap aggregating. In this strategy, new sub-datasets are created by taking random samples from the original dataset. Each subset is predicted with a model and predictions are combined to create the final prediction [45]. Based on [45], owing to bagging overfitting is reduced and more stable models are created.

4 Experimental results

4.1 Dataset and data preparation

There are various public datasets that contain different types of skin diseases in the literature. One of the motivations of this study is to detect skin lesions, including skin cancer types, and monkeypox which has increased in recent years and to contribute to patients’ rapid isolation when necessary. For this purpose, experiments were carried out by combining two datasets in this study. These datasets are PAD-UFES-20 [46] and MSLD [47]. MSLD dataset preparation details are also explained in [26]. PAD-UFES-20 dataset preparation details are also explained in [48].

The PAD-UFES-20 dataset contains six different skin lesion classes, which are Actinic Keratosis (ACK), Basal Cell Carcinoma (BCC), Melanoma (MEL), Nevus (NEV), Squamous Cell Carcinoma (SCC), and Seborrheic Keratosis (SEK).

MSLD includes two different labels. These are Monkeypox (MPX) and “Non-monkey-pox”. On the other hand, the “Non-monkeypox” class includes skin lesion images that belong to two different diseases without their actual labels. Thus “others class” have been eliminated from the combined dataset. Finally, the combined dataset includes seven different classes. Figure 3 shows some sample images from combined dataset.

The combined dataset includes 2298 images from PAD-UFES-20 and 102 images from MSLD. These images have been split into two parts as training set (80%) and testing set (20%). The data augmentation techniques like rotation, reflection, translation, and scale have been applied to the training set to provide balanced data. Figure 4 shows the number of samples in the combined dataset before and after data augmentation.

4.2 Working environment and evaluation metrics

Training and testing processes were carried out using A100 GPU with 15GB capacity in Colab. The parameters used in the training step are as following: learning rate is 2e-5, batch size is 10, weight decay 0.01, evaluation strategy is epoch.

In the classification systems, performance is evaluated using different criteria indices. Owing to these different indices, the users and scientist can further evaluate the performance of classification systems. Thus, they can understand different aspects of the system. Each index has some advantages and limitations. Therefore, using them together can provide more comprehensive evaluation. In these study, the results have been evaluated five criteria indices. These are Accuracy, Recall, Fscore, Precision, and Jaccard. Although, accuracy is often used to evaluate overall performance, it may not be sufficient on its own when there are class imbalances [49]. In this case, the Jaccard metric can provide a more accurate measure of performance [50]. Precision evaluates the performance of the model when false positives are significant. Recall measures the performance of the model when false negatives are significant. F-Score evaluates precision and recall metrics together [51]. The mathematical equations of these metrics are given in (3), (4), (5), (6) and (7). In the equations, TP is true positive, FP is false positive, TN is true negative, and FN is false negative.

$$\begin{aligned} Accuracy = (TP+TN)/(TP+TN+FP+FN) \end{aligned}$$

(3)

$$\begin{aligned} Recall = TP/(TP+FN) \end{aligned}$$

(4)

$$\begin{aligned} Fscore = 2*TP/(2*TP+FN+FP) \end{aligned}$$

(5)

$$\begin{aligned} Precision = TP/(TP+FP) \end{aligned}$$

(6)

$$\begin{aligned} Jaccard = TP/(TP+FP+FN) \end{aligned}$$

(7)

4.3 Comparative results

In this study, some different pre-trained models have been trained and tested on the same combined dataset. Table 2 reports the classification results. As seen in the table, the vision transformer model outperforms the other models. Also, the Vision transformer model has been trained from scratch to better show the benefits of transfer learning in a small amount of data. This system showed 60.49% accuracy score (80.66 % vs. 60.49%).

Table 2 Comparison of the vision transformer model with the pre-trained networks

Full size table

In this study, a deep ensemble learning technique has also been applied. As seen in Table 2, the best accuracy scores have been obtained using densenet201 and the vision transformer models. Thus, these two models have been combined using the Bagging Ensemble technique. This approach improved the accuracy to 81.91%.

Figure 5 shows the confusion matrix of the ensemble model. Although there are some mistakes, the proposed system usually generates promising results. The worst performance was obtained for the SCC lesion. These images are low-quality images taken with a mobile phone camera. System success can also increase as a result of training and testing with higher-quality images.

Figure 6 shows two samples from test results. Figure 6(a) is an ACK-type lesion. It is predicted as MPX by Densenet201 and as ACK by the vision transformer model. The vision transformer model usually performs better in cases where the lesion covers a long portion of the image. Because these types of images have long-range features. Vision transformer models can catch these. As can be seen, owing to the vision transformer model, the ensemble model could correctly predict it. Figure 6(b) is an MPX-type lesion. Although densenet201 correctly predicted this image, the Vision transformer couldn’t correctly predict it. As seen in this figure, the lesion is in a small part of the image. densenet201 is more successful in classifying such images, which require local feature information and are concentrated in a specific image region. For this image, the final ensemble model made the correct prediction owing to densenet201. As can be seen from these samples, owing to the different strengths of the densenet201 and the vision transformer model, the final ensemble model shows successful performance.

The proposed Vision transformer and CNN-based ensemble model for the skin lesion classification task has also been compared with other studies which used the same PAD-UFES-20 dataset. Table 3 reports the classification results. Although, the proposed ensemble-based system classifies 7-lesion, it outperforms other studies which classify 6-lesion in terms of accuracy, precision, and Fscore. Although [30] use 7 class PAD-UFES+Monkeypox data, the proposed system outperforms it in terms of all indices.

Table 3 Comparison of the vision transformer model with the other studies

Full size table

Also, in order to provide a more fair comparison with the literature, the Vision transformer-based system has also been trained and tested using only 6-class PAD-UFES-20 data. The testing results of this 6-class system have also been presented in the comparison table. This system produced better results than [52,53,54,55] in terms of Accuracy, Precision, and Fscore. In terms of Recall, it is second behind of [52] and better than the other studies. Skin lesion images are complex images and a successful feature extractor is needed to distinguish them successfully. In this study, vision transformer model outperforms other models. This can be explained by the attention mechanisms, which can focus on some important features. Moreover, with the ensemble learning, the system combines strengths of densenet201 and vision transformer model. Thus, it outperforms the other models in terms of many criteria indices.

5 Conclusion

This paper proposed an automated system that classifies monkeypox and other skin lesion images. For that purpose, two datasets which are PAD-UFES-20 and MSLD have been combined. The combined dataset includes 7-type skin lesions. To classify them, a state-of-the-art vision transformer model has been used. Vision transformer models usually perform well using larger data. But in the combined dataset, there are a small amount of data and to cope with it, fine-tuning has been applied. The vision transformer model, which was initially trained on ImageNet-21k that a dataset of 14 million labeled images, has been used for skin lesion classification task. Also, to obtain a fair comparison, the same system has been trained and tested only PAD-UFES-20 dataset images. In this experiment, the system outperformed the literature in terms of Accuracy, Precision, and Fscore. It produced comparable results in terms of Recall. Furthermore, some popular pre-trained networks have been trained for skin lesion classification tasks using the transfer learning approach and their test results have been compared. The best accuracy have been obtained using densenet201. Finally, an ensemble model has been created using the pre-trained densenet201 model and the vision transformer model. All the results have been reported in tables in the “Experimental Results” section. The ensemble model produced 81.91% accuracy, 65.94% Jaccard, 87.16% Precision, 74.12% Recall and 78.16% Fscore scores for 7-class classification. According to these values, the system competes with state-of-the-art models and also outperforms some of them.

The proposed system can assist researchers and doctors while diagnosing skin lesions. It can be helpful in situations that require rapid diagnosis and urgent patient isolation, such as monkeypox.

Availability of Data and Materials

The author declare that all data supporting the findings of this study are available within the article.

References

Chatterjee S, Dey D, Munshi S (2019) Integration of morphological preprocessing and fractal based feature extraction with recursive feature elimination for skin lesion types classification. Comput Methods Programs Biomed 178:201–218. https://doi.org/10.1016/j.cmpb.2019.06.018
Article Google Scholar
WHO (2022). https://www.who.int/europe/news/item/23-07-2022-who-director-general-declares-the-ongoing-monkeypox-outbreak-a-public-health-event-of-international-concern Accessed 5 Oct 2022
Zalaudek I, Argenziano G, Stefani AD, Ferrara G, Marghoob AA, Hofmann-Wellenhof R, Soyer HP, Braun R, Kerl H (2006) Dermoscopy in general dermatology. Dermatology 212:7–18. https://doi.org/10.1159/000089015
Article Google Scholar
Mandava A, Ravuri PR, Konathan R (2013) High-resolution ultrasound imaging of cutaneous lesions. Indian J Radiol Imaging 23:269–277. https://doi.org/10.4103/0971-3026.120272
Article Google Scholar
Hasan MK, Ahamad MA, Yap CH, Yang G (2023) A survey, review, and future trends of skin lesion segmentation and classification. Comput Biol Med 155:106624. https://doi.org/10.1016/j.compbiomed.2023.106624
Article Google Scholar
Lugagne J-B, Lin H, Dunlop MJ (2020) Delta: automated cell segmentation, tracking, and lineage reconstruction using deep learning. PLoS Comput Biol 16:1007673. https://doi.org/10.1371/journal.pcbi.1007673
Sahin VH, Oztel I, Oztel GY (2022) Human monkeypox classification from skin lesion images with deep pre-trained network using mobile application. J Med Syst 46:79. https://doi.org/10.1007/s10916-022-01863-7
Article Google Scholar
Hameed N, Hameed F, Shabut A, Khan S, Cirstea S, Hossain A (2019) An intelligent computer-aided scheme for classifying multiple skin lesions. Computers 8:62. https://doi.org/10.3390/computers8030062
Article Google Scholar
Oztel I, Oztel GY, Akgun D (2023) A hybrid lbp-dcnn based feature extraction method in yolo: an application for masked face and social distance detection. Multimed Tools Appl 82:1565–1583. https://doi.org/10.1007/s11042-022-14073-7
Article Google Scholar
Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, Uszkoreit J, Houlsby N (2020) An image is worth 16x16 words: transformers for image recognition at scale. arXiv:2010.11929
Mahersia H, Hamrouni K (2015) Using multiple steerable filters and Bayesian regularization for facial expression recognition. Eng Appl Artif Intell 38:190–202
Article Google Scholar
Classifier S, Renganathan K, Bhaskar V, Vishnuvardhan J (2020) Melanoma skin cancer detection using knn melanoma skin cancer detection using knn and svm classifier. Elem Educ Online 19:2076–2085. https://doi.org/10.17051/ilkonline.2020.02.696792
Article Google Scholar
Nancy VAO, Prabhavathy P, Arya MS, Ahamed BS (2023) Comparative study and analysis on skin cancer detection using machine learning and deep learning algorithms. Multimed Tools Appl. https://doi.org/10.1007/s11042-023-16422-6
Article Google Scholar
Sivakumar MS, Leo LM, Gurumekala T, Sindhu V, Priyadharshini AS (2023) Deep learning in skin lesion analysis for malignant melanoma cancer identification. Multimed Tools Appl. https://doi.org/10.1007/s11042-023-16273-1
Article Google Scholar
Anand V, Gupta S, Koundal D, Nayak SR, Nayak J, Vimal S (2022) Multi-class skin disease classification using transfer learning model. Int J Artif Intell Tools 31:1. https://doi.org/10.1142/S0218213022500294
Article Google Scholar
Hafhouf B, Zitouni A, Megherbi AC, Sbaa S (2022) An improved and robust encoder-decoder for skin lesion segmentation. Arab J Sci Eng 47:9861–9875. https://doi.org/10.1007/s13369-021-06403-y
Article Google Scholar
Araújo RL, Araújo FHD, Silva RRV (2022) Automatic segmentation of melanoma skin cancer using transfer learning and fine-tuning. Multimed Syst 28:1239–1250. https://doi.org/10.1007/s00530-021-00840-3
Article Google Scholar
Shorfuzzaman M (2022) An explainable stacked ensemble of deep learning models for improved melanoma skin cancer detection. Multimed Syst 28:1309–1323. https://doi.org/10.1007/s00530-021-00787-5
Article Google Scholar
Lakshmi TRV, Reddy CVK (2023) Classification of skin lesions by incorporating drop-block and batch normalization layers in representative cnn models. Arab J Sci Eng. https://doi.org/10.1007/s13369-023-08131-x
Article Google Scholar
Karthik R, Vaichole TS, Kulkarni SK, Yadav O, Khan F (2022) Eff2net: an efficient channel attention-based convolutional neural network for skin disease classification. Biomed Signal Process Control 73:103406. https://doi.org/10.1016/j.bspc.2021.103406
Article Google Scholar
Srinivasu PN, SivaSai JG, Ijaz MF, Bhoi AK, Kim W, Kang JJ (2021) Classification of skin disease using deep learning neural networks with mobilenet v2 and lstm. Sensors 21:2852. https://doi.org/10.3390/s21082852
Article Google Scholar
Medhat S, Abdel-Galil H, Aboutabl AE, Saleh H (2022) Skin cancer diagnosis using convolutional neural networks for smartphone images: a comparative study. J Radiat Res Appl Sci 15(1):262–267. https://doi.org/10.1016/j.jrras.2022.03.008
Article Google Scholar
Alyami J, Rehman A, Sadad T, Alruwaythi M, Saba T, Bahaj SA (2022) Automatic skin lesions detection from images through microscopic hybrid features set and machine learning classifiers. Microsc Res Tech 85:3600–3607. https://doi.org/10.1002/jemt.24211
Article Google Scholar
Yan S, Liu C, Yu Z, Ju L, Mahapatrainst D, Mar V, Janda M, Soyer P, Ge Z (2023) Epvt: environment-aware prompt vision transformer for domain generalization in skin lesion recognition. arXiv:2304.01508
Ahsan MM, Uddin MR, Farjana M, Sakib AN, Momin KA, Luna SA (2022) Image data collection and implementation of deep learning-based model in detecting monkeypox disease using modified vgg16
Ali SN, Ahmed MT, Paul J, Jahan T, Sani SMS, Noor N, Hasan T (2022) Monkeypox skin lesion detection using deep learning models: a preliminary feasibility study. arXiv:2207.03342
Aloraini M (2024) An effective human monkeypox classification using vision transformer. Int J Imaging Syst Technol 34. https://doi.org/10.1002/ima.22944
Kundu D, Siddiqi UR, Rahman MM (2022) Vision transformer based deep learning model for monkeypox detection. In: 25th International Conference on Computer and Information Technology (ICCIT), pp 1021–1026. https://doi.org/10.1109/ICCIT57492.2022.10054797
Ahsan MM, Alam TE, Haque MA, Ali MS, Rifat RH, Nafi AAN, Hossain MM, Islam MK (2024) Enhancing monkeypox diagnosis and explanation through modified transfer learning, vision transformers, and federated learning. Inf Med Unlocked 45:101449. https://doi.org/10.1016/j.imu.2024.101449
Article Google Scholar
Oztel I, Oztel GY, Sahin VH (2023) Deep learning-based skin diseases classification using smartphones. Adv Intell Syst 5. https://doi.org/10.1002/aisy.202300211
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attention is all you need. In: 31st Conference on neural information processing systems
Li R, Xiao W, Wang L, Jang H, Carenini G (2021) T3-vis: visual analytic for training and fine-tuning transformers in nlp. In: Proceedings of the 2021 conference on empirical methods in natural language processing: system demonstrations, pp 220–230. https://doi.org/10.18653/v1/2021.emnlp-demo.26
Krichene S, Müller T, Eisenschlos JM (2021) Dot: an efficient double transformer for nlp tasks with tables. ACL-IJCNLP, Findings of the Association for Computational Linguistics
Google Scholar
Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S (2020) End-to-end object detection with transformers. In: European conference on computer vision
Wang H, Zhu Y, Green B, Adam H, Yuille A, Chen L-C (2020) Axial-deeplab: stand-alone axial-attention for panoptic segmentation. In: European conference on computer vision
Brown TB, Mann B, Ryder N, Subbiah M, Kaplan J, Dhariwal P, Neelakantan A, Shyam P, Sastry G, Askell A, Agarwal S, Herbert-Voss A, Krueger G, Henighan T, Child R, Ramesh A, Ziegler DM, Wu J, Winter C, Hesse C, Chen M, Sigler E, Litwin M, Gray S, Chess B, Clark J, Berner C, McCandlish S, Radford A, Sutskever I, Amodei D (2020) Language models are few-shot learners. In: 34th Conference on neural information processing systems
Lepikhin D, Lee H, Xu Y, Chen D, Firat O, Huang Y, Krikun M, Shazeer N, Chen Z (2020) Gshard: scaling giant models with conditional computation and automatic sharding. In: The international conference on learning representations
Shuvo MMH, Kassim YM, Bunyak F, Glinskii OV, Xie L, Glinsky VV, Huxley VH, Thakkar MM, Palaniappan K (2021) Multi-focus image fusion for confocal microscopy using u-net regression map. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp 4317–4323. https://doi.org/10.1109/ICPR48806.2021.9412122
Hamad A, Bunyak F, Ersoy I (2017) Nucleus classification in colon cancer he images using deep learning. Microsc Microanal 23:1376–1377. https://doi.org/10.1017/S1431927617007541
Article Google Scholar
Priya BL, Jayalakshmy S, Idayachandran G, Kumaran S (2022) Performance analysis of semantic segmentation using optimized cnn based segnet. In: 2022 International conference on smart technologies and systems for next generation computing (ICSTSN), pp 1–5. https://doi.org/10.1109/ICSTSN53084.2022.9761293
Mutegeki R, Han DS (2020) A cnn-lstm approach to human activity recognition. In: 2020 International conference on artificial intelligence in information and communication (ICAIIC), pp 362–366. https://doi.org/10.1109/ICAIIC48513.2020.9065078
Kaur G, Sinha R, Tiwari PK, Yadav SK, Pandey P, Raj R, Vashisth A, Rakhra M (2022) Face mask recognition system using cnn model. Neurosci Inf 2:100035. https://doi.org/10.1016/j.neuri.2021.100035
Article Google Scholar
Goodfellow I, Bengio Y, Courville A (2016) Deep Learning. MIT Press, Cambridge
Google Scholar
Imagenet (2023). http://www.image-net.org/ Accessed 26 June 2023
Ganaie MA, Hu M, Malik AK, Tanveer M, Suganthan PN (2022) Ensemble deep learning: a review. Eng Appl Artif Intell 115:105151. https://doi.org/10.1016/j.engappai.2022.105151
Pacheco AGC, Lima GR, Salomão AS, Krohling B, Biral IP, de Angelo GG, Alves FCR Jr, Esgario JGM, Simora AC, Castro PBC, Rodrigues FB, Frasson PHL, Krohling RA, Knidel H, Santos MCS, do Espírito Santo RB, Macedo TLSG, Canuto TRP, de Barros LFS (2020) Pad-ufes-20: a skin lesion dataset composed of patient data and clinical images collected from smartphones. Data Brief 31:106221. https://doi.org/10.1016/j.dib.2020.106221
Article Google Scholar
Kaggle (2022) Monkeypox Skin Lesion Dataset. https://www.kaggle.com/datasets/nafin59/monkeypox-skin-lesion-dataset Accessed 5 Oct 2022
Pacheco AGC, Krohling RA (2020) The impact of patient clinical information on automated skin cancer detection. Comput Biol Med 116:1. https://doi.org/10.1016/j.compbiomed.2019.103545
Article Google Scholar
Guo X, Yin Y, Dong C, Yang G, Zhou G (2008) On the class imbalance problem. In: 2008 Fourth international conference on natural computation, pp 192–201. https://doi.org/10.1109/ICNC.2008.871
Atas I (2023) Performance evaluation of jaccard-dice coefficient on building segmentation from high resolution satellite images. Balkan J Electr Comput Eng 11:100–106. https://doi.org/10.17694/bajece.1212563
Article Google Scholar
Sokolova M, Lapalme G (2009) A systematic analysis of performance measures for classification tasks. Inf Process Manag 45:427–437. https://doi.org/10.1016/j.ipm.2009.03.002
Article Google Scholar
Khan IU, Aslam N, Anwar T, Aljameel SS, Ullah M, Khan R, Rehman A, Akhtar N (2021) Remote diagnosis and triaging model for skin cancer using efficientnet and extreme gradient boosting. Complexity 2021 https://doi.org/10.1155/2021/5591614
Chen Q, Li M, Chen C, Zhou P, Lv X, Chen C (2022) Mdfnet: application of multimodal fusion method based on skin image and clinical data to skin cancer classification. J Cancer Res Clin Oncol. https://doi.org/10.1007/s00432-022-04180-1
Pacheco AGC, Krohling RA (2021) An attention-based mechanism to combine images and metadata in deep learning models applied to skin cancer classification. IEEE J Biomed Health Inform 25:3554–3563. https://doi.org/10.1109/JBHI.2021.3062002
Article Google Scholar
Haritha D, Sandhya B (2022) Multi-modal medical data fusion using deep learning. In: Proceedings of the 2022 9th international conference on computing for sustainable global development, INDIACom 2022, pp 500–505. https://doi.org/10.23919/INDIACom54597.2022.9763296

Download references

Acknowledgements

This study was supported by The Scientific and Technological Research Council of Turkey - Turkish Academic Network and Information Center (TUBITAK-ULAKBIM) in the publication process.

Funding

Open access funding provided by the Scientific and Technological Research Council of Türkiye (TÜBİTAK).

Author information

Authors and Affiliations

Department of Software Engineering, Sakarya University, Serdivan, Sakarya, 54050, Turkey
Gozde Yolcu Oztel

Authors

Gozde Yolcu Oztel
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Gozde Yolcu Oztel.

Ethics declarations

Conflict of Interest/Competing Interests

The author did not receive support from any organization for the submitted work.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Yolcu Oztel, G. Vision transformer and CNN-based skin lesion analysis: classification of monkeypox. Multimed Tools Appl 83, 71909–71923 (2024). https://doi.org/10.1007/s11042-024-19757-w

Download citation

Received: 19 August 2023
Revised: 26 May 2024
Accepted: 23 June 2024
Published: 09 July 2024
Issue Date: August 2024
DOI: https://doi.org/10.1007/s11042-024-19757-w

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Vision transformer and CNN-based skin lesion analysis: classification of monkeypox

Abstract

Similar content being viewed by others

Convolutional Neural Network for Monkeypox Detection

Ensemble of Deep Convolutional Neural Network for Skin Lesion Classification in Dermoscopy Images

SkinMarkNet: an automated approach for prediction of monkeyPox using image data augmentation with deep ensemble learning models

1 Introduction

2 Related works

3 Methodology