Introduction

Monkeypox is an infectious disease caused by the monkeypox virus (MPXV), a member of the orthopoxvirus genus. It was first identified in the monkey in 1959 at a research institute in Denmark, hence it is named as Monkeypox virus [1]. Later, the first case was confirmed in humans in the Republic of Congo in 1970 when a child with smallpox-like symptoms was admitted to the hospital [2]. It transmits to humans through close contact with infected individuals or contaminated objects [3]. Initially, it usually appeared in the African region but recently it has reached more than 50 countries with 3,413 confirmed cases and one death [4]. Till now, there are two variants of the monkeypox virus known: one, the Central Africa clade and another, the West Africa clade. There is no proper treatment for the monkeypox virus to date. The ultimate solution is the development of a vaccine. The diagnosis of Monkeypox can be performed mainly with the polymerase chain reaction (PCR) method or skin lesion test using electron microscopy. The most trusted method of virus confirmation is PCR, which has also been used for COVID-19 diagnosis in recent years. In addition, artificial intelligence (AI)-based techniques could help detect them with the help of virus image processing and analysis.

With the emerging growth of AI models in various domains such as chest x-ray images [5], fruit image analysis [6], and sentiment analysis [7, 8], the AI models for medical image analysis have been proposed for various virus-related disease detection [9, 10]. For instance, Madhavan et al. [10] developed a deep learning model (Res-COvNet) based on transfer learning approach for COVID-19 virus detection. They employed ResNet-50 [11] to extract the features from X-ray images and extended the network with a classification layer. Their proposal achieved a promising accuracy of 96.2% for identifying normal, bacterial pneumonia, viral pneumonia, and COVID-19 cases on X-ray images. Similarly, a review study on the deep learning model for COVID-19 detection was reported by Bhatt et al. [12]. In addition, facial skin disease detection using deep learning was implemented by Yadav et al. [13].

Besides the COVID-19 virus detection, a few works used deep learning models for other disease detection such as chicken pox, Herpes, and so on. For instance, Sandeep et al. [14] investigated the detection of various skin diseases such as Psoriasis, Chicken Pox, Vitiligo, Melanoma, Ringworm, Acne, Lupus, and Herpes using deep learning (DL)-based methods. They developed a Convolutional Neural Network (CNN) to classify the skin lesion into eight disease classes and compared their solution with the help of the VGG-16 pre-trained model [15]. Their method provided an accuracy of 78% for the detection. Low-cost image analysis for Herpes Zoster Virus (HZV) detection using CNN was proposed in [16]. The early detection of HZV produced an accuracy of 89.6% when tested on 1,000 images.

Furthermore, Measles disease detection using a transfer learning approach was implemented by Glock et al. [17]. They achieved a sensitivity of 81.7%, specificity of 97.1%, and accuracy of 95.2% using the ResNet-50 model [11] over the diverse rash image dataset. Moreover, a big-data approach for Ebola virus disease detection was proposed in [18] using an ensemble learning approach. They utilized a combination of artificial neural network (ANN) and genetic algorithm (GA) for knowledge extraction over the big data using Apache Spark and Kafka framework. More recently, Ahsan et al. [19] collected the images of Monkeypox, Chickenpox, Measles and Normal categories using web mining techniques and verified by the experts. Later on, they also evaluated a transfer learning approach with the VGG-16 model considering two techniques [20]. The first technique considered the classification of images into two disease classes: Monkeypox and Chickenpox, whereas the second technique augmented the images. They reported an accuracy of (97%) while classifying the monkeypox without data augmentation, whereas the accuracy was decreased to 78% with augmentation.

From existing research works on virus-related disease detection using DL methods, we observe that the majority of them have employed the transfer learning approach [10, 17] using well-established pre-trained DL methods. Since there are not many works available on Monkeypox virus detection except the work by Ahsan et al. [20]. Their proposal has provided encouraging results in this domain. However, it has three main limitations. First, their models only deal with binary classification with limited performance. Second, they only consider the VGG-16 DL model for transfer learning, which lacks identifying the best-performing pre-trained DL methods and their best combinations to attain optimal performance. Third, their models have insufficient interpretability. As a result, it is difficult to establish trustworthiness among health practitioners during mass screening.

To address the aforementioned limitations, we, first, resort to the 13 pre-trained DL models and fine-tune them with the same approach. Second, we evaluate the performance of each DL model using averaged Precision, Recall, F1-score and Accuracy over 5 folds. Third, we ensemble the best-performing models to improve the overall performance.

The main contributions in this paper are as follows:

  • Propose to use a common fine-tuned architecture for all 13 pre-trained DL models for MonkeyPox detection and compare them;

  • Perform an ablative study to select the best-performing DL models for ensemble learning;

  • Compare the proposed approach with the state-of-the-art methods; and

  • Show the explainability using Grad-CAM [21] and LIME [22] of best-performing DL model.

Materials and methods

Dataset

Herein, we use a publicly available Monkeypox image dataset [19, 20]. The dataset has different sub-folders, including datasets with and without augmentations. Given that DL models prefer augmented images to learn meaningful information more accurately, we use them in this study. Table 1 shows the number of images from the augmented folder per category.

Table 1 Dataset statistics

Evaluation metrics

We use four widely-used performance metrics such as Precision (Eq. (1)), Recall (Eq. (2)), F1-score (Eq. (3)), and Accuracy (Eq. (4)).

$$\begin{aligned} P= \frac{TP}{TP+FP}, \end{aligned}$$
(1)
$$\begin{aligned} R= \frac{TP}{TP+FN}, \end{aligned}$$
(2)
$$\begin{aligned} F= 2 \times \frac{P \times R}{P+R}, \end{aligned}$$
(3)
$$\begin{aligned} A = \frac{TP+TN}{TP+TN+FP+FN}, \end{aligned}$$
(4)

where TP, TN, FP, and FN represent true positive, true negative, false positive, and false negative, respectively. Similarly, P, R, F, and A represent Precision, Recall, F1-score, and Accuracy, respectively.

Pre-trained DL models

The availability of various DL models trained on a large dataset, called ImageNet [23], made significant progress in image classification and computer vision tasks. More precisely, when the availability of expert-labelled data is limited to some domains such as biomedical image analysis, a most common approach is to utilise these pre-trained DL models for transfer learning [24]. This is helpful to boost the performance in a limited data setting because transfer learning allows the DL models trained on large datasets to transfer learned knowledge to a small domain-specific dataset.

We choose 13 pre-trained DL models for this study. These pre-trained model ranges from heavy-weight DL models such as VGG-16 [25], InceptionV3 and Xception [25] to light-weight models such as MobileNet [26], and EfficientNet [27]. The overall pipeline of the training process for those models is shown in Fig. 1. We use the same customisation for all pre-trained models. A brief discussion of each pre-trained DL model is presented in the next subsections.

VGG

The Visual geometry group (VGG) at Oxford University developed a Convolutional Neural Network (CNN), popularly known as VGG-16, which won the ImageNet [23] challenge in 2014. It consists of 13 Convolutions, 5 Max pooling and 3 Dense layers. It is named as VGG-16 because it has 16 layers that have the learnable weight parameters [11]. An extended version of VGG-16 model, which consists 16 Convolution layers, 5 Max-pooling layers and 3 Dense layers, is known as VGG-19.

ResNet

The very deep convolutional neural network such as VGG-16 and VGG-19 produced promising results in a large-scale image classification task. However, it is very hard to train a very deep neural network due to a vanishing gradient problem, i.e., the multiplication of small gradient propagated back to the previous layer start vanishing after a certain depth. The researchers aimed to address the vanishing gradient problem by introducing the concept of skip connection, which allows skipping some layers in the network. The group of layers in the network that use such skip connection are known as residual blocks (Res-Blocks), which are the core of ResNet architecture [11]. Here, we utilise two ResNet architectures: ResNet-50 and ResNet-101. The ResNet-50 consists of 48 Convolution layers, 1 Max-pooling, and 1 Average pooling layer, whereas ResNet-101 includes 99 Convolution layers, 1 Max-pooling and 1 Average pooling layer.

Inception-V3

The idea of widening the network instead of deepening is implemented in the Inception network, by a team of researchers at Google [28]. Inception network architecture uses the four parallel convolutions layers with different kernel sizes at a given depth of network to extract the image feature at different scales before passing them into the next layer. Here, we utilize a 48-layer deep Inception-v3 network.

InceptionResNet

With the development of wider and deeper architecture with residual connections such as Inception, [28] and ResNet [11] network, researchers exploited the benefit of combining the Inception architecture with the residual connections, and established a novel model called InceptionResNet. We utilise the InceptionResNetV2 network, which consists of 449 layers including Convolutions layers, Pooling layers, Batch normalization layers and so on.

Xception

It is an extreme version of the Inception network developed by Google in 2017 [25]. The main idea implemented in Xception is to make the Convolutions operation more efficient in Inception blocks. This was achieved with modified depth-wise separable convolution, which is performed in two steps: point-wise convolution followed by depth-wise convolution. Here, the point-wise convolution changes the dimension and depth-wise convolution represents the channel-wise spatial convolution.

MobileNet

The idea of depth-wise convolution was further exploited in a deep neural network architecture, known as MobileNet [26]. In this work, we utilise one version of MobileNet architecture, called MobileNetV2 [29]. The MobileNetV2 is an extended version of MobileNetV1, which consists of 1 regular Convolutions layer, 13 depth-wise separable convolutions blocks and 1 regular Convolutions layer, followed by an Average pooling layer. Whereas, MobileNetV2 added the Expand layer, Residual connections and Projection layers in addition to depth-wise Convolution layers known as a Bottleneck residual block.

DenseNet

In DenseNet architecture [30], the idea of skip connection was extended to multiple steps instead of one-step direct connections as in ResNet. And, the block designed to use in between such connections is known as Dense block. The main components of DenseNet are connectivity, and Dense blocks. Each layer in Densenet has a direct connection to its all forward layer, thereby establishing (L+1)/2 connections for L layer. Each Dense block consists of Convolutions layers with the same feature map size but different kernel sizes. In this work, we utilise the DenseNet-121 network, which consists of 120 Convolutions layers and 4 Average pooling layers.

EfficientNet

The expansion of CNN on each dimension into width, depth, and resolution was attempted arbitrarily in various deep neural network architectures such as ResNet, DenseNet, Inception, Xception and so on. However, the systematic approach for scaling up the CNN with a fixed set of scaling coefficients was proposed in EfficientNet architecture [27]. The network architecture of EfficientNet consists of three blocks: steam, body and final blocks. The steam and final blocks are common in all variants of EfficientNet while the body differs from one version to another. The stem block consists of input, re-scaling, normalization, padding, convolution, batch normalization and activation layer. The body consists of five modules, where each module has depth-wise convolution, batch normalization and activation layers. In this study, we use three versions of EfficientNet: EfficientNet-B0, EfficientNet-B1 and EfficientNet-B2. The EfficientNet-Bo has 237 layers in total, whereas EfficientNet-B1 and EfficientNet-B2 have 339 layers, excluding the top layer.

Fig. 1
figure 1

High-level workflow to train the pre-trained DL models with custom layers. Note that GAP refers to Global Average Pooling layer and the value inside the small bracket of dense layer represents the number of units under it

Implementation

We implement our proposed model using Keras [31] implemented in Python [32]. During the implementation, we tune the parameters as follows. We first resize each image into 150*150 as suggested by Sitaula and Hossain [33]. For augmentation, we apply online data augmentation as follows: rescale=1/255, rotation range=50,width shift range=0.2, height shift range=0.2, shear range=0.25, zoom range=0.1, and channel shift range=20. We set the optimizer as ’Adam’, batch size as 16, and initial learning rate as 0.0001. To prevent over-fitting, we utilise the learning rate decay over each epoch coupled with the Early stopping criteria.

In our study, we design random five folds (5-cross validation), where each fold contains 70/30 for train/test ratio and report the average performance.

Fig. 2
figure 2

Ensemble method between Xception and DenseNet-169 DL models. Note that the Voting block refers to max-voting

Fig. 3
figure 3

Sample train/test plot (fold 1) for accuracy and loss obtained from the fine-tuned Xception DL model

Ensemble approach

To ensemble the multiple DL models, we extract the probabilistic values from each fine-tuned pre-trained model and perform the majority voting approach (refer to Fig. 2). Each of our fine-tuned DL models shows the best-fit to learn the optimal features during the training and testing process (see Fig. 3).

In this study, we choose two best-performing fine-tuned models: Xception and DenseNet-169 based on the empirical study (see “Candidate model selection for ensemble”). Let us assume that the Xception model produces a probabilistic output vector as X and DenseNet-169 provides a probabilistic output vector as D with a size equal to the number of classes.

$$\begin{aligned} X=Xception(I), \end{aligned}$$
(5)
$$\begin{aligned} D=DenseNet-169(I), \end{aligned}$$
(6)
$$C=\underset{c}{\text{arg max}} [X,D],$$
(7)

where I is the input image to be classified and C gives us the index of highest majority value corresponding to the particular class c.

Table 2 Comparison of pre-trained DL models and ensemble approach using averaged Precision, Recall, F1-score, and Accuracy over 5 different folds

Results and discussion

Comparative study of DL models

We compare the proposed approach with the off-the-shelf pre-trained DL models based on the standard evaluation measures on this dataset. The results are presented in Table 2. Note that the reported results are the averaged performance over 5 different folds (5-fold cross validation).

From Table 2, we notice that the Xception is the second-best performing method among all contenders, whereas it is the best method among 13 pre-trained DL methods (Precision: 85.01%, Recall: 85.14%, F1-score: 85.02%, and Accuracy: 86.51%). We believe that this is because of the Xception’s higher ability to extract the discriminating information from the virus images with the help of its point-wise and depth-wise convolution. In addition, our proposed ensemble method is the best-performing method among all contenders with an accuracy of 87.13%. In terms of other performance metrics such as Precision, Recall and F1-score, we observe that it imparts 85.44% of Precision, 85.47% of Recall and 85.40% of F1-score. This shows that it is 0.43% higher in Precision, 0.33% higher in Recall, and 0.38% higher in F1-score than the second-best performing method (Xception). Furthermore, it imparts 5.26%, 6.3%, and 6.39% higher Precision, Recall, and F1-score, respectively than the least-performing DL method (VGG-16). Such improvement in the result is because of the decision fusion, which helps fuse the decision outcomes from different DL models as the final decision.

In summary, our proposed common custom layers are appropriate for fine-tuning all 13 pre-trained DL models to achieve optimal accuracy on this dataset. This is shown not only from the best-fit train/test graph but also from the overall performance (from 80.18% to 84.07% for Precision, from 79.17% to 83.74% for Recall, from 70.01% to 83.83% for F1-score and from 82.22% to 86.06% for Accuracy). Similarly, the majority voting approach has also been an interesting option in ensemble learning to exploit the highest decision for the optimal end classification result.

Table 3 Performance comparison of DL different models combination using Precision, Recall, F1-score and Accuracy
Fig. 4
figure 4

Confusion matrix of results obtained for five folds from (a) to (e) from the best performing Xception DL model in our study

Candidate model selection for ensemble

We select only those DL models that provide optimal performance in our study. For this, we select the top-5 models and their combinations for decision fusion. The detailed results are presented in Table 3. From Table 3, we observe that the combination of two models (Xception as M1 and DenseNet-169 as M2) provides us with the best performance compared to other combinations.

Table 4 Performance comparison of the proposed with the state-of-the-art methods using Precision, Recall, F1-score and Accuracy

Comparative study with state-of-the-art methods

Although there are no such well-established published state-of-the-art methods for the Monkeypox virus detection, we compare our proposed model with some of the closely related methods that have been used for COVID-19 detection. For this, we utilise four popular well-established DL-based methods: deep bag of words (BoDVW) [35], multi-scale deep bag of deep visual words (MBoDVW) [5], attention-based VGG (AVGG) [33] and convolutional neural network with long short term memory (CNN-LSTM) [34]. We try our best to select the optimal hyperparameters from the corresponding papers. The results are presented in Table 4, which show that the proposed method is superior to the state-of-the-art methods in terms of well-established evaluation measures. From this result, we believe our method is appropriate to the Monkeypox virus detection problem, whereas the contender methods are based on chest X-ray images and appropriate to COVID-19 detection problems.

Explainability

We show the explainability of the DL model using the Gradient-weighted Class Activation Mapping (Grad-CAM) [21] and Local Interpretable Model-Agnostic Explanations (LIME) [22] visualisation techniques. For this, we use the Xception model, which is the best-performing model, on the Monkeypox dataset. The outputs are presented in Fig. 5. The Grad-CAM measures the gradient of the output feature map of a selected layer of the network, whereas the LIME is a local model-independent approach to generate the interpretation for a specific case, which transforms the input data into a series of interpretable local representations.

From Fig. 5, we notice that the outputs obtained from the Xception model is able to detect the discriminating regions clearly for the classification. For instance, the Grad-CAM is able to show the virus-infected regions in yellow or dark yellow color and the LIME is able to encircle the potentially infected regions with its superpixel on the map. Note that for the Grad-CAM, we set ‘\(block14\_sepconv2\_act\)’ layer from the Xception DL model. And, for the LIME, we set the number of features as 5, the number of samples as 1000, and top labels as 4 for the Xception DL model.

Fig. 5
figure 5

Visualisation based on Grad-CAM and LIME for the Xception model

Class-wise study

We study the class-wise performance of the ensemble approach using the confusion matrix, which is shown in Fig. 4. From Fig. 4, different confusion matrices for all five-folds show that our ensemble approach is able to discriminate the images clearly into four different classes. More specifically, our method is able to highly discriminate the chickenpox and normal images compared to measles and monkeypox virus. Furthermore, all instances of chickenpox (66) and normal images (111) in the test set are recognised correctly by the proposed model, whereas it is still not perfect to discriminate measles and monkeypox viruses in fold 1 (a). This might be due to the similar features identified by the backbone CNN for two classes: measles and Monkeypox viruses as seen in Grad-CAM visualization (Fig. 5).

Conclusion and future works

In this paper, we compared 13 different pre-trained DL models with the help of transfer learning on the monkeypox dataset. With the help of such comparison using well-established evaluation measures, we identified the best-performing DL models to ensemble them for overall performance improvement. The evaluation result shows that the ensemble approach provides the highest performance (Precision: 85.44%; Recall: 85.47%; F1-score: 85.40%; and Accuracy: 87.13%) during the detection of the Monkeypox virus. Also, the Xception DL model provides the second-best performance (Precision: 85.01%; Recall: 85.14%; F1-score: 85.02%; and Accuracy: 86.51%)).

There are two major limitations of our work. First, the dataset size is comparatively smaller, so addition of more data could improve the performance further. Second, our AI approach is based on pre-trained DL models, which could be a problem if we would like to deploy them in a memory-constrained setting. So, the design of novel lightweight DL models could be an interesting work to let it work on a limited resource.