Introduction

In December 2019, the first case of a novel coronavirus (COVID-19) was found in Wuhan, China. It was thought that this virus originated from zoonotic-like species, however, the cause of this virus has still not been determined [1,2,3]. Coronavirus, officially known as Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2) causes an infectious disease that has spread to over 215 countries [4]. This deadly disease is contagious which is making it a significant threat to human beings. As per the current statistics [5], number of confirmed COVID-19 cases have crossed 103, 338, 512 and more than 2,233,559 deaths across the world [6]. People with weak immune systems and already suffering from other diseases like sugar, blood pressure, etc. are more prone to the infection [7]. It is spreading through close contact or the droplets when the infected coughs, sneezes or talks [8]. For a vaccine to develop for COVID-19, diagnosis through real-time reverse transcription-polymerase chain reaction (RT-PCR) is confirmed [9]. However, RT-PCR still lacks time management, and there are still lying a long chain of trials from animals to humans, and the virus is regularly mutating. Due to this, it may arise that the infected person may not be recognized and may not get the suitable treatment and may spread the virus to a healthy population and cannot be acceptable in this pandemic situation. It has been observed that X-rays and CT images are sensitive to the screening of the COVID-19 among the patients. Initially, lung X-ray is performed for suspected or confirmed patients through specific circuits. It proves to be a discriminating element wherein the patient’s further diagnosis depends on the clinical situation and the chest X-ray film results. Many recent works depending on the elaborated image features with clinical and diagnostic results may help in early detection of COVID-19 [10,11,12]. Technological advancement using the machine and deep learning applications for automatic prediction of COVID-19 traces in people has taken over people’s minds [13].

A work on X-ray image classification is introduced by Tulin et al. [14]. They used the DarkNet model for classifying the detected YOLO-based object in the input images. Another work on X-ray images, called Decompose, Transfer, and Compose (DeTraC), is introduced by Asmaa et al. in [15]. DeTraC used class decomposition mechanism to deal with class boundaries leading to an accuracy of 0.931 and sensitivity of 1. A work by Taban et al. on chest X-ray images used the 12-off-the-shelf CNN architectures in transfer learning [16]. GANs have also been used in the process of COVID-19 disease analysis by Mohamed et al. [17]. It used deep transfer learning based on GANs for 3-class classification among COVID-19, Normal and Pneumonia X-ray images. They utilized the dataset created by Joseph et al. [18] which yielded an accuracy, sensitivity, and F1-score of 0.8148, 0.8148, and 0.8146, respectively. Another work on the classification task of COVID-19 among the three classes of COVID-19, Normal and Pneumonia X-ray images was performed by Enzo et al. [19]. They used the datasets in [18, 20, 21] with the Pneumonia images Footnote 1 for achieving considerable performance. Wang et al. has introduced a COVID-19 predictive classification technique by using 453 chest CT images and achieved an accuracy of \(82.9\%\) [22]. Khaled et al. proposed another work for COVID-19 detection using the transfer learning-based hybrid 2D/3D CNN architecture, VGG16, for screening chest X-rays [23]. Yifan et al. has introduced the use of CNN in the classification process of COVID-19 [24]. Their model achieved an acceptable recall and precision values of 0.75 and 0.64, respectively. Depth-wise separable convolution layer and a spatial pyramid pooling module (SPP) modules are combined with VGG16 for obtaining better results. Similar work by N.Narayan et al. [2] with the use of Inception (Xception) model on chest X-ray images was also performed.

In a work by Rahmatizadeh et al., they introduced a three-step AI-based model to improve the critical care of patients [25]. An intensive care unit (ICU) is performed where the input evidence includes surgical, paraclinical, customized medicine (OMICS) and epidemiology data. The performance included assessment, therapy; the stratification of threats, prognosis, and direction, and finally, it concludes that the health care system’s efforts to overcome the problem of COVID-19 detection in patients with the help of today’s AI-based technology can easily and effectively tackle the patients, particularly for the patients in ICU suffering from COVID-19.

Table 1 Overview of prediction models for diagnosis of COVID-19
Fig. 1
figure 1

Sample images from \({\mathcal {D}}_2\) and \({\mathcal {D}}_1\) dataset used for the experimentation purpose in the course of this study; a COVID-19 images, b Normal images, and c Pneumonia images

Similarly, many such works on COVID-19 diagnosis have sprung up which helped the experts in effective classification of the disease. In Tables 1, 2 and 3, a comprehensive review for the use of deep learning for COVID-19 classification is tabulated. Inspired by the dire need to develop means to fight COVID-19, and intrigued by the works of open access support of the research community, this work leverages the variational aspects of encoding the radiological image features into classification scores for predicting the COVID-19 related lung opacities. Here, a deep learning-based stacked pipeline incorporating deep CNNs is encouraged. DenseNet and GoogLeNet are used for feature extraction with the involvement of a variational environment to further extract the latent space of the extracted features. This model was trained on the publically available COVID-19 datasets for performing diagnostic tests. The main contributions of this paper are as follows:

  • A framework for COVID-19 detection based on X-ray images is proposed, which outperforms state-of-the-art CNN models.

  • We propose a two-step ensemble approach under a variational setting for learning the joint representation of representations generated by fine-tuned CNNs.

  • We perform extensive experiments on COVID-19 dataset along with a challenging multi-class benchmark containing X-ray images.

Method

Dataset specification

The 2019 novel coronavirus has shown numerous unique symptoms to detect the viral disease COVID-19 from the patients. It has been deduced that the COVID-19 patterns can be perceived from either CT or chest X-ray images. Considering the seriousness of the situation worldwide, the publically available datasets have been collected and worked upon.

Fig. 2
figure 2

Sample images from dataset \({\mathcal {D}}_3\) used for the experimentation purpose in the course of this study; a Atelectasis b Effusion, c Infiltration, d Nodule and e Pneumonia

Initially, we have taken Chest X-ray images (Pneumonia) [26], which is represented as \(~{\mathcal {D}}_1\), which contains a total of 5840 images. The training set contains 4216 images which are further bifurcated into classes, Pneumonia and Normal with 1341 and 3875 images, respectively. Similarly, the test set contains 624 images subdivided into Pneumonia and Normal with 390 and 234 image samples, respectively.

Table 2 DATASET-1 \(~({\mathcal {D}}_1\)): Class-wise bifurcation of Pneumonia dataset
Table 3 DATASET-2 \(~({\mathcal {D}}_2\)): Class-wise bifurcation of COVID-19 dataset
Table 4 DATASET-3 \(~({\mathcal {D}}_3\)): Class-wise bifurcation of NIH Chest X-ray dataset

Another dataset \(~{\mathcal {D}}_2\) for COVID-19 images is collected from three COVID-19 datasets which include COVID-19 Radiography Database [27] containing approximately 219 images with another dataset [28] containing a total of 69 images with 60 samples in training and 9 samples in testing. The third dataset has been taken from an open-source repository [29] which contained a total of 35 COVID-19 images. All the datasets are combined to form a collection of 308 images that are further used for experimentation on the proposed framework. All in all, 308 images have been segregated as 196 for training and the rest 112 for testing, approximately in the ratio of 2:1.

We have comprehensively performed experiments on the NIH Chest X-ray images (dataset \(~{\mathcal {D}}_3\)Footnote 2 containing 14 categories, out of which we have selected 5 categories namely, Atelectasis, Effusion, Infiltration, Nodule, and Pneumonia which have higher correlation among them (tabulated in Table 4). Considering the dataset specifications, we found that the Pneumonia subset of images was limited in number and mixed with other classes (upon initial experimental trials). Therefore, to remove this apprehension, we amalgamated the Pneumonia images from \(~{\mathcal {D}}_1\) dataset in the \(~{\mathcal {D}}_3\) dataset. The snag about the dataset\(~{\mathcal {D}}_3\) is that the images have been extracted based on the query utilizing the natural language processing (NLP), which has confused the reader with multiple classes for a single image. Therefore, to avoid confusion, we have taken the diseases with minimal overlapping and maximum images in the concerned image category. Some of the sample images from the dataset \({\mathcal {D}}_1\) and \({\mathcal {D}}_3\) are shown in Figs. 1 and 2, respectively. From a clinical point of view, we have referenced many literature studies which have used the similar dataset that is used in our proposed model. This dataset only contains the images of the lungs of different patients irrespective of their demographic and individual information. Therefore, we have only added lung chest X-ray images for the experiments.

Network architecture

In this section, we propose the detailed involvement of deep learning modules for estimating the feasibility of using the chest X-rays to diagnose COVID-19. The concept of the stacked architecture of DenseNet and GoogleNet is explained here. Later, variational autoencoder is operated upon the output of stacked architecture. The deep modules are trained on \({\mathcal {D}}_1\) dataset, which includes the \(\{ X_t, Y_t\}\) set of images and their labels for Normal and Pneumonia patients. It contains a total of 4265 Pneumonia with 1575 Normal image samples.

DenseNet [30], in its basic architecture, is a deep CNN that helps eradicate the issue of vanishing-gradient efficiently using cross-connection by allowing the use of previously extracted features by the network [31]. This advantage is captured to achieve better performance with lesser computation. For the binary class classification, two fully connected (FC) layers with 1024 hidden features are added, which resulted in a final output of two classes. To fully utilize the network gradients, the ReLU activation function is used between the two FC layers, which does not activate all neurons at once and enhances the network for elevating the performance. At the approaching end of the model, logsoftmax function instead of the softmax function is encouraged to provide better numerical computation and gradient optimization. For the DenseNet neural network, the convolution layer’s key task is to extract the image features, preventing the error of manually extracting features. It consists of a series of feature maps. A convolution kernel holds these feature maps, and consists of weights as parameters, and is often called convolution filter.

Fig. 3
figure 3

Schematic representation of the proposed framework depicting different modules for the classification of COVID-19 v/s Normal v/s Pneumonia classes

The above convolution kernel consists of learnable parameters and is transformed using cross-connections with function maps of the previous layer. To further obtain the extracted function map, the resulting elements are processed, followed by an offset, and finally moved through a nonlinear ReLU activation.

An assortment of different convolution kernels allows the extraction of more complex features. The formula for calculation shall be as shown in Eq. 1.

$$\begin{aligned} {x^{L}_{\beta }}= g\left\{ \sum _{\alpha }^{Z^{L-1}} x_{\alpha }^{L-1}\circledast k_{\alpha \beta }^{L-1} + \gamma _{\beta }^{L} \right\} \end{aligned}$$
(1)

where \(g(\cdot )\) denotes the activation function with non-linearity. \(Z^{L-1}\) denotes the feature maps input to the convolution kernel selected by the \(L-1\) layer, and \(\circledast \) signifies the convolution operation performed using kernel \(k_{\alpha \beta }\) with \(\gamma \) represents the offset added to extracted kernel features. L denotes the number of layers in our convolution neural network, and \(k_{\alpha \beta }\) means the convolution kernel used over input \(x_{\alpha }^{L-1}\) of the feature map \(\beta \) between the L th layer and the feature map \(\alpha \) of the \((L-1)^{th}\) layer. After training, the output layer, last FC layer, and ReLU activation function are removed from the DenseNet model and the remaining part is used as a feature extractor. The model \(DenseNet_{1024}\) gives 1024 features as an output.

GoogleNet [32], is a state-of-the-art deep learning architecture that allows better and more computationally efficient results. This network consists of inception modules where every penultimate layer and the next layer is joined with four connections. First, a \(1\times 1\) convolution is performed on the output of the previous layer. This is done because the output of the previous layer consists of some useful extracted features which further need to be used, then a \(3\times 3\) and a \(5\times 5\) convolutions are performed on output from the previous layer.

For dimensionality reduction, a \(1\times 1\) filter is used before performing a \(3\times 3\) and \(5\times 5\) convolution. Also, a \(3\times 3\) max-pooling layer is followed by a \(1\times 1\) filter convolution, which helps achieve an optimal sparse structure. The model with pre-trained weights is trained on ImageNet [33] dataset, followed by replacing the last FC layer of the network with the addition of two FC layers having 1024 hidden features and final output of class probabilities. The activation function is amalgamated between the added layers. The network’s output layer is of the logsoftmax function with the ‘Adam’ optimizer to finally train the network. The individually trained DenseNet and GoogLeNet are now fused to form an ensemble acting as a feature extractor. The X-ray images of COVID-19, Normal, and Pneumonia constitute dataset \({\mathcal {D}}_2\), which consists of images and their corresponding labels as \(\{ X_{c}, Y_{c}\}\), respectively. The images are further divided into train and test sets where every image being resized to a dimension of \(224\times 224\). These images are then given as input to the trained model of GoogLeNet and DenseNet extracting 1024 features, represented as \(GoogleNet_{1024}\) and \(DenseNet_{1024}\), respectively, at the output, giving us \(G(c_i)\) and \(D(c_i)\) which contains the features \({\mathcal {F}}_G\) and \({\mathcal {F}}_D\), respectively, as depicted in Eq. 2.

$$\begin{aligned} E(c_i) = D(c_i)\oplus G(c_i), i = 1,2,....X_c \end{aligned}$$
(2)

where \(E(c_i)\) are the total extracted output features with \(D(c_i)\) and \(G(c_i)\) and \(\oplus \) denoting the concatenation operation. \({\mathcal {F}}_G\) and \({\mathcal {F}}_D\) are then concatenated to form features \({\mathcal {F}}_C\) with a dimension of 2048, which is used for training the variational autoencoder followed by ML-based classification features. Prior sending the feature vector from the aforementioned ensemble into the variational architecture, normalization is performed on \({\mathcal {F}}_C\) to make it compatible with the autoencoder. For this, mean \(\mu _C\) and standard deviation \(\sigma _C\) is calculated which is used to normalize \({\mathcal {F}}_C\) to provide features \({\mathcal {F}}_N\).

figure a

The concatenated features \({\mathcal {F}}_C\) have a very high dimension which brings out a requirement to use dimensionality reduction. For learning better distributions and sparse features from \({\mathcal {F}}_C\), we integrated a variational autoencoder. The variational aspect of the pipeline consists of an encoder-decoder assembly which rather than giving the latent space features, provides us with the distribution of the latent features in the form of its mean and standard deviation as:

$$\begin{aligned} \left\{ F_{\mu },F_{\sigma } \right\} \leftarrow VAE_{Encoder}\left\{ F_{N} \right\} \end{aligned}$$
(3)

The reparametrization of \(F_{\mu }\) and \(F_{\sigma }\) is then performed by associating a constant \(\epsilon \) with \(F_{\sigma }\) and adding the resultant to \(F_{\mu }\) to give features \(F_{reparam}\).

$$\begin{aligned} F_{reparam} = F_{\mu }+\epsilon \times F_{\sigma } \end{aligned}$$
(4)

The latent space encoder is then used to give the output feature space of 100 feature. A specialized loss function, \( {\mathcal {L}}_{VAE}\), consists of two factors, one which penalizes the reconstruction error and another which allows the learned distribution to be similar to our predefined distribution which is assumed to be a Gaussian distribution. The loss function is a sum of binary cross-entropy loss i.e. \({\mathcal {L}}_{BCE}\) and KL divergence loss i.e. \({\mathcal {L}}_{KL}\) as given:

$$\begin{aligned} L_{KL}= 0.5 \times \Sigma (1 + log(\sigma ^2) - \mu ^2 - \sigma ^2) \end{aligned}$$
(5)

where \(\mu \) and \(\sigma \) denote the mean and standard deviation. However, in the proposed model (as shown in Fig. 3), \(F_{\mu }\) and \(F_{\sigma }\) are used in Eq. 5 for \(\mu \) and \(\sigma \) thereby giving us \(L_{KL}\). In a nutshell, a VAE with \(\mathcal {VAE}_{encoder}\) and \(\mathcal {VAE}_{decoder}\) have regularised encoding distribution during training which gives mean and the standard deviation to understand the distribution of the latent space. Therefore, for increasing overall model robustness, \({\mathcal {F}}_N\) are passed through \(\mathcal {VAE}_{encoder}\) to extract \({\mathcal {F}}_{\mu }\) and \({\mathcal {F}}_{\sigma }\) which adequately represent the latent space distribution.

Machine learning based classification At the last stage, we employ the ML predictive classifiers wherein \({\mathcal {F}}_{\mu }\) is used to train the stacked arrangement of three classifiers, namely Support Vector Machine (SVM) [34], Random Forest (RF) [35], and XGBoost [36]. The output of this ensembling is later predicted by logistic regression (LR). The framework gives the output, \({\hat{Y}}\), which is the final label to the X-ray image which either belongs to COVID-19, Normal or Pneumonia patient. The above-proposed network architecture is described in Algorithm 1.

Experiments and results

Experimental setup

Implementation details The efficacy of the network performance is rigorously investigated using Python 3.8 with a processor of Intel\(\circledR \) Xeon(R) Gold 5120 CPU @ 2.20GHz\(\times 56\) with 93.1 GiB RAM on Ubuntu 18.04.2 LTS with NVIDIA Quadro P5000 with 16GB graphics memory.

Evaluation criteria

For evaluating the network robustness, confusion matrix with area under curve (AUC) property [37, 38] for ROC curves are estimated. They provide a detailed understanding of how well the model fits for final classification. AUC helps in checking how well a classifier is able to distinguish among various classes. Model’s performance is measured using the traditional metrics of Accuracy (Ac), Sensitivity (Sen), and Specificity (Spe) as in Eqs. 6, 7 and 8, respectively. \(F_1-score\) as given in Eq. 9 is a measure that reports the balance between precision and recall.

ROC-AUC ROC curve is a graph that uses the parameters of true positive rate (TPR) and false positive rate (FPR). It shows the performance of a classification model at all classification thresholds. Area under the curve for ROC [37, 38] is an effective measure to check the efficacy of ML classifiers.

$$\begin{aligned} Accuracy~(Ac)&= \frac{TP+TN}{TP + FP + FN + TN} \end{aligned}$$
(6)
$$\begin{aligned} Sensitivity~(Sen)&= \frac{TP}{TP + FN} \end{aligned}$$
(7)
$$\begin{aligned} Specificity~(Spe)&= \frac{TN}{TN + FP} \end{aligned}$$
(8)
$$\begin{aligned} F_1-score&= \frac{2*TP}{2*TP + FP + FN} \end{aligned}$$
(9)

TP, TN, FP, and FN are true positive, true negative, false positive, and false negative, respectively.

Fig. 4
figure 4

Confusion matrix of the baseline results on DenseNet, ResNet and GoogLeNet

Result analysis

To ethically make a comparative analysis and selection of the best CNN model for feature extraction, we initially displayed the baseline results on the in-depth modules, DenseNet, ResNet, and GoogleNet on the dataset\(~{\mathcal {D}}_1\). The baseline result is produced by training the models on the Normal and Pneumonia dataset, as shown in Fig. 4. The three chosen models are compared using the Area Under the Curve (AUC) property in Receiver Operating Characteristics (ROC) curves, as shown in Fig. 5. From the comparison, the two networks DenseNet and GoogLeNet, are selected and are worked upon for further analysis. We observed that the three deep learning modules’ baseline approach did not perform satisfactorily at the training stage (whose evaluated metrics are depicted in Table 5) and it is deduced that the performance of GoogleNet and DenseNet is better than ResNet. Therefore, to tackle the issue of appropriate and efficient feature extraction, only DenseNet and GoogLeNet are taken for further assessment.

Keeping the above conditions in mind, we extend our experiments on \(~{\mathcal {D}}_2\) dataset using the stacked architecture of DenseNet and GoogLeNet. The extracted features are then tested using SVM, whose results in the form of the confusion matrix and AUC in the ROC curves are depicted in Figs. 6 and 7, respectively. The test results have shown that some of the COVID-19 and Pneumonia samples are closely correlated. Therefore, the predicted results in the confusion matrix have mislabeled categories of both the classes. From the above results, we observed that the models with such high dimensions are highly computational. Also, features correlate with each other spatially with some semantic discrepancies among them. Therefore, for dimensionality reduction and to enrich the process with robust features, a better feature recognition segment, variational autoencoder, is used that will have better generalization ability.

Fig. 5
figure 5

ROC curves of the baseline results on DenseNet, ResNet and GoogLeNet

Table 5 Baseline training-based performance on the GoogLeNet, ResNet and DenseNet
Fig. 6
figure 6

Confusion Matrix of the DenseNet and GoogleNet feature Extraction modules with SVM-based classification

Fig. 7
figure 7

ROC curves of the DenseNet and GoogleNet feature Extraction modules with SVM-based classifiers

Table 6 Performance evaluation of ML-based classifiers on X-ray images for classification between COVID-19, Normal, and Pneumonia patients
Fig. 8
figure 8

Confusion matrix for a XGB, b RF, c SVM, and d Ensemble of XGB, RF and SVM classifiers displaying the final classification among the three classes of COVID-19, Normal and Pneumonia

Variational autoencoder (VAE) To further upgrade the accuracy with fewer dimensions, we need to extract more sparsity among the features which will help us to generate more meaningful latent space. Variational autoencoder (VAE) is used for this purpose. The concatenated features are passed through VAE incorporating a non-linear dimensionality reduction technique, t-SNE [39], which is applied to extract the latent space of 100 features. t-SNE representation calculates a similarity match between the data instances in both high and low dimensions and later optimize them. The latent space visualization of 100 features is as shown in Fig. 9. From the figure, it is inferred that the Normal cases have easily been segregated while the COVID-19 and the Pneumonia cases are being overlapped which can be understood in a way that the pneumonia is observed in early stages of COVID-19.

Fig. 9
figure 9

Visualization of latent space of VAE with 2 components using t-SNE

Quantitative analysis

We are now in a state of understanding the latent space distribution of the extracted features, which are further classified using the ML-based predictive classifiers. Considering the tactfulness of ML classifiers, we chose three basic classifiers, namely, SVM, RF, and XGBoost. These classifiers have successfully and efficiently classified the latent features into their basic classes. Finally, the probabilities predicted from these classifiers are used as meta-features for Logistic Regression to give the final label to our input images.

Fig. 10
figure 10

ROC curves for a DenseNet, b GoogLeNet, c SVM, d RF, e XGB and f Ensemble of XGB, RF and XGB, displaying the multi-class classification results on the NIH chest X-ray images

The classification results over dataset \({\mathcal {D}}_2\) for XGB, RF and SVM have been visualized in the confusion matrix as shown in Fig. 8a–c, respectively. Figure 8d depicts the classification results for the ensemble of XGB, RF, and SVM classifiers with a final estimator of Logistic Regression. From these confusion matrices, various evaluation metrics were calculated for each classifier, as shown in Table 6. The achieved Ac using SVM, RF, and XGB is 0.911, 0.902, and 0.893, respectively. The accuracy of SVM is highest in comparison to the other two classifiers. The maximum AUC for the three classifiers is recognized RF with a value of 0.974. The combined ROC curves for the three ML-based classifiers and the combined final estimator of LR is also shown in Fig. 11. These curves helps us to find the most optimal model out of all four machine learning classifiers. The ensemble of ML classifiers with the LR as a final estimator is the most optimal and robust model with the maximum area under the ROC curve being 0.976. The ensemble classifier gave the best results compared to other machine learning classifiers with an Ac, Sen, Spe, F1-score, and AUC of 0.917, 0.916, 0.958, 0.917 and 0.976, respectively. The ensemble classifier outperforms other ML classifiers.

Classification of COVID-19 using CT images

In this study, we have also experimented with the Computer Tomographic (CT) images of COVID-19.

Dataset description The Dataset comprises of two categories, namely, COVID-19 and Normal. The training and test sets consisted of 498 and 200 images, respectively, which in turn contains equal proportions of COVID-19 and Normal images [40]. The results of the experimentation are shown in Table 7. Due to the sparse availability of the CT pneumonia images’, deep networks are trained on CT image samples of COVID-19.

Table 7 Performance evaluation of ML-based classifiers for classification between COVID-19, Normal, Pneumonia using CT images

Model performance Table 7 demonstrates the quantitative results on the CT images for COVID-19 detection among COVID-19, Normal, and Pneumonia patient images. It is inferred that the proposed model has performed well on the corresponding COVID-19 CT images. The final classification accuracy of the assemblage of ML classifiers is \(82\%\), which is satisfactory but less compared to the results obtained on the X-ray COVID-19 images as mentioned in Table 6. It is a known fact that CT images are detailed when compared to the X-ray images, which helps in providing aid to radiologists for diagnostic purposes. Nevertheless, during this pandemic crisis, the easy availability of the CT images was quite troublesome. Therefore, the readily available X-ray images are taken, showing better results for the proposed framework.

Table 8 Performance evaluation of ML-based classifiers for multi-class classification on \(({\mathcal {D}}_3)\) dataset
Table 9 Comparative analysis of the AUC for multi-class classification for the proposed model setting

Multi-class classification results on a challenging dataset

To further analyze the viability of our proposed framework, we implement the framework on \({\mathcal {D}}_3\) dataset. For this, we have reported the results for multi-class classification results for Atelectasis, Effusion, Infiltration, Nodule, and Pneumonia image samples. Figure 10 shows the corresponding ROC curves for the different stage results on the aforementioned dataset. Figure 10a and b show the ROC curves for the initial training of the DenseNet and GoogLeNet for the above mentioned five classes with Fig. 10c–e showing the corresponding final 5-label classification by SVM, RF, and XGB classifiers, respectively.

The extracted latent space from the VAE is converted into components using t-SNE, which is visualized in Fig. 12. It can be seen from the figure that many data points are distorted from the cluster of their labeled classes which in turn affects the performance of the chosen classifiers which is clearly visible in Table  8. From the table, it can be observed that the quantitative performance varies greatly, in which the proposed pipeline has performed superior to the works done by Xiaosong et al. [20]. The comparative study for the AUC for multi-class classification for four classes (leaving out Pneumonia) is as depicted in Table 9 where the proposed method has outperformed the state-of-the-art results.

Fig. 11
figure 11

ROC for Machine Learning based classification among the three classes of COVID-19, Normal and Pneumonia

Fig. 12
figure 12

Visualization of latent space of VAE with 2 components using t-SNE for Dataset \({\mathcal {D}}_3\)

Discussion

The recent developments in the DL-based techniques using feature extraction and image processing for COVID-19 classification have marked new opportunities in the field of medical imaging. Automatic classification of COVID-19 among COVID-19, Normal and Pneumonia image samples proves to be a significant step for clinical interpretation and treatment planning by the CADx systems. The proposed model is divided into image feature extraction using DL models, enhancing features using VAE with final classification using ML-based predictive classifiers. The input images are passed through the CNN models: DenseNet and GoogleNet, to generate features individual to each architecture, which are later concatenated to form a feature vector. This feature vector is then fed to a VAE that learns meaningful latent space for easy and effective patterns of the input features. The output of the VAE are then used as input for ML-based predictive classifiers for classification. This multi-stage classification process has enabled the network to effectively identify the image patterns and helps improve the accuracy and robustness of the model.

For comparison, we have investigated the existing works which are as tabulated in Table 10. Mohamed et al. [17] demonstrated the use of GAN with deep transfer learning among COVID-19, Normal and Pneumonia X-ray image samples using the dataset created by Joseph et al. [18] that achieved an Ac, Sen, and F1-score of 0.8148, 0.8148, and 0.8146, respectively. Another work on the classification task of COVID-19 was performed by Enzo et al. in [19] utilizing the datasets [18, 20, 21] with the Pneumonia images Footnote 3 have achieved considerable performance. Yifan et al. [24] has also introduced the use of CNN in the classification of COVID-19 with the fine-tuning performance with recall and precision of 0.75 and 0.64, respectively. However, the results produced by the proposed model outperformed the state-of-the-art (SOTA) techniques [17] with an enhancement of \(1\%\), \(1\%\) and \(0.7\%\) on Ac, Sen, and F1-score, respectively. Based on the proposed architecture, the method has outshown the overall classification process.

Table 10 Performance comparison of the proposed classification scheme among COVID-19, Normal and Pneumonia with the state-of-the-art methodologies

Conclusion

This study proposes a novel framework to classify diseased images of people using their chest X-rays among COVID-19, Normal, and Pneumonia. It is the need of the hour to have such a technique for COVID-19 classification that is cost-effective and practically accurate . This study uses state-of-the-art DL architectures for feature extraction and image processing. The input images are passed through the CNN models, DenseNet and GoogleNet, to generate individual features which are later concatenated. These concatenated feature vectors are sent to a variational autoencoder that learns a meaningful latent space from the features and further inputs the extracted features to ML-based classifiers. ML-based predictive classifiers perform final predictions that helps improve the accuracy and robustness of the model. Thus, the proposed study helps to achieve an overall accuracy and AUC of 0.91 and 0.97, respectively. Further, we have tested the proposed framework’s scalability and efficacy on a challenging dataset to classify between Atelectasis, Effusion, Infiltration, Nodule, and Pneumonia and to check for viability and the diverse nature of the framework in applicant fields of medical image classification. The results of this dataset showed significant improvement over the state-of-the-art methodologies. Future aspects may extend this study to the datasets having more number of images with some other biological and physical parameters, which will help improve the results and the model’s viability in the field of AI-based detection of COVID-19 and other lungs related diseases.