1 Introduction

The COVID-19 pandemic caused by the novel coronavirus or SARS-CoV-2 originated in Wuhan, China and has affected more than 100 million people worldwide with more than 2 million deaths during January-October 2020. Although the mortality rate has dropped, the pandemic is not over yet. The tests performed for the detection of COVID-19 are the rapid IgM-IgG combined antibody test [32] and the real-time Polymerase Chain Reaction (RT-PCR) [48]. The RT-PCR test has several limitations: (1) A long time is required for obtaining the test results; (2) It is a costly test requiring experts to perform the test and analyze the results (3) They have a high false-negative rate (sensitivity of 71%) [55]. Although the rapid antigen test can produce results within 15 minutes by detection of the IgG and IgM antibodies simultaneously in the human blood, it might take several days for the human body to form the antibodies and thus there is a risk of spread of the virus before being detected. This leads to a very high false-negative rate. Hence, as an alternative, an automated diagnosis tool is required that is sensitive as well as specific to the COVID-19 disease which can lead to fast predictions.

Fig. 1
figure 1

Current worldwide statistics of COVID-19 (a) Total cases and deaths in the world (b) Daily new cases and deaths in the world [35]

Currently, in the worldwide scenario, there are 108.3 million total COVID-19 cases and 2.38 million total deaths. The plots for the total cases and daily cases are shown in Fig. 1a, b, respectively. In India, there are more than 10.8 million cases in total and 155,000 deaths as shown in Fig. 2a and the daily cases and mortality rates are shown in Fig. 2b (All data for the graphs have been collected from the publicly available data by Roser et al. [35]). Due to the acute shortage of RT-PCR test kits, especially in developing countries like India, population-wise screening is not possible, which has led to uncontrolled community spread of the virus. Also, the RT-PCR test is a tedious and time-consuming process, so an appropriate and viable option can be the use of chest CT-scan images for COVID-19 screening.

Fig. 2
figure 2

Current statistics of COVID-19 in India: (a) Total cases and deaths (b) Daily new cases and deaths [35] (the graph has been formed from the data using Google Sheets)

Computed Tomography or CT-scan is a relatively common test [34] and can be performed more amply. It is much more sensitive (98%) than RT-PCR (71%) as established by Fang et al. [18]. Figure 3 shows two CT images, one of which is of a COVID-19 infected patient and the other, a tested negative patient. The most common finding from the chest CTs is “ground-glass opacities” scattered throughout the lungs. They represent tiny air sacs or alveoli, getting filled with fluid, and turning a shade of grey in the CT-scan turning into a white consolidation in more severe cases, as marked by the red circle in Fig. 3a. The disease severity is proportionate to the lung findings, meaning sicker individuals have more of such opacities in one of both lobes of the lungs in chest CT-scans.

Fig. 3
figure 3

Illustration of chest CT image findings of two patients having: (a) COVID-19 positive and (b) COVID-19 negative. The COVID-19 infection’s characteristic “Ground Glass Opacity” has been marked with a red circle in the COVID-19 infected chest CT image

Also, to aid the clinicians in COVID-19 screening, automation based methods need to be developed that is both reliable and fast. Hence, researchers around the world have tried developing Computer-Aided Diagnosis tools for the detection of COVID-19 from chest X-rays or chest CT images. Chest CT images reveal more details than chest X-rays, and hence ET-NET considers chest CT images for the prediction of COVID-19 positive patients. Deep learning [30] is a powerful machine learning tool that uses structured or unstructured data for classification using a complex decision-making process.

The current image classification problem [13] is a supervised learning task. Supervised learning [4, 22] refers to the learning procedure, where an algorithm is trained on a labelled dataset, meaning that the true classes of the samples are already provided for the model to tune its parameters based on the training accuracy. Transfer learning is a technique where a deep learning model used for one task is utilized for another separate task. This method is particularly effective when the task at hand has less amount of data available for training the model, and the parameters trained from the previous task are loaded and trained with the new data for fine-tuning.

Ensemble learning allows the fusion of the salient features of multiple base learners, leading to more accurate predictions than the individual models. Such a learning scheme is robust since the variance in the prediction errors is reduced upon ensembling. An ensemble model aims to capture the complementary information from the base models and makes superior predictions. In the present study, the Bagging technique is used as a method to fuse the important aspects of all the transfer learning models considered here to form the ensemble. The Bagging technique is preferred over the Boosting algorithm in the present work, since the dataset available has only a small amount of data, and might lead to excessive overfitting using the Boosting technique. The bagging technique, on the other hand, reduces overfitting and hence is beneficial for the current problem. Thus, the ET-NET or Ensemble Transfer learning Network is proposed in this paper.

2 Literature survey

A vast amount of research is being conducted to help stop the COVID-19 pandemic [5]. However, the existing methods are time consuming and expensive, while also being less accurate. Yang et al. [54] showed that chest CT-scans can serve as an important make-up for the diagnosis of COVID-19. They used respiratory samples including nasal and throat swabs, and bronchoalveolar lavage fluid (BALF) to draw comparisons. The accuracy in the detection of COVID-19 was only 88.9% for severe cases and 82.2% for mild cases using sputum samples. Nasal swabs and throat swabs gave even lower accuracies (73.3% for nasal swabs and 60.0% for throat swabs). Table 1 shows some of the recent methods proposed in the literature for the automated diagnosis of COVID-19 from either CT-scan or Chest X-ray images.

Table 1 Some recent methods for COVID-19 detection

Several automated frameworks for the screening of COVID-19 infected patients have been proposed since the outbreak of the pandemic, a majority of which have used chest X-ray images [11, 50]. According to clinicians and doctors, CT-scans are more reliable and sensitive than radio-graph (X-ray) images, and hence a better input for screening.

Deep learning has been widely used as a computer-aided detection tool for COVID-19 screening like in [31, 57]. Gozes et al. [23] utilized deep learning by fusing two subsystems, one being a 2D slice model and the other a 3D volumetric model for CT image classification. Li et al. [31] developed COVNet for extracting visual features from volumetric chest CT images. The COVNet they developed extracted both 2D local and 3D global features using a ResNet50 backbone and fused the features using a max-pooling layer and employing a final fully connected layer for generating the probability scores.

Zhang et al. [56] proposed a novel deep learning model for utilizing 3D chest CT volumes for the classification of infected patients and localization of swelling regions in the CT-scans. They used a pretrained U-net for segmentation of the 3D CT-scans and fed the 3D segmented chest areas into a deep neural network for forecasting the infection probability. The computation time for the detection of test images in their method is only 1.93 seconds per image. Abdel et al. [1] proposed a semi-supervised meta learning-based lung segmentation model for COVID-19 detection. Karbhari et al. [29] proposed a Generative Adversarial Network (GAN) framework to address the challenge of data scarcity for COVID-19 detection and used the generated data for training a classification model. Das et al. [14] proposed a bi-level classification model that uses pre-trained VGG-19 for feature extraction and then a shallow classifier for the final predictions. Sen et al. [42] and Chattopadhyay et al. [11] proposed deep features extraction and classification framework using meta-heuristics to reduce the feature set dimensionality. Garain et al. [21] developed a Spiking Neural Network-based model for the detection of COVID-19 from CT-scan images.

Most of the previous methods as shown in Eq. (1) use a single model for the predictions, however, we propose an ensemble scheme for the detection of COVID-19. Using the complementary information provided by the different base classifiers, based on the confidence scores, enhances the overall performance and robustness by reducing the variation in prediction errors. The ensemble method is a kind of fusion mechanism that uses the outputs or features from more than one model to compute the final prediction of the input [36, 39, 40]. It aims to enhance the performance of the framework beyond the reach of the individual models. Ensemble learning works better than the individual models, because of the diversification of the information considered. When more than one model’s opinion is accounted for, less noisy predictions are produced. Hence, such a technique has been employed in the present work. A large variety of ensemble techniques [17, 25, 36, 52] have been proposed in literature, two of the most popular techniques being Bagging [10, 38] and Boosting [9].

2.1 Motivation and contributions

In light of the current pandemic situation, the medical practitioners and healthcare professionals are working tirelessly, fighting the disease. However, the current gold standard method for COVID-19 screening, the RT-PCR test, is slow and tedious, and hence inadequate for population-wise screening resulting in an uncontrolled number of infected individuals. Several researchers, therefore, are trying to develop systems for faster and more efficient screening of the infected patients, which is the primary motivation behind the current paper. (Vaccine?) Ensemble learning allows the fusion of salient properties of the base classifiers, thus achieving an overall enhanced performance. Such models are robust since computing the ensemble model decreases the spread (or dispersion) of the predictions of the base models. That is, the variance in the prediction errors are diminished and complementary information is captured. Figure 4 shows a diagram depicting the overall workflow of the proposed ET-NET model.

Fig. 4
figure 4

Overall workflow of the proposed ET-NET ensemble classifier model for COVID-19 detection from chest CT-scan images

The contributions of this paper are as follows:

  1. 1.

    An ensemble-based COVID detection approach has been used that boosts the performance of the individual CNN classifiers: Inception v3 [47], ResNet34 [24] and DenseNet201 [27]. For this, a bagging ensemble technique has been used that uses the average of the decision scores generated by each model for each class of the dataset.

  2. 2.

    The proposed model, called ET-NET, has been evaluated on a publicly available dataset [45] using 5-fold cross-validation, outperforming the previous state-of-the-art method by 1.56%.

  3. 3.

    Most of the previous works considered chest X-ray images which are less sensitive than lung CT images used in this work. To account for the less availability of publicly available data, Transfer Learning has been used to generate the decision scores. The ensembling technique helps capture complementary information, thus outperforming individual models.

CT-scan images have been used, generally requiring no prior segmentation, for the classification of the chest CT-scans into two categories: COVID or Non-COVID.

The rest of the paper has been organized as follows: Sect. 3: Proposed Method, explains in detail the working of ET-NET in the current study; Sect. 4: Results and Discussion, highlights the results obtained by the ET-NET on a publicly available dataset, compares it to existing models and discusses the efficacy of ET-NET and Sect. 5: Conclusions, concludes the findings and contributions of this paper, and discusses the possibilities of future works on the proposed model.

3 Proposed method

Convolutional Neural Networks (CNNs) are preferred for image classification problems since an image is a 2D matrix of pixel intensities, and it might help to look at an image in parts, for example, a 300x300 image can be seen 3x3 parts at a time, for, say, object detection, etc., which is achieved by the convolution operation. The pooling operation [53] helps in dimensionality reduction. CNNs are shift-invariant [49, 58] and have less number of parameters in comparison to deep fully connected neural networks and hence are computationally more efficient even while accommodating a very deep network [19, 20].

figure a

In the proposed work, three models namely, Inception v3 [47], ResNet34 [24] and DenseNet201 [27] pretrained on ImageNet [15] have been used, which are then fine-tuned using the chest CT-scan dataset. The number of layers and parameters of each deep transfer learning model have been shown in Table 2.

Table 2 Number of layers and parameters in each network

3.1 Inception v3

The characteristic features of the Inception v3 model developed by Szegedy et al. [47] in 2016, are the three types of inception block, which have parallel convolutions. Such modules account for more efficient computation in the deep architecture, while also addressing the overfitting problem. The architecture of the Inception v3 CNN has been illustrated in Fig. 5a.

3.2 ResNet34

The salient features of Residual Networks or ResNets developed by He et al. [24] in 2016 are that they have skip connections that directly concatenate the current layer with features from a previous layer, resulting in preservation of features from past layers, which might be important. ResNet34 is one such network that is 34 layers deep (and one fully connected classification layer), the architecture of which is shown in Fig. 5b.

3.3 DenseNet201

In DenseNets by Huang et al. [27] in 2017, each layer is a concatenation of feature maps of the current layer and all preceding layers. As a result, these networks are compact (that is, less number of channels), and hence in terms of computation and memory requirement, it is efficient, while also having rich features representation for the input images. The architecture of the DenseNet201 is shown in Fig. 5c.

Fig. 5
figure 5

Architectures of the three CNN base classifiers: (a) Inception v3, (b) ResNet34, and (c) DenseNet201 used to form the proposed ensemble model called ET-NET

3.4 Loss function

A loss function is a measure of the performance of a deep learning model. The main objective of a deep learning model is to minimize the error between the predicted and the original labels, which is calculated during backward propagation [12] in a neural network.

In the current study, the cross-entropy loss function is used, which evaluates the performance of the classifier which outputs a matrix of probabilities (each probability value between 0 and 1). Since the present classification problem deals with only two classes, the loss function is called Binary Cross-Entropy Loss function. The cross-entropy loss function is chosen since it performs well for binary classification problems which have a large decision boundary [33]. This loss function also helps curb the vanishing gradient descent problem since the use of logarithm nullifies any exponential behaviour which occurs due to the sigmoid (or softmax) activation function. The logarithm avoids saturation of the gradients at extreme values which is beneficial since large gradients are essential for making significant progress through the iterations.

Suppose for an input x, the true label is y and the predicted label from the classifier is \(\hat{y}\), which is given by Eq. 1, where w is the weight matrix associated with the neural network and b is the bias matrix associated with it. f is the non-linear activation function associated with the layers in the neural network. For the present work, the activation function Rectified Linear Unit or ReLU [3] has been used.

$$\begin{aligned} \hat{y} = f(w^{T}.x+b) \end{aligned}$$
(1)

The ReLU activation function is given as in Eq. 2.

$$\begin{aligned} ReLU(x) = max(0,x) \end{aligned}$$
(2)

Then the loss function L is given by Eq. 3 where N denotes the number of classes in the problem. \(N=2\) for the present study.

$$\begin{aligned} L(\hat{y}^{(i)},y^{(i)}) = -\sum _{i=1}^{N}y^{(i)}\log \hat{y}^{(i)} \end{aligned}$$
(3)

For m training samples, the cost function is given by Eq. 4

$$\begin{aligned} J(w,b) = -\frac{1}{m}\sum _{i=1}^{m}L(\hat{y}^{(i)},y^{(i)}) \end{aligned}$$
(4)

Using the cost function in Eq. 4, the weights and biases associated with the layers in the neural networks are updated.

3.5 Ensemble

The ensemble approach adopted for the current is the bootstrap aggregating or “Bagging” ensemble [8]. This machine learning-based ensemble technique was developed to make the machine learning classification algorithms more stable and accurate. Bagging ensemble techniques help to reduce overfitting problems, in contrast to Boosting ensemble technique [41] which increases the overfitting problem, because, in each stage of the Boosting algorithm, only the misclassified samples from the previous stage are used as training data.

In the current study, the Bagging ensemble technique uses the same training set for training the three pretrained models (Inception v3, ResNet34 and DenseNet201) independently and then predicts the class probabilities of the samples in the test set by the fine-tuned models to calculate the average probability score, thus giving equal weightage to all the three classifiers.

Suppose m models (classifiers) numbered as \(1,2,\dots,m\) are used for a classification task of n classes, and the prediction probability scores are denoted by P. The prediction scores for a single image from model i can be expressed as a matrix as in Eq. 5.

$$\begin{aligned} P^{(i)} = \left[ p^{(i)}_1 p^{(i)}_2 ... p^{(i)}_n \right] \end{aligned}$$
(5)

So the final prediction score \(P^{ensemble}\) using the average probability ensemble technique is given by Eq. 6.

$$\begin{aligned} P^{ensemble}= & {} \frac{\sum _{i=1}^{m} P^{(i)}}{m}\\ \nonumber= & {} \left[ \frac{\sum _{i=1}^{m} p^{(i)}_1}{m} \frac{\sum _{i=1}^{m} p^{(i)}_2}{m} ... \frac{\sum _{i=1}^{m} p^{(i)}_n}{m} \right] \\ \nonumber= & {} \left[ p^{\prime }_1 p^{\prime }_2 ... p^{\prime }_n \right] \end{aligned}$$
(6)

Now, the class having the maximum probability out of the values \(p^{\prime }_1, p^{\prime }_2, ... , p^{\prime }_n\) is decided as the predicted class, which is then compared with the true labels to obtain the accuracy. In the current problem, there are 3 models and 2 categories to sort the images into, accounting for \(m=3\) and \(n=2\) in Eqs. 5 and 6.

4 Results and discussion

In this section, we will briefly describe the dataset used for the current study in Sect. 4.1, the evaluation metrics used for comparing and validating ET-NET in Sect. 4.2. The implementation of the developed methodology and the results thus obtained, are described in detail in Sect. 4.3, and the comparison with the existing literature and standard models are made in Sect. 4.5.

4.1 Dataset description

For evaluating the performance of the proposed methodology, the dataset used is publicly available on KaggleFootnote 1 developed by Soares et al. [45]. The dataset consists of a total of 2481 CT-scan images unevenly distributed into COVID and Non-COVID categories as shown in Table 3. For the proposed framework, 70% of the images (1736 scans) are used as training data and the rest 30% (745 scans) are used as testing data.

Table 3 Class-wise distribution of images in the Kaggle dataset

4.2 Evaluation metrics

For evaluating the performance of ET-NET on the binary classification task at hand, parameters such as accuracy, precision, recall (or sensitivity), f1 score and specificity. For defining these terms, first the terms True Positive, True Negative, False Positive and False Negative needs to be defined.

In a binary classification problem, suppose the two classes are a positive class and a negative class. True Positive (TP) refers to a sample belonging to the positive class, being classified correctly. False Positive (FP) refers to a sample belonging to the negative class, but classified to be belonging to the positive class. Similarly, True Negative (TN) refers to a sample being classified correctly as belonging to the negative class. False Negative (FN) refers to a sample belonging to the positive class, but classified as being part of the negative class. Now the metrics can be defined as follows:

$$\begin{aligned} Accuracy= & {} \frac{TP+TN}{TP+FP+TN+FN}\end{aligned}$$
(7)
$$\begin{aligned} Precision= & {} \frac{TP}{TP+FP}\end{aligned}$$
(8)
$$\begin{aligned} Recall\, (or\, Sensitivity)= & {} \frac{TP}{TP+FN}\end{aligned}$$
(9)
$$\begin{aligned} F1 Score= & {} \frac{2}{\frac{1}{Precision}+\frac{1}{Recall}}\end{aligned}$$
(10)
$$\begin{aligned} Specificity= & {} \frac{TN}{TN+FP} \end{aligned}$$
(11)

4.3 Implementation

The CNN transfer learning models have been trained for 100 epochs, and the loss curves of the models have been shown in Fig. 6. The predictions of the models on the test set images have been saved. The hyperparameters used for training the three models are shown in Table 4.

Table 4 Hyperparameters used for training each model

The probability prediction matrices from the three classifiers have been averaged per sample to get the final prediction scores, and hence the predicted result for all the images are obtained.

The class-wise metrics obtained have been shown in Table 5 and the net result has been shown in Table 6. The confusion matrix for the test set has been shown in Fig. 7 and the Receiver Operating Characteristics (ROC) curves of the individual models and the proposed ET-NET are shown in Fig. 8.

Fig. 6
figure 6

Loss curves obtained using the base pre-trained classifiers after 100 epochs of re-training: (a) Inception v3 (b) ResNet34 and (c) DenseNet201

Table 5 Class-wise evaluation metrics generated by the base classifiers and the proposed ET-NET model on Fold-4 (best fold) of 5-fold cross-validation
Table 6 Evaluation metrics produced by the proposed ET-NET model on 5-fold cross-validation of the dataset
Fig. 7
figure 7

Confusion matrices of the predictions produced by the proposed ET-NET model on 5-Fold crossvalidation of the dataset: (a) Fold-1 (b) Fold-2 (c) Fold-3 (d) Fold-4 and (e) Fold-5

Fig. 8
figure 8

Receiver Operating Characteristics (ROC) curves obtained on (a) 5 folds of cross-validation (b) the test set of Fold-4 (best result) and base CNN classifiers

4.4 Error analysis

ET-NET performs very well for the current classification problem. Examples of correctly classified images from each class are shown in Fig. 9. In both the images, a part of the lungs has not been captured by the CT-scan properly, and as a result, it is an erroneous image. The contrast for Fig. 9a is also too high, while Fig. 9b is a hazy image. Even with all these limitations of the images in the dataset, ET-NET was able to classify them correctly, proving the model to be reliable even for imperfect imaging conditions. Hence, slightly noisy images do not affect the performance of ET-NET.

Fig. 9
figure 9

Examples of test cases where the proposed ET-NET model performs correct classification although the images were noisy: (a) COVID case and (b) Non-COVID case

Figure 10 shows one misclassified image from each class of the dataset. Figure 10a belongs to class “COVID” of the dataset but was classified by ET-NET as “Non-COVID”. The prime reason for that is, the lung condition depicted in the CT-scan is one of a mild COVID condition, as a result, prominent ground-glass opacity has not yet developed in the lung alveoli. So, ET-NET was unable to detect the presence of COVID-19 infection from such a preliminary stage of infection. Figure 10b on the other hand, is a sample belonging to the “Non-COVID” class of the dataset, but ET-NET predicted it to be a “COVID” condition. One of the reasons for that is, the lung CT-scan quality is not appropriate, because visibly the lung shape has not been properly captured. The other reason might be the fact that the CT-scan is very hazy unlike the low level of noise present in Fig. 9b.

Fig. 10
figure 10

Examples of test set images where the proposed ET-NET model fails to produce correct predictions: (a) COVID case and (b) Non-COVID case

4.5 Comparison with existing models

Several transfer learning models have been used for comparing the performance of the proposed approach, which has been shown in Table 7. Table 8 shows the comparison of ET-NET with some existing methods that use the same dataset.

Angelov and Soares [6] extracted features from non-pretrained GoogLeNet [46] and used a Multi-Layer Perceptron (MLP) for final classification. Panwar et al. [37] used the VGG19 [44] transfer learning model and added five more layers ahead and trained the network. Jaiswal et al. [28] used deep transfer learning technique with DenseNet201 [27] for feature extraction and classification.

Table 7 Comparison of ET-NET with some standard deep learning models
Table 8 Comparison of the proposed ET-NET with some existing models in literature on the Kaggle dataset

4.6 Statistical test

The McNemar’s test [16] is performed for statistically analysing the performance of the proposed ET-NET ensemble model with the base CNN classifiers which have been used to form the ensemble, and other standard transfer learning classifiers. McNemar’s test is a non-parametric analysis of paired nominal data distribution. Table 9 displays the results obtained from McNemar’s test on the Kaggle dataset. The “\(p-value\)” signifies the probability that two models are similar, thus, a lower \(p-value\) is desired. To reject the null hypothesis that the two models are similar, the \(p-value\) needs to be smaller than \(5\%\) that is, if \(p-value<0.05\), we can safely say that the two models under consideration are significantly different.

Table 9 Results of the McNemar’s test performed between ET-NET and standard CNN models on the Kaggle dataset: Null hypothesis is rejected for all cases

In Table 9, it can be noted that for every model with which the ET-NET is compared, \(p-value<0.05\), thus rejecting the null hypothesis. So, it can be said that the proposed ensemble model captures complementary information from the constituent base classifiers, thus producing superior results while making the ensemble model markedly dissimilar from the base classifiers.

5 Conclusions

The spread of COVID-19 has collapsed economies of the world and caused numerous deaths, and people are still suffering due to this pandemic situation. Although RT-PCR is used for the screening of COVID-19 patients, it is a tedious process with low sensitivity. ET-NET uses a more sensitive CT-scan based detection using Computer-Aided Diagnosis. Deep transfer learning and an average probability-based ensemble approach have been utilized for the binary classification task which obtained results superior to existing CT-scan based screening models achieving an accuracy of 97.73% which is impressive for the small dataset used. Also, the sensitivity and specificity of the proposed ET-NET is better than RT-PCR and hence can be used as a reliable and robust COVID-19 detection mechanism. The proposed ET-NET model is also domain-independent, and can be extended to problems in gait detection [2], action recognition [7], etc.

The primary limitation of this method is that the non-availability of the data may deter to prove the robustness and generalization ability of the method. Deep learning models essentially perform best with a very large database, but, the dataset used in this study, has only 2481 images, whereas the more efficient deep learning models need to be trained on larger datasets depending on the complexity of the problem for optimal performance. As a result, we had to use transfer learning models that were pretrained on ImageNet consisting of 14 million images and then fine-tuned using the chest CT-scan images from this study. Also, other pulmonary diseases like the Middle East respiratory syndrome (MERS) and Chronic obstructive pulmonary disease (COPD) are possible biases to the present work as compared to RT-PCR and IgG-IgM antibody tests. It might also be important to perform segmentation to improve the Non-Covid control group design, which we intend to address in the future.

We aim to perform more experiments once more extensive datasets of chest CT-scan become available and develop better models for classification. We shall try to use image enhancement techniques to address the limitations mentioned in Sect. 4.4. We may try more pretrained models to form the ensemble and try more sophisticated ensemble approaches like Dempster-Shafer theory, Choquet fuzzy integral or rank based fusions.