1 Introduction

The coronavirus disease 2019 (COVID-19) is a respiratory disorder, which may have varying severity respiratory symptoms from the common cold to fatal pneumonia. COVID-19 is caused by a novel coronavirus known as the severe acute respiratory syndrome coronavirus 2 (SARS-CoV2). SARS-CoV2 has very high contagious nature with a 1-14 days long incubation period. Some of the carriers may not show any symptoms while a significant amount of the patients may have minor symptoms such as dry-cough, sore throat, headache, fatigue, and sputum production. However, the virus can be fatal if the immune system of the patient is weak [72]. The conditions seen in the severe and critical patients may be the pneumonia, acute respiratory distress syndrome, pulmonary edema or multiple organ failure [15, 17]. In [19], it was stated that approximately 14% of the COVID-19 patients have experienced severe conditions such as the dyspnea, while 5% of the patients were in critical condition including respiratory failure, septic shock, or multiple organ dysfunction. Early diagnosis of the COVID-19 and the application of successful treatment is the key factor to reduce the complication and mortality in patients having underlying medical conditions such as hypertension, diabetes, cardiovascular disease and asthma [19, 26, 32, 53]. Another important factor related with the COVID-19 is the transmission mechanism of the SARS-CoV2. The primary propagation mechanism of the SARS-CoV2 has been identified as the spread of respiratory droplets through sneezing and coughing, which have the potential to cover a distance up to 1.8 meters [13]. This highly contagious nature of the SARS-CoV2 puts any person, who has a close contact history with the patient, in a very high risk. Although, the primary source of the SARS-CoV2 transmission has been identified as the symptomatic people, asymptomatic people can also have a possibility to be a risk factor [13]. The higher risk of getting severe COVID-19 disease for the patients having existing medical conditions and being over age 60 years, and the high potential of fast propagation risk of COVID-19 results in a significant need for the fast and accurate diagnosis tools.

As the most common test technique to diagnose COVID-19, the real-time reverse transcription-polymerase chain reaction (RT-PCR) is used to detect the presence of viral RNA. In this method, a sample including a mixture of mucus and saliva is taken by using the nasopharyngeal swab technique for being assessed for virus existence. However, the RT-PCR suffers from having low-sensitivity especially in the early stage [27, 56] and it was mentioned in [93] that the chest radiography has performed very well in the early diagnosis of COVID-19. Therefore, it is believed that complementing the nucleic acid testing with chest radiography based diagnosis has promising potential in the early detection of COVID-19 [41]. Regarding the chest radiography techniques, X-rays and Computer tomography (CT) scans are the most commonly used imaging methods to diagnose the thoracic abnormalities. Although the CT scan can provide finer details of the 3D anatomy of human body, X-rays are more convenient to differentiate between viral and non-viral pneumonia due to its fast imaging speed, significantly low cost and low dosage exposing of radiation [77]. Furthermore, in [39], the most common manifestations and patterns of lung abnormality on portable chest radiography (CXR) in COVID-19 were described and it was mentioned that the CXR will likely be the most commonly utilized method for diagnosis and follow up of COVID-19 because of the infection control issues related to patient transport to CT suites, the problems experienced in CT room decontamination, and lack of CT availability in parts of the world. In [8], an experimental CXR scoring system, which was tested on hospitalized patients with COVID-19 pneumonia, was presented to quantify and monitor the severity and progression of disease. The authors found that the inter-observer agreement of the developed system was very good and the CXR based scoring is a promising tool for predicting mortality in hospitalized patients with SARS-CoV2 infection. In the light of the advantages of X-ray imaging over CT scan in the diagnosis and monitoring of COVID-19, we focus on developing a X-ray imaging based automated system which has the ability of differentiating viral pneumonia (COVID-19) from non-viral pneumonia and normal controls (No findings).

Computer-aided diagnosis (CAD) has been successfully used as a supporting tool for the diagnosis process of radiologists since 1980s [22]. The CAD systems are mostly developed as a complementary decision making approach to the diagnosis of physicians due to their advantages such as being reproducible and having the ability of detecting subtle changes that cannot be observed by the visual inspection. With respect to the usage of X-ray imaging based CAD systems in the diagnosis of thoracic diseases, the recent advances in deep learning have led to breakthrough improvements in the discrimination of viral and non-viral pneumonia. In [43], a diagnostic tool, which is based on a deep-learning framework for diagnosis of pediatric pneumonia using chest X-ray images, was proposed. In [71], the performance of customized convolutional neural networks (CNNs) to differentiate between bacterial and viral pneumonia types in pediatric CXRs was presented. Additionally, various deep learning approaches were successfully employed to diagnose pneumonia and other pathologies in [5, 40, 95]. In order to detect COVID-19 samples by using X-rays, a deep learning architecture, which employs depthwise convolutions with varying dilation rates to incorporate local and global features extracted from diversified receptive fields, was presented in [60]. In [87], various deep learning models were utilized for feature extraction and the obtained feature sets were processed using the Social Mimic optimization method. Later, the modified deep features were given to support vector machines (SVMs) with the aim of COVID-19 detection. In [69], a concatenated neural network, which is based on Xception and ResNet50V2 networks for classifying the chest X-ray images into three categories of normal, pneumonia, and COVID-19, was presented in an unbalanced data-set configuration. In [65], a patch-based CNN approach with a relatively small number of trainable parameters were given for COVID-19 diagnosis. In this method, random patches were cropped from the X-ray images and the final classification result was obtained by majority voting from inference results at multiple patch locations. In [25], a comparative individual analysis of the recent deep learning models including VGG16, VGG19, DenseNet201, InceptionResNetV2, InceptionV3, Resnet50, and MobileNetV2 was presented in the detection and classification of COVID-19. An Auxiliary Classifier Generative Adversarial Network based model was employed in [90] for generating synthetic chest X-ray CXR images to avoid overfitting and increase the generalization capability of employed CNNs. In [66], an end-to-end deep learning architecture, which was an enhanced version of the Darknet-19 model, was employed for the multi-class classification (COVID vs. No-Findings vs. Pneumonia).

Although previous studies have shed some lights on the deep learning-based diagnosis by using X-ray images and significant improvement has been obtained, none of the previous works have been able to propose a complete solution to the COVID-19 detection problem. Additionally, the COVID-19 outbreak is recent and the content of the public X-ray imaging databases is still progressing. Due to this gradual increase in the number of COVID-19 images in the public databases, a need of developing new algorithms, which have generalization capability for new COVID-19 samples, has been raised. In this study, we propose a deep features based ensemble learning model, which uses feature and decision level fusion, in order to satisfy the aforementioned needs in COVID-19 diagnosis.

The main contributions of this study are summarized as follows:

  • The proposed learning model was applied to progressively created three public COVID-19 databases in order to measure its generalization capability and reduce the biasing effect that can occur in unbalanced databases.

  • The individual performance of seven powerful deep learning architectures including the Mobilenet, VGG16, ResNet50, ResNet101, NasNet, InceptionV3 and Xception were presented.

  • The same seven deep learning models were employed as feature extractors and the obtained individual deep features were fed to non-linear kernel SVMs with the aim of COVID-19 detection.

  • The extracted deep features by using individual CNNs were concatenated to form a single feature vector (feature level fusion) which was subsequently given to classifiers.

  • The decisions of the individual classifiers were combined by employing the majority voting schema (decision level fusion).

  • The experimental results have demonstrated the effectiveness and robustness of the proposed ensemble approach in epidemic screening by reaching high general accuracy values accompanied with high COVID-19 F1-scores, precision and recall values.

The rest of the study is organized as follows; Section 2 introduces materials and methods. Section 3 presents the experimental results and finally, Section 4 presents the discussion and conclusion.

2 Materials and methods

In this study, an ensemble of CNNs with the aid of decision and feature level fusion idea was proposed to solve the classification problem in X-ray images for COVID-19, No-Findings and Pneumonia classes. For doing that three public X-ray datasets were employed in the experiments and the generalization capability of the proposed approach has been proven. In the ensemble of CNNs, transfer learning layout of seven deep convolutional neural network (CNN) models, which were initially pre-trained by a huge image collection repository, the ImageNet, were utilized. The employed deep networks, whose individual classification performance were also given, were the MobilenetV2, VGG16, ResNet50, ResNet101, NasNet, InceptionV3 and Xception. In addition, the same seven deep networks were also employed as deep feature extractors and the obtained deep features were fused and the resultant concatenated feature vector was fed to non-linear kernel based SVMs to increase the discrimination performance.

2.1 Dataset information

In our study, three databases were constructed in a progressive way to measure the classification performance and generalization ability of the proposed approach by using the combinations of three publicly available data-sets. Firstly, the data-set that has been already used in [66] was employed as the baseline reference database and it is named as DB1. DB1 consists of 125 COVID-19 images, 500 pneumonia images and 500 normal (no-finding) images. The COVID-19 images of DB1 were taken from a public data-set, which is constantly updated by researchers [21]. The remaining 1000 non-COVID X-ray images were taken from the public ChestX-ray8 dataset [92] and the DB1 was finalized with 1125 X-ray images. Secondly, at the date of this study, 353 new COVID-19 samples, which have been appended to DB1 by researchers after the publication of [66], were added to DB1 to be able to compare our study with other state-of-art findings. This new database, which contains 1478 X-ray images in total, is named as DB2. Lastly, 113 new COVID-19 samples obtained from a different domain were added to DB2 to be able to create a more balanced data-set that would be more convenient to measure performance of the proposed method. The new 113 COVID-19 samples were taken from [1] resulting in the DB3, which contains 1591 X-ray images in total. In the experiments, 5-fold cross-validation technique was applied in order to validate the results over each created dataset as in [66], [60], [91]. For each fold, the whole image set was divided into training and testing sets with the ratio of 80% and 20%, respectively. In each repetition, a new model was trained by using randomly arranged 80% of data-set, while testing was evaluated with the remaining 20% of dataset. This cross-validation approach is then repeated 5 times and, as a result, each observation (sample image) is used for testing exactly once. A short summary of the constructed data-sets with the information of training and testing sizes for each fold is given in Table 1.

Table 1 The Image Distributions over Classes in Tested Datasets

2.2 Employed deep learning architectures

The traditional machine learning approaches, which consist of sequential sub-steps such as pre-processing, feature extraction, feature reduction/selection and classification, require domain specific expertise in order to obtain satisfactory performance in medical image analysis. The spatial and frequency domain features are the most popular approaches to obtain discriminating information from the raw images. For example, the Scale-Invariant Feature Transform (SIFT) and Maximally Stable Extreme Regions (MSER) methods are used in literature [37, 38] as the spatial domain interest point extraction techniques and the interest points based features are employed in traditional learning models subsequently. Regarding the frequency domain feature extractors like short time Fourier Transform (STFT) and wavelet transform (WT), the parameter selection procedure makes them hard to implement and dependent to user experience. On the other hand, even if the training processing times of deep learners are relatively long, they are implemented in end-to-end architectures which have no need or having minimum need for extra pre-processing steps and optimum tuning of feature extractor parameters. In contrast, traditional machine learning methods are still highly error prone and inaccurate to be used in a sensitive decision making process. Therefore, in order to benefit from the aforementioned superiorities of deep learners, seven CNN models including the MobileNetV2, VGG16, ResNet50, ResNet101, NasNet, InceptionV3 and Xception, have been applied to three public databases with the aim of three-class (COVID, No-Findings, Pneumonia) discrimination of X-ray images in the proposed study.

2.2.1 MobileNetV2

Although higher accuracy values can be achieved by using deeper and larger networks, these networks do not ensure efficiency in terms of size and speed, making them inconvenient for mobile applications. However, the fast and accurate diagnosis of COVID-19 is critical in the current pandemic condition causing the small mobile deep learning solutions more preferable. The MobileNetV2 [76], as an improvement of MobileNetV1, can be a powerful and versatile solution for mobile diagnosis of COVID-19 due to its high performance proven in various application areas including medieval writer identification [18], detecting underwater live crabs [12], real-time crowd counting [29] and remote wave gauging [11]. The main characteristic of MobileNetV2 relies on the usage of depthwise separable convolutions in which two 1D convolutions with two kernels are used instead of employing 2D convolution with a single kernel. As a result, the training phase can be carried out by using fewer parameters and less memory that results in a small and efficient model.

2.2.2 VGG16

The VGG16 [82] is a pre-trained very large CNN that was invented by VGG (Visual Geometry Group) from University of Oxford. The VGG16 was the 1st runner-up of the ILSVR (ImageNet Large Scale Visual Recognition Competition) 2014 in the classification task. The VGG16 architecture uses simple 3×3 size kernels in convolutional layers and combine them in a sequence to emulate the effect of larger receptive fields. The implemented VGG16 architecture is composed of 13 convolutional layers followed by 3 fully connected layers. Despite the simplicity of the VGG16 architecture, its memory usage and computational cost is dramatically high due to the exponentially increasing kernels.

2.2.3 ResNet50 and ResNet101

The ResNet deep learning models [31], which have introduced the “skip connections” concept, are the sub-classes of CNNs. In ResNets, some of the convolutional layers are bypassed (the concept of “skip connections”) at a time and the batch normalization is applied along with non-linearities (ReLU) [67]. In ResNet architectures, the “skip connections” enables to train much deeper networks and they give the network the option to simply copy the activations from ResNet block to ResNet block, preserving information as data goes through the layers [59]. The two versions of ResNet family, the ResNet50 and ResNet101 having 49 and 100 convolutional layers respectively, were employed in the current proposed COVID-19 diagnosis approach as a classifier and deep feature extractor.

2.2.4 NasNet

As a relatively recent network, the NASNet [100], whose CNN architecture was designed by another neural network, outperformed the previous state-of-the-art on the ILSVRC 2012 dataset. The NASNet architecture was created by use of the Neural Architecture Search (NAS) framework providing an algorithm for finding optimal neural network architectures [20]. In this algorithm, a controller recurrent neural network creates architectures aimed to perform at a specific level for a particular task, and by trial and error learns to propose better and better models [59].

2.2.5 InceptionV3

In the InceptionV3 [84], the inception modules, which are repeatedly stacked together to form a large network, are employed as an alternative to sequentially ordered convolution layers. In the inception modules, an asymmetric convolution structure is obtained by using multiple filters of various sizes resulting in more and more abundant spatial features with increased diversity. The usage of inception modules not only causes significant decrements in the number of parameters, it also increases the recognition ability of the network by using multiple scale features [99].

2.2.6 Xception

As an improved version of inception architecture, the Xception [16] algorithm uses depthwise separable convolutions which enables more efficient use of model parameters. In the Xception, the standard inception modules are replaced with the depthwise separable convolutions (enhanced inception modules) that use the depth dimension (the number of channels) as well as the spatial information. The enhanced inception modules result in stronger features including the depth information.

2.3 Transfer learning

During the analysis of medical images by using Transfer Learning, the weights of a deep-net that have been learned in the training of a CNN on a main dataset (for example ImageNet [74]) are transferred to a second CNN, which is then re-trained on labelled samples of desired medical data set using pre-learned weights. The final training phase is named as “fine tuning”; in which the certain layers of pre-trained net can be frozen (the weights of these layers stay fixed) while the remaining layers can be fine-tuned to suit the classification problem, except the last fully connected layer.

In our study, the employed CNNs were applied to COVID-19 data-sets by using the Transfer Learning strategy in the light of literature findings. In [4], it was mentioned that the performance of knowledge transfer depends on the dissimilarity between the database on which a CNN is trained and the database to which the knowledge is to be transferred. The distance between the natural image databases, that are employed for knowledge transfer, and COVID-19 data-sets is considerable. However, recent studies show that there is a potential for having benefit from knowledge transfer in medical data sets. For instance, in [6], a pre-trained CNN was employed as a feature extractor with the aim of chest pathology identification. In [89], pre-trained CNN based features have shown improved performance as they were fused with traditional handcrafted features in a nodule detection system. In addition to their feature extractor usage, the knowledge transferred CNNs can also be employed as the main classification framework with fine-tuning. For example, in [85], it was shown that the fine-tuned CNNs have performed similarly or better than the CNNs trained from scratch. In this study, pre-trained weights from [51] were transferred in either a shallow tuning or deep tuning strategy in which the weights of few layers for the former and many layers for the latter were trained. The results highlighted that medical image analysis requires deep tuning of more layers in contrast to many other computer vision tasks. In another study, it was demonstrated that fine-tuning of pre-trained networks worked better compared to nets trained from scratch in the analysis of skin lesions [62]. Additionally, in [81] knowledge transfer from natural images was applied in thoraco-abdominal lymph node detection and interstitial lung disease classification resulting in higher performance than training the CNNs from scratch. Similarly, in [14], transfer learning strategy was employed to identify the fetal abdominal standard plane and the approach revealed improved capability of the algorithm to encode the complicated appearance of the abdominal plane. In our study, due to the aforementioned superiorities of fine-tuning strategy, seven CNNs, which have already been trained by natural image database (ImageNet), were fine-tuned to extract deep features by using the X-ray samples. Later, these deep features were employed in the classification of chest X-ray images with individual and ensemble learning models.

2.4 Decision and feature level fusion

In a pattern recognition system, the ultimate goal is the design of best possible classification model for a specific problem such as the COVID-19 detection by using X-ray images. Traditionally, various classification models that have different theories and methodologies are applied to a specific pattern recognition problem, and the best model in terms of performance metrics is chosen. However, it was observed that the sets of patterns misclassified by the various classifiers would not necessarily overlap, even if one of the models has yielded the best accuracy. Hence, different classifiers may be harnessed to improve the overall performance by using their possible complementary information about the patterns to be classified, when they are used in an ensemble scheme [49]. This type of ensemble learning scheme is called decision level fusion based learning, in which the individual decisions of different models are combined to derive a consensus decision instead of relying on a single decision-making model. The hard-level combination uses the individual outputs of each classifier after they are binarized by applying a threshold to the classifier output probabilities (estimates of a posteriori probability of the class) to map them into class labels [63]. As a member of hard-level combination, the majority voting strategy simply counts the votes received from each classifier and the class that has the largest number of votes is selected as the consensus decision.

As an additional fusion strategy, the feature level fusion, in which various sets of features obtained by different feature extractors are combined, has high potential to achieve better classification performance [30, 55, 75, 88]. Feature level fusion generally consists of the concatenation of various normalized feature subsets resulting in a single feature vector forming a complete representation of different views (deep features obtained by using various CNNs). Regarding the CNNs based feature level fusion studies, even if the various CNN models are based on different configurations (or architectures), the fine-tuning of these CNN models by using the same target database (COVID-19 database in our study) consisting of concatenated feature vectors, can provide complementary information [23, 70].

2.5 Proposed deep features based ensemble model

In this study, seven CNN models (the MobilenetV2, VGG16, ResNet50, ResNet101, NasNet, InceptionV3, and Xception) have been used as the main structure of proposed framework. During the development of proposed method, firstly, these seven CNN models have been employed as deep feature extractors as depicted in Fig. 1. As seen in Fig. 1, the three databases were fed to the individual CNNs, which have already been pre-trained by using ImageNet [74], with the aim of network specific deep feature extraction by using a 5-fold cross-validation scheme. The optimum hyperparameters were chosen by employing a batch-size, epoch, and learning rate analysis that was based on trial and error strategy. Accordingly, the number of training epochs was chosen as 50, while a batch-size of 16 was employed. The learning rate that controls the speed of convergence was set to 0.0001, when Stochastic Gradient Descent with momentum was used as the optimization technique.

Fig. 1
figure 1

The deep feature extraction module

2.5.1 Learning scheme without feature level fusion

Subsequent to the deep feature extraction phase, the obtained deep features were fed to a softmax classifier satisfying the end-to-end learning scheme of classical deep learning. The classical softmax layer of CNNs, which is the generalization of logistic sigmoid function with the ability of mapping deep-features onto probability values used as outputs in discrimination problems having three or more classes, is named as “softmax classifier” in our study. The softmax classifier [45, 58, 80] is employed to measure the discriminating power of deep features obtained from the individual CNNs (equivalent to classical end-to-end learning with CNNs). The architecture that is followed to obtain individual CNN predictions is shown in Fig. 2 and the output of softmax classifier is named as ”Individual Predictions” at the top-middle section.

Fig. 2
figure 2

The multistage learning approach and decision level fusion of individual classifiers. “Fusion 1” refers to the hard-level combination of the individual predictions obtained from RBF and Polynomial kernel based SVMs. “Fusion 2” refers to the hard-level combination of the individual predictions obtained from Softmax function and RBF kernel based SVM. “Fusion 3” refers to the hard-level combination of the individual predictions obtained from Softmax function and Polynomial kernel based SVM. “Fusion 4” refers to the hard-level combination of the individual predictions obtained from Softmax function, RBF and Polynomial kernel based SVMs

In [36] and [86], it was mentioned that the CNNs, which are very good at learning invariant features, may show lower performance than the SVMs in classification. On the other hand, the SVMs are very successful at producing optimal decision surfaces from well behaved feature vectors, while having difficulties to represent the variances occurred in image features. Regarding the chest X-ray images used in our study, the areas that characterize the lung consolidation pattern may be located in various parts of the lung with changing size resulting in significant variances. Therefore, in addition to individual CNN based learning, a multistage model, in which the CNNs are employed to extract deep features that have potential to detect and recognize lung consolidation patterns, and non-linear SVMs that are trained by feeding the deep features learned by the CNNs, was presented and its performance was validated by using three databases. This multistage learning approach that uses CNNs and SVMs in a cascade connection has been successfully employed in various areas with the aim of classification performance improvement [28, 57, 83]. In this configuration, fully-connected activations of each CNN have been employed as feature extractors (given in Fig. 1) and the obtained deep feature vectors were fed to classifiers (softmax-classifier representing classical end-to-end CNN learning, SVM with RBF and Polynomial kernels) in a 5-fold validation scheme. Additionally, with the aim of performance improvement, the individual predictions obtained from classical end-to-end CNN learning (deep features that are fed to softmax classifier) and kernel based SVMs (deep features that are fed to SVMs) were fused by using the voting approach in accordance with the combinations given in Table 2. The SVM based learning configuration that uses the deep features, and the applied voting strategy was presented in Fig. 2.

Table 2 The detailed presentation of accuracy values obtained from applied individual and ensemble learning scenarios for three data-sets (average accuracy values of 5-folds are given)

2.5.2 Learning scheme with feature level fusion

Regarding the feature level fusion phase; the deep features extracted by individually employed CNNs were concatenated into a single fused feature vector directly without using any weight value. Subsequently, the fused feature vector was fed to the softmax classifier and also to the non-linear SVMs separately. After this, the individual predictions of the softmax classifier and SVMs were obtained as depicted in Fig. 3 when a concatenated feature vector was fed into. Furthermore, to benefit from the possible complementary behaviour of the learning models (softmax classifier, RBF and Polynomial SVMs), the obtained individual decisions were fused by using the majority voting. Thanks to the power of using feature and decision level fusion together, this final approach has given the best performance and was chosen as our proposed method. The detailed flowchart of proposed method including the deep feature extraction module, feature level fusion, multistage learning and decision level fusion can be seen in Fig. 3.

Fig. 3
figure 3

The Flowchart of the proposed method employing feature and decision level fusion

3 Experimental results

3.1 Performance of individual classifiers

The individual performance of the employed seven CNN models plus the results of concatenated feature vector can be seen in the first column of Table 2 in terms of the accuracy metric (each presented accuracy value was calculated as the mean of 5-folds and the standard deviation of these 5-folds were also represented for clarification). The highest accuracy values were obtained as 87.6% and 88.1% for the DB1 and DB2 respectively by using InceptionV3, while the best classification performance was achieved as 87.4% for DB3 by employing the Xception net. In contrast, when the poorest individual performances are investigated, it is seen that the MobileNetV2 had the worst accuracy value as 84.2% for DB1, while the NasNet has ended up with the accuracy values as 84.9% and 85.8% for DB2 and DB3 respectively. The second and third columns of Table 2 show the accuracy values obtained by multistage learning scheme, which uses non-linear SVM kernels, for the individual deep feature sets and also for the concatenated feature vector as given in bottom row group. As seen in column 2, the highest accuracy value was obtained by using the RBF kernel as 87.6% for the DB1 with no increment compared to softmax classifier. On the other hand, the RBF kernel based SVM learning, which were fed by InceptionV3 deep features, has slightly increased best accuracy value to 88.2% for DB2, while the ResNet50 has reached to 88.1% for DB3 by using RBF kernel based multistage approach. In addition, the columns 4, 5, 6 and 7 indicate the accuracy values obtained by using the decision level fusion strategy composed of the combinations of softmax classifier, radial basis function (RBF) and polynomial kernel based SVMs as highlighted in the Table 2.

3.2 Performance obtained by feature and decision level fusion

Regarding the effect of feature level fusion, the bottom row group (named as “Feature Level Fusion”) of Table 2 and the Fig. 4, in which the error values obtained from the three COVID-19 databases for the individual softmax classifier based learning models plus concatenated feature vector can be investigated. As seen in Fig. 4, the error values, which were obtained from the deep feature vector formed by using feature level fusion, are significantly lower than individual softmax classifier performance by reaching 9.8%, 10% and 9.6% errors for the DB1, DB2 and DB3 respectively. As understood from Table 2 and the Fig. 4, not a specific individual deep feature set (extracted by using a specific CNN) has outperformed the others for all three databases. This situation indicates that there is a significant need for ensemble learning which may pave the way for the complementary information achievement. It should also be noted that the error value for DB1 was even further reduced by 0.5% when RBF kernel based multistage learning algorithm was applied.

Fig. 4
figure 4

Classification errors of the individual learning and the feature level fusion schemes when the softmax classifier is employed

The contribution of decision level fusion can be investigated by using the right-side of Table 2 and the Fig. 5. In Fig. 5, the conventional classification performance of the softmax classifier (as it is used in traditional CNN based learning) was chosen as the reference baseline performance for seven CNN based deep feature extraction schemes. For comparison, the increments or decrements seen in the accuracy values obtained by the multistage SVM based learning and the decision level fusion were represented for each deep feature set plus the concatenated feature vector (obtained by the feature level fusion). When the Table 2 is investigated, it is seen that the highest accuracy values within the entire test set combinations were obtained as 90.8%, 90.5% and 90.7% for the DB1, DB2 and DB3 respectively, when the fourth decision level fusion approach, including the majority voting of hard labels obtained by softmax classifier, RBF and polynomial SVMs, was employed. As illustrated in Fig. 5, almost for all multistage SVM based learning and decision level fusion cases applied to MobileNetV2 based deep features, up to 2.5% increase in accuracy rate was achieved. On the contrary, approximately all the accuracy values obtained by decision level fusion, when they were applied to Xception and NasNet based deep features, were lower than the baseline softmax classifier performance. In accordance with the remaining VGG16, ResNet50, ResNet101 and InceptionV3 based deep features, neither the positive nor the negative effect of multistage learning and decision level fusion was clearly seen. For instance, up to 1% increase in the accuracy values was seen for the InceptionV3 and ResNet50 based scenarios in DB3, while slight improvements have been achieved by using VGG16 based scenarios for DB1. However, it should be noted that the proposed Fusion 4 approach has provided accuracy increments up to 0.6% in all data-sets for the feature-level fusion case as shown in the bottom-right side of Fig. 5.

Fig. 5
figure 5

The accuracy variations compared to CNNs when multistage learning and/or majority voting is applied are presented

3.3 Performance comparison by using confusion matrix based metrics

As alternative objective evaluation criteria, the confusion matrix based metrics were calculated to be able to show the performance of proposed approach. For doing this, the true positive (TP), true negative (TN), false positive (FP), and false negative (FN) values were obtained for each database. The confusion matrices obtained by the Fusion #4 strategy applied to 3 databases were given as Fig. 6 for further understanding. After the confusion matrices were obtained, 5 objective evaluation metrics were calculated as follows:

$$ Accuracy = \frac{TP+TN}{TP+TN+FP+FN} $$
(1)
$$ Precision = \frac{TP}{TP+FP} $$
(2)
$$ Recall = \frac{TP}{TP+FN} $$
(3)
$$ Specificity = \frac{TN}{TN+FP} $$
(4)
$$ F1 score = 2 \frac{Precision \times Recall}{Precision + Recall} $$
(5)
Fig. 6
figure 6

Confusion Matrices obtained from Fusion #4 strategy (Left DB1, Middle DB2, Right DB3)

Among these, the accuracy indicates the ratio of the correctly classified samples over entire datasets. The precision emphasizes how precise the learning model is out of those predicted positive samples, how much of the predicted positives are actual positive. The precision is an important parameter to determine when the costs of FP predictions is high. Moreover, the recall measures how much of the actual positive samples are captured by the model by labeling it as positive (TP). The recall is an essential parameter when there is a high cost associated with FN samples. The behaviour of precision vs recall of the COVID-19, pneumonia and no-finding classes obtained by using majority voted decisions, described as Fusion #4, of individual deep feature sets (obtained by a specific CNN) plus the concatenated feature vector (obtained by the feature level fusion) is given in Fig. 7. It is seen that the precision and recall values obtained by the concatenated feature vector were higher than the individual deep feature sets in almost all cases. As presented in Table 2, the highest accuracy values were obtained when the Fusion #4 strategy was applied for all three databases. The obtained precision and recall values for Fusion #4 strategy is also depicted in Table 3 to go in deeper investigation. Addition to precision and recall, the specificity metric is given in the Table 3 to signify the proportion of negatives that are correctly identified by the proposed approach. As seen in this table, in almost all classes and databases, the highest precision, recall and F1-scores were obtained for the COVID-19 class which has the highest priority in our classification problem. In addition, an important evaluation metric named as Kappa, which is a statistical measure of inter-annotator agreement for categorical items by comparing an observed accuracy with an expected accuracy [61], was given for all databases in Table 3. As mentioned in [52], the Kappa values greater than 0.80 are called almost perfect classification. Hence, the obtained Kappa values (0.845, 0.857 and 0.86 for DB1, DB2 and DB3 respectively) shows the success of proposed approach following Fusion #4 strategy in COVID-19 diagnosis problem. As a final point to remark, all the F1-measure values indicating how precise the classifier is (what percentage of the samples assigned to a certain class is classified correctly), as well as how robust it is (what percentage of the samples belonging to a certain class is classified correctly), were quite high for the COVID-19 class, showing the success of proposed Fusion #4 strategy.

Fig. 7
figure 7

Obtained precision and recall values of Fusion #4 strategy for each individual CNN and concatenated feature vector

Table 3 The detailed presentation of evaluation metrics for the Fusion #4 strategy

4 Discussion and conclusion

Although the RT-PCR is the most common technique to diagnose COVID-19, chest radiography based approaches have been extensively used as complementary diagnosis tools due to the low-sensitivity drawback of RT-PCR especially seen in the early stage of COVID-19. The X-ray scanning has been preferred as the primary radiography based imaging approach in COVID-19 detection due to its fast imaging speed, low cost and low dosage exposing of radiation compared to CT. However, the interpretation success of X-ray images strongly depends on the radiologist’s experience and visual inspection of the X-ray images belonging to several patients takes significant time and effort. In order to increase the objectivity of the X-ray imaging interpretation and decrease the required time and effort, CAD systems have been used as supporting decision mechanisms in the detection of COVID-19 cases. In this respect, several studies employing deep networks as the decision tool were published lately as depicted in Table 4. Some of the previous studies have treated the COVID-19 diagnosis as a binary classification problem. For instance, five pre-trained CNN based models, which were using binary classification in their last layer, were employed in [64] for the COVID-19 X-ray image detection. In [33], deep learning models were introduced to confirm only positive or negative COVID-19 cases as an another binary classification approach. Additionally, an approach based on building two models, the first one aimed to detect whether a chest X-ray is related to a healthy subject or to a generic pulmonary disease patient, was studied in [10]. In the second phase of this study, the X-ray image was given to an another model that aims to detect whether the pulmonary disease is COVID-19. However, we have aimed to design a COVID-19 detection framework that is built on a three class learning model in the proposed study. Therefore, the studies which were using binary classification were not added to our comparison table. Regarding the number of employed COVID-19 X-ray samples, although sufficient number of X-ray samples to train a learning model exist in [60], [65], [91], [2], [68] and [96], the ratio of COVID-19 samples is very low compared to the distribution of the remaining classes. However, most of the learning models tend to work on balanced class distributions or equal misclassification costs, and the performance of these learning methods can be significantly compromised when imbalanced data sets, like the employed COVID-19 vs non COVID-19 distributions seen in [2, 60, 65, 68, 91, 96], are used. Therefore, in our study, the employed data bases were progressively created, starting from the usage of samples given in [66] as DB1, till minimum imbalance between employed classes was achieved in DB3. As an addition to the scores obtained in [66], the CNN model studied in [48] was also tested on the same DB1 data-set, and the results obtained from that study were also compared with our proposed approach’s performance. As seen in Table 4, our method has outperformed [66], [60], [48] and [96] in terms of accuracy, precision and recall metrics, while our algorithm provides competitive performance compared to [65], [91] and [2] in terms of accuracy. It should be noted that our approach applied to DB1 is having the same number of image samples and same cross-validation strategy compared to [66], while a similar 5-fold cross-validation with different number of X-ray samples was carried out in [60] and [96]. Furthermore, the precision values obtained by using our method were significantly higher than [66], [60], [65], [68] and [96], while higher performance was achieved in terms of recall compared to [91] and [2].

Table 4 Performance comparison of related works on COVID-19 detection problem with the proposed method

A similar approach to our study, which uses deep features obtained from various CNNs and a SVM based classification strategy, was given in [79]. However, in [79] the deep features are fed to SVMS in an individual manner, while the fused deep features have been fed to non-linear SVMs in our proposed study. Additionally, a voting based decision level fusion strategy is also tested on the X-ray data-sets in our proposed approach. As a contribution of these feature+decision level fusion, it is seen that higher precision and recall values compared to [79] have been obtained when the proposed approach was applied to DB1, which has similar number of COVID-19 image samples as in [79].

As seen in Table 2 and the Fig. 4, none of the individual learning models has been significantly outperformed the others. However, accuracy improvements up to 2.5% were achieved when feature level fusion has been applied to obtained deep features. When the multistage learning and decision level fusion approaches are investigated, it is seen that the accuracy rises up to 2% and 0.5% have been obtained for the deep features extracted by using MobileNetV2 and VGG16 respectively. The supportive effect of SVM usage and majority voting for these two CNNs can be related to their sizes, which are the cause of possible underfitting and overfitting. As mentioned in [97], small networks such as the MobileNetV2 usually suffer from underfitting, while very large models such as the VGG16 may have trouble with overfitting [35]. However, a learner such as SVM, which is good at producing optimal decision surfaces even there is noise on the data, can have positive effect on the classification accuracy similar to our case. On the contrary, same multistage learning and majority voting strategy did not work well, resulting accuracy reductions for the deep features obtained by Xception and NasNet. When the architecture of NasNet is investigated, it is seen that the NasNet was constructed by a neural architecture search based optimization carried out by using reinforcement learning. As a result of this process, the well-designed scalable and convolutional cells are defined in the optimum way, resulting in an architecture that is prone to produce robust features as in our case. In a similar way, in the Xception, the usage of depthwise separable convolutions paves the way of efficiently usage of model parameters producing stronger features. Hereby, the cascade connection of the SVMs to the last FC layer of Xception and NasNet plus the usage of majority voting has no supportive effect in classification. So, the network based discrimination is more than enough for these two CNNs.

Another important fact that needs to be discussed about our proposed system, in which the Fusion #4 strategy was applied, is the obtained high precision and recall values. The precision value is directly related with the number of FP samples and low precision in COVID-19 means high number of healthy subjects that are misdiagnosed as COVID-19. An early quarantine measure applied to COVID-19 patients is employed as the fundamental disease control strategy across the countries [73]. Apart from the physical damages, the quarantine may cause dramatic psychological effects on the mental health. In previous studies, it was reported that the psychological impact of quarantine can vary from immediate effects such as irritability, fear of spreading infection to family members, confusion, anger, loneliness, anxiety, frustration, denial, insomnia, despair, depression, to extremes of consequences including suicide [7, 9, 24]. Therefore, the FP samples frequently seen in a COVID-19 detection system may cause significant undesired psychological and social consequences. However, as seen in Table 3, the proposed system has precision values, belonging to COVID-19 class, as 100%, 98.29% and 97.13% for the DB1, DB2 and DB3 respectively showing its almost perfect FP sample reduction performance. The recall metric, which is directly connected to FN samples, is also essential in COVID-19 detection because of the high cost associated with FN samples. Misdiagnosing a COVID-19 patient may cause dramatic consequences due to the very easy and fast transmission mechanism of the SARS-CoV2. The subject misdiagnosed as normal can spread the disease to his/her close environment in a very short time resulting in new patients who are ready to spread the disease further. However, thanks to our proposed approach, high recall values, reaching up to 97.6%, 96.65% and 97.29% in DB1, DB2 and DB3 respectively, were obtained by using the Fusion #4 strategy.

Although the deep learning approaches have enabled unprecedented breakthroughs in medical image analysis, the interpretable modules are sacrificed for uninterpretable ones that achieve higher performance through greater abstraction (more layers) and tighter integration (end-to-end training) in CNNs [78]. However, in [98], the Class Activation Mapping (CAM) technique, which is a way of producing visual explanations of the predictions of deep learning models [3], was proposed to make the CNNs more transparent and explainable. By using the CAM technique, useful knowledge about the employed prediction regions in the COVID-19 detection problem can be investigated. For example, the failure regions can be visually identified for the wrongly classified samples and necessary modifications in the learning models can be applied towards the most fruitful research directions. Besides, for a deep model, which is very strong in diagnosis, the CAM technique can visually identify the lung consolidation patterns as a supportive diagnostic tool for doctors. In Figure 8, two CAMs obtained from COVID-19 samples are given with the aim of visual validation of employed CNNs. In the CAMs, the red color highlights the lung regions where the employed CNN model focuses on (activating around those patterns) most during the discrimination. In Figure 8, upper row, the CAMs obtained with six CNNs, excluding VGG16 due to inability of representing its CAM by using employed approach, for an 83 year old male having mitral insufficiency, pulmonary hypertension and atrial fibrillation with COVID-19 infection, can be seen. In this patient, Ground-glass opacification (GGO) and consolidation in the right upper lobe and left lower lobe is seen as the indicators of COVID-19. The InceptionV3 and ResNet50 have correctly localized the right upper lobe pattern, while missing the left lower lobe. However, the Xception has successfully detected both two pathological regions with high spatial resolution. In the bottom row, the CAMs of a 53-year-old female, whose X-ray contains multifocal patchy opacities in both lungs, was depicted. This case is a good example to see the effect of feature level fusion of different CNNs because of the existing three separate opacity patterns. While the MobileNet has strong focus on right side single pattern, the InceptionV3, ResNet50 and Xception has low activation on right side. However, the ResNet101 and InceptionV3 have highly focused on left side upper pattern, while the Xception and MobileNet has significant activity near the left side lower pattern. When the complementary effect of these CAMs is considered, it is obvious that the fusion of features obtained by these CNNs would have higher discriminating power. In the middle part of Fig. 8, a flowchart explaining how the features obtained from various CNNs are concatenated is given for further understanding. Additionally, in Fig. 9, X-ray images belonging to the same patient with bilateral GGO are shown. The image in upper row is taken on the second day of diagnosis while the bottom row X-ray image is taken on the fourth day. As it can be seen, the active regions belonging to a specific CNN are consistent and not dramatically changing towards second and fourth day images.

Fig. 8
figure 8

CAM visualizations of two patients obtained by six CNNs (top and bottom rows) and the flow-chart of employed feature level fusion (middle row)

Fig. 9
figure 9

CAM visualizations of the same patient on the second and fourth day of diagnosis

In future research, we aim to focus on following research paths related with COVID-19 for further improvement; i) a different version of the feature level fusion, in which the features obtained from the various layers of the same CNN are concatenated, can be employed instead of the fusion of features obtained from the last FC layer of different type CNNs. By doing that diverse features, which contain more semantic information in the top layers and more low-level information in bottom layers, can be combined to provide more discriminative information. ii) the concatenation of feature sub-sets obtained from various deep-nets may cause two possible drawbacks for the subsequent learning models. First drawback would be the complexity increase [46], while the second can show itself as the difficulty in pattern identification due to curse of dimensionality, which is referred as having more features than the number of observations. A feature selection approach, such as the ReliefF [50], can applied to the concatenated feature set to reduce the learning algorithm complexity and prevent a possible overfitting scenario [47]. iii) since the outbreak is recent, the number of COVID-19 X-ray images, which can be used in CAD system design studies, is very limited. Even though there exists studies that uses Generative Adversarial Networks (GANs) [90] and attention guided augmentations [54] for increasing the number of training samples, the performance can be improved by using Progressive Growing GAN [42] for augmentation. Besides, the quality of artificial COVID-19 samples can be improved by integrating more labeled data into the learning process by using GANs. iv) the Canonical Correlation Analysis (CCA) [34, 44], which aims at measuring linear relationships between two sets of variables by using the within-set and between-set sample covariance matrices, can be employed as a feature fusion approach instead of simple concatenation of deep features. By utilizing the multi-view features (the deep features extracted from different CNNs and/or from the different layers of the same CNN), more discriminating features having maximized correlation between various sets can be attained with the hope of performance increase in COVID-19 detection. v) the hyperparameters, which are adjusted prior to the learning process and affect how the learning algorithm fits the model to data, can be tuned by using automatic tuning algorithms such as the Bayesian optimization [94]. In this way, the optimum hyperparameters for the COVID-19 detection problem can be tuned for both CNNs and SVMs to obtain higher performance.