1 Introduction

COVID-19 infection has appeared in Wuhan, China since December 2019. It is now considered a worldwide pandemic (Roosa et al. 2020; Yan et al. 2020). It may severely affect the human respiratory system. While COVID-19 causes mild symptoms in about 82% of the cases, the remaining cases suffer from fatal symptoms and some may need ventilators (National Geographic n.d.). Common COVID-19 infection signs include respiratory symptoms, fever, lowness of breath, and respiratory difficulties. In severe cases, COVID-19 may cause pneumonia, severe acute respiratory syndrome and kidney failure leading to death (Stoecklin et al. 2020). Dealing with people who suffer from respiratory symptoms, such as coughing and sneezing, should be avoided. While in many countries of the world, due to the rising needs for intensive care units, the health system has been overburdened and may be on the verge of collapsing. Then, for COVID-19 detection, an automatic diagnosis system is required.

COVID-19 can be diagnosed with several methods, like reverse transcriptase-polymerase chain reaction (RT-PCR), blood testing, and medical image analysis (Huang et al. 2020). Although RT-PCR testing is very specific, it is a time-consuming, difficult, and complicated manual technique. Hence, different modalities of medical imaging can be used for the task of COVID-19 detection. Although CT scanning is the most accurate and effective tool for COVID-19 detection, X-ray imaging is still the best tool, because it is cheap and fast. COVID-19 infection is reflected in X-ray images with a glassy nature. Hence, there is a need for accurate inspection of X-ray images to detect COVID-19 cases.

Dependence on human operators with this high rate of infection may be infeasible due to the limited number of trained specialists and the need to force safety precautions. That is why artificial intelligence (AI) finds a major role in this task. Both machine learning (ML) and deep learning (DL) tools are required in the diagnosis process (Tahir et al. 2020; Chowdhury et al. 2019a, b; Kallianos et al. 2019).

Several attempts have been presented in the literature for COVID-19 detection from X-ray images. CNNs have found good applications for this task. The reason is that multiple convolutional layers can lead to different feature maps through the utilization of a variety of convolutional kernels. The most effective features can be kept through the utilization of a pooling strategy (Zhang et al. 2018; Sun et al. 2019; Gheisari et al. 2017).

The concept of transfer learning (TL) has been investigated deeply in this area. The rationale behind this trend is the small size of the available dataset and the large cost of the training process. The idea of TL depends on the utilization of a pre-trained network and the application of fine tuning for the task of interest. TL has led to good classification results in different applications (Emara et al. 2021a, b). Hence, it is recommended in this paper.

The presented topic of research is very important to speed up the diagnosis process of COVID-19, efficiently. We begin this study by investigating traditional ML algorithms and a CNN model built from scratch to allow classification of X-ray images acquired for suspected COVID-19 patients. Our numerical results reveal the low accuracy of traditional ML algorithms that depend on manual feature extraction. In addition, building a CNN from scratch is not sufficient with a large burden of training and optimization for achieving the required performance level. Hence, our main contribution in this paper is to make use of the TL strategy in the classification task. In this strategy, well-trained deep convolutional neural networks (DCNNs) are tuned to the task of interest. Another contribution in this paper is to perform the feature extraction through the large DCNNs such as ResNet50, ResNet101, Inception-v3 and InceptionResnet-v2, while the classification task is performed with random forest (RF) and Gaussian process (GP) classifiers. These classifiers are reported in previous studies as superior classifiers. This gives good classification performance. The main contributions of this work are summarized as follows:

  • Developing different ML and DL algorithms for exploring the issue of COVID-19 detection from X-ray images.

  • Summarizing the most recent related work concerned with classification challenges of medical images.

  • Achieving higher classification and detection accuracies for identifying COVID-19 cases.

  • Presenting a detailed comparative study between the proposed work and the most recent related studies.

2 Related works

In the literature, researchers developed several algorithms for COVID-19 detection. A large amount of data is a vital demand required to train various deep learning models. At the beginning of the pandemic, the available data was limited. In order to solve this problem, TL has been introduced.

Pham (2021) introduced a TL-based algorithm for COVID-19 detection. Its results revealed the importance of fine tuning. This can save time and cost by avoiding the development of more complicated models that produce the same or better results. Narin et al. (2020) proposed a TL-based algorithm for COVID-19 detection based on an inception model. This algorithm has been evaluated on 1065 CT images. An accuracy of 79.3% has been reported. Saiz and Barandiaran (2020) utilized the VGG16 pre-trained model for the detection process. Their algorithm was evaluated on 1500 X-ray images. An accuracy of 94.92% has been obtained. In Wang and Wong (2021), ResNet50 model was utilized for COVID-19 detection. Their algorithm was evaluated on 100 X-ray images. An accuracy of 98% has been achieved. Erdem and Aydın (2021) introduced a comparison between the pre-trained models, namely Inception-v3, MobileNet, SqueezeNet, Xception, and VGG16, to get the best performance. Their results revealed that Inception-v3 model has the highest accuracy that reaches 90%. Jain et al. (2021) utilized Xception, Inception-v3, and ResNet pre-trained models for COVID-19 diagnosis. These models were evaluated on 6432 X-ray images. Their results revealed that the Xception model presents the highest accuracy that reaches 97.97%.

Image pre-processing techniques have a vital role in the enhancement of the classification process. El-Shafai et al. (2021a) introduced an algorithm for segmentation and classification of COVID-19 images based on DL. Firstly, the classification process is employed to differentiate between COVID-19 and pneumonia images with the CNN model. Then, the segmentation process is applied on the COVID-19 and pneumonia images. Finally, the obtained segmented images are used to determine the infected regions in COVID-19 and pneumonia images. El-Shafai et al. (2021b) introduced an automatic algorithm for image enhancement and classification. To get high-resolution versions of X-ray and CT images, their paper presented a hybrid SIGTra model. A generative adversarial network (GAN) has been used for the image super-resolution reconstruction purpose. In addition, for image classification, TL with CNN (TCNN) has been used. An accuracy of 99% for X-ray image classification has been achieved. Canayaz (2021) utilized meta-heuristic-based feature selection for COVID-19 detection. Firstly, an image contrast enhancement algorithm is used for pre-processing. Then, the features are extracted using different pre-trained models. Feature selection is implemented depending on metaheuristic algorithms. Finally, the obtained features are classified using a support vector machine (SVM). An accuracy of 99.83% was obtained.

A combination of different CNN models was presented to get higher detection accuracy. Xu et al. (2020) utilized a location-attention network and a ResNet18 model for COVID-19 detection. Their algorithm has been evaluated on 618 CT images for COVID-19, viral pneumonia, and normal cases. Their algorithm reported an accuracy of 86.7%. Karar et al. (2021) introduced cascaded DL models to increase the efficiency of COVID-19 detection. Eleven pre-trained models were exploited and compared for classification purposes. According to their results, the VGG16, ResNet50, and DenseNet169 models achieve the best detection accuracy. Emara et al. (2021b) used CNN models with various learning procedures for COVID-19 diagnosis. Firstly, a CNN-based TL algorithm was used to automatically diagnose COVID-19 from X-ray images with various training and testing ratios. The second task was to train the CNN model from scratch. Their results indicate that training of the TL-based CNN models produces high performance. Wang et al. (2021) presented a modified inception model, followed by internal and external validation for COVID-19 detection. Their model has been evaluated on 1065 CT images for COVID-19 and viral pneumonia cases. The accuracy of the internal validation was 89.5%, and the accuracy of the external validation reaches 79.3%. Song et al. (2021) proved that identifying possible lesions from CT images may be useful for COVID-19 detection. The feature pyramid network (FPN) was combined with the ResNet50 model. That ResNet50 was used to extract local and relational features. The global features extracted from the original image are concatenated with these features. The classification process is carried out using a multi-layer perceptron (MLP). The sensitivity of this model was 96%.

Several researchers implemented AI and heuristic optimization algorithms such as genetic algorithms (GA) in this topic. Mimetic genetic algorithms (MGAs) were exploited to solve several problems such as network optimization, vehicle routing, several graph theory and electronic manufacturing units. Roy et al. (2019) introduced MGAs to solve the traveling salesman problem (TSP). Boltzmann probabilistic selection and parents crossover were combined with the ergodic mutation. The cost and distance are compared for the adjacent nodes of the involved parents. Standard benchmarks were obtained from TSP versus classical genetic algorithms (GAs). Biswas and Pal (2019) presented a fuzzy goal programming (FGP) method based on GA. In order to solve the congestion management (CM) problem, membership functions are converted into membership goals. The GA computational scheme achieves the required goals according to their priorities. Li et al. (2020) proposed a DCNN model for COVID-19 detection that is called COVNet. Their model was tested on CT images collected from six hospitals. It achieved an accuracy of 96%. Using the COVID-19 chest X-ray dataset, Ghoshal and Tucker (2020) used drop-weights-based Bayesian convolutional neural networks (BCNNs) to compute uncertainty in DL models to increase the diagnostic performance. An accuracy of 89.92% has been reported. Wang et al. (2021) proposed a DCNN model for COVID-19 detection called COVID-Net. Their algorithm was tested on a collection of 16,756 X-ray images from 13,645 cases obtained from two open access data sources. An accuracy of 92.4% has been reported.

The main advantage of DCNN models is the automatic feature extraction. DCNN models can be used as feature extractors followed by traditional ML models for the classification process. Loey et al. (2021) presented an algorithm for COVID-19 detection. Their algorithm includes two stages. The first stage depends on ResNet50 for feature extraction. The second stage depends on decision tree (DT), SVM, and ensemble algorithm for classification. Their results revealed that the SVM classifier outperforms the other ML algorithms, and achieves an accuracy of 99.64%. Wu et al. (2020) presented an ML algorithm for COVID-19 detection from blood tests. Random forest (RF) classifier allows discrimination from 49 blood tests. An accuracy of 95.95% has been obtained. Rahman et al. (2020) built a data-driven dynamic clustering method to mitigate the COVID-19 negative impact on the economy. Their method have mainly three components: data analysis, dynamic clustering, and data security. A clustering technique has been presented, and it has been simulated in four scenarios to reveal its benefits and drawbacks. In the lock-down coverage experiment, the presented clustering method improved the performance indicators by 60–80%.

3 Materials and methods

The used chest X-ray image dataset includes 912 X-ray images for regular people, and 912 X-ray images for COVID-19 infected people. It was presented in (Alqudah and Qazan 2020). Figure 1 shows different samples from this dataset that were used to test the proposed models for COVID-19 detection (Fig. 2).

Fig. 1
figure 1

X-ray images for COVID-19 and normal cases (Alqudah and Qazan 2020)

Fig. 2
figure 2

Block diagram of the proposed approaches for COVID-19 detection

The proposed approach is presented in Fig. 2. We have investigated traditional ML algorithms for COVID-19 detection. Firstly, traditional ML algorithms with manual feature extraction have been studied. After that, a 15-layer CNN model built from scratch has been investigated for efficient classification of COVID-19 cases. In addition, a TL strategy has been exploited. Different DCNNs such as ResNet50, ResNet101, Inception-v3 and InceptionResnet-v2 have been tuned to the task of interest. Finally, feature extraction is performed through DCNNs, while the classification task is implemented with traditional ML classifiers.

3.1 Convolution neural network trained from scratch

The principal structure of a CNN network is a combination of convolution, batch normalization (BN), and pooling layers. The BN layers are used to normalize the local features once the convolution layers have retrieved them. Pooling layers are used to minimize the number of features extracted. To reflect the variations in local activity levels, max-pooling is used. It reveals the details of the edges. The largest values observed primarily correspond to edges. X-ray images contain many details. The representation of the output feature map is illustrated as follows (Bhandary et al. 2020; Bosch et al. 2007; Cheng and Bao 2014):

$$\begin{aligned} {Y_{j}}^{l}=f\left( \sum _{i\in N_{j} }^{} {Y_{i}}^{l-1}*{X_{ij}}^{l}+{b_{j}}^{l}\right) \end{aligned}$$
(1)

where \({Y_{j}}^{l}\) indicates the local features collected from the preceding layers, and \({X_{ij}}\) refers to the adjustable kernels. The bias is employed to prevent over-fitting, and it is represented by the symbol \({b_{j}}^{l}\). The pooling process is represented in Eq. (2)

$$\begin{aligned} {Y_{j}}^{l}=down( {Y_{j}}^{l-1}) \end{aligned}$$
(2)

The down-sampling function is represented by down(.). All activations in the previous layer are connected directly to the fully-connected (FC) layer. The FC layer adds discriminative features to the input image, allowing it to be classified into different classes.

3.2 Local feature extraction and machine learning classifiers

First of all, X-ray images are resized and converted from gray-scale into HVS images. Then, mean, standard deviation (std), skewness, kurtosis, histogram minimum and histogram maximum are estimated for H, V and S channels. These features are used as inputs for the ML models. To the best of our knowledge, this set of features has never been considered in the literature for ML-based COVID-19 detection. Different ML models have been evaluated for classification. Specifically, we assessed logistic regression (LR), K-nearest neighbours (KNN), SVM, naive Bayes (NB), DT, RF, gradient boosting (GB), stochastic gradient descent (SGD), GP, MLP, adaptive boosting (AdaBoost) and extreme gradient boosting (XGBoost) classifiers.

To verify the effectiveness of the proposed approach, extraction of the adequate features from the input images is required. The mean, std, skewness, kurtosis, histogram minimum and histogram maximum are considered. Higher-order statistics such as skewness, and kurtosis (Groeneveld and Meeden 1984) are utilized for classifying X-ray images. The use of these statistics is inspired by the fact that distribution of the samples of a dataset is often characterized by their level of dispersion and asymmetry. For an N-point data sample sequence, \(X = {x_1,x_2,\ldots , x_N}\), the corresponding skewness \(\beta _{1}\), and kurtosis \(\beta _{2}\) are calculated as:

$$\begin{aligned} \beta _{1}= & {} \frac{1}{N}\sum _{i=1}^{N}\left( \frac{x_{i}-\mu }{\sigma }\right) ^{3} \end{aligned}$$
(3)
$$\begin{aligned} \beta _{2}= & {} \frac{1}{N}\sum _{i=1}^{N}\left( \frac{x_{i}-\mu }{\sigma }\right) ^{4} \end{aligned}$$
(4)

where \(\mu\) is the mean of the data and \(\sigma\) is the standard deviation. The second-, third-, and fourth-order moments are used to calculate the skewness and kurtosis.

In Fig. 3, the corresponding values of histogram minima, histogram maxima, mean, and std are also displayed for H, V, and S channels. It is important to note that the histograms of the X-ray images are different for each case. The associated histogram minima, histogram maxima, mean, std, skewness, and kurtosis values are all different, and these variables are representatives of dataset dispersion, asymmetry, and peakedness. As a result, it is logical to assume that these statistical metrics are more effective for classifying X-ray images.

Fig. 3
figure 3

Histograms of H, V and S channels for COVID-19 and normal cases

3.3 Deep feature extraction with machine learning (ML) classifiers

An ML classifier is used instead of the DL classifier, because DL classifiers require a large dataset for training and validation. The deep features of the pooling layer are retrieved and fed into the ML classifier. ResNet50, ResNet101, Inception-v3, and InceptionResnet-v2 are employed as pre-trained CNN models. Furthermore, the GP classifier is employed and compared with the RF classifier.

Figures 4 and 5 present the architecture for the deep feature extraction with the modified InceptionResnet-v2 and ResNet101 models, respectively.

Fig. 4
figure 4

Deep feature extraction architecture with modified Inception-Resnet-v2

Fig. 5
figure 5

Deep feature extraction architecture with modified ResNet101 model

3.4 Transfer-learning-based pre-trained models

Deep learning from scratch is a time-consuming process that requires data labeling and splitting. TL is appropriate for removing the huge strain of this task. In TL, small changes in deep pre-trained networks are induced in response to input data. The pre-trained models are loaded, and the BN, ReLU, and softmax layers are used in place of the last three FC layers. The models are trained with a learning rate of 0.00001 with 6 epochs. The final aim of the proposed approach is to use a tuned pre-trained model to classify image batches into COVID-19 or normal cases. The block diagram of the TL-based model for the COVID-19 detection is presented in Fig. 6.

Fig. 6
figure 6

Block diagram of the proposed pre-training-based TL model

3.5 Performance metrics

Standard metrics like accuracy (ACC), sensitivity (SEN), specificity (SPEC), precision (Preci), mis-classification rate (\(M_r\)), and false positive rate (\(F_{ pr}\)) are used to evaluate the proposed model (Sokolova and Lapalme 2009). The number of correctly identified abnormal cases is known as true positive (\(T_p\)). The number of accurately identified normal cases (\(T_n\)) gives the true negative. A set of normal cases categorized as anomaly diagnoses represents false positive (\(F_p\)). The false negative (\(F_ n\)) represents the collection of abnormalities seem to be normal. Sensitivity is given as:

$$\begin{aligned} SEN=\frac{T_p}{T_p+F_n} \times 100 \end{aligned}$$
(5)

Specificity is given as:

$$\begin{aligned} SPEC=\frac{T_n}{T_n+F_p} \times 100 \end{aligned}$$
(6)

Accuracy is given as:

$$\begin{aligned} ACC=\frac{T_p+T_n}{T_p+T_n+F_p+F_n} \times 100 \end{aligned}$$
(7)

Precision is given as:

$$\begin{aligned} Preci=\frac{T_p}{T_p+F_p} \end{aligned}$$
(8)

The misclassification rate is given as:

$$\begin{aligned} M_r =\frac{F_p+F_n}{T_p+T_n+F_p+F_n} \end{aligned}$$
(9)

False positive rate is given as:

$$\begin{aligned} F_{pr} =\frac{F_p}{T_n+F_p} \end{aligned}$$
(10)

4 Experimental results

The investigated ML models, TL-based pre-trained models, deep feature models, and the model trained from scratch, are tested using (Alqudah and Qazan 2020) dataset.

4.1 Results for the first approach

For the first approach, a simple CNN model was built and trained from scratch. As demonstrated in Table 1, the proposed CNN model is discussed. The conv-1, conv-2, and conv-3 layers have 16, 32, and 64 filters, respectively, that have \(3 \times 3\) pixels in size. The max-pooling function is employed for dimensionality reduction. The network is trained from scratch using the Adam optimizer, with a learning rate of 0.00001. Figure 7 displays the performance of the CNN trained from scratch in terms of both accuracy and loss. There is a coincidence in performance between validation and training accuracy as well as validation and training loss. The minimal square error (MSE) has been chosen as the loss function. Based on the MSE, a distance minimization approach is used. The confusion matrix and ROC curve for the trained model from scratch are shown in Fig. 8. The CNN model reports an accuracy of 94.78% and an \(F_{pr}\) of 0.0522.

Table 1 Architecture of the proposed CNN model for COVID-19 detection
Fig. 7
figure 7

Training progress of the CNN model

Fig. 8
figure 8

Confusion matrix and ROC curve for the CNN model trained from scratch

4.2 Results for the second approach

The results are presented in Table 2. It is clear that the best models for detection are RF, GP, MLP and KNN with accuracy levels up to 97.53%, 97.53%, 96.16% and 95.34%, respectively, with a 80/20 training/testing ratio. Moreover, the NB, GB and AdaBoost classifiers achieve low accuracies of about 67.67%, 67.88% and 65.48%, respectively.

Table 2 Detection performance results obtained from different ML models

Figures 9 and 10 present confusion matrices and ROC curves for MLP, RF and GP classifiers, which give higher performance than other models. AUCs of 98%, 98%, 96% and 95% are obtained from RF, GP, MLP and KNN classifiers, respectively.

Fig. 9
figure 9

Confusion matrices for the best ML models used for COVID-19 detection a RF classifier, b GP classifier and c MLP classifier

Fig. 10
figure 10

ROC curves for the best performance ML models used for COVID-19 detection a RF classifier, b GP classifier and c MLP classifier

4.3 Results for the third approach

Four pre-trained models, namely InceptionResnet-v2, Inception-v3, ResNet50, and ResNet101 were used in this study. Moreover, GP and RF classifiers were used for the purpose of classification. Tables 3 and 4 present the detection performance results obtained from different pre-trained models with GP and RF classifiers, respectively. It is clear that InceptionResnet-v2 and ResNet101 models with GP classifier outperform other models.

Table 3 Detection performance results obtained from different pre-trained models with GP classifier
Table 4 Detection performance results obtained from different pre-trained models with RF classifier

4.4 Results for the fourth approach

Table 5 presents detection performance results obtained for TL-based pre-trained models. It is clear that ResNet101 outperforms the other models with an accuracy that reaches 99.18%.

Table 5 Detection performance results obtained for TL-based pre-trained models

Confusion matrices and ROC curves for the proposed pre-trained models are presented in Figs. 19, and 20. InceptionResnet-v2 and ResNet101 models have the same performance with an accuracy that reaches 100%.

5 Discussion and comparison with the-state-of-the-art methods

As can be seen from the obtained results, the TL-based and deep feature extraction approaches present better accuracy levels than those of the CNN model and local feature extraction approaches. The obtained results reveal that pre-trained models work on both TL and deep feature extraction. These models were trained with a huge number of images that reach 25 million images. The convolution layer filters were selected to be efficient for new applications such as COVID-19 detection. Furthermore, the depth of these CNN models has a considerable impact on the application accuracy. The approach that comprises deep feature extraction with GP and RF classifiers gives the highest accuracy levels. This is attributed to the ability of the deep features to represent image activities, efficiently, in addition to the inherent characteristics of the used RF classifier to reduce the overfitting and increase accuracy. In addition, the GP is a powerful algorithm for classification problems (Figs. 11, 12, 13).

Fig. 11
figure 11

Confusion matrix and ROC curve for inceptionResnet-v2 model with GP classifier

Fig. 12
figure 12

Confusion matrix and ROC curve for inceptionResnet-v2 with RF classifier

Fig. 13
figure 13

Confusion matrix and ROC curve for Inceptionv-3 model with GP classifier

The computation time is the ultimate comparison metric between the proposed approaches. It is shown in Table 6, which clearly reveals that the deep feature extraction with Resnet101 and GP classifier gives a run time of 30.9 s, which is the least time. The deep feature extraction with InceptionResnet-v2 model and GP classifier is used to achieve the second best runtime of 38.7 s. The CNN model trained from scratch has the longest runtime of 1198.1 s (Figs. 14, 15, 16).

Table 6 Computational time of the examined approaches
Fig. 14
figure 14

Confusion matrix and ROC curve for Inception-v3 model with RF classifier

Fig. 15
figure 15

Confusion matrix and ROC curve for ResNet50 model with GP classifier

Fig. 16
figure 16

Confusion matrix and ROC curve for ResNet50 model with RF classifier

The accuracy level with the proposed approach reaches 100%, which is higher than the levels of traditional methods given in Table 7. These findings confirm the efficiency of the deep feature extraction process with efficient classifiers to perform the required classification task (Figs. 17, 18, 19, 20).

Table 7 Comparison of the proposed work with state-of-the-art models
Fig. 17
figure 17

Confusion matrix and ROC curve for ResNet101 model with GP classifier

Fig. 18
figure 18

Confusion matrix and ROC curve for ResNet101 model with RF classifier

Fig. 19
figure 19

Confusion matrix and ROC curve for ResNet101

Fig. 20
figure 20

Confusion matrix and ROC curve for inceptionResnet-v2

6 Conclusions

This paper presented four approaches for the detection of COVID-19 cases from X-ray images. A model has been built from scratch for this purpose. In addition, machine learning has been investigated for COVID-19 detection using histogram-based and statistical features. The task of deep feature extraction has also been investigated with the best machine learning classifiers for COVID-19 detection. Finally, transfer learning has been utilized to enhance the performance of the detection process. The obtained results proved that the transfer-learning-based and the deep features-based approaches outperform the local feature extraction approaches and the CNN model built from scratch. Deep features with Gaussian process (GP) and random forest (RF) classifiers perform better than the other approaches. This is attributed to the ability of the deep features to represent image activities, efficiently, in addition to the inherent characteristics of the used RF classifier to reduce the overfitting and increase the accuracy. In addition, the GP is a powerful algorithm for classification problems. Moreover, the deep feature extraction with Resnet101 and GP classifier gives a run time of 30.9 s, which is the least time compared to those of other proposed approaches. While chest X-ray images have been used to diagnose many lung diseases such as tuberculosis, pneumonia, and lung carcinomas, the proposed approaches are limited to the recognition of COVID-19 versus normal cases. In future work, feature fusion can be exploited for features extracted from different networks to enhance the detection accuracy. In addition, the proposed approaches can be extended to different diagnosis tasks.