1 Introduction

Coronavirus (COVID-19) disease is a global health crisis that occurred in the last 2 years and caused many deaths in many countries around the world. Some symptoms of COVID-19 are variable based on the kind of variant contracted, ranging from slight symptoms to critical and deadly illness [1]. Normal side effects incorporate coughing, fever, and loss of smell and taste, and more uncommon ones include migraines, nasal blockage and runny nose, sore throat, muscle torment, loose bowels, eye bothering, toes swelling or turning purple, and moderate to serious breathing troubles. In the last 2 years, the main challenge has been the early detection of COVID-19 to avoid an increase in disease and death in all countries around the world [2]. Hence, much research is introduced since the beginning spread of COVID-19, which is based on different models of machine learning and deep learning for the detection, classification, and segmentation of this disease [3].

Nowadays, machine learning (ML) and deep learning (DL) have become the main core of artificial intelligence. Machine learning (ML) and deep learning (DL) have become increasingly important in various fields due to their ability to analyze large amounts of data and extract patterns and insights that can inform decision-making. Different fields of this exist, such as healthcare for the diagnosis of different diseases, the energy sector for the modeling of fluidized adsorption beds [4], chemistry for predicting the NOx emissions from the coal combustion into circulating fluidized-bed combustors [5], or power generation in the industry sector [6]. In healthcare, many ML and DL models are developed in the medical area [7]. A convolutional neural network (CNN) is considered a special sort of deep learning architecture based on neural networks and consists of three main layers (convolutional layer, pooling layer, and fully connected layer) [8]. CNN is appropriate for several fields of computer vision. Therefore, a higher performance of the convolutional neural network (CNN) with images could help many researchers in the diagnosis of COVID-19 [9].

Different types of medical images can be used in the diagnosis of any disease, such as X-ray, MRI, CT scan, PET scan, and ultrasound [10]. All these images can be used in diagnostic tests for COVID-19. The most common types are the chest X-ray (CXR) and computed tomography (CT) scan. However, CT is the essential radiographic test to assess COVID-19 as it is accurate, has higher misdiagnosis rates, and is quicker [11].

Imaging classification for COVID-19 has emerged as an essential tool for diagnosing and monitoring disease and other lung diseases. Different radiological imaging techniques such as chest X-rays and CT scans can reveal characteristic features of the disease [12]. These extracted features can help differentiate COVID-19 from other respiratory illnesses and provide valuable information on the severity and progression of the disease. Imaging classification also plays a crucial role in identifying patients who are at high risk of developing severe illness, such as those with pre-existing lung disease or immunocompromised individuals. Additionally, imaging can aid in the early detection of complications such as pneumonia, which can be life-threatening if left untreated. Imaging classification has also been used to identify imaging biomarkers that may predict disease severity and clinical outcomes [13]. For example, studies have shown that the extent of lung involvement on CT scans is associated with the likelihood of ICU admission and mortality. Overall, imaging classification plays a critical role in the diagnosis, management, and prognosis of COVID-19, making it an indispensable tool for healthcare providers. Overall, imaging classification for COVID-19 is an essential component of the diagnostic process, allowing for early detection, appropriate treatment, and improved patient outcomes. The main problem is to get more accurate results in the classification of COVID-19. Therefore, our objective is to define the most accurate ML classifier with the highest-performance CNN model to extract features of the network. In this research paper, we can address some popular deep learning models based on convolutional neural network models using different machine learning classifiers [14]. There are 10 deep convolutional networks: Xception [15], Darknet53 [16], Vgg19 [17], Alexnet [18], Googlenet [19], Mobilenetv2 [20], Squeezenet [21], Darknet19 [22], Resnet50 [21], and Resnet101 [21]). These are used to extract features from our dataset which consists of 2,482 CT images. We focused on applying different machine learning classifiers for binary classification (COVID-19 and non-COVID-19) using extracted features from different deep convolutional neural networks. In each model, we test random layers to define the best layer to achieve high accuracy and define the best results for network model with the best used classifier with the three different database folds.

There are some key contributions of this paper that can be summarized as follows:

  1. 1)

    Extract the most relevant features which fit the highest accuracy and categorize into two levels: the network level and the layer level.

  2. 2)

    Find the best layer in the 10 selected deep CNN networks for feature extraction.

  3. 3)

    Find the best classifier from the selected ML classifiers which could give higher performance.

  4. 4)

    Proposed hybrid models of both machine learning and deep learning can give us higher results than each separately.

  5. 5)

    Present the highest 10 models from the proposed hybrid models for COVID-19 detection.

  6. 6)

    Present the best models for COVID-19 detection among all the proposed models and the literature.

The paper is organized as follows. Section 2 presents some related work and how this could be used to diagnose COVID-19. Subsequently, details about our proposed methodology and information about some used classifiers are described in Sect. 3. Section 4 introduces details about our experimental setup and the dataset. Section 5 introduces the obtained results and a detailed discussion about the obtained results. Finally, the conclusion is presented in Sect. 6.

2 Related work

Many effective works that are based on machine learning and deep learning have been introduced using chest radiography images such as X-rays images and CT scans through the last 2 years from the start of the COVID-19 crisis. In this section, we can discuss the most relevant works that are based on different machine learning classifiers and different deep learning models to extract features from the used dataset images. Therefore, we will highlight their success and accomplished results.

In this paper [23], a deep method to extract CXR image features by using Convolutional Neural Networks (CNNs) called Alexnet is presented. The study classified images to recognize COVID-19 disease by Support Vector Machine (SVM) using nonlinear kernel-function algorithms. The researchers used 146 CXR images which were divided into two classes, COVID-19 and non-COVID-19 (healthy). They categorized the results as follows: Accuracy (91.53%), Recall (91.68%), False Positive Rate (91.06%), and True Negative Rate (8.32%). Researchers in [24] introduced a new deep model called as COVIDX-Net which included seven deep CNN models: VGG19, MobileNet, Densenet121, InceptionV3, ResNetV2, Inception-ResNet-V2, and Xception. Each model could extract deep features from X-ray images to classify non-COVID-19 and COVID-19 cases. They used a dataset consisting of 50 X-ray images and divided it into two classes. In their evaluation process, they used 80% of the images for training and 20% for testing. Their model could achieve the best f1-score for Densenet121 with a value of 0.91, and the same accuracy values of 90% for Vgg19 and Densent121.

In [25], the researchers’ model was based on using 11 different deep CNN models to extract features from X-ray images plus the methodology of support vector machine (SVM) as a classifier which could categorize images into three categories: non-COVID-19, COVID-19, and pneumonia. Their dataset consisted of 381 CXR images which were divided into 127 confirmed COVID-19 images, 127 confirmed pneumonia images, and 127 healthy images. Resnet50 plus SVM could achieve the best results among other models as they achieved accuracy, sensitivity, FPR, and F1 score values of 95.33%, 95.33%, 2.33%, and 95.34% respectively. Researchers in [27] implemented different algorithms for the diagnosis of lung diseases like COVID-19 and pneumonia using GLCM and HOG features to extract features from chest X-rays (CXR). They used 6300 patches of lung images. They used Support Vector Machine (SVM) and Artificial Neural Networks (ANN) as classifiers. They achieved a maximum accuracy value of 93.73% for SVM.

In [27], deep-learning-based methods were introduced, namely deep learning feature extraction, fine-tuning based on some pre-trained convolutional neural networks, and end-to-end training of a developed CNN model. The researchers used a dataset consisting of 180 COVID-19 and 200 healthy chest X-ray images. They used some pre-trained deep CNN models: ResNet18, ResNet50, ResNet101, VGG16, and VGG19 to extract features. They used a Support Vector Machines (SVM) classifier with different Kernel functions to classify the extracted features from used CNN models. They found that ResNet50 with Linear Kernel SVM classifier could achieve the best accuracy value of 94.7% between other used models. In this work [28], they proposed a deep-learning CNN architecture consisting of 15 layers to extract features from their collected CT images. They collected 10 cases of COVID-19 pneumonia chest CT images from RadioPaedia for the classification of the COVID-19 pneumonia infection as normal. They used two main layers to extract deep features, the global average pool, and fully connected layers. These layers are combined using the approach of max-layer detail (MLD). Later, a Correntropy technique is embedded in the main design to choose the most discriminant features from the pool of features. For the final classification, they used a one-class kernel extreme learning machine classifier which could achieve an average accuracy of 95.1%, and 95.1% sensitivity, 95% specificity and 94% precision rates.

In another study [29], researchers used a dataset consisting of 15,153 X-ray images and classified these into three classes: viral pneumonia, normal, and COVID-19. They processed this dataset and then used Local Binary Model to extract features from it. They used six classifiers, Cubic SVM, Linear Discriminant (LD), Quadratic Discriminant (QD), Ensemble, Kernel NaiveBayes (KNB), and KNN Weighted, to classify input images. Their model achieved the highest performance by using Cubic SVM with an accuracy value of 98.05%. In [30], researchers introduced a detection and classification system for COVID-19 using X-ray images. They introduced six models beginning with image processing, segmentation, feature extraction, and two classifiers: KNN and SVM. These six models are named HOG-KNN, LBP-KNN, Haralick-KNN, HOG-SVM, LBP-SVM, and Haralick-SVM. They used 5,000 X-ray images for testing five folds validation. The LBP-KNN model outperformed the other five models with an average accuracy of 98.66%, 97.76% sensitivity, 100% specificity, and 100% precision.

In [31], researchers presented a classification algorithm for COVID-19 cases using lung CT scans. They used the q-transform model to extract features from 276 CT scans. Then, they used three ML classifiers: “k-nearest neighbor”, “decision tree”, and SVM. They achieved the highest performance with the SVM classifier with 98.25% accuracy, 95.30% sensitivity, and 97.60% specificity. In [32], researchers used six CNN architectures: VGG16, DeseNet121, MobileNet, NASNet, Xception, and EfficientNet. They used 3873 CT images for binary classification (COVID and non-COVID). They achieved the highest performance with VGG16, which had an accuracy of 97.68% compared to other used architectures. In this paper [33], researchers described a CNN model, which employed three image classes: normal, COVID-19, and viral pneumonia. These images were divided as follows: 219 for COVID-19, 1341 for normal images, and 1345 for viral pneumonia cases. The CNN model had an accuracy of around 94% on the test data for three classification classes, which contained only 32 images from all classes. Its recall and precision values for COVID-19 were 94% and 95%, respectively, whereas the VGG16 model achieved 97% accuracy and the VGG19 model achieved 97% accuracy for the three classes’ classification.

In [34], researchers suggested a new model of CNN called ResF which was comprised of five major blocks. Each block had two succeeding elements: a ResF module and a connector ResF module. Their dataset was based on using freely available LUS images and video datasets. They used 121 videos divided into 45 for COVID-19, 23 for bacterial pneumonia, and 53 for healthy people. There were also 40 images: COVID-19 (18), bacterial pneumonia (7), and healthy (15). The proposed system could achieve 92.5% precision and 91.8% accuracy value. Mohammad Rahimzadeh et al. [35] proposed some training model techniques that helped the network's learning when faced with an unbalanced dataset. In addition, they proposed a new model which was a hybrid of Xception and ResNet50V2 networks. In this work, they used two open-source datasets. The first database was X-ray images, which were classified into 4 classes: 180 images for COVID-19 cases and 42 images for streptococcus, pneumocystis, and SARS, which were classed as pneumonia. The second database was obtained from the Kaggle database which contained 6012 pneumonia cases and 8851 normal cases. Finally, they tested their network on 11302 images and proposed an average accuracy value for detecting only COVID-19 cases at 99.50%, where the average accuracy for all classes was 91.4%.

3 Methodology

Convolutional Neural Networks or CNN can have a significant role in image classification. Its technique is based on using several small filters to define the features in each layer. CNN consists of some sequenced layers, where each layer can extract some features from input images. The main layers of CNN are the convolutional layer, sampling layer, activation layer, and fully connected layers. Deep feature extraction from chest X-ray images is based on extracting features based on some pre-trained CNN models.

Our hybrid model consists of four stages as shown in Fig. 1. The process begins by inputting our dataset of CT images. Then, some preprocessing techniques are applied to the used images, such as resizing and augmentation. Different CNN architectures—Xception, Darknet53, Vgg19, Alexnet, Googlenet, Mobilenetv2, Squeezenet, Darknet19, Resnet50, and Resnet101—are used to extract features from this dataset. Finally, five machine learning classifiers are used to classify the dataset and select the best classifier that can give our model more performance.

Fig. 1
figure 1

Our hybrid classification model based on machine and deep learning

3.1 Preprocessing stage

Pre-processing for the input dataset is a critical step to obtain more adjustable and fitting images for the classification process. Firstly, we start resizing these images to be compatible with the used CNN network. All images are resized to fit with each selected network. The augmentation process is also applied to input images in which we can increase the size of the dataset without the need to acquire new images [36].

3.2 Feature extraction stage

The deep feature extraction process can help the classifier to identify the image with higher accuracy. Different CNN models are applied for feature extraction. To extract features from the used dataset, we selected 10 CNN models, since CNN networks have the robust ability to extract complex and deep features that convey each image in more detail [37]. Xception consists of 170 layers that involve Depthwise Separable Convolutions. It also has 36 convolutional layers [15]. Darknet53 has 184 layers and is based on using residual connections [16]. VGG19 has 47 layers [17]. Figure 2 illustrates the extracted features at layer 33 (Relu Layer).

Fig. 2
figure 2

Extracted features of layer 33 of VGG19

Alexnet was proposed in 2012 and consisted of 25 layers in total [18]. GoogLeNet is based on the inception architecture and includes 144 layers. It used the concept of inception modules, in which the network can choose between various sizes of filters in each block [19]. MobileNetV2 has 154 layers and is based on an upturned residual structure in which the residual connections are linked between the bottleneck layers [20]. Squeezenet is a type of convolutional neural network with 68 layers [21]. Darknet19 involves 64 layers [22]. ResNet50 has 177 layers based on CNN architecture [21]. ResNet101 involves 347 layers based on CNN architecture with an input image size of 224 × 224 × 3 [21]. Every network layer of the CNN models' performance is evaluated. In this phase, the ideal number of network layers to use for the fastest feature extraction and best accuracy will be determined. We provide the best 4 layers for feature extraction from these networks to define the best layer with higher performance. The number of extracted features in each layer for these models is shown in Table 1.

Table 1 Extracted features from different layers using some deep feature extraction CNN models2

3.3 classification stage

Binary classification (COVID and non-COVID) is carried out after the stage of feature extraction in each one of the selected layers of the used CNN network. In each model, we used some selected layers with five used machine learning classifiers: support Vector Machine (SVM), K-nearest neighbors (KNN), Ensemble Classifier NaiveBayes Classifier, and Decision Tree (DT) Classifier.

Support Vector Machine (SVM) is a classical kernel learning classifier that seeks to get the optimal separating hyperplane among different classes by focusing on the training cases named as support vectors. It can be placed at the edge of class descriptors [38]. The hyperplane is chosen in such a way that it maximizes the margin, which is the distance between the hyperplane and the closest data points of each class (support vectors). SVM can build this optimal separating hyperplane based on a kernel function (K). All images whose feature vectors lie on one side of the separating hyperplane belong to class 1 and the others belong to class -1 [39]. Kernel is a mathematical function that can be used in SVM. Even if the data points are not linearly separable in the input space, SVM can be used to map the original data points onto high-dimensional feature spaces, making it simple to identify the hyperplane [40]. There are different types of SVM. Our work is based on error correcting output codes and the Support Vector Machine (ECOC-SVM). The ECOC approach is used to correct the data error when inputted into a channel. It converts the figure from binary classification to a multiclass classification [41]. ECOC-SV is used to solve the problems of online identification and feature extraction. Figure 3 is a typical structure of the ECOC-SVM multi-class classifier. By applying ECOC-SVMs to multi-classification, it transforms the M classes classification to N classes classification by using ECOC matrix. ECOC-SVM functions according to the following steps:

  • In training, by SVM binary partition, classes are divided into two subsets (0 or 1) for each row in the coding matrix. The row containing a value of 0 is designated as the first class, while the row with a value of 1 is considered the second class. Each column of the matrix is assigned a codeword corresponding to its respective class [42]. The training of all N classes is accomplished by utilizing the rows of the ECOC matrix, where N denotes the dimensionality of new feature space and the length of codeword. If the trained ECOC SVM does not achieve satisfactory accuracy, the training parameters may be adjusted and the ECOC SVM is re-trained, as shown in Fig. 4.

  • In testing, after being trained, the classifier is presented with the test data. The N classifiers identify the samples and produce an output vector Y. Equation (1) displays the Hamming distance between each vector in the output vector matrix and the code matrix. The classifier selects the class based on the smallest distance.

$$ {\text{Argmind}}\left( {Y,H_{i} } \right) = \mathop \sum \limits_{j = 1}^{N} \left| {Y_{j } - H_{i,j} } \right|, i = 1,2,3, \ldots .,m $$
(1)
Fig. 3
figure 3

ECOC-SVM framework

Fig. 4
figure 4

ECOC-SVM Training process

K-nearest neighbor (KNN) is one of the easiest algorithms classified under the category of lazy algorithms. In KNN, the distance between two data points is typically measured using the Euclidean distance. Given a new data point, the KNN algorithm first identifies the k-nearest neighbors in the training set based on their distance to the new data point. The class label of the new data point is then assigned based on the most common class label among its k-nearest neighbors. In the case of a tie, the class with the closest neighbor is typically chosen [43]. Our work is based on a multiclass k-nearest neighbors (KNN) classifier which is a machine learning algorithm that can be used for classification tasks involving multiple classes. In this method, the classifier first determines the k nearest neighbors of a test instance from the training data based on a chosen distance metric. Then, the class label of the test instance is predicted based on the class labels of the k nearest neighbors using various techniques such as majority voting or weighted voting [44]. A k value which denotes the number of considered neighbors is calculated, and the Euclidean distance, which is the distance between two points as mentioned in Eq. (2) of the input from the K-nearest neighbors, is found.

$$ {\text{Euclidean}}\;{\text{distance}} = \sqrt {\sum\limits_{i = 1}^{n} {\left( {Z_{i} - y_{i} } \right)} } $$
(2)

where \(y_{i } {\text{is the first point}}\), and \(Z_{i} {\text{is the last point}}\).

Then, the classifier can be represented by the following equation.

$$ \overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\rightharpoonup}$}}{{y_{t} }} \left( l \right) = \arg \max_{{b\smallint \left\{ {0,1\} } \right.}} P\left( {H_{b}^{l } \backslash E_{{\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\rightharpoonup}$}}{{c_{t} }} \left( l \right)}}^{l} } \right), l \in y $$
(3)

where \(\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\rightharpoonup}$}}{{c_{t} }} \left( l \right)\) is the membership counting vector, \(\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\rightharpoonup}$}}{{y_{t} }} \left( l \right)\) is the predicted category vector, and \(P\left( {H_{b}^{l } \backslash E_{{\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\rightharpoonup}$}}{{c_{t} }} \left( l \right)}}^{l} } \right), l \in y\) is the prior probabilities for calculated neighbors.

Ensemble Classifier is a supervised machine learning technique that combines the predictions of multiple base classifiers to improve the overall prediction accuracy. The idea behind an ensemble classifier is that the combination of several classifiers can result in a stronger and more accurate classifier than any individual classifier [45]. In our work, the algorithm is based on the principle of bagging, which involves constructing multiple versions of the training data by random sampling with replacement and training a base learner on each version [46]. The ensemble technique is based on the Adaboost method which improves the simple boosting algorithm through an iterative process and predicts new data using Eq. 4, as described in Algorithm 1. Two extension versions of the AdaBoost algorithm (AdaBoost.M1, Ada-Boost.M2) enhance the ensemble classifier for binary and multiclass classification that is performed according to Eq. 5.

$$ f\left( x \right) = {\text{sign}}\left( {\mathop \sum \limits_{t = 1}^{T} \alpha_{t} * M_{t} \left( x \right)} \right) $$
(4)
$$ f\left( x \right) = \mathop {\arg \max }\limits_{y \in dom\left( y \right)} \left( {\mathop \sum \limits_{t:Mt \left( x \right) = y} \log \frac{1}{{\beta_{t} }}} \right) $$
(5)

where \(\beta_{t} \leftarrow \frac{{1 - \varepsilon_{t} }}{{\varepsilon_{t} }}\).

Algorithm 1
figure a

The Adaboost algorithm pseudo-code

A Decision Tree (DT) is another ML classification algorithm that is a tree-like structure consisting of parent nodes, branches, and leaf nodes as shown in Fig. 5. The decision tree algorithm works by recursively splitting the data into subsets based on the values of the attributes, with the goal of maximizing the information gain or minimizing the impurity at each split [47]. The split criteria is the most critical aspect of a decision tree strategy. The internal nodes denote the attribute test, whereas branches denote the outcome [48]. The leaf nodes denote the class labels. At each node of the tree, the algorithm selects the feature that best separates the data into distinct classes. The algorithm then recursively splits the data into subsets based on the selected feature until it reaches a stopping criterion. This stopping criterion can be a maximum depth of the tree, a minimum number of samples per leaf, or other criteria [49]. A tree can be turned directly into a collection of rules in the context of decision tree learning, each of which is taken from a branch of the tree. The collection of regulations can be graded based on their quality. The confidence and J-measure are two prominent indicators of rule quality [50]. Confidence is a measure of a rule's weight (predictive accuracy), as described in Eq. 6.

$$ {\text{Confidence}} = \frac{{P\left( {x,y} \right)}}{P\left( x \right)} $$
(6)
Fig. 5
figure 5

DT structure

P(x, y) is the joint probability that the antecedent and consequent of a rule both occur and P(x) is the probability that the rule antecedent occurs.

The J-measure measures a rule's average information content that is the product of two terms described in Eq. 7. J-measure is also a prominent method of rating rules.

$$ J\left( {Y,X = x} \right) = P\left( x \right) . j\left( {Y,X = x} \right) $$
(7)

The first term, taken as a measure of simplicity, is the chance that the antecedent (left hand side) of a rule happens. The second term is j-measure, which is a measure of a rule's goodness-of-fit, commonly known as cross entropy. Equation 8 defines the j-measure.

$$ j\left( {Y,X = x} \right) = P\left( {y |x} \right) . \log_{2} \frac{{P\left( {y |x} \right)}}{p\left( x \right)} + \left( {1 - P\left( {y |x} \right)} \right) . \log_{2} \frac{{1 - P\left( {y |x} \right)}}{1 - p\left( x \right)} $$
(8)

NaiveBayes is a classification and regression ML algorithm that is based on supervised learning [51]. NaiveBayes classifier is a probabilistic classification algorithm that is based on Bayes' theorem and the assumption of independence between the features. The NaiveBayes classifier calculates the posterior probability of each class label given the feature values, and then assigns the label with the highest probability as the predicted class. Given a sample of dataset X, the classifier predicted that X belongs to the class \(Ci{ }\) if

$$ P(Ci |X) > P(Cj |X)\;{\text{for}}\;1 \le j \le m, j \ne i. $$
(9)

The equation for the NaiveBayes classifier can be written as in Eq. (10).

$$ P\left( {Ci|X} \right) = \frac{{P\left( {X|Ci} \right). P\left( {Ci} \right)}}{P\left( X \right)} $$
(10)

Here X is the data with an undefined class, Ci is the hypothesis X with a specific class, \(P\left( {Ci{|}X} \right){ }\) is the probability of \(Ci\) referring to X, \(P\left( {Ci} \right)\) is the probability hypothesis Ci, \(P\left( {X{|}Ci} \right)\) is the probability of X in hypothesis \(Ci\), and \(P\left( X \right)\) is the probability of X.

4 Testing environment

In this section, we can spotlight the tools of our testing environment, such as the used machine, the input images dataset, and the performance metrics.

4.1 Machine

All the experiments in our work were carried out in MATLAB R2022a on an OMEN Laptop by HP with the following configuration: AMD Ryzen 7 5800H with Radeon Graphics 3.20 GHz, Nvidia Geforce RTX3070 8GB, and 16 GB RAM.

4.2 Dataset description

In our work, we used a dataset of 2482 CT scans downloaded from the Kaggle website [52], containing 1252 CT scans of positive COVID-19 and 1230 CT scans of non-COVID-19. These data were collected from real patients at hospitals in Sao Paulo, Brazil. We made our work based on three folds. This dataset is divided randomly between training and testing with respective percentage values (60%, 40%); (70%, 30%) and (80%, 20%), as shown in Table 2. The basic idea is to split the dataset into k equally sized folds. Then, the model is trained on the fold and validated on the remaining fold. This process is repeated k times, with each fold used once as a validation set. The advantage of k-fold cross-validation is that it allows for a more reliable estimate of a model's performance than a single holdout set. By averaging the performance across multiple folds, we obtain a more representative estimate of the model's expected performance on unseen data. As shown in Fig. 6, we can classify the input dataset into COVID and non-COVID classes.

Table 2 The number of CT images used in each Fold
Fig. 6
figure 6

Different CT images are classified as COVID-19 in the above row and non-COVID-19 in the lower row

4.3 Performance metrics

Some metrics can be used to evaluate the performance of our proposed model to classify COVID-19 by using CXR and CT datasets. These matrices are calculated based on the parameters of the confusion matrix: True Positive (α), True Negative (β), False Positive (Ω), and False Negative (µ). These parameters are defined as follows:

  • α denotes the number of COVID-19 images that are classified correctly as COVID-19.

  • µ denotes the number of COVID-19 images that are incorrectly classified as non-COVID-19.

  • Ω denotes the number of non-COVID-19 images that are incorrectly classified as COVID-19.

  • β denotes to the number of non-COVID-19 images that are classified correctly as non-COVID-19.

Accuracy (ACC) is the measure of correctly predicted cases related to the whole dataset. A high value of accuracy means more performance of the model [42].

$$ {\text{Accuracy}} \left( {{\text{ACC}}} \right) = \frac{\alpha + \beta }{{\alpha + \beta + \Omega + \mu }} $$
(11)

Sensitivity (Sen) is considered to be the rate of correctly classified positive samples in the positive class dataset.

$$ {\text{Sensitivity }}\left( {{\text{SEN}}} \right) = \frac{\alpha }{\alpha + \mu } $$
(12)

Specificity (SPC) is the measure of true negatives that are correctly classified by the model.

$$ {\text{Specificity }}\left( {{\text{SPC}}} \right){ } = \frac{\beta }{\beta + \Omega } $$
(13)

Positive Predictive Value (PPV) is the proportion measure of positive classified cases that are truly positive.

$$ {\text{PPV}} = \frac{\alpha }{\alpha + \Omega } $$
(14)

Negative Predictive Value (NPV) is the proportion measure of negative classified cases that are truly negative.

$$ {\text{NPV}} = \frac{\beta }{\beta + \mu } $$
(15)

5 Results and discussion

In our study, we examined the performance of classification models for the detection of COVID-19 based on ten CNN models. In each model, we used five classifiers with randomly selected layers to check the best classifier with the best layer in each model. Then, we measured the performance of each classifier in terms of Accuracy, Sensitivity, Specificity, PPV, and NPV. In the case that models have the same accuracy value, we will check the performance based on higher sensitivity than specificity as it is better to classify non-patient to the patient than the patient to non-patient as it will help non-patients investigate more than cause the patient to die. The following results were performed using the dataset F-2.

In the Xception model, we will compare the performance results of the four layers that we selected. We excluded the results of NaiveBayes and Decision Tree classifiers as their results are negative. From Table 3, layer 79 achieved an accuracy value of 98.25% with the SVM classifier, 90.30% with the KNN classifier, and 91.73% with the ensemble classifier. Therefore, layer 79 has the highest accuracy value with the SVM classifier. Layer 108 achieved an accuracy value of 97.85% with the SVM classifier, 75.65% with the KNN classifier, and 91.90% with the ensemble classifier. Therefore, SVM could achieve the best accuracy value for layer 108. For layer 117, it could achieve an accuracy value of 98.25% with the SVM classifier, 78.33% with the KNN classifier, and 91.13% with the ensemble classifier. Therefore, layer 117 achieved the best value with SVM classifiers. Finally, for layer 134, an accuracy value of 97.85% is achieved with the SVM classifier, and 70.18% with the KNN classifier. However, for the ensemble classifier, there is no result for this layer due to more processing operations. The best performance for layer 134 is achieved with the SVM classifier. After this survey, we find that the best result for all selected layers is achieved with the SVM classifier. SVM could achieve the highest accuracy value with layer 79 and layer 117, which achieved the same values of accuracy. Therefore, to define the best layer, we can compare them by using other performance metrics. Layer 79 achieved a sensitivity value of 97.87%, whereas layer 117 achieved a sensitivity value of 98.13%. Therefore, we can consider that the best model will be layer 117 with the SVM classifier to be selected as model 1.

Table 3 Performance metrics values of selected layers of the Xception Model with different classifiers

For Darknet53, we will compare the results for selected four layers as shown in Table 4. We excluded NaiveBayes because of its bad results. Layer 32 achieved an accuracy value of 95.57% with the SVM classifier, 83.15% with the KNN classifier, 89.70% with the ensemble classifier, and no results for the decision tree classifier. Therefore, the best value is achieved by the SVM classifier. Layer 34 achieved an accuracy value of 95.43% with the SVM classifier, 72.98% with the KNN classifier, 89.76% with the ensemble classifier, and no result with the decision tree classifier. Therefore, the best value for layer 34 is achieved by the SVM classifier. Layer 42 achieved an accuracy value of 96.10%, 89.40% with the KNN classifier, 88.39% with the ensemble classifier, and 71.13% with the decision tree classifier. The best result for layer 42 is by the SVM classifier. For layer 182, an accuracy value of 92.21% is achieved by SVM, 89.29% with KNN, 91.13% with the ensemble, and 73.15% with the decision tree classifier. The best results for layer 182 are achieved by the SVM classifier. The SVM classifier could achieve the best performance results for all layers; however, the best values are for layer 42. Therefore, we can name the model of layer 42 with the SVM classifier as model 2.

Table 4 Performance metrics values of selected layers of Darknet53 Model with different classifiers

For the VGG19 model, Table 5 shows the performance results for four layers. We exclude the results of NaiveBayes as their results are negative. For layer 30, 97.98% of the accuracy value is achieved by the SVM classifier, 86.79% with the KNN classifier, 95.18% with the ensemble classifier, and 74.88% with the decision tree classifier. Hence, the best results are achieved by the SVM classifier. Layer 33 achieved an accuracy value of 98.38% with SVM, 66.67% with KNN, 93.69% with the ensemble, and 75.12% with the decision tree. Therefore, layer 33 has the best performance values when using the SVM classifier. Layer 36 achieved an accuracy value of 97.04% with SVM, 90.95% with KNN, 92.44% with the ensemble, and 80.06% with the decision tree classifier. Therefore, layer 36 has the best performance values when using the SVM classifier. Layer 38 could achieve an accuracy value of 97.58% with SVM, 75.48% with KNN, 93.69% with the ensemble, and 74.46% with the decision tree. With SVM, layer 38 could achieve its best performance. The SVM classifier could achieve the best performance for all selected layers; however, layer 33 could achieve the highest performance values with a 98.38% accuracy value. Therefore, we can name the model of layer 33 with SVM as model 3.

Table 5 Performance metrics values of selected layers of the VGG19 Model with different classifiers

In AlexNet, the highest performance values are measured as shown in Table 6 for layers 8, 9, 10, and 12 and exclude the results of the NaiveBayes classifier as their results are negative. Layer 8 has an accuracy value of 96.37% with the SVM classifier, 77.20% with KNN, 92.02% with the ensemble, and 77.86% with the decision tree. Hence, the SVM classifier achieved the best performance for layer 8. Layer 9 achieved an accuracy of 97.44% with SVM, 83.33% with KNN, 92.38% with the ensemble, and 74.64% with the decision tree. Therefore, the best classifier for layer 9 is SVM. Layer 10 achieved an accuracy of 96.24% with SVM, 86.55% with KNN, 91.73% with the ensemble, and 76.61% with the decision tree. Therefore, the best classifier for layer 10 is SVM. Layer 12 achieved an accuracy of 97.04% % with SVM, 87.14% with KNN, 92.62% with the ensemble, and 76.37% with the decision tree. Therefore, the best classifier for layer 12 is SVM. SVM classifier could achieve a higher performance for all layers than other classifiers; however, layer 9 could achieve the highest performance between all layers with SVM. Therefore, we can consider layer 9 with the SVM classifier to be the best model, and name it model 4.

Table 6 Performance metrics values of selected layers of the Alexnet Model with different classifiers

We can select the best layers in the case of GoogleNet with the highest performance values which are layers 15, 23, 40, and 70, as shown in Table 7. Layer 15 achieved 95.70% of accuracy with SVM, 79.40% with KNN, 90.12% with the ensemble, 71.19% with the decision tree, and no results for NaiveBayes. The best classifier for layer 15 is the SVM classifier. Layer 23 achieved an accuracy of 96.51% with SVM, 91.79% with KNN, 92.56% with the ensemble, 72.98% with the decision tree, and 85.36% with NaiveBayes. The best classifier for layer 23 is the SVM classifier. Layer 40 achieved an accuracy of 97.85% with SVM, 71.07% with KNN, 93.45% with the ensemble, and 76.07% with the decision tree, with no results for NaiveBayes. The best classifier for layer 23 is the SVM classifier. Layer 70 achieved an accuracy of 96.24% with SVM, 78.75% with KNN, 91.07% with the ensemble, 73.69% with the decision tree, and 50% for NaiveBayes. The best classifier for layer 23 is the SVM classifier. Therefore, between all used classifiers, SVM could achieve the best results for all layers, but layer 40 has the best values for SVM. Therefore, we can consider the model of layer 40 with the SVM classifier to be model 5 for the classification of COVID-19.

Table 7 Performance metrics values of different selected layers of the GoogleNet Model with different classifiers

In the case of the Mobilenetv2 model, we concentrate on layers 58, 84, 150, and 152, as shown in Table 8. Based on five used classifiers, layer 58 has an accuracy value of 95.83% with SVM, 86.67% with KNN, 92.20% with the ensemble, 77.50% with Decision Tree, and 50% with NaiveBayes. Therefore, the best classifier for layer 58 is SVM. Layer 84 achieved a 98.79% accuracy value with SVM, 76.90% with KNN, 92.08% with the ensemble, 73.57% with Decision Tree, and 50% with NaiveBayes. Hence, SVM is the best classifier for layer 84. Layer 150 achieved an accuracy value of 96.64% with SVM, 70.71% with KNN, 90.18% with the ensemble, 71.55% with Decision Tree, and 50% with NaiveBayes. Therefore, the best classifier for layer 150 is SVM. Layer 152 achieved an accuracy value of 88.72% with SVM, 87.20% with KNN, 89.11% with the ensemble, 69.17% with Decision Tree, and 81.43% with NaiveBayes. Therefore, the best classifier for layer 150 is SVM. The best performance results for all layers are achieved with the SVM classifier and the worst values through NaiveBayes. For the SVM classifier, the layer that can achieve the highest values is layer 84. Therefore, we can consider layer 84 with SVM as our model 6 to classify COVID-19 disease.

Table 8 Performance metrics values of different selected layers of MobileNet Model with different classifiers

In the Squeezenet model, the four layers that we selected to be tested with five ML classifiers are shown in Table 9. Layer 20 achieved an accuracy value of 96.10% with SVM, 86.49% with KNN, 90.95% with the ensemble, 75.18% with the decision tree, and 79.35% with NaiveBayes. Layer 20 could achieve the best performance with the SVM classifier. Layer 27 with the SVM classifier achieved 97.31% accuracy, 85.60% accuracy with KNN, 90.65% accuracy with the ensemble, 71.13% accuracy with the decision tree, and 80.48% with NaiveBayes. Therefore, Layer 27 could achieve the best performance result with the SVM classifier. Layer 32 achieved 96.51% of accuracy with SVM, 53.45% with KNN, 89.23% with the ensemble, 68.51% with the decision tree, and 50% with NaiveBayes. Layer 32 achieved higher performance results with the SVM classifier. Layer 34 achieved an accuracy value of 97.98% with SVM, 85.36% with KNN, 93.39% with the ensemble, 74.17% with the decision tree, and 50% with NaiveBayes. The best classifier for layer 34 is the SVM. We find that all layers achieved their best results with the SVM classifier. Therefore, for the SVM classifier, the layer with the best performance is layer 34 with an accuracy of 97.98%. Hence, layer 34 in Squeezenet Model with SVM has been selected to be model 7 due to achieving the highest performance.

Table 9 Performance metrics values of different selected layers of the Squeezenet Model with different classifiers

With the Darknet19 model, layers achieving the highest performance results are mentioned in Table 10. For the SVM classifier, layer 33 achieved an accuracy value of 98.12%, layer 39 achieved a value of 97.85%, layer 41 achieved a 97.85% accuracy value, and layer 45 achieved a 98.25% accuracy value. Therefore, the best layer for SVM is layer 45. For the KNN classifier, layer 33 achieved an accuracy value of 89.52%, layer 39 achieved a value of 90.89%, layer 41 achieved an accuracy of 85.36%, and layer 45 achieved an accuracy of 81.73%. hence, layer 39 achieved the best result with the KNN classifier. For the ensemble classifier, layer 33 achieved 92.44% accuracy, layer 39 achieved 93.39% accuracy, layer 41 achieved 93.10% accuracy, and layer 45 achieved 93.33% accuracy. Therefore, the best layer for the ensemble classifier is layer 39. The decision tree classifier achieved an accuracy of 74.11% with layer 33, 72.08% with layer 39, 72.44% with layer 41, and 75.71% with layer 45. The decision tree classifier achieved the best performance result with layer 45. The NaiveBayes classifier achieved an accuracy of 77.74% with layer 33, 80.06% with layer 39, 50% with layer 41, and 50% with layer 45. The NaiveBayes classifier achieved the best performance with layer 39. For each layer, we can see that the highest result is achieved with the SVM classifier. Therefore, SVM is the best classifier for all layers, and layer 45 in Darknet 19 Model with SVM has been selected to be model 8 due to achieving the highest performance.

Table 10 Performance metrics values of different selected layers of the Darknet19 Model with different classifiers

For the Resnet50 Model, the best results are achieved with layers 44, 62, 154, and 175 as shown in Table 11. Firstly, layer 44 achieved 94.36% accuracy with the SVM classifier, 87.68% with the KNN classifier, no result with the ensemble, no result with the decision tree, and 83.81% with NaiveBayes. The best classifier with layer 44 is SVM. Layer 62 had an accuracy of 95.57% with the SVM classifier, 70.71% with KNN, 85.59% with the ensemble, 70.24% with the decision tree, and 50% with NaiveBayes. Layer 62 achieved the best performance with the SVM classifier. Layer 154 achieved 98.25% accuracy with the SVM classifier, 97.84% with the KNN classifier, 94.23 with the ensemble, 75.36% with the decision tree, and 83.81% with NaiveBayes. The best classifier with layer 154 is SVM. Layer 175 has 90.46% accuracy with SVM, 88.45% with the KNN classifier, 89.76% with the ensemble, 73.51% with the decision tree, and 78.04% with NaiveBayes. Layer 175 achieved the best performance with the SVM classifier. Finally, we found that the SVM classifier could achieve the best result with all layers. Layer 154 in Resnet50 Model with SVM has been selected as model 9 as it achieved the highest performance.

Table 11 Performance metric values of different selected layers of the Resnet50 Model with different classifiers

The last model we tested was Resnet101 in which the best four layers are layers 68, 91, 142, and 247. Their performance results are mentioned in Table 12. Layer 68 achieved an accuracy of 95.83% with the SVM classifier, 82.92% with KNN, and no results with the ensemble, decision tree, and NaiveBayes. Layer 91 achieved an accuracy value of 97.98% with SVM, 85.54% with KNN, 92.50% with the ensemble, no result with the decision tree, and 50% with NaiveBayes. Layer 142 has an accuracy of 98.92% with SVM, 86.49% with KNN, 90.12% with ensemble classifier, 70.83% with the decision tree, and 82.74% with NaiveBayes. Layer 247 has an accuracy of 96.91% with SVM, 77.14% with KNN, 89.94% with ensemble classifier, 71.61% with decision tree classifier, and 50% with NaiveBayes. Layer 247 could achieve the best performance with SVM. For all tested layers, the SVM classifier achieved the best performance with layer 142 of the Resnet 101 model and has been selected to be model 10.

Table 12 Performance metrics values of different selected layers of the Resnet101 Model with different classifiers

Now, we have 10 models that have the best performance results from all the tested models. These 10 models are tested with the three datasets F-1, F-2, and F-3, as shown in Table 13. We selected the best three models based on the highest accuracy as this is the most natural measure of performance. The best of the three models is model 10, which has the best accuracy with a value of 98.98%. This model uses layer 142 of Resnet101 to extract features. The next model is model 6 with an accuracy value of 99.39% which used layer 84 of the MobileNet model to extract features. The last model is model 3 with an accuracy value of 98.98%, which used layer 33 of the VGG19 model to extract features. All models used SVM as a classifier.

Table 13 the best 10 models with the highest performance using SVM

To certify the credibility of the obtained results, a comparison with the results from the related work was performed in Table 14. We will compare these best 3 models with previous work related to the classification of COVID-19. The best performance results are achieved with model 10, model 6, and model 3. Model 3, Model 6, and Model 10 achieved an accuracy of 98.98%, 99.39%, and 98.98% with the dataset F-3, respectively. Model 10 has the highest average accuracy of 98.69%. However, Model 6 has the highest accuracy of 99.39% over all models. All the previous work mentioned achieved the highest accuracy of 98.66% in [30]. This is lower than our best three models as shown in Table 14.

Table 14 Comparison between our models and old models

5.1 Time consumption

Time consumption is a critical factor in many aspects of life. In the case of healthcare, time consumption can be a critical factor in determining patient outcomes, especially in emergencies. The time required for diagnostic tests and imaging can significantly impact the speed and accuracy of diagnosis and treatment. For example, in the case of COVID-19, delays in test results can lead to delayed treatment and increased spread of the disease. Moreover, imaging tests, such as CT scans, can take a significant amount of time to perform and interpret, which can delay diagnosis and treatment. However, technological advancements have led to the development of faster and more efficient diagnostic and imaging methods. Overall, time consumption is a critical factor in healthcare, and efforts to reduce it can lead to improved patient outcomes and overall healthcare efficiency. In our methodology, model 6, which achieved the best accuracy value among all models, could take a consumed time of 20.892 s in the training stage. Although the network training time was not considered, the average testing time per image was measured at 0.84 s.

6 Conclusion

Coronavirus disease (COVID-19), caused by SARS-CoV-2, is one of the major problems of the twenty-first century. COVID-19 has caused many injuries and deaths throughout the world over the previous 2 years. Computer-aided diagnosis has become an essential task in the fight against the virus's spread. Early detection of COVID-19 is critical for lowering patient mortality risk. Researchers are looking for quick answers using Machine Learning (ML) and Deep Learning approaches (DL). We provide a hybrid model for COVID-19 identification based on machine learning and deep learning models in this research. To extract characteristics from CT scans, we employed 10 distinct deep CNN network models. We extract features from several levels in each network and determine the optimal layer that produces the best-extracted features for each CNN network. Then, to categorize these characteristics, we employed five different machine learning classifiers. The collection contains 2481 CT scans classified as COVID-19 and non-COVID-19, which is a public dataset from Kaggle. Three folds with various sizes are extracted between testing and training. Experiments are performed to determine the optimum layer for all CNN networks, the best network, and the best classifier. The measured performance demonstrated the suggested system's superiority over the literature, with the greatest accuracy of 99.39%. Our models have been evaluated with three folds, and the average accuracy between the three folds is 98.69%.