1 Introduction

The contagious severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) emerged in Wuhan, China near the end of the year 2019. It was then declared as a pandemic by World Health Organization (WHO) at the beginning of March 2020. Due to its rapid airborne transmission, the number of infected cases of COVID-19 disease exceeded 82 million and the number of fatalities exceeded 1.7 million by the end of the year 2020. These numbers tragically reached nearly 200 million infections and 4.27 million fatalities by the beginning of August 2021.

Such a rapid increase in the number of cases by the pandemic surpassed the capacity of hospitals in many countries both instrument-wise and staff-wise which put the whole healthcare systems in jeopardy. Therefore, it was crucial to be able to detect and isolate infected people in a fast, cheap, and reliable manner. The real-time polymerase chain reaction (RT-PCR) test is widely used for diagnosis. It has the drawback of giving late results and it is also sometimes inconclusive in negative cases. On the other hand, it was clear that even at the early stages of the infection, the chest imaging was showing clear indications of the infection by COVID-19 which in turn helped to make faster and accurate decisions by doctors/radiologists.

Based on these findings, many studies have been carried out to automatically diagnose tomography and X-ray images using image processing and artificial intelligence techniques. Wehbe et al. [1] developed a system called DeepCOVID-XR where they combined six different deep learning models with a binary classifier. They tested their system on a database of 17,002 X-ray images with two classes (normal, COVID-19). Al-Waisy et al. [2] had a preprocessing stage for their model that they applied on a smaller database with 800 X-ray images. They fused the weighted decisions of the two trained proposed models and managed to obtain very high results. Their result is hard to be generalized due to their small database. Researchers in [3] developed a CNN model that extracted two types of features from the X-ray images. These features were then fed to 4 different classifiers before combining their outputs. The dataset contained 4600 X-ray images collected from different public databases and achieved a very high accuracy rate. Ismael et al. [4] proposed different deep learning-based approaches (deep feature extraction, fine-tuning and end-to-end training) to classify 180 COVID-19 and 200 healthy chest X-ray images. They obtained the highest result (94.7% accuracy score) with deep features extracted from the ResNet50 model and SVM classifier. Alam et al. [5] used HOG and CNN as feature extractors from X-ray images before fusing them. They used the fused features to train a 2-class (normal, COVID-19) VGG classification network. They achieved a 98.36% accuracy score with fivefold cross-validation on a dataset consisting of a total of 5090 X-ray images which included 1979 COVID-19 positive images.

The previously mentioned results worked on databases with normal and COVID-19 classes. Other researchers conducted their studies on databases with three classes, namely; normal, COVID-19 and pneumonia. Togacar et al. [6] for instance, used preprocessing fuzzy color technique. The original images in a small database of 458 X-ray images with 295 COVID-19 positive images were used to construct and stack new images. They combined the features obtained from two different pre-trained networks using social mimic optimization on the stacked dataset using three classes with an SVM classifier. Abbas et al. [7] developed a system called DeTraC where they used AlexNet for feature extraction and transfer learning method with ResNet18 as a pre-trained network. Their database originally had 196 X-ray images that included 105 COVID-19 positive images. They increased the images with augmentation to 1764 images. On the other hand, Ozturk et al. [8] built a model named DarkCovidNet which was inspired by the DarkNet. They tested their system on a database of 1125 X-ray images which included 125 COVID-19 positive images. Moreover, Chowdhury et al. [9] created a public COVID-19 X-ray images database of 3487 X-ray images by combining several public sources. They used transfer learning on different pre-trained networks for COVID-19 diagnosis with two and three classes. While, they achieved the best performance with the CheXNet model on original images and, on augmented images, the DenseNet201 model achieved the best performance. Later on, they enlarged their database to a total of 3886 images.

All the accuracy results of the aforementioned researches were promising and showed high potential for using X-ray images for automatic and prompt diagnosis and detections of the COVID-19 on the infected patients. In this paper, a novel application of a 3-stream fusion-based deep learning model is proposed for the detection of COVID-19 and Viral Pneumonia based on chest X-ray images. We proposed a model that uses deep learning by developing a 3-stream CNN model. The CNN model extracts and concatenates the features obtained from the grayscale, LBP, and HOG images before using it for making a final decision.

The rest of the paper is organized as follows: Sect. 2 describes the database used and algorithms. Section 3 explains the methodology of the proposed system. Results and findings are listed and discussed in Sect. 4. Finally, the paper is concluded in Sect. 5.

2 Database and algorithms

2.1 Chest X-ray database

In this study, the COVID-19 radiography database v3 created by Chowdhury et al. [9] was used. The database was created using several sources [10,11,12,13,14,15]. There are a total of 3886 images in the COVID-19 Radiography dataset, including 1200 Covid-19 positive, 1341 normal, and 1345 viral pneumonia. Sample images from the database are given in Fig. 1.

Fig. 1
figure 1

Sample images from database. a Normal b COVID-19 c Pneumonia

2.2 Local binary pattern LBP

Local binary pattern (LBP) is a powerful feature extraction method for texture classification developed by Ojala et al. [16]. In the LBP method, we consider a 3 × 3 window with the pixel of interest being at the center. The intensity of the center pixel of the window is considered as the threshold. This threshold is then compared with the other eight neighboring pixels’ intensity values. Each of the neighboring pixels will obtain the value zero if its intensity is below the threshold and the value one if its intensity is higher than or equal to the threshold. The result will be a binary 8-bit code that can be converted to a decimal value. The starting point of the 8-bit code is not important but it is essential to be consistent for all pixels and all images. This decimal value (which is in the range 0–255) will replace the value of the corresponding center pixel. With the application of these steps on every single pixel of the image, the LBP image can be obtained as shown in Fig. 2.

Fig. 2
figure 2

Calculation of the LBP image

2.3 Histograms of oriented gradients HOG

The histograms of oriented gradients (HOG) is a gradient calculation-based feature descriptor that was first used by Dalal and Triggs in their study on human detection [17]. Accordingly, the gradient magnitude and direction of each central pixel in the determined neighborhood are calculated with the help of sliding windows on the image. First, the feature descriptor is obtained by taking the difference between horizontal and vertical pixels neighboring the central pixel (Fig. 3).

Fig. 3
figure 3

Calculation of feature descriptor, magnitude and direction of the gradient

Using this feature descriptor, the magnitude (Eq. 1) and direction (Eq. 2) values of each pixel of the image are calculated. For the visualization of HOG, bin histograms are created from line segments by using the magnitudes and directions in the specified neighborhood and angle range.

$$ \begin{aligned} {\text{Magnitude }}\,{\text{of}}\,{\text{Gradient}} & = \sqrt {\left( {g_{x} } \right)^{2} + \left( {g_{y} } \right)^{2} } \\ & = \sqrt {\left( {50} \right)^{2} + \left( { - 50} \right)^{2} } \\ & = 70.71 \\ \end{aligned} $$
(1)
$$ \begin{aligned} {\text{Direction}}\,{\text{of}}\,{\text{Graident}} & = {\text{arctan}}\left( {\frac{{g_{y} }}{{g_{x} }}} \right) \\ & = {\text{arctan}}\left( {\frac{ - 50}{{50}}} \right) \\ & = - 45^{ \circ } \\ \end{aligned} $$
(2)

3 Methodology

Images in the dataset are first resized to 224 × 224 pixels. The local binary pattern (LBP) and histograms of oriented gradients (HOG) features are then calculated. Since the size of the virus-infected areas is relatively small compared to chest images, 3 × 3 window size is preferred. For this reason, eight neighboring pixels in LBP calculation, 3 × 3 neighboring pixels, and nine angle ranges in HOG calculation were assigned as parameter values. Sample images from the dataset and its obtained LBP, HOG images are shown in Fig. 4.

Fig. 4
figure 4

Images samples from dataset: a X-ray images b corresponding LBP images c corresponding HOG images

In this study, a multi-stream CNN with three inputs was designed. The three inputs are the grayscale X-ray image, the corresponding LBP, and HOG images. In the proposed network, each of the three input images passes in parallel through a Feature Extraction Module (FEM). The FEM consists of five feature extraction layers (see Fig. 5).

Fig. 5
figure 5

Feature Extraction Layer

Each of the FEM layers consists of a convolution layer with 3 × 3 filter size (has filter outputs of size 16, 32, 64, 128, and 256, respectively, in each of the five layers of FEM), a batch normalization layer (only at the first layer of FEM), a ReLU activation layer, an average pooling layer with 2 × 2 pool size and a dropout layer with 0.15 ratio (0.5 in the last layer of FEM). Features obtained from the three separate outputs (features) of the final FEM layers are combined as a single tensor at the concatenation layer.

The fused feature tensor passes to the classification module, which consists of the Flattening layer, the Dense layer of 512 neurons, the ReLU layer, the Dropout layer, and the Dense layer of 3 neurons. The classification result is obtained at the dense layer with the Softmax activation function. The proposed CNN architecture is shown in Fig. 6.

Fig. 6
figure 6

Architecture of the proposed 3-stream fusion-based CNN model

4 Results and discussion

All codes in this study were written in Python and ran in a Google Colaboratory environment. Tensorflow framework and Keras libraries were used to define the CNN model. In the proposed model, stochastic gradient descent was used as an optimization algorithm and categorical cross-entropy is used as a loss function. fivefold cross-validation was used to test model performance. This method divides the database into five equal subsets and each fold uses different subsets. In this study, three of these subsets were used for training, one for validation, and one for testing. Thus, all samples in the dataset are used for training, validation, and testing of the model (Fig. 7).

Fig. 7
figure 7

Splitting the database into five subsets with fivefold cross validation

The model was trained using 25 iterations and 16 batch-size. Figure 8 shows the accuracy and loss plot on model training for 5th fold.

Fig. 8
figure 8

Accuracy and loss plots on model training for 5th fold

In each fold, the model is compiled and trained from scratch. As a result, five different confusion matrices are obtained from model testing results. Figure 9 shows the confusion matrices obtained by evaluating the model on each fold.

Fig. 9
figure 9

Confusion matrices for each fold: a 1st Fold b 2nd Fold c 3rd Fold d 4th Fold e 5th Fold

Using the results of these matrices, TP—true positives (positive samples correctly predicted to be in their corresponding class by the model), TN—true negatives (negative samples correctly predicted to be in their corresponding class), FP-false positives (negative samples incorrectly predicted to be in other classes), and FN-false negatives (positive samples incorrectly predicted to be in other classes) were calculated for each class. Afterward, the precision (Eq. 3), recall (Eq. 4), specificity (Eq. 5), F1 Score (Eq. 6), and accuracy (Eq. 7) of the proposed model were calculated.

$$\mathrm{Precision}=\frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FP}}$$
(3)
$$\mathrm{Recall}=\frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FN}}$$
(4)
$$\mathrm{Specificity}=\frac{\mathrm{TN}}{\mathrm{TN}+\mathrm{FP}}$$
(5)
$$\mathrm{F}1\mathrm{ Score}=\frac{2\times \mathrm{Precision}\times \mathrm{Recall}}{\mathrm{Precision}+\mathrm{Recall}}$$
(6)
$$\mathrm{Accuracy}=\frac{\mathrm{TP}+\mathrm{TN}}{\mathrm{TP}+\mathrm{TN}+\mathrm{FP}+\mathrm{FN}}$$
(7)

In the fivefold cross-validation, the performance scores for each fold were calculated, and the average score was taken at the end. Model performance scores such as precision, recall, specificity, F1-score, and accuracy for each fold and the corresponding average scores are given in Table 1.

Table 1 Fivefold cross-validation model performance scores

In Table 2, a comparison of the results obtained using the proposed algorithm against other most recent studies on the diagnosis of COVID-19 is given. All results in this table are conducted on public X-ray image databases with three classes. Wang et al. [18] developed a deep convolutional neural network called COVID-Net and achieved a 93.30% accuracy on a database of 13.975 X-ray images. Chandra et al. [19] extracted 8196 features using three different feature extractors and used a meta-heuristic binary grey wolf optimization approach for selecting the most informative features. They passed the selected features to a majority voting-based classifier ensemble of five benchmark supervised classification algorithms and managed to achieve 93.41% accuracy on three classes. On the other hand, Ibrahim et al. [20] used a pre-trained AlexNet model on publicly available multiclass X-ray images and achieved a 94% accuracy rate for three classes. Results in [21] show that they achieved a success rate of 94.2% with the developed 22-layer CNN model.

Table 2 Comparison of studies with 3-class public datasets for COVID-19 detection

Another CNN model named CFW-Net based on the channel feature weighting was developed by Wang et al. [22] reached a 94.35% accuracy rate. Khan et al. [23] applied the transfer learning method with the Xception model on a small dataset containing 284 COVID-19 positive images. They achieved a 95% accuracy performance rate for three classes which is not a high performance considering the size of their database. Meanwhile, Asif et al. [24] trained the Inception V3-based deep CNN model using a public dataset containing 864 COVID-19 positive images and achieved a 96% success rate. Bayoudh et al. [25] developed a transfer learning-based hybrid 2D/3D CNN architecture and achieved a 96.91% accuracy rate. Pham [26] achieved the highest performance with the AlexNet model reaching 97.59% for three classes with 438 COVID-19 images. The study was performed using a fine-tuning method with three different pre-trained CNNs. It is clear that the proposed feature fusion-based model in this study outperformed the other models considering the size of the used database.

The performance of deep learning models is highly proportional to the size of the database. As the database size increases, the stability of the model is positively affected. In the early days of the pandemic, it was not possible to reach many publicly available images. Later on, with the spread of the pandemic in all countries and the increase in the number of infected people, more images began to be shared. Accordingly, this enabled researchers to train their models using thousands of COVID-19 positive raw images instead of hundreds. Taking this into consideration, some of the high results obtained at the early stages of the pandemic based on small-sized databases such as in [6, 7] cannot be reliable. The performance reported on such databases should be evaluated on larger databases. Otherwise, the findings of these studies cannot be generalized.

A comparison of the proposed 3-stream fusion-based CNN model against 1-stream CNN models is presented in Table 3. 1-stream CNN model uses only one of the 3 images (grayscale, LBP or HOG image) as input to the model. The training of the fusion-based model and average prediction is taking relatively longer time than the 1-stream CNN models. Time consumption for training the model should not be an issue due to the fact that the process is one-time off-line process. One the other hand, the proposed 3-stream fusion-based model recorded better loss rate and accuracy.

Table 3 Comparison between the proposed 3-stream fusion-based CNN model and the corresponding 1-stream versions

5 Conclusions

In this paper, we proposed a deep learning-based diagnosis of COVID-19 on possibly infected patients using chest X-ray images. Besides the availability of X-ray imaging in most healthcare units, the other test techniques were expensive and gave late, and sometimes inconclusive, results. The proposed approach designed a 3-stream convolutional neural network (CNN) that extracts and concatenates features obtained from grayscale X-ray, LBP, and HOG images. The fivefold cross-validation experiments carried out on a large chest X-ray database with three classes showed that our proposed approach outperformed the recently developed algorithms with accuracy reached 97.76%. High results, from other studies, obtained using small databases were listed in Table 2 but were not taking into consideration. This is because such results using small databases cannot be reliable and cannot be generalized. Our results showed that the proposed model can make a significant contribution to the rapidly increasing workload in health systems with a deep learning-based automatic diagnosis tool. Moreover, having a standard benchmark database that can be used by researchers to evaluate and compare their work against others will be of great value. It will make it easier to evaluate proposed algorithms objectively without any biasing. Also, larger databases will help to obtain reliable and trustworthy results and findings.