1 Introduction

Coronavirus disease 2019 (COVID-19) has recently become an unprecedented public health crisis worldwide [1]. At the end of December 2019, patients with a previously unknown respiratory disease were identified in Wuhan, Hubei Province, China [2]. By January 25, 2020, the diagnosis of COVID-19 had been confirmed in at least 1975 more patients since the first patient was hospitalized on December 12, 2019. COVID-19 caused by a new coronavirus named severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) [2, 3]. The typical symptoms of COVID-19 include fever, shortness of breath, dizziness, cough, headache, sore throat, fatigue, and muscle pain [2,3,4]. After the first case of COVID-19 was discovered in Wuhan, the virus has rapidly spread to 216 countries worldwide, largely due to human-to-human transmission of the virus early in the clinical course [1]. The COVID-19 pandemic has imposed substantial demands on the public health systems, health infrastructure, and economies of most countries worldwide [5]. Because the total number of people infected by SARS-CoV-2 has increased rapidly, the capacity of healthcare systems (i.e., beds, ventilators, care providers, masks, etc.) is insufficient to meet the demand. Due to the rapid transmission of SARS-CoV-2 from person to person, millions of people have been infected, more than four billion people have been instructed to remain at home, and many people have lost their jobs [1, 2, 5]. Severe COVID-19 has caused deaths worldwide [6]. As reported by the world health organization (WHO) on November 17, 2020 [6], the numbers of patients with confirmed cases of COVID-19, recovered COVID-19 patients, and non-surviving COVID-19 patients were 55.4M, 38.6M, and 1.3M, respectively. Moreover, education systems have been negatively affected by the COVID-19 pandemic, and schools and universities have switch to remote learning.

To date, the most widely used screening tool for the detection and diagnosis of COVID-19 has been real-time reverse-transcription polymerase chain reaction (RT-PCR) [7]. Radiological imaging techniques such as chest digital X-ray (CXR) and computed tomography (CT) are the standard screening tools used to detect and diagnose chest respiratory diseases early in the clinical course, including COVID-19 [1, 8]. Due to the low sensitivity of RT-PCR, radiological images are also used for diagnostic purposes in patients with symptoms of respiratory diseases. Although the CT is the gold standard, primary chest digital X-ray systems are still useful because they are faster, deliver a lower dose of radiation, are less expensive, and are more widely available [4, 8]. Indeed, CT scans or X-rays should be routinely obtained in addition to RT-PCR results to improve the accuracy of the diagnosis of COVID-19 [8]. However, the large number of patients who test positive for SARS-CoV-2 makes the use of regular screening on a daily basis challenging for physicians. Thus, on March 16, 2020,the United States administration encouraged experts and researchers to employ artificial intelligence (AI) techniques to combat the COVID-19 pandemic [1]. Currently, experts have started to use machine learning and deep learning technologies to develop CAD systems to assist physicians in increasing the accuracy of the diagnosis of COVID-19 [1, 8]. In the last few years, the use of deep learning methods as adjunct screening tools for physicians has attracted a great deal of interest. Deep learning CAD systems have been shown to be capable and reliable, and promising diagnostic performance has been achieved using the entire image without user intervention [9, 10]. The use of a deep learning CAD system could assist physicians and improve the accuracy of the diagnosis of COVID-19 [1]. Deep learning CAD systems have been successfully applied to predict different medical problems, such as breast cancer [9, 10], skin cancer [11, 12], and respiratory disease, using digital X-ray images [8]. The rapid spread of the COVID-19 pandemic and the consequent death of humans worldwide makes it necessary to apply deep learning technologies to develop CAD systems that can improve the diagnostic performance. This need was the motivation for developing a deep learning CAD system to diagnose COVID-19 based on entire digital X-ray images.

In this paper, our contributions to the diagnosis of COVID-19 based on digital X-ray images are as follows. First, a simultaneous deep learning CAD system that uses the YOLO predictor was adopted to detect and diagnose COVID-19 directly from entire chest X-ray images. Second, COVID-19 is differentiated from eight other respiratory diseases in a multiclass recognition problem. Third, deep learning regularizations of data balancing, augmentation, and transfer learning were also applied to improve the overall diagnostic performance for COVID-19. Finally, our proposed CAD system was trained and optimized with five-fold tests using data from two different digital X-ray datasets, COVID-19 [13, 14] and ChestX-ray8 [15]. The outcomes of this study can be used to guide other researchers when developing novel deep learning CAD frameworks to accurately diagnose COVID-19.

The objective of this work was to provide a practical and feasible CAD system based on AI that can help physicians, patients, healthcare systems, and hospitals by facilitating the faster and more accurate diagnosis of COVID-19.

The rest of this paper is organized as follows. A review of the relevant literature is presented in Section 2. The technical aspects of the deep learning CAD-based YOLO system are detailed in Section 3. The results of the experiment with COVID-19 are reported and discussed in Sections 4 and 5. Finally, the most important findings of this work are summarized in Section 6.

2 Related works

Starting in 2020, after the discovery of COVID-19, some artificial intelligence (AI) systems based on deep learning have been employed to detect COVID-19 on digital X-ray and CT images. In [16], Oh et al. presented a patch-based deep learning CAD system consisting of segmentation and classification stages that could identify COVID-19 based on CXR images. With regard to segmentation, FC-DenseNet103 was used to segment and extract the full lung regions from the entire CXR images. With regard to classification, multiple random patches (i.e., regions of interest) were extracted from the segmented lung regions for use as the input for the classification DL model. They used CXR images from multiple patients who were healthy and patients who were diagnosed with bacterial pneumonia, tuberculosis, and viral pneumonia associated with COVID-19. Diagnostic accuracies of 84.40% and 88.9% were achieved for the F1-score and overall accuracy, respectively. Ozturk et al. [8] proposed the deep learning DarkCovidNet that can automatically detect COVID-19 based on digital chest X-ray images. They developed their model using 17 convolutional layers with the aim of achieving binary classification (i.e., COVID-19 and no finding) and multinomial classification (i.e., COVID-19, no finding, and pneumonia) diagnoses. They achieved overall classification accuracies of 98.08% and 87.02% for the binary and multinomial classifications, respectively. Fan et al. [17] proposed a deep learning model called Inf-Net that can be used to identify or segment suspicious regions indicative of COVID-19 on chest CT images. They used a parallel partial decoder to generate the global representation of the final segmented maps. After that, they used implicit reverse attention and explicit edge attention to enhance the segmented boundaries. They achieved segmentation accuracies of 73.90% and 89.40% with regard to Dice and the enhanced-alignment index, respectively. In May 2020, Wang et al. [18] proposed COVID-Net, which was based on a deep learning model and could differentiate patients with COVID-19 from healthy individuals and those with pneumonia based on digital X-ray images. The classification performance of their model was compared with the those of VGG-19 and ResNet-50 using the same database of digital X-ray images [18]. The authors concluded that COVID-Net outperformed VGG-16 and ResNet-50, with positive predictive values (PPVs) of 90.50%, 91.30%, and 98.9% for healthy, pneumonia, and COVID-19, respectively. Hamdan et al. [19] presented a deep learning COVIDX-Net model that can be used to distinguish between COVID-19 patients and healthy individuals based on 50 digital chest X-ray images. They used seven well-established deep networks as feature extractors and compared their classification results. Compared with other deep learning models, VGG-19 and DensNet201 had the highest diagnostic performance value of 90%. Apostolopoulos et al. [20] tested the ability of five well-established deep learning networks to detect COVID-19 on digital X-ray images. They used three classifications, namely, normal, pneumonia, and COVID-19, and they achieved the best overall classification accuracy of 93.48% with VGG-19. Additionally, they tested all five deep learning models with regard to the binary classification problem (i.e., COVID-19 against non-COVID-19), and they achieved the highest accuracy of 98.75% with VGG-19. Sakshy et al. [21] proposed a three-phase deep learning detection model to detect COVID-19 on CT images with a binary classification task. They used data augmentation, transfer learning, and abnormality localization with different backend deep learning networks: ResNet18, ResNet50, ResNet101, and SqueezeNet. They concluded that the pre-trained ResNet18 using the transfer learning strategy achieved the best diagnostic results of 99.82%, 97.32%, and 99.40% in the training, validation, and test sets, respectively. Khan et al. [22] proposed a deep learning convolutional neural network (i.e., CoroNet) that could be used to diagnose COVID-19 as a multiclass problem based on whole chest X-ray images. They achieved an overall accuracy of 89.6% for the identification of COVID-19 from among bacterial pneumonia, viral pneumonia, and normal images. Narin et al. [23] compared the classification performances of three different deep learning convolutional neural networks (i.e., ResNet-50, InceptionV3, and InceptionResNetV2) using chest X-ray images. They evaluated the ability of those three models to differentiate patients with COVID-19 from individuals without COVID-19, and they achieved the best classification accuracy of 98% using ResNet-50. Ardakani et al. [24] evaluated ten different well-established DL models to diagnose COVID-19 on CT scans in routine clinical practice. They differentiated between COVID-19 and non-COVID-19 with a binary classification task, and they achieved the best diagnostic result using the ResNet-101 and Xception DL networks, with an overall accuracy of 99.40%. Pereira et al. [7] presented a classification scheme based on well-known texture descriptors and a convolutional neural network (CNN). They used a resampling algorithm to balance the training dataset for a multiclass classification problem. Their model achieved an average F1 score of 65%. Moreover, comprehensive survey studies on deep learning applications pertaining to COVID-19 are presented in [25, 26]. Such deep learning methods have been employed to diagnose COVID-19 on entire X-ray images. This is due to the lack of X-ray images with annotated regions of suspected lesions. However, it is not practical to use the entire X-ray image to achieve a reliable diagnosis of COVID-19 [27]. Thus, the detection of suspicious regions specific to individual respiratory diseases is critical for achieving a more accurate diagnosis because it could be used to derive more representative deep features of the abnormalities. To our knowledge, this is the regional convolutional deep learning CAD system developed to simultaneously detect COVID-19 and differentiate it from among other respiratory diseases based on chest X-ray images. The automatic detection of COVID-19 is a major challenge for researchers. Our previous promising diagnostic results from the breast cancer diagnosis CAD system using the YOLO predictor [9, 10] have encouraged us to employ a similar system to detect and classify COVID-19, with the aim of enhancing the diagnosis of COVID-19.

3 Material and methods

Deep learning computer-aided diagnosis (CAD) based on the YOLO predictor was used to simultaneously detect COVID-19 and differentiate it from eight other respiratory diseases: atelectasis, infiltration, pneumothorax, masses, effusion, pneumonia, cardiomegaly, and nodules. The CAD system presented in this paper has a unique deep learning framework structure; the system has been validated and can simultaneously detect and classify COVID-19. Figure 1 is a conceptual diagram of the proposed CAD system.

Fig. 1
figure 1

Schematic diagram of the proposed deep learning CAD system based on the YOLO predictor

3.1 Digital X-ray images dataset

We used two different digital chest X-ray databases, namely, COVID-19 [13, 14] and ChestX-ray8 [15]. The data distributions for the two datasets are shown in Fig. 2.

Fig. 2
figure 2

Data distribution over all nine classes of respiratory diseases. The datasets for each classes were randomly split into 70%, 20%, and 10% for the training, testing, and validation sets, respectively

3.1.1 COVID-19 dataset

The COVID-19 dataset used in this study was collected from two different publicly available sources. First, we used the digital X-ray images from patients with COVID-19 collected by Cohen et al. [13] from different public sources, hospitals and radiologists. These images are publicly available to help expert researchers develop AI based on deep learning approaches to improve the diagnosis and understanding of COVID-19. Researchers from different countries try to constantly update these datasets and add more X-ray images. In this study, we used the available X-ray images acquired from 125 patients with COVID-19 (82 males and 43 females). Unfortunately, complete metadata were not yet available for all these patients. Age was provided for only 26 patients; the average age was 55 years. Second, we used digital X-ray images from patients with COVID-19 collected by a research team from Qatar University [14]. All these images are publicly available in portable network graphic (png) file format with a size of 1024 × 1024 pixels. This dataset is publicly provided for researchers to develop useful and impactful AI models with the aim of addressing the COVID-19 crisis. The metadata were not yet available for all patients with COVID-19. In this study, we used all available digital X-ray images from 201 patients with COVID-19. Thus, a total of 326 CXR images were collected and used to develop the proposed CAD system. The classification labels for these images are publicly available, but the information regarding the GT localization (i.e., bounding box) is not yet available for either COVID-19 dataset. This is because the CXR images are rapidly collected in the context of the pandemic. To locate the abnormalities, we asked two expert radiologists to annotate the abnormalities (i.e., lesions associated with COVID-19) localizations in a parallel manner. Since some CXR images were provided by the authors with some small white/black arrows, as shown in Figure 3a-c, showing the localization of the COVID-19 lesions, we compared the experts’ opinion with the existing annotations and marked the suspected lesions with a rectangle. Each bounding box GT was determined by the coordinates corresponding to the width (w), height (h), and center (x, y) of the abnormality. Figure. 3 shows some examples of COVID-19 lesions with the associated GT information.

Fig. 3
figure 3

Example cases of COVID-19 in different patients. The ground-truth (GT) information of the bounding box (i.e., green) for each case is superimposed on the original chest X-ray (CXR) image. The GT information was determined by expert physicians

3.1.2 ChestX-ray8

The ChestX-ray8 [15] dataset is the most frequently used and widely accessible medical imaging examination dataset available for eight different respiratory diseases: atelectasis, infiltration, pneumothorax, masses, effusion, pneumonia, cardiomegaly, and nodules. In this study, we used all CXR images with ground truth (GT) information involving the disease class label and the disease localization information as a labeled bounding box. The information pertaining to the GT bounding box (i.e., the starting point of the box (x,y), width (w), and height (h)) for each image is publicly available in the XML file [15]. As shown in Fig. 2, a total of 984 frontal views of CXR images were used, which were representative of eight different respiratory diseases. These images were accurately converted from DICOM format into ‘.png’ file format with a size of 1024 × 1024 pixels. Figure. 4 shows an example of an X-ray image for each disease class with the associated GT information.

Fig. 4
figure 4

Example cases of eight common respiratory diseases in different patients from the ChestX-ray8 dataset [15]: a atelectasis, b infiltration, c pneumothorax, d mass, e effusion, f pneumonia, g cardiomegaly, and h nodule. The ground-truth (GT) information of the bounding box (i.e., green) for each case is superimposed on the original image

3.2 Data preparation: Training, validation, and testing

To fine-tune and evaluate the proposed CAD system, the COVID-19 [13, 14] and ChestX-ray8 [15] datasets were used. As shown in Fig. 2, the chest X-ray images for each disease class were randomly divided as follows: 70% in the training dataset, 20% in the evaluation dataset, and 10% in the validation dataset [9, 10]. The hypertrainable parameters of the proposed deep learning system were selected via the training process using the training and validation datasets. After that, the final performance of the proposed CAD system was assessed using the evaluation set. Meanwhile, our proposed CAD system was assessed using five-fold tests in the training, validation, and evaluation datasets. These sets were generated by stratified partitioning to ensure equal testing of each X-ray image and to avoid system bias error. It is important to use k-fold cross-validation to develop a robust, reliable, and efficient CAD system, especially given the small sizes of medical datasets [9,10,11]. In addition, to prevent the development of bias in the proposed prediction model during the learning process due to an unbalanced training set, we used the following techniques. First, the training set for each mini-batch was automatically shuffled. Second, a weighted cross-entropy was used as a loss function to optimize the deep learning trainable parameters [28].

3.2.1 Balancing and augmentation strategies for the training dataset

Data balancing and augmentation strategies were applied to enlarge the size of the training dataset, avoid overfitting, and accelerate the learning process [9, 10]. These practical solutions were successfully applied to address the challenge of small datasets of annotated medical images [9, 10]. During training, each mini-batch included an almost equal number of digital X-ray images for each disease class [29, 30]. This was to avoid overfitting and prevent the performance of the deep learning model from being biased towards the disease class with the largest number of images (i.e., COVID-19). To balance the training sets and avoid having a majority of images related to COVID-19, the training images from the eight disease classes in the ChestX-ray8 dataset were flipped twice (i.e., left-right and up-down), generating 1378 chest X-ray images. Thus, the total number of images in all disease classes in the training set after balancing was 2295 (i.e., 917 original images from all disease classes including COVID-19 and 1378 balanced images from eight disease classes from the ChestX-ray8 dataset).

After data balancing, an augmentation strategy was applied for all nine disease classes as follows. First, the original chest X-ray images were randomly scaled and translated ten times. Second, the X-ray images for each class were rotated around the origin center by 0°, 45°, 90°, 135°, 180°, 225°, 270°, and 315°. Finally, the rotated X-ray images for each class with θ = 0° and 270° were flipped left-right and up-down. This ensured that each X-ray image for each balanced class was augmented 22 times. Thus, a total of 50,490 X-ray images were generated and used to train our proposed CAD system. For each k-fold test, the same data balancing and augmentation strategy was utilized. In addition, transfer learning was applied to initialize the trainable parameters using ImageNet [9, 10]. Then, the deep learning CAD system was fine-tuned using our training set of chest X-ray images [31].

3.3 The concept of the deep learning CAD system

To simultaneously predict (detect and classify) COVID-19 from among the other respiratory diseases, a deep learning CAD system based on the YOLO predictor was adopted and used. With regard to object detection, previous studies have employed conventional image processing algorithms, machine learning classifiers, or complex deep learning pipelines [9, 10]. In contrast, our proposed CAD system is a regressor model that can simultaneously detect the localization of potential disease lesions and predict the probabilities of those lesions belonging to specific disease classes [10]. It has a robust ability to simultaneously learn the characteristics of the entire input X-ray image and the background. Thus, it can locate regions with lesions indicative of respiratory diseases with fewer background errors than other existing methods [30]. In addition, it has a unique deep learning structure allowing it to simultaneously optimize trainable parameters end-to-end to tune the training weights for the detection and classification tasks. Unlike the Faster R-CNN [32] and sliding window [33] methods, YOLO inspects the regions suspected of containing disease lesions directly in the context of the entire chest X-ray images. The conceptual diagram of the CAD-based YOLO predictor is shown in Fig. 1.

In fact, the YOLO predictor starts by dividing the input X-ray image into N × N grid cells, as shown in Fig. 1. If the lesion (i.e., the lesion associated with COVID-19 or any other respiratory diseases) center falls into any grid cell, that cell is responsible for predicting that disease. For each grid cell, five anchors (i.e., bounding boxes) are assigned and used to predict the disease class to which the lesion belongs (i.e., COVID-19, pneumonia, etc.). For each anchor, YOLO predicts the disease class of the lesions based on five prediction parameters: center location (x,y), width (w), height (h), and confidence score probability (Prconf.). The confidence score interprets the YOLO-based confidence that the predicted box contains a lesion and how accurate it expects the representation of the final output prediction by that box to be.

During the training process, the predicted confidence in each anchor is calculated by the multiplying the probability of the existing respiratory disease (i.e., lesion) by the value of the intersections over union (IoU) as follows:

$$ \mathrm{Confidence}\ \left({\Pr}_{\mathrm{conf}.}\right)=\mathrm{Prob}\left(\mathrm{Object}\ \right)\times {\mathrm{IoU}}_{\mathrm{Pred}.}^{\mathrm{GT}}. $$
(1)

If the grid cell does not contain any respiratory disease lesion, the confidence of all bounding boxes of that cell should be zero. In contrast, if any suspected disease lesion falls in that grid cell, Prob(Object ) should be greater than zero. Thus, the confidence of all bounding boxes of that cell should also be greater than zero. However, the network has been optimized to achieve the highest object probability and the highest object confidence. Based on both object probability and\( {\mathrm{IOU}}_{\mathrm{Pred}.}^{\mathrm{GT}} \), the coordinates of all bounding boxes are simultaneously optimized and adjusted to fit the object that is falling in the specific grid cell. During the training process, each grid cell predicts the conditional class probabilities Prob(Classi| Object ) for all nine disease classes (i.e., COVID-19 and other respiratory diseases). During training, the confidence score for a detected bounding box is determined based on the conditional class probabilities as follows:

$$ \mathrm{Confidence}\ \mathrm{Score}=\mathrm{Prob}\left({\mathrm{Class}}_{\mathrm{i}}|\mathrm{Object}\ \right)\times \mathrm{Confidence}\kern0.50em i=1,2,3,\dots, \mathrm{and}\ 9 $$
(2)

where

$$ \mathrm{Prob}\left({\mathrm{Class}}_{\mathrm{i}}|\mathrm{Object}\ \right)=\frac{\mathrm{Prob}\left({\mathrm{Class}}_{\mathrm{i}}\right)}{\mathrm{Prob}\left(\mathrm{Object}\right)}. $$
(3)

Then,

$$ \mathrm{Confidence}\ \mathrm{Score}=\frac{\mathrm{Prob}\left({\mathrm{Class}}_{\mathrm{i}}\right)}{\mathrm{Prob}\left(\mathrm{Object}\right)}\times \mathrm{Prob}\left(\mathrm{Object}\ \right)\times {\mathrm{IoU}}_{\mathrm{Pred}.}^{\mathrm{GT}}=\mathrm{Prob}\left({\mathrm{Class}}_{\mathrm{i}}\right)\times {\mathrm{IoU}}_{\mathrm{Pred}.}^{\mathrm{GT}} $$
(4)

During testing, to obtain the confidence score when there is no GT, the conditional class probability is multiplied by the individual box confidence value. The detected bounding boxes with the highest confidence values indicate that COVID-19 or another respiratory disease is present, which should be considered the final prediction output. However, the confidence score probability for each detected bounding box encodes the probability for each disease class and how well each box fits the classes of respiratory diseases. The confidence score for each box is computed as follows:

$$ {\mathrm{Confidence}\ \mathrm{Score}}_{{\mathrm{Box}}_i}=\mathrm{argmax}\ \left\{\left({\Pr}_{\mathrm{conf}{.}_i}\times \Pr \left({\mathrm{class}}_i|\mathrm{Object}\right)\right),\left({\Pr}_{\mathrm{conf}{.}_{i+1}}\times \Pr \left({\mathrm{class}}_{i+1}|\mathrm{Object}\right)\right),\dots \mathrm{etc}\right\}\ i=1,2,3,\dots, \mathrm{and}\ 9. $$
(5)

For each bounding box, only one disease class is predicted and assigned (i.e., COVID-19, pneumonia, mass, etc.). As long as all bounding boxes are assigned to the same grid cell, the disease class for these boxes should be the same, but they can have different confidence values and conditional probabilities. Finally, the detected box that has the maximum confidence probability should be used to determine the final predicted output of the proposed CAD system. Moreover, all other detected bounding boxes have \( {\mathrm{IoU}}_{\mathrm{Pred}.}^{\mathrm{GT}}<45\% \) with lower confidence scores are suppressed using the algorithm of non-max suppression (NMS).

3.3.1 Deep learning structure of the CAD system

The structure of the proposed CAD system involves convolutional layers (Conv.), fully connected (FC) layers, and tensor of prediction (ToP), as shown in Fig. 5. Deep high-level features are extracted with 23 sequential convolutional layers, while the coordinates of the detected bounding boxes and the output probabilities are predicted with two FC layers. The total number of derived deep-feature maps depends mainly on the number of convolutional kernels that are used for each convolutional layer. Moreover, convolution reduction layers with a kernel size of 1 × 1 are added and utilized, followed by 3 × 3 convolutional layers, as shown in Fig. 5. This structure is used to reduce the size and compress the derived feature representations [9, 10]. In addition, batch normalization (BN) layer is used after each convolutional layer to reduce overfitting, accelerate convergence, and stabilize the training of the deep network [9, 10]. Down-sampling using max-pooling (MP) with a size of 2 × 2 is applied five times after the convolutional layers to minimize the dimensionality of the derived deep-feature maps and select the most appropriate deep features. The aggregated deep-feature maps from the last convolutional layer are concatenated and flattened using global average pooling (GAP) to feed directly into the fully connected layers. The numbers of nodes or neurons for the first and second dense layers are modified to 512 and 4096, respectively. The final output of the proposed model is called a tensor of prediction (ToP), which contains all detected predictors of the five anchors: coordinates (x, y, w,  and h), confidence scores (Prconf.), and the conditional class probabilities of all nine disease classes (PrCOVID − 19, PrPneumonia, …etc). These predictors are encoded in the 3D matrix of the ToP with the size of N × N × (5 × B + C), where N, B, and C represent the number of grid cells, number of anchors, and number of classes, respectively [27]. As mentioned above, the input X-ray image is divided into 7 × 7 nonoverlapping grid cells, and each grid cell should detect any lesion (caused by COVID-19 or the other respiratory diseases) in that cell. The size of 7 × 7 was chosen to achieve the best performance, as shown in our previous studies. Meanwhile, five anchors or bounding boxes (i.e., B = 5) are used to detect the object in each grid cell. The proposed CAD system was built to detect and recognize nine classes of respiratory diseases (i.e., C = 9). Thus, the final output represents a 3D ToP with a size of 7 × 7 × 34. This means that the actual output layer of the fully connected layer has 7 × 7 × 34 or 1666 neurons. Each set of 34 neurons in the output FC layer is responsible for predicting all parameters of the five bounding boxes for each grid cell in the original chest X-ray image. Here, the key is that each grid cell can only make local predictions for its region of the input X-ray image. The proposed prediction model has the capability to detect and classify respiratory diseases faster than other recent detection methodologies. Moreover, the leaky rectified linear activation function is utilized in all the convolutional and fully connected layers, while the ReLU activation function, ϕ(x) =  max (0, x), is only utilized in the final dense layer [27]. The leaky rectified linear activation function ϕ(θi) is expressed as the linear transformation of the input θi with a nonzero slope for the negative part of the activation function as follows:

$$ \upphi \left({\uptheta}_{\mathrm{i}}\right)=\left\{\begin{array}{c}{\uptheta}_{\mathrm{i}};\kern4.5em \mathrm{if}\ {\uptheta}_{\mathrm{i}}>0\kern0.5em \\ {}0.1\times {\uptheta}_{\mathrm{i}};\kern2.5em \mathrm{otherwise}.\end{array}\right. $$
(6)
Fig. 5
figure 5

Deep learning structure of the proposed CAD System

3.4 Experimental setting

The input digital CXR images were scaled using bilinear interpolation to a size of 448 × 448 pixels [9, 10]. In addition, the intensity of all CXR images was linearly normalized to a range of [0 ~ 1] as in [9, 10]. A multiscale training strategy was used to learn predictions across different resolutions of the input X-ray images [34]. Since the proposed network downsamples the derived deep-feature maps five times, the network randomly chose a new image dimension size for every 10 batches in multiplies of 32 (i.e., 320, 352, …, 608). Thus, the smallest input resolution was 320 × 320, and the largest input resolution was 608 × 608. Moreover, a mini-batch size of 24 and number of epochs of 120 were utilized to train and validate the proposed CAD system.

3.5 Implementation environment

To execute the experimental study, a PC with the following specifications was used. Intel® Core(TM) i7-6850K processor, RAM of 16.0 GB, 3.36 GHz, and four GPUs NVIDIA GeForce GTX1080.

3.6 Evaluation strategy

Our evaluation strategy used two conditions to determine whether the detected bounding boxes constituted a final true detection. First, the overlapping ratio (i.e., \( {\mathrm{IoU}}_{\mathrm{Pred}.}^{\mathrm{GT}} \)) between the detected bounding box and its corresponding GT boxes had to be equal to or greater than an appropriate practical threshold. Second, the confidence score (i.e., Prconf) of the final detected box had to be equal to or greater than an appropriate threshold [27, 35]. Specifically, we always use the maximum confidence score to evaluate truly detected boxes [9, 10]. A high confidence score reflects a highly accurate prediction that the lesion exists in the detected bounding box [9, 10].

For the quantitative evaluation with each fold test, we used weighted objective metrics, including sensitivity (Sens.), specificity (Spec.), overall accuracy (Acc.), the F1-score or Dice, the Matthews correlation coefficient (Mcc.), the positive predictive value (PPV), and the negative predictive value (NPV) [9, 10]. To avoid having test sets that were unbalanced with regard to the nine disease classes, we used the weighted class strategy [27]. The weighted ratios for atelectasis, infiltration, pneumothorax, masses, effusion, pneumonia, cardiomegaly, nodules, and COVID-19 were 0.14, 0.10, 0.08, 0.06, 0.12, 0.09, 0.11, 0.06, and 0.25, respectively. All evaluation indices were computed using multiclass confusion matrices for each fold test [9, 10].

4 Experimental results

4.1 Detection results

4.1.1 The prober threshold of the IoU and confidence score

The presented CAD system is able to predict five anchors (i.e., bounding boxes) for each grid cell in entire X-ray images. To suppress undesirable detected boxes with very small confidence scores, the non-max suppression (NMS) technique was used [9, 34]. This algorithm required three consecutive stages during the testing phase. First, detected bounding boxes with confidence scores less than 0.005 were directly discarded. Second, among any remaining boxes, the box with the highest confidence score (i.e., Prconf.) was selected to represent the final predicted bounding box. Finally, any remaining boxes with IoUnms ≥ 50% with respect to the predicted box representing the final output identified in the second step were also discarded. Figure 6a shows the potential predicted boxes after applying NMS. During the evaluation phase, the overlapping ratio of the IoU between the final predicted box and its GT had to be greater than an appropriate threshold to ensure that the confidence that the predicted box includes the lesion is high. Experimentally, we found that the appropriate threshold for \( {\mathrm{IoU}}_{\mathrm{Pred}.}^{\mathrm{GT}} \) was greater than 45%, as shown in Fig. 7a. The majority of the final detected bounding boxes for the X-ray images in the test set had in IoU accuracy greater than 90%. The final detected boxes with \( {\mathrm{IoU}}_{\mathrm{Pred}.}^{\mathrm{GT}}<45\% \) were considered to be false detections. In addition to controlling the IoU, we also adjusted the appropriate threshold for the confidence score to ignore the undesirable detected boxes. Figure 6b-d show the detected bounding boxes stratified by different probability thresholds of the confidence score. Experimentally, we found that the appropriate confidence threshold was greater than 10%, as shown in Fig. 7b. This was for the detection of at least one suspected lesion in each test image for diagnostic purposes.

Fig. 6
figure 6

Effect of the confidence score (i.e., Prconf) threshold on the number of detected bounding boxes. The potential regions including suspected lesions (i.e., detected bounding boxes) caused by COVID-19 were detected using confidence score thresholds of a 0.005, b 0.02, c 0.10, and d 0.20

Fig. 7
figure 7

Prediction measurements in terms of a the intersection over union (IoU) and b the confidence score for the final predicted bounding box for all test sets for the nine disease classes

4.1.2 Detection results after 5-fold cross-validation

The presented deep learning CAD system can efficiently automatically predict suspected COVID-19 lesions and other respiratory disease lesions from entire X-ray images. Table 1 shows the overall detection performance according to 5-fold validation using the test images from all nine disease classes. For each k-fold test, the same deep learning structure, training, and testing parameters of the presented CAD system were applied. The detected regions of interest (ROIs) that involved COVID-19 or other respiratory diseases were considered to be correctly detected if and only if \( {\mathrm{IoU}}_{\mathrm{Pred}.}^{\mathrm{GT}}\ge 45\% \) with Prconf. ≥ 10%. Otherwise, they were considered to be false detection cases even if.Prconf. ≥ 10%. Indeed, the most correct final detected bounding boxes had the maximum IoU and confidence scores as well. Based on the average of the 5-fold tests, the CAD-based YOLO was shown to be a reliable and feasible method of detecting COVID-19, with an overall detection accuracy of 96.31%. It failed to detect only 3.69% of COVID-19 cases in all the images. More generally, the presented CAD system has the capability to correctly detect respiratory diseases, with an overall detection accuracy of 90.67% for all nine disease classes. The true and false detection cases for the individual classes for the 5-fold validation are presented in Table 1.

Table 1 Detection evaluation results for COVID-19 and eight other respiratory diseases over the 5-fold tests in the test set

With regard to the qualitative evaluation, Fig. 8 shows examples of correctly detected suspicious lesions indicative of COVID-19 and all other disease classes. The overlapping ratios (i.e., IoU) for the resulting bounding boxes beside their corresponding confidence scores from each case are also presented. The detected boxes of these cases have acceptable IoU ratios and high confidence scores, indicating that the lesions have been accurately detected. Figure. 9 shows some examples of falsely detected cases of all nine disease classes. The final detected boxes of these cases have undesirable overlapping ratios with their GTs. Therefore, they were considered incorrect detection cases even if they satisfied the confidence score condition.

Fig. 8
figure 8

Examples of correctly predicted cases of COVID-19 and other respiratory diseases from chest X-ray (CXR) images: a atelectasis, b infiltration, c pneumothorax, d mass, e effusion, f pneumonia, g cardiomegaly, h nodule, and i & j COVID-19. The GT information (green), detected bounding box (red), IoU, and probability or confidence score (Pr.) for each case are superimposed on the original chest X-ray images

Fig. 9
figure 9

Examples of the incorrectly predicted cases of COVID-19 and other respiratory diseases from chest X-ray images: a atelectasis, b infiltration, c pneumothorax, d mass, e effusion, f pneumonia, g nodule, and (h, i, & j) COVID-19. The GT information (green), detected bounding box (red), IoU, and probability or confidence score (Pr.)for each case are superimposed on the original chest X-ray images

4.2 Classification results

The presented CAD-based YOLO predictor has the capability to simultaneously detect ad classify end-to-end the detected ROIs as COVID-19 or other respiratory diseases. As shown in Figs. 8 and 9, the presented CAD system detects the final regions with suspected lesions of respiratory diseases and classifies them at the same time. In fact, this is the key characteristic that makes the YOLO predictor faster and more accurate than other techniques, such as Faster R-CNN [10, 27, 34]. All final detected bounding boxes are classified even if they have been incorrectly detected. With regard to classification, it is important to know the final diagnosis status of each X-ray image (i.e., COVID-19 or another disease) since its GT label is available. The classification evaluation results are derived based on the multiclass confusion matrices for all nine classes over each fold test. Figure 10 shows an example of the confusion matrices for all disease classes from the 3-fold and 5-fold tests. Indeed, most of the COVID-19 cases were correctly distinguished from other respiratory diseases. Due to the high degree of similarity between COVID-19 and other respiratory diseases, some cases of COVID-19 were misclassified as pneumonia and vice versa. The weighted recognition evaluation metrics obtained via the five-fold test for all classes are reported in Table 2. Specifically, the classification evaluation results for each individual disease class as an average of the tests are shown in Fig. 11. It is clear that the proposed CAD system achieved an average overall accuracy of classification between 94.60% for pneumonia and 97.40% for COVID-19. The sensitivity was 91.69%, the specificity was 98.79%, and the Mcc. was 91.96% for differentiating COVID-19 from the other respiratory diseases. The classification performance of the system for COVID-19 as represented by the F1-score was 93.86%. We can conclude that the CAD system achieved satisfactory and promising classification performance with regard to the problem of the multiclass recognition of respiratory diseases.

Fig. 10
figure 10

The derived multiclass confusion matrices of COVID-19 against other lung diseases from the test sets over a 3-fold and b 5-fold tests

Table 2 Weighted classification measurements (%) for COVID-19 among the other lung diseases as an average over the 5-fold tests in the test set
Fig. 11
figure 11

Classification evaluation measurements (%) for each individual class of lung diseases as an average over the 5-fold tests in the test set

4.3 Effects of the regularization strategies

To improve the diagnostic performance for COVID-19 and the differentiation of COVID-19 from other respiratory diseases, data balancing and augmentation strategies were used. In this regard, the presented CAD system was trained and fine-tuned over 5-fold tests using the original, balanced, and augmented datasets in three separate scenarios. In each scenario, the same deep learning structure and learning settings were used. Figure 12 shows the weighted classification performance as an average of the 5-fold tests for each scenario. The balancing strategy improved the diagnostic performance by 3.43%, 1.47%, 2.79%, 3.35%, 3.86%, 3.28%, and 1.43% in terms of the sens., spec., Acc., F1-score, Mcc., PPV, and NPV, respectively. The major improvement was achieved through data augmentation after balancing. After applying the augmentation strategy, the classification performance was improved by 12.91%, 4.49%, 6.64%, 12.17%, 12.99%, 11.72%, and 3.56% in terms of the sens., spec., Acc., F1-score, Mcc., PPV, and NPV, respectively.

Fig. 12
figure 12

Effect of enlarging the training set sizes using different deep learning regularization strategies on the overall classification performance of the proposed CAD system. The evaluation results are presented as the average of the 5-fold tests in the test sets for all disease classes

4.4 The cost of the prediction time

The training time depends on the deep learning structure, training settings (i.e., number of epochs and mini-batch size), number of training sets, and specifications of the PC. For each fold test, the presented CAD system required almost 18 h for training. To make predictions for all test images, the proposed CAD system required 2.44 s. Since we had 263 test images across all disease classes, the predication time for an individual X-ray image was 0.0093 s. Our CAD system can make reliable preditctions in real time by 108 FPS. The rapid global spread of COVID-19 is challenging for physicians. The accurate and fast detection of COVID-19 based on entire chest X-ray image can help physicians, patients, and health care systems.

5 Discussion

Recently, researchers have been encouraged to apply artificial intelligence (AI) methodologies to help physicians in hospitals diagnose COVID-19. Indeed, deep learning based on CNN has been shown to achieve promising classification results with different applications. To date, a few studies based on machine learning and deep learning models have been designed and presented. Such studies employed deep learning models to classify entire input X-ray images. However, it is neither efficient nor accurate to base a diagnosis on an entire X-ray image [12, 27]. Thus, the detection by the CAD system of regions containing suspected lesions related to a respiratory disease (i.e., COVID-19 or another disease) represents a crucial prerequisite for achieving a more accurate diagnosis. Table 3 compares the prediction compression performance of our proposed CAD system with the performance of the latest deep learning models. Ozturk et al. [8] presented the deep learning model of DarkCovidNet that can be used to differentiate COVID-19 cases from pneumonia and normal cases. They achieved an overall diagnostic performance of 87.02%. Wand et al. developed the COVID-Net deep learning model to differentiate COVID-19 cases from normal and pneumonia cases. They achieved an overall diagnotic performance of 92.40%. Meanwhile, Khan et al. [22] presented the deep learning model of CoroNet, which can be used to differentiate COVID-19 cases from bacterial pneumonia, viral pneumonia, and normal cases. A diagnostic performance of 89.60% was achieved for the multiclass recognition problem.

Table 3 Prediction performance comparison against the latest deep learning models for the diagnosis of COVID-19 based on chest x-ray images

In this study, the proposed CAD system could effectively differentiate COVID-19 from eight other respiratory diseases. The detection accuracies for all nine disease classes ranged from 71.50% for pneumothorax to 97.60% for infiltration. The overall performance for the correct detection of regions with suspicious lesions was 90.67%. With regard to the detection of COVID-19, an overall detection accuracy of 96.31% was achieved. The results of the evaluation of the detection capability of the model for each individual disease class are reported in Table 1. The proposed CAD system could simultaneously predict the diagnosis (i.e., COVID-19 or not) for each detected ROI to determine the final diagnosis of the input X-ray image. As shown in Table 2, a promising classification accuracy of 97.40% was achieved over 5-fold tests. The simultaneous detection and classification of COVID-19 or other respiratory diseases in a single assessment of an entire X-ray image is helpful for physicians, especially when the number of patients is large. This will directly help support health care systems in hospitals as well. By controlling the confidence score threshold for the detected bounding boxes, we can select the desired number of boxes that should be used for the final real-time diagnosis. As shown in Fig. 6c, after adjusting the confidence threshold to be greater than 10%, two detected boxes were finally assigned two different regions with lesions suspected of being related to COVID-19. These results are logical and acceptable because COVID-19 and other respiratory diseases can affect both lungs in the same patient. Meanwhile, it is important to consider the final detected regions with suspicious lesions for classification even if they have been incorrectly detected. As shown in Fig. 9, most falsely detected cases were correctly classified. Additionally, it may help physicians focus on regions with suspicious lesions other than those with GTs. Figure 9h-j show the incorrectly detected ROIs according to the annotated position of the GT, but the final diagnosis was accurate. Meanwhile, deep learning regularizations for data balancing and augmentation were applied to improve the final diagnostic performance of the proposed CAD system. As shown in Fig. 12, these regularizers obviously improved the diagnostic performance as reflected in all evaluation indices. The average of the five-fold tests for the overlap class problem showed that the classification performance increased from 90.76% to 97.40% and from 72.64% to 84.81% with regard to the Acc. and F1-score, respectively. Generally, CAD systems could support physicians by providing a second opinion that could be used when making the final decision regarding the diagnosis. The fast and accurate diagnosis of COVID-19 based on entire X-ray images is key to helping physicians, patients, and health care systems.

The proposed CAD system has some advantages. First, the model has promising predictive accuracy for differentiating COVID-19 from other respiratory diseases is achieved. Second, the model can rapidly predict the presence of COVID-19 and other respiratory diseases based on entire X-ray images. Finally, user interventions are not required to detect and classify COVID-19 because the proposed CAD system has a unique end-to-end deep learning structure.

Despite the encouraging and rapid diagnostic performance for COVID-19, some drawbacks and limitations need to be addressed. Annotated digital X-ray images from COVID-19 patients are still unavailable. Considerable time and effort on the part of physicians is needed to label and localize the exact regions containing lesions associated with COVID-19.

In the future, when the annotated chest X-ray images become available, we plan to validate the presented CAD system. For increase the reliability of the diagnosis, we will expand our proposed CAD system to diagnose COVID-19 based on digital CT images. Additionally, we plan to locally collect digital X-ray and CT images for further validation. To achieve more accurate pre-training of deep learning models, a generative adversarial network (GAN) could be used to synthesize images [27].

6 Conclusion

In this work, a deep learning CAD system is proposed that can simultaneously detect and diagnose COVID-19 based on chest X-ray images. Our presented deep learning system was built in a unique deep learning structure and can rapidly predict the regions containing suspicious lesions likely associated with COVID-19 on entire X-ray images. The proposed CAD system was validated with regard to the multiclass recognition problem, achieving a promising diagnostic accuracy of 97.40% over 5-fold tests. Highly accurate and rapid information extraction from entire CXR images is a key for developing a comprehensive and useful patient triage system in hospitals and healthcare systems. The promising diagnostic performance and the rapid prediction time make this proposed CAD system practical and reliable as a means of assisting physicians, patients, and health care systems.