1 Introduction

At the end of 2019, China was invaded by an unknown fatal human-to-human transmissible virus that causes symptoms of pneumonia, fever, dry cough, and fatigue [14]. These symptoms are highly similar to those of viral pneumonia symptoms but are significantly more fatal, causing deaths among many people. The International Committee of Taxonomy of Viruses named this virus SARS-CoV-2 [22], which was subsequently called coronavirus disease 2019 (COVID-19) [79]. Due to its high transmissibility, COVID-19 rapidly spread globally in early 2020 and created a worldwide panic. Several countries have imposed a complete ban on various social activities to control the virus. COVID-19 transmission threatens human life, particularly the elderly and those with chronic diseases [13, 43]. An accurate diagnosis of infected patients reduces the spread of the disease. Reverse transcription-polymerase chain reaction (RT-PCR) is the gold standard for the COVID-19 test [80]. However, this diagnostic tool has two critical issues: it has a turnaround time that takes a few hours to days and low reliability due to its high false-negative rate [63]. A false-negative RT-PCR result indicates a negative test result of a COVID-19-infected person who could spread the virus and infect others. Thus, a more accurate rapid detection method, than the current testing method is necessary to overcome the COVID-19 pandemic.

Alternative methods have been used for COVID-19 diagnoses, such as medical radiography. It involves computed tomography (CT) and X-ray to detect abnormalities in chest imaging. It is reported that RT-PCR has failed to diagnose a large number of “suspected” cases with typical clinical COVID-19 symptoms and identical specific CT images [71]. According to Huang et al. [58], 98% of patients with chest CT imaging have bilateral involvement. Various strands of research have shown that COVID-19 CT has bilateral ground-glass opacity (GGO) and sub-segmental consolidation areas. Fang et al. [19] showed that the sensitivity of CT to COVID-19 is higher than that of RT-PCR (98% vs. 71%). Moreover, as stated in [5], more than 70% of patients with negative RT-PCR tests have typical CT manifestations. Hang et al. [24] established the relationship between CT features and RT-PCR test results, particularly in patients with a negative RT-PCR. This finding emphasizes that RT-PCR and CT are the main determinants of COVID-19 diagnosis.

A trained expert is required to detect COVID-19 infection correctly from a CT or X-ray image to differentiate between COVID-19-caused pneumonia and viral pneumonia, such as that caused by the influenza virus. When many trained observers annotate the same infected area, the best annotations are obtained, providing a probabilistic estimate of confidence in the lesion area. However, an expert’s hand annotation is time-consuming and costly. Thus, doctors need to use automatic image annotation techniques to speed up the labeling process.

Deep learning (DL) refers to an enhanced type of neural network with additional layers that increase the levels of abstraction [45]. DL can learn features from training data automatically. It has recently shown remarkable success in medical imaging by learning important image features from large training data. This review includes many papers published in prestigious journals. They are the most recent studies related to our main topic, namely, COVID-19 detection from medical images. The papers were selected based on Four criteria. First, the database. The papers were carefully selected from well-known database journals such as Elsevier, SpringerLink, IEEE. Second, the use of Deep learning technique for COVID-19 detection from medical imaging as a classification problem, therefore excluding other Deep Learning problems. We were mostly interested in CapsNet and CNN, for binary and multi class classification. Third, the type of medical imaging and only chest x-ray and chest CT are included, therefore excluding MRI images of the chest. Fourth, the date of the paper. There are many papers on the topic, however, we choose the ones that are most recently published namely between 2020 and 2021. Nineteen papers were systematically reviewed and presented based on two criteria: the type of the medical image (i.e., CT scan or X-rays) and the type of machine learning classification problem (i.e., binary or multiclass). We show that the limited number of publicly available COVID-19 data is one of the major challenges for most of the reviewed research. Despite publicly available data, image processing remains difficult when dealing with repeated images, lung segmentation, and images with various viewpoints. Moreover, COVID-19-related CT and X-ray images share many common features with several types of pneumonia. This condition limits the performance of certain DL approaches in distinguishing COVID-19 from unrelated viral pneumonia. The main contributions of this study are the following:

  1. 1.

    Several COVID-19 research articles are summarized in a comprehensible structure based on two criteria: the type of medical images (i.e., CT scans or X-rays) and the type of machine learning classification issue (i.e., binary or multiclass).

  2. 2.

    The datasets used in the reviewed studies are presented at points that interest every researcher when selecting a dataset for their studies.

  3. 3.

    A detailed discussion is provided about the main challenges researchers may face when developing a DL classification model for detecting COVID-19 from CT and X-ray images and the proposed solutions.

The paper proceeds as follows. Section 2 provides a brief background regarding COVID-19 classification techniques from radiology images. Section 3 presents a systematic review of neural network applications for COVID-19 detection organized by classification and image types. Section 4 discusses the challenges of and proposed solutions in the reviewed papers. Finally, Section 5 concludes this paper and provides future trends.

2 Background

This section introduces the technical details used in the review to provide background information on the field. We define the concept of image classification and how it has been applied in medical images. Then, we explain the two most common DL algorithms, namely, convolutional neural networks (CNN) and Capsule Neural Networks (CapsNet), applied extensively in medical image classification.

2.1 Medical image classification

Health centers generate millions of medical images every day. Imaging is one of the preferred diagnostic tools by numerous physicians. Furthermore, imaging techniques in the medical industry, including X-ray, CT, magnetic resonance imaging, positron emission tomography, and electrocardiogram signal, produce gigabytes of information daily. This number necessitates large storage devices and effective DL approaches for identifying diseases in image databases. Additionally, image databases are more challenging to manage than numerical databases. Medical image classification is one of the most critical problems faced in DL for healthcare; it is based on the grouping of similar image pixels. Assigning similar values in groups identifies common pixels, and these pixels are indicated. As a result, a correctly classified image often denotes areas that share specific criteria indicated in the classification scheme [8].

2.1.1 COVID-19 and pneumonia

Symptoms of pulmonary infection caused by different viruses are frequently similar, especially in the early stages of the disease [16]. Thus, clinical manifestations of pneumonia caused by COVID-19 and influenza viruses may be similar [18]. The most common clinical signs of COVID-19 and influenza pneumonia are fever, cough, and malaise [70]. However, healthcare worker precautions, treatment, and diagnosis approaches differ between these two viruses. Differentiating COVID-19 from influenza pneumonia is critical in the early stages of infection to ensure a timely and suitable management plan. The most common abnormalities in CT images of viral pneumonia include changes in pulmonary parenchymal density with areas of GGO or consolidation, nodular, or micro-nodular opacities [21]. In contrast, GGO and/or consolidation were frequently reported findings on the CT imaging of COVID-19 patients [81]. Studies have shown that CT, which has high sensitivity and availability, can play a crucial role in diagnosing COVID-19, particularly in the early stages when RT-PCR can be negative [20, 31, 60]. Combining RT-PCR with CT or X-ray can enhance the sensitivity of detecting COVID-19 in clinically suspected patients.

2.2 Deep learning algorithms

CNN, recurrent neural networks, CapsNet, and generative adversarial network (GAN) are a few examples of many existing DL algorithms. DL methods provide automatic learning of features based on determined specified detection, segmentation, classification, and objective functions. Moreover, they enable the development of complete “end-to-end” systems that take an image and detect, segment, and classify visual objects using a single model and a unified training method [45].

  • Convolutional Neural Networks CNN is a DL algorithm that is used widely in image classification. The basic architecture of CNNs includes three layers: convolution, pooling, and fully connected layers. CNNs are split into two sections; the feature extraction section contains convolution and pooling layers, whereas the classification section includes fully connected layers. The image first passes through a sequence of convolution and pooling layers for feature extraction; then, it passes through the fully connected layers for classification [75]. Pooling layer in CNNs is designed to decrease the redundancy of representation and the number of parameters to recognize that the exact location is unnecessary for object detection. Pooling cannot provide viewpoint invariance, which is the idea of making the model insensitive to changes in viewpoint. Image augmentation techniques, such as crop, zoom, rotate, and various other transformations strategies, are used on training images to generate every possible combination. To enable the network to learn better, the augmented images are fed to the CNN model to be trained explicitly on all possible combinations. This process requires substantial data, which is expensive. Considerable data indicate numerous images that help the algorithm to learn effectively to classify the images correctly and with minimal errors [10]. This problem is referred to as invariance. CNNs can handle translation invariance (i.e., the ability of a network to recognize an object in any location in an image) but not rotational invariance. One of the limitations of CNN is its inability to handle the rotation [57]. If it is trained on objects in one orientation, then it will have difficulty when the orientation is changed. CNNs are popular and have been successful in the medical image classification of CT and X-ray images. It allows highly representative and hierarchical image features to be learned from sufficient training data. Successful application off CNNs in medical image classification has three main strategies. These strategies are regarded as the key factors in developing powerful CNNs models [45]:

    1. 1.

      Training the CNN from scratch This process needs numerous labeled training sets. This requirement could be challenging in the medical imaging field due to the high cost of expert annotation and the lack of diseases in the datasets. Large datasets (thousands or more) yield strong evidence, and medical imaging studies face the challenge of obtaining large datasets, except for well-funded research. For example, the NLST study involved nearly 53,000 patients and cost more than 250 million dollars [27].

    2. 2.

      Using off-the-shelf pre-trained CNN features The typical CNNs models that can be applied immediately in this field of study are known as off-the-shelf CNNs. Pre-trained models often share millions of parameters attained when trained to a stable state [61]. It can be used directly to extract image features. Then, the extracted feature vector is used as an input to train a new fully connected layer to handle other classification problems [61]. When using small data for retraining, only the last fully connected layer parameters are updated. Then, the parameters of the other layers of the model are consistent with the parameters of the pre-trained model [75].

    3. 3.

      Transfer learning from non-medical to medical image domains Fine-tuning a pre-trained network is an adequate solution to training a CNN from scratch. The concept is to transfer the knowledge from a source domain with a massive amount of labeled images to a target domain with a restricted set of labeled data [45].

  • Capsule Neural Networks The human brain significantly performs better translation invariance than pooling in CNN. Hinton [59] posits that the brain has modules called “capsules” that are particularly good at handling different types of visual simulation and encoding, such as pose. The brain has a system for routing low-level visual information about which belief is the best capsule to treat. To overcome the previous CNN limitation, Hinton [59] proposed a new neural network model based on the capsule concept, that is, CapsNets. CapsNets make it deeper in nesting or inner structure rather than having a neural network deeper in height. The model is robust to affine transformations. CapsNets introduced the concept of routing by agreement. Each capsule maintains the information extracted from the previous (parent) capsules and compares the information from the parents to construct a classification agreement for the entities in the image [53, 59]. Routing by agreement is the main difference between CapsNet and CNN that helps to define spatial relations. The CapsNet architecture has one convolution layer, one primary layer, one digit layer, and three fully connected layers. CapsNet in medical image classification is promising as medical image datasets are often small and highly imbalanced. CapsNet’s behavior compared with the standard CNN in [34] is evaluated under typical biomedical image database constraints, such as few labeled data and class imbalances. Their findings confirm that CapsNet can be trained with less data for the same or improved performance and is highly resilient to imbalanced class distribution.

3 Deep learning-based detection of COVID-19 using chest imaging

RT-PCR is the most widely used test for COVID-19 detection in all countries and is currently the gold standard in symptomatic and asymptomatic patients. The optimal time for RT-PCR testing is at least two days following the infection of the patient until negativization, and its turnaround time is approximately 190 min. The results of 19 studies by Kim et al. [39] show that the positive predictive value of RT-PCR was between 47.3% and 98.3%, whereas the negative predictive value is between 93.4% and 99.9%. Moreover, recent studies have recorded that RT-PCR has a high false-negative rate [40, 44]. False negatives (COVID-19-positive patients labeled as negative and thus could spread the virus) are riskier than false positives.

Another method used for COVID-19 diagnosis is medical radiography involving chest CT and chest X-ray. Fang et al. [19] showed that the sensitivity of CT to COVID-19 is higher than that of RT-PCR (98% vs. 71%).

Moreover, as stated in [5], more than 70% of patients with negative RT-PCR tests have typical CT manifestations. In febrile cases with high clinical suspicion, recommendations preferably required chest CT and RT-PCR [84]. Chest CT is helpful and more sensitive to COVID-19 detection than an X-ray. However, excessive exposure to radiation and cost restrict the use of CT, particularly for pregnant women and children [38]. A chest X-ray is a noninvasive low-radiation chest test that provides a full picture of the lungs; it is also less costly than CT. Therefore, X-ray can be an efficient method for the early detection of COVID-19. However, COVID-19 CT and X-ray images share many characteristics with other forms of pneumonia. For radiologists, this situation makes diagnosis difficult. Hang et al. [24] established the relationship between CT features and RT-PCR test results, particularly in patients with a negative RT-PCR. This finding emphasizes that RT-PCR and CT are the main determinants of COVID-19 diagnosis. A brief review of DL COVID-19 diagnosis research shows that the main limitation is the availability of COVID-19 data. However, the researchers overcame this constraint by employing various strategies and solutions, achieving good performance results. Figure 1 shows more than 50 studies, a few of which are discussed in detail in the next sections. The studies reviewed were based on the type of image and classification.

Fig. 1
figure 1

Image types used in COVID-19 DL studies

Fig. 2
figure 2

The hierarchical representation of studies based on the type of image and classification

As shown in Figure 1, most studies used X-ray, and the most used classification is binary classification. Figure 2 illustrated the hierarchical representation of the studies with reference and accuracy.

  1. 1.

    Chest X-ray

    • Binary Classification In [1], CapsNet is used for the binary classification of X-ray images. The original CapsNet contains one convolution layer and is modified by adding two convolution layers to extract additional features from images effectively. The dataset used contains four labels: non-COVID viral, COVID-19, normal, and bacterial. The first three are labeled as non-COVID to be a binary classification task COVID-19 or non-COVID. As COVID-19 is a new disease, the dataset is small and highly imbalanced. To address the imbalanced class problem, they modified the original margin loss to assign additional penalties to misclassified positive cases. No augmentation techniques have been used to resolve the small dataset problem because CapsNet can handle small data; the results show that the model has 95.7% accuracy, 90% sensitivity, and 95.8% specificity. However, the CapsNet was pre-trained to account for the small dataset to improve the model. Pre-training is conducted on an external dataset of five classes, reflected in the five final capsules. These five capsules are then collapsed into two capsules. Next, all the capsule layers are fine-tuned on the main COVID-19 dataset. As a result, the accuracy, specificity, and AUC of COVID-CAPS further improved to 98.3%, 98.6%, and 0.97, respectively, though the sensitivity decreased to 80%. CapsNet is also used in [68] for the binary classification of X-ray images as COVID-19 or no findings. The original CapsNet contains one convolution layer, which is modified by adding four layers to extract additional features from the images effectively. The performance results are 97.24% accuracy, 97.42% sensitivity, and 97.06% specificity. Quan et al. [54] integrated CNN and CapsNet into a DL framework called DenseCapsNet. CNN is used for extraction, whereas CapsNet is the main body of the framework. Two CNNs, namely, DenseNet121 and a residual neural network (ResNet) with 50 layers called ResNet50, were used to extract features. These CNNs are good at detecting COVID-19 and have similar results. However, regarding the COVID-19 detection of normal cases, ResNet50 has two times more false detection cases than DenseNet121. Thus, DenseNet121 is used in their proposed framework. It extracted 1024 features of the COVID-19 X-ray. The extracted features are passed through the first CapsNet convolution layer, reducing the number of features from 1024 to 512. Then, they are converted to vectors and used as input to the main capsule layer. Finally, the capsule layer outputs instant parameters containing COVID-19 and normal images, and the vector length represents the probability of occurrence of each category. Luca et al. [7] proposed a three-stage approach: a binary model for distinguishing between healthy and pulmonary X-ray (pulmonary data included COVID-19), a binary model distinguishing COVID-19 from other pulmonary diseases. They considered a new DL network based on fine-tuning the VGG16 [64] model to develop the first and second models. The third step is to use gradient-weighted class activation mapping (Grad-CAM) [62] to detect COVID-19-infected regions in an X-ray image. With an average of 2.5 s, the proposed method assists pathologists and radiologists in identifying infected areas in an X-ray image. Wang et al. [73] proposed a computer-aided detection (CAD) framework to recognize and localize COVID-19 from common pneumonia. CAD is a DL model classifier that includes two models, namely, discrimination-DL and localization-DL. Localization-DL is a binary classification model that uses a ResNet for image classification [74] to localize the patient’s infected pulmonary region in the right or left lung or bi-pulmonary region. The CAD scheme achieved 93.65% accuracy, 90.92% sensitivity, and 92.62% specificity with 272 ms/image testing time. Karakanis et al. [36] developed a binary classification model for COVID-19 vs. normal to detect COVID-19 from X-ray. The conditional GAN is used to augment the limited data with a pre-trained ResNet8 model for classification. The model achieved 98.7% accuracy, 100% sensitivity, and 98.3% specificity. Ibrahim et al. [33] used a pre-trained CNN model named the AlexNet to perform a set of binary classification approaches on X-ray images: COVID-19 vs. normal, bacterial pneumonia vs. normal, non-COVID-19 viral pneumonia vs. normal, and COVID-19 vs. bacterial pneumonia. For COVID-19 and normal, the model achieved 99.16% accuracy, 97.44% sensitivity, and 100% specificity. For COVID-19 and non-COVID-19 viral pneumonia, the model gained 99.62% accuracy, 90.63% sensitivity, and 99.89% specificity. Ayyar et al. [6] developed multiple binary classification models that were fine-tuned to discriminate between two classes at a time rather than learning to distinguish between three classes. The classifiers are normal vs. positive (pneumonia or COVID-19), normal vs. COVID-19, normal vs. pneumonia, and pneumonia vs. COVID-19. The squeeze-and-excitation (SE) module is used with two popular pre-trained CNN classifiers: ResNet50 and InceptionV3. The SE-based attention module can be used with any convolutional layer to weigh each channel in the layer for the removal of redundancy [32]. The binary model used in InceptionV3 with SE to classify X-ray images into COVID-19 and pneumonia achieved 94.3% sensitivity and 95.6% accuracy. When used with SE, ResNet50 achieved 93.65% sensitivity and 94.9% accuracy.

    • Multiclass Classification The same binary model of Suat et al. [68] mentioned previously is used for COVID-19, no findings, and pneumonia output. It achieved 89.19% accuracy, 84% sensitivity, 92% specificity, and 84.61% precision. DenseCapsNet [54], the result of the previously described CNN and CapsNet integration, is trained on multiclass classification data, including normal, pneumonia, and COVID-19. It obtained 97% accuracy. The discrimination-DL model in the CAD framework in [73] is based on a Feature Pyramid Network (FPN) [42] on top of ResNet. It automatically recognizes, collects, and extracts important lung features from chest X-ray images. The output is one out of three classes: COVID-19, healthy, or common pneumonia. The acquired X-ray images for COVID-19 positive are input to the localization-DL binary model. Karakanis et al. [36] also used the same architecture of the binary model to develop a multiclass classification model that classifies X-ray to bacterial pneumonia, COVID-19, and normal. The model performed 98.3% accuracy, 99.3% sensitivity, and 98.1% specificity. Waheed et al. [69] proposed CovidGAN, that uses Auxiliary Classifier GAN (ACGAN) for augmentation and pre-trained VGG16 for classification. The model classifies X-ray images as COVID-19 or normal. It achieves 95% accuracy, 90% sensitivity, 97% specificity, 96% precision, and 93% F1-score. Ibrahim et al. [33] also applied the AlexNet model to the multiclass classification of X-ray images. Two approaches have been developed, namely, is a three-way classification (COVID-19 vs. bacterial pneumonia vs. healthy) and a four-way classification (COVID-19 vs. non-COVID-19 viral pneumonia vs. bacterial pneumonia vs. healthy). The models’ achieved accuracy, sensitivity, and specificity are 94%, 91.30%, and 84.78% and 93.42%, 89.18%, and 98.92%, respectively.

  2. 2.

    Chest CT

    • Binary Classification Sakshi et al. [4] proposed a three-phase model to detect COVID-19 using a ResNet. The three phases are data augmentation, transfer learning, and abnormality localization. After preprocessing, the images are divided into 90% training and 10% testing. Testing data are fed immediately to the second phase. Augmentation techniques are applied to training data on COVID-19 and normal cases. The number of COVID-19 images increased from 178 to 1602 nine times by rotation, translation, and shear. This technique overcame the problem of limited COVID-19 data as CNN could robustly classify objects though they were placed in different orientations. The preprocessed and augmented training images are fed to a pre-defined transfer learning-based CNN model, called ResNet18, for the binary classification of data into COVID-19 and non-COVID-19. ResNet18 is an efficient model used for binary classifications of testing data. As a result, the activation outputs of the first convolution and the deeper layers of the ResNet18 are used to generate the feature maps for detecting CT image abnormalities of COVID-19 cases. Then, the extracted features are evaluated by comparing the activation regions with the CT image of COVID-19 as input. Furthermore, the valuable features for localizing abnormalities in the COVID-19 image are obtained via the strongest activation channel. The binary classification results confirm that the ResNet18 model effectively classifies COVID-19 data. The results were 100% sensitivity, 99.4% accuracy, 99.5% F1-score, 99% precision, 0.996 AUC, and 98.6% specificity. Mei et al. [48] proposed a framework that includes three models, namely, the CNN, the MLP classifier, and their integration, for the binary classification of CT images into COVID-19 or non-COVID-19. The CNN model consists of two subsystems. The first CNN identifies the abnormal CT slices using CNN pre-trained Inception-ResNet-v2 [66]. The second CNN performs region-specific disease using ResNet-18, with segmented lung images from the first CNN as inputs and COVID-19 or non-COVID-19 as output. The MLP classifier is based on demographic and clinical data to classify COVID-19 positivity. Clinical information includes patients’ sex, age, exposure history, white blood cell counts, symptoms (presence or absence of cough, fever, and/or sputum), neutrophil number, percentage neutrophils, absolute lymphocyte number and percentage lymphocytes. The first and second models are integrated to generate the final output of the third model. The joint model has 84.3% sensitivity, 82.8% specificity, and 0.92 AUC. Wang et al. [72] developed a deep feature fusion method that incorporates CNN and a Graph Convolutional Network (GCN) to classify CT images into COVID-19 and healthy control. CNN is used to learn individual image-level features, whereas GCN is used to learn relation-aware features. The model has a sensitivity of 97.71±1.46, a specificity of 96.56±1.48, a precision of 96.61±1.43, an accuracy of 97.14±1.26, and an F1-score of 97.15±1.25.

    • Multiclass Classification Zhang et al. [85] proposed an artificial intelligence (AI) diagnostic system that combines two models: lung lesion segmentation and diagnostic prediction model. In lung segmentation, the network trained on manually segmented images of COVID-19, common pneumonia, and normal images to generate lung lesion maps. Lesion is the affected part of the lung. Lung lesion maps are input to the second model, a multiclass classification model using 3D ResNet [25]. 3D ResNet classifies the lung lesion map as COVID-19, common pneumonia, or normal. The multiclass classification model has 85.05% accuracy and 0.94 AUC. He et al. [29] benchmarked DL models to multi-classify chest CT to COVID-19, common pneumonia, and normal control. Various models are used for 2D and 3D CNN and showed that the 3D CNN models outperform the 2D CNN models. The best high-performance model is DenseNet3D, achieving 88.14% accuracy and 0.94 AUC. Then, an automated DL methodology was developed to generate a lightweight DL model called MNas3DNet41, with 87.14% accuracy and 0.96 AUC. Using ResNet50 as the backbone for distinguishing COVID-19 from other common pneumonia, Li et al. [41] developed a 3D DL framework referred to as COVNet. The dataset was collected from six hospitals reporting COVID-19 chest CT to be positive for RT-PCR, common pneumonia, and non-pneumonia abnormalities. The Grad-CAM method was used to visualize the important lung region of the DL model. The framework achieved sensitivity to COVID-19, common pneumonia, and non-pneumonia of 90%, 87%, and 94%, respectively.

  3. 3.

    CT and X-ray

    • Binary Classification Varalakshmi et al. [51] proposed a novel model for COVID-19 detection from CT and X-ray images using DL techniques. The dataset includes COVID-19, normal cases, pulmonary diseases, bacterial pneumonia, and viral pneumonia. First, the VGG16 model, a CNN model, is used to find lung diseases similar to COVID-19. Pneumonia is much closer to COVID-19 than other diseases with 93% accuracy. To analyze further the pneumonia images, transfer learning is performed on bacterial and viral pneumonia. COVID-19 is closer to viral pneumonia. The knowledge gained from viral pneumonia is transferred to COVID-19 detection using binary classification into COVID-19 or non-COVID-19. Using VGG16 and transfer learning, the proposed model achieved 91% precision, 90% recall, and 93% accuracy. Kassani et al. [37] proposed a COVID-19 detection system from X-ray and CT images. The system includes two phases: the feature extraction method using 15 pre-trained CNN models and the classification method using six machine learning classifiers. The second phase classifies the computed features from the first phase to produce the final prediction of either COVID-19 or normal. The best results are obtained by combining the DensNet121 model and a Bagging tree classifier with 99% accuracy. Ravi et al. [56] proposed a deep learning-based meta-classifier approach to detect COVID-19 from CT and X-ray images. In the first stage, they used EfcientNet, one of the CNN pre-trained models, for feature extraction and dimensionality reduction. The features are combined and sent into stacked meta-classifiers. The stacked meta-classifier predicts using the random forest and support-vector machines (SVM) and then classifies in the second stage. The second stage employs logistic regression to classify unlabeled samples into COVID-19 and non-COVID-19 classes. Ahamed et al. [2] fine-tuned a pre-trained ResNet50V2 model to detect COVID-10 from CT and X-ray images. ResNet50V2 is a pre-trained CNN model that they developed by adding extra layers to its base network. It obtained 98.95% accuracy for two-class cases: COVID-19 vs. viral pneumonia.

    • Multiclass Classification Ahamed et al. [2] also achieved 97.24% accuracy by using the fine-tuned ResNet50V2 model in four-class cases( COVID-19 vs. normal vs. bacterial pneumonia vs. viral pneumonia), whereas 97.24% accuracy was achieved in three-class cases (COVID-19 vs. normal vs. bacterial pneumonia.

Table 2 and Table 1 show the summary of the literature review based on binary and multi classification.

4 Discussion of the current challenges

Developing a neural network model to identify COVID-19 in CT and X-ray images involves various challenges. This section discusses the difficulties encountered by the researchers included in the literature review as well as their proposed solutions. This disunion can be a reference for helping researchers interested in contributing to COVID-19 deep learning studies.

Figure 3 shows the challenges and the proposed solutions.

  1. 1.

    Dataset Issues A few COVID-19 datasets are publicly available. Table 3 provides details on some of these datasets used by the researchers in developing DL models. The table demonstrates that most datasets have limited COVID-19 cases. Before being fed to a DL model, most available datasets require pre-processing, such as resizing, unifying image format, and segmenting the lung image from other body part images. Due to unclean data issues, such as repetitive and unsegmented lung images, selecting the best image for the model is challenging. The methods used by the authors to resolve the concerns of repeated and unsegmented images are discussed below.

    • Repeated Images Repeated images ensure no repeated images before feeding the dataset to the model. Repeated images will generate inaccurate high-performance results because they may appear in training and testing sets when splitting the dataset. The data must be unseen by the testing set to evaluate the model’s performance. In [29], the authors manually deleted the repeated images from the CC-CC-II dataset [86] and generated a new clean dataset [30]. The newly cleaned dataset contains 340,190 images from 2,698 patients, indicating numerous scans of the same patient from different points of view. They split the dataset into training, testing, and validation sets based on patients’ demographics to prevent the existence of the same patient in two sets. They achieved 88% accuracy, which is high compared with other multiclass classification studies.

    • Segmentation Method Lung segmentation is the separation of the images of lung parts from those of the other body parts. This initial step is necessary for lung image analysis and plays a vital role in the classification model’s performance. Real-world original CT scans and X-ray images contain noise and vary between different devices and human operations. Thus, segmentation also resolves data heterogeneity among different datasets if combined into one dataset. Segmentation allows the model to focus on the medically meaningful regions of the input image. The studies performed the segmentation method by either developing a segmentation model or using a publicly available dataset to train the model. Alternatively, the easiest method is to detect the boundaries of the lungs from the surrounding body parts by using open-source libraries and setting the appropriate threshold parameters. Quan et al. [54] and Zhang et al. [85] built a segmentation model independently from the classification model using existing segmentation networks. In [54], the segmentation model is built on the TernausNet model and trained on publicly available datasets, whereas in [85], the model is built on the DeepLabv3 model and trained on a manually segmented dataset. The segmentation results in [85] were annotated and reviewed by five senior radiologists. Compared with human experts, their segmentation network obtained smoother and clearer lesion segmentation boundaries and high accuracy. He et al. [29], Mei et al. [48], and Wang et al. [73] performed segmentation by identifying the lung regions and separating them from other body parts in different ways. First, in [29], the dataset used [86] contains certain segmented images, except for a few. They used the k-means method to separate lung images from CT slices and eliminate the white background.

      Table 1 Summary of Multiclass Classification Literature Review
      Table 2 Summary of Binary Classification Literature Review
      Fig. 3
      figure 3

      COVID-19 challenges and proposed solutions

      Second, in [48], they defined the lung region as the pixels with an intensity of less than 175 falling within the segmented body part. Then, they removed small regions with fewer than 64 pixels because they are frequently segmented due to random noise. The lung region was extended by 10 pixels to include the pleural boundary completely. Third, in [73], they developed a network based on the VGG that uses a minimum bounding-box approach. This approach takes the non-zero region, which corresponds to the lung regions. Therefore, all approaches perform well in lung segmentation, although the first is more accurate in detecting COVID-19-infected areas of the lungs. The second approach is the simplest and quickest, and the code can be found in publicly available sources.

  2. 2.

    Limited Available COVID-19 Although positive COVID-19 images are limited, many strategies can be employed to avoid the overfitting problems and enhance the models’ capability to classify COVID-19 cases. Data augmentation, pre-trained models, and CapsNet are used in the studies to solve the problem of the limited availability of COVID-19 data. Certain studies applied all these strategies, whereas others use either two or one.

    • Data Augmentation Data augmentation uses images in the training set and applies modifications to these cases. The purpose is to generate further representative samples that simulate patients’ acquisition and anatomical variation changes. The additional augmented data should help the model avoid learning features that are highly specific to the original training data. Thus, the model’s generalizability improves as well as performance on the test set [9]. Numerous approaches to data augmentation are available, depending on the data and the model. Table 4 lists the techniques used in the studies. It shows that rotation, flipping, and translation are the most commonly used methods. After applying augmentation approaches, the accuracy of [68] increased by 7%. Waheed et al. [69] and Karakanis et al. [36] performed augmentation by utilizing GAN to generate several images. Subsequently, the performance results in [36] greatly improved. The multiclass and binary classification models increased their accuracy, specificity, and sensitivity by 55.4%, 67%, and 85.2% and by 85.2%, 76.2%, and 96.1%, respectively. In contrast, the accuracy, sensitivity, F1-score, and precision in [69] increased by 10%, 21%, 15%, and 7%, respectively.

      Table 3 Publicly available COVID-19 datasets
      Table 4 The Augmentation techniques used in the studies
    • Capsule Neural Network The use of CapsNet is one of the practical solutions to the limited COVID-19 data. CapsNet can deal with small datasets. [1] used CapsNet without employing any data augmentation methods or pre-training the model. The results achieved 95.7% accuracy, 95.8% specificity, 90% sensitivity, and 97% AUC. They modified the loss function to address the imbalanced class problem due to few COVID-19 images available. In the loss function, positive samples are given more weight than negative samples, where weights are determined based on the proportion of the positive and negative cases. Moreover, the CapsNet binary classification model in [68] achieved an accuracy of 91.24% without data augmentation. In [54], the CapsNet model is integrated with the DenseNet model to obtain high performance without pre-training or data augmentation. The results are 99.32% accuracy, 99.04% precision, 99.52% sensitivity, and 99.15% specificity.

    • Pre-trained Deep Learning Neural Network Models Pre-trained DL neural network models have shown promising results in solving many problems in several research areas, such as computer vision, pattern recognition, natural language processing, and medical image classification. The superior ability of pre-trained networks is due to their extensive training on large datasets for numerous classes, such as ImageNet dataset with 1000 classes. This feature allows users to utilize the benefits of pre-training and the transfer learning concept in various classification and feature extraction processes [3]. To improve the diagnostic capabilities of the classification models of COVID-19, certain studies considered pre-training and transfer learning using an external dataset of X-ray or CT images or using CNN models pre-trained on a large public dataset. Many CNN models are trained on the ImageNet dataset, and their weights are publicly available for use in transfer learning. The key reasons for using the ImageNet dataset are the ImageNet publication, large size, different classes, and the close similarity of many classes [65]. ImageNet weights were utilized to initialize the weights of the CNN models used in [2, 4, 6, 7, 33, 36, 37, 48, 54, 56, 69]. In [54], the DenseNet121 model parameters are used to initialize the COVID-19 classification network parameters. Then, the network is trained using the COVID-19 dataset without freezing any layers. In [37], the DenseNet121 model is used as a feature extractor model; it outperforms 14 pre-trained CNN models. It obtained 99% accuracy when used with the Bagging tree machine learning classifier. In [7], the weights of the pre-trained VGG16 network are used. The steps are as follows. Exclude the fully connected layer head, and replace it with a new fully connected layer head composed of the layers AveragePooling2D, Flatten, Dense, Dropout, and the final Dense with the “softmax” activation. The last layer is added on top of the VGG16. Then, the VGG16 convolutional weights are frozen, allowing only the fully connected layer head to be trained. Xyrt et al. [4] used four pre-trained CNN models, namely, ResNet18, ResNet101, ResNet50, and SqueezeNet. The models are re-trained to classify binary data from CT images into two classes. The ResNet18 pre-trained model obtained the best testing accuracy of 99.4 In [7], the Inception-ResNet-v2 model was used as the slice selection CNN to identify abnormal CT images from all chest CT images. A pre-trained model has a 99.4% accuracy. The NIH Chest X-ray dataset is used to pre-train the models in [1, 51]. ImageNet was not used for pre-training because the nature of images in that dataset is different from the COVID-19 X-ray dataset. In [1], pre-training increased the accuracy and specificity by 2.6% and 2.8%, respectively, whereas sensitivity decreased by 10%. In [51], the VGG16 is pre-trained on the NIH dataset to determine which disease is most similar to COVID-19. Viral pneumonia showed a high similarity to COVID-19 because the VGG16 has a lower misclassification rate with viral pneumonia.

  3. 3.

    Differentiating COVID-19 from Viral Pneumonia COVID-19 is similar to various types of viral pneumonia. Thus, distinguishing between COVID19 and other viral pneumonia is a critical challenge for multiclass classification studies compared with binary classification studies. In [73], the multiclass classification model is used to classify COVID-19, common pneumonia, and normal cases. The dataset does not contain any viral pneumonia cases, indicating a more straightforward classification process.

  4. 4.

    Several Images from Various Viewpoints Certain datasets contain several lung images from different points of view, and other images do not show the lung parts. In [29], a sequence of 2D images from multiple points of view of the same patient is combined to form a single 3D image. In [48], after segmenting the lung, they deleted the images if the size of the lung is smaller than 20% of the size of the body part.

5 Conclusion and future trends

This paper reviewed numerous studies to develop a DL diagnostic tool and thus enhance the current COVID-19 testing methods. Various DL models have demonstrated promising results in the diagnosis of COVID-19. However, most of these models have not been tested in a real-world environment, raising immediate concerns about the difficulty of using them in clinical practice. As a result, further research is needed to develop a benchmark framework for validating and comparing the existing methodologies.

The COVID-19 virus mutates rapidly across geographical boundaries [28]. Xu et al. [82] discovered that COVID-19 patients living outside of Wuhan had milder illnesses and less prominent laboratory abnormalities than those living in Wuhan. Accordingly, data acquired from one region may not be suitable to draw interferences on another region [28]. Therefore, collaboration across all medical organizations worldwide is critical to expanding the existing dataset and accelerating the DL research of COVID-19. Merging existing datasets from multiple countries will lead to an accurate diagnosis of COVID-19.

Most of the publicly available COVID-19 datasets are limited in size. Therefore, transfer learning of pre-trained CNN models or the CapsNet model is a promising future research direction that may aid in the detection of anomalies in small datasets and yield remarkable results. Additional public datasets of chest CT and X-ray images can be gathered and produced in the future for future usage. The performance of DL models cannot be enhanced in the absence of high-quality data. As we review, the binary classification of chest CT and X-ray images achieved high performance, whereas the multiclass classification of CT images needs to be improved. Training DL models on COVID-19 and other viral pneumonia image models would generate good models for distinguishing between COVID-19 and similar viral pneumonia in seconds based on learned features. The current multiclass classification model’s accuracy and sensitivity are less than that of the binary classification.

COVID-19 imaging resembles that of other chest diseases. To validate COVID-19 diagnoses, we can use the existing benchmark dataset of chest radiographs to distinguish COVID-19 from other chest diseases.

SARS-CoV-2 is a virus that causes COVID-19, it spreads very quickly and change constantly, and variants of this virus can be even more fatal. Thus, generalizing deep-learning models to distinguish between the various strains of COVID-19 is important.

Based on the survey findings, deep-learning models may be employed to categorize the various stages of COVID-19 infection, such as pre-symptomatic, asymptomatic, mild, and severe.

A multiclass classification model with high-performance matrices is needed. The paper outlined the main challenges that the researchers faced when developing a DL classification model to detect COVID-19 from CT and X-ray images. The issues and proposed solutions will assist researchers who aim to improve COVID-19 detection from CT and X-ray images.