In this section, we present the methodology for COVID-19 detection by means of a chest X-ray image. We detail the main datasets and briefly describe the COVID-Net (Wang et al. 2020), our baseline method. Also, we describe the employed deep learning techniques as well as the learning methodology and evaluation.
Datasets
RSNA pneumonia detection challenge dataset
The RSNA Pneumonia Detection Challenge (Radiological Society of North America 2020) is a competition that aims to locate lung opacities on chest radiographs. Pneumonia is associated with opacity in the lung, and some conditions such as pulmonary edema, bleeding, volume loss, and lung cancer can also lead to opacity in lung radiography. Finding patterns associated with pneumonia is a hard task. In that sense, the Radiological Society of North America (RSNA) has promoted the challenge, providing a rich dataset. Although The RSNA challenge is a segmentation challenge, here we are using the dataset for a classification problem. The dataset offers images for two classes: Normal and Pneumonia (non-normal). We are using a total of 16,680 images of this dataset, of which 8066 are from normal class and 8614 from the pneumonia class.
COVID-19 image data collection
The “COVID-19 Image Data Collection” (Cohen et al. 2020) is a collection of anonymized COVID-19 images, acquired from websites of medical and scientific associations (Giovagnoni 2020; Società Italiana di Radiologia Medica e Interventistica 2020) and research papers. The dataset was created by researchers from the University of Montreal with the help of the international research community to assure that it will be continuously updated. Nowadays, the dataset includes more than 183 X-ray images of patients who were affected by COVID-19 and other diseases, such as MERS, SARS, and ARDS. The dataset is public and also includes CT scan images. According to the authors, the dataset can be used to assess the advancement of COVID-19 in infected individuals and also allow the identification of patterns related to COVID-19 helping in differentiating it from other types of pneumonia. Besides, CXR images can be used as an initial screening for the COVID-19 diagnostic processes. So far, most of the images are from male individuals (approx. 60/40% of males and females, respectively), and the age group that concentrates most cases is from 50 to 80 years old. The dataset has four views: the posteroanterior (PA), anteroposterior (AP), supine (AP supine), and lateral (L). There are images from the same subject with different views and different acquisition sessions.
COVIDx dataset
In Wang et al. (2020), a new dataset is proposed by merging two other public datasets: “RSNA Pneumonia Detection Challenge dataset” and “COVID-19 Image Data Collection.” The new dataset, called COVIDx, is designed for a classification problem and contemplates three classes: normal, pneumonia, and COVID-19. Most instances of the normal and pneumonia classes come from the “RSNA Pneumonia Detection Challenge dataset,” and all instances of the COVID-19 class come from the “COVID-19 Image Data Collection.” The dataset has a total of 13,800 images from 13,645 individuals and is split into two partitions, one for training purposes and one for testing (model evaluation). The distribution of images between the partitions is shown in Table 1, and the source code to reproduce the dataset is publicly available (https://github.com/lindawangg/COVID-Net). The resolution of images ranges from 156 × 157 to 4032 × 3024 pixels.
Table 1 COVIDx Images distribution among classes and partitions. The dataset is proposed in (Wang et al. 2020) HCV-UFPR COVID-19 dataset
Brazil is one of the countries most affected by covid-19, with over 6 million confirmed cases, to date. The Hospital da Cruz Vermelha from Curitiba, located in the state of Paraná, southern Brazil, received and documented some of those cases (See Fig. 2). The data collection consists of 281 X-ray images of people infected with COVID-19 and 232 of people who obtained negative results that are not infected. All images have 3 eight-bit color channels (RGB) and image resolution ranges from 2974 × 2612 to 4248 × 3480 pixels. The images are labeled in two classes, COVID-19 and non-COVID, and there are no annotations regarding the image angle view. The dataset is private, but it can be made available upon request.
EfficientNet
The EfficientNet (Tan and Le 2019) is in fact a family of models defined on the baseline network described in Table 2. This base architecture (B0) is found with the aid of a network architecture search (NAS) method.
Table 2 EfficientNet baseline network: B0 architecture Its main component (or block) is known as the Mobile Inverted Bottleneck Conv (MBconv) Block introduced in (Sandler et al. 2018) and depicted in Fig. 3.
The rationale behind the EfficientNet family is to start from a high-quality yet compact baseline model and uniformly scale each of its dimensions systematically with a fixed set of scaling coefficients. Formally, an EfficientNet is defined by three dimensions: (i) depth, (ii) width, and (iii) resolutions as illustrated in Fig. 4.
Starting from the baseline model in Table 2, each dimension is scaled by the parameter according to Eq. 1
$$ {\displaystyle \begin{array}{c} depth={\alpha}^{\phi}\\ {} widht={\beta}^{\phi}\\ {} resolution={\gamma}^{\phi}\\ {}s.t.\alpha \cdot {\beta}^2\cdot {\gamma}^2\approx 2\\ {}\alpha \ge 1,\beta \ge 1,\gamma \ge 1\end{array}} $$
(1)
where α, β, and γ are constants obtained by a grid search experiment conducted in Tan and Le (2019). As stated in Tan and Le (2019), Eq. 1 provides a nice balance between performance and computational cost. The coefficient controls the available resources. Equation 1 determines the increase or decrease of model FLOPS when depth, width, and resolution are modified.
Architectures B1 to B7 are derived from architecture B0. Using the same methodology (network architecture search), more blocks were included at the top of the B0 model, making it deeper and wider. Thus, new efficient models were found and labeled (B1 to B7) during the search.
Notably, in Tan and Le (2019), a model from EfficientNet family was able to beat the powerful GPipe Network (Huang et al. 2019) on the ImageNet dataset (Russakovsky et al. 2015) running with 8.4x fewer parameters and being 6.1x faster.
Hierarchical classification
In classification problems, it is common to have some sort of relationship among classes. Very often, on real problems, the classes (the category of an instance) are organized hierarchically, like a tree structure. According to Silla and Freitas (2011), one can have three types of classification: flat classification, which ignores the hierarchy of the tree; local classification, in which there is a set of classifiers for each level of the tree (one classifier per node or level); and finally, global classification, in which one single classifier is built with the ability to classify any node in the tree, besides the leaves.
The most popular type of classification in the literature is the flat one. However, here we propose the use of local classification, which we call hierarchical classification. Thus, the target classes are located in the leaves of the tree, and in the intermediate nodes, we have classifiers. In this work, we need two classifiers, one at the root node, dedicated to discriminate between the Normal and Pneumonia classes and another one in the next level dedicated to discriminate between pneumonia types. The problem addressed here can be mapped as the topology depicted in Fig. 5 in which there are two levels of classification. To make the class inference for a new instance, first, the instance is presented to the first classifier (in the root node). If it is predicted as “Normal,” the inference ends there. If the instance is considered “Pneumonia,” it is then presented to the second classifier, which will discern whether it is a pneumonia caused by “COVID-19” or “Not.”
Training
Deep learning models are complex and therefore require a large number of instances to avoid overfitting, i.e., when the learned network performs well on the training set but underperforms on the test set. Unfortunately, for most problems in real-world situations, data is not abundant. In fact, there are few scenarios in which there is an abundance of training data, such as the ImageNet (Russakovsky et al. 2015), in which there are more than 14 million images of 21,841 classes/categories. To overcome this issue, researchers rely on two techniques: data augmentation and transfer learning. We also detail here the proposed models, based on EfficientNet.
Image pre-processing and data augmentation
Several pre-processing techniques may be used for image cleaning, noise removal, outlier removal, etc. The only pre-processing applied in this work is a simple intensity normalization of the image pixels to the range [0; 1]. In this manner, we rely on the filters of the convolutional network itself to perform possible data cleaning. Also, all images are resized according to the architecture resolution parameter (See Table 2).
Data augmentation consists of expanding the training set with transformations of the images in the dataset (Goodfellow et al. 2016) provided that the semantic information is not lost. In this work, we applied three transformations to the images: rotation, horizontal flip, and scaling, as such transformations would not hinder, for example, a physician to interpret the radiography. Figure 6 presents an example of the applied data augmentation. The transformations applied to the images are rotation (range 0 to 15 degrees clockwise or anticlockwise), Zoom with a range of 0–20%, and horizontal flipping (50% chance). All or none changes may be applied/combined according to a probability.
Proposed family of models
The EfficientNet family has models of high performance and low computational cost. Since this research aims to find efficient models capable of being embedded in conventional smartphones, the EfficietNet family is a natural choice. We explore the EfficientNets by adding more operator blocks atop of it. More specifically, we add four new blocks, as detailed in Table 3. Thus, we proposed six new architectures, varying the base model. We add the suffix “X” to differentiate the proposed architectures from original EfficientNet base architectures.
Table 3 Proposed family architectures, considering one EfficientNet model as the base model. (NC = Number of Classes) Since the original EfficientNets were built to work on a different classification problem, we add new fully connected layers (FC) responsible for the last steps of the classification process. We also use batch normalization (BN), dropout, and swish activation functions for the following reasons.
The batch normalization constrains the output of the last layer in a range, forcing zero mean and standard deviation one. That acts as regularization, increasing the stability of the neural network, and accelerating the training (Ioffe and Szegedy 2015).
The Dropout (Srivastava et al. 2014) is perhaps the most powerful method of regularization. The practical effect of dropout operation is to emulate a bagged ensemble of multiple neural networks by inhibiting a few neurons, at random, for each mini-batch during training. The number of inhibited neuronal units is defined by the dropout parameter, which ranges between 0 and 100%.
The most popular activation function is the Rectified Linear Unit (ReLU), which can be formally defined as f(x) = max(0; x). However, in the added block, we have opted for the switch activation function (Ramachandran et al. 2017) defined as:
$$ f(x)=x\cdot {\left(1+{e}^{-x}\right)}^{-1} $$
(2)
Differently from the ReLU the swish activation produces a smooth curve during the minimization loss process when a gradient descent algorithm is used. Another advantage of the swish activation regarding the ReLU is that it does not zero out small negative values which may still be relevant for capturing patterns underlying the data (Ramachandran et al. 2017).
Transfer learning
Instead of training a model from scratch, one can take advantage of using the weights from a pre-trained network and accelerate or enhance the learning process. As discussed in Oquab et al. (2014), the initial layers of a model can be seen as feature descriptors for image representation, and the latter ones are related to instance categories. Thus, in many applications, several layers can be reused. The task of transfer learning then defines how and what layers of a pre-trained model should be used. This technique has proved to be effective in several computer vision tasks, even when transferring weights from completely different domains (Goodfellow et al. 2016; Luz et al. 2018).
The steps for transfer of learning are:
-
1.
Copying the weights from a pre-trained model to a new model
-
2.
Modifying the architecture of the new model to adapt it to the new problem, possibly including new layers
-
3.
Initializing the new layers
-
4.
Defining which layers will pass through a new the learning process
-
5.
Training (updating the weights according to the loss function) with a suitable optimization algorithm
We apply transfer learning to the EfficientNets pre-trained on the ImageNet dataset (Russakovsky et al. 2015). It is clear that the ImageNet domain is much broader than the chest X-rays that will be presented to the models in this work. Thus, the imported network weights are taken just as an initial solution and are all fine-tuned (i.e., the weights from all layers) by the optimizer over the new training phase. The rationale is that the imported models already have a lot of knowledge about all sorts of objects. By permitting all the weights to get fine-tuned, we allow the model to specialize to the problem in hands. In the training phase, the weights are updated with the Adam Optimizer and a schedule rule decreasing the learning rate by a factor of 10 in the event of stagnation (“patience = 2”). The learning rate started with 10−4, and the number of epochs fixed at 20.
Model evaluation and metrics
The final evaluation is carried out with the COVIDx dataset, and since the COVIDx comprises a combination of two other public datasets, we follow the script (https://github.com/lindawangg/COVID-Net) provided in Wang et al. (2020) to load the training and test sets. The data is then distributed according to Table 1.
In this work, three metrics are used to evaluate models: accuracy (Acc), COVID-19 sensitivity (SeC), and COVID-19 positive prediction (+PC), i.e.,
$$ {\displaystyle \begin{array}{c} Acc=\frac{TP_N+{TP}_P+{TP}_C}{\# samples}\\ {}{Se}_C=\frac{TP_C}{TP_C+{FN}_C}\\ {}+{P}_C=\frac{TP_C}{TP_C+{FP}_C}\end{array}} $$
(3)
wherein TPN, TPP, TPC, FNC, and FPC stand for the normal samples correctly classified, non-COVID-19 samples correctly classified, the COVID-19 samples correctly classified, the COVID-19 samples classified as normal or non-COVID-19, the non-COVID-19, and normal samples classified as COVID-19. The number of multiply-accumulate (MAC) operations are used to measure the computational cost.
Experiments and discussion
In this section, we present the dataset setup, experimental settings, and results, which can be divided into three-fold: (i) flat vs hierarchical approaches, (ii) ablation study, and (iii) cross-dataset evaluation. Finally, we discuss the results. The execution environment of the computational experiments was conducted on an Intel(R) Core(TM) i7-5820K CPU @ 3.30GHz, 64Gb Ram, two Titan X with 12Gb, and the TensorFlow/Keras framework for Python.
Dataset setup 1
Three different training set configurations were analyzed with the COVIDx dataset: (i) (Raw Dataset), the raw dataset without any pre-processing; (ii) (Raw Dataset + Data Augmentation), the raw dataset with a data augmentation of 1000 new images on COVID-19 samples and a limitation of 4000 images for the two remaining classes; and (iii) (Balanced Dataset), the dataset with a 1000 images per class achieved by data augmentation on COVID-19 samples and under-sampling the other two classes to 1000 samples each one. Learning with an unbalanced dataset could bias the prediction model towards the classes with more samples, leading to inferior classification models.
In this work, we evaluate two scenarios: flat and hierarchical. Regardless of the scenarios, the three training sets remain the same (Raw, Raw + Data Augmentation, and Balanced). However, for the hierarchical case, there is an extra process to split the sets into two parts: the first part, the instances of pneumonia and COVID-19 classes are joined and receive the same label (Pneumonia). In the second part, the instances related to the normal class are removed, leaving in the set only instances related to Pneumonia and COVID-19. Thus, two classifiers are built for the hierarchical case, and each one works with a different set of data (see Section 4.3 for more details). The ablation study also uses this dataset setup.
Dataset setup 2
In order to assess the impact of learning a model in one data distribution and evaluate on another one, the COVIDx dataset is only used for training/validation, and the HCV-UFPR COVID-19 Dataset is entirely reserved for testing. This scenario, the cross-dataset evaluation, is closer to reality since in a real-world situation, the models should face samples acquired from different sensors, individuals, and environment.
Experimental settings and results
Flat vs hierarchical
We evaluate four families of convolutional neural networks: EfficientNet, MobileNet, VGG and, ResNet on Dataset Setup 1. Our method includes EfficientNet architectures as base building blocks (B0-B5) with the insertion of 4 custom blocks at the top, as detailed in the Methodology section. These new architectures, we call B0-X, B1-X, B2-X, B4-X, and B5-X. Their features are summarized in Table 4. Among the presented models, we highlight the low footprint of MobileNet and EfficientNet based models.
Table 4 Base models footprint details. (Mb = Megabytes) Regarding the base models (B0-B5 models of the EfficientNet family), the simplest one is the EfficientNet-B0. Thus, we assess the impact of the different training sets and the two forms of classification (flat and hierarchical) for our models derived from B0 (B0-X). The results are shown in Table 5.
Table 5 EfficientNet B0-X results over the three proposed training sets. (Acc: = Accuracy; SeC = COVID-19 Sensitivity; +PC = COVID-19 positive prediction) Since there are more pneumonia, and normal x-ray samples than COVID-19, the neural network learning process tends to improve the classification of the majoritarian classes because they will have more weight in the loss calculation. This may justify the results obtained by balancing the data. As described in Section 4.3, the hierarchical approach is also evaluated here. First, classes of COVID-19 and common pneumonia are combined and presented to the first level of classification (normal vs pneumonia). At the second level, another model classifies instances into pneumonia caused by COVID-19 or other causes.
It is possible to see on Table 5 that better results are achieved with the flat approach on balanced data. This scenario is used to evaluate the remaining network base architectures. The training loss for this scenario is presented in Fig. 7.
The results of all evaluated architectures are summarized in Table 6. We stress that we adapted all architectures by placing the same four blocks on top. It can be seen that all the networks have comparable performances in terms of accuracy. However, the more complex the model is, the worse is the performance for the minority class, the COVID-19 class.
Table 6 Results on different network architectures as base model. Best scenario for COVID-19: all experiments with a balanced training set and flat classification. (Acc: = accuracy; SeC = COVID-19 sensitivity; +PC = COVID-19 positive prediction) The cost of a model is related to the number of parameters. The higher the number of parameters, the higher the amount of data the model needs to adjust them. Thus, we hypothesized that the lack of a bigger dataset may explain the difficulties faced by the more complex models.
Table 7 presents a comparison of the proposed approach and the one proposed by Wang et al. (2020) (COVID-net) under the same evaluation protocol. Even though the accuracy is comparable, the proposed approach presents an improvement on positive prediction without losing sensitivity. Besides, a significant reduction both in terms of memory (our model is 15 times smaller) and latency is observed. It is worth highlighting that Wang et al. (2020) apply data augmentation to the dataset, but it is not clear in their manuscript how many new images are created.
Table 7 Comparison of the proposed approach against SOTA. (Acc: = Accuracy; SeC = COVID-19 Sensitivity; +PC = COVID-19 Positive Prediction) The COVID-Net (Wang et al. 2020) is a very complex network, which demands a memory of 2.1GB (for the smaller model) and performs over 3.5 billion MAC operations implying three main drawbacks: computation-cost, time-consumption, and infrastructure costs. A 3.59 billion MAC operations model takes much more time and computations than a 11.5 million MAC model (in the order of almost 300 times), and the same GPU necessary to run one COVID-Net model can run more than 15 models of the proposed approach (B3-X flat approach) keeping a comparable (or even better) figures. The improvements, in terms of efficiency, are even greater using the B0-X - with a small trade-off in terms of the sensitivity metric. The complexity can hinder the use of the model in the future, for instance, on mobile phones or common desktop computers (without GPU).
Ablation study
In order to customize the network architectures to best suit the problem, in this work, we propose the addition of new blocks on top of the networks. To assess the effectiveness of the proposal, we performed an ablation study, training the B3-based architecture with and without the proposed blocks under the same conditions (same set of batch data, same hyperparameters, and same random seed). The study showed that the inclusion of the 4 proposed blocks improves the total accuracy of the model by 2.3% (from 91.77 to 93.0%). Also, the inclusion of the blocks allows a better trade-off between the metrics SeC and + PC. With the addition of the proposed blocks, the +PC increased from 74.19 to 100%; however, SeC dropped from 100 to 96.8%.
Cross-dataset evaluation
Cross-evaluation between databases is of paramount importance to ascertain the power of generalization of the model regarding variations in images (due to different equipment and sensors). Thus, the Dataset Setup 2 is used for that aim. Table 8 summarizes the experimental results. The proposed approach (B3-X) overcomes the COVID-Net (version CXR Large) proving to be more robust than the other approaches evaluated.
Table 8 Comparison of the proposed approach against SOTA on Dataset 2 Setup: A cross-dataset evaluation on HCV-UFPR COVID-19 Dataset. (Acc: = Accuracy; SeC = COVID-19 Sensitivity; +PC = COVID-19 Positive Prediction) Discussion
In Fig. 8, we present two X-ray images of COVID-19-infected individuals. Those images are from the test set and, therefore, were not seen by the model during training. According to studies (Ng et al. 2020), the presence of the COVID-19 infection can be observed through some opacity (white spots) on chest radiography imaging. In the first row of Fig. 8, one can see the corrected classified image and its respective activation maps generated by our model. The activation map corresponds to opaque points in the image, which may correspond to the presence of the disease. For the second row images, it is observed that the model failed to find opaque points in the image and the activation map highlights non-opaque regions.
In Fig. 9, the confusion matrices of flat and hierarchical approaches are presented. It is possible to observe that the hierarchical model classifies the normal class better, though it also shows a noticeable reduction in terms of sensitivity and positive prediction for the COVID-19 class. One hypothesis is that both Pneumonia and COVID-19 classes are similar (both kinds of pneumonia) and share key features. Thus, the lack of normal images on second classification level reduces the diversity of the training set, interfering with model training. Besides, the computational cost is twice as higher than flat classification since two models are required. However, we believe that the hierarchical approach has a key aspect: it suffers less from bias in the dataset/protocol. In Maguolo and Nanni (2020), a critical evaluation of the test protocols and databases for methods aiming at classifying COVID-19 in X-ray images is presented. According to Maguolo and Nanni (2020), the considered datasets are mostly composed of images from different distributions and different databases, and this may favor the deep learning models to learn patterns related to the image acquisition process, instead of focusing only on disease patterns.
In the first stage of the hierarchical classification, images related to COVID-19 and non-COVID pneumonia are given the same classification label. Thus images from different datasets are combined which forces the method to disregard patterns related to the acquisition process or sensors at the first classification stage. An example of the hierarchical model application can be seen in Fig. 10. It can be seen from the confusion matrix of the first stage (Fig. 10 (left)) that the model is able to classify most instances correctly, and for that, we believe it has focused on the patterns that may help discriminate among different types of pneumonia.
Results of the ablation study showed that the inclusion of the additional blocks significantly improved the trade-off between SeC and + PC, increasing the total accuracy of the model, which justifies the increase in computational cost.
Regarding cross-dataset evaluation, the results showed that even the models considered state-of-the-art suffer from variations caused by the difference in sensors, equipment and acquisition protocols. These findings reveal that, in order to have a model able to work in the field, one must train (or adjust) the model with representative local data.
Findings and future direction
We summarize our findings as follows.
-
An efficient and low computational family of models was proposed to detect COVID-19 patients from chest X-ray images. Even with only a few images of the COVID-19 class, insightful results with a sensitivity of 90% and a positive prediction of 100% were obtained, with the evaluation protocol proposed in (Wang et al. 2020).
-
Regarding the hierarchical analysis, we conclude that there are significant gains that justify the use of the present task. We believe that it suffers less from the bias present in the evaluation protocols, already discussed in (Maguolo and Nanni 2020).
-
The proposed network blocks, put on top of the base models, showed to be very effective for the CRX detection problem, in particular, CRX related to COVID-19.
-
The evaluation protocol proposed in Wang et al. (2020) is based on the public dataset “COVID-19 Image Data Collection” (Cohen et al. 2020), which is being expanded by the scientific community. With more images from the COVID-19 class, it will be possible to improve the training. However, the test partition tends to become more challenging. For sake of reproducibility and future comparisons of results, our code is available at https://github.com/ufo pcsilab/EfficientNet-C19.
-
The cross-dataset evaluation showed that even models considered state-of-the-art to detect COVID-19 in X-ray have their performance severely deteriorated by variation in images caused by differences in sensors or acquisition protocols. To overcome such an issue and increase generalization power, models should be re-trained or fine-tuned on more representative data.
-
The Internet of Medical Things (IOMT) (Joyia et al. 2017) is now a hot topic in industry. However, the Internet can be a major limitation for medical equipment, especially in poor countries. Our proposal is to move towards a model that can be fully embedded in conventional smartphones (edge computing), eliminating the use of the Internet or cloud services. In that sense, the model achieved in this work requires only 55 Mb of memory and has a viable inference time for a conventional cell phone processor.