1 Introduction

1.1 Brief introduction about COVID-19 framework

In December 2019, in Wuhan, China, a new disease was identified, namely COVID-19. This new disease, caused by the SARS-CoV2 virus, was announced by WHO as a pandemic in March 2020 and so far, more than 413 million positive cases are reported, with more than 5.8 million deaths worldwide. SARS-CoV2 causes complications such as acute respiratory disorders, cardiac problems and secondary infections in a fairly high proportion of patients. Early identification of infected individuals is of vital importance, not only to apply medical care, but also to cut off the chain of disease transmission [1].

So far, the gold standard for disease diagnosis is by identification of viral RNA from reverse transcription-polymerase chain reaction (RT-PCR). The ability to perform this type of testing in the early months of the pandemic was limited in most countries [2]. Also, one of the disadvantages of PCR is the time it takes to provide results, in addition to not being effective when the viral load is still low in the first three or four days after infection [3]. In addition, sampling introduces a very high variability, depending on the site, the level of expertise of the personnel and the viral load of the person at the time [4]. Therefore, investigating other approaches to the diagnosis and evolution of damage to the respiratory system is an important task [5].

In the case of COVID-19, lung imaging contributes to monitor the evolution of the disease, assisting human specialists in decision making, as one of the most commonly damaged organs is the lungs. In fact, several studies confirm visible lung abnormalities in patients with this disease [6]. However, the use of these techniques as a diagnostic method has shown low sensitivity and specificity in current radiological practice [7]. For example, in the work of [8] when using chest X-ray (CXR) imaging to detect SARS-CoV-2 pneumonia a sensitivity of 57% was achieved while in the work of [9] the reported sensitivity was 65%. These values demonstrate how difficult is for radiologists to diagnose COVID-19 using CXR directly.

So as with PCR, the study of lung radiographs could be very beneficial, to save time in the proper management of positive patients. Specifically, CXR imaging can be used to obtain relatively immediate diagnostic information. Moreover, such equipment is available in almost all medical settings, being portable and allowing the acquisition of these images quickly and at relatively low cost.

1.2 CXR imaging as a diagnostic method

Multiple studies have been published claiming the possibility of diagnosing COVID-19 from chest radiographs using machine learning techniques, with very high accuracy [10] or quantification of severity [11, 12]. Even, sensitivity values of 100% in automatic classification are reported [13, 14]. However, other studies demonstrate the lack of generalization of the models, by notably lowering the performance index when trying to classify images that do not come from the same distribution (ood) with which they were trained [15,16,17,18,19,20,21]. In other words, these proposed models suffer from the inherent drawbacks of low generalization capability, derived from the sparse labeled COVID-19 data [22]. Therefore, most of the proposed models are affected by shortcut learning. Thus, instead of focusing on radiographic COVID-19 findings, the models are based on features that do not relate to the pathologies they are trying to classify. A review on the effect of shortcut learning applied to COVID-19 identification appears in [23]. In that study, it was evidenced that, studies that applied explainable artificial intelligence (XAI) methods to determine the most important regions on which deep learning (DL) models are based were outside the lung region.

One of the strategies to address the problem of shortcut learning and lack of generalization of models is to use a larger number of images in training. However, although a multitude of COVID-19 CXRs are performed daily worldwide, the available datasets remain limited due to the lack of expert labels and difficulties in sharing patient data outside the hospital due to privacy concerns. The situation is further exacerbated in the current pandemic situation, making collaboration between different hospitals in different countries difficult. As an alternative, works agree on applying lung segmentation to eliminate biases that may be contained in the images, related to textual labels [24]. In fact, in the study of [19] it was demonstrated that when using the whole image, the textual labels turned out to be among the regions that most influenced decision making.

On the other hand, lung segmentation does not guarantee that the models actually focus on the right regions. Even after applying histogram equalization to homogenize gray level intensities in CXR images, there is a high risk that the algorithms will rely on spurious features that are not related to the disease [15].

1.3 Validation of models from external assessment sets

Studies agree that one of the strategies to eliminate the possible biases contained in the CXR image sets is to use images containing only the segmented lung region [25]. In this way, the textual labels that have been so determinant to exercise classification, are eliminated since, in reality, they are not significant to assign class membership. Likewise, applying image processing techniques to homogenize the images as a stage prior to classification can alleviate the bias that occurs in the images related to their acquisition.

The validity of the models obtained so far has been challenged by recent work in which validation is performed with ood images [15,16,17,18, 20, 21]. In these cases, the performance indices when using the ood set have decreased considerably when compared to the results obtained using sets coming from the same distribution (iid). These studies, although few in number, have begun to make researchers aware of the need to use external validation sets to determine the validity of their models. In fact, one of the indispensable requirements for establishing some kind of COVID-19 identification system in any clinical setting is that it possesses stable generalization capability over unseen images.

However, special attention must be paid to conform the set of ood images, since due to the abundant naming ambiguity and overlap between current sets, a set that is not really ood could be end up being used to evaluate the models. This occurred in the work of [26], where there was an overlap in the images used in training and the external validation set. This statement can be arrived at by carefully analyzing the composition of the sets, taking as a reference the works [27], where the provenance of many of the current sets used in the scientific literature to identify COVID-19 through CXR is described. In addition, the pneumonia/normal class dataset was based on a pediatric dataset (patient age 1–5 years) as is the case in other works such as [28, 29]. In contrast, the mean age of the COVID-19 class was older than 40 years. Looking at the pneumonia image, it is evident that the sizes of the rib cage and thoracic structures in the pneumonia dataset are different from the COVID-19 cases due to the age difference. It is likely that these studies used age-related features to differentiate pneumonia/normal cases and COVID-19 cases as an indicator of age, rather than pathologic diagnosis.

The aim of this work is to evaluate the effectiveness of a new model, to identify COVID-19 using CXR, following the recommendations collected in the scientific literature. For this purpose, the possible biases existing in the image sets are eliminated by segmenting the lung region. In addition, a new patch training methodology for the models, never reported before, is proposed. In this way, the problem of using the same source in the training and test sets for model evaluation is analyzed. Also, a new set of balanced CXR images for different classes, of common origin for research purposes, belonging to native patients is released.

1.4 Contributions of the paper

- A new methodology is proposed to address the problem of automatic COVID-19 classification from CXR images. In doing so, many of the claims of other research have been taken into account. For example, segmentation of the lung region is applied to eliminate any bias of the image set related to the textual labels of the radiographs. Also, a lung field partitioning structure is proposed to provide the network with the ability to work with higher resolution images.

- A rigorous evaluation of the proposed models from external image sets is performed to determine the generalization power of the proposed model.

- A new set of images containing radiographs from COVID-19 positive patients is released. Thus, this set is made available to the international scientific community for research purposes. Thus, it contributes to the creation of a set of images of diverse origin that makes the proposed algorithms more robust, in addition to improving the validation process of the existing ones.

2 Materials and methods

2.1 Image set description

Three sets of images were used in this work. The first one, which will be called Data 1, contains a total of 1613 images. The distribution by classes of the three data bases is shown in Table 1. In the case of COVID-19 positive images, 276 images were obtained from COVIDx [28], the rest of the positive images belong to autochthonous hospitals (national repository). These last images are private for the moment. In the case of the images with pneumonia and normal cases, they belong to COVIDx.

Table 1 Description of the image sets used

The second set of images (Data 2, part of which is released in https://github.com/jportal93/De-COVID-Rx) consists of 2040 images distributed in the three classes. These come from the same analog X-ray machine, belongs to Hospital A (autochthonous). A detailed description of Data 2 appears below. The set of images (Data 3, also released) contains images from a portable digital X-ray equipment, divided into COVID-19 113 (Hospital A), Pneumonias 113 (Hospital B, also autochthonous) and Normal 113 (Hospital A and B).

Data 2 was obtained at both anteroposterior (AP) and posteroanterior (PA) configurations, using conventional (analog) CXR equipment, to obtain images on acetate plate. Digitization was performed using a Nikon-D80 camera with 18—135 mm zoom lens, set at f/3.5–5.6G with focal length between 20 and 38 mm, AF-S focus mode. The shutter speed varied according to the characteristics of the images from 1/40 s to 1/1.3 s. In all cases sensitivity ISO 100, color space black-and-white was used. The images were saved in raw format (NEF type) with 12 bpp and size 3872 × 2592 pixels. No flash illumination was used, only the light provided by the negatoscope.

The first pre-processing was to convert the images from NEF to TIF format with 8 bpp and cropping to a size ranging from 1524 × 1450 to 1095 × 1320 pixels, using the computational tool Nikon ViewNX-i, ver. 1.4.2 64bit. The diversity of sizes is due to the fact that the acetate plates were of various dimensions and that all the part of the image that is not the thorax is discarded. No adjustment (restoration, or enhancement) was performed at this stage.

The composition of Data 2 and Data 3 is shown in Table 2.

Table 2 Characteristics of subjects included in Data 2 and Data 3

2.2 Annotation and patching of the image set

One of the shortcomings of DL-based methods to identify COVID-19 from CXR is the low resolution of the images used to train the networks. In most of the works when using pre-trained networks, 229 × 229 pixels images are used. This image resizing process results in the loss of important features in the images. An alternative to overcome this drawback is the patch partitioning what is proposed in this work. This alternative allows the network to be trained with higher resolution images, so that the network can learn patterns that are better related to the disease it is trying to diagnose.

A total of 1613 images with diverse origin were annotated. These images belong to the Data 1 set. A total of 733 COVID-19 positive images, 633 images with pneumonias due to other causes and 247 normal images were annotated. This task was performed by 3 expert radiologists independently, with more than 10 years in the specialty. In the case of doubtful images, consensus was sought among all of them. The criteria for annotation of the image data base (DB) are shown in Table 3 and are taken from the based on common international criteria. To annotate the DB, the specialists observe the image based on a division of the lower, middle and upper lung fields for the left and right lung. In this way, the CXR image is divided into six regions as shown in Fig. 1. For each of the regions, the presence or absence of some of the criteria reported in Table 3 is determined. These annotations constituted the starting point for a division into patches of this set of images as explained below.

Table 3 Criteria for database annotation by radiologists
Fig. 1
figure 1

Division into six regions of a CXR based on lung fields, which is the starting point for manual annotation by radiologists

The first step performed to divide the CXR image into regions was to segment the lung region. A pre-trained convolution network based on the U-Net architecture [30] was used. This network was trained with CXR images and masks segmented by hand by human specialists, belonging to the Montgomery [31] and JSRT [32] ensembles. This model reports a Dice similarity coefficient for the aforementioned image bases of 0.985 and 0.972 respectively. In order to improve the segmentation results obtained by the network, has been used as a post-processing criterion, leaving only the two largest morphological structures detected (Lungs). Then, the Convex Hull operation is performed on each detected lung to eliminate the irregularities in the edges obtained from the segmentation mask (Fig. 2). In this way, both physiological structures other than lungs and textual labels are removed from the image. In addition, the holes that may remain in the lung region due to failures in the segmentation algorithm are filled from morphological operations. This is intended to decrease the probability of learning from shortcuts or features outside the lung region [25].

Fig. 2
figure 2

Segmentation process and division into patches

Images of difficult segmentation, where only a single region was obtained, were eliminated as they were considered a bias for training. Next, a dilation of the segmentation mask is performed, using a disk-shaped structuring element of dimensions proportional to the image. In our case, 2% of the image was used as the radius of the structuring element. This operation aims to increase the obtained segmentation mask and to provide the network with more information about the lungs, since in the case of COVID-19 lesions it is known that they can be lodged in the border regions of both lungs.

Then, the image is divided into a set of patches, where 1 patch represents the two lungs, 2 patches equals each lung separately, 4 patches represent each lung divided into two, and 6 patches equals each lung divided into three. Starting from a binary image of 1 patch that includes the two silhouettes of the lungs, the bounding box of each one is determined independently. In this way, the 2-patch representation is obtained. The 4-patch images are obtained by dividing in half each of the connected components found in the previous step. The same is done to obtain the 3-patch images, this time dividing into 3 regions instead of 2. Figure 2 represents this process.

2.3 Description of phase a of the proposed system

In the work of [33] a patch-based classification is proposed. In contrast to this research, a model based on a random selection of patches belonging to the lung region is trained in that study. Such methodology assumes that each selected patch presents manifestations of COVID-19. Assuming that in each patch selected from a COVID-19 positive image or from another type of pneumonia there is some manifestation of the disease does not hold true in all images, since the lesions appear localized in specific regions and are not always distributed throughout the image. Therefore, it is possible to select patches that do not belong to the class which is trying to classify because they do not contain the radiographic manifestation which are trying to classify. To alleviate this difficulty in this study, manual annotation was performed by expert radiologists in a fixed division by patch regions as explained in the previous section.

The first task performed was the organization in folders of the patches according to the classification received by the radiologists. For this purpose, each image is assigned to a class (normal, pneumonia or COVID-19). Thus, for example, an image with COVID-19 will be placed in the COVID-19 folder of a patch, but in the divisions of 2, 4 and 6 patches, there is the possibility of dividing the images depending on the region they occupy in the lung, according to the number of patches into which they were divided. In other words, if Fig. 1 is taken as a reference, in this new representation 6 divisions will be generated, corresponding to each of the numbers shown in the figure.

In phase A of the system, the training of the models is carried out exclusively from the radiologists' annotations. This will be called Initial Models (IM). Therefore, as many models are obtained as patches are generated per division. Figure 3 shows the process described above.

Fig. 3
figure 3

Phase A of the system, division into patches of the images annotated by the radiologists and obtaining Initial Models from the new partitions generated

2.4 Description of Phase B of the proposed system

The manual annotation work performed by radiologists is a tedious and time consuming process. In order to increase the number of images for model training, Phase B is proposed. In this new stage the Initial Models are used to automatically classify a new set of images (Data 2). With this procedure, a division into patches with the same hierarchical structure as that obtained in Phase A is obtained. That is, the manifestations of COVID-19 and pneumonia are not seen in the whole image region. Therefore, the following heuristic was used to perform the division. The images divided into 2, 4 and 6 patches where there was no match between the model and the class to which the whole image belonged, the label prevailed. That is, in case the model classified one of the patches as pneumonia when the image was actually COVID-19, this patch would be assigned to the COVID-19 folder (the label that the image had) instead of the model prediction. This is illustrated in Fig. 4.

Fig. 4
figure 4

Scheme followed to increase the set of annotated images

2.5 Description of Phase C of the proposed system

In Phase C, the models are re-trained, using as training set the one obtained in Phase B (Data 1 + Data 2), using transfer learning from Phase A. In this stage, what was learned in the previous stage is used with the objective of improving the performance of the models for the four classification strategies. The 90% percent of the data was used to train the system and 10% to validate it.

2.6 Description of the architecture and parameters

Pytorch was used as a framework for training, testing and evaluation, starting from the pre-trained network Resnet34 and then using transfer learning. In the first stage, a training with 400 epochs is done in order to adjust the weights of the layers of the pre-trained network to the new problem. This value was set because the training tests showed that the parameters remained stable and maintained convergence from 300 epochs.

To mitigate the problem of low resolution of pre-trained networks, an image size of 512 × 512 was used, as well as image patching. The network output has three classes, COVID-19, Pneumonia and Normal. The Adam optimizer [34] was used, which is an extension of the down gradient to adjust the weights during training and minimize losses. Cross entropy is used as the loss function. As the losses decrease in training, the obtained learning models are saved up to the point, where it is seen that the next epoch increases the losses and therefore training continues until the stopping requirements are met. In phase A of the study, training is performed from zero. Training stops after 300 epochs or when the losses are less than 10–3. In the case of Phase C, the transfer learning technique was applied to obtain the retrained models using 35 epochs. The learning rate used was 0.01 and the batch size was 32. It should be noted that initially the hyperparameters were established following the recommendations of [35] and [36]. Later, they were tuned up from multiple tests with various data sets taken from data 1 and 2.

During all the trainings, the data augmentation technique was applied, starting from generating more patches for each original patch, based on left and right rotations in angles up to 10 degrees, zoom of 1.05% and image brightness variations up to 0.8%. In the case of the 2-, 4- and 6-patch variants, the product of the probabilities of each of the independent models was used as the fusion strategy to output the classification of an image.

2.7 Evaluation of the proposed models

The first evaluation of the models is performed in Phase A, this is done using 10% of Data 1 (internal validation), and Data 3 (external test). Then in the next stage the previous results are improved, so Phase B and C are introduced. As part of the Phase C evaluation, two sets were used newly. The first one contains images belonging to 10% of the union of Data 1 and Data 2 sets (internal validation). Finally, the external evaluation is performed using Data 3 set.

This strategy followed for the evaluation of the models proposed in each of the phases allows to determine the generalization power of the system. Likewise, by making use of an external evaluation set, one of the claims made by the scientific community on the validity of the proposed models is addressed.

To determine the performance of the models obtained, the Precision, Recall (sensitivity) and F1score metrics per class and their averages have been used, which are called macro metrics (MacroPrecision, MacroRecall and MarcoF1score). All the metrics used are defined below in Eqs. 1, 2, 3, 4, 5 and 6. These performance measures are obtained from the confusion matrix. Their use allows the performance of the models to be evaluated by class and globally.

$$Precision= \frac{TP}{TP+FP}$$
(1)
$$Recall= \frac{TP}{TP+FN}$$
(2)
$${F1}_{score}=2* \frac{Precision*Recall}{Precision+Recall}$$
(3)
$$MacroPrecision= \frac{1}{3}{\sum }_{c=1}^{3}{Precision}_{c}$$
(4)
$$MacroRecall= \frac{1}{3}{\sum }_{c=1}^{3}{Recall}_{c}$$
(5)
$$MacroF{1}_{score}= \frac{1}{3}{\sum }_{c=1}^{3}{F{1}_{score}}_{c}$$
(6)

where TP are the true positives, FP are the false positive and FN are the false negatives.

3 Results

3.1 Evaluation of Phase A

The first stage of the system is performed with the images labeled by the radiologists. For this purpose, the training of the models belonging to each of the division variants is performed. Thus, four different systems are obtained for image classification (1, 2, 4 and 6 patches). In the case of variants 2, 4 and 6 patches contain as many models as patch splits are performed. Then, in order to issue the classification of the complete image, the multiplication of the probabilities of memberships obtained by each model that composes the scheme is used as an aggregation function.

Figure 5 shows the confusion matrices obtained at this stage for the different patch division strategies. Also, Table 4 shows the performance measures obtained from the aforementioned confusion matrices. It is observed that, the best performance indices are obtained for the 6-patch variant, followed by the 2-patch variant as presented in Fig. 6.

Fig. 5
figure 5

Confusion matrix for the different variants of patch splitting in Phase A

Table 4 Performance indices obtained for the different patch splitting evaluation strategies in the Phase A patching evaluation
Fig. 6
figure 6

Macrometrics for the different variants of the Phase A patch division

3.2 Evaluation of Phase A using the external set

This section presents the evaluation of the MI on the external data set. Figure 7 presents the confusion matrices obtained for each of the patch partitioning strategies. As can be seen in Table 5, the models created do not present generalization capacity. That is, the models obtained are unable to maintain the results presented in the previous stage, this time in the presence of a set that does not come from any of the sources used in the training. It is observed that in the case of variants 2, 4 and 6 patches the systems are unable to detect the class "Pneumonia" (note that the performance measures are 0). On the other hand, variant 1 patch shows the best results for determining the "Pneumonia" class, to the detriment of the performance indices for the "Normal" class. That is, the behavior of these models is close to the random classification and far from the values obtained in the previous stage, showing a clear overfitting effect by the models. These results show that the models obtained fail to perform the correct classification of this new set of images.

Fig. 7
figure 7

Confusion matrix for the different patch splitting variants in Phase A for the external evaluation set

Table 5 Performance indices obtained for the different patch splitting evaluation strategies using the external evaluation set in Phase A

3.3 Evaluation of Phase C

This section shows the results obtained in the Phase C evaluation stage. The evaluation set used was 10% of the images belonging to this stage. Figure 8 presents the confusion matrices obtained when evaluating the models with the four splitting variants. Table 6 presents the metrics calculated from the confusion matrices presented above. Figure 9 presents the performance measures for the different classification variants. It can be seen that again the 6-patch variant presents slightly higher values than the rest of the strategies.

Fig. 8
figure 8

Confusion matrix for the different Phase C patching variants for 10% of the internal validation data

Table 6 Performance indices obtained for the patch splitting strategies using 10% of the data for the evaluation
Fig. 9
figure 9

Macrometrics for the different Phase C patch splitting variants using 10% of the images as evaluation

3.4 Evaluation of Phase C using the external set

This section presents the results obtained from the models retrained on the external evaluation (Data 3). Figure 10 shows the confusion matrix obtained for each of the patch splitting strategies. It can be seen that the best results appear again for the 6-patch splitting variant. On this occasion, it is observed that the results obtained are maintained compared to the internal validation stage. This shows that the proposed models have generalization capacity. Thus, this type of models can help human specialists in the timely identification of COVID-19 cases with visible manifestations in the lung region and apply a timely management to these patients.

Fig. 10
figure 10

Confusion matrix for the different Phase C patching variants for the external evaluation set

4 Discussion

This work has highlighted the importance of not committing the same mistakes widely diffused in publications from 2020 and 2021 on the subject of COVID 19/pneumonia classification based on CNN. It is related to the fact that if the AI systems are trained and validated only from complete images taken from the various international or autochthonous databases, they are capable of performing a classification based on spurious characteristics, since they perform learning by shortcuts [19,20,21,22,23, 25]. This can be evidenced in two ways: Comparing the performance of the system against external data (of a different origin than those used during training/validation) and verifying that CNN only classifies based on features contained in the lung region. Systems without power of generalization that have based their learning on biases, will show much lower performance metrics in an external evaluation compared to those achieved during validation with data of the same origin as the training data.

The system proposed in this research has shown that if only Phase A is run (which shares many points in common – particularly training based on one patch – with various other systems published in the scientific literature, see [10,11,12,13,14]), when evaluated with external dataset it shows a very poor performance, which proves the presence of shortcut learning.

To overcome these obstacles, phases B and C have been implemented, not only to take advantage of using only useful information contained within the lung region and a higher number of training images, but also to carry out a correct training/validation/evaluation strategy, where images with the same origin are not overlapped at all stages of system deployment. Once this is done, similar values ​​are achieved in the metrics during the evaluation with external data with respect to what was achieved during the validation, which proves the power of generalization achieved.

Another interesting aspect is that CXR has not proven to be efficient enough to achieve correct classification rates, close to 100% in COVID 19/pneumonia classification, as suggested by the systems that present shortcut learning reported in the literature [10,11,12,13,14].

During the year 2022, an enormous diversity of models tested with different databases has continued to be presented, reporting precisions between 84 and 95% for various databases, but without reporting the behavior with an external test [37]. However, more researchers have joined in recognizing the existence of biases and verifying that most of the models are not robust when tested with a source data different from the training/validation data. For example, in [38] has shown that when trained and tested on a single dataset, the performance of all CNNs is relatively poor (precision: 0.65–0.72, recall: 0.59–0.71), but remains relatively constant during external testing (precision: 0.58–0.82, recall: 0.57–0.72). In contrast, when internally trained and evaluated on combinatorial data sets, all CNNs perform well (precision: 0.94–1.00, recall: 0.77–1.00). However, when tested on the crossover data set, the results drop substantially (precision: 0.10–0.61, recall: 0.04–0.80). In this sense, the present work presents superior results to those reported in this review and using models that already have generalization power and lower level of bias.

On the other hand, it is necessary to test whether CT images are more efficient in this task by serving as input to CNNs. Normally, CT images have a higher sensitivity as a method of diagnosis and follow-up compared to chest X-ray. Cases of COVID-19 with lesions visible on CT but not visible on CXR have been described in [39]. In fact, one of the main findings on CT images of COVID-19 patients is ground-glass opacities in the peripheral regions of the lower lobes, which are sometimes not visible on CXR images. From the scientific literature it can be seen that ​​for CT the studies also differ in terms of the metrics used, the architectures, the models, and the databases used to train/validate the networks, so comparison between them is not easy. In the work [40] for example, which focuses on the importance of CT image segmentation to achieve better results, the authors report a precision of 87% in the classification of 3 classes (Covid/Normal/Pneumonia), which is equivalent to that achieved in the present work, with a different database. In the study itself, the precision values ​​achieved by the previous models for CT are between 73 and 85%. In the work [41] modifications are introduced on the usual architecture of ResNet. Data from COVIDx-CT (CT equivalent to COVIDx) are used for training and validation, obtaining an accuracy of 99.7%. However, they do not report external test results. In the study [42], where results are also reported with several CNNs but where it is stated that they have carried out an external test, the precision values ​​range are between 59 and 97% with the COVIDx-CT data, depending on the model used.

Finally, it should be stated that the presented system has some limitations. In new external tests carried out looking for the causes of the present classification errors, at least two biases have been identified. One relates to the correct position of the patient when taking the radiography, which presumably comes from COVIDx training data [28], where very sick patients were studied with X-rays from portable equipment in incorrect positions. The second is related to scapular opacities that are misinterpreted by the system.

The possible improvements to be introduced for the problems described above are based on taking the misclassified images and placing them in the folder with the correct label, performing a new training. This strategy is known as ¨Federal Learning¨. As the system is retrained with more images, its performance metrics improve.

5 Conclusions

In this study, a model was developed to automatically classify COVID-19 from CXR images. Recommendations suggested in the literature were followed to increase the robustness of the models. For this purpose, the architecture of a resnet34 CNN with input images of size 512 × 512 pixels was used. In addition, the training of the networks was limited to the inner regions of the lungs, using a segmented image and a patch division. A patch partitioning step was proposed and it was shown that 6-patch partitioning yielded the best results. Specifically, this model yielded 0.887 accuracy, 0.85 recall and 0.848 F1score on the external set. The models obtained after extending the training set in stage B showed similar results in the internal and external evaluation. This endows these models with generalization power. Therefore, making them useful to be used in conjunction with human specialists in the identification of COVID-19 in clinical settings.