1 Introduction

The most talked-about current pandemic, COVID-19, has resulted in a massive global tragedy and has had a significant influence on many lives throughout the world. The first case of the fatal virus was reported in Wuhan, a Chinese province in December 2019 [1]. The virus swiftly spread throughout the world, affecting a wide range of countries.

One of the most frequently used techniques for diagnosing COVID-19 is reverse transcription-polymerase chain reaction (RT-PCR). Early identification of the disease has relied heavily on radiological imaging techniques such as computed tomography (CT) and X-ray [2]. X-ray scans have been utilized in the screening of COVID-19 patients since PCR has a diagnostic sensitivity of 60%-70%. In a few recent studies, it is observed that X-ray and CT imaging scans of people who have COVID-19 symptoms are altered [3]. Zhao et al. [4] found dilatation and consolidation in COVID-19 patients, as well as ground-glass opacities.

The rapid rise in positive COVID-19 cases has increased the need for researchers to employ Artificial Intelligence (AI) in conjunction with the expert opinion to assist clinicians in their jobs. In this regard, deep learning models have begun to gain momentum. Because radiologists are in short supply in hospitals, AI-based diagnostic models might be beneficial in giving prompt assistance to patients. Hemdan et al. [5] proposed seven Convolutional Neural Network (CNN) models to diagnose COVID-19 from X-ray images, including improved VGG19 and Google MobileNet. With an accuracy of 92.4%, Wang et al. [6] identified COVID-19 images from normal and viral pneumonia patients. Similarly, Ioannis et al. [7] used 224 COVID-19 images and achieved a class accuracy of 93.4%. Opconet, an optimized CNN, was suggested in [8] with a total of 2800 images and a 92.8% accuracy score. Apostolopoulous et al. [9] created a MobileNet CNN model utilizing extricated features. Various other methods, such as InceptionV3, ResNet50, GCN, and Inception-ResNetV2, were used for classification [10,11,12,13]. In [14], a transfer learning-based method was employed in order to classify existence or absence of COVID-19 chest X-ray pictures utilizing three models such as ResNet18, ResNet50, SqueezeNet, and DenseNet121. In [15], CNN’s key hyperparameters are tuned using (i) MLP and Grey Wolf Optimizer (GWO), and (ii) MLP and Whale optimization + BAT method.

Although all of the above-mentioned state-of-the-art approaches use CNN, the methods do not take into consideration more than 6432 images. Using such a small dataset causes many real life cases to be missed out. Data augmentation can be performed to overcome this issue. But, in this case, data augmentation techniques such as rotation and resizing of the pictures are not enough to cover wide range of possible cases for COVID-19 instances, viral pneumonia, and normal chest X-ray scans. As a result, the generated CNN models fail to properly distinguish these diseases. Although some degree of inaccuracy in recognizing viral pneumonia cases is acceptable, misclassification of COVID-19 patients as normal or viral pneumonia might confuse physicians and management. The proposed study attempts to solve the constraints described above by creating an automated diagnostic method for screening COVID-19 patients using chest X-ray images trained on 103,468 images considering 5 classes such as COPD signs, COVID, normal, others and Pneumonia.

The remaining part of this paper is structured as follows: Section 2 explains the dataset in details. Section 3 introduces the proposed method. Section 4 presents the experimental results. Finally, Sect. 5 concludes the paper.

2 Dataset description

Four distinct datasets make up our data collection. We combined the following datasets PADCHEST dataset  [16], BIMCV-COVID19+ dataset [17], COVID-19 Radiography Database ([18] and [19]), and Chest X-ray Images (Pneumonia) [20]. We employ 297,541 frontal chest X-ray images from 86,876 individuals by merging all the datasets. We did not apply a processing technique on the images while collecting the dataset. There are an average of 3-4 images per subject due to the follow-up scans. As a result, for all studies, patient-wise splits are taken into account to divide the patients into training, validation, and test groups. Some anomaly classes provide geographical information in the PLCO dataset. Table 1 depicts the number of images that each anomaly was found in. Multiple anomalies can be seen in a single picture. Furthermore, the collections comprise 178,319 images that do not exhibit any of the previously described anomalies; these images are not listed in Table 1.

Table 1 Number of images for each anomaly in the dataset

We used a total of 103,476 images in our experiments. As it can be seen in Table 1, there is a huge bias between the classes. For example “normal” and “COPD signs” have 62,115 and 23,280 samples, respectively, while many other classes have less than 1,000 samples each. In order to prevent this dataset bias, we grouped up similar classes under the same label and proceeded on the experiments with the newly assigned labels. Also, we excluded 136 images that have broken file format hence creating problems for us while reading the file.

Given D1 = 14 abnormalities of the ChestX-ray14 dataset and D2 = 12 abnormalities of the PLCO dataset, we define D = D1 + D2 = 26 classes for our network. In our first set of experiments, we splitted the dataset into 5 classes (See Table 2). Since “normal”, “COPD signs”, “pneumonia”, and “covid” has plenty of samples, we kept them as they are. On the other hand, we combined all the smaller classes under one label called “others”.

Table 2 Dataset splitted into 5 labels

In our second set of experiments, we splitted the dataset into 2 classes (See Table 3). We kept “normal” class as it is, and combined all other classes under “abnormal” label where “normal” indicates image of a healthy lung, and “abnormal” indicates an unhealthy lung.

Table 3 Dataset splitted into 2 labels
Table 4 Samples from the dataset

3 Proposed method

3.1 Network

In our method, we firstly extract the most important features of the images that are distinguishing between the classes. Then, we use a classifier to obtain the results. Considering X-ray scans of lungs, distinguishing features can be obtained from the texture. For example, X-ray scan of a lung with pneumonia has abnormalities in the texture compared to a normal, i.e. healthy lung (See Table 4). Our aim is to catch such features to find patterns for each class.

The state-of-the-art research show that one of the most successful ways of obtaining texture features from the images is using CNN-based deep learning methods [15, 21,22,23]. Thanks to their convolutional structure, these methods process each pixel and their relation with neighbouring pixels together. This way, they successfully find the features in the images. Therefore, we used a CNN-based network as our feature extractor.

In our method, we pass the processed images to a InceptionV3 network which we use as our boneback for feature extraction. In order to pick the best boneback network here, we experimented with different boneback networks and picked InceptionV3 since it performs the best in our case. Discussion on these experiment can be found in the results section. InceptionV3 is a convolutional neural network-based architecture which is made of symmetric and asymmetric blocks. As it can be seen in Fig. 1, the network has a deep architecture in which the convolution layers create the base, and fully connected layers are used to make a connection to output. Average pooling, max pooling, batch normalization and dropouts are also used in the network to improve the performance.

Fig. 1
figure 1

InceptionV3 architecture [24]

Lastly, we use a classifier to obtain the final label using the features extracted. In our classifier, firstly, we pass the features through a normalization layer. This is done to regularize the data. We use batch normalization which processes the data batch by batch and subtracts mean, and divides by standard deviation [25]. After that, two fully connected layers with sizes of 64 and 32 are added to model the relationship between the extracted features and the final classes. We used ReLu activation in these layers. Additionally, we included dropout layers with 0.5 rate after each fully connected layer to prevent overfit. Finally, an output layer is included to obtain the final prediction per class.

3.2 Training

The training has been carried out with a batch size of 128. We used 20 epochs as this epoch number is enough for converging in our case. Higher epoch numbers are resulted with overfitting. 5 folds cross-validation is applied to avoid a biased data split. Larger fold numbers are not preferred as it increases the training time a lot.

For all of our experiments, we splitted the dataset into training and testing sets with a ratio of 0.75 and 0.25, respectively. Only the training set is used for training and validation while testing set is used only for the testing after the training process is completed.

All the experiments are done on a PC with Ubuntu 20.04 installed. Main hardware used are Intel®Core\(^{\mathrm{TM}}\) i7-5820K central processing unit, NVIDIA TITAN Xp graphics processing unit and 96 gigabytes of memory.

4 Results

In this task, X-ray images are very similar to each other, resulting with a low inter-class variance in the dataset. Therefore, extracting the distinctive features is important to have a good performing method. In our experiments, we focused on this idea to improve the performance. We conducted experiments with different feature extraction backbones to evaluate their contributions to the overall performance. As explained in Sect. 3 in details, we selected the most commonly used state-of-the-art CNN-based networks, i.e. VGG16 [26], InceptionV3 [24], ResNet50 [27], NasNetMobile [28], for this experiment. We conducted the same experiment for both 5-class and 2-class dataset label grouping.

Table 5 Results for each backbone for 2-class and 5-class

In Table 5, you can see how well our method performs when different backbones are used. We measured the performances on the test split for both 2-class and 5-class splits. From the table, we can see that performances of all the backbones are better for 2-class merge compared to 5-class merge. This is an expected result for us as 2-class split in this task is created between normal and abnormal (diseased) X-ray images. Variance between the diseased and normal images is higher than the variance in 5-class dataset. In 5-class merge, the network needs to characterize and separate four pathological classes from each other. However, this is a hard task as these classes have a low variance among each other. Secondly, ‘others” class in 5-class merge has a low accuracy, resulting with a decrease in the overall accuracy. Looking at the accuracies in the table, we can see that VGG16, InceptionV3 and, ResNet50 perform very close to each other for 2-class. However, their performances differ for 5-class as InceptionV3 leads with 81.03% test accuracy.

Table 6 Class accuracies when InceptionV3 is used as backbone
Table 7 Comparison with the state-of-the-art methods

In order to elaborate on the performance more, we present the class accuracies of our method using InceptionV3 backbone for 2-class and 5-class splits in Table 6. We discuss that the main difference between class accuracies is caused by the low variance and difference in the sample counts between them. Also, it is notable that in Table 6, “others” class has a very low accuracy. As explained in Sect. 2, we merged many classes which have a low sample count to create “others” class. Therefore, this class both have a low sample count and a hard-to-characterize sample set. We argue that these two factors are the main reasons behind the low accuracy for “others” class. Furthermore, we calculated Matthews Correlation Coefficient (MCC) [29] since the dataset is imbalanced. The model gets MCC as 0.6745 and 0.6541 for 2-class and 5-class, respectively.

Finally, our best results are obtained by using InceptionV3 as our backbone with no filtering applied. We obtained 84.57% test accuracy for the 2-class label merge and 81.03% test accuracy for the 5-class label merge. Notably, we obtained 97.03% test accuracy on predicting COVID diagnosed X-ray images in our 5-class dataset. Also, we obtained “normal”, i.e. healthy class prediction accuracy of 93.20% with 5-class and 78.33% with 2-class.

We compare our results with the state-of-the-art methods in Table 7. Our method with 97.03% accuracy on the COVID images is one of the best performing methods for COVID detection. It is important to notice that the datasets for the other methods have at most 6,432 images. This sample count is very low compared to our 103,468 images. Our dataset, although splitted to 5 classes, has cases from 20 different diseases, which is much higher compared to the-state-of-the-art studies. Certainly, this variety and size adds additional value to our results as our method results with a better generalization. We should note down that the generalization problem is one of the main reasons why such methods have not been trusted for COVID detection in practice. COVID detection is of utmost importance; therefore, COVID accuracy is the main metric for our study. And, our method produces very good results for that while improving the generalization capabilities.

Table 8 Training and validation accuracy between 2 classes Normal (0) and Abnormal class (1)
Table 9 Training and validation accuracy between 5 classes

Additionally, we show the training accuracies for our method in Tables 8 and 9 for each class merge separately. We can see that in both cases, there is only a small difference between the training and test accuracies that tells us overfitting or underfitting is not a priority in our training. However, there are some low accuracies for some classes, i.e. “others” in 5-class merge. This can be explained by the mixed data samples in “others” class which makes characterization of this class harder than others.

5 Conclusion

One of the most used techniques for diagnosing certain diseases is assessing pathological chest X-rays. However, this task is yet to be fully automated, while such an automation would save hours of medical professionals. In the previous research, CNNs and similar structures are proposed as a method to automatically differentiate between the normal and diseased X-ray chest images. However, one of the lacking point of previous research was relatively small dataset, i.e. low sample size, bias between classes, low generalization, etc.

In order to address such issues, we evaluated how well deep CNNs perform on pathological chest X-ray classification using largest number of images, i.e. 103,468 images. We experimented with different class splits and different methods to finally obtain 84.57% accuracy for classifying between normal and diseased images. Moreover, we obtained 81.03% accuracy for classifying between 5 classes, i.e. COPD signs, COVID, Pneumonia, normal and others. Our method acquired 97.03% accuracy for COVID class which is significant if the method is used for COVID detection.

One of the shortcomings of our method is that, “others” class in the 5-class split includes images from many diseases. This makes the class hard to characterize, therefore causing a decrease in the performance. A further study can be done with an improvement of the data in this class. Alternatively, the network can be improved so that it learns low-performing classes better.