1 Introduction

Convolutional Neural Networks, also known as CNNs or ConvNets, have become a widely used method in the field of computer vision. Following the success of AlexNet [1] in ILSVRC 2012, CNNs have been extensively applied to various Cardiovascular (CV) tasks using deep learning. As time has passed, the architecture of CNNs has become increasingly complex and deep, leading to the development of deep ConvNets. With improvements in network architecture and computer hardware, we have been able to train these deep ConvNets, which have shown significant performance improvements in object detection, recognition, and image segmentation tasks. A general representation of a convolutional neural network is provided in Fig. 1. The availability of large datasets and the progress in deep learning have made it possible for models to achieve human-level performance in many fields. Medical image analysis, such as the detection and segmentation of radiology images, has also yielded promising results using deep learning.

The detection of many diseases can be achieved through the use of chest radiology, which is the best and most commonly used clinical imaging tool. Every year, more than two billion chest radiology imaging tests are conducted [2]. This imaging tool is crucial for identifying various thoracic diseases, which are the leading cause of mortality worldwide. It would be exponentially beneficial if computer systems could interpret chest X-rays with the same efficiency as a practicing radiologist [3, 4]. Over the last few years [5,6,7,8,9], the diagnosis of chest radiology images has received increased attention, and several algorithms have been developed for pulmonary tuberculosis classification [10,11,12,13] and pneumonia detection [3, 14, 15]. During the pandemic, deep learning has also found its usage in COVID-19 detection [16,17,18,19,20,21].

Fig. 1
figure 1

General representation of a convolutional neural network

Our proposed study aims to identify multiple pathology diseases by re-implementing the CheXNet model and constructing several additional models with nearly identical hyperparameters. These models will be compared side-by-side. Additionally, we aim to train these models on a Tensor Processing Unit (TPU) to reduce training time. Deep learning has made significant advancements in the field of medicine due to the availability of vast datasets, enabling the development of models that surpass the performance of medical professionals. For instance, pneumonia detection [3, 14, 15], skin cancer classification [22,23,24], and lung cancer screening [25,26,27] have all benefited from deep learning. CheXNet [3], an algorithm that can detect pneumonia from chest X-rays, performs better than practicing radiologists. CNN models, such as CheXNeXt [4], can identify various pathologies diseases with a performance similar to that of practicing board-certified radiologists using frontal-view chest X-rays.

Fig. 2
figure 2

Sample chest X-ray image for all classes in the chest X-ray14

In recent years, various life-threatening diseases have been detected and diagnosed using deep learning techniques by a number of researchers [28,29,30,31,32]. Baltruschat et al. [28] conducted a study comparing multiple deep-learning approaches to classify chest X-Ray images with multiple labels. They analyzed various methods for using CNNs to classify X-ray images from the Chest X-ray14 dataset and found that fine-tuned networks using the ImageNet dataset produced satisfactory results. However, the most effective model was specifically trained using only X-ray images and incorporated non-image data. A systematic survey of deep learning techniques for the analysis of COVID-19 and their usability for detecting Omicron has been provided by [32]. The COVID-19 pandemic has caused a shift towards utilizing deep learning methods for analyzing and identifying infected areas in radiology images. These techniques can be divided into classification, segmentation, and multi-stage approaches used for COVID-19 diagnosis at both the image and region levels. Khan et al. [33] introduced a new method called deep hybrid learning (DHL) and deep boosted hybrid learning (DBHL) for accurately detecting COVID-19 in chest X-ray images. The DBHL technique involves using data augmentation, transfer learning-based fine-tuning, deep features boosting, and hybrid learning to improve the performance of the COVID-RENets models (COVID-RENets-1 and COVID-RENets-2). In their experiments, the DBHL framework outperformed other well-established CNN models.

Stirenko et al. [11] conducted a study on the use of deep learning-based computer-aided diagnosis (CADx) to predict the presence of tuberculosis by statistically analyzing 2D chest X-ray images. They demonstrated the effectiveness of deep CNN for CADx of tuberculosis, particularly through techniques like lung segmentation and data augmentation, both lossless and lossy, on a small and unbalanced dataset. Rahman et al. [14] presented a study that aimed to develop an automated method for identifying bacterial and viral pneumonia by analyzing digital X-ray images. They gave a comprehensive overview of current approaches to detecting pneumonia and described the specific techniques used in their research. Alakus and Turkoglu [17] created clinical predictive models using deep learning and laboratory data to forecast which patients were likely to contract COVID-19. They evaluated the models’ effectiveness using various performance metrics like precision, F1-score, recall, AUC, and accuracy with data from 600 patients and 18 laboratory findings and validated them through ten-fold cross-validation and train-test split methods. The results showed that the predictive models accurately identified patients with COVID-19. Dey et al. [15] designed a Deep-Learning System (DLS) for diagnosing lung abnormalities by using chest X-ray images. They tested the system with conventional and filtered chest radiographs and conducted an initial evaluation using a SoftMax classifier. The outcomes indicated that the VGG19 method provided higher classification accuracy than other methods.

Khan et al. [34] have proposed a new diagnostic system that employs deep CNNs to detect and analyze COVID-19 infections by identifying minor irregularities. This system comprises two phases. In the first phase, a new CNN named SB-STM-BRNet is utilized to identify COVID-19 infections in lung CT images. This is achieved using a Squeezed and Boosted (SB) channel and a Split-Transform-Merge (STM) block with dilated convolutions. In the second phase, the COVID-CB-RESeg CNN is employed to detect and analyze COVID-19-infected areas in the images. This CNN incorporates region-homogeneity and heterogeneity operations in each encoder-decoder block and auxiliary channels in the boosted-decoder to learn about low illumination and the boundaries of the infected regions. The proposed diagnostic system has shown promising results in identifying COVID-19 infections. Additionally, Khan et al. [35] have introduced a new CNN architecture called STM-RENet, which utilizes a split-transform-merge approach to analyze X-ray images and identify radiographic patterns associated with COVID-19 infection. This block-based CNN includes a new convolutional block named STM, which separately and jointly performs region and edge-based operations. By combining these operations with convolutional techniques, STM-RENet can analyze the homogeneity of regions, intensity inhomogeneity, and features that define boundaries. The authors have also presented an improved version of STM-RENet called CB-STM-RENet, which utilizes channel boosting and learns textural variations to enhance its performance. When evaluated on three datasets, CB-STM-RENet demonstrated significantly superior results than conventional CNNs.

A major limitation of previous research in the field of deep CNNs for chest X-ray diagnosis is that many studies have only examined their ability to perform binary classification tasks. These tasks involve detecting the presence or absence of a specific disease. There is a need for more research on the ability of these models to simultaneously detect and classify multiple diseases or conditions in a single image. Moreover, there are two additional issues in previous studies. First, many studies have used small and potentially biased datasets, which negatively impacts the generalizability and accuracy of these models. Therefore, the need for more extensive and diverse datasets is essential. Second, there has been a lack of research on the ability of deep learning models to diagnose rare or less common diseases in chest X-rays accurately. Finally, previous studies have relied on simple accuracy metrics, which is inadequate for evaluating the performance of these models. More robust evaluation methods, such as sensitivity, specificity, and area under the curve, are required to understand these models’ performance better.

2 Materials and Methods

2.1 Dataset

A significant amount of research has been done using the Chest X-ray14 dataset [3, 7,8,9, 36,37,38]. The dataset has been collected and made openly available by the National Institute of Health. It consists of 112120 frontal view chest X-ray images of 30805 unique patients. When loaded, these images are single-channel gray-scale images and need to be converted to 3-channel images to allow our pre-trained model to process them. Each image in the dataset is annotated with up to 14 different thoracic pathology labels. Figure 2 shows a sample for each of these diseases from the dataset itself. Table 1 lists all the diseases in chest X-ray14.

Table 1 List of all diseases in chest X-ray14

Wang et al. [39] used NLP to text-mine disease classifications from the related radiological reports to label the images. The labels are expected to have an accuracy greater than \(90\%\) [40]. The dataset also consists of images labeled as No Finding, which simply indicates that the NLP system was unable to find any diseases for that particular image. It does not necessarily imply a healthy chest X-ray image. To be able to train on a TPU, the entire dataset was converted to tfrecord and made publicly available at NIH Chest X-ray TFRecords.Footnote 1

2.2 Dataset Distribution

Table 1 provides a breakdown of the total positive labels for each of the 14 pathology diseases. To train and evaluate the models, the entire dataset was divided into three sets: a training set, a validation set, and a test set, in accordance with the recommendations of [41,42,43,44]. It is important to note that the dataset was not simply split into three parts. This is because the dataset contains follow-up images for each patient, and a direct split could result in data leaks, leading to misleading results. Instead, the dataset was first grouped by patient IDs, treating each patient as a separate entity, then split into the following three sets.

  • Training dataset: \(85\%\) of the total groups.

  • Validation dataset: \(10\%\) of the total groups.

  • Test dataset: \(5\%\) of the total group.

Hence, the distribution of the dataset used is given as follows.

  • Training dataset: 95466 examples.

  • Validation dataset: 11265 examples.

  • Test dataset: 5389 examples.

2.3 Dataset Preprocessing

The dataset needs to be preprocessed before building and training the model. The X-ray images in the dataset were preprocessed. For the training set, each image was standardized by subtracting the mean and standard deviation of that image from each pixel of that image.

$$\begin{aligned} \hat{X}_j^{[i]}=\frac{{X}_j^{[i]}-\bar{X}^{[i]}}{\sigma ^{[i]}} \end{aligned}$$

Here, i refers to the \(i^{\text {th}}\) image in the training set, j refers to the \(j^{\text {th}}\) pixel in the \(i^{\text {th}}\) image, and \(\hat{X}\) represents the standardized image. Similarly, \(\bar{X}\) represents the mean of an image, and \(\sigma\) represents the standard deviation of an image.

For standardization of validation and test set, the mean and standard deviation for each channel were calculated using a single batch of the training set, which was hypothesized to represent the mean and standard deviation of the entire training set. Then the individual images in the validation and test were standardized (feature-wise) using the above-stated formula. After standardizing, the images in the dataset were re-scaled to \(224 \times 224\); this was done to remain consistent with the pre-trained models. The models used were pre-trained on the ImageNet dataset with an input size of \(224 \times 224\) per image. After re-scaling, the images were batched, with each batch containing \(16 \times 8\) (for DenseNet and ResNet50 models) and \(8 \times 8\) (for EfficientNetB1 model) examples. The batch size was kept large to utilize TPU efficiently. Figure 3 shows the generalized workflow taken while developing models for this paper.

Fig. 3
figure 3

Generalized workflow for the process

3 Model Development

3.1 Transfer Learning

Transfer learning refers to the idea of taking a model trained on a different task and using this pre-trained model for a downstream task. Figure 4 briefly describes the idea behind transfer learning. This paper uses a handful of models trained on the ImageNet [45] dataset, trains, or fine-tunes them for the task of chest X-ray diagnosis using transfer learning. All model variants are trained on a TPU, and their AUROC scores are recorded on the held-out test set.

Fig. 4
figure 4

Transfer learning process

3.2 DenseNet121

The first model used as the backbone for this task was the DenseNet121 [46]. Densely Connected Convolutional Networks, or simply DenseNet, is another way to keep increasing the depth of a deep convolutional network. The problem in deep convnets arises when they become so deep that the gradients vanish on their way back. Huang et al. [46] designed an architecture to ensure maximum gradient flow during back-propagation to resolve this problem. DenseNet exploits the potential of the network through feature reuse. The architecture for DenseNets is exhibited in Fig. 6(a). This paper uses a DenseNet121, a 121-layered convolutional neural network model. This model derives inspiration from the CheXNet model and is a re-implementation of the same. To use transfer learning on the DenseNet121, the final dense layer of the pre-trained model was replaced by a Dense layer with 14 units, and a sigmoid activation was applied to it. The resultant model architecture is shown in Fig. 5. In addition, the training parameters for the DenseNet121 backbone are given in Table 2.

Fig. 5
figure 5

Model architecture using DenseNet121 as a backbone

3.3 ResNet50

The second model used as the backbone for the diagnosis task was the ResNet50 [47]. ResNets are an exciting class of models and have served as the state-of-the-art model for various tasks. A deep neural network architecture tends to give a more significant error as compared to a comparatively shallow neural network. He et al. [47] overcame this problem by introducing a deep residual learning framework and skip or residual connections. Figure 6(b) exposes the architectures of different ResNets.

In this work, ResNet50V2 uses 50 layered deep convolutional neural networks with multiple residual connections. The pre-trained model was modified to apply transfer learning for this diagnosis task. For this task, the final fully connected layers of the pre-trained ResNet50V2 model were replaced by an average pooling, followed by a series of dense, relu, and dropout layers. A final dense layer with 14 units followed by a sigmoid was utilized for the output. The resultant model architecture is shown in Fig. 7. Also, the training parameters for the ResNet50V2 backbone are depicted in Table 3.

Fig. 6
figure 6

Model architecture on ImageNet for a DenseNet [46]; b ResNet [47]; and c EfficientNet-B0 baseline network [48]

Fig. 7
figure 7

Model architecture using ResNet50V2 as the backbone

Table 2 Training parameters for DenseNet121 backbone
Table 3 Training parameters for ResNet50V2 backbone

3.4 EfficientNetB1

The third and last model used as the backbone for this task was EfficientNetB1 [48]. EfficientNets are a class of efficiently designed models to optimize the model’s performance while having a considerably low amount of trainable parameters. Tan and Le [48] came up with a better way of scaling the network, which they call compound scaling, in which they selected an efficient scaling for all − width, depth, and image resolution. The baseline network architecture of EfficientNetB0 is shown in Fig. 6(c).

To use transfer learning on the EfficientNetB1, the final Dense layer of the pre-trained model was replaced by a Dense layer with 14 units, and a sigmoid activation was applied to it. The resultant model architecture is shown in Fig. 8. Also, the training parameters for the EfficientNetB1 backbone are presented in Table 4.

Fig. 8
figure 8

Model architecture using EfficientNetB1 as a backbone

Table 4 Training parameters for EfficientNetB1 backbone

4 Model Training

The models were individually trained on a TPU using various batch sizes while maintaining the same hyper-parameters. The models were initialized with parameters from a network that was pre-trained on ImageNet [25] before commencing training. The last layers were replaced with a dense layer having 14 units followed by a sigmoid layer for obtaining the predicted probabilities of all 14 pathology diseases, as discussed in the previous sections. The images were resized to \(224 \times 224\) pixels before inputting them. Prior to feeding the network with the image, each image in the training set was subject to random horizontal flips and random rotations up to 10 degrees.

4.1 Loss Function

The data in the dataset were imbalanced. To account for this imbalance, instead of simply using a binary cross-entropy loss function, a weighted binary cross-entropy loss is minimized, as suggested in [3].

$$\begin{aligned}{} & {} J=\sum _{c=1}^{14}~L(y_c, \hat{y}_c)\\{} & {} L(y_c, \hat{y}_c)=\frac{1}{N}~\sum _{i=1}^N [-w_{p_c} \times y_{c_i} \times \log (\hat{y}_{c_i})-w_{n_c} \times (1-y_{c_i})\\{} & {} \quad \times \log (1-\hat{y}_{c_i})] \end{aligned}$$

Here, c refers to one of the classes of labels, i refers to the \(i^{th}\) example in the training set, y refers to true label, and \(\hat{y}\) refers to the predicted label or probability. Similarly, \(w_{p_c}\) and \(w_{n_c}\) are defined as follows.

$$\begin{aligned} w_{p_c}= & {} \frac{\text {Total negative examples in class}\, c}{\text {Total examples}}\\ w_{n_c}= & {} \frac{\text {Total positive examples in class}\, c}{\text {Total examples}} \end{aligned}$$

The model was trained on a TPU V3.8 on Kaggle for 100 epochs. The above-mentioned custom loss, binary accuracy, and AUROC score were monitored during training. [49] optimizer was used for training. The learning rate was reduced by a factor of 10 if no improvements in validation loss were seen for two continuous epochs. Early stopping was used with the patience of 10 to prevent over-fitting of the model and prevent wastage of computing time. The end-to-end open-source deep learning framework Tensor FlowFootnote 2 was used to train and evaluate the models.

5 Results and Performance

The overall workflow for disease prediction in sample X-rays is graphically produced, as shown in Fig. 9.

Fig. 9
figure 9

Proposed CXD server, where a Front view of the CXD where boxes are given for sequences and blue button representing submission option; b Uploading Sample Chest X-ray; c Predicted results of given sample using Deep Learning approaches; and d Depict previous history along with confidence scores generated from embedded Deep learning approaches

In this study, the metric used for comparing the models is the AUROC score and curve. In this section, the results of the models are discussed and compared. Additionally, all the models are compared to the CheXNet model [3]. AUROC (Area Under the Receiver Operator Curve) is a performance metric that evaluates classification models for various threshold values. ROC is a probability curve, and AUC represents the degree of separability. The higher the AUC value, the better a model can differentiate between the positive and negative classes. An AUROC curve plots the true positive rate against the false-positive rate.

The true positive rate, also referred to as sensitivity, measures the proportion of positive examples in the dataset that the model accurately identified as positive. In other words, it represents the fraction of total positive examples correctly predicted as positive by the model, i.e.,

$$\begin{aligned} \text {TPR}=\frac{\text {True positive}}{\text {True positives false negatives}} \end{aligned}$$

The false positive rate, also given as 1-sensitivity, is the fraction of total negative examples in the dataset the model incorrectly predicted as positive, i.e.,

$$\begin{aligned} \text {FPR}=\frac{\text {False positive}}{\text {True positives false negatives}} \end{aligned}$$

The AUROC scores for different classes for all three model variants have been listed in Table 5, and the ROC curves have been illustrated in Fig. 10. The experimental results demonstrated that the model constructed using DenseNet121 outperformed the other two models. Therefore, our experimental investigations are in line with several studies [50,51,52] previously published in the literature. Some possible reasons for this superior performance could be:

  • The network architecture of DenseNet121 is more complex, enabling the model to extract more features from the data and potentially improve its performance.

  • The DenseNet121 model utilizes a more closely connected pattern between its layers, which can aid in decreasing the number of parameters within the model and avoiding overfitting.

  • DenseNet121 incorporates batch normalization and skip connections to enhance convergence and performance.

Fig. 10
figure 10

a AUROC curve for DenseNet121 backbone; b AUROC curve for ResNet50 backbone; and c AUROC curve for EfficientNetB1 backbone

Table 5 AUROC scores for different model variants

Further, we employed a pre-trained DenseNet121 model and modified its fully connected layers as previously described. Subsequently, we assessed the model’s performance without any further training and found that the ROC values were approximately 0.5, indicating that its predictions were akin to random guessing. Figure 11(a) illustrates the results graphically. Similarly, we froze the DenseNet121 model and only trained the global average pooling (GAP) and final softmax layers, using the DenseNet121 model as the backbone, without modifying its parameters. The outcomes of this method are demonstrated in Fig. 11(b). Essentially, we maintained the original DenseNet121 model and only updated the final layers that were added to it.

Fig. 11
figure 11

a ROC curve for DenseNet121 without training model; and b ROC curve for DenseNet121 model after training only the GAP and final softmax layers

Furthermore, DenseNet121 was utilized as an unalterable feature extractor, and a more intricate fully connected layer was trained on it. Only the fully connected layers were modified during the training process, and the rest of the DenseNet121 architecture remained unchanged. Figure 12 illustrates the new layers appended to the fully connected network. The ROC curve achieved from this approach is displayed in Fig. 13.

Fig. 12
figure 12

The architecture of extra added layers to the fully connected network of DenseNet121

Fig. 13
figure 13

ROC curve for extra added layers to the fully connected network of DenseNet121

We have computed the lower and upper confidence intervals for ResNet50, EfficientNetB1, and DenseNet121 to further analyze these models. A confidence interval is a range of values that are likely to include the true value of a population parameter. The lower and upper confidence intervals for these models indicate the potential range of performance when applied to a specific task or dataset. For instance, if the lower confidence interval for a model’s accuracy is \(95\%\), there is a \(95\%\) chance that the true accuracy of the model will be at least that value. These intervals help to understand the uncertainty surrounding a model’s performance and to compare the performance of different models. Based on the assumption of a normal distribution, we have provided tables below that show the minimum and maximum estimated prevalence of Atelectasis disease with a \(95\%\) confidence level for ResNet50, EfficientNetB1, and DenseNet121 models. These tables, labeled as Tables 6, 7, 8, 9, 10,11, indicate the minimum estimated prevalence of Atelectasis disease with a \(95\%\) confidence level based on the assumption of a normal distribution for the respective models. In these tables, TPR represents the true positive rate, and FPR represents the false positive rate.

Table 6 The minimum estimated prevalence of Atelectasis disease with a \(95\%\) confidence level based on the assumption of a normal distribution for ResNet50
Table 7 The maximum estimated prevalence of Atelectasis disease with a \(95\%\) confidence level based on the assumption of a normal distribution for ResNet50
Table 8 The minimum estimated prevalence of Atelectasis disease with a \(95\%\) confidence level based on the assumption of a normal distribution for EfficientNetB1
Table 9 The maximum estimated prevalence of Atelectasis disease with a \(95\%\) confidence level based on the assumption of a normal distribution for EfficientNetB1
Table 10 The minimum estimated prevalence of Atelectasis disease with a \(95\%\) confidence level based on the assumption of a normal distribution for DenseNet121
Table 11 The maximum estimated prevalence of Atelectasis disease with a \(95\%\) confidence level based on the assumption of a normal distribution for DenseNet121

Next, we have thoroughly analyzed the lower and upper bounds of the precision-recall (PR) curves for the three models being considered. Nevertheless, it is important to note that the ROC curve is also a suitable alternative for evaluating a classifier’s performance, particularly for datasets that have imbalanced classes. This is because, unlike the PR curve, which only considers the true positive rate, the ROC curve considers both the true positive rate and the false positive rate. Therefore, we have included both the PR and ROC curves to provide a more comprehensive evaluation of these models, which are depicted in Figs. 14, 15, and 16.

Fig. 14
figure 14

PR curve for a lower bound, and b upper bound at \(95\%\) confidence interval for ResNet50 model

Fig. 15
figure 15

PR curve for a lower bound, and b upper bound at \(95\%\) confidence interval for EfficientNetB1 model

Fig. 16
figure 16

PR curve for a lower bound, and b upper bound at \(95\%\) confidence interval for DenseNet121 model

The aforementioned tables, given as Tables 6, 7, 8, 9, 10,11, also display the F1 score, also known as the F-measure or F-score. It is a metric that combines precision and recall into a single score, commonly used in classification tasks. The score is calculated as the harmonic mean of precision and recall. Precision is the number of true positive predictions divided by the total number of positive predictions, and recall is the number of true positive predictions divided by the total number of actual positive samples. The F1 score is valuable for evaluating the performance of classification models because it provides a balance between precision and recall and allows for the comparison of models with different precision and recall values.

6 Conclusions and Future Scopes

Pneumonia is a major cause of human fatalities worldwide. According to the Centers for Disease Control and Prevention,Footnote 3 over one million adults in the US are hospitalized due to pneumonia, and around 50, 000 die from the disease each year. India has over 10 million cases of pneumonia each year. Although chest X-rays are the most effective means of diagnosing pneumonia [53], medical imaging has constraints in terms of access to expertise in some areas [54]. Additionally, chest radiographs may also be utilized to diagnose other illnesses.

In addition, even expert radiologists are limited by various human factors [38, 55,56,57,58]. Therefore, the creation of detection systems could greatly benefit humanity. As a result of the difficulty in training these three models on a CPU, this study considers using a TPU. The CXD server has an improved interface that is more efficient and has been developed using a large chest X-ray dataset up until January 2021. This extensive and accurate data is being constantly utilized to enhance our proposed CXD server, ensuring the quality of our work.

The objective of this study was to automate an essential stage of the radiology process by utilizing three convolutional neural networks (CNNs), namely DenseNet121, ResNet50, and EfficientNetB1, to precisely detect 14 types of thoracic pathology diseases from chest radiography images. A total of 112, 120 chest X-ray datasets containing various thoracic pathology diseases were utilized to evaluate the performance of these models based on their ability to predict the likelihood of individual diseases and alert clinicians to potentially abnormal findings. The results indicated that the DenseNet121 model outperformed the other two models in terms of the score values achieved for each class on the dataset. Furthermore, the performance of these models was compared to that of the ChexNet model.

Our future plans involve expanding our CNN training by incorporating extra data and assessing different architectures for diagnosing other thoracic pathology diseases. We are convinced that a computer-aided diagnostic tool of this kind could enhance the effectiveness and precision of diagnosing thoracic pathology diseases, including pandemics like COVID-19 and Swine Flu, significantly. This tool could prove to be especially useful during a pandemic when the demand for prevention and treatment often surpasses the available resources.