Introduction

Pneumothorax (collapsed or dropped lung) is an emergency condition when air enters the pleural space, i.e., the space between the lungs and the chest wall1,2. It is one of the diseases in the Category 1 findings that should be communicated to clinicians within minutes in order to take immediate actions to avoid fatalities as recommended by the American College of Radiology (ACR)3,4. A graphical illustration of pneumothorax is provided in Fig. 1.

Figure 1
figure 1

A graphical illustration of pneumothorax. [Medical gallery of Blausen Medical 2014, WikiJournal of Medicine5].

It is generally a serious condition that can be fatal1. To prevent patient death, early detection of pneumothorax through application of deep learning may be a viable option4. Pneumothorax is typically detected on chest X-ray images by qualified radiologists2. However, nowadays radiologists in many may have to process many X-ray studies daily6. With the increasing workload, the large volume of work for radiologists understandably delays diagnosis and treatment. In this process, experience is absolutely necessary but even the most experienced expert may be prone to miss the subtleties of an image7,8. Since a wrong or delayed diagnosis can cause harm to patients9, it is vital to develop computer-aided approaches to assist radiologists in their daily workflow.

Chest X-ray is the most common medical imaging modality with over 35 millions images taken every year in the U.S. alone10. X-ray images allow for inexpensive screening of several chest conditions including pneumothorax6. Since hospital daily workloads may result in long queues for radiology images to be read, including images acquired overnight or images without any clinical pre-screening, an automated method of inspecting chest X-rays and prioritizing studies with potentially positive findings for rapid review may reduce the delay in diagnosing and treating pneumothorax11.

Due to its recent success, an increasing number of studies have adopted “deep learning” for processing digital images, referring to the use of Deep Neural Networks (DNN), defined as artificial neuronal networks with more 3 hidden layers12 (DNNs practically consisting of many more hidden layers), to detect pneumothorax or other thoracic diseases in chest X-ray images11,13,14,15,16,17,18,19,20,21,22,23. The deep-learning pneumothorax detection systems can be categorized into two categories: (1) detection methods, i.e., to pinpoint the exact location of certain thoracic diseases in an input chest X-ray image, and (2) classification methods, i.e., to identify the presence of certain thoracic diseases in an input chest X-ray image without highlighting the exact location of the disease.

From a practical point of view, detection systems remain hard to be developed as large high-quality datasets with pixel-level labels are needed to train such systems. These datasets are expensive to obtain as creating representative and accurate labels constitutes tedious manual work for radiologists (for instance, many works focus on large and medium size pneumothorax11). Classification systems, on the other hand, are relatively easier to develop as they only need image-label annotation.

So far, more than half a million chest X-ray images with image-level (or global) labels have been released. These are ChestX-ray1413, CheXpert19 and MIMIC-CXR24. Leveraging such a large amount of labelled images, classification-based systems should not be difficult to train and deploy. Nevertheless, the main drawback of a classification system is that it only outputs a single probability, a number that quantifies the likelihood of the chest X-ray to contain a certain abnormality. This may not be enough to justify the diagnosis.

Image search, as a different approach to classification, not only can provide a probabilistic output like a classifier by taking a weighted majority vote among matched images but also can provide access to the metadata of similar cases from the past, a functionality that a trained deep network for classification cannot offer. Hence, image search allows a comparison of patient histories and treatment profiles due to providing a list of matched cases and not just delivering a classification probability25. Therefore, image search may enable a virtual “second opinion” for diagnostic purposes and provide computerized explanations for the decision support. While image search may be more viable for clinical deployment in terms of explainability, its classification performance still needs to be investigated, specifically, whether image search can achieve a identification performance as high as those obtained by classification-based deep learning systems.

In this study, we explored the use of image search based on deep features obtained from DNNs to detect pneumothorax in chest X-ray images. By means of using image search as a classifier, all chest X-ray images were first tagged with a feature vector. Given a query chest X-ray image, the majority voting of the top k retrieved chest X-ray images was then used as a classifier. The corresponding reports and other clinical metadata of the top search results can also be used if available. This is an inherent benefit of image search.

Our contributions in this study are two-folded. Firstly, we developed AutoThorax-Net that generates feature vectors from integrating multiple images into one feature vector. Although the benefits of using deep features for processing x-ray images is well-established, in this study we demonstrated breaking down the image into multiple sub-images, here by using the chest symmetry, does in fact provide better results; the separation of left and right lung accompanying the entire image apparently increases the recognition accuracy. Flipping one lung—as a coarse type of image registration—may also contribute to the better feature matching. Experimental results demonstrate that image search on AutoThorax-Net features can achieve a higher detection performance compared with using feature vectors solely from the input image. Experimental results also showed that image search based using AutoThorax-Net features with the dimensionality reduced by a factor of 12 times can achieve a detection performance, comparable to or outperforming those obtained by existing systems such as CheXNet13.

Related works

Deep learning for analyzing chest x-ray images

Since the release of the ChestX-ray14 dataset13 by National Institute of Health, providing 112,120 frontal-view X-ray images of 30,805 unique patients labelled for 14 diseases (in which each image may have multi-labels), an increasing number of studies have adopted DNNs to develop automated systems to detect diverse diseases on chest X-ray images11,14,15,16,17,18,19,20,21,22. CheXNet14, a DNN with DenseNet121 architecture26, has been trained on ChestX-ray14 dataset and achieved radiologist-level detection of pneumonia. Since then, many DNN architectures have been proposed for a variety of tasks ranging from localization21, lateral and frontal dual chest X-ray reading15, integration of non-image data in classification20, attention-guided approaches27, location-aware schemes6, weakly-supervised methods28,29 as well as generative models8.

For detecting pneumothorax, a recent study collected 13,292 frontal chest X-rays (3107 with pneumothorax) to train a DNN to verify the presence of a large or moderate-sized pneumothorax. Another recent study4 collected 1003 images (437 with pneumothorax and 566 with no abnormality) to detect pneumothorax with DNNs. So far, there has been no study to investigate pneumothorax detection in a large dataset, perhaps by combining the three large public datasets.

Deep learning for content-based image retrieval

Retrieving similar images given a query image is known as Content-Based Image Retrieval (CBIR)30 or Content-Based Medical Image Retrieval (CBMIR)31 for medical applications. While classification-based methods provide promising results32, CBMIR systems can assist clinicians by enabling them to compare the case they are examining with previous (already diagnosed) cases and by exploiting the information in corresponding medical reports31. It may also help radiologists in faster and more reliably preparing reports for particular diagnosis31.

While deep learning methods have been applied to CBIR tasks in recent studies33,34, there has been less attention on exploring deep learning methods for CBMIR tasks35,36.

One study investigated the performance of DNNs for MR and CT images with human anatomy labels35. Another study investigated the retrieval performance of DNNs among multimodal medical images for different body organs36. There is also a study exploring hashing deep features into binary codes, testing among lung, pancreas, neuro and urothelial bladder images37. Deep Siamese Convolutional Neural Networks38 have also been tested for CBMIR to minimize the use of expert labels using multiclass diabetic retinopathy fundus images. Another study explored deep learning for CBMIR among multiple modalities for a large number of classes39. So far, there has not been any report to validate CBMIR techniques for a challenging case like pneumothorax detection in large datasets. We attempt to close this gap by reporting our results on a large dataset of chest x-ray images by fusion three public datasets.

Methods

Given a query chest X-ray image, the problem is to output whether it contains pneumothorax using image search in archived images through majority vote among retrieved similar images from the archive.

The proposed method of using image search as a classifier comprises of three phases (Fig. 2): (1) Tagging images with deep features (all images in the database are fed into a pre-trained network to extract features), (2) image searching (tagging the query image with features and calculating its distance with all other features in the archive to find the most similar images), (3) classification (majority vote among the labels of retrieved similar images).

Figure 2
figure 2

An overview of using image search as a classifier to recognize pneumothorax in chest X-ray images. Phase 1: Tagging images with Features. Phase 2: Image Search (distance calculation between the query features and all other features in the database). Phase 3: Classification (the majority voting of the retrieved images as a classifier).

Phase 1: tagging images with deep features

In this phase, all chest X-ray images in the archive are tagged with deep features. To represent a chest X-ray image as a feature vector with a fixed dimension, the last pooling layer may be used as image representation. In other words, the pre-trained deep convolutional neuronal network is considered as a feature extractor to convert a chest X-ray image into an n-dimensional feature vector with n = 1, 024 being a typical value for many networks. In our study, DenseNet12126 is used for converting a chest X-ray image into a feature vector with 1,024 dimensions. DenseNet topology has a strong gradient flow contributing to diverse features and is, compared to many other architectures such as ResNet and EfficientNet, quite compact with only 7 million weights. We adopted DenseNet121 also for a fair comparison in experiments with CheXNet, which has also used DenseNet121 as its backbone architecture. Three configurations are explored to extract deep features from a chest X-ray image (Fig. 3):

Figure 3
figure 3

An overview of the three configurations using DenseNet12126 to extract features from a chest X-ray images. Configuration 1: a feature vector is extracted from the entire chest X-ray image. Configuration 2: two feature vectors are extract from the left chest side and flipped right chest side. The final feature vector is a concatenation of these two feature vectors. Configuration 3: three feature vectors are extract from the left chest side, the flipped right chest side, and the entire chest X-ray image, respectively.

  • Configuration 1—a feature vector is extracted from the entire chest X-ray image. Representing the entire image with one feature vector is quite common and assumes that the object or abnormality will be adequately quantified in the single feature vector for the entire image.

  • Configuration 2—two feature vectors are extracted, one from the left chest side and one from the flipped version of the right chest side. The final feature vector is a concatenation of these two feature vectors. If DenseNet12126 is adopted as the feature extractor, the feature vector has 2, 048 values. The rational behind this idea is to allow expressive features for each side of the chest to be quantified separately to make feature matching easier for the unsupervised search. As well, flipping the right lung is a registration-like operation to facilitate alignment in matching.

  • Configuration 3—three feature vectors are extracted as described in previous two configurations. The final feature vector is a concatenation of these three feature vectors. If DenseNet12126 is adopted as the feature extractor, the dimension of the final feature vector is 3072 real-valued features. The rationale behind this configuration is that matching a combined feature vector that represents the whole image, the left chest side and the flipped right chest side not only provides a global image view but also more focused and aligned attention to each chest side to emphasize their features in the search and matching process.

Phase 2: image search

In this phase, the distance between the deep features of the query chest X-ray image and all chest X-ray images in the database are computed. The chest X-ray images having the shortest distance with those of the query chest X-ray image are subsequently retrieved. The Euclidean distance, as the most commonly used norm for deep feature matching, was used for computing the distance between the deep features of two given chest X-ray images. It is the geometric distance in the multidimensional space recommended when all variables have the same metric40. The calculated distances can be sorted to retrieve as many as matched images as desired. The impact of distance norms on retrieval may be investigated in future works.

Phase 3: classification

In this phase, the majority vote among the labels of retrieved chest X-ray images is used as a classification decision. For example, given a query chest X-ray image, the top k most similar chest X-ray images are retrieved. If m chest X-ray images are labelled with pneumothorax (with m ≤ k), the query image is classified as pneumothorax with a class likelihood of m/k. The larger k the more reliable the classification vote will become. This, in turn, requires a large archive of tagged images to increase the probability of finding similar images.

Compressing feature dimensionality using autoencoders

The dimensionality of the feature vectors, especially the concatenated ones, may become a computational obstacle but it can be reduced by employing autoencoders. One may use an autoenoder for all configurations but our main motivation was a size reduction for the longest feature vector for configuration 3. Two steps are required to construct an encoder to reduce feature vector dimensionality:

  • Step 1: Unsupervised end-to-end training with a decoder An autoencoder with the architecture summarized in Fig. 4a is first constructed. A dropout layer41 with a probability of 0.2 is introduced between each layer to reduce the probability of overfitting. The model is then trained for 10 epochs by backpropagation with outputs being set equal to inputs. The batch size, loss function and optimizer were set to 128, Mean Squared Error and Adam, respectively. The training details are visualized in Fig. 5.

    Figure 4
    figure 4

    An overview of autoencoder topologies: (a) an autoencoder, with an encoder (highlighted in blue) and an decoder (highlighted in yellow), is constructed during step 1: Unsupervised end-to-end training with decoder, (b) an encoder (highlighted in blue) with an extra layer added (highlighted in green) is constructed during step 2: supervised fine-tuning with labels, (c) an encoder (highlighted in blue) is constructed by removing the 1-dimension layer.

    Figure 5
    figure 5

    Visualization of training and validation loss in both phases (for the first fold). The weights of the 10th epoch, for both phases, are used in the experiments.

  • Step 2: Supervised fine-tuning with labels After the training, the decoder in the model is removed as we only need the encoding part as dimensionality reduction. Instead, a one-dimensional fully connected layer of neurons with the sigmoid function as activation function was used in training phase. The network was trained for 10 epochs with the batch size of 128 using binary cross-entropy loss function and Adam optimizer. The training details are visualized in Fig. 5. During training, to deal with class imbalance, individual class weight was set for each class using the following formula:

    $$ W_{cj} = \frac{S}{{C \times S_{cj} }} $$
    (1)

    where wc j is the class weight of class c j; C is the total number of classes; S is the total number of training samples; Sc j is the total number of training samples belonging to class c j.

The model architecture is summarized in Fig. 4b. Similarly, a dropout layer41 was introduced between the 256-dimensional layer and the one-dimensional layer to reduce the probability of overfitting. The model is then trained with through backpropagation. After the training, the one-dimensional fully connected layer is removed. The model architecture of the final encoder is summarized in Fig. 4c.

Model architecture

The architecture of AutoThorax-Net to obtain features from a chest X-ray image is illustrated in Fig. 6.

Figure 6
figure 6

A graphical illustration of AutoThorax-Net. Three extracted feature vectors (FVs) (left) are concatenated and fed into into an encoder to compress them into 256 values for image search. During training the same compressed FV is used as input to a fully connected layer (FC) to classify images as pneumothorax with a likelihood of p.

Results

In this section, we first describe the datasets collected and the prepossessing procedures. We then describe the experiments that were conducted, followed by the analysis. The main goal of experiments is to validate the performance of image search via matching deep features. In order to establish performance quantification, we treat search like a classifier by taking a consensus vote guided by the ROC statistics. We also compare the results with the CheXNet (without any modification or fine-tuning) which is an end-to-end deep network specially trained for classifying chest X-ray images.

Dataset collection

Three large public datasets of chest X-ray images were collected. The first is MIMIC-CXR24,42, a dataset of 371,920 chest X- rays associated with 227,943 imaging studies. A total of 248,236 frontal chest X-ray images in the training set were used in this study. The second dataset is CheXpert19, a dataset consisting of 224,316 chest radiographs belonging to 65,240 patients. A total of 191,027 frontal chest X-ray images in the training set were used in this study. The third dataset is ChestX-ray1413 consisting of 112,120 frontal-view X-ray images of 30,805 patients. All chest X-ray images in this dataset were used in this study.

In total, 551,383 frontal chest X-ray images were used in our investigations. 34,605 images (6% of of all images) were labelled as pneumothorax. The labels refer to the entire image; the collapsed lungs were not highlighted in any way.

Implementation and parameter setting

We used the library Keras (http://keras.io/) v2.2.4 with Tensorflow backend43 to implement the approach. As we used a pre-trained network for feature extraction, the DenseNet121 was selected26, and the weight file was obtained through the default setting of Keras. For CheXNet14, the weight file was downloaded from GitHub (https://github.com/brucechou1983/CheXNet- Keras). All images were resized to 224 224 before feeding into networks. All other parameters were default values unless otherwise specified. All experiments were run on a computer with 64.0 GB DDR4 RAM, an Intel Core i9-7900X @3.30 GHz CPU (10 Cores) and one GTX 1080 graphic card.

Performance evaluation

Following relevant literature14,19, the performance of classification was evaluated by the area under the curve (AUC) for the receiver operating characteristic curve (ROC curve) to enable the comparison over a range of prediction thresholds. As a tenfold cross-validation was conducted in the experiments, average ROC was computed with 95% confidence interval.

Dataset preparation & preprocessing

There is a concern for ChestX-ray1413 dataset that its chest X-ray images with chest tubes were frequently labelled with Pneumothorax44,45. As we combined ChestX-ray14 with CheXpert19, and MIMIC-CXR24,42 datasets in our experiments, the concern was mitigated to address the bias.

Dataset 1 (semi-automated detection)

If there is a suspicion of pneumothorax by the user, then the search will limited to the archived images that are either normal (i.e., no finding) or pneumothorax. We This is a dataset comprising of 34,605 pneumothorax chest X-ray images and 160,003 normal chest X-ray images. Searching in this dataset means there is already a suspicion by the expert that the image may contain pneumothorax, hence the search is guided to only search within archived images that are diagnosed as either pneumothorax or normal (no finding). The pneumothorax images were obtained from the collected frontal chest x-ray images with the label "Pneumothorax". They were considered as the positive (+ ve) class. The normal images were obtained from the collected frontal chest x-ray images with label the "No Finidng". These chest X-ray images were considered as the negative (− ve) class. A summary of dataset 1 is provided in Table 1.

Table 1 A summary of chest X-ray images in the Dataset 1 through combination of three public datasets.

Dataset 2 (fully-automated detection)

If there is no concrete suspicion from user, we match the input image against all other images regardless of their tagged disease label. This dataset is comprising of 34,605 pneumothorax chest X-ray images and 516,778 non-pneumothorax chest x-ray images. Searching in this dataset means the computer is automatically searching in all images to verify the likelihood of pneumothorax without any guidance of the expert. The pneumothorax images were obtained from the collected frontal chest X-ray images with the label "Pneumothorax". They were considered as the positive (+ ve) class. The non-pneumothorax images were obtained from the collected frontal chest X-ray images without the label "Pneumothorax", meaning that they may contain cases such as normal, pneumonia, edema, cardiomegaly and more. They were considered as the negative (-ve) class. A summary of dataset 2 is provided in Table 2.

Table 2 A summary of chest X-ray images in the Dataset 2 through combination of three public datasets.

First experiment series: semi-automated solution

The first experiments series focuses on a “semi-automated” solution for pneumotharx. We confine the search and classification to cases that are either normal or diagnosed with pneumotharx (Dataset 1). We test all three configurations (Fig. 3), CheXNet, and the proposed AutoThorax-Net.

Experimental workflow

All images of Dataset 1 for all configurations were first tagged with deep features. We constructed the receiver operating characteristics (ROC) curve for the dataset to find the trade-off between sensitivity and specificity (Fig. 7). We used Youden’s index46 to find the trade-off position on the ROC curve providing the threshold for match selection. The Youden’s index can be calculated as “sensitivity + specificity 1”. A standard tenfold cross-validation was adopted for tests that showed a very low standard deviation for all experiments apparently due to the large size of the datasets. All tagged chest X-ray images were divided into 10 groups. In each fold, one group of chest X-ray images was used as validation set, while the remaining chest X-ray images were used as “archived” images to be searched. The above process was repeated 10 times, such that in each fold a different group of chest X-ray images was used as the validation set. In each fold, an encoder was trained using the archived set of that fold. The encoder was then used for compressing deep features for each chest X-ray image in the validation set.

Figure 7
figure 7

Analysis for Dataset 1. Left: Sample ROC curves for the proposed AutoThorax-Net for different k values in one fold and their area under the curve (AUC), Right: corresponding ROC thresholds for k = 1001 to select the sensitivity–specificity trade-off using Youden’s index.

The parameters of the encoder construction process are described as follows:

Step 1 Unsupervised end-to-end training with decoder: The training epoch and batch size were set as 10 and 128, respectively. The loss function was chosen as mean-squared-error. Adam optimizer47 was used. The dropout rate was set to 0.2, i.e., a probability of 20% setting the neuron output as zero to counteract possible overfitting.

Step 2 Supervised fine-tuning with labels: The loss function was chosen as binary cross-entropy. Other parameters remained the same as in Step 1.

Given a query chest X-ray image from validation sets, image search was conducted on the archived set to retrieve k similar images for each query image. The consensus vote among the top k retrieved chest X-ray images subsequently determines whether the query image is pneumothorax. Results were generated with k 11, 51, 101, 251, 501, 1001. As we were using a large number of archived images, one was excepting to see better results for higher k values.

Results

Experimental results on Dataset 1 are summarized in Table 3 for AutoThorax-Net, ChexNet and all three feature configurations from Fig. 3. We calculated area under the curve (AUC), sensitivity and specificity for all 10 folds. Standard deviations were quite low (< 1%), hence not reported. Figure 8 shows the confusion matrices for both AutoThorax-Net and ChexNet.

Table 3 A summary of classification performance using image search as a classifier on Dataset 1. The numbers (in percentage) are the result of averaging 10 folds with very low standard deviation (< 1%).
Figure 8
figure 8

Dataset 1: Confusion matrices for the best results of AutoThorax-Net (left) and CheXNet (right).

The average sensitivity and specificity obtained by Configuration 3 for k = 1001 are higher than those obtained by Configuration 1 although they have almost the same AUC. Configuration 1 shows higher sensitivity for k = 11 (86% versus 83%) but its specificity is lower than Configuration 3 (76% versus 80%). Configuration 2 delivers the same AUC in range 88% but in individual comparison is always worse than other configurations with lower sensitivity and specificity.

AutoThorax-Net has clearly the highest AUC (92%). ChexNet delivers an AUC of 88% similar to the three search configurations. The highest sensitivity is 86% achieved by all tested methods. However, AutoThorax-Net also provides a specificity of 84% whereas the specificity of all other methods, including ChexNet, are in the 70% range.

To verify that the improvements of the proposed methods are significant, we have performed the two-sided Wilcoxon Signed-Rank test48 between the performance of CheXNet and our best performing configuration which is AutoThorax-Net with k = 1001. Our results, shown in Table 3, suggest that AUC and Specificity are improved from 88 to 92 and 76 to 84, respectively. The calculated p-values for these two metrics, both 0.005, are smaller than 0.05 and reject the null hypothesis which means significant differences exist between the performance of AutoThorax-Net with k = 1001 and CheXNet with respect to these two metrics.

Second experiment series: automated solution

In these experiments, we investigated the possibility of constructing a “fully automated” solution by searching the entire archive, i.e., Dataset 2. We summarize the experimental workflow, and report the results with some analysis.

Experimental workflow

We constructed the receiver operating characteristics (ROC) curve for Dataset 2 to find the trade-off between sensitivity and specificity (Fig. 9). We used Youden’s index to find the trade-off position on the ROC curve providing the threshold for match selection. A standard tenfold cross-validation was adopted for testing that showed a very low standard deviation (< 1%) for all experiments apparently due to the large size of the datasets. All chest X-ray images were divided into 10 folds. In each fold, one group of chest X-ray images was used as validation set, while the remaining chest X-ray images were used as archived set. The above process was repeated 10 times, such that in each fold a different group of chest X-ray images was used as the validation set. In each fold, an encoder was trained using the archived set of that fold. The encoder was then used for compressing deep features for each chest X-ray image in the validation set.

Figure 9
figure 9

Analysis for Dataset 2. Left: Sample ROC curves for the proposed AutoThorax-Net for different k values in one fold and their area under the curve (AUC), Right: corresponding ROC thresholds for k = 1001 to select the sensitivity–specificity trade-off using Youden’s index.

The parameters of the encoder construction process were set as before described for Dataset 1.

For image search, given a chest X-ray image (from the validation set), the compressed deep feature was used for searching in the archived set. The consensus vote among the top k retrieved chest X-ray images to classify the query image from the validation set. Experiments were conducted with k 11, 51, 101, 251, 501, 1001 to observe the effect of more retrievals on consensus voting. For comparison, CheXNet14 was adopted as a baseline to be applied to the validation set in each fold.

Results

Experimental results on Dataset 2 are summarized in Table 4. Figure 10 shows the confusion matrices for AutoThorax-Net and ChexNet.

Table 4 A summary of classification performance using image search as a classifier on Dataset 2. The numbers (in percentage) are the result of averaging 10 folds with very low standard deviation (< 1%).
Figure 10
figure 10

Dataset 2: Confusion matrices for the best results of AutoThorax-Net (left) and CheXNet (right).

The highest AUC of 82% is achieved by AutoThorax-Net for k = 251, 501 and 1001. The highest sensitivity of 74% is achieved by Configuration 2 (for k = 251) and Configuation 3 (for 101). However, they both deliver low specificity values of 61% and 65%, respectively. The second highest sensitivity of 73% is achieved by Configuration 1, Configuration 2 and AutoThorax-Net. Their specificity is 61%, 63% and 75%, respectively. AutoThorax-Net can clearly provide a higher and more reliable trade-off between sensitivity and specificity in a fully automated setting when applied on a large archive of X-ray images.

To verify that the improvements of the proposed methods are significant, we have performed the two-sided Wilcoxon Signed-Rank test48 between the performance of CheXNet and our best performing configuration which is AutoThorax-Net with k = 1001. Our results, shown in Table 4, suggest that AUC and Specificity are improved from 77 to 82 and 67 to 75, respectively. The calculated p-values for these two metrics, both 0.005, are smaller than 0.05 and reject the null hypothesis which means significant difference exist between the performance of AutoThorax-Net with k = 1001 and CheXNet with respect to these two metrics.

Comparing Autoencoder against PCA

As one of the main contributions of the AutoThorax-Net is encoding the concatenated feature vector (i.e., reducing the dimensionality), the question arises whether the same level of performance can be achieved by traditional algorithms such as the principal component analysis (PCA). We did run the tenfold cross validation on both dataset configurations for k = 11 and k = 51. We observed in all settings that the performance of autoencoder was better than PCA. For instance, for the second experiment, PCA achieved 72% and 76% AUC for k = 11 and k = 51, respectively, while autoencoder achieved 74% and 80% AUC. As the performance of the dimensionality reduction is independent of k, one expects that a more capable compression should already manifest itself for any k. However, as we are using the compressed/encoded features for image search, good performance is expected to be particularly visible for a small number of matched cases.

Discussions

In our investigations, we experimented with image search as a classifier to detect pneumothorax based on autoencoded concatenated features applied on more than half a million chest X-ray images obtained through the merging three large public datasets.

In our experiments, we verified that the use of image search as a classifier with AutoThorax-Net as a feature extractor can improve classification performance. This was demonstrated by analysing the ROC curves to find the trade-off for each individual approach. We further confirmed that compressing concatenated deep features via autoencoders further improves the results of image search. This indicates that image search as a classifier is a viable and more conveniently explainable solution for the practice of diagnostic radiology when reports and history of evidently diagnosed cases of similar cases are readily available.

Please note that some of the folds we used may contain images that ChextNet has already seen during its training. This may bring a slight inflation of performance numbers for ChexNet. We ignored this unfair advantage for ChexNet over our AutoThorax-Net since we had to exploit the mixture of three public datasets and apply k-fold cross validation for maximum data usage and decreasing data bias.