Introduction

A pneumothorax is an abnormal collection of air in the pleural space between the lung and the chest wall. The annual incidence rate of pneumothorax was approximately 7.3 cases per 100,000 individuals [1]; among hospitalized patients, the incidence of pneumothorax was estimated at 22.7 cases per 100,000 admissions every year [2].

Without prompt recognition and management, pneumothorax may evolve into life-threatening tension pneumothorax. Rapid and correct identification of pneumothorax can minimize the risk associated with tension pneumothorax [3] and thus improve patient outcomes.

Because of its advantage in mobility, the portable supine chest radiograph (SCXR) is one of the most common imaging studies performed in the emergency department (ED) and intensive care unit (ICU) [4, 5]. However, the reported sensitivity of SCXR for detecting pneumothorax varied widely, ranging from 9 to 75% [6], indicating the high rate of misses at initial encounters.

Several factors may explain the heterogeneous sensitivity of SCXR in detecting pneumothorax [7,8,9]. First, the imaging quality may be reduced because of limitations in the patient’s positioning or body habitus. Second, the distribution of free air in the pleural space is variable and highly dependent on the intrathoracic anatomic structure and relevant pathology in the lung parenchyma and pleural space [10]. The subtle imaging findings of pneumothorax in SCXRs require expertise and cautious inspection to detect its presence.

In the current study, we hypothesized that artificial intelligence-based approaches for interpreting portable SCXRs may facilitate physicians in detecting pneumothorax with greater efficiency and accuracy. We aimed to develop and validate deep learning (DL)-based computer-aided diagnosis (CAD) systems that enable more efficient and accurate pneumothorax detection and localization by portable SCXR.

Materials and Methods

Study Design and Setting

We conducted a retrospective study to develop and test our CAD systems in chronologically differing image datasets. Local portable SCXRs were retrieved from the Picture Archiving and Communication System (PACS) database of the National Taiwan University Hospital (NTUH). This study was approved by the Research Ethics Committee of NTUH (reference number: 202003106RINC) and granted a consent waiver. Our results are reported according to the Checklist for Artificial Intelligence in Medical Imaging (CLAIM) [11].

Image Acquisition and Dataset Designation

As shown in Fig. 1, a Radiology Information System served to identify candidate images used in the building of training (NTUH-1519) and testing (NTUH-20) datasets.

Fig. 1
figure 1

Flow chart of image inclusion process and dataset designation. SCXR, supine chest X-ray; ED, emergency department; ICU, intensive care unit; NTUH, National Taiwan University Hospital; PACS, Picture Archiving and Communication System

Inclusion criteria for the candidate positive group of NTUH-1519 were as follows: (1) text report with clinical finding of pneumothorax; (2) patient age ≥ 20 years; (3) portable SCXR; (4) imaging obtained in ED or ICU; and (5) exam performed between January 1, 2015 and December 31, 2019. We only selected the first study as the representative image for each patient in the analysis. Inclusion criteria for the candidate negative group were the same as those above, except for text reports devoid of pneumothorax. We randomly selected qualifying images from negative candidates, populating positive and negative groups at an approximate ratio of 1:2. The tentative candidate list was then further scrutinized to avoid overlap of patients.

Image acquisition for NTUH-20 differed in the time frame (January 1, 2020 to December 31, 2020) but was otherwise the same. In addition, the candidate negative group included only images obtained in EDs, and the image ratio for positive and negative groups was approximately 1:10.

We exported all eligible de-identified images in Digital Imaging and Communications in Medicine (DICOM) format, including corresponding text reports for analysis. The radiological reports were generated by various board-certified radiologists for clinical purposes.

Image Annotation, Ground Truth, and CXR Report Extraction

Each image was first split into 10 × 10 grids of equal size. Bounding boxes were then used to cover the pneumothorax visible in each grid, utilizing the least area. Each image was randomly assigned to two emergency physicians, blinded to each other’s efforts, for image annotation. A total of six board-certified and four board-eligible emergency physicians were involved, each with at least 4 years of clinical experience. All images were ultimately reviewed by an experienced (10 years) board-certified pulmonologist and intensivist who adjusted annotations as necessary. The reviewed annotations served as ground truth in model training and testing. Any images harbouring thoracic drainage tubes were picked up and excluded from further analysis. CXR findings and diagnoses [12, 13] were extracted manually by research assistants blinded to annotations according to the radiology reports.

Development of Algorithm

We designed two separate CAD systems (Fig. 2), each including a classification model and a localization model and jointly yielding the following variables: (1) diagnosis output, indicating the presence or absence of pneumothorax, and (2) localization output, indicating the pneumothorax lesion site. Our CAD systems were designated as detection- or segmentation-based systems according to the localization method applied (i.e., object detection or image segmentation).

Fig. 2
figure 2

Structures of (A) detection-based and (B) segmentation-based CAD systems: Each system incorporates a classification model and a localization model, the major difference being the localization method (object detection vs. image segmentation). A CXR image (1) is first passed to the classification model (2) to derive the probability regarding presence of pneumothorax. If the output probability exceeds the classification threshold (3), the image is passed to the localization model (4) to assess pneumothorax position. Once the largest predicted confidence score of the bounding boxes (detection-based CAD system) or predicted areas (segmentation-based CAD system) of the pneumothorax is above the localization threshold (5), the CAD system yields diagnosis (6) and localization (7) outputs. Red rectangles and purple areas denote pneumothorax locations. CAD, computer-aided diagnosis; CXR, chest X-ray

Supplemental Figs. 1 and 2 show the training pipelines in detail. In brief, NTUH-1519 was first split into different subsets to train the CAD systems. This partition process ensured similar proportions of images (with vs. without pneumothoraces) and eliminated patient overlap across all subsets. All images underwent preprocessing before analysis. The annotated bounding boxes were transformed into segmentation masks for the segmentation-based system. The segmentation masks would also be replaced with one or several larger bounding boxes, covering all adjoining masks at minimal area and serving as input for the detection-based system.

For the detection-based system, EfficientNet-B2 [14], DneseNet-121 [15], and Inception-v3 [16] were selected as the architecture for the classification model; Deformable DETR [17], TOOD [18], and VFNet [19] were selected as the architecture for the localization model. Both classification and localization models of the segmentation-based system shared the UNet [20] architecture, using RegNetY [21] as encoder.

Evaluation Metrics of Algorithm

Diagnosis output performance was determined by the area under receiver operating characteristics curve (AUC), area under precision-recall curve, sensitivity, specificity, positive predictive value, and negative predictive value. Youden’s index [22] acquired from the training (NTUH-1519) dataset indicated the optimal threshold for testing.

Localization output performance was measured by Dice coefficient, calculated as twice the area of overlap divided by total pixel count in predicted and ground-truth masks. Dice coefficients were only computed in images positive for both predicted and ground-truth pneumothoraces (i.e., images classified as true positives), referred to as prediction-ground truth TP-Dice. TP-Dice coefficients were also calculated to evaluate the consistency shown by two annotators (inter-annotator TP-Dice).

Statistical Analysis

Continuous variables were expressed as means with standard deviations while categorical variables as counts and proportions. Continuous variables were compared with Student’s t-test, and categorical variables were compared with the Chi-squared test. All statistics were determined as point estimates, with 95% confidence intervals (CIs), through a bootstrap technique at 1,000 repetitions. Prediction-ground truth and inter-annotator TP-Dice coefficients were compared by paired t-test. Subgroup analysis was performed to explore the influence of the pneumothorax size on the model performance. The pneumothorax was categorized into large, medium, and small sizes based on the 33rd and 66th percentiles of areas of segmentation masks. A two-tailed p-value < 0.05 was considered statistically significant. All computations were driven by open-source freeware (SciPy v1.8.1) [23].

Results

As shown in Figs. 1 and 2642 images were acquired from the PACS database, (training, 1571; testing, 1071). Significant differences between NTUH-1519 and NTUH-20 datasets are shown in Table 1, with 490 (31.2%) and 126 (11.8%) patients, respectively annotated as pneumothorax. Aside from pneumothorax, other patient characteristics and image findings were numerically similar for images annotated for presence or absence of pneumothorax in NTUH-1519 and NTUH-20 datasets (Supplemental Tables 3 and 4).

Table 1 Comparison of training (NTUH-1519) and testing (NTUH-20) datasets

Table 2 indicates that pneumothorax was accurately diagnosed by detection-based (AUC: 0.940, 95% CI: 0.907–0.967) and segmentation-based (AUC: 0.979, 95% CI: 0.963–0.991) systems, both achieving levels similar to those of radiology reports. Figure 3 demonstrates four representative imaging sets. The overlain predicted bounding boxes or segmentation masks served to assist clinicians in verifying the diagnosis and position of pneumothorax. As shown in Table 3, prediction-ground truth TP-Dice coefficients for detection- and segmentation-based systems were 0.758 (95% CI: 0.707–0.806) and 0.681 (95% CI: 0.642–0.721), respectively, both values significantly surpassing inter-annotator TP-Dice values. Supplemental Table 5 lists the required computational resources for both systems.

Table 2 Diagnostic performances of computer-aided diagnosis (CAD) systems and radiology reports
Fig. 3
figure 3

Sample images stratified by predicted results of diagnosis outputs, including (A) true-positive, (B) false-positive, (C) true-negative, and (D) false-negative results. The first column at left displays original images. In the second column from left, preprocessed bounding boxes (green rectangles) and segmentation masks (red areas) of detection- and segmentation-based CAD systems are shown. The third column from left demonstrates bounding boxes (red rectangles) predicted by detection-based CAD system, with segmentation masks (purple areas) predicted by segmentation-based CAD system appearing in the fourth column. CAD, computer-aided diagnosis

Table 3 Localization performances of computer-aided diagnosis (CAD) systems and annotators

Subgroup analysis showed that in diagnosing pneumothorax, performances of both systems declined according to the size (large to small) of pneumothorax, consistent with the trend for radiology reports (Table 2). In terms of pneumothorax localization, diminishing pneumothorax size (large to small) corresponded with similar declines in prediction-ground truth TP-Dice coefficients for both systems, again aligned with the observed trend for inter-annotator TP-Dice values (Table 3).

Discussion

Main Findings

Both detection- and segmentation-based systems achieved excellent performance, which was comparable to radiology reports or human annotators. Like human readers, the diagnosis and localization performance of the CAD systems might be influenced by the size of the pneumothorax.

Annotation of Pneumothorax on SCXR

Most public datasets [24, 25] rely on chest X-rays with image-level labels of common thoracic diseases that are text-mined from radiology reports and are inherently inaccurate [26, 27]. For example, for ChestX-ray14, a study suggested the agreement regarding pneumothorax diagnosis between the image-level label and radiologist review was only about 60% [28], which may lead to poor model generalizability [29].

On the other hand, pixel-based annotation may effectively facilitate the development of pneumothorax-detecting algorithms [30]. For standing CXR, the pneumothorax lesion could usually be delineated [31, 32] by the visceral pleural line in the apicolateral space [33]. Nonetheless, when patients are in the supine position, the spaces where the air is trapped differ from those in the standing position [34]. Adopting segmentation masks to delineate the pneumothorax lesion on SCXR might raise a concern that only those images with clear pleural lines were annotated, leading to selection bias.

Consequently, we used bounding boxes for annotation, allowing for localization of pneumothoraces without distinct pleural lines. Nevertheless, in some lesions, such as those spanning lung apices and basal aspects, the use of bounding boxes might encompass nearly an entire unilateral lung region. This problem was overcome by dividing images into 10 × 10 grids, permitting bounding boxes to accommodate lesions of varying shapes.

Dataset Selection for Training and Testing Models

Considering the low (0.5-3%) incidence of pneumothorax cited in epidemiologic data [35, 36], use of a consecutive random SCXR sampling for model development may result in class imbalance. Such imbalance may bias CAD systems towards learning features of a more common class (i.e., pneumothorax-negative images) and distort various evaluation metrics [37]. Thus, we employed a case-controlled design [38, 39] to achieve greater balance in training and testing datasets. As shown in Table 1, the higher proportion (31.2%) of images annotated as pneumothorax in the NTUH-1519 dataset may enable our CAD systems to better learn pneumothorax-related features; whereas the lower proportion (11.8%) in NTUH-20 fostered performance testing on a plane approaching real-world prevalence [35, 36].

In a previous study [29], the accuracy of DL-based pneumothorax detection was shown to significantly decline when testing the algorithm in an external dataset. Concerns over accuracy overestimation and limited generalizability of such algorithms may be mitigated by model evaluation in an independent dataset. However, no datasets dedicated to portable SCXRs were available for our purposes. According to the Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD) statement [40], external validation may use data collected by the same researchers, using the same predictors and outcome definitions and assessments, but typically sampled from a later period (temporal or narrow validation). In our study, the NTUH-20 dataset consisted of SCXRs taken during 2020 at NTUH. Compared with NTUH-1519, NTUH-20 was a chronologically different dataset (2015–2019 vs. 2020), with significant differences (Table 1). According to the TRIPOD statement, the chronologically different testing dataset can be used to verify the external generalizability of the CAD system.

Diagnosis Output Performance

Niehues et al. [41] used portable SCXR to develop a CAD algorithm with excellent performance in identifying pneumothorax (AUC: 0.92, 95% CI: 0.89–0.95). Nonetheless, the thoracic drains were concomitantly present in approximately half of the images with pneumothorax [41]. It is thus conceivable that these drains were misconstrued in the algorithm as a feature of pneumothorax [42]. Rueckel et al. [30] also collected 3062 SCXRs, including 760 images with pixel-level annotations of pneumothorax and thoracic drain. This model also performed well overall (AUC: 0.877) for unilateral pneumothorax detection.

For the present study, however, we excluded images with thoracic drains and used bounding boxes for pixel-level annotation. Both of detection- and segmentation-based systems delivered excellent performances (AUC values > 0.94) in pneumothorax detection. In our study, the architectures of the classification models differed between the two CAD systems as the UNet-based model [20] itself could output both classification results and localization information.

Routine portable SCXR exams are common practice in critical care [43, 44]. Such regular use of portable SCXR exams may partly account for the prolonged turnaround time from image acquisition to interpretation by a radiologist [45]. Our systems may help prioritize portable SCXRs within queues, flagging those to be checked upfront by a radiologist or earmarking treating clinicians for notifications. As shown in Table 1, there was a high percentage of patients receiving tracheal intubation. Early detection of pneumothorax may facilitate prompt life-saving procedures for these patients to prevent serious complications, such as tension pneumothorax.

Localization Output Performance

Using standing chest X-rays, a model devised by Lee et al. [46] has achieved a Dice coefficient of 0.798 in pneumothorax localization. Feng et al. [47] also derived a model able to localize the pneumothorax lesion (Dice coefficient: 0.69). Nevertheless, even though Feng et al. [47] included portable SCXRs in the analysis, the researchers excluded films with only supine signs of pneumothorax, e.g., deep sulcus sign. Another model by Zhou et al. [48], based on frontal chest X-rays alone (no portable SCXRs), could detect pneumothorax with a Dice coefficient of 0.827.

The images of portable SCXR are generally deemed suboptimal for diagnosis. The patients are often unable to cooperate during image acquisition, leading to poor bodily orientation or inspiratory efforts. Compared with standing chest X-rays, they are also inferior in image quality, hindering the diagnosis of pneumothorax due to lesser degrees of resolution and luminance [49]. Furthermore, classic findings of pneumothorax on standing chest X-rays are often lacking on portable SCXRs. Given the more challenging interpretation of SCXRs, past models [46,47,48; 50] may not be suitable for pneumothorax localization on these images.

Both CAD systems we developed (based on object detection or image segmentation) performed excellently in pneumothorax localization, comparable to the level of annotators (Table 3). To the best of our knowledge, our CAD systems may be the first ones capable of localizing pneumothoraces on portable SCXRs. Although the detection- and segmentation-based systems performed similarly in testing, their required computational resources differed substantially (Supplemental Table 5). The detection-based system only outputs approximate positional information with several coordinates of bounding boxes. Logically, its computational demands should be less than those of the segmentation-based system, which provides accurate pixel-wise lesion information. However, the detection-based system must integrate several models for ensemble and thus is more demanding of resources by comparison. Users must take into account specific computational requirements when choosing a preference.

Influence of Pneumothorax Size

Previous studies [42, 51, 52] have demonstrated that model performance (as with human readings) may be influenced by extent of pneumothorax. A model that Taylor et al. [52] devised correctly identified 100% of large pneumothoraces but only 39% of small ones. Similarly, performance levels of our CAD systems declined as pneumothorax size diminished. This is not surprising, because inter-annotator TP- Dice coefficients also fell as pneumothorax size decreased, underscoring the problematic model learning of small-volume lesions. This phenomenon was more obvious for the detection-based CAD system as its lower prediction-ground truth TP-DICE than the inter-annotator TP-DICE (Table 3) may lead to the lower diagnostic performance for small pneumothorax than the radiology reports (Table 2).

Unlike large pneumothoraces, small pneumothorax is apt to be overlooked by clinicians, especially on portable SCXRs, necessitating assistance by CAD systems. Because most patients subjected to portable SCXRs are those susceptible to complications caused by pneumothorax, especially those receiving mechanical ventilation, timely detection is critical to prevent a small pneumothorax from progressing into tension pneumothorax [53].

Future Applications

The CAD system can serve two primary functions: (1) prioritizing the SCXRs and selecting those in question to be checked first by the radiologist or (2) issuing notifications to attending clinicians. When the clinicians examine the diagnosis results, the localization outputs of pneumothorax may pop up to facilitate verification of the results. We present the requirements of computational resources for these two CAD systems (Supplemental Table 5), which can assist healthcare institutions in selecting the most suitable model for deployment. Moreover, in future studies, it is warranted to examine the feasibility of adapting these CAD systems for edge computing and their integration into portable chest X-ray machines, which holds the potential to broaden the CAD systems’ applicability.

Study Limitations

First, because we only have de-identified images available for analysis, we did not know whether patients’ clinical comorbidities may influence the performance of the CAD system. Nonetheless, Table 1 shows there were diverse findings or diagnoses on SCXRs, which might somewhat mitigate this concern. Second, given the low prevalence for pneumothorax [35, 36], we used a case-controlled study design for image collection to ensure sufficient numbers of pneumothorax-positive patients. This design may result in an artificially elevated pneumothorax prevalence in our datasets, compared with real-world settings. We therefore relied on radiology reports or annotators as reader reference points by which to judge CAD system performance. Further prospective studies are warranted to better test performance with real-life pneumothorax prevalence by enrolling consecutive patients from EDs or ICUs on a manageable scale [54].

Conclusions

We developed two DL-based CAD systems to diagnose and localize pneumothoraces on portable SCXRs, using detection and segmentation methods, respectively. Performances of both systems proved excellent, comparable to those of radiologists or human annotators when tested in a dataset of differing time frame, with differing patient or image characteristics. Hence, the potential for external generalizability seems favourable. Although each performed similar in testing, the detection-based system may demand more in terms of computational resources.