Background

Chest x-ray is the most common medical imaging exam to screen patients suspected of having pulmonary abnormalities and diseases. Due to its utility, many deep learning-based artificial intelligence (AI) methods have been proposed, such as detecting pneumonia, tuberculosis, and COVID-19 [1,2,3,4,5]. Also, some studies have explored the potential applications of those methods in clinical environments, including shortening turnaround time, increasing reading efficiency, and reducing misinterpretation [6,7,8,9,10].

Chest x-ray images have their unique characteristics depending on the specifications of x-ray machines which are broadly composed of detectors (e.g., computed radiography (CR) and digital radiography (DR)) and generators (e.g., mobile or stationary). For example, chest x-ray images from DR detectors typically show better image quality than those from CR detectors at the same dose level [11]. In addition, chest x-ray images from mobile x-ray machines usually have higher noise than those from stationary machines due to the limited maximum power of generators [12, 13].

However, many AI methods for chest x-ray abnormality or disease classification have not strictly investigated their method’s stability for chest x-rays of different x-ray machine specifications, even though supervised-trained AI algorithms presumably have a high bias toward the training dataset [14]. This bias results in the degraded diagnostic performance of the methods when applied to chest x-rays with different characteristics from training data (i.e., x-rays of unseen machines during training), limiting the clinical utility in the real world [15,16,17].

In this study, we propose an x-ray manipulation pipeline (XM-pipeline) that combines a set of image pre-processing and data augmentation techniques to overcome the AI model’s bias toward the training dataset from a single x-ray machine. We carefully designed the XM-pipeline to incorporate the hardware-related changes in x-ray images during AI training. To validate the effectiveness of the XM-pipeline, we trained AI models using the XM- and conventional pipelines. Then, we compared their diagnostic performance based on multiple test datasets of different machine specifications.

Methods

Chest x-ray image collection and annotation

In our retrospective study, we collected chest x-ray images (digital imaging and communications in medicine [DICOM] format) from Vietnam and Indonesian hospitals (Fig. 1). A total of 11,652 chest x-ray images of symptomatic patients who visited the National Lung Hospital in Vietnam for tertiary care were acquired between May 2020 and February 2021. Also, from Awal Bros hospitals located in four different areas of Indonesia, 3,358 chest x-ray images of asymptomatic individuals who underwent medical checkups were collected between September 2020 and October 2020. Each hospital used different x-ray machine specifications (i.e., CR or DR detectors of different vendors with stationary or mobile generators; see Table 1 and Supplementary Table S1 for details). The institutional review board of each participating institution approved this study.

Fig. 1
figure 1

Data flow diagram of our retrospective study

Table 1 Summary of the test datasets used in our retrospective study

A radiologist with 30 years of experience (MD1) reviewed all the chest x-rays from Vietnam and Indonesian hospitals. The presence of common pulmonary abnormalities (i.e., target abnormalities: atelectasis, consolidation/ground glass opacity, fibrotic sequelae, nodule/mass, and pneumothorax) for each chest x-ray image was confirmed by the radiologist using a web-based annotation tool (Label Studio version 1.6) [18]. Based on the radiologist’s annotations, we excluded 3,834 chest x-ray images with non-target abnormalities (Fig. 1; see Supplementary Table S2 for distribution of target abnormalities). Then, the chest x-ray images from the Vietnam hospital were randomly split into training (5,763 chest x-rays), validation (1,439 chest x-rays), and internal testing datasets (1,278 chest x-rays; VHDR1 in Table 1). Also, the chest x-ray images from Indonesian hospitals were separated into four datasets according to each x-ray machine and hospital as external testing datasets (IHDR2, IHCR1, IHCR2, and IHCR3,Mobile in Table 1). To check the potential reproducibility issue of the annotations, we also invited three more radiologists (12 years of experience on average; MD2, MD3, and MD4) to annotate some chest x-rays of VHDR1 (100 normal and 100 abnormal x-rays) and calculated Cohen’s kappa scores between MD1 and the others. This results in high Cohen’s kappa scores (0.84 for MD2, 0.87 for MD3, and 0.81 for MD3), which means the annotations of MD1 are highly consistent and reproducible by the other radiologists.

In addition to the collected datasets, we utilized five publicly available chest x-ray datasets for AI evaluation: Two datasets for tuberculosis detection were Shenzhen [19] and Montgomery datasets (SZDR3 and MGCR4 in Table 1; Portable Network Graphics format) [19]. Three large public datasets included CheXpert [20], ChestX-Det10 [21], and RSNA-Pneumonia [22] (Portable Network Graphics or Joint Photographic Experts Group format; see Supplementary Note 1).

Figure 2 shows some example chest x-ray images from the test datasets without adjusting the window level and width (i.e., raw DICOM images), highlighting diverse image characteristics.

Fig. 2
figure 2

Example chest x-ray images of different x-ray machine specifications: (a) VHDR1, (b) IHDR2, (c) IHCR1, (d) IHCR2, (e) IHCR3,Mobile, (f) SZDR3, and (g) MGCR4. Each image reveals unique characteristics (e.g., contrast and noise levels). The images were plotted without adjusting the window level and width except for SZDR3 and MGCR4. In other words, raw image data were extracted from DICOM files

Training AI models

We utilized an EfficientNet-B6 [23] as a neural network architecture and trained five AI models by applying conventional image processing pipelines, the XM-pipeline, and no pipeline (i.e., baseline) (Fig. 3) to classify chest x-ray images as normal (i.e., no target abnormalities) or abnormal (i.e., with at least one target abnormality). Each pipeline is composed of two sub-functions, pre-processing, and data augmentation. We cropped the lung regions of all chest x-ray images before applying the pipelines using an additional network developed by in-house data.

Fig. 3
figure 3

Overview of training and testing phases of the AI models used in this study. a Training phase: after cropping the lung regions of chest x-ray images, those images were processed by each pipeline for AI training. The conventional pipelines utilized HE, CLAHE, and UM as pre-processing methods, and random rotation and horizontal flipping as data augmentations. The XM-pipeline includes the histogram modification and three more data augmentation techniques (i.e., contrast, sharpness, and noise augmentations). All networks were trained to classify chest x-rays as normal or abnormal. b Testing phase: external test datasets of different x-ray machine specifications were used for AI evaluation. All input x-rays were pre-processed before being fed to the networks. CLAHE Contrast-limited histogram equalization, HE Histogram equalization, UM Unsharp masking

We used Pytorch (Version 1.12.1) and an NVIDIA GeForce RTX 3090 for AI training. We utilized an Adam (learning rate 0.003; batch size 4) [24] as an optimizer. In the AI training phase, we applied a resampling method, which under-sampled the majority class data, to mitigate the data imbalance problem [25]. All chest x-ray images were resized to 512 × 512 after pre-processing.

Conventional image-processing pipelines

For conventional image pre-processing, we adopted histogram equalization (HE) [26], contrast-limited adaptive histogram equalization (CLAHE) [27], and unsharp masking (UM) [28], which are primarily utilized in chest x-ray AI development [29,30,31,32]. Also, as conventional data augmentation techniques, we applied random rotation (degree within [-15, 15]) and horizontal flipping (probability 0.5), which are commonly used in many studies of chest x-ray AI [33,34,35,36,37].

X-ray manipulation pipeline (XM-pipeline)

In the proposed XM-pipeline, as a pre-processing step, we modified the histogram of each chest x-ray image to normalize its brightness and maximize the information inside the lung region (Supplementary Note 2 for detail). First, we stretched the histogram through an iterative optimization process [38], and to improve the contrast inside the lung region, we changed the minimum intensity of each x-ray image to the minimum intensity inside the lung region [32]. Example chest x-ray images after pre-processing are shown in Fig. 4 (more examples in Supplementary Fig. S1).

Fig. 4
figure 4

Example chest x-ray images from VHDR1, IHCR2, and IHCR3,Mobile after applying the conventional pre-processing methods (HE, CLAHE, UM) and the histogram modification in the XM-pipeline: The right upper zones (yellow-dotted boxes) of each image were zoomed in for investigation. CLAHE Contrast-limited histogram equalization, HE Histogram equalization, UM Unsharp masking

The histogram modification was then followed by contrast, sharpness, and noise augmentation techniques (Supplementary Note 3 for detail) in the training phase to mimic the hardware-related changes of chest x-rays. We simulated the contrast change of chest x-ray images depending on the voltage level of an x-ray generator using a gamma correction method [39]. Also, we mimicked the change in the sharpness of x-rays, possibly due to a scattering effect, by applying a Gaussian filter [40]. Finally, to consider the thermal and electronic noises, we added synthetic noise to each chest x-ray image [41].

Evaluation of AI models

For all the test datasets, the diagnostic performance of each AI model was evaluated by calculating the average area under the curve (AUC) at receiver operating characteristics analysis with 95% confidence intervals (CIs).

The DeLong test [42] was performed to check the statistical significance of the difference in the diagnostic performance of each pair of the AI model with the XM-pipeline and another. We repeated AI training ten times by iterating the random division of training and validation data (5,763 chest x-rays (80%) for training; 1,439 chest x-rays (20%) for validation; 1,278 chest x-rays for internal testing were totally separated; Fig. 1). Each random split of data for each iteration was consistent across the different model settings in Fig. 3. We utilized Fisher’s method to combine p values from each iteration and check the statistical significance (p < 0.05) for each pair (i.e., XM versus another), instead of performing multiple comparisons (i.e., comparisons between all combinations such as HE versus CLAHE).

We also checked the stability of each AI model for the changes in the characteristics of x-ray images (e.g., noise injection). We utilized the internal testing dataset (VHDR1 in Table 1) to generate x-ray images with different contrast (\(\gamma\) = 0.2 to 5.0), sharpness (\(s\) = -12 to 12), and noise (\(\sigma\) = 0 to 0.1) levels (details in Supplementary Note 3). Then, we fed those images to each AI model and calculated the averaged AUC values depending on the changes in contrast, sharpness, and noise levels. For the AUC calculation, we also repeated AI training ten times.

All statistical analysis was performed using Python packages (scikit-learn (1.2.0) and SciPy (1.7.3)).

Results

Diagnostic performance of AI models

The diagnostic performance of the AI models (AUCs and p values) is summarized in Table 2. Other evaluation metrics, such as sensitivity and specificity for each AI model, are summarized in Supplementary Table S3. For VHDR1, which is the internal test dataset from the same source of the training data, the diagnostic performance of the AI model with the XM-pipeline showed marginal but statistically significant differences from the others: AUC 0.970 (95% CI 0.967–0.972) for XM-pipeline versus 0.966 (95% CI 0.961–0.971) for baseline (p < 0.001), 0.962 (95% CI 0.958–0.965) for HE (p < 0.001), 0.965 (95% CI 0.963–0.968 for CLAHE (p = 0.097), and 0.965 (95% CI 0.963–0.966) for UM (p = 0.002)). For the external test datasets, the performance of the AI model that utilized the XM-pipeline consistently outperformed those of the others.

Table 2 Diagnostic performance of the AI models with different pipelines

When investigating the results in more detail, the AI model with the XM-pipeline achieved better performance in all datasets acquired from CR systems (i.e., IHCR1, IHCR2, IHCR3,Mobile, and MGCR4) compared to those of the other methods: e.g., AUC in IHCR2 0.944 (95% CI 0.939–0.948) for XM-pipeline versus 0.658 (95% CI 0.622–0.692) for baseline (p < 0.001), 0.917 (95% CI 0.908–0.926) for HE (p = 0.001), 0.705 (95% CI 0.662–0.749) for CLAHE (p < 0.001), and 0.544 (95% CI 0.520–0.567) for UM (p < 0.001), even if the model was trained using the data from a single DR system (i.e., same data with VHDR1). In particular, in the IHCR3,Mobile dataset acquired from a mobile x-ray machine, the AI model with the XM-pipeline outperformed the other models, reporting statistically significant differences: AUC 0.949 (95% CI 0.940–0.957) for XM-pipeline versus 0.937 (95% CI 0.927–0.947) for baseline (p = 0.043), 0.933 (95% CI 0.922–0.944) for HE (p = 0.042), 0.932 (95% CI 0.923–0.942) for CLAHE (p = 0.009), and 0.925 (95% CI 0.912–0.938) for UM (p = 0.001).

When we tested the AI models on the three large public datasets (CheXpert, ChestX-Det10, and RSNA-Pneumonia datasets), the AI model with the XM-pipeline outperformed the others. In the CheXpert dataset, the AI model with the XM-pipeline showed the best diagnostic performance: AUC 0.832 (95% CI 0.824–0.839) for XM-pipeline versus 0.822 (95% CI 0.809–0.835) for baseline (p < 0.001), 0.819 (95% CI 0.804–0.834) for HE (p < 0.001), 0.817 (95% CI 0.806–0.828) for CLAHE (p < 0.001), and 0.814 (95% CI 0.803–0.826) for UM (p = 0.001). In the ChestX-Det10 dataset, the AI model with the XM-pipeline reported the highest AUC: 0.920 (95% CI 0.916–0.924) for XM-pipeline versus 0.898 (95% CI 0.891–0.906) for baseline (p < 0.001), 0.913 (95% CI 0.907–0.920) for HE (p = 0.003), 0.909 (95% CI 0.903–0.915) for CLAHE (p = 0.001), and 0.899 (95% CI 0.891–0.907 (p = 0.001) for UM). In the RSNA-Pneumonia dataset, the AI model with the XM-pipeline also showed the best result: AUC 0.861 (95% CI 0.854–0.867) for XM-pipeline versus 0.854 (95% CI 0.842–0.866) for baseline (p < 0.001), 0.853 (95% CI 0.844–0.861) for HE (p = 0.001), 0.850 (95% CI 0.842–0.857) for CLAHE (p = 0.001), and 0.853 (95% CI 0.844–0.861) for UM, (p = 0.046).

To further understand the behavior of the AI model, we generated heatmaps [43] for the x-rays of selected patients in the IHCR2 and IHCR3,Mobile datasets (Fig. 5). In the heatmaps, the AI model with the XM-pipeline clearly highlighted abnormal regions compared to the other models.

Fig. 5
figure 5

Analysis results of each AI model for the x-rays of selected patients. ad and il Chest x-ray images after pre-processing. eh and m–p Heatmaps and abnormality scores. In the abnormal chest x-ray image of a 31-year-old man (first row), the heatmap of the AI model with the XM-pipeline highlighted the exact region of fibrosis confirmed by a radiologist (yellow-dotted box in e) with the high abnormality score (= 0.964). In contrast, the others showed much smaller activations on these regions in the heatmaps (e versus f, g, h). Also, in the normal chest x-ray image of a 30-year-old man (second row), only the heatmap of the AI model with the XM-pipeline revealed no significant activations over the image (m versus n, o, p) with the minor abnormality score (= 0.057). CLAHE Contrast-limited histogram equalization, HE Histogram equalization, UM Unsharp masking

Stability of AI predictions

Figure 6 shows the diagnostic performance of each AI model depending on the changes in the image characteristics of input chest x-rays. When we changed the contrast of chest x-ray images (Fig. 6a), in general, the AI model trained using the XM-pipeline reported higher diagnostic performance compared to those of the other methods (e.g., AUCs when \(\gamma =5.0\): 0.821 (95% CI 0.814–0.829) for XM-pipeline, 0.636 (95% CI 0.610–0.667) for HE, 0.662 (95% CI 0.612–0.720) for CLAHE, and 0.565 (95% CI 0.534–0.599) for UM). When the contrast was low, UM especially showed degraded performance (e.g., AUCs \(\gamma =0.2\): 0.946 (95% CI 0.943–0.950) for XM-pipeline, 0.943 (95% CI 0.934–0.953) for HE, 0.927 (95% CI 0.909–0.947) for CLAHE, and 0.691 (95% CI 0.580–0.818) for UM). Similarly, when we changed the sharpness (Fig. 6b) and noise levels (Fig. 6c) of chest x-ray images, the AI model with the XM-pipeline demonstrated less degradation of the diagnostic performance, such as in cases of increasing sharpness (e.g., AUCs when \(s=12\): 0.960 (95% CI 0.951–0.971) for XM-pipeline, 0.729 (95% CI 0.572–0.908) for HE, 0.870 (95% CI 0.837–0.907) for CLAHE, and 0.553 (95% CI 0.526–0.584) for UM) and adding noise (e.g., AUCs when \(\sigma =0.1\): 0.801 (95% CI 0.771–0.835) for XM-pipeline, 0.555 (95% CI 0.510–0.606) for HE, 0.630 (95% CI 0.557–0.713) for CLAHE, and 0.500 (95% CI 0.481–0.521) for UM).

Fig. 6
figure 6

Diagnostic performance of each AI model depending on the changes in x-ray image characteristics. In general, the AI model with the XM-pipeline showed consistently higher diagnostic performance compared to those of the other models, such as after increasing the contrast (e.g., AUCs when γ = 5.0: 0.821 (95% CI 0.814–0.829) for XM–pipeline, 0.636 (95% CI 0.610–0.667) for HE, 0.662 (95% CI 0.612–0.720) for CLAHE, and 0.565 (95% CI 0.534–0.599) for UM)) and increasing the sharpness (AUCs when s = 12: 0.960 (95% CI 0.951–0.971) for XM-pipeline, 0.729 (95% CI 0.572–0.908) for HE, 0.870 (95% CI 0.837–0.907) for CLAHE, and 0.553 (95% CI 0.526–0.584) for UM). AUC Area under ROC curve, CLAHE Contrast-limited histogram equalization, HE Histogram equalization, UM Unsharp masking

Discussion

In this study, we proposed the XM-pipeline, which combines the series of image preprocessing and data augmentation methods to minimize the degradation of the diagnostic performance of an AI model for chest x-ray images of various machine specifications. We confirmed that the AI model with the XM-pipeline showed higher diagnostic performance than the other AI models with the conventional pipelines based on the test datasets of different x-ray machines, including CR or DR detectors and mobile or stationary generators.

In the XM-pipeline, we carefully designed the data augmentation techniques (see Supplementary Note 3 for detail) to consider the potential x-ray image variations depending on the x-ray scan settings (e.g., changes of scan parameters, presence of grids, etc.) [44,45,46]. For example, due to the change in the voltage level of an x-ray generator, photons from the generator will have different amounts of energy. Accordingly, the contrast of chest x-ray images will be changed [47]. The sharpness in chest x-ray images can be changed depending on the presence of grids [48] and vendor-specific image processing techniques when producing DICOM [49]. Also, a chest x-ray image is known to have a mixture of noises due to several factors (e.g., amount of radiation dose) [13], and we approximated the thermal and electronic noises by adding Gaussian noise to chest x-rays [50].

Previously, some studies have reported the diagnostic performance of AI models for chest x-ray images from multiple institutions [51, 52] and a few x-ray machines [53]. However, none of them proposed a training pipeline to improve the diagnostic performance and investigated AI models regarding different machine specifications, including the type of detectors and generators. Furthermore, we trained and evaluated the AI models using raw image data from DICOM files without adjusting the window level and width, while many open datasets provide x-ray data in Portable Network Graphics or Joint Photographic Experts Group formats [20, 54]. We believe deploying an AI model optimized for DICOM files is more practical in clinics.

Throughout this study, we have chosen the two most common data augmentation techniques (i.e., random rotation and horizontal flipping) as the conventional ones. To find out the effect of other augmentations, we measured the diagnostic performance of AI models by applying different sets of augmentations, including random shearing (degree within [-15, 15]) and scaling (scaling factor within [0.8, 1.2]). However, no combination of augmentations was superior on the test datasets, even after adding more augmentations (see Supplementary Table S4).

When we investigated the diagnostic performance of AI for each abnormality, we found that some abnormalities were more challenging than others in terms of generalization (see Supplementary Table S5). For example, in the VHDR1 and IHDR2 datasets, the diagnostic performance for consolidation/ground glass opacity, pleural effusion, and pneumothorax were almost the same, while that for nodule/mass and fibrotic sequelae were degraded in IHDR2. However, the diagnostic performance was improved for all the abnormalities, after applying the XM-pipeline, compared to the baseline model.

This study has limitations. First, we could not fully explore the optimal parameters of the XM-pipeline (e.g., range of sharpness coefficient). Therefore, the diagnostic performance can still be improved. Second, we adopted the most common image processing techniques for comparison, such as HE, CLAHE, and UM. However, other techniques exist to normalize chest x-ray images, such as [55]. Third, even if we carefully designed the data augmentation techniques in the XM-pipeline, those techniques are still limited to reflect the changes of image characteristics in chest x-rays. In the future, more realistic algorithms, such as the Monte Carlo method for x-ray scattering [56], can be explored as advanced data augmentation techniques. Fourth, we have validated the diagnostic performance of the AI models for classifying common pulmonary abnormalities, but other applications, such as COVID-19 detection [1], might be addressed as future work. Fifth, in this study, quality control of chest x-rays was only performed on private datasets. Other large open datasets, such as CheXpert [20], might need to be reviewed by radiologists to prevent potential labeling issues [57, 58].

In summary, the diagnostic performance of the AI model with the XM-pipeline was consistently higher than those of the other models with the conventional pipelines when they were evaluated on the test datasets of different x-ray machine specifications. This result implies that applying the XM-pipeline can minimize the performance degradation of the AI model due to the changes in x-ray machine specifications.