Keywords

1 Introduction

Chext X-rays (CXR’s) are the most commonly performed radiology examination world-wide, with over 150 million obtained annually in the United States alone. CXR’s are a cornerstone of acute triage as well as longitudinal surveillance. Despite the ubiquity of the exam and its apparent technical simplicity, the chest x ray is widely regarded among radiologists as among the most difficult to master [1].

Due to a shortage in supply of radiologists, radiographic technicians are increasingly called upon to provide preliminary interpretations, particularly in Europe and Africa. In the US, non-radiology physicians often provide preliminary or definitive readings of CXRs, decreasing the waiting interval at the nontrivial expense of diagnostic accuracy.

Even among expert radiologists, clinically substantial errors are made in 3–6% of studies [1, 2], with minor errors seen in 30% [3]. Accurate diagnosis of some entities is particularly challenging: early lung cancer for example is missed in 19–54% of cases, with similar sensitivity figures described for pneumothorax and rib fracture detection. The likelihood for major diagnostic errors is directly correlated with both shift length and volume of examinations being read [4], a reminder that diagnostic accuracy varies substantially even at different times of the day for a given radiologist.

Hence there exists an immense unmet need and opportunity to provide immediate, consistent and expert-level insight into every CXR. In the present work we describe a novel methodology employed in this endeavor and we present the results achieved using a robust method of clinical validation.

Fig. 1.
figure 1

TextRay Model Illustration. Frontal (PA) and lateral view images each go through a separate CNN. A fully-connected layer is applied on their concatenated feature vectors and emits the confidence for each finding. Training labels were extracted by analyzing the report sentences. Negative (green) and positive (red) sentences identified. Findings in positive sentences receive a positive training label. Negative or unmentioned findings receive a negative label.

2 Materials and Methods

Data. All Patient Health Information (PHI) was removed from the data prior to acquisition in compliance with HIPAA standards. We utilized a dataset of 2.1 million CXRs with their respective diagnostic reports. All postero-anterior (PA) CXR films of individuals aged 18 and above were procured. Corresponding lateral views were present in 85% of the CXR examinations and were included in the study data.

Table 1. Number of studies with each finding in our data. 596k (62%) of the total 959k studies had no reported findings.

Textual Analysis. A standardization process was employed whereby all CXR reports were reduced to a set of distinct canonical labels. First, a sentence boundary detection algorithm was applied to the 2.1M reports, yielding a pool of 827k unique sentences. Three expert radiologists and two medical students categorized the most occurring sentences with respect to their pertinence to CXR images.

Three categories emerged: sentences that report the presence or absence of a finding, for example “the heart is enlarged”, or “normal cardiac shadow”, and could be used as labels; neutral sentences, which referenced information not derived from or inherently related to the image itself, for example: “84 year old man with cough”, “lung nodule follow up”, or “comparison made to CT chest”.

A third category of sentences could render the study unreliable for training due to ambiguity regarding the relationship of the text to the image, for example “no change in the appearance of the chest since yesterday”.

After filtering out neutral and negative sentences using a few hand-crafted regular expressions, it was possible to fully cover 826k reports using just the 20k most prevalent positive sentences. The same expert radiologists reviewed each of these sentences and mapped them to an initial ontology of 60 findings which covered 99.99% of all positive sentence volume.

In making the final ontology, we focused on visual findings rather than clinical interpretations or diagnoses. We chose to merge some categories: osteoporosis was merged into osteopenia, twisted and uncoiled aorta into abnormal aorta, and bronchial markings into interstitial markings, since it is often impossible to differentiate these based on the image alone. Although visually distinct, all tubes and venous lines were consolidated into two respective categories. The resulting 40 categories are presented in Table 1.

Training Set Generation. On completion of sentence labeling, we set out to design the appropriate training set. A conservative approach would only include studies whose report sentences were fully-covered, i.e. every potentially positive sentence in them was manually reviewed and mapped to a finding. A more permissive any-hit approach would include any study with a recognized positive sentence in its report, ignoring other unrecognized sentences, with the risk that some of them also mention abnormalities that would be mislabeled as negatives.

The fully-covered approach yielded 596k normal studies (no positive findings), and 230k abnormal studies. The any-hit approach, while noisier, added 58% more abnormal studies, for a total of 363k. Hence our final training set had 826k studies in the fully-covered approach, and 959k studies in the any-hit approach.

Additionally, many radiologists will omit mention of normal structures in favor of brevity, thereby implying a negative label. This bias extends to many studies in which even mildly abnormal or senescent changes are omitted. For example, the same CXR may produce a single-line report of “No acute disease” by one radiologist and descriptions of cardiomegaly, and degenerative changes by another radiologist. Inherently, this omission bias introduces noise into the labeling process, particularly for findings which are not deemed critical, even in the more conservative fully-covered training set.

We decided to compare both approaches, and took the larger any-hit training set as our baseline. To the best of our knowledge, this is the largest training set ever assembled for chest X-ray, both in terms of the number of studies and the number of labels (see Table 1 for its composition). We partitioned the training set into training, validation, and testing (80%/10%/10% respectively), based on the (anonymized) patient identity. From the 10% of studies designated as validation we compiled a validation set of size 994 with at least 25 positives from each finding. We picked the model with lowest validation loss.

2.1 Model

Our model, called TextRay, is illustrated in Fig. 1. We start by applying a CNN (DenseNet121 [5]) on the Lateral and PA views (separately). We removed the last fully connected layer from each CNN and concatenated their outputs (just after the average pooling layer). We then applied our own fully-connected layer resulting in \(K=40\) outputs, one for each finding, followed by a sigmoid activation. Hence, our model treats each study as a “bag of findings”, reporting the confidence for each one. We used the mean of the binary cross-entropy losses as our main loss function:

$$ loss=\frac{1}{K}\sum _{k=1}^K y_k\log (p_k)+(1-y_k)\log (1-p_k) $$

where \(p_k\) is the value of the k-th output unit and \(y_k\) is the binary label for the k-th finding.

Our model receives two inputs of size 299\(\,\times \,\)299. When lateral view was unavailable, we fed the network with random noise instead. Each X-Ray image (up to 3000\(\,\times \,\)3000 pixels in raw format) was zero-mean-normalized, rescaled to a size of \(330(1+a)\) x \(330(1+b)\), and rotated c degrees. A random patch of 299\(\,\times \,\)299 was taken as input. For training augmentation, we sampled ab uniformly from \(\pm 0.09\) and c from \(\pm 9\), randomly flipping each image horizontally. For balance, we replaced the PA view with random noise in 5% of the samples. For test we used \(a=b=c=0\) and took the central patch as input, without flipping.

We trained on two 1080Ti GPUs, putting each CNN on a different GPU. We used the built-in Keras 2.1.3 implementation of DenseNet121 over Tensorflow 1.4. We used the Adam optimizer with Keras’ default parameters, and a batch size of 32. We sorted the studies in two queues, normals and abnormals, and filled each batch with 95% abnormal studies on average. An epoch was defined as 150 batches. We started with a learning rate of 0.001 and multiplied it by 0.75 if validation loss hadn’t improved for 30 epochs. We trained for 2000 epochs.

2.2 Evaluation Sets

We chose 12 of the 40 findings and prepared evaluation sets for them, using studies from the test partition. Most sets focused on a single finding except cardiomegaly, hilar prominance, and pulmonary edema, which were lumped together as they are commonly seen in the setting of congestive heart failure. In each set, the studies were derived from two pools: pos-pool are studies that the reports indicated as positive for that finding. These studies were obtained by a manual textual search for terms indicative for each finding, independently of our sentence-tagging operation; neg-pool are randomly sampled studies, which are mostly negative for any finding (see Table 2 for the sets composition).

Each set was evaluated by three expert radiologists. In each set, the radiologist reviewed the shuffled studies and indicated the presence or absence of the relevant finding, using a web-based software operated on a desktop. The radiologists were shown both PA and Lateral view in their original resolutions.

We considered the report as a fourth expert opinion. To measure the accuracy of the label-extraction process, we cross referenced the report opinion with the training set labels. The positive labels in the training set were accurately mentioned the report; frequently, positive findings mentioned in the reports were mislabeled as negatives as would be expected in the any-hit training set, but this was also observed to lesser degree even in the fully-covered set.

3 Results

We performed pairwise analysis of the radiologist agreement following the procedure in [6], except we used the agreement rate between two taggers (e.g. accuracy) instead of the F1 score, because (a) it also measures agreement on the negatives; and (b) it is easier to interpret. The average agreement rate (AAR) for a radiologist (or a model) is the average of the agreement rates achieved against the other two (three for a model) radiologists. The avg. radiologist rate is the mean of the three radiologists’ AARs. We used the bootstrap method (\(n=10000\)) to obtain 95% confidence intervals over the difference between TextRay and the average radiologist agreement rates. As TextRay’s threshold for each finding, we used the one that maximized the AAR on the validation set.

Table 2. Evaluation Sets. The number of studies taken from the pos-pool (finding is positive in report) and neg-pool (random sample) are indicated, along with the average agreement rate (AAR) of the 3 radiologists (rads) assigned to each set vs. the report. The AAR between our model and the rads (column textray) is compared against the AAR between any radiologist and the other rads (avg. rad.). Confidence intervals are computed over the difference (\(\varDelta =\text{ textray }-\text{ avg. } \text{ rad. }\)).

Table 2 shows that TextRay is on par with human radiologists (within the 95% CI) on 10 out of 12 findings, with the exception of rib fracture and hilar prominence. On some findings (elevated diaphragm, abnormal aorta, and pulmonary edema), radiologists agree significantly more with our algorithm than with each other (e.g. the CI does not include 0). Table 2 also shows the average agreement of the radiologists with the report. Here as well, this agreement is often higher than the average agreement among the radiologists themselves. This provides evidence that the noise added by using the reports as labels is no larger than the noise added by training a radiologist to do the tagging.

Using our text-based labels as ground-truth, TextRay’s performance was then tested over all 40 findings. To create the test set, a random sample of 5,000 studies was chosen from the test partition. Then, more studies were added from the partition until each finding had at least 100 positive cases, for a total of 7,030 studies. The ROCs (in the Supp. material) have AUCs ranging between 0.7 and 1.0 (average 0.892). At the top of the chart, artificial objects (i.e. pacers, lines, tubes, wires, and implants) are detected with AUCs approaching 1.0, much better than all diseases.

Fig. 2.
figure 2

Area under the ROC curve (AUC) of our base model vs. the PA-only variant over 40 chest X-ray findings. The numbers refer to the index of Table 1. A cluster of labels should be mapped left-to-right.

Figure 2 shows the area under the ROC curve (AUC) achieved by our model compared to a variant that was trained only with the PA view of each study (the approach used in [6, 7]). We see that in most findings, the performance is similar, but vertebral height loss, consolidation, rib fracture, and kyphosis stand out as findings in which the lateral view improved detection. These findings are expected from a clinical radiographic perspective.

For comparison, we also trained a variant of TextRay with the fully-covered training set, but it achieved significantly lower results in almost all findings (see Supp. material), suggesting that the additional abnormal studies in the any-hit set more than compensated for the higher label noise. Finally, we draw heat maps based on the procedure presented in [7] (presented in Supp. material).

4 Discussion

The extraction of labels from full CXR reports has been recognized as essential for efficient and robust CNN training on large datasets. Shin et al. [8] extracted labels from the 3,955 CXR reports in the OpenI dataset, using the MeSH system [9]. The ChestX-ray14 dataset released by Wang et al. [7] contains 112k PA images loosely labeled using a combination of NLP and hand-crafted rules. Rajpurkar et al. [6] team of four radiologists reported a high degree of disagreement with the provided ChestX-ray14 labels in general, although they demonstrate the ability to achieve expert-level prediction for the presence of pneumonia after training upon a DenseNet121 CNN.

Utilizing several public datasets with image labels and reports provided, Jing et al. [10] built a system that can generate a natural appearing radiology report using a hierarchical RNN. The high-level RNN generates sentence embeddings that seed low-level RNNs that produce the words of each sentence. As part of their report generation, they also produce tags representing the clinical finding present in the image. Interestingly, the model trained using these tags and the text of the reports did not predict the tags better than the model that was trained just using the tags. The ultimate accuracy of the system however remains poorly defined due to lack of clinical radiologic validation.

To the best of our knowledge, the present study is the first to utilize extensive radiology expertise for both multi-label generation and visual validation of algorithmic results. Study labels were generated bottom-up via ontology-based methodology which was rooted in the text rather than pre-existing categories or tags (i.e. MeSH). We trained upon the largest dataset of CXRs described to date, achieving results on twelve distinct visual findings which are on par with inter-radiologist agreement and in some cases, better.

5 Conclusion

In this work we attempt to broadly cover all findings radiologists usually report when reviewing a PA and Lateral chest X-ray. Since a relatively small set of sentences is heavily re-used in CXR reports, we were able to generate organic labels for millions of reports by examining and indexing twenty thousand individual sentences. This massive amount of data allowed us to obtain radiology-level detection performance on various of findings using a single model, in essence distilling the insight of millions of radiographic interpretations into software code. Application of a similar technique upon AP chest X-ray scans, musculoskeletal and abdominal radiographies is currently ongoing.