TextRay: Mining Clinical Reports to Gain a Broad Understanding of Chest X-Rays

Laserson, Jonathan; Lantsman, Christine Dan; Cohen-Sfady, Michal; Tamir, Itamar; Goz, Eli; Brestel, Chen; Bar, Shir; Atar, Maya; Elnekave, Eldad

doi:10.1007/978-3-030-00934-2_62

Jonathan Laserson¹⁸,
Christine Dan Lantsman¹⁹,
Michal Cohen-Sfady¹⁸,
Itamar Tamir²⁰,
Eli Goz¹⁸,
Chen Brestel¹⁸,
Shir Bar²¹,
Maya Atar²² &
…
Eldad Elnekave¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 11071))

Included in the following conference series:

International Conference on Medical Image Computing and Computer-Assisted Intervention

14k Accesses
19 Citations

Abstract

The chest X-ray (CXR) is by far the most commonly performed radiological examination for screening and diagnosis of many cardiac and pulmonary diseases. There is an immense world-wide shortage of physicians capable of providing rapid and accurate interpretation of this study. A radiologist-driven analysis of over two million CXR reports generated an ontology including the 40 most prevalent pathologies on CXR. By manually tagging a relatively small set of sentences, we were able to construct a training set of 959k studies. A deep learning model was trained to predict the findings given the patient frontal and lateral scans. For 12 of the findings we compare the model performance against a team of radiologists and show that in most cases the radiologists agree on average more with the algorithm than with each other.

You have full access to this open access chapter, Download conference paper PDF

RadTex: Learning Efficient Radiograph Representations from Text Reports

Event-Based Clinical Finding Extraction from Radiology Reports with Pre-trained Language Model

Article 17 October 2022

Automatic Classification of Radiological Reports for Clinical Care

Keywords

1 Introduction

Chext X-rays (CXR’s) are the most commonly performed radiology examination world-wide, with over 150 million obtained annually in the United States alone. CXR’s are a cornerstone of acute triage as well as longitudinal surveillance. Despite the ubiquity of the exam and its apparent technical simplicity, the chest x ray is widely regarded among radiologists as among the most difficult to master [1].

Due to a shortage in supply of radiologists, radiographic technicians are increasingly called upon to provide preliminary interpretations, particularly in Europe and Africa. In the US, non-radiology physicians often provide preliminary or definitive readings of CXRs, decreasing the waiting interval at the nontrivial expense of diagnostic accuracy.

Even among expert radiologists, clinically substantial errors are made in 3–6% of studies [1, 2], with minor errors seen in 30% [3]. Accurate diagnosis of some entities is particularly challenging: early lung cancer for example is missed in 19–54% of cases, with similar sensitivity figures described for pneumothorax and rib fracture detection. The likelihood for major diagnostic errors is directly correlated with both shift length and volume of examinations being read [4], a reminder that diagnostic accuracy varies substantially even at different times of the day for a given radiologist.

Hence there exists an immense unmet need and opportunity to provide immediate, consistent and expert-level insight into every CXR. In the present work we describe a novel methodology employed in this endeavor and we present the results achieved using a robust method of clinical validation.

2 Materials and Methods

Data. All Patient Health Information (PHI) was removed from the data prior to acquisition in compliance with HIPAA standards. We utilized a dataset of 2.1 million CXRs with their respective diagnostic reports. All postero-anterior (PA) CXR films of individuals aged 18 and above were procured. Corresponding lateral views were present in 85% of the CXR examinations and were included in the study data.

Table 1. Number of studies with each finding in our data. 596k (62%) of the total 959k studies had no reported findings.

Full size table

Textual Analysis. A standardization process was employed whereby all CXR reports were reduced to a set of distinct canonical labels. First, a sentence boundary detection algorithm was applied to the 2.1M reports, yielding a pool of 827k unique sentences. Three expert radiologists and two medical students categorized the most occurring sentences with respect to their pertinence to CXR images.

Three categories emerged: sentences that report the presence or absence of a finding, for example “the heart is enlarged”, or “normal cardiac shadow”, and could be used as labels; neutral sentences, which referenced information not derived from or inherently related to the image itself, for example: “84 year old man with cough”, “lung nodule follow up”, or “comparison made to CT chest”.

A third category of sentences could render the study unreliable for training due to ambiguity regarding the relationship of the text to the image, for example “no change in the appearance of the chest since yesterday”.

After filtering out neutral and negative sentences using a few hand-crafted regular expressions, it was possible to fully cover 826k reports using just the 20k most prevalent positive sentences. The same expert radiologists reviewed each of these sentences and mapped them to an initial ontology of 60 findings which covered 99.99% of all positive sentence volume.

In making the final ontology, we focused on visual findings rather than clinical interpretations or diagnoses. We chose to merge some categories: osteoporosis was merged into osteopenia, twisted and uncoiled aorta into abnormal aorta, and bronchial markings into interstitial markings, since it is often impossible to differentiate these based on the image alone. Although visually distinct, all tubes and venous lines were consolidated into two respective categories. The resulting 40 categories are presented in Table 1.

Training Set Generation. On completion of sentence labeling, we set out to design the appropriate training set. A conservative approach would only include studies whose report sentences were fully-covered, i.e. every potentially positive sentence in them was manually reviewed and mapped to a finding. A more permissive any-hit approach would include any study with a recognized positive sentence in its report, ignoring other unrecognized sentences, with the risk that some of them also mention abnormalities that would be mislabeled as negatives.

The fully-covered approach yielded 596k normal studies (no positive findings), and 230k abnormal studies. The any-hit approach, while noisier, added 58% more abnormal studies, for a total of 363k. Hence our final training set had 826k studies in the fully-covered approach, and 959k studies in the any-hit approach.

Additionally, many radiologists will omit mention of normal structures in favor of brevity, thereby implying a negative label. This bias extends to many studies in which even mildly abnormal or senescent changes are omitted. For example, the same CXR may produce a single-line report of “No acute disease” by one radiologist and descriptions of cardiomegaly, and degenerative changes by another radiologist. Inherently, this omission bias introduces noise into the labeling process, particularly for findings which are not deemed critical, even in the more conservative fully-covered training set.

We decided to compare both approaches, and took the larger any-hit training set as our baseline. To the best of our knowledge, this is the largest training set ever assembled for chest X-ray, both in terms of the number of studies and the number of labels (see Table 1 for its composition). We partitioned the training set into training, validation, and testing (80%/10%/10% respectively), based on the (anonymized) patient identity. From the 10% of studies designated as validation we compiled a validation set of size 994 with at least 25 positives from each finding. We picked the model with lowest validation loss.

2.1 Model

Our model, called TextRay, is illustrated in Fig. 1. We start by applying a CNN (DenseNet121 [5]) on the Lateral and PA views (separately). We removed the last fully connected layer from each CNN and concatenated their outputs (just after the average pooling layer). We then applied our own fully-connected layer resulting in $K=40$ outputs, one for each finding, followed by a sigmoid activation. Hence, our model treats each study as a “bag of findings”, reporting the confidence for each one. We used the mean of the binary cross-entropy losses as our main loss function:

$$ loss=\frac{1}{K}\sum _{k=1}^K y_k\log (p_k)+(1-y_k)\log (1-p_k) $$

where $p_k$ is the value of the k-th output unit and $y_k$ is the binary label for the k-th finding.

Our model receives two inputs of size 299$\,\times \,$299. When lateral view was unavailable, we fed the network with random noise instead. Each X-Ray image (up to 3000$\,\times \,$3000 pixels in raw format) was zero-mean-normalized, rescaled to a size of $330(1+a)$ x $330(1+b)$, and rotated c degrees. A random patch of 299$\,\times \,$299 was taken as input. For training augmentation, we sampled a, b uniformly from $\pm 0.09$ and c from $\pm 9$, randomly flipping each image horizontally. For balance, we replaced the PA view with random noise in 5% of the samples. For test we used $a=b=c=0$ and took the central patch as input, without flipping.

We trained on two 1080Ti GPUs, putting each CNN on a different GPU. We used the built-in Keras 2.1.3 implementation of DenseNet121 over Tensorflow 1.4. We used the Adam optimizer with Keras’ default parameters, and a batch size of 32. We sorted the studies in two queues, normals and abnormals, and filled each batch with 95% abnormal studies on average. An epoch was defined as 150 batches. We started with a learning rate of 0.001 and multiplied it by 0.75 if validation loss hadn’t improved for 30 epochs. We trained for 2000 epochs.

2.2 Evaluation Sets

We chose 12 of the 40 findings and prepared evaluation sets for them, using studies from the test partition. Most sets focused on a single finding except cardiomegaly, hilar prominance, and pulmonary edema, which were lumped together as they are commonly seen in the setting of congestive heart failure. In each set, the studies were derived from two pools: pos-pool are studies that the reports indicated as positive for that finding. These studies were obtained by a manual textual search for terms indicative for each finding, independently of our sentence-tagging operation; neg-pool are randomly sampled studies, which are mostly negative for any finding (see Table 2 for the sets composition).

Each set was evaluated by three expert radiologists. In each set, the radiologist reviewed the shuffled studies and indicated the presence or absence of the relevant finding, using a web-based software operated on a desktop. The radiologists were shown both PA and Lateral view in their original resolutions.

We considered the report as a fourth expert opinion. To measure the accuracy of the label-extraction process, we cross referenced the report opinion with the training set labels. The positive labels in the training set were accurately mentioned the report; frequently, positive findings mentioned in the reports were mislabeled as negatives as would be expected in the any-hit training set, but this was also observed to lesser degree even in the fully-covered set.

3 Results

We performed pairwise analysis of the radiologist agreement following the procedure in [6], except we used the agreement rate between two taggers (e.g. accuracy) instead of the F1 score, because (a) it also measures agreement on the negatives; and (b) it is easier to interpret. The average agreement rate (AAR) for a radiologist (or a model) is the average of the agreement rates achieved against the other two (three for a model) radiologists. The avg. radiologist rate is the mean of the three radiologists’ AARs. We used the bootstrap method ($n=10000$) to obtain 95% confidence intervals over the difference between TextRay and the average radiologist agreement rates. As TextRay’s threshold for each finding, we used the one that maximized the AAR on the validation set.

Table 2. Evaluation Sets. The number of studies taken from the pos-pool (finding is positive in report) and neg-pool (random sample) are indicated, along with the average agreement rate (AAR) of the 3 radiologists (rads) assigned to each set vs. the report. The AAR between our model and the rads (column textray) is compared against the AAR between any radiologist and the other rads (avg. rad.). Confidence intervals are computed over the difference ($\varDelta =\text{ textray }-\text{ avg. } \text{ rad. }$).

Full size table

Table 2 shows that TextRay is on par with human radiologists (within the 95% CI) on 10 out of 12 findings, with the exception of rib fracture and hilar prominence. On some findings (elevated diaphragm, abnormal aorta, and pulmonary edema), radiologists agree significantly more with our algorithm than with each other (e.g. the CI does not include 0). Table 2 also shows the average agreement of the radiologists with the report. Here as well, this agreement is often higher than the average agreement among the radiologists themselves. This provides evidence that the noise added by using the reports as labels is no larger than the noise added by training a radiologist to do the tagging.

Using our text-based labels as ground-truth, TextRay’s performance was then tested over all 40 findings. To create the test set, a random sample of 5,000 studies was chosen from the test partition. Then, more studies were added from the partition until each finding had at least 100 positive cases, for a total of 7,030 studies. The ROCs (in the Supp. material) have AUCs ranging between 0.7 and 1.0 (average 0.892). At the top of the chart, artificial objects (i.e. pacers, lines, tubes, wires, and implants) are detected with AUCs approaching 1.0, much better than all diseases.

Figure 2 shows the area under the ROC curve (AUC) achieved by our model compared to a variant that was trained only with the PA view of each study (the approach used in [6, 7]). We see that in most findings, the performance is similar, but vertebral height loss, consolidation, rib fracture, and kyphosis stand out as findings in which the lateral view improved detection. These findings are expected from a clinical radiographic perspective.

For comparison, we also trained a variant of TextRay with the fully-covered training set, but it achieved significantly lower results in almost all findings (see Supp. material), suggesting that the additional abnormal studies in the any-hit set more than compensated for the higher label noise. Finally, we draw heat maps based on the procedure presented in [7] (presented in Supp. material).

4 Discussion

The extraction of labels from full CXR reports has been recognized as essential for efficient and robust CNN training on large datasets. Shin et al. [8] extracted labels from the 3,955 CXR reports in the OpenI dataset, using the MeSH system [9]. The ChestX-ray14 dataset released by Wang et al. [7] contains 112k PA images loosely labeled using a combination of NLP and hand-crafted rules. Rajpurkar et al. [6] team of four radiologists reported a high degree of disagreement with the provided ChestX-ray14 labels in general, although they demonstrate the ability to achieve expert-level prediction for the presence of pneumonia after training upon a DenseNet121 CNN.

Utilizing several public datasets with image labels and reports provided, Jing et al. [10] built a system that can generate a natural appearing radiology report using a hierarchical RNN. The high-level RNN generates sentence embeddings that seed low-level RNNs that produce the words of each sentence. As part of their report generation, they also produce tags representing the clinical finding present in the image. Interestingly, the model trained using these tags and the text of the reports did not predict the tags better than the model that was trained just using the tags. The ultimate accuracy of the system however remains poorly defined due to lack of clinical radiologic validation.

To the best of our knowledge, the present study is the first to utilize extensive radiology expertise for both multi-label generation and visual validation of algorithmic results. Study labels were generated bottom-up via ontology-based methodology which was rooted in the text rather than pre-existing categories or tags (i.e. MeSH). We trained upon the largest dataset of CXRs described to date, achieving results on twelve distinct visual findings which are on par with inter-radiologist agreement and in some cases, better.

5 Conclusion

In this work we attempt to broadly cover all findings radiologists usually report when reviewing a PA and Lateral chest X-ray. Since a relatively small set of sentences is heavily re-used in CXR reports, we were able to generate organic labels for millions of reports by examining and indexing twenty thousand individual sentences. This massive amount of data allowed us to obtain radiology-level detection performance on various of findings using a single model, in essence distilling the insight of millions of radiographic interpretations into software code. Application of a similar technique upon AP chest X-ray scans, musculoskeletal and abdominal radiographies is currently ongoing.

References

Robinson, P.J., Wilson, D., Coral, A., Murphy, A., Verow, P.: Variation between experienced observers in the interpretation of accident and emergency radiographs. Br. J. Radiol. 72(856), 323–30 (1999)
Article Google Scholar
Brady, A., Laoide, R., McCarthy, P., McDermott, R.: Discrepancy and error in radiology: concepts, causes and consequences. Ulster Med. J. 81(1), 3–9 (2012)
Google Scholar
Bruno, M.A., Walker, E.A., Abujudeh, H.H.: Understanding and confronting our mistakes: the epidemiology of error in radiology and strategies for error reduction. RadioGraphics 35(6), 1668–1676 (2015)
Article Google Scholar
Hanna, T.N., Lamoureux, C., Krupinski, E.A., Weber, S., Johnson, J.O.: Effect of shift, schedule, and volume on interpretive accuracy: a retrospective analysis of 2.9 million radiologic examinations. Radiology (2017) 170555
Google Scholar
Huang, G., Liu, Z., van der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017)
Google Scholar
Rajpurkar, P., et al.: CheXNet: radiologist-level pneumonia detection on chest X-rays with deep learning (2017)
Google Scholar
Wang, X., et al.: Chestx-ray8: hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases (2017). openaccess.thecvf.com
Shin, H.C., Roberts, K., Lu, L., Demner-Fushman, D., Yao, J., Summers, R.M.: Learning to read chest x-rays: recurrent neural cascade model for automated image annotation. In: Computer Vision and Pattern Recognition (CVPR), June 2016
Google Scholar
Demner-Fushman, D., Shooshan, S.E., Rodriguez, L., Antani, S., Thoma, G.R.: Annotation of chest radiology reports forindexing and retrieval, pp. 99–111 (2015)
Google Scholar
Jing, B., Xie, P., Xing, E.: On the automatic generation of medical imaging reports, November 2017
Google Scholar

Download references

Author information

Authors and Affiliations

Zebra Medical Vision LTD, Shefayim, Israel
Jonathan Laserson, Michal Cohen-Sfady, Eli Goz, Chen Brestel & Eldad Elnekave
Sheba Medical Center and Tel Aviv University, Ramat Gan, Israel
Christine Dan Lantsman
Rabin Medical Center, Petah Tikva, Israel
Itamar Tamir
Technion, Israel Institute of Technology, Haifa, Israel
Shir Bar
Ben Gurion University, Beersheba, Israel
Maya Atar

Authors

Jonathan Laserson
View author publications
You can also search for this author in PubMed Google Scholar
Christine Dan Lantsman
View author publications
You can also search for this author in PubMed Google Scholar
Michal Cohen-Sfady
View author publications
You can also search for this author in PubMed Google Scholar
Itamar Tamir
View author publications
You can also search for this author in PubMed Google Scholar
Eli Goz
View author publications
You can also search for this author in PubMed Google Scholar
Chen Brestel
View author publications
You can also search for this author in PubMed Google Scholar
Shir Bar
View author publications
You can also search for this author in PubMed Google Scholar
Maya Atar
View author publications
You can also search for this author in PubMed Google Scholar
Eldad Elnekave
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jonathan Laserson .

Editor information

Editors and Affiliations

University of Leeds, Leeds, UK
Alejandro F. Frangi
King’s College London, London, UK
Julia A. Schnabel
University of Pennsylvania, Philadelphia, PA, USA
Christos Davatzikos
Universidad de Valladolid, Valladolid, Spain
Carlos Alberola-López
Queen’s University, Kingston, ON, Canada
Gabor Fichtinger

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 5741 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Laserson, J. et al. (2018). TextRay: Mining Clinical Reports to Gain a Broad Understanding of Chest X-Rays. In: Frangi, A., Schnabel, J., Davatzikos, C., Alberola-López, C., Fichtinger, G. (eds) Medical Image Computing and Computer Assisted Intervention – MICCAI 2018. MICCAI 2018. Lecture Notes in Computer Science(), vol 11071. Springer, Cham. https://doi.org/10.1007/978-3-030-00934-2_62

Download citation

DOI: https://doi.org/10.1007/978-3-030-00934-2_62
Published: 26 September 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-00933-5
Online ISBN: 978-3-030-00934-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

TextRay: Mining Clinical Reports to Gain a Broad Understanding of Chest X-Rays

Abstract

Similar content being viewed by others

RadTex: Learning Efficient Radiograph Representations from Text Reports

Event-Based Clinical Finding Extraction from Radiology Reports with Pre-trained Language Model

Automatic Classification of Radiological Reports for Clinical Care

Keywords

1 Introduction

2 Materials and Methods

2.1 Model

2.2 Evaluation Sets

3 Results

4 Discussion

5 Conclusion

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 5741 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

TextRay: Mining Clinical Reports to Gain a Broad Understanding of Chest X-Rays

Abstract

Similar content being viewed by others

RadTex: Learning Efficient Radiograph Representations from Text Reports

Event-Based Clinical Finding Extraction from Radiology Reports with Pre-trained Language Model

Automatic Classification of Radiological Reports for Clinical Care

Keywords

1 Introduction

2 Materials and Methods

2.1 Model

2.2 Evaluation Sets

3 Results

4 Discussion

5 Conclusion

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 5741 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation