Weakly Supervised One-Stage Vision and Language Disease Detection Using Large Scale Pneumonia and Pneumothorax Studies

Tam, Leo K.; Wang, Xiaosong; Turkbey, Evrim; Lu, Kevin; Wen, Yuhong; Xu, Daguang

doi:10.1007/978-3-030-59719-1_5

Leo K. Tam¹⁶,
Xiaosong Wang¹⁶,
Evrim Turkbey¹⁷,
Kevin Lu¹⁶,
Yuhong Wen¹⁶ &
…
Daguang Xu¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12264))

Included in the following conference series:

International Conference on Medical Image Computing and Computer-Assisted Intervention

8817 Accesses
8 Citations

Abstract

Detecting clinically relevant objects in medical images is a challenge despite large datasets due to the lack of detailed labels. To address the label issue, we utilize the scene-level labels with a detection architecture that incorporates natural language information. We present a challenging new set of radiologist paired bounding box and natural language annotations on the publicly available MIMIC-CXR dataset especially focussed on pneumonia and pneumothorax. Along with the dataset, we present a joint vision language weakly supervised transformer layer-selected one-stage dual head detection architecture (LITERATI) alongside strong baseline comparisons with class activation mapping (CAM), gradient CAM, and relevant implementations on the NIH ChestXray-14 and MIMIC-CXR dataset. Borrowing from advances in vision language architectures, the LITERATI method demonstrates joint image and referring expression (objects localized in the image using natural language) input for detection that scales in a purely weakly supervised fashion. The architectural modifications address three obstacles – implementing a supervised vision and language detection method in a weakly supervised fashion, incorporating clinical referring expression natural language information, and generating high fidelity detections with map probabilities. Nevertheless, the challenging clinical nature of the radiologist annotations including subtle references, multi-instance specifications, and relatively verbose underlying medical reports, ensures the vision language detection task at scale remains stimulating for future investigation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 119.00; Price excludes VAT (USA)

Softcover Book: USD 159.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Bergstra, J., Bardenet, R., Bengio, Y., Kégl, B.: Algorithms for hyper-parameter optimization. In: Shawe-Taylor, J., Zemel, R.S., Bartlett, P.L., Pereira, F.C.N., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems 24: 25th Annual Conference on Neural Information Processing Systems 2011. Proceedings of a meeting held 12–14 December 2011, Granada, Spain, pp. 2546–2554 (2011). http://papers.nips.cc/paper/4443-algorithms-for-hyper-parameter-optimization
Brooks, J.: Coco annotator (2019). https://github.com/jsbroks/coco-annotator/
Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. CoRR abs/1810.04805 (2018). http://arxiv.org/abs/1810.04805
Girshick, R.B.: Fast R-CNN. In: 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, 7–13 December 2015, pp. 1440–1448. IEEE Computer Society (2015). https://doi.org/10.1109/ICCV.2015.169
Huang, J., et al.: Speed/accuracy trade-offs for modern convolutional object detectors. CoRR abs/1611.10012 (2016). http://arxiv.org/abs/1611.10012
Irvin, J., et al.: Chexpert: a large chest radiograph dataset with uncertainty labels and expert comparison. CoRR abs/1901.07031 (2019). http://arxiv.org/abs/1901.07031
Johnson, A.E.W., et al.: MIMIC-CXR: a large publicly available database of labeled chest radiographs. CoRR abs/1901.07042 (2019). http://arxiv.org/abs/1901.07042
Johnson, J., Karpathy, A., Fei-Fei, L.: Densecap: fully convolutional localization networks for dense captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016
Google Scholar
Kao, H.: Gradcam on chexnet, March 2020. https://github.com/thtang/CheXNet-with-localization
Kazemzadeh, S., Ordonez, V., Matten, M., Berg, T.L.: Referitgame: referring to objects in photographs of natural scenes. In: Moschitti, A., Pang, B., Daelemans, W. (eds.) Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, Doha, Qatar, 25–29 October 2014, A meeting of SIGDAT, a Special Interest Group of the ACL, pp. 787–798. ACL (2014). https://doi.org/10.3115/v1/d14-1086
Li, Z., et al.: Thoracic disease identification and localization with limited supervision. CoRR abs/1711.06373 (2017). http://arxiv.org/abs/1711.06373
Lin, M., Chen, Q., Yan, S.: Network in network. In: Bengio, Y., LeCun, Y. (eds.) 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, 14–16 April 2014, Conference Track Proceedings (2014). http://arxiv.org/abs/1312.4400
Loper, E., Bird, S.: NLTK: the natural language toolkit. CoRR cs.CL/0205028 (2002). https://arxiv.org/abs/cs/0205028
Lyubinets, V., Boiko, T., Nicholas, D.: Automated labeling of bugs and tickets using attention-based mechanisms in recurrent neural networks. CoRR abs/1807.02892 (2018). http://arxiv.org/abs/1807.02892
Manning, C.D., Surdeanu, M., Bauer, J., Finkel, J.R., Bethard, S., McClosky, D.: The stanford corenlp natural language processing toolkit. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, ACL 2014, Baltimore, MD, USA, 22–27 June 2014, System Demonstrations, pp. 55–60. The Association for Computer Linguistics (2014). https://doi.org/10.3115/v1/p14-5010
Moradi, M., Madani, A., Gur, Y., Guo, Y., Syeda-Mahmood, T.: Bimodal network architectures for automatic generation of image annotation from text. In: Frangi, A.F., Schnabel, J.A., Davatzikos, C., Alberola-López, C., Fichtinger, G. (eds.) MICCAI 2018. LNCS, vol. 11070, pp. 449–456. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-00928-1_51
Chapter Google Scholar
Paszke, A., et al.: Pytorch: an imperative style, high-performance deep learning library. CoRR abs/1912.01703 (2019). http://arxiv.org/abs/1912.01703
Rajpurkar, P., et al.: Chexnet: radiologist-level pneumonia detection on chest x-rays with deep learning. CoRR abs/1711.05225 (2017). http://arxiv.org/abs/1711.05225
Rajpurkar, P., Jia, R., Liang, P.: Know what you don’t know: unanswerable questions for squad. CoRR abs/1806.03822 (2018). http://arxiv.org/abs/1806.03822
Redmon, J., Divvala, S.K., Girshick, R.B., Farhadi, A.: You only look once: Unified, real-time object detection. CoRR abs/1506.02640 (2015). http://arxiv.org/abs/1506.02640
Redmon, J., Farhadi, A.: Yolov3: an incremental improvement. CoRR abs/1804.02767 (2018). http://arxiv.org/abs/1804.02767
Selvaraju, R.R., Das, A., Vedantam, R., Cogswell, M., Parikh, D., Batra, D.: Grad-cam: why did you say that? Visual explanations from deep networks via gradient-based localization. CoRR abs/1610.02391 (2016). http://arxiv.org/abs/1610.02391
Tenney, I., Das, D., Pavlick, E.: BERT rediscovers the classical NLP pipeline. CoRR abs/1905.05950 (2019). http://arxiv.org/abs/1905.05950
Vaswani, A., et al.: Attention is all you need. CoRR abs/1706.03762 (2017). http://arxiv.org/abs/1706.03762
Wang, X., Peng, Y., Lu, L., Lu, Z., Bagheri, M., Summers, R.M.: Chestx-ray8: hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases. CoRR abs/1705.02315 (2017). http://arxiv.org/abs/1705.02315
Wang, X., Peng, Y., Lu, L., Lu, Z., Summers, R.M.: Tienet: text-image embedding network for common thorax disease classification and reporting in chest x-rays. CoRR abs/1801.04334 (2018). http://arxiv.org/abs/1801.04334
Yan, K., Wang, X., Lu, L., Summers, R.M.: Deeplesion: automated deep mining, categorization and detection of significant radiology image findings using large-scale clinical lesion annotations. CoRR abs/1710.01766 (2017). http://arxiv.org/abs/1710.01766
Yang, Z., Gong, B., Wang, L., Huang, W., Yu, D., Luo, J.: A fast and accurate one-stage approach to visual grounding. In: 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), 27 October–2 November 2019, pp. 4682–4692. IEEE (2019). https://doi.org/10.1109/ICCV.2019.00478
Yang, Z., Gong, B., Wang, L., Huang, W., Yu, D., Luo, J.: A fast and accurate one-stage approach to visual grounding. CoRR abs/1908.06354 (2019). http://arxiv.org/abs/1908.06354
Zhou, B., Khosla, A., Lapedriza, À., Oliva, A., Torralba, A.: Learning deep features for discriminative localization. CoRR abs/1512.04150 (2015). http://arxiv.org/abs/1512.04150
Zhu, W., Vang, Y.S., Huang, Y., Xie, X.: DeepEM: deep 3D ConvNets with EM for weakly supervised pulmonary nodule detection. In: Frangi, A.F., Schnabel, J.A., Davatzikos, C., Alberola-López, C., Fichtinger, G. (eds.) MICCAI 2018. LNCS, vol. 11071, pp. 812–820. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-00934-2_90
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

NVIDIA, 2788 San Tomas Expy, Santa Clara, CA, 95051, USA
Leo K. Tam, Xiaosong Wang, Kevin Lu, Yuhong Wen & Daguang Xu
National Institute of Health Clinical Center, 10 Center Dr., Bethesda, MD, 20814, USA
Evrim Turkbey

Authors

Leo K. Tam
View author publications
You can also search for this author in PubMed Google Scholar
Xiaosong Wang
View author publications
You can also search for this author in PubMed Google Scholar
Evrim Turkbey
View author publications
You can also search for this author in PubMed Google Scholar
Kevin Lu
View author publications
You can also search for this author in PubMed Google Scholar
Yuhong Wen
View author publications
You can also search for this author in PubMed Google Scholar
Daguang Xu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Leo K. Tam .

Editor information

Editors and Affiliations

University of Toronto, Toronto, ON, Canada
Anne L. Martel
The University of British Columbia, Vancouver, BC, Canada
Purang Abolmaesumi
University College London, London, UK
Danail Stoyanov
École Centrale de Nantes, Nantes, France
Diana Mateus
EURECOM, Biot, France
Maria A. Zuluaga
Chinese Academy of Sciences, Beijing, China
S. Kevin Zhou
Sorbonne University, Paris, France
Daniel Racoceanu
The Hebrew University of Jerusalem, Jerusalem, Israel
Leo Joskowicz

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Tam, L.K., Wang, X., Turkbey, E., Lu, K., Wen, Y., Xu, D. (2020). Weakly Supervised One-Stage Vision and Language Disease Detection Using Large Scale Pneumonia and Pneumothorax Studies. In: Martel, A.L., et al. Medical Image Computing and Computer Assisted Intervention – MICCAI 2020. MICCAI 2020. Lecture Notes in Computer Science(), vol 12264. Springer, Cham. https://doi.org/10.1007/978-3-030-59719-1_5

Download citation

DOI: https://doi.org/10.1007/978-3-030-59719-1_5
Published: 29 September 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-59718-4
Online ISBN: 978-3-030-59719-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The Medical Image Computing and Computer Assisted Intervention Society (opens in a new tab)