Multiple Prompt Fusion for Zero-Shot Lesion Detection Using Vision-Language Models

Guo, Miaotian; Yi, Huahui; Qin, Ziyuan; Wang, Haiying; Men, Aidong; Lao, Qicheng

doi:10.1007/978-3-031-43904-9_28

Miaotian Guo¹⁴,
Huahui Yi¹⁵,
Ziyuan Qin¹⁵,
Haiying Wang¹⁴,
Aidong Men¹⁴ &
…
Qicheng Lao^14,16

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14224))

Included in the following conference series:

International Conference on Medical Image Computing and Computer-Assisted Intervention

4033 Accesses

Abstract

The success of large-scale pre-trained vision-language models (VLM) has provided a promising direction of transferring natural image representations to the medical domain by providing a well-designed prompt with medical expert-level knowledge. However, one prompt has difficulty in describing the medical lesions thoroughly enough and containing all the attributes. Besides, the models pre-trained with natural images fail to detect lesions precisely. To solve this problem, fusing multiple prompts is vital to assist the VLM in learning a more comprehensive alignment between textual and visual modalities. In this paper, we propose an ensemble guided fusion approach to leverage multiple statements when tackling the phrase grounding task for zero-shot lesion detection. Extensive experiments are conducted on three public medical image datasets across different modalities and the detection accuracy improvement demonstrates the superiority of our method.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Bodla, N., Singh, B., Chellappa, R., Davis, L.S.: Soft-nms-improving object detection with one line of code. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 5561–5569 (2017)
Google Scholar
Codella, N.C., et al.: Skin lesion analysis toward melanoma detection: a challenge at the 2017 international symposium on biomedical imaging (ISBI), hosted by the international skin imaging collaboration (ISIC). In: 2018 IEEE 15th international symposium on biomedical imaging (ISBI 2018), pp. 168–172. IEEE (2018)
Google Scholar
Dong, X., Yu, Z., Cao, W., Shi, Y., Ma, Q.: A survey on ensemble learning. Front. Comput. Sci. 14, 241–258 (2020)
Article Google Scholar
Gao, P., et al.: Clip-adapter: better vision-language models with feature adapters. arXiv preprint arXiv:2110.04544 (2021)
Girshick, R.: Fast R-CNN. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1440–1448 (2015)
Google Scholar
Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 580–587 (2014)
Google Scholar
Gu, X., Lin, T.Y., Kuo, W., Cui, Y.: Open-vocabulary object detection via vision and language knowledge distillation. arXiv preprint arXiv:2104.13921 (2021)
Jensen, J.D., Elewski, B.E.: The ABCDEF rule: combining the “ABCDE Rule” and the “Ugly Duckling Sign” in an effort to improve patient self-screening examinations. J. Clin. Aesthetic Dermatol. 8(2), 15 (2015)
Google Scholar
Jiang, C., Wang, S., Liang, X., Xu, H., Xiao, N.: ElixirNet: relation-aware network architecture adaptation for medical lesion detection. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 11093–11100 (2020)
Google Scholar
Li, J., Li, D., Xiong, C., Hoi, S.: Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900. PMLR (2022)
Google Scholar
Li, L.H., et al.: Grounded language-image pre-training. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10965–10975 (2022)
Google Scholar
Li, N., Jiang, Y., Zhou, Z.-H.: Multi-label selective ensemble. In: Schwenker, F., Roli, F., Kittler, J. (eds.) MCS 2015. LNCS, vol. 9132, pp. 76–88. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-20248-8_7
Chapter Google Scholar
Li, N., Zhou, Z.-H.: Selective ensemble of classifier chains. In: Zhou, Z.-H., Roli, F., Kittler, J. (eds.) MCS 2013. LNCS, vol. 7872, pp. 146–156. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-38067-9_13
Chapter Google Scholar
Neubeck, A., Van Gool, L.: Efficient non-maximum suppression. In: 18th international conference on pattern recognition (ICPR 2006), vol. 3, pp. 850–855. IEEE (2006)
Google Scholar
Qin, Z., Yi, H., Lao, Q., Li, K.: Medical image understanding with pretrained vision language models: a comprehensive study. arXiv preprint arXiv:2209.15517 (2022)
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)
Google Scholar
Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: Unified, real-time object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 779–788 (2016)
Google Scholar
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems, vol. 28 (2015)
Google Scholar
Sarcar, M., Rao, K., Narayan, K.: Computer aided design and manufacturing. PHI Learning (2008). https://books.google.co.jp/books?id=zXdivq93WIUC
Solovyev, R., Wang, W., Gabruseva, T.: Weighted boxes fusion: Ensembling boxes from different object detection models. Image Vis. Comput. 107, 104117 (2021)
Article Google Scholar
Vázquez, D., et al.: A benchmark for endoluminal scene segmentation of colonoscopy images. J. Healthcare Eng. 2017 (2017)
Google Scholar
Wei, S., Li, Z., Zhang, C.: Combined constraint-based with metric-based in semi-supervised clustering ensemble. Int. J. Mach. Learn. Cybern. 9, 1085–1100 (2018)
Article Google Scholar
Zhang, M.L., Zhou, Z.H.: Exploiting unlabeled data to enhance ensemble diversity. Data Min. Knowl. Disc. 26, 98–129 (2013)
Article MathSciNet Google Scholar
Zhang, P., et al.: VinVL: revisiting visual representations in vision-language models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5579–5588 (2021)
Google Scholar
Zhou, Y., et al.: Large language models are human-level prompt engineers (2023)
Google Scholar

Download references

Author information

Authors and Affiliations

School of Artificial Intelligence, Beijing University of Posts and Telecommunications, Beijing, China
Miaotian Guo, Haiying Wang, Aidong Men & Qicheng Lao
West China Biomedical Big Data Center, West China Hospital, Sichuan University, Sichuan, China
Huahui Yi & Ziyuan Qin
Shanghai Artificial Intelligence Laboratory, Shanghai, China
Qicheng Lao

Authors

Miaotian Guo
View author publications
You can also search for this author in PubMed Google Scholar
Huahui Yi
View author publications
You can also search for this author in PubMed Google Scholar
Ziyuan Qin
View author publications
You can also search for this author in PubMed Google Scholar
Haiying Wang
View author publications
You can also search for this author in PubMed Google Scholar
Aidong Men
View author publications
You can also search for this author in PubMed Google Scholar
Qicheng Lao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Qicheng Lao .

Editor information

Editors and Affiliations

Icahn School of Medicine, Mount Sinai, NYC, NY, USA, Tel Aviv University, Tel Aviv, Israel
Hayit Greenspan
Emory University, Atlanta, GA, USA
Anant Madabhushi
Queen’s University, Kingston, ON, Canada
Parvin Mousavi
The University of British Columbia, Vancouver, BC, Canada
Septimiu Salcudean
Yale University, New Haven, CT, USA
James Duncan
IBM Research, San Jose, CA, USA
Tanveer Syeda-Mahmood
Johns Hopkins University, Baltimore, MD, USA
Russell Taylor

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 162 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Guo, M., Yi, H., Qin, Z., Wang, H., Men, A., Lao, Q. (2023). Multiple Prompt Fusion for Zero-Shot Lesion Detection Using Vision-Language Models. In: Greenspan, H., et al. Medical Image Computing and Computer Assisted Intervention – MICCAI 2023. MICCAI 2023. Lecture Notes in Computer Science, vol 14224. Springer, Cham. https://doi.org/10.1007/978-3-031-43904-9_28

Download citation

DOI: https://doi.org/10.1007/978-3-031-43904-9_28
Published: 01 October 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-43903-2
Online ISBN: 978-3-031-43904-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The Medical Image Computing and Computer Assisted Intervention Society (opens in a new tab)

Multiple Prompt Fusion for Zero-Shot Lesion Detection Using Vision-Language Models