Skip to main content

Enhancing Open-Vocabulary Semantic Segmentation with Prototype Retrieval

  • Conference paper
  • First Online:
Image Analysis and Processing – ICIAP 2023 (ICIAP 2023)

Abstract

Large-scale pre-trained vision-language models like CLIP exhibit impressive zero-shot capabilities in classification and retrieval tasks. However, their application to open-vocabulary semantic segmentation remains challenging due to the gap between the global features extracted by CLIP for whole-image recognition and the requirement for semantically detailed pixel-level features. Recent two-stage methods have attempted to overcome these challenges by generating mask proposals that are agnostic to specific classes, thereby facilitating the identification of regions within images, which are subsequently classified using CLIP. However, this introduces a significant domain shift between the masked and cropped proposals and the images on which CLIP was trained. Fine-tuning CLIP on a limited annotated dataset can alleviate this bias but may compromise its generalization to unseen classes. In this paper, we present a method to address the domain shift without relying on fine-tuning. Our proposed approach utilizes weakly supervised region prototypes acquired from image-caption pairs. We construct a visual vocabulary by associating the words in the captions with region proposals using CLIP embeddings. Then, we cluster these embeddings to obtain prototypes that embed the same domain shift observed in conventional two-step methods. During inference, these prototypes can be retrieved alongside textual prompts. Our region classification incorporates both textual similarity with the class noun and similarity with prototypes from our vocabulary. Our experiments show the effectiveness of using retrieval to enhance vision-language architectures for open-vocabulary semantic segmentation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 69.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 89.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Amoroso, R., Baraldi, L., Cucchiara, R.: Assessing the role of boundary-level objectives in indoor semantic segmentation. In: Tsapatsoulis, N., Panayides, A., Theocharides, T., Lanitis, A., Pattichis, C., Vento, M. (eds.) CAIP 2021. LNCS, vol. 13052, pp. 455–465. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-89128-2_44

    Chapter  Google Scholar 

  2. Amoroso, R., Baraldi, L., Cucchiara, R.: Improving indoor semantic segmentation with boundary-level objectives. In: Rojas, I., Joya, G., Català, A. (eds.) IWANN 2021. LNCS, vol. 12862, pp. 318–329. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-85099-9_26

    Chapter  Google Scholar 

  3. Bruno, P., Amoroso, R., Cornia, M., Cascianelli, S., Baraldi, L., Cucchiara, R.: Investigating bidimensional downsampling in vision transformer models. In: Sclaroff, S., Distante, C., Leo, M., Farinella, G.M., Tombari, F. (eds.) ICIAP 2022. Lecture Notes in Computer Science, vol. 13232, pp. 287–299. Springer, Cham (2022)

    Chapter  Google Scholar 

  4. Cancilla, M., et al.: The DeepHealth toolkit: a unified framework to boost biomedical applications. In: ICPR (2021)

    Google Scholar 

  5. Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: DeepLab: Semantic image segmentation with deep convolutional nets, Atrous convolution, and fully connected CRFs. TPAMI 27, 834–848 (2017)

    Google Scholar 

  6. Chen, X., et al.: Microsoft coco captions: data collection and evaluation server. arXiv preprint: arXiv:1504.00325 (2015)

  7. Cheng, B., Schwing, A., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS (2021)

    Google Scholar 

  8. Cipriano, M., et al.: Deep segmentation of the mandibular canal: a new 3D annotated dataset of CBCT volumes. IEEE Access 10, 11500–11510 (2022)

    Article  Google Scholar 

  9. Ding, J., Xue, N., Xia, G.S., Dai, D.: Decoupling zero-shot semantic segmentation. In: CVPR (2022)

    Google Scholar 

  10. Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale (2021)

    Google Scholar 

  11. Everingham, M., Van Gool, L., Williams, C.K.I., Winn, J., Zisserman, A.: The PASCAL visual object classes challenge 2012 (VOC2012) Results

    Google Scholar 

  12. Ghiasi, G., Gu, X., Cui, Y., Lin, T.Y.: Scaling open-vocabulary image segmentation with image-level labels. In: Avidan, S., Brostow, G., Cisse, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. Lecture Notes in Computer Science, vol. 13696, pp. 540–557. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20059-5_31

    Chapter  Google Scholar 

  13. Gu, X., Lin, T.Y., Kuo, W., Cui, Y.: Open-vocabulary object detection via vision and language knowledge distillation. arXiv preprint: arXiv:2104.13921 (2021)

  14. Gu, Z., Zhou, S., Niu, L., Zhao, Z., Zhang, L.: Context-aware feature generation for zero-shot semantic segmentation. In: ACM Multimedia (2020)

    Google Scholar 

  15. Ilharco, G., et al.: OpenCLIP. Zenodo 4, 5 (2021)

    Google Scholar 

  16. Jia, C., et al.: Scaling up visual and vision-language representation learning with noisy text supervision. In: ICML (2021)

    Google Scholar 

  17. Li, B., Weinberger, K.Q., Belongie, S., Koltun, V., Ranftl, R.: Language-driven semantic segmentation. In: ICLR (2022)

    Google Scholar 

  18. Liang, F., et al.: Open-vocabulary semantic segmentation with mask-adapted CLIP. In: CVPR (2023)

    Google Scholar 

  19. Lin, T.Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48

    Chapter  Google Scholar 

  20. Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: CVPR (2021)

    Google Scholar 

  21. Mottaghi, R., et al.: The role of context for object detection and semantic segmentation in the wild. In: CVPR (2014)

    Google Scholar 

  22. Mukhoti, J., et al.: Open vocabulary semantic segmentation with patch aligned contrastive learning. In: CVPR 2023 (2022)

    Google Scholar 

  23. Qi, L., et al.: Open world entity segmentation. TPAMI (2022)

    Google Scholar 

  24. Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML (2021)

    Google Scholar 

  25. Schuhmann, C., et al.: LAION-5B: an open large-scale dataset for training next generation image-text models. In: NeurIPS Datasets and Benchmarks Track (2022)

    Google Scholar 

  26. Xian, Y., Choudhury, S., He, Y., Schiele, B., Akata, Z.: Semantic projection network for zero-and few-label semantic segmentation. In: CVPR (2019)

    Google Scholar 

  27. Xu, J., et al.: GroupViT: semantic segmentation emerges from text supervision. In: CVPR (2022)

    Google Scholar 

  28. Xu, M., et al.: A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Avidan, S., Brostow, G., Cisse, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. Lecture Notes in Computer Science, vol. 13689, pp. 736–753. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19818-2_42

    Chapter  Google Scholar 

  29. Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR (2017)

    Google Scholar 

  30. Zhou, C., Loy, C.C., Dai, B.: Extract Free Dense Labels from CLIP. In: Avidan, S., Brostow, G., Cisse, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. Lecture Notes in Computer Science, vol. 13688, pp. 696–712. Springer, Cham (2022)

    Chapter  Google Scholar 

Download references

Acknowledgments

Research partly funded by PNRR - M4C2 - Investimento 1.3, Partenariato Esteso PE00000013 -“FAIR - Future Artificial Intelligence Research” - Spoke 8 “Pervasive AI”, funded by the European Commission under the NextGeneration EU programme.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Luca Barsellotti .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Barsellotti, L., Amoroso, R., Baraldi, L., Cucchiara, R. (2023). Enhancing Open-Vocabulary Semantic Segmentation with Prototype Retrieval. In: Foresti, G.L., Fusiello, A., Hancock, E. (eds) Image Analysis and Processing – ICIAP 2023. ICIAP 2023. Lecture Notes in Computer Science, vol 14234. Springer, Cham. https://doi.org/10.1007/978-3-031-43153-1_17

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-43153-1_17

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-43152-4

  • Online ISBN: 978-3-031-43153-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics