Modal interaction-enhanced prompt learning by transformer decoder for vision-language models

Liu, Mingyue; Zhao, Honggang; Ma, Longfei; Li, Mingyong

doi:10.1007/s13735-023-00287-4

Modal interaction-enhanced prompt learning by transformer decoder for vision-language models

Regular Paper
Published: 03 August 2023

Volume 12, article number 19, (2023)
Cite this article

International Journal of Multimedia Information Retrieval Aims and scope Submit manuscript

Mingyue Liu¹,
Honggang Zhao¹,
Longfei Ma¹ &
…
Mingyong Li^1,2

327 Accesses
1 Citation
Explore all metrics

Abstract

In the current multimodal retrieval field, CoOp is the preferred approach among many models due to its simplicity and powerful adaptive capability. However, CoOp focuses primarily on optimizing prompts to perform contrast learning, without considering image-text interactions and the impact on the model when visual information is incorporated into the prompts. In this work, we propose a prompt tuning method for simulating image-text interaction based on CoOp: Decoding context optimization (DeCoOp). Through extensive experiments on 11 image classification datasets, seven datasets under the few-shot setting and all 11 datasets under the zero-shot setting are ahead of CoOp in our method. Experiments on four target datasets of ImageNet show a model performance improvement of more than 10%, demonstrating that our approach substantially outperforms the baseline model CoOp in terms of point domain generalization and robustness. In addition, ablation experiments performed on three representative datasets confirmed the effectiveness and further improvement of the accuracy of DeCoOp. Finally, experiments are performed on 11 datasets using different visual backbones, and it is not difficult to find that the gap between our approach and handcrafted prompts is large in all architectures and shows better performance than CoOp.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1

CLIP-Adapter: Better Vision-Language Models with Feature Adapters

Article 15 September 2023

Learning to Prompt for Vision-Language Models

Article 31 July 2022

Parameter-efficient tuning of cross-modal retrieval for a specific database via trainable textual and visual prompts

Article 29 February 2024

Data availability

The data and code that support the results of this study are openly available in DeCoOp at https://github.com/JoyeMing/DeCoOp.git.

References

He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
Krizhevsky A, Sutskever I, Hinton GE (2017) Imagenet classification with deep convolutional neural networks. Commun ACM 60(6):84–90
Article Google Scholar
Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S (2020) End-to-end object detection with transformers. In: European conference on computer vision, Springer, pp 213–229
Gao P, Zheng M, Wang X, Dai J, Li H (2021) Fast convergence of detr with spatially modulated co-attention. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 3621–3630
Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: towards real-time object detection with region proposal networks. Adv Neural Inform Process Syst 28
Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3431–3440
Devlin J, Chang M-W, Lee K, Toutanova K (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805
Radford A, Narasimhan K, Salimans T, Sutskever I et al (2018) Improving language understanding by generative pre-training
Radford A, Kim JW, Hallacy C, Ramesh A, Goh G, Agarwal S, Sastry G, Askell A, Mishkin P, Clark J (2021) Learning transferable visual models from natural language supervision. In: International conference on machine learning, PMLR, pp 8748–8763
Gao P, Geng S, Zhang R, Ma T, Fang R, Zhang Y, Li H, Qiao Y (2021) Clip-adapter: better vision-language models with feature adapters. arXiv preprint arXiv:2110.04544
Zhang R, Fang R, Gao P, Zhang W, Li K, Dai J, Qiao Y, Li H (2021) Tip-adapter: training-free clip-adapter for better vision-language modeling. arXiv preprint arXiv:2111.03930
Zhou K, Yang J, Loy CC, Liu Z (2022) Learning to prompt for vision-language models. Int J Comput Vis 130(9):2337–2348
Article Google Scholar
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. Adv Neural Inform Process Syst 30
Zhou K, Yang J, Loy CC, Liu Z (2022) Conditional prompt learning for vision-language models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 16816–16825
Rao Y, Zhao W, Chen G, Tang Y, Zhu Z, Huang G, Zhou J, Lu J (2022) Denseclip: language-guided dense prediction with context-aware prompting. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 18082–18091
Li M, Xu R, Wang S, Zhou L, Lin X, Zhu C, Zeng M, Ji H, Chang S-F (2022) Clip-event: connecting text and images with event structures. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 16420–16429
Ge C, Huang R, Xie M, Lai Z, Song S, Li S, Huang G (2022) Domain adaptation via prompt learning. arXiv preprint arXiv:2202.06687
Jia M, Tang L, Chen B-C, Cardie C, Belongie S, Hariharan B, Lim S-N (2022) Visual prompt tuning. arXiv preprint arXiv:2203.12119
Deng J, Dong W, Socher R, Li L-J, Li K, Fei-Fei L (2009) Imagenet: A large-scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition, IEEE, pp 248–255
Fei-Fei L, Fergus R, Perona P (2004) Learning generative visual models from few training examples: an incremental Bayesian approach tested on 101 object categories. In: 2004 conference on computer vision and pattern recognition workshop, IEEE, pp 178–178
Parkhi OM, Vedaldi A, Zisserman A, Jawahar C (2012) Cats and dogs. In: 2012 IEEE conference on computer vision and pattern recognition, IEEE, pp 3498–3505
Krause J, Stark M, Deng J, Fei-Fei L (2013) 3D object representations for fine-grained categorization. In: Proceedings of the IEEE international conference on computer vision workshops, pp 554–561
Nilsback M-E, Zisserman A (2008) Automated flower classification over a large number of classes. In: 2008 Sixth Indian conference on computer vision, graphics & image processing, IEEE, pp 722–729
Bossard L, Guillaumin M, Gool LV (2014) Food-101–mining discriminative components with random forests. In: European conference on computer vision, Springer, pp 446–461
Maji S, Rahtu E, Kannala J, Blaschko M, Vedaldi A (2013) Fine-grained visual classification of aircraft. arXiv preprint arXiv:1306.5151
Xiao J, Hays J, Ehinger KA, Oliva A, Torralba A (2010) Sun database: large-scale scene recognition from abbey to zoo. In: 2010 IEEE computer society conference on computer vision and pattern recognition, IEEE, pp 3485–3492
Soomro K, Zamir AR, Shah M (2012) Ucf101: a dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402
Cimpoi M, Maji S, Kokkinos I, Mohamed S, Vedaldi A (2014) Describing textures in the wild. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3606–3613
Helber P, Bischke B, Dengel A, Borth D (2019) Eurosat: a novel dataset and deep learning benchmark for land use and land cover classification. IEEE J Select Top Appl Earth Observ Remote Sens 12(7):2217–2226
Article Google Scholar
Recht B, Roelofs R, Schmidt L, Shankar V (2019) Do imagenet classifiers generalize to imagenet? In: International conference on machine learning, PMLR, pp 5389–5400
Wang H, Ge S, Lipton Z, Xing EP (2019) Learning robust global representations by penalizing local predictive power. Adv Neural Inform Process Syst 32
Hendrycks D, Zhao K, Basart S, Steinhardt J, Song D (2021) Natural adversarial examples. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 15262–15271
Hendrycks D, Basart S, Mu N, Kadavath S, Wang F, Dorundo E, Desai R, Zhu T, Parajuli S, Guo M et al. The many faces of robustness: a critical analysis of out-of-distribution generalization supplementary material

Download references

Acknowledgements

This work was partially supported by Chongqing Natural Science Foundation of China(Grant No. CSTB2022NSCQ-MSX1417), the Science and Technology Research Program of Chongqing Municipal Education Commission (Grant No. KJZD-K202200513) and Chongqing Normal University Fund (Grant No. 22XLB003).

Author information

Authors and Affiliations

College of Computer and Information Science, Chongqing Normal University, Chongqing, 401331, China
Mingyue Liu, Honggang Zhao, Longfei Ma & Mingyong Li
Chongqing National Center for Applied Mathematics, Chongqing, 401331, China
Mingyong Li

Authors

Mingyue Liu
View author publications
You can also search for this author in PubMed Google Scholar
Honggang Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Longfei Ma
View author publications
You can also search for this author in PubMed Google Scholar
Mingyong Li
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

ML proposed the model idea, experimented, and wrote the main manuscript text, LM and HZ created the chart, ML provided a topic for the paper, guided the research, provided laboratory equipment and funding, revised and polished the manuscript and so on, and all of the authors read the work before it was submitted.

Corresponding author

Correspondence to Mingyong Li.

Ethics declarations

Conflict of interest

The authors declare that there are no conflict of interest regarding the publication of this paper.

Ethical approval

No ethical approval involved regarding the publication of this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Liu, M., Zhao, H., Ma, L. et al. Modal interaction-enhanced prompt learning by transformer decoder for vision-language models. Int J Multimed Info Retr 12, 19 (2023). https://doi.org/10.1007/s13735-023-00287-4

Download citation

Received: 02 March 2023
Revised: 24 May 2023
Accepted: 10 July 2023
Published: 03 August 2023
DOI: https://doi.org/10.1007/s13735-023-00287-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Modal interaction-enhanced prompt learning by transformer decoder for vision-language models

Abstract

Access this article

Similar content being viewed by others

CLIP-Adapter: Better Vision-Language Models with Feature Adapters

Learning to Prompt for Vision-Language Models

Parameter-efficient tuning of cross-modal retrieval for a specific database via trainable textual and visual prompts

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Modal interaction-enhanced prompt learning by transformer decoder for vision-language models

Abstract

Access this article

Similar content being viewed by others

CLIP-Adapter: Better Vision-Language Models with Feature Adapters

Learning to Prompt for Vision-Language Models

Parameter-efficient tuning of cross-modal retrieval for a specific database via trainable textual and visual prompts

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation