Skip to main content
Log in

Modal interaction-enhanced prompt learning by transformer decoder for vision-language models

  • Regular Paper
  • Published:
International Journal of Multimedia Information Retrieval Aims and scope Submit manuscript

Abstract

In the current multimodal retrieval field, CoOp is the preferred approach among many models due to its simplicity and powerful adaptive capability. However, CoOp focuses primarily on optimizing prompts to perform contrast learning, without considering image-text interactions and the impact on the model when visual information is incorporated into the prompts. In this work, we propose a prompt tuning method for simulating image-text interaction based on CoOp: Decoding context optimization (DeCoOp). Through extensive experiments on 11 image classification datasets, seven datasets under the few-shot setting and all 11 datasets under the zero-shot setting are ahead of CoOp in our method. Experiments on four target datasets of ImageNet show a model performance improvement of more than 10%, demonstrating that our approach substantially outperforms the baseline model CoOp in terms of point domain generalization and robustness. In addition, ablation experiments performed on three representative datasets confirmed the effectiveness and further improvement of the accuracy of DeCoOp. Finally, experiments are performed on 11 datasets using different visual backbones, and it is not difficult to find that the gap between our approach and handcrafted prompts is large in all architectures and shows better performance than CoOp.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Data availability

The data and code that support the results of this study are openly available in DeCoOp at https://github.com/JoyeMing/DeCoOp.git.

References

  1. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778

  2. Krizhevsky A, Sutskever I, Hinton GE (2017) Imagenet classification with deep convolutional neural networks. Commun ACM 60(6):84–90

    Article  Google Scholar 

  3. Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S (2020) End-to-end object detection with transformers. In: European conference on computer vision, Springer, pp 213–229

  4. Gao P, Zheng M, Wang X, Dai J, Li H (2021) Fast convergence of detr with spatially modulated co-attention. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 3621–3630

  5. Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: towards real-time object detection with region proposal networks. Adv Neural Inform Process Syst 28

  6. Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3431–3440

  7. Devlin J, Chang M-W, Lee K, Toutanova K (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805

  8. Radford A, Narasimhan K, Salimans T, Sutskever I et al (2018) Improving language understanding by generative pre-training

  9. Radford A, Kim JW, Hallacy C, Ramesh A, Goh G, Agarwal S, Sastry G, Askell A, Mishkin P, Clark J (2021) Learning transferable visual models from natural language supervision. In: International conference on machine learning, PMLR, pp 8748–8763

  10. Gao P, Geng S, Zhang R, Ma T, Fang R, Zhang Y, Li H, Qiao Y (2021) Clip-adapter: better vision-language models with feature adapters. arXiv preprint arXiv:2110.04544

  11. Zhang R, Fang R, Gao P, Zhang W, Li K, Dai J, Qiao Y, Li H (2021) Tip-adapter: training-free clip-adapter for better vision-language modeling. arXiv preprint arXiv:2111.03930

  12. Zhou K, Yang J, Loy CC, Liu Z (2022) Learning to prompt for vision-language models. Int J Comput Vis 130(9):2337–2348

    Article  Google Scholar 

  13. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. Adv Neural Inform Process Syst 30

  14. Zhou K, Yang J, Loy CC, Liu Z (2022) Conditional prompt learning for vision-language models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 16816–16825

  15. Rao Y, Zhao W, Chen G, Tang Y, Zhu Z, Huang G, Zhou J, Lu J (2022) Denseclip: language-guided dense prediction with context-aware prompting. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 18082–18091

  16. Li M, Xu R, Wang S, Zhou L, Lin X, Zhu C, Zeng M, Ji H, Chang S-F (2022) Clip-event: connecting text and images with event structures. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 16420–16429

  17. Ge C, Huang R, Xie M, Lai Z, Song S, Li S, Huang G (2022) Domain adaptation via prompt learning. arXiv preprint arXiv:2202.06687

  18. Jia M, Tang L, Chen B-C, Cardie C, Belongie S, Hariharan B, Lim S-N (2022) Visual prompt tuning. arXiv preprint arXiv:2203.12119

  19. Deng J, Dong W, Socher R, Li L-J, Li K, Fei-Fei L (2009) Imagenet: A large-scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition, IEEE, pp 248–255

  20. Fei-Fei L, Fergus R, Perona P (2004) Learning generative visual models from few training examples: an incremental Bayesian approach tested on 101 object categories. In: 2004 conference on computer vision and pattern recognition workshop, IEEE, pp 178–178

  21. Parkhi OM, Vedaldi A, Zisserman A, Jawahar C (2012) Cats and dogs. In: 2012 IEEE conference on computer vision and pattern recognition, IEEE, pp 3498–3505

  22. Krause J, Stark M, Deng J, Fei-Fei L (2013) 3D object representations for fine-grained categorization. In: Proceedings of the IEEE international conference on computer vision workshops, pp 554–561

  23. Nilsback M-E, Zisserman A (2008) Automated flower classification over a large number of classes. In: 2008 Sixth Indian conference on computer vision, graphics & image processing, IEEE, pp 722–729

  24. Bossard L, Guillaumin M, Gool LV (2014) Food-101–mining discriminative components with random forests. In: European conference on computer vision, Springer, pp 446–461

  25. Maji S, Rahtu E, Kannala J, Blaschko M, Vedaldi A (2013) Fine-grained visual classification of aircraft. arXiv preprint arXiv:1306.5151

  26. Xiao J, Hays J, Ehinger KA, Oliva A, Torralba A (2010) Sun database: large-scale scene recognition from abbey to zoo. In: 2010 IEEE computer society conference on computer vision and pattern recognition, IEEE, pp 3485–3492

  27. Soomro K, Zamir AR, Shah M (2012) Ucf101: a dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402

  28. Cimpoi M, Maji S, Kokkinos I, Mohamed S, Vedaldi A (2014) Describing textures in the wild. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3606–3613

  29. Helber P, Bischke B, Dengel A, Borth D (2019) Eurosat: a novel dataset and deep learning benchmark for land use and land cover classification. IEEE J Select Top Appl Earth Observ Remote Sens 12(7):2217–2226

    Article  Google Scholar 

  30. Recht B, Roelofs R, Schmidt L, Shankar V (2019) Do imagenet classifiers generalize to imagenet? In: International conference on machine learning, PMLR, pp 5389–5400

  31. Wang H, Ge S, Lipton Z, Xing EP (2019) Learning robust global representations by penalizing local predictive power. Adv Neural Inform Process Syst 32

  32. Hendrycks D, Zhao K, Basart S, Steinhardt J, Song D (2021) Natural adversarial examples. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 15262–15271

  33. Hendrycks D, Basart S, Mu N, Kadavath S, Wang F, Dorundo E, Desai R, Zhu T, Parajuli S, Guo M et al. The many faces of robustness: a critical analysis of out-of-distribution generalization supplementary material

Download references

Acknowledgements

This work was partially supported by Chongqing Natural Science Foundation of China(Grant No. CSTB2022NSCQ-MSX1417), the Science and Technology Research Program of Chongqing Municipal Education Commission (Grant No. KJZD-K202200513) and Chongqing Normal University Fund (Grant No. 22XLB003).

Author information

Authors and Affiliations

Authors

Contributions

ML proposed the model idea, experimented, and wrote the main manuscript text, LM and HZ created the chart, ML provided a topic for the paper, guided the research, provided laboratory equipment and funding, revised and polished the manuscript and so on, and all of the authors read the work before it was submitted.

Corresponding author

Correspondence to Mingyong Li.

Ethics declarations

Conflict of interest

The authors declare that there are no conflict of interest regarding the publication of this paper.

Ethical approval

No ethical approval involved regarding the publication of this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Liu, M., Zhao, H., Ma, L. et al. Modal interaction-enhanced prompt learning by transformer decoder for vision-language models. Int J Multimed Info Retr 12, 19 (2023). https://doi.org/10.1007/s13735-023-00287-4

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s13735-023-00287-4

Keywords

Navigation