Abstract
In the current multimodal retrieval field, CoOp is the preferred approach among many models due to its simplicity and powerful adaptive capability. However, CoOp focuses primarily on optimizing prompts to perform contrast learning, without considering image-text interactions and the impact on the model when visual information is incorporated into the prompts. In this work, we propose a prompt tuning method for simulating image-text interaction based on CoOp: Decoding context optimization (DeCoOp). Through extensive experiments on 11 image classification datasets, seven datasets under the few-shot setting and all 11 datasets under the zero-shot setting are ahead of CoOp in our method. Experiments on four target datasets of ImageNet show a model performance improvement of more than 10%, demonstrating that our approach substantially outperforms the baseline model CoOp in terms of point domain generalization and robustness. In addition, ablation experiments performed on three representative datasets confirmed the effectiveness and further improvement of the accuracy of DeCoOp. Finally, experiments are performed on 11 datasets using different visual backbones, and it is not difficult to find that the gap between our approach and handcrafted prompts is large in all architectures and shows better performance than CoOp.
Similar content being viewed by others
Data availability
The data and code that support the results of this study are openly available in DeCoOp at https://github.com/JoyeMing/DeCoOp.git.
References
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
Krizhevsky A, Sutskever I, Hinton GE (2017) Imagenet classification with deep convolutional neural networks. Commun ACM 60(6):84–90
Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S (2020) End-to-end object detection with transformers. In: European conference on computer vision, Springer, pp 213–229
Gao P, Zheng M, Wang X, Dai J, Li H (2021) Fast convergence of detr with spatially modulated co-attention. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 3621–3630
Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: towards real-time object detection with region proposal networks. Adv Neural Inform Process Syst 28
Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3431–3440
Devlin J, Chang M-W, Lee K, Toutanova K (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805
Radford A, Narasimhan K, Salimans T, Sutskever I et al (2018) Improving language understanding by generative pre-training
Radford A, Kim JW, Hallacy C, Ramesh A, Goh G, Agarwal S, Sastry G, Askell A, Mishkin P, Clark J (2021) Learning transferable visual models from natural language supervision. In: International conference on machine learning, PMLR, pp 8748–8763
Gao P, Geng S, Zhang R, Ma T, Fang R, Zhang Y, Li H, Qiao Y (2021) Clip-adapter: better vision-language models with feature adapters. arXiv preprint arXiv:2110.04544
Zhang R, Fang R, Gao P, Zhang W, Li K, Dai J, Qiao Y, Li H (2021) Tip-adapter: training-free clip-adapter for better vision-language modeling. arXiv preprint arXiv:2111.03930
Zhou K, Yang J, Loy CC, Liu Z (2022) Learning to prompt for vision-language models. Int J Comput Vis 130(9):2337–2348
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. Adv Neural Inform Process Syst 30
Zhou K, Yang J, Loy CC, Liu Z (2022) Conditional prompt learning for vision-language models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 16816–16825
Rao Y, Zhao W, Chen G, Tang Y, Zhu Z, Huang G, Zhou J, Lu J (2022) Denseclip: language-guided dense prediction with context-aware prompting. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 18082–18091
Li M, Xu R, Wang S, Zhou L, Lin X, Zhu C, Zeng M, Ji H, Chang S-F (2022) Clip-event: connecting text and images with event structures. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 16420–16429
Ge C, Huang R, Xie M, Lai Z, Song S, Li S, Huang G (2022) Domain adaptation via prompt learning. arXiv preprint arXiv:2202.06687
Jia M, Tang L, Chen B-C, Cardie C, Belongie S, Hariharan B, Lim S-N (2022) Visual prompt tuning. arXiv preprint arXiv:2203.12119
Deng J, Dong W, Socher R, Li L-J, Li K, Fei-Fei L (2009) Imagenet: A large-scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition, IEEE, pp 248–255
Fei-Fei L, Fergus R, Perona P (2004) Learning generative visual models from few training examples: an incremental Bayesian approach tested on 101 object categories. In: 2004 conference on computer vision and pattern recognition workshop, IEEE, pp 178–178
Parkhi OM, Vedaldi A, Zisserman A, Jawahar C (2012) Cats and dogs. In: 2012 IEEE conference on computer vision and pattern recognition, IEEE, pp 3498–3505
Krause J, Stark M, Deng J, Fei-Fei L (2013) 3D object representations for fine-grained categorization. In: Proceedings of the IEEE international conference on computer vision workshops, pp 554–561
Nilsback M-E, Zisserman A (2008) Automated flower classification over a large number of classes. In: 2008 Sixth Indian conference on computer vision, graphics & image processing, IEEE, pp 722–729
Bossard L, Guillaumin M, Gool LV (2014) Food-101–mining discriminative components with random forests. In: European conference on computer vision, Springer, pp 446–461
Maji S, Rahtu E, Kannala J, Blaschko M, Vedaldi A (2013) Fine-grained visual classification of aircraft. arXiv preprint arXiv:1306.5151
Xiao J, Hays J, Ehinger KA, Oliva A, Torralba A (2010) Sun database: large-scale scene recognition from abbey to zoo. In: 2010 IEEE computer society conference on computer vision and pattern recognition, IEEE, pp 3485–3492
Soomro K, Zamir AR, Shah M (2012) Ucf101: a dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402
Cimpoi M, Maji S, Kokkinos I, Mohamed S, Vedaldi A (2014) Describing textures in the wild. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3606–3613
Helber P, Bischke B, Dengel A, Borth D (2019) Eurosat: a novel dataset and deep learning benchmark for land use and land cover classification. IEEE J Select Top Appl Earth Observ Remote Sens 12(7):2217–2226
Recht B, Roelofs R, Schmidt L, Shankar V (2019) Do imagenet classifiers generalize to imagenet? In: International conference on machine learning, PMLR, pp 5389–5400
Wang H, Ge S, Lipton Z, Xing EP (2019) Learning robust global representations by penalizing local predictive power. Adv Neural Inform Process Syst 32
Hendrycks D, Zhao K, Basart S, Steinhardt J, Song D (2021) Natural adversarial examples. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 15262–15271
Hendrycks D, Basart S, Mu N, Kadavath S, Wang F, Dorundo E, Desai R, Zhu T, Parajuli S, Guo M et al. The many faces of robustness: a critical analysis of out-of-distribution generalization supplementary material
Acknowledgements
This work was partially supported by Chongqing Natural Science Foundation of China(Grant No. CSTB2022NSCQ-MSX1417), the Science and Technology Research Program of Chongqing Municipal Education Commission (Grant No. KJZD-K202200513) and Chongqing Normal University Fund (Grant No. 22XLB003).
Author information
Authors and Affiliations
Contributions
ML proposed the model idea, experimented, and wrote the main manuscript text, LM and HZ created the chart, ML provided a topic for the paper, guided the research, provided laboratory equipment and funding, revised and polished the manuscript and so on, and all of the authors read the work before it was submitted.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that there are no conflict of interest regarding the publication of this paper.
Ethical approval
No ethical approval involved regarding the publication of this paper.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Liu, M., Zhao, H., Ma, L. et al. Modal interaction-enhanced prompt learning by transformer decoder for vision-language models. Int J Multimed Info Retr 12, 19 (2023). https://doi.org/10.1007/s13735-023-00287-4
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s13735-023-00287-4