Learning to Prompt for Vision-Language Models

Zhou, Kaiyang; Yang, Jingkang; Loy, Chen Change; Liu, Ziwei

doi:10.1007/s11263-022-01653-1

Learning to Prompt for Vision-Language Models

Published: 31 July 2022

Volume 130, pages 2337–2348, (2022)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

Kaiyang Zhou¹,
Jingkang Yang¹,
Chen Change Loy¹ &
…
Ziwei Liu¹

16k Accesses
344 Citations
48 Altmetric
1 Mention
Explore all metrics

Abstract

Large pre-trained vision-language models like CLIP have shown great potential in learning representations that are transferable across a wide range of downstream tasks. Different from the traditional representation learning that is based mostly on discretized labels, vision-language pre-training aligns images and texts in a common feature space, which allows zero-shot transfer to a downstream task via prompting, i.e., classification weights are synthesized from natural language describing classes of interest. In this work, we show that a major challenge for deploying such models in practice is prompt engineering, which requires domain expertise and is extremely time-consuming—one needs to spend a significant amount of time on words tuning since a slight change in wording could have a huge impact on performance. Inspired by recent advances in prompt learning research in natural language processing (NLP), we propose Context Optimization (CoOp), a simple approach specifically for adapting CLIP-like vision-language models for downstream image recognition. Concretely, CoOp models a prompt’s context words with learnable vectors while the entire pre-trained parameters are kept fixed. To handle different image recognition tasks, we provide two implementations of CoOp: unified context and class-specific context. Through extensive experiments on 11 datasets, we demonstrate that CoOp requires as few as one or two shots to beat hand-crafted prompts with a decent margin and is able to gain significant improvements over prompt engineering with more shots, e.g., with 16 shots the average gain is around 15% (with the highest reaching over 45%). Despite being a learning-based approach, CoOp achieves superb domain generalization performance compared with the zero-shot model using hand-crafted prompts.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

CLIP-Adapter: Better Vision-Language Models with Feature Adapters

Article 15 September 2023

Unsupervised Prototype Adapter for Vision-Language Models

CoCoOpter: Pre-train, prompt, and fine-tune the vision-language model for few-shot image classification

Article 23 August 2023

Notes

CoOp is pronounced as /ku:p/.
https://github.com/openai/CLIP.
We find that the negative results on Food101, for learning-based models including CoOp and linear probe, are caused by the noisy training data with “intense colors and sometimes wrong labels” (Bossard et al., 2014).

References

Bossard, L., Guillaumin, M., & Van Gool, L. (2014). Food-101–mining discriminative components with random forests. In ECCV
Bommasani, R., Hudson, D. A., Adeli, E., Altman, R., Arora, S., von Arx, S., Bernstein, M. S., Bohg, J., Bosselut, A., Brunskill, E., & et al. (2021). On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258
Bossard, L., Guillaumin, M., & Van Gool, L. (2014). Food-101–mining discriminative components with random forests. In ECCV
Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., & et al. (2020). Language models are few-shot learners. arXiv preprint arXiv:2005.14165
Chen, T., Kornblith, S., Norouzi, M., & Hinton, G. (2020). A simple framework for contrastive learning of visual representations. In ICML
Cimpoi, M., Maji, S., Kokkinos, I., Mohamed, S., & Vedaldi, A. (2014). Describing textures in the wild. In CVPR
Deng, J., Dong, W., Socher, R., Li, LJ., Li, K., & Fei-Fei, L. (2009). Imagenet: A large-scale hierarchical image database. In CVPR
Desai, K., & Johnson, J. (2021). Virtex: Learning visual representations from textual annotations. In CVPR
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., & et al. (2021). An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR
Elhoseiny, M., Saleh, B., & Elgammal, A. (2013). Write a classifier: Zero-shot learning using purely textual descriptions. In ICCV
Fei-Fei, L., Fergus, R., & Perona, P. (2004). Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. In CVPR-W
Frome, A., Corrado, G., Shlens, J., Bengio, S., Dean, J., Ranzato, M., & Mikolov, T. (2013). Devise: A deep visual-semantic embedding model. In NeurIPS
Fürst, A., Rumetshofer, E., Tran, V., Ramsauer, H., Tang, F., Lehner, J., Kreil, D., Kopp, M., Klambauer, G., Bitto-Nemling, A., & et al. (2021). Cloob: Modern hopfield networks with infoloob outperform clip. arXiv preprint arXiv:2110.11316
Gao, P., Geng, S., Zhang, R., Ma, T., Fang, R., Zhang, Y., Li, H., & Qiao, Y. (2021). Clip-adapter: Better vision-language models with feature adapters. arXiv preprint arXiv:2110.04544
Gao, T., Fisch, A., & Chen, D. (2020). Making pre-trained language models better few-shot learners. arXiv preprint arXiv:2012.15723
Gomez, L., Patel, Y., Rusiñol, M., Karatzas, D., & Jawahar, C. (2017). Self-supervised learning of visual features through embedding images into text topic spaces. In CVPR
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In CVPR
He, K., Fan, H., Wu, Y., Xie, S., & Girshick, R. (2020). Momentum contrast for unsupervised visual representation learning. In CVPR
Helber, P., Bischke, B., Dengel, A., & Borth, D. (2019). Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification. IEEE Journal Selected Topics in Applied Earth Observations and Remote Sensing
Hénaff, O. J., Srinivas, A., Fauw, J. D., Razavi, A., Doersch, C., Eslami, S. M. A., & van den Oord, A. (2020). Data-efficient image recognition with contrastive predictive coding. In ICML
Hendrycks, D., Basart, S., Mu, N., Kadavath, S., Wang. F., Dorundo, E., Desai, R., Zhu, T., Parajuli, S., Guo, M., Song, D., Steinhardt, J., & Gilmer, J. (2021a). The many faces of robustness: A critical analysis of out-of-distribution generalization. ICCV
Hendrycks, D., Zhao, K., Basart, S., Steinhardt, J., & Song, D. (2021b). Natural adversarial examples. In CVPR
Jia, C., Yang, Y., Xia, Y., Chen, Y. T., Parekh, Z., Pham, H., Le, Q. V., Sung, Y., Li, Z., & Duerig, T. (2021). Scaling up visual and vision-language representation learning with noisy text supervision. In ICML
Jia, M., Tang, L., Chen, B. C., Cardie, C., Belongie, S., Hariharan, B., & Lim. S. N. (2022). Visual prompt tuning. arXiv preprint arXiv:2203.12119
Jiang, Z., Xu, F. F., Araki, J., & Neubig, G. (2020). How can we know what language models know? ACL
Joulin, A., Van Der Maaten, L., Jabri, A., & Vasilache, N. (2016). Learning visual features from large weakly supervised data. In ECCV
Krause, J., Stark, M., Deng, J., & Fei-Fei, L. (2013). 3d object representations for fine-grained categorization. In ICCV-W
Lei Ba, J., Swersky, K., Fidler, S., & et al. (2015). Predicting deep zero-shot convolutional neural networks using textual descriptions. In ICCV
Lester, B., Al-Rfou, R., & Constant, N. (2021). The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691
Li, A., Jabri, A., Joulin, A., & van der Maaten, L. (2017). Learning visual n-grams from web data. In ICCV
Li, XL., & Liang, P. (2021). Prefix-tuning: Optimizing continuous prompts for generation. arXiv preprint arXiv:2101.00190
Li, Y., Liang, F., Zhao, L., Cui, Y., Ouyang, W., Shao, J., Yu, F., & Yan, J. (2021). Supervision exists everywhere: A data efficient contrastive language-image pre-training paradigm. arXiv preprint arXiv:2110.05208
Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., & Neubig, G. (2021a). Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. arXiv preprint arXiv:2107.13586
Liu, X., Zheng, Y., Du, Z., Ding, M., Qian, Y., Yang, Z., & Tang, J. (2021b). Gpt understands, too. arXiv preprint arXiv:2103.10385
Maji, S., Rahtu, E., Kannala, J., Blaschko, M., & Vedaldi, A. (2013). Fine-grained visual classification of aircraft. arXiv preprint arXiv:1306.5151
Nilsback, M. E., & Zisserman, A. (2008). Automated flower classification over a large number of classes. In ICVGIP
Parkhi, O. M., Vedaldi, A., Zisserman, A., & Jawahar, C. (2012). Cats and dogs. In CVPR
Petroni, F., Rocktäschel, T., Lewis, P., Bakhtin, A., Wu, Y., Miller, A. H., & Riedel, S. (2019). Language models as knowledge bases? In EMNLP
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., & et al. (2021). Learning transferable visual models from natural language supervision. In ICML
Recht, B., Roelofs, R., Schmidt, L., & Shankar, V. (2019). Do imagenet classifiers generalize to imagenet? In ICML
Sennrich, R., Haddow, B., & Birch, A. (2016). Neural machine translation of rare words with subword units. In ACL
Shin, T., Razeghi, Y., Logan IV, R. L., Wallace, E., & Singh, S. (2020). Autoprompt: Eliciting knowledge from language models with automatically generated prompts. In EMNLP
Singh, A., Hu, R., Goswami, V., Couairon, G., Galuba, W., Rohrbach, M., & Kiela, D. (2021). Flava: A foundational language and vision alignment model. arXiv preprint arXiv:2112.04482
Socher, R., Ganjoo, M., Sridhar, H., Bastani, O., Manning, C. D., & Ng, A.Y. (2013). Zero-shot learning through cross-modal transfer. In NeurIPS
Soomro, K., Zamir, A. R, & Shah, M. (2012). Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402
Taori, R., Dave, A., Shankar, V., Carlini, N., Recht, B., & Schmidt, L. (2020). Measuring robustness to natural distribution shifts in image classification. In NeurIPS
Tian, Y., Wang, Y., Krishnan, D., Tenenbaum, J. B., & Isola, P. (2020). Rethinking few-shot image classification: a good embedding is all you need? In ECCV
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. In NeurIPS
Wang, D., Shelhamer, E., Liu, S., Olshausen, B., & Darrell, T. (2020). Tent: Fully test-time adaptation by entropy minimization. arXiv preprint arXiv:2006.10726
Wang, H., Ge, S., Lipton, Z., & Xing, E. P. (2019). Learning robust global representations by penalizing local predictive power. In NeurIPS
Xiao, J., Hays, J., Ehinger, K. A., Oliva, A., & Torralba, A. (2010). Sun database: Large-scale scene recognition from abbey to zoo. In CVPR
Yuan, L., Chen, D., Chen, Y. L., Codella, N., Dai, X., Gao, J., Hu, H., Huang, X., Li, B., Li, C., & et al. (2021). Florence: A new foundation model for computer vision. arXiv preprint arXiv:2111.11432
Zhang, Y., Jiang, H., Miura, Y., Manning, C. D., & Langlotz, C. P. (2020). Contrastive learning of medical visual representations from paired images and text. arXiv preprint arXiv:2010.00747
Zhong, Z., Friedman, D., & Chen, D. (2021). Factual probing is [mask]: Learning vs. learning to recall. In NAACL
Zhou, K., Liu, Z., Qiao, Y., Xiang, T., & Loy, C. C. (2021). Domain generalization in vision: A survey. arXiv preprint arXiv:2103.02503
Zhou, K., Yang, J., Loy, C. C., & Liu, Z. (2022). Conditional prompt learning for vision-language models. arXiv preprint arXiv:2203.05557

Download references

Acknowledgements

This study is supported under the RIE2020 Industry Alignment Fund Industry Collaboration Projects (IAF-ICP) Funding Initiative, as well as cash and in-kind contribution from the industry partner(s).

Author information

Authors and Affiliations

S-Lab, Nanyang Technological University, Singapore, Singapore
Kaiyang Zhou, Jingkang Yang, Chen Change Loy & Ziwei Liu

Authors

Kaiyang Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Jingkang Yang
View author publications
You can also search for this author in PubMed Google Scholar
Chen Change Loy
View author publications
You can also search for this author in PubMed Google Scholar
Ziwei Liu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kaiyang Zhou.

Additional information

Communicated by Diane Larlus.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix

A Datasets Details

The detailed statistics of the 11 datasets, as well as the four variants of ImageNet, are shown in Table 6. The hand-crafted prompts used for zero-shot CLIP are also detailed in the table. For Caltech101, the “BACKGROUND_Google” and “Faces_easy” classes are discarded. For the video dataset, UCF101, the middle frame of each video is used as input to the image encoder.

Table 6 Datasets statistics

Full size table

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Zhou, K., Yang, J., Loy, C.C. et al. Learning to Prompt for Vision-Language Models. Int J Comput Vis 130, 2337–2348 (2022). https://doi.org/10.1007/s11263-022-01653-1

Download citation

Received: 02 February 2022
Accepted: 16 July 2022
Published: 31 July 2022
Issue Date: September 2022
DOI: https://doi.org/10.1007/s11263-022-01653-1

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Learning to Prompt for Vision-Language Models

Abstract

Access this article

Similar content being viewed by others

CLIP-Adapter: Better Vision-Language Models with Feature Adapters

Unsupervised Prototype Adapter for Vision-Language Models

CoCoOpter: Pre-train, prompt, and fine-tune the vision-language model for few-shot image classification

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendices

Appendix

A Datasets Details

Rights and permissions

About this article

Cite this article

Navigation

Learning to Prompt for Vision-Language Models

Abstract

Access this article

Similar content being viewed by others

CLIP-Adapter: Better Vision-Language Models with Feature Adapters

Unsupervised Prototype Adapter for Vision-Language Models

CoCoOpter: Pre-train, prompt, and fine-tune the vision-language model for few-shot image classification

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendices

Appendix

A Datasets Details

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation