CLIP-Adapter: Better Vision-Language Models with Feature Adapters

Gao, Peng; Geng, Shijie; Zhang, Renrui; Ma, Teli; Fang, Rongyao; Zhang, Yongfeng; Li, Hongsheng; Qiao, Yu

doi:10.1007/s11263-023-01891-x

CLIP-Adapter: Better Vision-Language Models with Feature Adapters

Published: 15 September 2023

Volume 132, pages 581–595, (2024)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

Peng Gao¹^na1,
Shijie Geng²^na1,
Renrui Zhang¹^na1,
Teli Ma¹,
Rongyao Fang³,
Yongfeng Zhang²,
Hongsheng Li³ &
…
Yu Qiao¹

6993 Accesses
68 Citations
2 Altmetric
Explore all metrics

This article has been updated

Abstract

Large-scale contrastive vision-language pretraining has shown significant progress in visual representation learning. Unlike traditional visual systems trained by a fixed set of discrete labels, a new paradigm was introduced in Radford et al. (International conference on machine learning, PMLR, 2021) to directly learn to align images with raw texts in an open-vocabulary setting. On downstream tasks, a carefully chosen text prompt is employed to make zero-shot predictions. To avoid non-trivial prompt engineering, context optimization (Zhou et al. in Int J Comput Vis 130(9):2337–2348, 2022) has been proposed to learn continuous vectors as task-specific prompts with few-shot training examples. In this paper, we show that there is an alternative path to achieve better vision-language models other than prompt tuning. While prompt tuning is for the textual inputs, we propose CLIP-Adapter to conduct fine-tuning with feature adapters on either visual or language branch. Specifically, CLIP-Adapter adopts an additional bottleneck layer to learn new features and performs residual-style feature blending with the original pretrained features. As a consequence, CLIP-Adapter is able to outperform context optimization while maintaining a simple design. Experiments and extensive ablation studies on various visual classification tasks demonstrate the effectiveness of our approach.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Learning to Prompt for Vision-Language Models

Article 31 July 2022

Modal interaction-enhanced prompt learning by transformer decoder for vision-language models

Article 03 August 2023

Efficient Prompt Tuning for Vision and Language Models

Data Availability

No new data were created during the study. All experiments of this manuscript were conducted (training and evaluation) on 11 publicly available image classification datasets (Deng et al., 2009; Krause et al., 2013; Soomro et al., 2012; Fei-Fei et al., 2004; Nilsback and Zisserman, 2008; Xiao et al., 2010; Cimpoi et al., 2014; Helber et al., 2019; Maji et al., 2013; Parkhi et al., 2012; Bossard et al., 2014).

Change history

29 October 2023
The original version has been revised to update Fig. 1

References

Alayrac, J. B., Donahue, J., Luc, P., et al. (2022). Flamingo: a visual language model for few-shot learning. In A. H. Oh, A. Agarwal, D. Belgrave, et al. (Eds.), Advances in neural information processing systems. MIT Press.
Google Scholar
Anderson, P., He, X., & Buehler, C., et al. (2018). Bottom-up and top-down attention for image captioning and visual question answering. In CVPR.
Bossard, L., Guillaumin, M., & Van Gool, L. (2014). Food-101–mining discriminative components with random forests. In European conference on computer vision, Springer, pp. 446–461.
Brown, T., Mann, B., & Ryder, N., et al. (2020). Language models are few-shot learners. In NeurIPS.
Carion, N., Massa, F., & Synnaeve, G., et al. (2020). End-to-end object detection with transformers. In ECCV.
Chen, Y. C., Li, L., & Yu, L., et al. (2020). Uniter: Learning universal image-text representations. In ECCV.
Cimpoi, M., Maji, S., & Kokkinos, I., et al. (2014). Describing textures in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3606–3613.
Conneau, A., Khandelwal, K., & Goyal, N., et al. (2020). Unsupervised cross-lingual representation learning at scale. In ACL.
Deng, J., Dong, W., & Socher, R., et al. (2009). Imagenet: A large-scale hierarchical image database. In CVPR.
Devlin, J., Chang, M. W., & Lee, K., et al. (2019). Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT.
Dong, L., Yang, N., & Wang, W., et al. (2019). Unified language model pre-training for natural language understanding and generation. In NeurIPS.
Dosovitskiy, A., Beyer, L., & Kolesnikov, A., et al. (2021). An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR.
Fei-Fei, L., Fergus, R., & Perona, P. (2004). Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. In 2004 conference on computer vision and pattern recognition workshop, IEEE, p. 178.
Gao, P., Jiang, Z., & You, H., et al. (2019). Dynamic fusion with intra-and inter-modality attention flow for visual question answering. In CVPR.
Gao, P., Lu, J., & Li, H., et al. (2021a). Container: Context aggregation network. In NeurIPS.
Gao, P., Zheng, M., & Wang, X., et al. (2021b) Fast convergence of detr with spatially modulated co-attention. In Proceedings of the IEEE/CVF international conference on computer vision, pp 3621–3630.
Gao, T., Fisch, A., & Chen, D. (2021c). Making pre-trained language models better few-shot learners. In ACL-IJCNLP.
Gu, Y., Han, X., & Liu, Z., et al. (2022). Ppt: Pre-trained prompt tuning for few-shot learning. In Proceedings of the 60th annual meeting of the association for computational linguistics (Volume 1: Long Papers), pp. 8410–8423.
He, J., Zhou, C., & Ma, X., et al. (2022). Towards a unified view of parameter-efficient transfer learning. In International conference on learning representations.
He, K., Zhang, X., & Ren, S., et al. (2016). Deep residual learning for image recognition. In CVPR.
Helber, P., Bischke, B., Dengel, A., et al. (2019). Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 12(7), 2217–2226.
Article Google Scholar
Hendrycks, D., Basart, S., Mu, N., et al. (2021a). The many faces of robustness: A critical analysis of out-of-distribution generalization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 8340–8349.
Hendrycks, D., Zhao, K., Basart, S., et al. (2021b). Natural adversarial examples. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 15,262–15,271.
Houlsby, N., Giurgiu, A., Jastrzebski, S., et al. (2019). Parameter-efficient transfer learning for nlp. In International conference on machine learning, PMLR, pp. 2790–2799.
Howard, A. G., Zhu, M., Chen, B., et al. (2017). Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861
Hu, S., Zhang, Z., Ding, N., et al. (2022). Sparse structure search for parameter-efficient tuning. arXiv preprint arXiv:2206.07382
Jia, C., Yang, Y., Xia, Y., et al. (2021). Scaling up visual and vision-language representation learning with noisy text supervision. In ICML.
Jia, M., Tang, L., Chen, B. C., et al. (2022). Visual prompt tuning. In ECCV, pp. 709–727.
Jiang, Z., Xu, F. F., Araki, J., et al. (2020). How can we know what language models know? Transactions of the Association for Computational Linguistics, 8, 423–438.
Article Google Scholar
Kim, J. H., Jun, J., & Zhang, B. T. (2018). Bilinear attention networks. In NIPS.
Krause, J., Stark, M., Deng, J., et al. (2013). 3d object representations for fine-grained categorization. In Proceedings of the IEEE international conference on computer vision workshops, pp. 554–561.
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In NIPS.
Lester, B., Al-Rfou, R., & Constant, N. (2021). The power of scale for parameter-efficient prompt tuning. In EMNLP.
Li, C., Liu, H., Li, L. H., et al. (2022). ELEVATER: A benchmark and toolkit for evaluating language-augmented visual models. In Thirty-sixth conference on neural information processing systems datasets and benchmarks track.
Li, J., Selvaraju, R., Gotmare, A., et al. (2021). Align before fuse: Vision and language representation learning with momentum distillation. Advances in Neural Information Processing Systems, 34, 9694–9705.
Google Scholar
Li, X., Yin, X., Li, C., et al. (2020). Oscar: Object-semantics aligned pre-training for vision-language tasks. In ECCV.
Li, X. L., & Liang, P. (2021). Prefix-tuning: Optimizing continuous prompts for generation. In ACL.
Lian, D., Zhou, D., Feng, J., et al. (2022). Scaling and shifting your features: A new baseline for efficient model tuning. Advances in Neural Information Processing Systems, 35, 109–123.
Google Scholar
Liu, P., Yuan, W., Fu, J., et al. (2023). Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys, 55(9), 1–35.
Article Google Scholar
Liu, X., Zheng, Y., Du, Z., et al. (2021). Gpt understands, too. arXiv preprint arXiv:2103.10385
Long, J., Shelhamer, E., & Darrell, T. (2015). Fully convolutional networks for semantic segmentation. In CVPR.
Lu, J., Batra, D., Parikh, D., et al. (2019). Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In NeurIPS.
Maji, S., Rahtu, E., Kannala, J., et al. (2013). Fine-grained visual classification of aircraft. arXiv preprint arXiv:1306.5151.
Mao, M., Zhang, R., Zheng, H., et al. (2021). Dual-stream network for visual recognition. Advances in Neural Information Processing Systems, 34, 25,346-25,358.
Google Scholar
Nilsback, M. E., & Zisserman, A. (2008). Automated flower classification over a large number of classes. IEEE: Graphics and 008 sixth Indian conference on computer vision (pp. 722–729).
Parkhi, O. M., Vedaldi, A., Zisserman, A., et al. (2012). Cats and dogs. In 2012 IEEE conference on computer vision and pattern recognition, IEEE, pp 3498–3505.
Radford, A., Wu, J., Child, R., et al. (2019). Language models are unsupervised multitask learners. OpenAI blog.
Radford, A., Kim, J. W., Hallacy, C., et al. (2021). Learning transferable visual models from natural language supervision. In International conference on machine learning, PMLR, pp. 8748–8763.
Recht, B., Roelofs, R., Schmidt, L., et al. (2019). Do imagenet classifiers generalize to imagenet? In International conference on machine learning, PMLR, pp. 5389–5400.
Ren, S., He, K., Girshick, R., et al. (2015). Faster r-cnn: Towards real-time object detection with region proposal networks. In NIPS.
Shin, T., Razeghi, Y, Logan IV, R. L., et al. (2020). Autoprompt: Eliciting knowledge from language models with automatically generated prompts. In EMNLP.
Simonyan, K., & Zisserman, A. (2015). Very deep convolutional networks for large-scale image recognition. In ICLR.
Soomro, K., Zamir, A. R., & Shah, M. (2012). Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402.
Sun, T., Shao, Y., Qian, H., et al. (2022). Black-box tuning for language-model-as-a-service. In International conference on machine learning, PMLR, pp. 20,841–20,855.
Sung, Y. L., Cho, J., & Bansal, M. (2022). Lst: Ladder side-tuning for parameter and memory efficient transfer learning. Advances in Neural Information Processing Systems, 35, 12,991-13,005.
Google Scholar
Sung, Y. L., Cho, J., & Bansal, M. (2022b). Vl-adapter: Parameter-efficient transfer learning for vision-and-language tasks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 5227–5237.
Tan, H., & Bansal, M. (2019). Lxmert: Learning cross-modality encoder representations from transformers. In EMNLP-IJCNLP.
Touvron, H., Cord, M., Douze, M., et al. (2021). Training data-efficient image transformers and distillation through attention. In ICML.
Tsimpoukelli, M., Menick, J. L., Cabi, S., et al. (2021). Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems, 34, 200–212.
Google Scholar
Van der Maaten, L., & Hinton, G. (2008). Visualizing data using t-sne. Journal of machine learning research 9(11).
Vaswani, A., Shazeer, N., Parmar, N., et al. (2017). Attention is all you need. In NIPS.
Wang, H., Ge, S., Lipton, Z., et al. (2019). Learning robust global representations by penalizing local predictive power. Advances in Neural Information Processing Systems, 32.
Wang, W., Bao, H., Dong, L., et al. (2022a). Image as a foreign language: Beit pretraining for all vision and vision-language tasks. arXiv preprint arXiv:2208.10442.
Wang, Z., Yu, J., Yu, A. W., et al. (2022b). SimVLM: Simple visual language model pretraining with weak supervision. In International Conference on Learning Representations.
Wortsman, M., Ilharco, G., Kim, J. W., et al. (2022). Robust fine-tuning of zero-shot models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 7959–7971.
Xiao, J., Hays, J., Ehinger, K. A., et al. (2010). Sun database: Large-scale scene recognition from abbey to zoo. In 2010 IEEE computer society conference on computer vision and pattern recognition, IEEE, pp. 3485–3492.
Yao, Y., Zhang, A., Zhang, Z., et al. (2021). Cpt: Colorful prompt tuning for pre-trained vision-language models. arXiv preprint arXiv:2109.11797
Yao, Y., Chen, Q., Zhang, A., et al. (2022). PEVL: Position-enhanced pre-training and prompt tuning for vision-language models. In Proceedings of the 2022 conference on empirical methods in natural language processing, pp. 11,104–11,117.
Yu, Z., Yu, J., Cui, Y., et al. (2019). Deep modular co-attention networks for visual question answering. In CVPR.
Zhou, K., Yang, J., Loy, C. C., et al. (2022). Learning to prompt for vision-language models. International Journal of Computer Vision, pp. 1–12.

Download references

Author information

Peng Gao, Shijie Geng and Renrui Zhang have contributed equally to this work.

Authors and Affiliations

Shanghai AI Laboratory, Shanghai, China
Peng Gao, Renrui Zhang, Teli Ma & Yu Qiao
Rutgers University, New Brunswick, NJ, USA
Shijie Geng & Yongfeng Zhang
The Chinese University of Hong Kong, Hong Kong, SAR, China
Rongyao Fang & Hongsheng Li

Authors

Peng Gao
View author publications
You can also search for this author in PubMed Google Scholar
Shijie Geng
View author publications
You can also search for this author in PubMed Google Scholar
Renrui Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Teli Ma
View author publications
You can also search for this author in PubMed Google Scholar
Rongyao Fang
View author publications
You can also search for this author in PubMed Google Scholar
Yongfeng Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Hongsheng Li
View author publications
You can also search for this author in PubMed Google Scholar
Yu Qiao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Peng Gao.

Additional information

Communicated by Liu Ziwei.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

The original version has been revised to update Fig. 1.

A Result Comparison Under CLIP-Style Preprocessing

In Fig. 7, we present the result comparison under CLIP-style preprocessing of few-shot learning on 11 datasets. Compared to CoOp-style preprocessing, the performances of all methods are improved under CLIP-style preprocessing. Similar to Figure 3 of the main body, CLIP-Adapter still outperforms other baselines across different shot settings.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Gao, P., Geng, S., Zhang, R. et al. CLIP-Adapter: Better Vision-Language Models with Feature Adapters. Int J Comput Vis 132, 581–595 (2024). https://doi.org/10.1007/s11263-023-01891-x

Download citation

Received: 04 October 2022
Accepted: 18 August 2023
Published: 15 September 2023
Issue Date: February 2024
DOI: https://doi.org/10.1007/s11263-023-01891-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

CLIP-Adapter: Better Vision-Language Models with Feature Adapters

Abstract

Access this article

Similar content being viewed by others

Learning to Prompt for Vision-Language Models

Modal interaction-enhanced prompt learning by transformer decoder for vision-language models

Efficient Prompt Tuning for Vision and Language Models

Data Availability

Change history

29 October 2023

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

A Result Comparison Under CLIP-Style Preprocessing

Rights and permissions

About this article

Cite this article

Keywords

Navigation

CLIP-Adapter: Better Vision-Language Models with Feature Adapters

Abstract

Access this article

Similar content being viewed by others

Learning to Prompt for Vision-Language Models

Modal interaction-enhanced prompt learning by transformer decoder for vision-language models

Efficient Prompt Tuning for Vision and Language Models

Data Availability

Change history

29 October 2023

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

A Result Comparison Under CLIP-Style Preprocessing

A Result Comparison Under CLIP-Style Preprocessing

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation