Skip to main content
Log in

CLIP-Adapter: Better Vision-Language Models with Feature Adapters

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

This article has been updated

Abstract

Large-scale contrastive vision-language pretraining has shown significant progress in visual representation learning. Unlike traditional visual systems trained by a fixed set of discrete labels, a new paradigm was introduced in Radford et al. (International conference on machine learning, PMLR, 2021) to directly learn to align images with raw texts in an open-vocabulary setting. On downstream tasks, a carefully chosen text prompt is employed to make zero-shot predictions. To avoid non-trivial prompt engineering, context optimization (Zhou et al. in Int J Comput Vis 130(9):2337–2348, 2022) has been proposed to learn continuous vectors as task-specific prompts with few-shot training examples. In this paper, we show that there is an alternative path to achieve better vision-language models other than prompt tuning. While prompt tuning is for the textual inputs, we propose CLIP-Adapter to conduct fine-tuning with feature adapters on either visual or language branch. Specifically, CLIP-Adapter adopts an additional bottleneck layer to learn new features and performs residual-style feature blending with the original pretrained features. As a consequence, CLIP-Adapter is able to outperform context optimization while maintaining a simple design. Experiments and extensive ablation studies on various visual classification tasks demonstrate the effectiveness of our approach.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Data Availability

No new data were created during the study. All experiments of this manuscript were conducted (training and evaluation) on 11 publicly available image classification datasets (Deng et al., 2009; Krause et al., 2013; Soomro et al., 2012; Fei-Fei et al., 2004; Nilsback and Zisserman, 2008; Xiao et al., 2010; Cimpoi et al., 2014; Helber et al., 2019; Maji et al., 2013; Parkhi et al., 2012; Bossard et al., 2014).

Change history

  • 29 October 2023

    The original version has been revised to update Fig. 1

References

  • Alayrac, J. B., Donahue, J., Luc, P., et al. (2022). Flamingo: a visual language model for few-shot learning. In A. H. Oh, A. Agarwal, D. Belgrave, et al. (Eds.), Advances in neural information processing systems. MIT Press.

    Google Scholar 

  • Anderson, P., He, X., & Buehler, C., et al. (2018). Bottom-up and top-down attention for image captioning and visual question answering. In CVPR.

  • Bossard, L., Guillaumin, M., & Van Gool, L. (2014). Food-101–mining discriminative components with random forests. In European conference on computer vision, Springer, pp. 446–461.

  • Brown, T., Mann, B., & Ryder, N., et al. (2020). Language models are few-shot learners. In NeurIPS.

  • Carion, N., Massa, F., & Synnaeve, G., et al. (2020). End-to-end object detection with transformers. In ECCV.

  • Chen, Y. C., Li, L., & Yu, L., et al. (2020). Uniter: Learning universal image-text representations. In ECCV.

  • Cimpoi, M., Maji, S., & Kokkinos, I., et al. (2014). Describing textures in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3606–3613.

  • Conneau, A., Khandelwal, K., & Goyal, N., et al. (2020). Unsupervised cross-lingual representation learning at scale. In ACL.

  • Deng, J., Dong, W., & Socher, R., et al. (2009). Imagenet: A large-scale hierarchical image database. In CVPR.

  • Devlin, J., Chang, M. W., & Lee, K., et al. (2019). Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT.

  • Dong, L., Yang, N., & Wang, W., et al. (2019). Unified language model pre-training for natural language understanding and generation. In NeurIPS.

  • Dosovitskiy, A., Beyer, L., & Kolesnikov, A., et al. (2021). An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR.

  • Fei-Fei, L., Fergus, R., & Perona, P. (2004). Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. In 2004 conference on computer vision and pattern recognition workshop, IEEE, p. 178.

  • Gao, P., Jiang, Z., & You, H., et al. (2019). Dynamic fusion with intra-and inter-modality attention flow for visual question answering. In CVPR.

  • Gao, P., Lu, J., & Li, H., et al. (2021a). Container: Context aggregation network. In NeurIPS.

  • Gao, P., Zheng, M., & Wang, X., et al. (2021b) Fast convergence of detr with spatially modulated co-attention. In Proceedings of the IEEE/CVF international conference on computer vision, pp 3621–3630.

  • Gao, T., Fisch, A., & Chen, D. (2021c). Making pre-trained language models better few-shot learners. In ACL-IJCNLP.

  • Gu, Y., Han, X., & Liu, Z., et al. (2022). Ppt: Pre-trained prompt tuning for few-shot learning. In Proceedings of the 60th annual meeting of the association for computational linguistics (Volume 1: Long Papers), pp. 8410–8423.

  • He, J., Zhou, C., & Ma, X., et al. (2022). Towards a unified view of parameter-efficient transfer learning. In International conference on learning representations.

  • He, K., Zhang, X., & Ren, S., et al. (2016). Deep residual learning for image recognition. In CVPR.

  • Helber, P., Bischke, B., Dengel, A., et al. (2019). Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 12(7), 2217–2226.

    Article  Google Scholar 

  • Hendrycks, D., Basart, S., Mu, N., et al. (2021a). The many faces of robustness: A critical analysis of out-of-distribution generalization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 8340–8349.

  • Hendrycks, D., Zhao, K., Basart, S., et al. (2021b). Natural adversarial examples. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 15,262–15,271.

  • Houlsby, N., Giurgiu, A., Jastrzebski, S., et al. (2019). Parameter-efficient transfer learning for nlp. In International conference on machine learning, PMLR, pp. 2790–2799.

  • Howard, A. G., Zhu, M., Chen, B., et al. (2017). Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861

  • Hu, S., Zhang, Z., Ding, N., et al. (2022). Sparse structure search for parameter-efficient tuning. arXiv preprint arXiv:2206.07382

  • Jia, C., Yang, Y., Xia, Y., et al. (2021). Scaling up visual and vision-language representation learning with noisy text supervision. In ICML.

  • Jia, M., Tang, L., Chen, B. C., et al. (2022). Visual prompt tuning. In ECCV, pp. 709–727.

  • Jiang, Z., Xu, F. F., Araki, J., et al. (2020). How can we know what language models know? Transactions of the Association for Computational Linguistics, 8, 423–438.

    Article  Google Scholar 

  • Kim, J. H., Jun, J., & Zhang, B. T. (2018). Bilinear attention networks. In NIPS.

  • Krause, J., Stark, M., Deng, J., et al. (2013). 3d object representations for fine-grained categorization. In Proceedings of the IEEE international conference on computer vision workshops, pp. 554–561.

  • Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In NIPS.

  • Lester, B., Al-Rfou, R., & Constant, N. (2021). The power of scale for parameter-efficient prompt tuning. In EMNLP.

  • Li, C., Liu, H., Li, L. H., et al. (2022). ELEVATER: A benchmark and toolkit for evaluating language-augmented visual models. In Thirty-sixth conference on neural information processing systems datasets and benchmarks track.

  • Li, J., Selvaraju, R., Gotmare, A., et al. (2021). Align before fuse: Vision and language representation learning with momentum distillation. Advances in Neural Information Processing Systems, 34, 9694–9705.

    Google Scholar 

  • Li, X., Yin, X., Li, C., et al. (2020). Oscar: Object-semantics aligned pre-training for vision-language tasks. In ECCV.

  • Li, X. L., & Liang, P. (2021). Prefix-tuning: Optimizing continuous prompts for generation. In ACL.

  • Lian, D., Zhou, D., Feng, J., et al. (2022). Scaling and shifting your features: A new baseline for efficient model tuning. Advances in Neural Information Processing Systems, 35, 109–123.

    Google Scholar 

  • Liu, P., Yuan, W., Fu, J., et al. (2023). Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys, 55(9), 1–35.

    Article  Google Scholar 

  • Liu, X., Zheng, Y., Du, Z., et al. (2021). Gpt understands, too. arXiv preprint arXiv:2103.10385

  • Long, J., Shelhamer, E., & Darrell, T. (2015). Fully convolutional networks for semantic segmentation. In CVPR.

  • Lu, J., Batra, D., Parikh, D., et al. (2019). Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In NeurIPS.

  • Maji, S., Rahtu, E., Kannala, J., et al. (2013). Fine-grained visual classification of aircraft. arXiv preprint arXiv:1306.5151.

  • Mao, M., Zhang, R., Zheng, H., et al. (2021). Dual-stream network for visual recognition. Advances in Neural Information Processing Systems, 34, 25,346-25,358.

    Google Scholar 

  • Nilsback, M. E., & Zisserman, A. (2008). Automated flower classification over a large number of classes. IEEE: Graphics and 008 sixth Indian conference on computer vision (pp. 722–729).

  • Parkhi, O. M., Vedaldi, A., Zisserman, A., et al. (2012). Cats and dogs. In 2012 IEEE conference on computer vision and pattern recognition, IEEE, pp 3498–3505.

  • Radford, A., Wu, J., Child, R., et al. (2019). Language models are unsupervised multitask learners. OpenAI blog.

  • Radford, A., Kim, J. W., Hallacy, C., et al. (2021). Learning transferable visual models from natural language supervision. In International conference on machine learning, PMLR, pp. 8748–8763.

  • Recht, B., Roelofs, R., Schmidt, L., et al. (2019). Do imagenet classifiers generalize to imagenet? In International conference on machine learning, PMLR, pp. 5389–5400.

  • Ren, S., He, K., Girshick, R., et al. (2015). Faster r-cnn: Towards real-time object detection with region proposal networks. In NIPS.

  • Shin, T., Razeghi, Y, Logan IV, R. L., et al. (2020). Autoprompt: Eliciting knowledge from language models with automatically generated prompts. In EMNLP.

  • Simonyan, K., & Zisserman, A. (2015). Very deep convolutional networks for large-scale image recognition. In ICLR.

  • Soomro, K., Zamir, A. R., & Shah, M. (2012). Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402.

  • Sun, T., Shao, Y., Qian, H., et al. (2022). Black-box tuning for language-model-as-a-service. In International conference on machine learning, PMLR, pp. 20,841–20,855.

  • Sung, Y. L., Cho, J., & Bansal, M. (2022). Lst: Ladder side-tuning for parameter and memory efficient transfer learning. Advances in Neural Information Processing Systems, 35, 12,991-13,005.

    Google Scholar 

  • Sung, Y. L., Cho, J., & Bansal, M. (2022b). Vl-adapter: Parameter-efficient transfer learning for vision-and-language tasks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 5227–5237.

  • Tan, H., & Bansal, M. (2019). Lxmert: Learning cross-modality encoder representations from transformers. In EMNLP-IJCNLP.

  • Touvron, H., Cord, M., Douze, M., et al. (2021). Training data-efficient image transformers and distillation through attention. In ICML.

  • Tsimpoukelli, M., Menick, J. L., Cabi, S., et al. (2021). Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems, 34, 200–212.

    Google Scholar 

  • Van der Maaten, L., & Hinton, G. (2008). Visualizing data using t-sne. Journal of machine learning research 9(11).

  • Vaswani, A., Shazeer, N., Parmar, N., et al. (2017). Attention is all you need. In NIPS.

  • Wang, H., Ge, S., Lipton, Z., et al. (2019). Learning robust global representations by penalizing local predictive power. Advances in Neural Information Processing Systems, 32.

  • Wang, W., Bao, H., Dong, L., et al. (2022a). Image as a foreign language: Beit pretraining for all vision and vision-language tasks. arXiv preprint arXiv:2208.10442.

  • Wang, Z., Yu, J., Yu, A. W., et al. (2022b). SimVLM: Simple visual language model pretraining with weak supervision. In International Conference on Learning Representations.

  • Wortsman, M., Ilharco, G., Kim, J. W., et al. (2022). Robust fine-tuning of zero-shot models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 7959–7971.

  • Xiao, J., Hays, J., Ehinger, K. A., et al. (2010). Sun database: Large-scale scene recognition from abbey to zoo. In 2010 IEEE computer society conference on computer vision and pattern recognition, IEEE, pp. 3485–3492.

  • Yao, Y., Zhang, A., Zhang, Z., et al. (2021). Cpt: Colorful prompt tuning for pre-trained vision-language models. arXiv preprint arXiv:2109.11797

  • Yao, Y., Chen, Q., Zhang, A., et al. (2022). PEVL: Position-enhanced pre-training and prompt tuning for vision-language models. In Proceedings of the 2022 conference on empirical methods in natural language processing, pp. 11,104–11,117.

  • Yu, Z., Yu, J., Cui, Y., et al. (2019). Deep modular co-attention networks for visual question answering. In CVPR.

  • Zhou, K., Yang, J., Loy, C. C., et al. (2022). Learning to prompt for vision-language models. International Journal of Computer Vision, pp. 1–12.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Peng Gao.

Additional information

Communicated by Liu Ziwei.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

The original version has been revised to update Fig. 1.

A Result Comparison Under CLIP-Style Preprocessing

A Result Comparison Under CLIP-Style Preprocessing

In Fig. 7, we present the result comparison under CLIP-style preprocessing of few-shot learning on 11 datasets. Compared to CoOp-style preprocessing, the performances of all methods are improved under CLIP-style preprocessing. Similar to Figure 3 of the main body, CLIP-Adapter still outperforms other baselines across different shot settings.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Gao, P., Geng, S., Zhang, R. et al. CLIP-Adapter: Better Vision-Language Models with Feature Adapters. Int J Comput Vis 132, 581–595 (2024). https://doi.org/10.1007/s11263-023-01891-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-023-01891-x

Keywords

Navigation