Context-Aware Robust Fine-Tuning

Mao, Xiaofeng; Chen, Yufeng; Jia, Xiaojun; Zhang, Rong; Xue, Hui; Li, Zhao

doi:10.1007/s11263-023-01951-2

Context-Aware Robust Fine-Tuning

Published: 03 December 2023

Volume 132, pages 1685–1700, (2024)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

Xiaofeng Mao ORCID: orcid.org/0000-0003-4486-556X¹,
Yufeng Chen¹,
Xiaojun Jia²,
Rong Zhang¹,
Hui Xue¹ &
…
Zhao Li³

429 Accesses
1 Citation
1 Altmetric
Explore all metrics

Abstract

Contrastive language-image pre-trained (CLIP) models have zero-shot ability of classifying an image belonging to “\(\mathtt {[CLASS]}\)” by using similarity between the image and the prompt sentence “a \(\mathtt {[CONTEXT]}\) of \(\mathtt {[CLASS]}\)”. Based on exhaustive text cues in “\(\mathtt {[CONTEXT]}\)”, CLIP model is aware of different contexts, e.g. background, style, viewpoint, and exhibits unprecedented robustness against a wide range of distribution shifts. However, recent works find further fine-tuning of CLIP models improves accuracy but sacrifices the robustness on downstream tasks. We conduct an empirical investigation to show fine-tuning will corrupt the context-aware ability of pre-trained CLIP features. To solve this problem, we propose Context-Aware Robust Fine-tuning (CAR-FT). CAR-FT regularizes the model during fine-tuning to capture the context information. Specifically, we use zero-shot prompt weights to get the context distribution contained in the image. By minimizing the Kullback–Leibler divergence (KLD) between context distributions induced by original/fine-tuned CLIP models, CAR-FT makes the context-aware ability of CLIP inherited into downstream tasks, and achieves both higher in-distribution (ID) and out-of-distribution (OOD) accuracy. The experimental results show CAR-FT achieves superior robustness on five OOD test datasets of ImageNet, and meanwhile brings accuracy gains on nine downstream tasks. Additionally, CAR-FT surpasses previous domain generalization (DG) methods and gets 78.5% averaged accuracy on DomainBed benchmark, building the new state-of-the-art.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Deep learning models for digital image processing: a review

Article 07 January 2024

Learning to Prompt for Vision-Language Models

Article 31 July 2022

CLIP-Adapter: Better Vision-Language Models with Feature Adapters

Article 15 September 2023

Data Availability

ImageNet: https://www.image-net.org/ DomainBed: https://github.com/facebookresearch/DomainBed Flowers: https://www.robots.ox.ac.uk/~vgg/data/flowers/102/ Aircraft: https://www.robots.ox.ac.uk/~vgg/data/fgvc-aircraft/ CIFAR: https://www.cs.toronto.edu/~kriz/cifar.html Pets: https://www.robots.ox.ac.uk/~vgg/data/pets/ Cars: https://ai.stanford.edu/~jkrause/cars/car_dataset.html SUN397: https://vision.princeton.edu/projects/2010/SUN/ DTD: https://www.robots.ox.ac.uk/~vgg/data/dtd/ CLIP: https://github.com/openai/CLIP OpenCLIP: https://github.com/mlfoundations/open_clip

Code Availability

We plan to open-source codes for the community in the future.

Notes

References

Andreassen, A., Bahri, Y., Neyshabur, B., & Roelofs, R. (2021). The evolution of out-of-distribution robustness throughout fine-tuning. arXiv preprint arXiv:2106.15831
Arpit, D., Wang, H., Zhou, Y., & Xiong, C. (2021). Ensemble of averages: Improving model selection and boosting performance in domain generalization. arXiv preprint arXiv:2110.10832
Bai, H., Zhou, F., & Hong, L., (2021) Nas-ood: Neural architecture search for out-of-distribution generalization. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 8320–8329).
Barbu, A., Mayo, D., & Alverio, J., (2019) Objectnet: A large-scale bias-controlled dataset for pushing the limits of object recognition models. Advances in Neural Information Processing Systems 32.
Beery, S., Van Horn, G., & Perona, P. (2018) Recognition in terra incognita. In Proceedings of the European conference on computer vision (ECCV) (pp. 456–473).
Cha, J., Chun, S., Lee, K., et al. (2021). Swad: Domain generalization by seeking flat minima. Advances in Neural Information Processing Systems, 34, 22405–22418.
Google Scholar
Cha, J., Lee, K., Park, S., & Chun, S. (2022) Domain generalization by mutual-information regularization with pre-trained models. arXiv preprint arXiv:2203.10789
Chefer, H., Gur, S., & Wolf, L. (2021). Generic attention-model explainability for interpreting bi-modal and encoder-decoder transformers. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 397–406).
Chen, G., Yao, W., Song, X., Li, X., Rao, Y., & Zhang, K. (2022). Plot: Prompt learning with optimal transport for vision-language models. In The Eleventh international conference on learning representations.
Cimpoi, M., Maji, S., Kokkinos, I., Mohamed, S., & Vedaldi, A. (2014) Describing textures in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3606–3613).
Deng, J., Dong, W., Socher, R., Li, L. J., Li, K., & Fei-Fei, L.(2009). Imagenet: A large-scale hierarchical image database. In CVPR.
Fang, A., Ilharco, G., Wortsman, M., Wan, Y., Shankar, V., Dave, A., & Schmidt, L. (2022) Data determines distributional robustness in contrastive language image pre-training (clip). arXiv preprint arXiv:2205.01397.
Foret, P., Kleiner, A., Mobahi, H., & Neyshabur, B. (2020) Sharpness-aware minimization for efficiently improving generalization. In International conference on learning representations.
Gao, P., Geng, S., Zhang, R., Ma, T., Fang, R., Zhang, Y., & Qiao, Y. (2021) Clip-adapter: Better vision-language models with feature adapters. arXiv preprint arXiv:2110.04544.
Ge, W., & Yu, Y. (2017). Borrowing treasures from the wealthy: Deep transfer learning through selective joint fine-tuning. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp 1086–1095).
Gulrajani, I., & Lopez-Paz, D. (2020) In search of lost domain generalization. In International conference on learning representations.
Guo, Y., Shi, H., Kumar, A., Grauman, K., Rosing, T., & Feris, R. (2019) Spottune: Transfer learning through adaptive fine-tuning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 4805–4814).
Hadsell, R., Chopra, S., & LeCun, Y. (2006) Dimensionality reduction by learning an invariant mapping. In 2006 IEEE computer society conference on computer vision and pattern recognition (CVPR’06) (IEEE, pp. 1735–1742).
He, K., Zhang, X., & Ren, S., et al (2016). Deep residual learning for image recognition. In CVPR.
He, Y., Shen, Z., & Cui, P. (2021). Towards non-iid image classification: A dataset and baselines. Pattern Recognition, 110(107), 383.
Google Scholar
Hendrycks, D., Mu, N., Cubuk, E.D., & Lakshminarayanan, B. (2019). Augmix: A simple data processing method to improve robustness and uncertainty. In International conference on learning representations.
Hendrycks, D., Basart, S., Mu, N., Kadavath, S., Wang, F., Dorundo, E., & Gilmer, J. (2021a) The many faces of robustness: A critical analysis of out-of-distribution generalization. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 8340–8349).
Hendrycks, D., Zhao, K., Basart, S., Steinhardt, J., & Song, D. (2021b) Natural adversarial examples. In CVPR.
Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., & Sun, D. (2022) Pyramid adversarial training improves vit performance. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp 13419–13429).
Ilharco, G., Wortsman, M., & Wightman, R., et al. (2021). Openclip. https://doi.org/10.5281/zenodo.5143773
Jia, C., Yang, Y., Xia, Y., Chen, Y. T., Parekh, Z., Pham, H., & Duerig, T. (2021) Scaling up visual and vision-language representation learning with noisy text supervision. In ICML.
Khattak, M. U., Rasheed, H., Maaz, M., Khan, S., & Khan, F. S. (2023) Maple: Multi-modal prompt learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 19113–19122).
Krause, J., Stark, M., Deng, J., & Fei-Fei, L. (2013) 3d object representations for fine-grained categorization. In ICCV-W.
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2017). Imagenet classification with deep convolutional neural networks. Communications of the ACM, 60(6), 84–90.
Article Google Scholar
Kumar, A., Raghunathan, A., Jones, R., Ma, T., & Liang, P. (2021) Fine-tuning can distort pretrained features and underperform out-of-distribution. In International conference on learning representations.
Li, C., Liu, H., Li, L., Zhang, P., Aneja, J., Yang, J., & Gao, J.(2022) Elevater: A benchmark and toolkit for evaluating language-augmented visual models. arXiv preprint arXiv:2204.08790
Li, D., Yang, Y., Song, Y. Z., & Hospedales, T. M. (2017) Deeper, broader and artier domain generalization. In Proceedings of the IEEE international conference on computer vision (pp. 5542–5550).
Li, Y., Liang, F., Zhao, L., Cui, Y., Ouyang, W., Shao, J., & Yan, J. ((2021). Supervision exists everywhere: A data efficient contrastive language-image pre-training paradigm. In International conference on learning representations.
Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., & Neubig, G. (2021) Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. arXiv preprint arXiv:2107.13586
Loshchilov, I., & Hutter, F. (2018) Decoupled weight decay regularization. In International conference on learning representations.
Maji, S., Rahtu, E., Kannala, J., Blaschko, M., & Vedaldi, A. (2013) Fine-grained visual classification of aircraft. arXiv preprint arXiv:1306.5151
Mao, X., Chen, Y., Duan, R., Zhu, Y., Qi, G., Li, X., & Xue, H. (2022a) Enhance the visual representation via discrete adversarial training. In: NeurIPS.
Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., & Xue, H. (2022b) Towards robust vision transformer. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 12042–12051).
Miller, J.P., Taori, R., Raghunathan, A., Sagawa, S., Koh, P. W., Shankar, V., & Schmidt, L. (2021) Accuracy on the line: on the strong correlation between out-of-distribution and in-distribution generalization. In International conference on machine learning, PMLR (pp. 7721–7735).
Moreno-Torres, J. G., Raeder, T., Alaiz-Rodríguez, R., et al. (2012). A unifying view on dataset shift in classification. Pattern Recognition, 45(1), 521–530.
Article Google Scholar
Mu, N., Kirillov, A., Wagner, D., & Xie, S(2022) Slip: Self-supervision meets language-image pre-training. In European conference on computer vision (pp. 529–544). Springer.
Nilsback, M.E., & Zisserman, A. (2008) Automated flower classification over a large number of classes. In ICVGIP.
Parkhi, O.M., Vedaldi, A., Zisserman, A., & Jawahar, C. V. (2012) Cats and dogs. In CVPR.
Paul, S., & Chen, P.Y. (2022). Vision transformers are robust learners. In Proceedings of the AAAI conference on artificial intelligence (pp. 2071–2081).
Peng, X., Bai, Q., Xia, X., Huang, Z., Saenko, K., & Wang, B. (2019) Moment matching for multi-source domain adaptation. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 1406–1415).
Petroni, F., Rocktäschel, T., Lewis P, Bakhtin, A., Wu, Y., Miller, A. H., & Riedel, S. (2019). Language models as knowledge bases? In EMNLP.
Pham, H., Dai, Z., Ghiasi, G., Kawaguchi, K., Liu, H., Yu, A. W., & Le, Q. V. (2021) Combined scaling for open-vocabulary image classification. arXiv preprint arXiv: 2111.10050
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., & Sutskever, I. (2021) Learning transferable visual models from natural language supervision. In International conference on machine learning, PMLR (pp. 8748–8763).
Radosavovic, I., Kosaraju, R. P., Girshick, R., He, K., & Dollair, P. (2020) Designing network design spaces. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 10428–10436).
Recht, B., Roelofs, R., Schmidt, L., & Shankar, V. (2019) Do imagenet classifiers generalize to imagenet? In ICML.
Schuhmann, C., Beaumont, R., Vencu, R., Gordon, C., Wightman, R., Cherti, M., & Jitsev, J. (2022) Laion-5b: An open large-scale dataset for training next generation image-text models. In Thirty-sixth conference on neural information processing systems datasets and benchmarks track.
Taori, R., Dave, A., Shankar, V., Carlini, N., Recht, B., & Schmidt, L. (2020) Measuring robustness to natural distribution shifts in image classification. In NeurIPS.
Thomee, B., Shamma, D. A., Friedland, G., et al. (2016). Yfcc100m: The new data in multimedia research. Communications of the ACM, 59(2), 64–73.
Article Google Scholar
Torralba, A., & Efros, A. (2011) Unbiased look at dataset bias. In Proceedings of the 2011 IEEE conference on computer vision and pattern recognition (pp. 1521–1528).
Venkateswara, H., Eusebio, J., Chakraborty, S. (2017). Deep hashing network for unsupervised domain adaptation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 5018–5027).
Wang, H., Ge, S., Lipton, Z., & Xing, E. P. (2019) Learning robust global representations by penalizing local predictive power. In NeurIPS.
Wang, Z., Bai, Y., Zhou, Y., & Xie, C. (2022) Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452
Wortsman, M., Ilharco, G., Gadre, S. Y., Roelofs, R., Gontijo-Lopes, R., Morcos, A. S., & Schmidt, L. (2022a) Model soups: Averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. In International conference on machine learning, PMLR (pp 23965–23998).
Wortsman, M., Ilharco, G., Kim, J. W., Li, M., Kornblith, S., Roelofs, R., & Schmidt, L. (2022b) Robust fine-tuning of zero-shot models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp 7959–7971).
Xiao, J., Ehinger, K. A., Hays, J., et al. (2016). Sun database: Exploring a large collection of scene categories. International Journal of Computer Vision, 119(1), 3–22.
Article MathSciNet Google Scholar
Xie, C., Tan, M., Gong, B., Wang, J., Yuille, A. L., & Le, Q. V.(2020) Adversarial examples improve image recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 819–828).
Yao, L., Huang, R., Hou, L., Lu, G., Niu, M., Xu, H., & Xu, C. (2021) Filip: Fine-grained interactive language-image pre-training. In International conference on learning representations.
Yu, J., Wang, Z., Vasudevan, V., Yeung, L., Seyedhosseini, M., & Wu, Y. (2022) Coca: Contrastive captioners are image-text foundation models. arXiv preprint arXiv:2205.01917
Yuan, L., Chen, D., Chen, Y. L., Codella, N., Dai, X., Gao, J., & Zhang, P. (2021) Florence: A new foundation model for computer vision. arXiv preprint arXiv:2111.11432
Zhai, X., Wang, X., Mustafa, B., Steiner, A., Keysers, D., Kolesnikov, A., & Beyer, L. (2022) Lit: Zero-shot transfer with locked-image text tuning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 18123–18133).
Zhang, R., Zhang, W., Fang, R., Gao, P., Li, K., Dai, J., & Li, H. (2022a) Tip-adapter: Training-free adaption of clip for few-shot classification. In European conference on computer vision (pp. 493–510). Springer.
Zhang, X., Gu, S. S., Matsuo, Y., & Iwasawa, Y. (2022b) Domain prompt learning for efficiently adapting clip to unseen domains. arXiv preprint arXiv:2111.12853
Zhang, X., He, Y., Xu, R., Yu, H., Shen, Z., & Cui, P. (2022c) Nico++: Towards better benchmarking for domain generalization. arXiv preprint arXiv:2204.08040
Zhou, K., Yang, J., Loy, C. C., & Liu, Z. (2021) Learning to prompt for vision-language models. arXiv preprint arXiv:2109.01134
Zhou, K., Yang, J., Loy, C. C., & Liu, Z. (2022) Conditional prompt learning for vision-language models. arXiv preprint arXiv:2203.05557

Download references

Acknowledgements

This research is supported in part by the National Key Research and Development Program of China under Grant No.2020AAA0140000.

Funding

This research is supported in part by the National Key Research and Development Program of China under Grant No. 2020AAA0140000.

Author information

Authors and Affiliations

Alibaba Group, Hangzhou, 310023, Zhejiang, China
Xiaofeng Mao, Yufeng Chen, Rong Zhang & Hui Xue
Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China
Xiaojun Jia
Zhejiang University, Hangzhou, 310027, Zhejiang, China
Zhao Li

Authors

Xiaofeng Mao
View author publications
You can also search for this author in PubMed Google Scholar
Yufeng Chen
View author publications
You can also search for this author in PubMed Google Scholar
Xiaojun Jia
View author publications
You can also search for this author in PubMed Google Scholar
Rong Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Hui Xue
View author publications
You can also search for this author in PubMed Google Scholar
Zhao Li
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

All authors contributed to the research conception and design. Methodology: [Xiaofeng Mao]; Material preparation: [Xiaofeng Mao], [Yuefeng Chen]; Formal analysis and investigation: [Rong Zhang], [Hui Xue], [Zhao Li]; Writing - original draft preparation: [Xiaofeng Mao]; Writing - review and editing: [Xiaofeng Mao], [Yuefeng Chen], [Xiaojun Jia].

Corresponding author

Correspondence to Xiaofeng Mao.

Ethics declarations

Financial interests

Xiaojun Jia received a Ph.D. stipend from Institute of Information Engineering, Chinese Academy of Sciences. Xiaofeng Mao, Yuefeng Chen, Rong Zhang, and Hui Xue received salaries from the Alibaba Group. Zhao Li received salaries from Zhejiang University.

Ethics approval

Not applicable.

Consent to participate

Not applicable.

Consent for publication

Not applicable.

Additional information

Communicated by Oliver Zendel.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file 1 (pdf 157 KB)

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Mao, X., Chen, Y., Jia, X. et al. Context-Aware Robust Fine-Tuning. Int J Comput Vis 132, 1685–1700 (2024). https://doi.org/10.1007/s11263-023-01951-2

Download citation

Received: 30 November 2022
Accepted: 30 October 2023
Published: 03 December 2023
Issue Date: May 2024
DOI: https://doi.org/10.1007/s11263-023-01951-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Context-Aware Robust Fine-Tuning

Abstract

Access this article

Similar content being viewed by others

Deep learning models for digital image processing: a review

Learning to Prompt for Vision-Language Models

CLIP-Adapter: Better Vision-Language Models with Feature Adapters

Data Availability

Code Availability

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Financial interests

Ethics approval

Consent to participate

Consent for publication

Additional information

Publisher's Note

Supplementary Information

Supplementary file 1 (pdf 157 KB)

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Context-Aware Robust Fine-Tuning

Abstract

Access this article

Similar content being viewed by others

Deep learning models for digital image processing: a review

Learning to Prompt for Vision-Language Models

CLIP-Adapter: Better Vision-Language Models with Feature Adapters

Data Availability

Code Availability

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Financial interests

Ethics approval

Consent to participate

Consent for publication

Additional information

Publisher's Note

Supplementary Information

Supplementary file 1 (pdf 157 KB)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation