Vision-Language Alignment Learning Under Affinity and Divergence Principles for Few-Shot Out-of-Distribution Generalization

Zhu, Lin; Yin, Weihan; Yang, Yiyao; Wu, Fan; Zeng, Zhaoyu; Gu, Qinying; Wang, Xinbing; Zhou, Chenghu; Ye, Nanyang

doi:10.1007/s11263-024-02036-4

Vision-Language Alignment Learning Under Affinity and Divergence Principles for Few-Shot Out-of-Distribution Generalization

Published: 18 March 2024

(2024)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

Lin Zhu¹,
Weihan Yin¹,
Yiyao Yang¹,
Fan Wu¹,
Zhaoyu Zeng¹,
Qinying Gu²,
Xinbing Wang¹,
Chenghu Zhou¹ &
…
Nanyang Ye ORCID: orcid.org/0000-0003-3129-3953¹

350 Accesses
1 Altmetric
Explore all metrics

Abstract

Recent advances in fine-tuning large-scale vision-language pre-trained models (VL-PTMs) have shown promising results in quick adaption to downstream tasks. However, prior research often lacks comprehensive investigation into out-of-distribution (OOD) generalization. Fine-tuning has a potential risk of overfitting, especially on few-shot OOD datasets when significant distribution shifts occur between the few-shot training examples and test sets. Previous research on fine-tuning’s robustness to distribution shifts does not consider different characteristics of distribution shifts and may not effectively handle noisy data with spurious correlations. To address these challenges, we propose the Vision-Language Alignment Learning under Affinity and Divergence Principles (VLAD) to adapt VL-PTMs to robust few-shot OOD generalization with theoretical guarantees. Built upon the large-scale pre-trained vision-language foundation model CLIP, we leverage frozen language embeddings as invariant anchors to protect against distribution shifts, while using adapter layers to fine-tune pre-trained visual features for improved vision-language alignment. Besides, we introduce affinity and divergence principles to further mitigate overfitting during the vision-language aligning process by increasing class discrimination and suppressing non-causal features. More importantly, we offer theoretical evidence highlighting the superiority of general language knowledge in achieving more robust OOD generalization performance. The tighter upper bound of the OOD generalization errors by the proposed regularization loss is also shown in theoretical analysis. Our approach is substantiated by extensive experiments and ablation studies on diverse datasets, validating our theoretical findings. The code is available at https://github.com/LinLLLL/VLAD.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

How Does Fine-Tuning Impact Out-of-Distribution Detection for Vision-Language Models?

Article 20 September 2023

Unsupervised Prototype Adapter for Vision-Language Models

Contrastive Vision-Language Pre-training with Limited Resources

Data Availability Statement

The authors confirm that the data supporting the findings of this study are openly available at https://paperswithcode.com/dataset/colored-mnist for ColoredMNIST, at https://paperswithcode.com/dataset/celeba for CelebA, at https://paperswithcode.com/dataset/pacs for PACS, at https://paperswithcode.com/dataset/vlcs for VLCS, and at https://pan.baidu.com/s/1_m99H_Lmlyi155VyUsEyRQ?pwd=cxpw for ColoredCatsDogs. The other datasets used in ablation studies (Sect. 4.3) can be downloaded referring to the “DATASETS.md” file in Github project https://github.com/KaiyangZhou/CoOp. Our codes are available upon reasonable request.

Notes

Utilizing multiple linear regression models of Eq. 2, each aligned with the language feature of a class name, we can obtain normalized predicted probabilities over the classes by softmax operation.

References

Adebayo, J., Gilmer, J., Muelly, M., Goodfellow, I., Hardt, M., & Kim, B. (2018). Sanity checks for saliency maps. arXiv:1810.03292v3
Ahuja, K., Shanmugam, K., Varshney, K., & Dhurandhar, A. (2020). Invariant risk minimization games. In International Conference on Machine Learning, pp. 145–155. PMLR.
Akuzawa, K., Iwasawa, Y., & Matsuo, Y. (2019). Adversarial invariant feature learning with accuracy constraint for domain generalization. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 315–331. Springer.
Arjovsky, M., Bottou, L., Gulrajani, I., & Lopez-Paz, D. (2019). Invariant risk minimization. arXiv:1907.02893.
Arpit, D., Wang, H., Zhou, Y., & Xiong, C. (2022). Ensemble of averages: Improving model selection and boosting performance in domain generalization. Advances in Neural Information Processing Systems, 35, 8265–8277.
Google Scholar
Baehrens, D., Schroeter, T., Harmeling, S., Kawanabe, M., Hansen, K., & Müller, K.-R. (2010). How to explain individual classification decisions. Journal of Machine Learning Research, 11(61), 1803–1831.
MathSciNet Google Scholar
Bahng, H., Chun, S., Yun, S., Choo, J., & Oh, S. J. (2020). Learning de-biased representations with biased representations. In: International Conference on Machine Learning, pp. 528–539. PMLR.
Bai, H., Sun, R., Hong, L., Zhou, F., Ye, N., Ye, H.-J., Chan, S.-H.G., & Li, Z. (2021). Decaug: Out-of-distribution generalization via decomposed feature representation and semantic augmentation. In Proceedings of the AAAI Conference on Artificial Intelligence, 35, 6705–6713.
Article Google Scholar
Barbu, A., Mayo, D., Alverio, J., Luo, W., Wang, C., Gutfreund, D., Tenenbaum, J., & Katz, B. (2019). Objectnet: A large-scale bias-controlled dataset for pushing the limits of object recognition models. In NeurIPS.
Beery, S., Agarwal, A., Cole, E., & Birodkar, V. (2021). The iwildcam 2021 competition dataset. arXiv:2105.03494.
Blanchard, G., Deshmukh, A. A., Dogan, Ü., Lee, G., & Scott, C. (2021). Domain generalization by marginal transfer learning. The Journal of Machine Learning Research, 22(1), 46–100.
MathSciNet Google Scholar
Chen, Z., Duan, Y., Wang, W., He, J., Lu, T., Dai, J., & Qiao, Y. (2022). Vision transformer adapter for dense predictions. arXiv:2205.08534.
Chen, T., Kornblith, S., Norouzi, M., & Hinton, G. (2020). A simple framework for contrastive learning of visual representations. In International Conference on Machine Learning, pp. 1597–1607. PMLR.
Chen, H., Tao, R., Zhang, H., Wang, Y., Ye, W., Wang, J., Hu, G., & Savvides, M. (2022). Conv-adapter: Exploring parameter efficient transfer learning for convnets. arXiv:2208.07463.
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., & Fei-Fei, L. (2009). Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255
Donahue, J., Jia, Y., Vinyals, O., Hoffman, J., Zhang, N., Tzeng, E., & Darrell, T. (2014) Decaf: A deep convolutional activation feature for generic visual recognition. In: Xing, E. P., Jebara, T. (eds.) Proceedings of the 31st International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 32, pp. 647–655. PMLR
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., & Gelly, S., et al. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. arXiv:2010.11929
Dou, Q., Coelho de Castro, D., Kamnitsas, K., & Glocker, B. (2019). Domain generalization via model-agnostic learning of semantic features. arXiv:1910.13580v1
Du, Y., Xu, J., Xiong, H., Qiu, Q., Zhen, X., Snoek, C. G., & Shao, L. (2020). Learning to learn with variational information bottleneck for domain generalization. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part X 16, pp. 200–216. Springer.
Du, Y., Zhen, X., Shao, L., & Snoek, C. G. (2021). Metanorm: Learning to normalize few-shot batches across domains. In International Conference on Learning Representations.
Fan, Z., Ma, Y., Li, Z., & Sun, J. (2021). Generalized few-shot object detection without forgetting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4527–4536.
Finn, C., Abbeel, P., & Levine, S. (2017). Model-agnostic meta-learning for fast adaptation of deep networks. In International Conference on Machine Learning, pp. 1126–1135. PMLR.
Gao, T., Fisch, A., & Chen, D. (2020). Making pre-trained language models better few-shot learners. arXiv:2012.15723.
Gao, P., Geng, S., Zhang, R., Ma, T., Fang, R., Zhang, Y., Li, H., & Qiao, Y. (2021). Clip-adapter: Better vision-language models with feature adapters. arXiv:2110.04544.
Goyal, S., Kumar, A., Garg, S., Kolter, Z., & Raghunathan, A. (2023). Finetune like you pretrain: Improved finetuning of zero-shot vision models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19338–19347.
Gulrajani, I., & Lopez-Paz, D. (2020). In search of lost domain generalization. arXiv:2007.01434.
Hao, T., Chen, H., Guo, Y., & Ding, G. (2023). Consolidator: Mergable adapter with group connections for vision transformer. In International Conference on Learning Representations.
Hendrycks, D., Basart, S., Mu, N., Kadavath, S., Wang, F., Dorundo, E., Desai, R., Zhu, T., Parajuli, S., & Guo, M., et al. (2021). The many faces of robustness: A critical analysis of out-of-distribution generalization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 8340–8349.
Hendrycks, D., Zhao, K., Basart, S., Steinhardt, J., & Song, D. (2021). Natural adversarial examples. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15262–15271.
He, Y., Shen, Z., & Cui, P. (2021). Towards non-iid image classification: A dataset and baselines. Pattern Recognition, 110, 107383.
Article Google Scholar
Houlsby, N., Giurgiu, A., Jastrzebski, S., Morrone, B., De Laroussilhe, Q., Gesmundo, A., Attariyan, M., & Gelly, S. (2019). Parameter-efficient transfer learning for nlp. In International Conference on Machine Learning, pp. 2790–2799. PMLR.
Huang, L., Niu, G., Liu, J., Xiao, X., & Wu, H. (2022). Du-vlg: Unifying vision-and-language generation via dual sequence-to-sequence pre-training. arXiv:2203.09052.
Huang, Z., Wang, H., Huang, D., Lee, Y. J., & Xing, E. P. (2022). The two dimensions of worst-case training and their integrated effect for out-of-domain generalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9631–9641.
Huang, Z., Wang, H., Xing, E.P., & Huang, D. (2020). Self-challenging improves cross-domain generalization. arXiv:2007.02454.
Immer, A., Hennigen, L.T., Fortuin, V., & Cotterell, R. (2021). Probing as quantifying inductive bias. arXiv:2110.08388.
Jain, A., Guo, M., Srinivasan, K., Chen, T., Kudugunta, S., Jia, C., Yang, Y., & Baldridge, J. (2021). Mural: Multimodal, multitask retrieval across languages. arXiv:2109.05125.
Jia, M., Tang, L., Chen, B.-C., Cardie, C., Belongie, S., Hariharan, B., & Lim, S.-N. (2022). Visual prompt tuning. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXIII, pp. 709–727. Springer.
Jia, C., Yang, Y., Xia, Y., Chen, Y.-T., Parekh, Z., Pham, H., Le, Q., Sung, Y.-H., Li, Z., & Duerig, T. (2021). Scaling up visual and vision-language representation learning with noisy text supervision. In International Conference on Machine Learning, pp. 4904–4916. PMLR.
Jiang, J., Liu, Z., & Zheng, N. (2023). Correlation information bottleneck: Towards adapting pretrained multimodal models for robust visual question answering. International Journal of Computer Vision, 132(1), 185–207.
Article Google Scholar
Khandelwal, P., & Yushkevich, P. (2020). Domain generalizer: A few-shot meta learning framework for domain generalization in medical imaging. In Domain Adaptation and Representation Transfer, and Distributed and Collaborative Learning: Second MICCAI Workshop, DART 2020, and First MICCAI Workshop, DCL 2020, Held in Conjunction with MICCAI 2020, Lima, Peru, October 4–8, 2020, Proceedings 2, pp. 73–84. Springer.
Kirichenko, P., Izmailov, P., & Wilson, A. G. (2022). Last layer re-training is sufficient for robustness to spurious correlations. arXiv:2204.02937.
Koh, P.W., Sagawa, S., Marklund, H., Xie, S.M., Zhang, M., Balsubramani, A., Hu, W., Yasunaga, M., Phillips, R.L., & Gao, I. (2021). Wilds: A benchmark of in-the-wild distribution shifts. In International conference on machine learning, pp. 5637–5664
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2017). Imagenet classification with deep convolutional neural networks. Communications of the ACM, 60(6), 84–90.
Article Google Scholar
Krueger, D., Caballero, E., Jacobsen, J.-H., Zhang, A., Binas, J., Zhang, D., Le Priol, R., & Courville, A. (2021). Out-of-distribution generalization via risk extrapolation (rex). In International Conference on Machine Learning, pp. 5815–5826. PMLR.
Kumar, A., Raghunathan, A., Jones, R., Ma, T., & Liang, P. (2022). Fine-tuning can distort pretrained features and underperform out-of-distribution. arXiv:2202.10054.
Lee, Y., Chen, A. S., Tajwar, F., Kumar, A., Yao, H., Liang, P., & Finn, C. (2022). Surgical fine-tuning improves adaptation to distribution shifts. arXiv:2210.11466.
Li, Y., Liang, F., Zhao, L., Cui, Y., Ouyang, W., Shao, J., Yu, F., & Yan, J. (2021). Supervision exists everywhere: A data efficient contrastive language-image pre-training paradigm. In International Conference on Learning Representations.
Li, D., Yang, Y., Song, Y.-Z., & Hospedales, T. M. (2017). Deeper, broader and artier domain generalization. In Proceedings of the IEEE International Conference on Computer Vision, pp. 5542–5550.
Li, D., Yang, Y., Song, Y.-Z., & Hospedales, T. (2018). Learning to generalize: Meta-learning for domain generalization. In Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32.
Li, X., Yin, X., Li, C., Zhang, P., Hu, X., Zhang, L., Wang, L., Hu, H., Dong, L., & Wei, F., et al. (2020). Oscar: Object-semantics aligned pre-training for vision-language tasks. In European Conference on Computer Vision, pp. 121–137. Springer.
Lin, Y., Dong, H., Wang, H., & Zhang, T. (2022). Bayesian invariant risk minimization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16021–16030.
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., & Guo, B. (2021). Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022.
Lu, C., Wu, Y., Hernández-Lobato, J. M., & Schölkopf, B. (2021). Invariant causal representation learning for out-of-distribution generalization. In International Conference on Learning Representations.
Mancini, M., Akata, Z., Ricci, E., & Caputo, B. (2020). Towards recognizing unseen categories in unseen domains. In European Conference on Computer Vision, pp. 466–483. Springer.
Ming, Y., Yin, H., & Li, Y. (2022). On the impact of spurious correlation for out-of-distribution detection. In Proceedings of the AAAI Conference on Artificial Intelligence, 36, 10051–10059.
Article Google Scholar
Muandet, K., Balduzzi, D., & Schölkopf, B. (2013). Domain generalization via invariant feature representation. In International Conference on Machine Learning, pp. 10–18. PMLR.
Nakamura, A., & Harada, T. (2019). Revisiting fine-tuning for few-shot learning. arXiv:1910.00216.
Nichol, A., Achiam, J., & Schulman, J. (2018). On first-order meta-learning algorithms. arXiv:1803.02999.
Pantazis, O., Brostow, G., Jones, K., & Mac Aodha, O. (2022) Svl-adapter: Self-supervised adapter for vision-language pretrained models. arXiv:2210.03794.
Peng, D., & Pan, S. J. (2022). Learning gradient-based mixup towards flatter minima for domain generalization. arXiv:2209.14742.
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., & Clark, J., et al. (2021). Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pp. 8748–8763. PMLR.
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., & Clark, J., et al. (2021). Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pp. 8748–8763. PMLR.
Rame, A., Dancette, C., & Cord, M. (2022). Fishr: Invariant gradient variances for out-of-distribution generalization. In International Conference on Machine Learning, pp. 18347–18377. PMLR.
Recht, B., Roelofs, R., Schmidt, L., & Shankar, V. (2019). Do imagenet classifiers generalize to imagenet? In International Conference on Machine Learning, pp. 5389–5400. PMLR.
Rojas-Carulla, M., Schölkopf, B., Turner, R., & Peters, J. (2018). Invariant models for causal transfer learning. The Journal of Machine Learning Research, 19(1), 1309–1342.
MathSciNet Google Scholar
Sato, T., Shen, J., Wang, N., Jia, Y.J., Lin, X., & Chen, Q. A. (2020). Security of deep learning based lane keeping system under physical-world adversarial attack. arXiv:2003.01782.
Shen, Z., Cui, P., Zhang, T., & Kunag, K. (2020). Stable learning via sample reweighting. In Proceedings of the AAAI Conference on Artificial Intelligence, 34, 5692–5699.
Article Google Scholar
Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556.
Simonyan, K., Vedaldi, A., & Zisserman, A. (2013). Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv:1312.6034.
Stojanov, P., Li, Z., Gong, M., Cai, R., Carbonell, J., & Zhang, K. (2021). Domain adaptation with invariant representation learning: What transformations to learn? Advances in Neural Information Processing Systems, 34, 24791–24803.
Google Scholar
Sung, Y.-L., Cho, J., & Bansal, M. (2022). Vl-adapter: Parameter-efficient transfer learning for vision-and-language tasks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5227–5237.
Torralba, A. (2011). Eaa: Unbiased look at dataset bias. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1521–1528.
Tseng, H.-Y., Lee, H.-Y., Huang, J.-B., & Yang, M.-H. (2020). Cross-domain few-shot classification via learned feature-wise transformation. arXiv:2001.08735.
Wang, H., Ge, S., Lipton, Z., & Xing, E. P. (2019). Learning robust global representations by penalizing local predictive power. arXiv:1905.13549v2
Wang, J., Lan, C., Liu, C., Ouyang, Y., Qin, T., Lu, W., Chen, Y., Zeng, W., & Yu, P. (2022). Generalizing to unseen domains: A survey on domain generalization. IEEE Transactions on Knowledge and Data Engineering, 35(8), 8052–8072.
Google Scholar
Weber, M.G., Li, L., Wang, B., Zhao, Z., Li, B., & Zhang, C. (2022). Certifying out-of-domain generalization for blackbox functions. In Conference on Machine Learning, pp. 23527–23548. PMLR.
Wortsman, M., Gururangan, S., Li, S., Farhadi, A., Schmidt, L., Rabbat, M., & Morcos, A. S. (2022). Lo-fi: distributed fine-tuning without communication. arXiv:2210.11948
Wortsman, M., Ilharco, G., Gadre, S. Y., Roelofs, R., Gontijo-Lopes, R., Morcos, A. S., Namkoong, H., Farhadi, A., Carmon, Y., & Kornblith, S., et al. (2022). Model soups: Averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. In International Conference on Machine Learning, pp. 23965–23998. PMLR.
Wortsman, M., Ilharco, G., Kim, J. W., Li, M., Kornblith, S., Roelofs, R., Lopes, R. G., Hajishirzi, H., Farhadi, A., & Namkoong, H., et al. (2022). Robust fine-tuning of zero-shot models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7959–7971.
Wu, C.-E., Tian, Y., Yu, H., Wang, H., Morgado, P., Hu, Y. H., & Yang, L. (2023). Why is prompt tuning for vision-language models robust to noisy labels? In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 15488–15497.
Wu, J., Zou, D., Braverman, V., Gu, Q., & Kakade, S. (2022). The power and limitation of pretraining-finetuning for linear regression under covariate shift. Advances in Neural Information Processing Systems, 35, 33041–33053.
Google Scholar
Xing, Y., Wu, Q., Cheng, D., Zhang, S., Liang, G., & Zhang, Y. (2022). Class-aware visual prompt tuning for vision-language pre-trained model. arXiv:2208.08340.
Ye, N., Li, K., Hong, L., Bai, H., Chen, Y., Zhou, F., & Li, Z. (2021). Ood-bench: Benchmarking and understanding out-of-distribution generalization datasets and algorithms. arXiv:2106.03721 1(3), 5.
You, K., Liu, Y., Wang, J., & Long, M. (2021). Logme: Practical assessment of pre-trained models for transfer learning. In International Conference on Machine Learning, pp. 12133–12143. PMLR.
Zhang, R., Fang, R., Gao, P., Zhang, W., Li, K., Dai, J., Qiao, Y., & Li, H. (2021). Tip-adapter: Training-free clip-adapter for better vision-language modeling. arXiv:2111.03930.
Zhang, X., Iwasawa, Y., Matsuo, Y., & Gu, S. S. (2021). Amortized prompt: Guide clip to domain transfer learning. arXiv:2111.12853.
Zhang, P., Li, X., Hu, X., Yang, J., Zhang, L., Wang, L., Choi, Y., & Gao, J. (2021). Vinvl: Revisiting visual representations in vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5579–5588.
Zhang, T. (2004). Statistical behavior and consistency of classification methods based on convex risk minimization. The Annals of Statistics, 32(1), 56–85.
Article MathSciNet CAS Google Scholar
Zhang, G., Zhao, H., Yu, Y., & Poupart, P. (2021). Quantifying and improving transferability in domain generalization. Advances in Neural Information Processing Systems, 34, 10957–10970.
Google Scholar
Zhou, K., Yang, J., Loy, C. C., & Liu, Z. (2021). Learning to prompt for vision-language models. arXiv:2109.01134.
Zhou, K., Yang, J., Loy, C.C., & Liu, Z. (2022). Conditional prompt learning for vision-language models. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

Download references

Acknowledgements

Nanyang Ye is the corresponding author. This work was supported by Natural Science Foundation of China under Grants (Nos. 42050105, 62106139). This work was partially supported by the National Key R &D Program of China (NO.2022ZD0160101).

Author information

Authors and Affiliations

School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University, Dongchuan Road, Shanghai, 200240, China
Lin Zhu, Weihan Yin, Yiyao Yang, Fan Wu, Zhaoyu Zeng, Xinbing Wang, Chenghu Zhou & Nanyang Ye
Shanghai Artificial Intelligence Laboratory, Yunjin Road, Shanghai, 200000, China
Qinying Gu

Authors

Lin Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Weihan Yin
View author publications
You can also search for this author in PubMed Google Scholar
Yiyao Yang
View author publications
You can also search for this author in PubMed Google Scholar
Fan Wu
View author publications
You can also search for this author in PubMed Google Scholar
Zhaoyu Zeng
View author publications
You can also search for this author in PubMed Google Scholar
Qinying Gu
View author publications
You can also search for this author in PubMed Google Scholar
Xinbing Wang
View author publications
You can also search for this author in PubMed Google Scholar
Chenghu Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Nanyang Ye
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Nanyang Ye.

Additional information

Communicated by Massimiliano Mancini.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A More Information on Experiments

1.1 Experiment Results on the Two-Types of Distribution Shifts Under Various Few-Shot Training Numbers

We conduct experiments on OOD datasets including VLCS, PACS, CCD, and CelebA under low-shot conditions. Following the same evaluation protocol of CLIP-based methods. As shown in Table 10, our proposed VLAD achieves the best generalization performance across different few-shot training numbers compared to its competitors. Compared to the single-modality LP-V, vision-language models prove to be highly competitive in few-shot OOD generalization, maintaining stable performances in low-shot settings. The adapter-tuning methods, including CLIP-Adapter, Tip-Adapter-F, and our VLAD, benefit from preserved general language knowledge and outperform prompt-tuning methods in terms of generalization. Furthermore, under our proposed two principles, VLAD significantly enhances the generalization ability of previous adapter-tuning methods, particularly with an increase in the number of shots.

1.2 Hyperparameter Search Spaces of the Experiments on OOD Generalization Benchmark Datasets

The hyperparameter search spaces for the many-shot experiments based on OODBench (Ye et al., 2021) in Sect. 4 are presented in Table 11.

Table 11 Hyperparameters and distributions for random search based on OODBench’s code

Full size table

The hyperparameter search spaces for the few-shot experiments based on CoOp’s code (Zhou et al., 2021) in Sect. 4 are presented in Table 12.

Table 12 Hyperparameters and distributions for random search based on CoOp’s code

Full size table

Appendix B Notations and Preliminaries for The Proposed Theorems

Before proving our theorems, we first introduce important notations and preliminaries for clarity.

1.1 Notations

Rowspace Suppose $r_1, r_2,\ldots , r_m$ are orthogonal row vectors of the matrix F with rank $m_F$, the set of all possible linear combinations of $r_1,\ldots , r_m$ is called the rowspace of F. We denote the rowsapce of a matrix as $rowspace(\cdot )$.

Singular Values The singular values of a matrix F are the square roots of the non-negative eigenvalues of $F^{\top }F$. We denote the maximum singular value and the minimum singular value of a matrix as $\sigma _{\max }(\cdot )$ and $\sigma _{\min }(\cdot )$, respectively.

Norm of a Matrix The L2-Norm of a vector is the square root of the sum of the squares of each element, and the L2-Norm of a Matrix F is the maximum singular value of F, i.e. $\Vert F \Vert _2 = \sigma _{max}(F)$.

We denote the L2-Norm of a matrix as $\Vert \cdot \Vert _2$, the maximum singular value of a matrix as $\sigma _{\max }(\cdot )$, and the maximum principal angle between two subspaces as $\theta _{\max }(\cdot , \cdot )$.

Following the paper, we denote the classical linear probing models as:

$$\begin{aligned} Y=X A_{lp}^{\top }=X V_0^{\top } {\varvec{h}}_{lp} \end{aligned}$$

(B1)

where $ A_{lp} \in {\mathbb {R}}^{d \times 1}$ and ${\varvec{h}}_{lp} \in {\mathbb {R}}^{k \times 1}$, and $V_0$ is the fixed pre-trained model.

The generic form of the proposed method is:

$$\begin{aligned} Y=XV_0^{\top } G_{al} {\varvec{l}}_0 \end{aligned}$$

(B2)

where $V_0 \in {\mathbb {R}}^{k \times d}$ denotes the image feature extractor; $ {\varvec{l}}_0 \in {\mathbb {R}}^{d_g \times 1}$ ($k \ge d_g$) is the fixed language embedding; and $G_{al} \in {\mathbb {R}}^{k \times d_g}$ is the visual feature adapter to achieve better alignment between $ {\varvec{l}}_0$ and $XV_0^{\top }$.

The optimal solution of $G, V, {\varvec{l}}$ in the aligned form in Eq. (B5) is $G_\star , {\varvec{V}}_\star , {\varvec{l}}_\star $, respectively. Let the feature extractor $XV_0^{\top } \in {\mathcal {R}}_0 = {\mathbb {R}}^{n \times k}$, where ${\mathcal {R}}_0 = rowspace(XV_0^{\top })$, we assume that $XV_0^{\top }$ contains both causal information and non-causal information. Based on this assumption, let $Z=XV_0^\top $, $Z_{\text {id}}=X_{\text {id}}V_0^\top $ ($X_{\text {id}}$ is the input data form $P_{\text {id}}$), $Z_{\text {ood}}=X_{\text {ood}}V_0^\top $ ($X_{\text {ood}}$ is the input data form $\sim P_{\text {ood}}$), $Z_c$ denote the causal feature, $Z_{n_1}$ denote the non-causal feature from $Z_{\text {id}}$, and $Z_{n_2}$ denote the non-causal feature from $Z_{\text {ood}}$, we define ${\mathcal {Q}} = rowspace(Z_c)$, ${\mathcal {S}}_{n_1} = rowspace(Z_{n_1})$ and ${\mathcal {S}}_{n_2} = rowspace(Z_{n_2})$. So we have $\cos \theta _{\max }({\mathcal {R}}_0, {\mathcal {Q}}) > 0$, $\cos \theta _{\max }({\mathcal {R}}_0, {\mathcal {S}}_{n_1}) > 0$ and $\cos \theta _{\max }({\mathcal {R}}_0, {\mathcal {S}}_{n_2}) > 0$.

1.2 Analysis of LP-V’s Adapter Layer

Lemma 1

In the overparameterized setting, for all times t and all $A^t_{lp} \in {\mathbb {R}}^{d \times 1}$, ${\exists }~ {\varvec{l}}_{lp}^t \in {\mathbb {R}}^{k \times 1}$, $\mathrm { s.t. } ~A_{lp}^t = V_0 {\varvec{l}}_{lp}^t$, where $V_0$ has orthogonal rows forming the basic features of ${\mathcal {R}}_0$ (${\mathcal {R}}_0 = rowspace(XV_0^\top )$).

Proof

Let $Dim(\cdot )$ denotes the dimension of a rowsapce. According to the data assumption that $Dim(rowspace(X)) < d - k$, for all $ A_{lp}^t \in {\mathbb {R}}^{d \times 1}$, given the feature extractor $V_0$ with k orthogonal columns, then there exists ${\varvec{l}}_{lp}^t \in {\mathbb {R}}^{k \times 1}$, $\mathrm { s.t. }~ XV_0^\top {\varvec{l}}_{lp}^t = XA_{lp}^{t\top }$, i.e. $A_{lp}^t = V_0 {\varvec{l}}_{lp}^t$. $\square $

Lemma 2

(Equivalent Model of LP-V’s Adapter Layer) In the high-dimensional space, the L2-Normalized language embedding $ {\varvec{l}}_0 \in {\mathcal {V}} = {\mathbb {R}} ^{d_g \times 1}$ and $rowspace( {\varvec{l}}_0) = k^\prime $, then $k^\prime \approx d_g$ holds for a large enough $d_g$. With the data assumption $d_g \le k$, for ${\forall }~ {\varvec{h}}_{lp}^t \in {\mathbb {R}}^{k \times 1}$, we can find a $G_{lp}^t \in {\mathbb {R}}^{k \times d_g}$ and a ${\varvec{l}}_{lp}^t \in {\mathbb {R}}^{d_g \times 1}$, with ${\mathbb {P}}({\varvec{l}}_{lp}^t \ne {\varvec{l}}_0) >1-\delta _0$, $\mathrm { s.t. } ~{\varvec{h}}_{lp}^t = G_{lp}^t {\varvec{l}}_{lp}^t$, where $\delta _0 \in [0, 1]$.

Proof

For all ${\varvec{h}}_{lp}^t \in {\mathcal {H}} ={\mathbb {R}}^{k \times 1}$, we have $rowspace({\varvec{h}}_{lp}^t) = k$, ${Dim}(rowspace(G_{lp} {\varvec{l}}_0))=d_g \le k$. Define a measure M on measurable space ${\mathcal {H}}$ and measurable space ${\mathcal {V}}$, such that $\delta _0 = M({\mathcal {V}} \cap {\mathcal {H}})/ M({\mathcal {V}} \cup {\mathcal {H}}) \in [0, 1]$. Then with ${\mathbb {P}}({\varvec{l}}_{lp}^t \ne {\varvec{l}}_0) >1-\delta _0$, we have ${\varvec{h}}_{lp}^t = G_{lp}^t{\varvec{l}}_{lp}^t$. That is, the equation ${\varvec{h}}_{lp}^t \equiv G_{lp}^t {\varvec{l}}_0^t$ not always holds. $\square $

To facilitate the comparison of OOD Error between VLAD and LP-V, we express the two models in an aligned form. Based on Lemma 2, we have LP-V’s equivalent model in VLAD’s vision-language alignment form as follows:

Lemma 3

(LP-V’s Equivalent Model in VLAD’s Vision-Language Alignment Form) In the high-dimensional space, we denote LP-V’s adapter layer at training step t as ${\varvec{h}}_{lp}^t \in {\mathbb {R}}^{k \times 1}$. For ${\forall }~ {\varvec{h}}_{lp}^t $, with the data assumption $d_g \le k$ ($d_g$ is the dimension of the language embedding $ {\varvec{l}}_0$), there exist a $G_{lp}^t \in {\mathbb {R}}^{k \times d_g}$ and a ${\varvec{l}}_{lp}^t \in {\mathbb {R}}^{d_g \times 1}$, which have the same dimension with VLAD’s adapter $G_{al}$ and VLAD’s language embedding $ {\varvec{l}}_0$, respectively, such that $XV_0^\top {\varvec{h}}_{lp}^t = XV_0^\top G_{lp}^t {\varvec{l}}_{lp}^t$ with ${\mathbb {P}}({\varvec{l}}_{lp}^t \ne {\varvec{l}}_0) > 1-\delta _0 $ $(\delta _0 \in [0, 1])$.

According to this Lemma, the generic form of LP-V can be rewritten as:

$$\begin{aligned} Y^t = X V_0^{\top } {\varvec{h}}_{lp}^t = X V_0^{\top } G_{lp}^t {\varvec{l}}_{lp}^t \end{aligned}$$

(B3)

where $Y^t = X V_0^{\top } {\varvec{h}}_{lp}^t$ is equivalent to $Y^t=X V_0^{\top } G_{lp}^t {\varvec{l}}_{lp}^t$ with $G_{lp}^t$ and ${\varvec{l}}_{lp}^t$ dependent on $ {\varvec{h}}_{lp}^t$ at each training step t. Without any constraint to LP-V’s adapter layer $ {\varvec{h}}_{lp}^t$, LP-V tends to update ${\varvec{h}}_{lp}^t$ relying on spurious features. Consequently, in the equivalent model B3 derived from ${\varvec{h}}_{lp}^t$, the terms $G_{lp}^t$ and ${\varvec{l}}_{lp}^t$ would diverge substantially from $G_{al}^t$ and ${\varvec{l}}_0$. Therefore, we can compare VLAD and LP-V by:

$$\begin{aligned} \left\{ \begin{array}{l} Y_{\text {VLAD}}^t=X V_0^{\top } G_{al}^t {\varvec{l}}_0\\ Y_{\text {LP-V}}^t = X V_0^{\top } G_{lp}^t {\varvec{l}}_{lp}^t \end{array}\right. \end{aligned}$$

(B4)

The two Equations have the aligned form:

$$\begin{aligned} Y=XV_0^{\top }G{\varvec{l}} \end{aligned}$$

(B5)

For this aligned form, we assume that the optimal solution of adapter layer G and the language embedding ${\varvec{l}}$ is $ G_\star , {\varvec{l}}_\star $, respectively.

We emphasize that Lemma 3 expresses an equivalent form of LP-V in vision-language alignment term, where the equation ${\varvec{h}}_{lp}^t = G_{lp}^t {\varvec{l}}_{lp}^t$ holds with ${\varvec{h}}_{lp}^t$ and $G_{lp}^t$ dependent on ${\varvec{l}}_{lp}^t$ at each training step t. This facilitates analyzing the lower bound of LP-V’s OOD Error ($Y=XV_0^\top {\varvec{h}}_{lp}$).

1.3 Statistical Behaviour of MSE and Cross-Entropy Loss

According to Zhang (2004), we first introduce the relationship between the least square loss and Cross-Entropy loss in logistic regression. Denote the DNN predictor as f and the loss function as $\phi $. Given a set of example pairs (x, y), in the binary classification problem where $y\in \{-1, +1\}$, following the prediction rule: predict $y=1$ if $f(x)\ge 0$ and predict $y=-1$ otherwise, the risk of DNN predictor f is defined as

$$\begin{aligned} Q(f(\cdot ))={\textbf{E}}_X[\eta (X)\varvec{\phi }(f(X))+(1-\eta (X))\varvec{\phi }(-f(X))] \end{aligned}$$

where $\eta (x)$ is the true conditional probability $P(y=1\vert x)$, ${\textbf{E}}_X$ denotes the expectation over input data X. $\phi (v) = (1-v)^2$ for least square loss and $\phi (v) = \log (1+\exp (-v))$ for logistic regression. For convenience, we simplify the above definition as:

$$\begin{aligned} Q(\eta ,f)=\eta \phi (f)+(1-\eta )\phi (-f) \end{aligned}$$

Define the function $f^\star _\phi (\eta ):[0,1]\rightarrow {\mathbb {R}}^\star $ as $f^\star _\phi (\eta )=\arg \min _{f\in {\mathbb {R}}^\star }Q(\eta , f)$, which is the optimal predictor under risk $Q(\eta , f)$. Correspondingly, $Q^*(\eta )=\inf _{f\in {\mathbb {R}}}Q(\eta ,f)=Q(\eta ,f_\phi ^*(\eta ))$ is the optimal risk. Then the following quantities are always non-negative:

$$\begin{aligned} \Delta Q(\eta ,f)&=Q(\eta ,f)-Q(\eta ,f_\phi ^*(\eta )) \nonumber \\&=Q(\eta ,f)-Q^*(\eta ), \end{aligned}$$

(B6)

$$\begin{aligned} \Delta Q(f(\cdot ))&=Q(f(\cdot ))-Q(f_\phi ^*(\eta (\cdot )))\nonumber \\&={\textbf{E}}_X\Delta Q(\eta (X),f(X)) \end{aligned}$$

(B7)

Thus, we can induce that: $f_\phi ^*(\eta )=2\eta -1;Q^*(\eta )=4\eta (1-\eta )$ for least square loss and $f_\phi ^*(\eta )=\ln \frac{\eta }{1-\eta };Q^*(\eta )=-\eta \ln \eta -(1-\eta )\ln (1-\eta )$ for logistic regression.

Theorem 3

If $\phi $ is differentiable, then the Bregman divergence $d_\phi $ is uniquely defined. We have:

$$\begin{aligned} \Delta Q(\eta ,p)=\eta d_\phi (f_\phi ^*(\eta ),p)+(1-\eta )d_\phi (-f_\phi ^*(\eta ),-p) \end{aligned}$$

If furthermore, $f^\star _\phi $ is differentiable, then $Q^\star $ is also differentiable. Assume that $p=f^\star _\phi (\bar{(}\eta ))$. Then

$$\begin{aligned} \Delta Q(\eta ,p)=d_{Q^*}({\bar{\eta }},\eta ) \end{aligned}$$

More details about this theorem and corresponding proof can be seen in Zhang (2004).

Now we can introduce the relationship between the least square loss and logistic regression as follows:

Lemma 4

In logistic regression, if the estimation of f(x) leads to a smaller Cross-Entropy loss in logistic regression, then the expected squared difference between the estimated conditional probability ${\hat{P}}(y=1\vert x)=1/(1+\exp ^{-f(x)})$ and the true conditional probability $P(y=1\vert x)=\eta (x)$ is also small.

Proof

In logistic regression, $\phi (v)=\ln (1+\exp (-v))$, $f_\phi ^*(\eta )=\ln \frac{\eta }{1-\eta }$ and $Q^*(\eta )=-\eta \ln \eta -(1-\eta )\ln (1-\eta )$. According to Theorem 3, the closeness of $1/(1+\exp ^{-f (x)})$ to $\eta (x)$ is measured by the expected Bregman divergence of $-\eta (x)\log \eta (x) -(1-\eta (x))\log (1-\eta (x))$, which is essentially the relative entropy between $\eta (x)$ and $1/(1 +\exp ^{(-f(x)}))$ (also called KL-divergence):

$$\begin{aligned} \Delta Q(\eta ,p)= & {} \eta \ln [\eta (1+e^{-p})]+(1-\eta )\ln [(1-\eta )(1+e^p)]\\= & {} \textrm{KL}\Big (\eta \Big \Vert \frac{1}{1+e^{-p}}\Big ) \end{aligned}$$

Let ${\bar{\eta }}=f_\phi ^{*-1}(p)=1/(1+\exp (-p))$. One may obtain a lower bound for $KL(\eta \Vert {\bar{\eta }}) $ using Taylor expansion: $\exists \eta ^{\prime }$ between ${\bar{\eta }}$ and $\eta $ such that:

$$\begin{aligned} \Delta {\varvec{Q}}(\eta ,p)&=\textrm{KL}(\eta \vert \vert {\bar{\eta }})=-\frac{1}{2}Q^{*\prime \prime }(\eta ^{\prime })(\eta -{\bar{\eta }})^2 \end{aligned}$$

(B8)

$$\begin{aligned}&=\frac{1}{2\eta ^{\prime }(1-\eta ^{\prime })}(\eta -{\bar{\eta }})^2\ge 2(\eta -{\bar{\eta }})^2 \end{aligned}$$

(B9)

This Lemma implies that if we can find ${\hat{p}}(x)$ such that $Q({\hat{p}}(\cdot ))$ is small, then the expected squared difference between the estimated conditional probability $1/(1+\exp ^{-f(x)})$ and the true conditional probability $\eta (x)$ is also small. Accordingly, the expected squared difference between the estimated value f(x) from the regression model and the true conditional probability $\eta (x)$ is also small. Therefore, the mean square error in Definition 2 provides a lower bound on the Cross-Entropy loss when using predictions from the linear regression model followed by a sigmoid activation for classification. $\square $

Appendix C Analysis of the Two Types of Distribution Shifts

We now provide detailed explanations of the correlation shift and diversity shift concepts as defined in Definition 1.

Lemma 5

(Restatement of Definition 1) Let the feature extractor $XV_0^{\top } \in {\mathbb {R}}^{n \times k}$ and ${\mathcal {R}}_0 = rowspace(XV_0^{\top })$, we assume that $XV_0^{\top }$ contains both causal information and non-causal information. Let $Z=XV_0^\top $, $Z_{\text {id}}=X_{\text {id}}V_0^\top $, $Z_{\text {ood}}=X_{\text {ood}}V_0^\top $, $Z_c$ denote the causal feature, $Z_{n_1}$ denote the non-causal feature from $Z_{\text {id}}$, and $Z_{n_2}$ denote the non-causal feature from $Z_{\text {ood}}$. We assume that $Z_{\text {id}} = Z_{c} + Z_{n_1}$ and $Z_{\text {ood}} = Z_{c} + Z_{n_2}$, where casual feature $Z_{c}$ is orthogonal with non-causal feature $Z_{n_1}$ and $Z_{n_2}$. We denote the rowspaces of these latent features as ${\mathcal {Q}} = rowspace(Z_c)$, ${\mathcal {S}}_{n_1} = rowspace(Z_{n_1})$ and ${\mathcal {S}}_{n_2} = rowspace(Z_{n_2})$.

According to the assumptions above, we have $\cos \theta _{\max }({\mathcal {R}}_0, {\mathcal {Q}}) > 0$, $\cos \theta _{\max }({\mathcal {R}}_0, {\mathcal {S}}_{n_1}) > 0$ and $\cos \theta _{\max }({\mathcal {R}}_0, {\mathcal {S}}_{n_2}) > 0$. Now we consider two cases of $0< \cos \theta _{\max }({\mathcal {S}}_{n_1}, {\mathcal {S}}_{n_2}) < 1$ and $\cos \theta _{\min }({\mathcal {S}}_{n_1}, {\mathcal {S}}_{n_2}) = 1$.

[Case 1] If $0 \le \cos \theta _{\max }({\mathcal {S}}_{n_1}, {\mathcal {S}}_{n_2}) < 1$, then there exists a rotation matrix $U_1 \in {\mathbb {R}}^{k \times k}$, such that $Z_{\text {id}}U_1 = Z_{c}^\prime + Z_{n_1}^\prime + a_{\text {id}} \varepsilon ^\prime $ and $Z_{\text {ood}}U_1 = Z_{c}^\prime + Z_{n_2}^\prime + a_{\text {ood}} \varepsilon ^\prime $, where $Z_{c}^\prime = Z_{c}U_1$, the $rowspace(Z_{n_1}^\prime )$ and $rowspace(Z_{n_2}^\prime )$ are orthogonal subspaces of ${\mathcal {R}}_0$, and $\varepsilon ^\prime $ can be treated as a noise feature that lives in the shared subspace between $rowspace(Z_{n_1}U)$ and $rowspace(Z_{n_2}U)$. We denote ${\mathcal {S}}_{c}^\prime = rowspace(Z_{c} U_1)$, ${\mathcal {S}}_{n_1}^\prime = rowspace(Z_{n_1}^\prime )$, and ${\mathcal {S}}_{n_2}^\prime = rowspace(Z_{n_2}^\prime )$. Then ${\mathcal {S}}_{c}^\prime $, ${\mathcal {S}}_{n_1}^\prime $ and ${\mathcal {S}}_{n_2}^\prime $ are mutually orthogonal. [Case 2] If $\cos \theta _{\min }({\mathcal {S}}_{n_1}, {\mathcal {S}}_{n_2}) = 1$, then there exists a rotation matrix $ U_2 \in {\mathbb {R}}^{k \times k}$, such that $Z_{\text {id}}U_2 = Z_{\text {id}}^\prime + b_{\text {id}} \varepsilon ^{\prime \prime }$ and $Z_{\text {ood}}U_2 = Z_{\text {ood}}^\prime +b_{\text {ood}} \varepsilon ^{\prime \prime }$, let ${\mathcal {R}}_{id}^{\prime } = rowspace(Z_{\text {id}}^\prime )$ and ${\mathcal {R}}_{ood}^{\prime } = rowspace(Z_{\text {ood}} ^\prime )$, ${\mathcal {R}}_{id}^{\prime } $ and ${\mathcal {R}}_{ood}^{\prime }$ are orthogonal subspaces of ${\mathcal {R}}_0$.

Proof

If $0 \le \cos \theta _{\max }({\mathcal {S}}, {\mathcal {Q}}) < 1$, we rotate the subspaces ${\mathcal {S}}_{n_1}$ and ${\mathcal {S}}_{n_2}$ by minimizing the components on their shared subspace $\varepsilon ^\prime $. That is:

$$\begin{aligned} \begin{aligned} U1&= \mathop {\arg \min }\limits _{U} \Vert a_{\text {id}} \Vert _2 + \Vert a_{\text {ood}} \Vert _2 \\ \mathrm { s.t. }~ Z_{n_1}U^\top&=Z_{n_1}^\prime + a_{\text {id}}\varepsilon ^\prime \\ Z_{n_2}U^\top&=Z_{n_2}^\prime + a_{\text {ood}}\varepsilon ^\prime \end{aligned} \end{aligned}$$

(C10)

Therefore, the original feature $Z_{\text {id}}$ and $Z_{\text {ood}}$ can be disentangled by $Z_{\text {id}} = Z_{c}^\prime + Z_{n_1}^\prime + a_{\text {id}} \varepsilon ^\prime $ and $Z_{\text {ood}} = Z_{c}^\prime + Z_{n_2}^\prime + a_{\text {ood}} \varepsilon ^\prime $. Since the orthogonal spaces are still orthogonal after rotation, we have ${\mathcal {S}}_{c}^\prime = rowspace(z_c^\prime )$, ${\mathcal {S}}_{n_1}^\prime = rowspace(Z_{n_1}^\prime )$ and ${\mathcal {S}}_{n_2}^\prime = rowspace(Z_{n_2}^\prime )$ are mutually orthogonal.

If $\cos \theta _{\min }({\mathcal {S}}_{n_1}, {\mathcal {S}}_{n_2}) = 1$, then there exists a rotation matrix $U_2 \in {\mathbb {R}}^{k \times k}$, such that the original feature $Z_{\text {id}}$ and $Z_{\text {ood}}$ can be rewritten as $Z_{\text {id}}U_2 = Z_{\text {id}}^\prime + b_{\text {id}} \varepsilon ^{\prime \prime }$ and $Z_{\text {ood}}U_2 = Z_{\text {ood}}^\prime +b_{\text {ood}} \varepsilon ^{\prime \prime }$. And ${\mathcal {R}}_{id}^{\prime } = rowspace(Z_{\text {id}}^\prime )$ and ${\mathcal {R}}_{ood}^{\prime } = rowspace(Z_{\text {ood}}^\prime )$ are orthogonal subspaces of ${\mathcal {R}}_0$. Similarly, the subspaces ${\mathcal {R}}_{id} = rowspace(Z_{\text {id}})$ and ${\mathcal {R}}_{ood} = rowspace(Z_{\text {ood}})$ are rotated by minimizing the components $b_{\text {id}} $ and $b_{\text {ood}} $ on the shared subspace $\varepsilon ^{\prime \prime }$.$\square $

Appendix D Proof of Theorem 1

Inspired by Kumar et al. (2022), we employ proof techniques involving contrapositive for Theorem 1. However, unlike Kumar et al. (2022), we relax the assumption that the pre-trained image feature is of good quality and focus on analyzing the classifier layers of VLAD and LP-V while keeping the pre-trained visual backbone frozen. This allows us to demonstrate the superiority of the general language knowledge in VLAD’s classifier layer compared to the vanilla projection layer in LP-V.

1.1 Preparations

Lemma 6

Since ${\varvec{l}}_\star $ and $ {\varvec{l}}_0$ for each class name have been normalized, ${\exists }~ 0 < \varepsilon _v \le 2$, $ \mathrm { s.t. }~\Vert {\varvec{l}}_\star - {\varvec{l}}_0 \Vert _2 \le \varepsilon _v$. For the visual feature adapter $G_0$ with orthogonal columns, ${\exists }~ \varepsilon _g > 0$, $ \mathrm { s.t. }~\Vert G_\star - G_0 \Vert _2 \le \varepsilon _g$. We call $\varepsilon _g$ the visual feature adapter initialization error.

Proof

$$\begin{aligned} \Vert {\varvec{l}}_\star - {\varvec{l}}_0 \Vert _2\le & {} \Vert {\varvec{l}}_\star \Vert _2 + \Vert {\varvec{l}}_0 \Vert _2 = 1+1=2 \end{aligned}$$

(D11)

$$\begin{aligned} \Vert G_\star - G_0 \Vert _2\le & {} \Vert G_\star \Vert _2 + \Vert G_0 \Vert _2 \nonumber \\= & {} \sigma _{\max }(G_\star ) + \sqrt{d_g} \end{aligned}$$

(D12)

We use the fact that $\sigma _{\max }(G_\star )$ is upper bounded (see Eq. (D28)).$\square $

Lemma 7

Let $\Sigma = {\mathbb {E}}(V_0X^\top XV_0^\top )$, we have:

$$\begin{aligned} L_{\text {ood}}(G, {\varvec{l}}, V_0)&=(G_\star {\varvec{l}}_\star -G {\varvec{l}})^\top \Sigma (G_\star {\varvec{l}}_\star -G {\varvec{l}}) \end{aligned}$$

(D13)

$$\begin{aligned}&\le \sigma _{\max }(\Sigma )\Vert G_\star {\varvec{l}}_\star -G {\varvec{l}}\Vert _2^2 \end{aligned}$$

(D14)

Proof

Let x be the input from $P_{\text {ood}}$, we have:

$$\begin{aligned} L_{\text {ood}}(G,{\varvec{l}}, V_0)&={\mathbb {E}}[({\varvec{l}}_\star ^\top G_\star ^\top V_0 x^\top - {\varvec{l}}^\top G^\top V_0 x^\top )^2]\nonumber \\&={\mathbb {E}}[(G_\star {\varvec{l}}_\star - G {\varvec{l}})^\top V_0x^{\top }xV_0^\top (G_\star {\varvec{l}}_\star - G {\varvec{l}})\nonumber \\ {}&=(G_\star {\varvec{l}}_\star -G{\varvec{l}})^\top {\mathbb {E}}[V_0X^\top XV_0^\top ](G_\star {\varvec{l}}_\star - G {\varvec{l}})\nonumber \\&=(G_\star {\varvec{l}}_\star -G{\varvec{l}})^{\top }\Sigma (G_\star {\varvec{l}}_\star - G {\varvec{l}} )\nonumber \\&\le \sigma _{\max }(\Sigma )\Vert G_\star {\varvec{l}}_\star -G {\varvec{l}}\Vert _2^2 \end{aligned}$$

(D15)

$\square $

Lemma 8

For the aligned form $Y = XV_0^\top G^t {\varvec{l}}^t$, we have:

$$\begin{aligned} {\varvec{l}}_0 {\varvec{l}}_0^\top -G_0^\top G_0 = {\varvec{l}}^t{\varvec{l}}^{t\top } - G^{t\top } G^t \end{aligned}$$

(D16)

See proof of Lemma A.4 in Kumar et al. (2022).

Lemma 9

For some universal constant c, for all $G_0 \in {\mathbb {R}}^{k \times d_g}$ with orthogonal columns, we have:

$$\begin{aligned} \Vert G_\star - G_0 \Vert _2 \ge c d( {\varvec{l}}_0, {\varvec{l}}_\star ) \end{aligned}$$

(D17)

where $d( {\varvec{l}}_0, {\varvec{l}}_\star ) = \Vert {\varvec{l}}_\star - {\varvec{l}}_0 \Vert _2$.

Proof

$$\begin{aligned} \Vert G_\star - G_0 \Vert _2&= 1/2 (\Vert G_\star - G_0 \Vert _2 \Vert {\varvec{l}}_\star \Vert _2 + \Vert G_\star - G_0 \Vert _2 \Vert {\varvec{l}}_0 \Vert _2) \nonumber \\&\ge 1/2 (\Vert G_\star {\varvec{l}}_\star - G_0{\varvec{l}}_\star \Vert _2 + \Vert G_\star {\varvec{l}}_0 - G_0 {\varvec{l}}_0 \Vert _2) \nonumber \\&\ge 1/2 (\Vert G_\star {\varvec{l}}_\star - G_0{\varvec{l}}_\star + G_0 {\varvec{l}}_0 - G_\star {\varvec{l}}_0 \Vert _2) \nonumber \\&= 1/2 \Vert G_\star ({\varvec{l}}_\star - {\varvec{l}}_0)-G_0({\varvec{l}}_\star - {\varvec{l}}_0) \Vert _2 \nonumber \\&\ge 1/2 \big \vert \Vert G_\star ({\varvec{l}}_\star - {\varvec{l}}_0)\Vert _2-\Vert G_0({\varvec{l}}_\star - {\varvec{l}}_0) \Vert _2 \big \vert \nonumber \\&\ge c d( {\varvec{l}}_0, {\varvec{l}}_\star ) \end{aligned}$$

(D18)

where

$$\begin{aligned} \begin{aligned} c&= \min {\{ \vert \sigma _{\min }(G_\star )- 1\vert , \vert \sigma _{\min }(G_0)- (1+\sqrt{d_g})^2\vert \}} \end{aligned}\nonumber \\ \end{aligned}$$

(D19)

$\square $

Lemma 10

Conducting rotation on $XV_0^\top $ by a rotation matrix U does not change the OOD Error.

Proof

Let $G_{al}^\infty (V_0)$ and $G_{al}^\infty (V_0 U^\top )$ denote the solution of VLAD’s adapter based on the visual feature extractor $V_0$ and $V_0 U^\top $, respectively. Since $XV_0^\top G_\star (V_0) = XV_0^\top U U^T G_\star (V_0 U^\top )$, with ${\varvec{l}}_\star $ fixed, we have $G_\star (V_0) = U^\top G_\star (V_0 U^\top )$. Based on Eq. (D22), we have:

$$\begin{aligned}&XV_0^{\top } G_{al}^\infty (V_0) {\varvec{l}}_0 \nonumber \\&\quad = XV_0^\top (V_0X^\top X V_0^\top ) ^{-1} (V_0X^\top X V_0^\top ) G_\star (V_0) {\varvec{l}}_\star \nonumber \\&\quad = XV_0^\top U (V_0X^\top X V_0^\top ) ^{-1} (V_0X^\top X V_0^\top ) U^\top G_\star (V_0) {\varvec{l}}_\star \nonumber \\&\quad = XV_0^\top U (U^\top V_0X^\top X V_0^\top U) ^{-1}\nonumber \\&\quad (U^\top V_0X^\top X V_0^\top U) G_\star (V_0 U^\top ) {\varvec{l}}_\star \nonumber \\&\quad =XV_0^\top U G_{al}^\infty (V_0 U^\top ) {\varvec{l}}_0 \end{aligned}$$

(D20)

So the final predictors for $G_{al}^\infty (V_0)$ and $G_{al}^\infty (V_0 U^\top )$ are identical. That is, conducting rotation on $XV_0^\top $ by a rotation matrix U does not change the OOD Error of VLAD. To make the proof much easier, we complete the whole proof based on the rotated orthogonal subspaces of Lemma 5. $\square $

1.2 Main Proof

Theorem 4

Based on the data assumption and the model setting, given the initialization of $V_0$, $ {\varvec{l}}_0$ and $G_0$, let the language embedding initialization error $d( {\varvec{l}}_0, {\varvec{l}}_\star ) \le \varepsilon _v$, then the OOD Error of the VLAD method is upper bounded by:

$$\begin{aligned} L_{\text {ood}}(f_{\text {VLAD}}(G^\infty _{al})) \le c_k\varepsilon _v \end{aligned}$$

(D21)

where $c_k$ is a constant given the value of dimension k.

Proof

Since the optimization objective of the VLAD: $\mathop {\min }\Vert X V_0^\top G {\varvec{l}}_0-X V_0^\top G_\star {\varvec{l}}_{\star } \Vert _2^2$ (minimizing over G) is convex, there exists a global minimum $G_{al}^\infty $ satisfies:

$$\begin{aligned} G_{al}^{\infty } {\varvec{l}}_0 {\varvec{l}}_0^{\top } = (V_0X^\top X V_0^\top ) ^{-1} (V_0X^\top X V_0^\top ) G_\star {\varvec{l}}_\star {\varvec{l}}_0^{\top } \end{aligned}$$

(D22)

So we have:

$$\begin{aligned} \left\| G_{\star }{\varvec{l}}_{\star }-G_{al}^{\infty } {\varvec{l}}_0\right\| _{2}&\le \left\| \left( G_{\star }{\varvec{l}}_{\star }-G_\star {\varvec{l}}_0\right) +\left( G_{\star } {\varvec{l}}_0-G_{al}^\infty {\varvec{l}}_0\right) \right\| _{2} \nonumber \\&\le \underbrace{\left\| G_{\star }{\varvec{l}}_{\star }-G_\star {\varvec{l}}_0\right\| _{2}}_{(1)}\nonumber \\&\quad +\underbrace{\left\| G_{\star } {\varvec{l}}_0-G_{al}^{\infty } {\varvec{l}}_0\right) \Vert _{2}}_{(2)} \end{aligned}$$

(D23)

For term(1), we get the inequality:

$$\begin{aligned} \left\| G_{\star }{\varvec{l}}_{\star }-G_\star {\varvec{l}}_0\right\| _{2} \le \sigma _{\max }\left( G_{\star }\right) \left\| {\varvec{l}}_{\star } - {\varvec{l}}_0\right\| _{2} \end{aligned}$$

(D24)

Now we show $\sigma _{\max }(G_{\star })$ is upper bounded. Based on Lemma 8, $G_{\star }$ satisfies:

$$\begin{aligned} G_{\star }^\top G_{\star } ={\varvec{l}}_{\star }{\varvec{l}}_{\star }^\top - {\varvec{l}}_0 {\varvec{l}}_0^\top +G_0^\top G_0 \end{aligned}$$

(D25)

Taking the trace everywhere, we get:

$$\begin{aligned} \text {Tr}(G_{\star }^\top G_{\star }) = \text {Tr}({\varvec{l}}_{\star }{\varvec{l}}_{\star }^\top ) - \text {Tr}( {\varvec{l}}_0 {\varvec{l}}_0^\top ) + \text {Tr}(G_0^\top G_0) \end{aligned}$$

(D26)

Based on the definition of the F-Norm of a matrix, Eq. (D26) can be denoted as:

$$\begin{aligned} \Vert {G_{\star }}\Vert _F^2 = \Vert {{\varvec{l}}_{\star }}\Vert _F^2 - \Vert { {\varvec{l}}_0}\Vert _F^2 + \Vert {G_{0}}\Vert _F^2 \end{aligned}$$

(D27)

Since the singular value of a matrix is non-negative and $G_0$ has k orthogonal columns (by assumption), we have $\sigma _{\max }(G_{\star }) \le \Vert {G_{\star }}\Vert _F^2$ and $\Vert {G_{0}}\Vert _F = \sqrt{d_g}$, so $\sigma _{\max }(G_{\star })$ is upper bounded by:

$$\begin{aligned} \sigma _{\max }(G_{\star })\le & {} \Vert {G_{\star }}\Vert _F^2 \le (\Vert {{\varvec{l}}_{\star }}\Vert _F + \Vert {G_{0}}\Vert _F)^2 \nonumber \\= & {} (1 + \sqrt{d_g})^2 \end{aligned}$$

(D28)

Therefore, Eq. (D24) can be further simplified as:

$$\begin{aligned} \left\| G_{\star }{\varvec{l}}_{\star }-G_\star {\varvec{l}}_0\right\| _{2} \le (1 + \sqrt{d_g})^2 \varepsilon _v \end{aligned}$$

(D29)

For term (2), we have:

$$\begin{aligned} \begin{aligned} \left\| G_{\star } {\varvec{l}}_0-G_{al}^{\infty } {\varvec{l}}_0\right\| _{2}&=\left\| (G_{\star }-G_{al}^{\infty }) {\varvec{l}}_0 {\varvec{l}}_0^\top {\varvec{l}}_0\right\| _{2} \\&\le \left\| (G_{\star }-G_{al}^{\infty }) {\varvec{l}}_0 {\varvec{l}}_0^\top \right\| _{2}\left\| {\varvec{l}}_0\right\| _{2} \\&= \left\| (G_{\star }-G_{al}^{\infty }) {\varvec{l}}_0 {\varvec{l}}_0^\top \right\| _{2} \\&= \left\| G_{\star } {\varvec{l}}_0 {\varvec{l}}_0^\top - (V_0X^\top X V_0^\top ) ^{-1}\right. \\&\left. \quad (V_0X^\top X V_0^\top ) G_\star {\varvec{l}}_\star {\varvec{l}}_0^{\top }\right\| _{2} \\&= \left\| G_{\star } ( {\varvec{l}}_0 - {\varvec{l}}_\star ) {\varvec{l}}_0^\top \right\| _{2} \\&\le \sigma _{\max }(G_\star ) \left\| ( {\varvec{l}}_0 - {\varvec{l}}_\star ) \right\| _{2} \\&\le (1 + \sqrt{d_g})^2 \varepsilon _v \end{aligned} \end{aligned}$$

(D30)

Incorporating Eqs. (D29) and (D30) into Eq. (D23), we get:

$$\begin{aligned} \left\| G_{\star }{\varvec{l}}_{\star }-G_{al}^{\infty } {\varvec{l}}_0\right\| _{2} \le 2(1 + \sqrt{d_g})^2 \varepsilon _v \end{aligned}$$

(D31)

Thus complete the proof. $\square $

Theorem 5

Based on the data assumptions and the model setting, given the initialization of $V_0$, $ {\varvec{l}}_0$ and $G_0$, if the language embedding initialization error satisfies $d( {\varvec{l}}_0, {\varvec{l}}_\star ) \le \varepsilon _v$ and $d(G_0, G_\star ) \le \varepsilon _g$, the OOD Error of the LP-V $L_{\text {ood}}(f_{\text {LP-V}}({\varvec{h}}_{lp}^\infty ))$ is lower bounded by:

$$\begin{aligned} \sigma _{\min }(\Sigma ) \min \{O(\phi ), O((\phi ^2 - \varepsilon _v^2 - \varepsilon _g^2 - \varepsilon _v \varepsilon _g)/\varepsilon _v)\} \end{aligned}$$

(D32)

where $\Sigma = {\mathbb {E}}(V_0X^\top XV_0^\top )$ and $\phi ^2 = \vert ({\varvec{l}}_\star ^\top {\varvec{l}}_\star )^2 - ( {\varvec{l}}_0^\top {\varvec{l}}_\star )^2 \vert $.

Note that the Big-Oh notation $O(\cdot )$ means the asymptotic upper bound on the order of magnitude of a function described by another (usually simpler) function.

Proof

We have rewritten the model expression of the LP-V in Eq. (B4). Based on Eqs. (B4) and (D15), the OOD error of the LP-V can be computed as:

$$\begin{aligned} \begin{aligned}&L_{\text {ood}}(f_{\text {LP-V}}(G_{lp}^\infty , {\varvec{l}}_{lp}^\infty ))\\&\quad = (G_\star ^{\top }{\varvec{l}}_\star ^\top -G_{lp}^\infty {\varvec{l}}_{lp}^\infty )^{\top }\Sigma (G_\star ^\top {\varvec{l}}_\star ^\top - G_{lp}^\infty {\varvec{l}}_{lp}^\infty ) \\&\quad \ge \sigma _{\min }(\Sigma ) \Vert G_\star ^\top {\varvec{l}}_\star ^\top - G_{lp}^\infty {\varvec{l}}_{lp}^\infty \Vert _2 \end{aligned} \end{aligned}$$

(D33)

So it suffices to lower bound $\Vert G_\star ^\top {\varvec{l}}_\star ^\top - G_{lp}^\infty {\varvec{l}}_{lp}^\infty \Vert _2$. Following the method in Kumar et al. (2022), we first assume that $\Vert G_\star ^\top {\varvec{l}}_\star ^\top - G_{lp}^t {\varvec{l}}_{lp}^t \Vert _2 \le \Delta $ (for $\forall t>0$), and we will show that:

$$\begin{aligned} \Vert ({\varvec{l}}_\star ^\top {\varvec{l}}_\star )^2 - ( {\varvec{l}}_0^\top {\varvec{l}}_\star )^2 \Vert _2 \le f(\Delta ) \end{aligned}$$

(D34)

where $f(\cdot )$ denotes the polynomial function corresponding to $\Delta $.

First, let $\,c\,=\,\cos \theta _{\text {max}}({\mathcal {R}}_0, {\mathcal {S}}_{n_2}^\prime )$ for Case 1 and $\,c\,=\,\cos \theta _{\text {max}}({\mathcal {R}}_0, {\mathcal {R}}_{ood}^\prime )$ for Case 2, we show $\Vert {\varvec{l}}_{lp}^t-{\varvec{l}}_\star \Vert _2 \le d_g\varepsilon _v/c$.

Let $z = \frac{c}{\Vert {\varvec{l}}_{lp}^t - {\varvec{l}}_\star \Vert _2}({\varvec{l}}_{lp}^t - {\varvec{l}}_\star )$, ${\exists }~ y \in {\mathcal {R}}_0 = rowspace(XV_0^\top )$, $\mathrm { s.t. }~yG_0 = z$.

For Case 1 with $\cos \theta _{\text {max}}({\mathcal {R}}_0, {\mathcal {S}}_{n_2}^\prime )\ge 0$, we can find a $x \in {\mathcal {S}}_{n_2}^\prime $ with $\Vert x \Vert _2 \le 1$ and $\Pi _{{\mathcal {R}}_0}(x) = y$ ($\Pi _{{\mathcal {R}}_0}(x)$ denotes the projection of x on ${\mathcal {R}}_0$). For Case 2, we can also find a $x \in {\mathcal {R}}_{ood}^\prime $ with $\Vert x \Vert _2 \le 1$ and $\Pi _{{\mathcal {R}}_0}(x) = y$. And we have $xG_0 = z$.

Since $x \in {\mathcal {S}}_{n_2}^\prime $ in Case 1 and $x \in {\mathcal {R}}_{ood}$ in Case 2, $G_{lp}^t$ and $ {\varvec{l}}_{lp}^t$ do not change in the directions of x when training on ID data, so we have $x G_0 = x G_{lp}^t$ and $x G_0 {\varvec{l}}_0 = x G_{lp}^t {\varvec{l}}_{lp}^t$. Then the distance between ${\varvec{l}}_{lp}^t$ and ${\varvec{l}}_\star $ can be computed as:

$$\begin{aligned} \begin{aligned} \Vert {\varvec{l}}_{lp}^t-{\varvec{l}}_\star \Vert _2&=\frac{1}{c}({\varvec{l}}_{lp}^t - {\varvec{l}}_\star )^\top \frac{c({\varvec{l}}_{lp}^t - {\varvec{l}}_\star )}{\Vert {\varvec{l}}_{lp}^t - {\varvec{l}}_\star \Vert _2} \\&= \frac{1}{c}({\varvec{l}}_{lp}^t - {\varvec{l}}_\star )^\top z \\&= \frac{1}{c}({\varvec{l}}_{lp}^t - {\varvec{l}}_\star )^\top xG_0 \\&= \frac{1}{c} x G_{lp}^t {\varvec{l}}_{lp}^t - x G_{0} {\varvec{l}}_\star \\&= \frac{1}{c} x (G_{0} {\varvec{l}}_0 - G_{0} {\varvec{l}}_\star ) \\&\le \frac{1}{c} \Vert x \Vert _2 \Vert G_{0} {\varvec{l}}_0 - G_{0} {\varvec{l}}_\star \Vert _2 \\&\le \frac{1}{c} \sigma _{\max }(G_0) \Vert {\varvec{l}}_0 - {\varvec{l}}_\star \Vert _2 \\&\le d_g\varepsilon _v /c \end{aligned} \end{aligned}$$

(D35)

Then, we bound $\Vert G_0 {\varvec{l}}_\star - G_{lp}^t {\varvec{l}}_\star \Vert _2$.

$$\begin{aligned} \Vert G_0 {\varvec{l}}_\star - G_{lp}^t {\varvec{l}}_\star \Vert _2&\le \Vert G_0 {\varvec{l}}_\star - G_\star {\varvec{l}}_\star \Vert _2 + \Vert G_\star {\varvec{l}}_\star - G_{lp}^t {\varvec{l}}_\star \Vert _2 \nonumber \\&\le \sigma _{\max } (G_0 - G_\star ) \Vert {\varvec{l}}_\star \Vert _2\nonumber \\&\quad + \Vert G_\star {\varvec{l}}_\star + G_{lp}^t {\varvec{l}}_{lp}^t - G_{lp}^t {\varvec{l}}_{lp}^t - G_{lp}^t {\varvec{l}}_\star \Vert _2 \nonumber \\&\le \varepsilon _g + \Delta + \Vert G_{lp}^t {\varvec{l}}_{lp}^t - G_{lp}^t {\varvec{l}}_\star \Vert _2 \nonumber \\&\le \varepsilon _g + \Delta + \sigma _{\max }(G_{lp}^t) \frac{d_g\varepsilon _v}{c} \end{aligned}$$

(D36)

Similarly with Eq. (D28), we have:

$$\begin{aligned} \sigma _{\max }(G_{lp}^t)\le & {} \Vert {G_{lp}^t}\Vert _F^2 \le (\Vert {{\varvec{l}}_{lp}^t}\Vert _F + \Vert {G_{0}}\Vert _F)^2 \nonumber \\= & {} (1 + \sqrt{d_g})^2 \end{aligned}$$

(D37)

Incorporating Eq. (D37) into Eq. (D36), we get:

$$\begin{aligned} \Vert G_0 {\varvec{l}}_\star - G_{lp}^t {\varvec{l}}_\star \Vert _2 \le \Delta + \varepsilon _g + \frac{(1 + \sqrt{d_g})^2}{c}d_g\varepsilon _v = \Delta _1\nonumber \\ \end{aligned}$$

(D38)

Next, we compute $\Vert ({\varvec{l}}_\star ^\top {\varvec{l}}_\star )^2 - ( {\varvec{l}}_0^\top {\varvec{l}}_\star )^2 \Vert _2$.

By the triangle inequality, we have:

$$\begin{aligned} \begin{aligned} \vert ({\varvec{l}}_{\star }^{\top }{\varvec{l}}_{\star })^2-( {\varvec{l}}_0^{\top }{\varvec{l}}_{\star })^2\vert&\le \vert ({\varvec{l}}_{\star }^{\top }{\varvec{l}}_{\star })^2-({{\varvec{l}}_{lp}^{t}}^{\top }{\varvec{l}}_{\star })^2\vert \\&\quad + \vert ({{\varvec{l}}_{lp}^{t}}^{\top }{\varvec{l}}_{\star })^2-( {\varvec{l}}_0^{\top }{\varvec{l}}_{\star })^2\vert \end{aligned} \end{aligned}$$

(D39)

The first term on the RHS of Eq. (D39), $\vert ({\varvec{l}}_{\star }^{\top }{\varvec{l}}_{\star })^2-({{\varvec{l}}_{lp}^{t}}^{\top }{\varvec{l}}_{\star })^2\vert $, can be bounded by:

$$\begin{aligned} \begin{aligned}&\vert ({\varvec{l}}_{\star }^{\top }{\varvec{l}}_{\star })^2-({{\varvec{l}}_{lp}^{t}}^{\top }{\varvec{l}}_{\star })^2 \vert \\&\quad = \vert ({{\varvec{l}}_{lp}^{t}}^{\top }{\varvec{l}}_{\star }-{{\varvec{l}}_{\star }}^{\top }{\varvec{l}}_{\star }) ({{\varvec{l}}_{lp}^{t}}^{\top }{\varvec{l}}_{\star } + {{\varvec{l}}_{\star }}^{\top }{\varvec{l}}_{\star })\vert \\&\quad = \vert ({{\varvec{l}}_{lp}^{t}}^{\top }{\varvec{l}}_{\star }-{{\varvec{l}}_{\star }}^{\top }{\varvec{l}}_{\star })(2{{\varvec{l}}_{\star }}^{\top }{\varvec{l}}_{\star } + {{\varvec{l}}_{lp}^{t}}^{\top }{\varvec{l}}_{\star } - {{\varvec{l}}_{\star }}^{\top }{\varvec{l}}_{\star })\vert \\&\quad \le \Vert {{\varvec{l}}_{lp}^{t}}^{\top } -{\varvec{l}}_{\star }^{\top } \Vert _2 \Vert {\varvec{l}}_\star \Vert _2^2 (2\Vert {\varvec{l}}_\star \Vert _2^2 + \Vert {\varvec{l}}_{lp}^t - {\varvec{l}}_\star \Vert _2) \\&\quad = (d_g\varepsilon _v/c) (2 + (d_g\varepsilon _v/c)):= \Delta _2 \end{aligned} \end{aligned}$$

(D40)

According to Lemma 8, we have:

$$\begin{aligned} {\varvec{l}}^t_{lp}{{\varvec{l}}^t_{lp}}^\top - {\varvec{l}}_0 {\varvec{l}}_0^\top = {G^t_{lp}}^\top G^t_{lp}-G_0^\top G_0 \end{aligned}$$

(D41)

We conduct left multiply both sides by ${\varvec{l}}_\star ^\top $ and right multiply both sides by ${\varvec{l}}_\star $ to get:

$$\begin{aligned} ({\varvec{l}}_{lp}^t{}^\top {\varvec{l}}_\star )^2-({\varvec{l}}_0^\top {\varvec{l}}_\star )^2=\Vert G_{lp}^t{\varvec{l}}_\star \Vert _2^2-\Vert G_0{\varvec{l}}_\star \Vert _2^2 \end{aligned}$$

(D42)

And then the second term on the RHS of Eq. (D39) is bounded by:

$$\begin{aligned} \vert&({{\varvec{l}}_{lp}^{t}}^{\top }{\varvec{l}}_{\star })^2-( {\varvec{l}}_0^{\top }{\varvec{l}}_{\star })^2\vert \nonumber \\&= \vert \Vert {G_{lp}^{t}}{\varvec{l}}_{\star }\Vert _2^2 - \Vert G_{0}{\varvec{l}}_{\star }\Vert _2^2\vert \nonumber \\&= \vert ( {G_{lp}^{t}}{\varvec{l}}_{\star }- G_{0}{\varvec{l}}_{\star })^\top ({G_{lp}^{t}}{\varvec{l}}_{\star } + G_{0}{\varvec{l}}_{\star })\vert \nonumber \\&\le \Vert {G_{lp}^{t}}{\varvec{l}}_{\star }- G_{0}{\varvec{l}}_{\star } \Vert _2 \Vert {G_{lp}^{t}}{\varvec{l}}_{\star } + G_{0}{\varvec{l}}_{\star } \Vert _2 \nonumber \\&\le \Delta _1 \Vert 2G_{0}{\varvec{l}}_{\star } + {G_{lp}^{t}}{\varvec{l}}_{\star } - G_{0}{\varvec{l}}_{\star } \Vert _2 \nonumber \\&\le \Delta _1 (2\Vert G_{0}{\varvec{l}}_{\star }\Vert _2 + \Vert {G_{lp}^{t}}{\varvec{l}}_{\star } - G_{0}{\varvec{l}}_{\star } \Vert _2) \nonumber \\&\le \Delta _1 (2\Vert G_{0}{\varvec{l}}_{\star } - G_{\star }{\varvec{l}}_{\star }\Vert _2 + 2\Vert G_{\star }{\varvec{l}}_{\star }\Vert _2 + \Delta _1) \nonumber \\&\le \Delta _1 (2\varepsilon _g + 2\sigma _{\max }(G_\star ) + \Delta _1)\nonumber \\&\le \Delta _1 (2\varepsilon _g + 2(1+\sqrt{d_g})^2 + \Delta _1):= \Delta _3 \end{aligned}$$

(D43)

Now, we can bound $\vert ({\varvec{l}}_{\star }^{\top }{\varvec{l}}_{\star })^2-( {\varvec{l}}_0^{\top }{\varvec{l}}_{\star })^2 \vert $ by:

$$\begin{aligned} \vert ({\varvec{l}}_{\star }^{\top }{\varvec{l}}_{\star })^2-( {\varvec{l}}_0^{\top }{\varvec{l}}_{\star })^2\vert \le \Delta _{2}+\Delta _{3} \end{aligned}$$

(D44)

Finally, we rewrite the bound above in terms of $\Delta $, $\varepsilon _v$ and $\varepsilon _g$.

We recall that:

$$\begin{aligned} \Delta _1= & {} \Delta + \varepsilon _g + (1 + \sqrt{d_g})^2d_g\varepsilon _v/c \end{aligned}$$

(D45)

$$\begin{aligned} \Delta _2= & {} (d_g\varepsilon _v/c) (2 + (d_g\varepsilon _v/c)) \end{aligned}$$

(D46)

$$\begin{aligned} \Delta _3= & {} \Delta _1 (2\varepsilon _g + 2(1+\sqrt{d_g})^2 + \Delta _1) \end{aligned}$$

(D47)

Plugging $\Delta _1$ into $\Delta _3$ and applying Big-Oh notation, we have:

$$\begin{aligned} \begin{aligned} \Delta _2 + \Delta _3&= O(\Delta ^2) + O(\Delta \varepsilon _v/c) + O(\Delta \varepsilon _g) \\&\quad + O((\varepsilon _v/c)^2) + O(\varepsilon _g^2) + O(\varepsilon _v \varepsilon _g/c) \end{aligned} \end{aligned}$$

(D48)

Therefore, we obtain the lower bound of $\vert ({\varvec{l}}_{\star }^{\top }{\varvec{l}}_{\star })^2-( {\varvec{l}}_0^{\top }{\varvec{l}}_{\star })^2\vert $:

$$\begin{aligned} \begin{aligned}&\vert ({\varvec{l}}_{\star }^{\top }{\varvec{l}}_{\star })^2-({\varvec{l}}_0^{\top }{\varvec{l}}_{\star })^2\vert \le O(\Delta ^2) + O(\Delta ) \\&\quad + O(\varepsilon _g) + O(\varepsilon _v) + O(\varepsilon _g \Delta ) \\&\quad + O(\varepsilon _v \Delta ) +O(\varepsilon _g^2) + O(\varepsilon _v^2) + O(\varepsilon _g \varepsilon _v) \end{aligned} \end{aligned}$$

(D49)

The final step is taking the contrapositive.

We have shown that if $\Vert G_\star ^\top {\varvec{l}}_\star ^\top - G_{lp}^t {\varvec{l}}_{lp}^t \Vert _2 \le \Delta $, then the Eq. (D49) holds. According to the contrapositive statement and the notation:

$$\begin{aligned} \Delta= & {} \min \{O(\phi ), O(\phi ^2 - O(\varepsilon _g) - O(\varepsilon _v) - O(\varepsilon _g^2) \nonumber \\{} & {} - O(\varepsilon _v^2) - O(\varepsilon _g \varepsilon _v)) / (O(\varepsilon _g) + O(\varepsilon _v))\} \end{aligned}$$

(D50)

if the inequality below holds:

$$\begin{aligned} \begin{aligned}&\vert ({\varvec{l}}_\star ^\top {\varvec{l}}_\star )^2 - ( {\varvec{l}}_0^\top {\varvec{l}}_\star )^2 \vert \ge O(\Delta ^2) + O(\Delta ) \\&\quad + O(\varepsilon _g) + O(\varepsilon _v) + O(\varepsilon _g \Delta ) \\&\quad + O(\varepsilon _v \Delta ) +O(\varepsilon _g^2) + O(\varepsilon _v^2) + O(\varepsilon _g \varepsilon _v) \end{aligned} \end{aligned}$$

(D51)

then we have $\Vert G_\star ^\top {\varvec{l}}_\star ^\top - G_{lp}^t {\varvec{l}}_{lp}^t \Vert _2 \ge \Delta $.

That is, let $\vert ({\varvec{l}}_\star ^\top {\varvec{l}}_\star )^2 - ({\varvec{l}}_0^\top {\varvec{l}}_\star )^2 \vert = 2\phi ^2 +\phi $ for some $\phi $, we denote $\Delta =\min \{O(\phi ), O(\phi ^2 - O(\varepsilon _g) - O(\varepsilon _v) - O(\varepsilon _g^2) - O(\varepsilon _v^2) - O(\varepsilon _g \varepsilon _v)) / (O(\varepsilon _g) + O(\varepsilon _v))\}$ holds, then condition (D51) is established, thus we have $\Vert G_\star ^\top {\varvec{l}}_\star ^\top - G_{lp}^t {\varvec{l}}_{lp}^t \Vert _2$ is lower bounded by $\Delta $.

Based on Eq. (D33), since $\sigma _{\min }(\Sigma )$ has a lower bound almost surely, the OOD error of the LP-V is lower bounded by:

$$\begin{aligned}&L_{\text {ood}}(f_{\text {LP-V}}(G_{lp}^t, {\varvec{l}}_{lp}^t))\nonumber \\&\quad \ge \sigma _{\min }(\Sigma ) (\min \{O(\phi ), O(\phi ^2{ -} O(\varepsilon _g) {-} O(\varepsilon _v) - O(\varepsilon _g^2) \nonumber \\&\qquad - O(\varepsilon _v^2) - O(\varepsilon _g \varepsilon _v)) / (O(\varepsilon _g) + O(\varepsilon _v))\})^2 \nonumber \\&\quad = O(1/(\varepsilon _v + \varepsilon _g)^2) \end{aligned}$$

(D52)

Now we can give the proof of Theorem 1 based on Theorem 4 and Theorem 5. Since $V_0X^\top X V_0^\top $ is invertible, we have $\sigma _{\min }(\Sigma ) > 0$, $\sigma _{\max }(\Sigma )$ is upper bounded almost surely. So we get that for a large enough t and some $c_k^\prime >0$:

$$\begin{aligned} L_{\text {ood}}(f_{\text {VLAD}}(G^\infty _{al})) \le c_k^\prime \varepsilon _v \end{aligned}$$

(D53)

And from Lemma 9, the initial error $\varepsilon _g \ge c_k^{\prime \prime }$ holds for some $c_k^{\prime \prime }$. From the model assumption that $ {\varvec{l}}_0$ is scaled to be a unit vector, i.e. $\Vert {\varvec{l}}_\star \Vert _2 = \Vert {\varvec{l}}_0 \Vert _2 = 1$, we have $\phi <1$. For some $c^{\prime } >0$, we get that:

$$\begin{aligned} L_{\text {ood}}(f_{\text {LP-V}}({\varvec{h}}_{lp}^t))= L_{\text {ood}}(f_{\text {LP-V}}(G_{lp}^t, {\varvec{l}}_{lp}^t) \ge c^\prime /\varepsilon _v^2 \end{aligned}$$

(D54)

when the visual feature extractor initialization error $\varepsilon _g$ is upper bounded.

In summary, as the initialization error of language knowledge approaches zero, i.e., $\varepsilon _v \rightarrow 0$:

$$\begin{aligned} \begin{aligned} \frac{L_{\text {ood}}(f_{\text {VLAD}}(G_{al}^\infty ))}{\text {min}_{t \geqslant 0}L_{\text {ood}}(f_{\text {LP-V}}({\varvec{h}}_{lp}^t))} \le \frac{c_k^\prime \varepsilon _v}{c^\prime /\varepsilon _v^2} \rightarrow 0 \end{aligned} \end{aligned}$$

(D55)

That is, for ${\forall }~ \varepsilon > 0$ and $t > 0$, we have:

$$\begin{aligned} {\mathbb {P}}\left( \frac{L_{\text {ood}}(f_{\text {VLAD}}(G_{al}^\infty ))}{\text {min}_{t \geqslant 0}L_{\text {ood}}(f_{\text {LP-V}}({\varvec{h}}_{lp}^t))} \ge \varepsilon \right) \ = 0, ~~\text {as}~~ {\varvec{l}}_0 \rightarrow {\varvec{l}}_\star \nonumber \\ \end{aligned}$$

(D56)

$\square $

Appendix E OOD Generalization Guarantee

Theorem 1 shows that the proposed VLAD can achieve a lower OOD Error benefiting from the additional pre-trained language knowledge. In this section, we evaluate its generalization ability by measuring the generalization performance of $Y=X V_0^\top G_{al} {\varvec{l}}_0$ from the ID data to OOD data.

Given n $(n \ge 1)$ labeled source domain(s) $S = \{S_1, \dots S_n\}$, we hope algorithms perform well on an unseen target domain T that has a distribution shift to the source domain(s). Based on the transferability definition and quantifiable transfer measures proposed in Zhang et al. (2021d), we give the OOD generalization guarantee of the proposed VLAD.

Let $\epsilon _{{\mathcal {D}}}(\theta )$ be the classification error of a model with model parameters $\theta \in \Theta $ on a domain ${\mathcal {D}}$ (or ${\mathcal {S}}$ for source domains or ${\mathcal {T}}$ for target domains):

$$\begin{aligned} \epsilon _{{\mathcal {D}}}(\theta ) =\underset{(x, y) \in {\mathcal {D}}}{{\mathbb {E}}}[\ell (f(x), y)] \end{aligned}$$

(E57)

where $\ell $ is the loss function and f(x) is the prediction result of the OOD algorithm.

Definition 3

(Transferabilty) The source domain ${\mathcal {S}}$ is ${(\delta _{\mathcal {S}}, \delta _{\mathcal {T}})}_\Theta -$ transferable to the target domain ${\mathcal {T}}$ if for $\delta _{\mathcal {S}} > 0$, there exists $\delta _{\mathcal {T}}>0$ such that $\text {argmin}(\epsilon _{{\mathcal {S}}},\delta _{{\mathcal {S}}})_{\Theta }\subseteq \text {argmin}(\epsilon _{{\mathcal {T}}},\delta _{{\mathcal {T}}})_{\Theta }$, where:

$$\begin{aligned} \begin{aligned} \text {argmin}(\epsilon _{{\mathcal {D}}},\delta _{{\mathcal {D}}})_{\mathcal \theta }&:=\{\theta \in \mathcal \mathcal {\Theta }:\epsilon _{{\mathcal {D}}}(\theta )\\&\le \inf \limits _{\theta \in \Theta }\epsilon _{{\mathcal {D}}}(\theta )+\delta _{{\mathcal {D}}}\} \end{aligned} \end{aligned}$$

(E58)

The set $\text {argmin}(\epsilon _{{\mathcal {S}}},\delta _{{\mathcal {S}}})_{\Theta }$ is known as the a $\delta _{\mathcal {D}}-$ minimal set of $\epsilon _{\mathcal {D}}$, which represents the near-optimization set of the model parameters $\theta $. And $\text {argmin}(\epsilon _{{\mathcal {S}}},\delta _{{\mathcal {S}}})_{\Theta }\subseteq \text {argmin}(\epsilon _{{\mathcal {T}}},\delta _{{\mathcal {T}}})_{\Theta }$ means that the near-optimal solution in the source domains is also near-optimal in the target domain.

Definition 4

(Quantifiable Transfer Measures) Given some $\Gamma \subseteq \Theta $, $\epsilon ^*_{{\mathcal {S}}}:=\text {inf}_{\theta \in \Gamma }\epsilon _{{\mathcal {S}}}(\theta )$ and $\epsilon ^*_{{\mathcal {T}}}:=\text {inf}_{\theta \in \Gamma }\epsilon _{{\mathcal {T}}}(\theta )$, the symmetric transfer measure and the realizable transfer measure respectively are defined as:

$$\begin{aligned} {\mathbb {T}}_\Gamma ({\mathcal {S}},{\mathcal {T}})= & {} \sup _{\theta \in \Gamma }\vert \epsilon _{\mathcal {S}}(\theta )-\epsilon _{\mathcal {S}}^*-(\epsilon _{\mathcal {T}}(\theta )-\epsilon _{\mathcal {T}}^*)\vert \end{aligned}$$

(E59)

$$\begin{aligned} {\mathbb {T}}_\Gamma ^r({\mathcal {S}},{\mathcal {T}})= & {} \sup _{\theta \in \Gamma }\vert \epsilon _{\mathcal {S}}(\theta )-\epsilon _{\mathcal {T}}(\theta )\vert \end{aligned}$$

(E60)

The symmetric transfer measure (Eq. E59) reduces to the realizable transfer measure (Eq. E60) in the realizable cases when $\epsilon _{\mathcal {S}}^* = \epsilon _{\mathcal {T}}^* = 0$. In the literature, $\epsilon _{\mathcal {S}}(\theta )-\epsilon _{\mathcal {S}}^*$ is known as the excess risk, which is the relative error compared to the optimal result $\epsilon _{\mathcal {S}}^*$. So the transfer measure ${\mathbb {T}}_\Gamma ({\mathcal {S}},{\mathcal {T}})$ quantifies the distinction between the source domains and the target domain from the perspective of excess risk.

In this section, we calculate the realizable transfer measure (Eq. E60) of the proposed VLAD for both diversity shifts and correlation shifts.

On the one hand, as analyzed in Definition 1, for diversity shifts between the source and target domains, the visual feature $z=XV_0^\top $ can be disentangled as $Z_{\text {id}} = Z_{c}^\prime + Z_{n_1}^\prime + a_{\text {id}}\varepsilon ^\prime $ and $Z_{\text {ood}} = Z_{c}^\prime + Z_{n_2}^\prime + a_{\text {ood}}\varepsilon ^\prime $, where ${\mathcal {S}}_{c}^\prime = rowspace(z_c^\prime )$, ${\mathcal {S}}_{n_1}^\prime = rowspace(Z_{n_1}^\prime )$ and ${\mathcal {S}}_{n_2}^\prime = rowspace(Z_{n_2}^\prime )$ are mutually orthogonal. Here, we omit the superscript “$\prime $” without causing confusion. That is, $Z_{\text {id}} = Z_{c}+ Z_{n_1} + a_{\text {id}}\varepsilon $ and $Z_{\text {ood}} = Z_{c} + Z_{n_2} + a_{\text {ood}}\varepsilon $.

On the other hand, for correlation shifts, the visual feature can be disentangled as $Z_{\text {id}} = Z_{c}+ Z_{n_1}$ and $Z_{\text {ood}} = Z_{c}+ Z_{n_2}$, where ${\mathcal {S}}_{c} = rowspace(z_c)$ and ${\mathcal {S}}_{n_i} = rowspace(z_{n_i})$ $(i=1, 2)$ are orthogonal subspaces, while $Z_{n_1}$ and $Z_{n_2}$ shared the same data space.

Before the OOD generalization guarantee theorem, we first clarify our symbolic denotations and data assumptions. The final output of VLAD is denoted by $f_\theta (Z)$, where $\theta $ denotes the model parameter learned in source domains and $Z=XV_0^\top $ is the visual feature extracted by the fixed pre-trained model $V_0$. The empirical distribution of the input (X, Y) from source domains and target domain is $p_{{\mathcal {S}}}(Z, Y)$ and $p_{{\mathcal {T}}}(Z, Y)$, respectively.

Theorem 6

(Restatement of OOD Generalization Guarantee of the Proposed VLAD) Given the pre-trained visual feature extractor $V_0$ and the language embedding initialization error $d( {\varvec{l}}_0, {\varvec{l}}_\star ) =\Vert {\varvec{l}}_\star - {\varvec{l}}_0 \Vert _2\le \varepsilon _v$, suppose we have learned the VLAD model $f_\theta (Z)$ on ID sample distribution(s) ${\mathcal {S}}$, and for a square loss function $\ell $ and the proposed regularization terms $\ell _a$ and $\ell _d$ defined in Eqs. (3) and (4), respectively, and we get the optimization objective:

$$\begin{aligned} \begin{aligned}&\epsilon _{{\mathcal {S}}}(\theta ){=}\underset{(Z, Y) \in {\mathcal {S}}}{{\mathbb {E}}}[\ell (f_\theta (Z),Y) + \ell _a (Z, {\varvec{l}}_0) {+} \ell _d(Z, {\varvec{l}}_0)]\le \eta \\&\vert \epsilon _{{\mathcal {S}}}^d(\theta ) \vert =\big \vert \underset{(Z, Y) \in {\mathcal {S}}}{{\mathbb {E}}} \ell _d (Z, {\varvec{l}}_0) \big \vert \ge \eta _d ~~~~(0<\eta _d\le 2). \end{aligned}\nonumber \\ \end{aligned}$$

(E61)

Then for the OOD Error estimation $L_{\text {ood}} = \underset{(X, Y) \in {\mathcal {T}}}{{\mathbb {E}}}[\ell (f_\theta (Z),Y)]$, where ${\mathcal {T}}$ denotes the OOD sample distribution, we have:

$$\begin{aligned} L_{\text {ood}} \le \left\{ \begin{array}{c c}{{2\eta + (3-\eta _d + D(P_{{\mathcal {S}}}, P_{{\mathcal {T}}}))^2/2}}&{}{{0 \le \cos \theta _{\max }({\mathcal {S}}_{n_1}, {\mathcal {S}}_{n_2}) < 1}}\\ {{ 2\eta + ( 1+ (2-\eta _d) D(P_{{\mathcal {S}}}, P_{{\mathcal {T}}}))^2/2}}&{}{{\cos \theta _{\min }({\mathcal {S}}_{n_1}, {\mathcal {S}}_{n_2})= 1}} \end{array}\right. \nonumber \\ \end{aligned}$$

(E62)

where $D(P_{\mathcal {S}}, P_{\mathcal {T}}) = {\mathbb {E}}\vert \vert (Z_{n_2}-Z_{n_1}) \cdot Z_{n_1}^{-1}\vert \vert _2^2 $ for correlation shift and $D(P_{\mathcal {S}}, P_{\mathcal {T}}) = {\mathbb {E}} \vert \vert Z_{n_2}\vert \vert _2^2 $ for diversity shift.

Note that the optimization objective in Eq. (8) is reasonable, because a better alignment between the image feature and language embedding from the same class is expected, and hence a good model for ID data has $\underset{(Z, Y) \in {\mathcal {S}}}{{\mathbb {E}}} \ell _d (Z, {\varvec{l}}_0) < 0$. In this case, $\vert \epsilon _{{\mathcal {S}}}^d(\theta ) \vert =\vert \underset{(Z, Y) \in {\mathcal {S}}}{{\mathbb {E}}} \ell _d (Z, {\varvec{l}}_0) \vert \ge \eta _d$ gives the upper bound of regularization loss $\ell _d$. From Eq. (4), $\ell _d$ is the distinction between the inner product of the positive pairs and negative pairs, it is easy to induce that $\vert \epsilon _{{\mathcal {S}}}^d(\theta ) \vert \le 2 $ when vision features and language features are L2-Normalized before alignment.

Proof

According to Eq. (4), we denote the causal feature in $\varvec{v_i}~(i\in [1,n])$ as $Z_{c}^{+}$ and represent the the causal feature in ${\varvec{v}}^{-}$ as $Z_{c}^{-}$. During the training process, the pre-trained image features $\varvec{v_i}~(i\in [1,n])$ and ${\varvec{v}}^{-}$ are from different classes and are contrasted in $\ell _d$ due to their similar non-causal feature. We denote the similar non-causal feature as $Z_{n_1}$. Then we have:

$$\begin{aligned} \begin{aligned} \underset{(Z, Y) \in {\mathcal {S}}}{{\mathbb {E}}}\ell _d (Z, {\varvec{l}}_0)&= \underset{(Z, Y) \in {\mathcal {S}}}{{\mathbb {E}}}\\&\quad -\log \frac{\exp ((Z_{c}^{+} + Z_{n_1}) G_{al} {\varvec{l}}_0^{+})}{\exp ((Z_{c}^{-} + Z_{n_1}) G_{al} {\varvec{l}}_0^{+})} \\&= \underset{(Z, Y) \in {\mathcal {S}}}{{\mathbb {E}}}\\&\quad -\log \frac{\exp ((Z_{c}^{+} G_c+ Z_{n_1} G_n) {\varvec{l}}_0^{+})}{\exp ((Z_{c}^{-} G_c + Z_{n_1} G_n) {\varvec{l}}_0^{+})} \\&\le -\eta _d \end{aligned} \end{aligned}$$

(E63)

The second line follows the disentanglement of $Z=Z_c + Z_{n_1}$, where the rowspaces of $Z_c$ and $Z_{n_1}$ are orthogonal. Similarly, the visual feature adapter can be denoted as $G_{al} = G_c + G_n$, where $G_c$ and $G_n$ is corresponding to $Z_c$ and $Z_{n_1}$, respectively.

Furthermore, Eq. (E63) can be induced as $\underset{(Z, Y) \in {\mathcal {S}}}{{\mathbb {E}}} \vert (Z_{c}^{+}-Z_{c}^{-})G_c {\varvec{l}}_0^{+}\vert \ge {\eta _d}$. Recall that $\vert \vert {\varvec{l}}_0 \vert \vert _2 =1$, so we have $\underset{(Z, Y) \in {\mathcal {S}}}{{\mathbb {E}}} \vert \vert Z_{c}^{+}G_c\vert \vert _2^2 + \vert \vert Z_{c}^{-}G_c\vert \vert _2^2 \ge {2\eta _d -2}$. Again since $\vert \vert Z(G_n + G_c) \vert \vert _2^2 = \vert \vert Z_{n_1}G_n \vert \vert _2^2 + \vert \vert z_cG_c \vert \vert _2^2= 1$, so we have $\underset{(Z, Y) \in {\mathcal {S}}}{{\mathbb {E}}}\vert \vert Z_{n_1}G_n \vert \vert _2^2 \le 2-\eta _d$.

For the square loss function, the OOD Error can be calculated as:

$$\begin{aligned} \begin{aligned} \epsilon _{{\mathcal {T}}}(\theta )&= \underset{(Z, Y) \in {\mathcal {T}}}{{\mathbb {E}}}[\ell (f_\theta (Z),Y)] \\&= \underset{(Z, Y) \in {\mathcal {T}}}{{\mathbb {E}}}[\ell ((Z_c + Z_{n_2}) G_{al} {\varvec{l}}_0,Y)] \\&= {{\mathbb {E}}}[\ell ((Z_c + Z_{n_1})G_{al} {\varvec{l}}_0 + (Z_{n_2}-Z_{n_1})G_{al} {\varvec{l}}_0, Y)] \\&\le 2 {{\mathbb {E}}}[\ell ((Z_c + Z_{n_1})G_{al} {\varvec{l}}_0, Y)]\\&\quad + 2{{\mathbb {E}}}((Z_{n_2}-Z_{n_1})G_n {\varvec{l}}_0)^2\\&\le 2 {{\mathbb {E}}}[\ell ((Z_c + Z_{n_1})G_{al} {\varvec{l}}_0, Y)]\\&\quad + \frac{1}{2}{{\mathbb {E}}}(\vert \vert (Z_{n_2}-Z_{n_1})G_n\vert \vert _2^2 + \vert \vert {\varvec{l}}_0 \vert \vert _2^2)^2 \\&= 2\eta + {\mathbb {E}}(\vert \vert (Z_{n_2}-Z_{n_1})G_n\vert \vert _2^2 +1)^2 \end{aligned} \end{aligned}$$

(E64)

The fourth line follows the inequality $(a+b-c)^2\le 2(a-c)^2+2b^2$.

For the diversity shift case, Eq. (E64) can be further calculated as:

$$\begin{aligned} \epsilon _{{\mathcal {T}}}(\theta )&\le 2\eta + {\mathbb {E}}(\vert \vert (Z_{n_2}-Z_{n_1})G_n\vert \vert _2^2 +1)^2/2\nonumber \\&= 2\eta + {\mathbb {E}}(\vert \vert Z_{n_1}G_n\vert \vert _2^2 + \vert \vert Z_{n_2}G_n\vert \vert _2^2 +1)^2/2\nonumber \\&\le 2\eta + {\mathbb {E}}(2-\eta _d + \vert \vert Z_{n_2}G_n\vert \vert _2^2 +1)^2/2\nonumber \\&\le 2\eta + {\mathbb {E}}(2-\eta _d + \vert \vert Z_{n_2}G_n\vert \vert _2^2 +1)^2/2\nonumber \\&\le 2\eta + (3-\eta _d + {\mathbb {E}} \vert \vert Z_{n_2}\vert \vert _2^2 )^2/2\nonumber \\&= 2\eta + (3-\eta _d + \int _{Z_{n_2}} \vert \vert Z_{n_2}\vert \vert _2^2 p(Z_{n_2})dZ_{n_2})^2/2 \end{aligned}$$

(E65)

For the correlation shift case, we have:

$$\begin{aligned} \epsilon _{{\mathcal {T}}}(\theta )&\le 2\eta + {\mathbb {E}}(\vert \vert (Z_{n_2}-Z_{n_1}) \otimes Z_{n_1}^{-1} \otimes Z_{n_1} G_n\vert \vert _2^2 +1)^2/2\nonumber \\&\le 2\eta + {\mathbb {E}}(\vert \vert (Z_{n_2}-Z_{n_1}) \otimes Z_{n_1}^{-1}\vert \vert _2^2 \vert \vert Z_{n_1} G_n\vert \vert _2^2 +1)^2/2\nonumber \\&= 2\eta + {\mathbb {E}}(\vert \vert (Z_{n_2}-Z_{n_1}) \otimes Z_{n_1}^{-1}\vert \vert _2^2(2-\eta _d) + 1)^2/2\nonumber \\&= \left( \int _{Z_{n_1}} \int _{Z_{n_2}} \vert \vert (Z_{n_2}-Z_{n_1}) \right. \nonumber \\&\left. \quad \otimes Z_{n_1}^{-1}\vert \vert _2^2 p(Z_{n_1})p(Z_{n_2})dZ_{n_1}dZ_{n_2}(2-\eta _d) + 1\right) ^2/2\nonumber \\&\quad +2\eta \end{aligned}$$

(E66)

where the notation $\otimes $ denotes the element-wise product of two vectors or matrices, and $(Z_{n_1})^{-1}$ denotes the generalized inverse of $Z_{n_1}$, i.e., $Z_{n_1}^{-1} \otimes Z_{n_1} = e$, e is the all-ones vector with each element set to 1.

Since ${\mathbb {T}}_\Gamma ^r({\mathcal {S}},{\mathcal {T}})= \sup _{\theta \in \Gamma }\vert \epsilon _{\mathcal {S}}(\theta )-\epsilon _{\mathcal {T}}(\theta )\vert $, we have:

$$\begin{aligned} {\mathbb {T}}_\Gamma ^r({\mathcal {S}},{\mathcal {T}}) \le \left\{ \begin{array}{c c}{{\eta + (3-\eta _d +D(P_{{\mathcal {S}}}, P_{{\mathcal {T}}}))^2/2}}&{}{{0 \le \cos \theta _{\max }({\mathcal {S}}_{n_1}, {\mathcal {S}}_{n_2}) < 1}}\\ {{ \eta + ( 1+ (2-\eta _d) D(P_{{\mathcal {S}}}, P_{{\mathcal {T}}}))^2/2}}&{}{{\cos \theta _{\min }({\mathcal {S}}_{n_1}, {\mathcal {S}}_{n_2})= 1}} \end{array}\right. \nonumber \\ \end{aligned}$$

(E67)

where $D(P_{\mathcal {S}}, P_{\mathcal {T}}) = {\mathbb {E}}\vert \vert (Z_{n_2}-Z_{n_1}) \otimes Z_{n_1}^{-1}\vert \vert _2^2 $ for correlation shift and $D(P_{\mathcal {S}}, P_{\mathcal {T}}) = {\mathbb {E}} \vert \vert Z_{n_2}\vert \vert _2^2 $ for diversity shift. Then the upper bound of $L_{\text {ood}}$ can be induced by $L_{\text {ood}} \le {\mathbb {T}}_\Gamma ^r({\mathcal {S}},{\mathcal {T}}) + \eta $. $\square $

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Zhu, L., Yin, W., Yang, Y. et al. Vision-Language Alignment Learning Under Affinity and Divergence Principles for Few-Shot Out-of-Distribution Generalization. Int J Comput Vis (2024). https://doi.org/10.1007/s11263-024-02036-4

Download citation

Received: 04 April 2023
Accepted: 02 February 2024
Published: 18 March 2024
DOI: https://doi.org/10.1007/s11263-024-02036-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Vision-Language Alignment Learning Under Affinity and Divergence Principles for Few-Shot Out-of-Distribution Generalization

Abstract

Access this article

Similar content being viewed by others

How Does Fine-Tuning Impact Out-of-Distribution Detection for Vision-Language Models?

Unsupervised Prototype Adapter for Vision-Language Models

Contrastive Vision-Language Pre-training with Limited Resources

Data Availability Statement

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendices

Appendix A More Information on Experiments

1.1 Experiment Results on the Two-Types of Distribution Shifts Under Various Few-Shot Training Numbers

1.2 Hyperparameter Search Spaces of the Experiments on OOD Generalization Benchmark Datasets

Appendix B Notations and Preliminaries for The Proposed Theorems

1.1 Notations

1.2 Analysis of LP-V’s Adapter Layer

Lemma 1

Proof

Lemma 2

Proof

Lemma 3

1.3 Statistical Behaviour of MSE and Cross-Entropy Loss

Theorem 3

Lemma 4

Proof

Appendix C Analysis of the Two Types of Distribution Shifts

Lemma 5

Proof

Appendix D Proof of Theorem 1

1.1 Preparations

Lemma 6

Proof

Lemma 7

Proof

Lemma 8

Lemma 9

Proof

Lemma 10

Proof

1.2 Main Proof

Theorem 4

Proof

Theorem 5

Proof

Appendix E OOD Generalization Guarantee

Definition 3

Definition 4

Theorem 6

Proof

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation