Skip to main content
Log in

Exploring Vision-Language Models for Imbalanced Learning

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

Vision-language models (VLMs) that use contrastive language-image pre-training have shown promising zero-shot classification performance. However, their performance on imbalanced dataset is relatively poor, where the distribution of classes in the training dataset is skewed, leading to poor performance in predicting minority classes. For instance, CLIP achieved only 5% accuracy on the iNaturalist18 dataset. We propose to add a lightweight decoder to VLMs to avoid out of memory problem caused by large number of classes and capture nuanced features for tail classes. Then, we explore improvements of VLMs using prompt tuning, fine-tuning, and incorporating imbalanced algorithms such as Focal Loss, Balanced SoftMax and Distribution Alignment. Experiments demonstrate that the performance of VLMs can be further boosted when used with decoder and imbalanced methods. Specifically, our improved VLMs significantly outperforms zero-shot classification by an average accuracy of 6.58%, 69.82%, and 6.17%, on ImageNet-LT, iNaturalist18, and Places-LT, respectively. We further analyze the influence of pre-training data size, backbones, and training cost. Our study highlights the significance of imbalanced learning algorithms in face of VLMs pre-trained by huge data. We release our code at https://github.com/Imbalance-VLM/Imbalance-VLM.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others

Notes

  1. We mainly deal with a supervised imbalanced setting, i.e., the training data is fully labeled. There are other new emerging areas such as semi-supervised imbalanced learning (Chen et al., 2022) and they are out of the scope of this paper.

  2. In fact, even with powerful hardware such as NVIDIA A100 (80 G), it may not be feasible to run CoOp on large datasets with a huge number of classes as 8142.

  3. We also provide the results of full finetuning CLIP-ViTL14. Linear probing needs single 4090 with a batch size of 256 while full finetuning requires 2 A100-80 G with a batch size of 128.

References

  • Byrd, J., & Lipton, Z. (2019). What is the effect of importance weighting in deep learning? In ICML, PMLR (pp. 872–881).

  • Cao, K., Wei, C., Gaidon, A., Arechiga, N., & Ma, T. (2019a). Learning imbalanced datasets with label-distribution-aware margin loss. In NeurIPS.

  • Cao, K., Wei, C., Gaidon, A., Arechiga, N., & Ma, T. (2019b). Learning imbalanced datasets with label-distribution-aware margin loss. arXiv preprint arXiv:1906.07413

  • Chen, H., Fan, Y., Wang, Y., Wang, J., Schiele, B., Xie, X., Savvides, M., & Raj, B. (2022). An embarrassingly simple baseline for imbalanced semi-supervised learning. arXiv preprint arXiv:2211.11086

  • Dehghani, M., Djolonga, J., Mustafa, B., Padlewski, P., Heek, J., Gilmer, J., Steiner, A. P., Caron, M., Geirhos, R., Alabdulmohsin, I., & Jenatton, R. (2023). Scaling vision transformers to 22 billion parameters. arXiv preprint arXiv:2302.05442

  • Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., & Uszkoreit, J. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. In International conference on learning representations.

  • He, K., Chen, X., Xie, S., Li, Y., Dollár, P., & Girshick, R. (2022). Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 16000–16009).

  • He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In CVPR (pp. 770–778).

  • Hong, Y., Han, S., Choi, K., Seo, S., Kim, B., & Chang, B. (2021). Disentangling label distribution for long-tailed visual recognition. In CVPR (pp. 6626–6636).

  • Jamal, M. A., Brown, M., Yang, M. H., Wang, L., & Gong, B. (2020). Rethinking class-balanced methods for long-tailed visual recognition from a domain adaptation perspective. In CVPR (pp. 7610–7619).

  • Kang, B., Xie, S., Rohrbach, M., Yan, Z., Gordo, A., Feng, J., & Kalantidis, Y. (2019). Decoupling representation and classifier for long-tailed recognition. In ICML.

  • Khan, S. H., Hayat, M., Bennamoun, M., Sohel, F. A., & Togneri, R. (2017). Cost-sensitive learning of deep feature representations from imbalanced data. IEEE TNNLS, 29(8), 3573–3587.

    Google Scholar 

  • Li, J., Li, D., Xiong, C., & Hoi, S. (2022). Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International conference on machine learning, PMLR (pp. 12888–12900).

  • Lin, T. Y., Goyal, P., Girshick, R., He, K., & Dollár, P. (2017). Focal loss for dense object detection. In ICCV (pp. 2980–2988).

  • Liu, Z., Hu, H., Lin, Y., Yao, Z., Xie, Z., Wei, Y., Ning, J., Cao, Y., Zhang, Z., Dong, L., & Wei, F. (2022). Swin transformer v2: Scaling up capacity and resolution. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 12009–12019).

  • Liu, Z., Miao, Z., Zhan, X., Wang, J., Gong, B., & Yu, S. X. (2019). Large-scale long-tailed recognition in an open world. In CVPR (pp. 2537–2546).

  • Lüddecke, T., & Ecker, A. (2022). Image segmentation using text and image prompts. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 7086–7096).

  • Ma, T., Geng, S., Wang, M., Shao, J., Lu, J., Li, H., Gao, P., & Qiao, Y. (2021). A simple long-tailed recognition baseline via vision-language model. arXiv preprint arXiv:2111.14745

  • Menon, A. K., Jayasumana, S., Rawat, A. S., Jain, H., Veit, A., & Kumar, S. (2020). Long-tail learning via logit adjustment. In ICLR.

  • Platt, J., Cristianini, N., & Shawe-Taylor, J. (1999). Large margin dags for multiclass classification. In NIPS (p. 12).

  • Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., & Krueger, G. (2021). Learning transferable visual models from natural language supervision. In International conference on machine learning, PMLR (pp. 8748–8763).

  • Ren, J., Yu, C., Ma, X., Zhao, H., & Yi, S. (2020). Balanced meta-softmax for long-tailed visual recognition. arXiv preprint arXiv:2007.10740

  • Schuhmann, C., Beaumont, R., Vencu, R., Gordon, C., Wightman, R., Cherti, M., Coombes, T., Katta, A., Mullis, C., Wortsman, M., & Schramowski, P. (2022). Laion-5b: An open large-scale dataset for training next generation image-text models. In Thirty-sixth conference on neural information processing systems datasets and benchmarks track.

  • Tan, J., Wang, C., Li, B., Li, Q., Ouyang, W., Yin, C., & Yan, J. (2020). Equalization loss for long-tailed object recognition. In CVPR (pp. 11662–11671).

  • Tang, K., Huang, J., & Zhang, H. (2020). Long-tailed classification by keeping the good and removing the bad momentum causal effect. NeurIPS, 33, 66.

    Google Scholar 

  • Tian, C., Wang, W., Zhu, X., Dai, J., & Qiao, Y. (2022). Vl-ltr: Learning class-wise visual-linguistic representation for long-tailed visual recognition. In X. X. V. Part (Ed.), Computer Vision-ECCV 2022: 17th European conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings (pp. 73–91). Springer.

  • Van Horn, G., Mac Aodha, O., Song, Y., Cui, Y., Sun, C., Shepard, A., Adam, H., Perona, P., & Belongie, S. (2018). The inaturalist species classification and detection dataset. In CVPR (pp. 8769–8778).

  • Vapnik, V. (1991). Principles of risk minimization for learning theory. Advances in Neural Information Processing Systems, 4, 66.

    Google Scholar 

  • Wang, J., Lukasiewicz, T., Hu, X., Cai, J., & Xu, Z. (2021a). Rsg: A simple but effective module for learning imbalanced datasets. In CVPR (pp. 3784–3793).

  • Wang, J., Zhang, W., Zang, Y., Cao, Y., Pang, J., Gong, T., Chen, K., Liu, Z., Loy, C. C., & Lin, D. (2021b). Seesaw loss for long-tailed instance segmentation. In CVPR (pp. 9695–9704).

  • Wang, P., Han, K., Wei, X. S., Zhang, L., & Wang, L. (2021c). Contrastive learning based hybrid networks for long-tailed image classification. In CVPR (pp. 943–952).

  • Wang, Y., Zhang, B., Hou, W., Wu, Z., Wang, J., & Shinozaki, T. (2022). Margin calibration for long-tailed visual recognition. In Asian Conference on Machine Learning (ACML).

  • Wang, Y. X., Ramanan, D. & Hebert, M. (2017). Learning to model the tail. In NeurIPS (pp. 7032–7042).

  • Wei, H., Tao, L., Xie, R., Feng, L., & An, B. (2022). Open-sampling: Exploring out-of-distribution data for re-balancing long-tailed datasets. In International conference on machine learning, PMLR (pp. 23615–23630).

  • Xu, Z., Yang, S., Wang, X., & Yuan, C. (2023). Rethink long-tailed recognition with vision transforms. In ICASSP 2023—2023 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 1–5). IEEE.

  • Yang, C. Y., Yang, J. S., & Wang, J. J. (2009). Margin calibration in svm class-imbalanced learning. Neurocomputing, 73(1–3), 397–411.

    Article  Google Scholar 

  • Yang, L., Jiang, H., Song, Q., & Guo, J. (2022). A survey on long-tailed visual recognition. In IJCV (pp. 1–36).

  • Yang, Y., & Xu, Z. (2020). Rethinking the value of labels for improving class-imbalanced learning. In NeurIPS.

  • Yin, X., Yu, X., Sohn, K., Liu, X., & Chandraker, M. (2019). Feature transfer learning for face recognition with under-represented data. In CVPR.

  • Yu, J., Wang, Z., Vasudevan, V., & Yeung, L. (2022). Coca: Contrastive captioners are image-text foundation models. arXiv preprint arXiv:2205.01917

  • Zhang, S., Li, Z., Yan, S., He, X., & Sun, J. (2021). Distribution alignment: A unified framework for long-tail visual recognition. In CVPR.

  • Zhou, B., Cui, Q., Wei, X. S., & Chen, Z. M. (2020). Bbn: Bilateral-branch network with cumulative learning for long-tailed visual recognition. In CVPR.

  • Zhou, B., Lapedriza, A., Khosla, A., Oliva, A., & Torralba, A. (2017). Places: A 10 million image database for scene recognition. IEEE TPAMI, 40(6), 1452–1464.

  • Zhou, K., Yang, J., Loy, C. C., & Liu, Z. (2022a). Conditional prompt learning for vision-language models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 16816–16825).

  • Zhou, K., Yang, J., Loy, C. C., & Liu, Z. (2022b). Learning to prompt for vision-language models. International Journal of Computer Vision, 130(9), 2337–2348.

Download references

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Jindong Wang, Wei Ye or Shikun Zhang.

Additional information

Communicated by Kaiyang Zhou.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A: Algorithm Flow

The algorithm of incorporating imbalanced learning algorithms in shown in Algo. 2.

Appendix B: Dataset

Table 9 The detailed statistics of datasets

The detailed statistics of datasets are shown in Table 9.

figure b

Appendix C: Imbalanced Algorithms

In this section, we give a brief introduction to the used imbalanced methods.

Class Balanced Re-weighting Class Balanced Re-Weighting (CBW) assigns loss weights to each instance in the dataset based on the class distribution, such that each class has an equal contribution to the overall loss function during training, which allows the model to give more importance to the minority class.

LDAM Loss Label-Distribution-Aware Margin (LDAM) Loss (Cao et al., 2019b) aims to improve the performance of the model on imbalanced datasets by considering the distribution of labels in the data. This loss function adds a margin term to the traditional cross-entropy loss, which prevents the model from being biased towards the majority class. The margin term calculated based on the class distribution in the dataset.

Focal Loss Focal Loss (Lin et al., 2017) assigns higher weights to hard-to-classify samples which has low confidence in prediction, making them more important in the training process, while reducing the contribution of easy-to-classify samples with high confidence.

Balanced Softmax Loss Balanced Softmax Loss (Ren et al., 2020) propose an unbiased extension of Softmax called Balanced Softmax, which accommodates the label distribution shift between training and testing. It can minimize the generalization bound in the imbalanced settings.

LADE Loss LAbel distribution DisEntangling (LADE) Loss (Hong et al., 2021) formulates imbalanced classification as a label shift problem where the target and source label distributions are different, and identifies the entanglement between the source label distribution and the model prediction as a significant hurdle. LADE loss is based on the optimal bound of Donsker-Varadhan representation to directly disentangle the source label distribution from the model prediction in the training phase.

CRT and LWS (Kang et al., 2019) focuses on exploring the impact of representation strategies and classifier strategies and finds that data imbalance may not be a major issue in learning high-quality representations. They demonstrate that it is possible to achieve strong imbalanced classification ability by adjusting only the classifier, even when the representations are learned using the simplest instance-balanced (natural) sampling. (Kang et al., 2019) proposes a straightforward approach called Classifier Re-Training (CRT) which re-trains the re-initialized classifier with class-balanced sampling and fixed representations. Besides Learnable weight scaling (LWS) can also improve the performance of imbalanced classification by re-scaling of the magnitude for the weight matrices for each class in the classifier.

Disalign Disalign (Zhang et al., 2021) is also a two stage algorithms like CRT and LWS. It keeps both the representation and the classifier fixed and develops an adaptive calibration function to adjust the classification scores by adding class specific extra classifier and instance specific confidence layer.

MARC In Wang et al. (2022), the relationship between the margins and logits is examined, and a positive correlation is observed between the biased margins and biased logits. To address this issue, MARgin Calibration function (MARC) with only 2K trainable parameters (k is the number of classes) is proposed to dynamically calibrates the biased margins to obtain unbiased logits with both the representation and the classifier fixed.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, Y., Yu, Z., Wang, J. et al. Exploring Vision-Language Models for Imbalanced Learning. Int J Comput Vis 132, 224–237 (2024). https://doi.org/10.1007/s11263-023-01868-w

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-023-01868-w

Keywords

Navigation