Skip to main content
Log in

A Survey on Long-Tailed Visual Recognition

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

The heavy reliance on data is one of the major reasons that currently limit the development of deep learning. Data quality directly dominates the effect of deep learning models, and the long-tailed distribution is one of the factors affecting data quality. The long-tailed phenomenon is prevalent due to the prevalence of power law in nature. In this case, the performance of deep learning models is often dominated by the head classes while the learning of the tail classes is severely underdeveloped. In order to learn adequately for all classes, many researchers have studied and preliminarily addressed the long-tailed problem. In this survey, we focus on the problems caused by long-tailed data distribution, sort out the representative long-tailed visual recognition datasets and summarize some mainstream long-tailed studies. Specifically, we summarize these studies into ten categories from the perspective of representation learning, and outline the highlights and limitations of each category. Besides, we have studied four quantitative metrics for evaluating the imbalance, and suggest using the Gini coefficient to evaluate the long-tailedness of a dataset. Based on the Gini coefficient, we quantitatively study 20 widely-used and large-scale visual datasets proposed in the last decade, and find that the long-tailed phenomenon is widespread and has not been fully studied. Finally, we provide several future directions for the development of long-tailed learning to provide more ideas for readers.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17

Similar content being viewed by others

References

  • Abu-El-Haija, S., Kothari, N., Lee, J., Natsev, P., Toderici, G., Varadarajan, B., & Vijayanarasimhan, S. (2016). Youtube-8m: A large-scale video classification benchmark. arXiv:1609.08675

  • An, X., Zhu, X., Xiao, Y., Wu, L., Zhang, M., Gao, Y., Qin, B., Zhang, D., & Fu, Y. (2020). Partial fc: Training 10 million identities on a single machine. arXiv:2010.05222

  • Anderson, C. (2006). The long tail: Why the future of business is selling less of more. Hachette Books.

  • Anderson, P., Fernando, B., Johnson, M., & Gould, S. (2016). Spice: Semantic propositional image caption evaluation. In Proceedings of the European conference on computer vision (pp. 382–398).

  • Andrej, K., George, T., Sanketh, S., Thomas, L., Rahul, S., & Li, F.F. (2014). Large-scale video classification with convolutional neural networks. In Proceedings of the IEEE international conference on computer vision (pp. 1725–1732).

  • Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C.L., & Parikh, D. (2015). Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision (pp. 2425–2433).

  • Brock, A., Jeff, D., & Karen, S. (2018). Large scale Gan training for high fidelity natural image synthesis. In International conference on learning representations.

  • Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., & Amodei, D. (2020). Language models are few-shot learners. In Advances in neural information processing systems (pp. 1877–1901).

  • Buda, M., Maki, A., & Mazurowski, M. A. (2018). A systematic study of the class imbalance problem in convolutional neural networks. Neural Networks, 106, 249–259.

    Article  Google Scholar 

  • Byrd, J., & Lipton, Z. (2019). What is the effect of importance weighting in deep learning? In International conference on machine learning (pp. 872–881). PMLR.

  • Caesar, H., Uijlings, J., & Ferrari, V. (2018). Coco-stuff: Thing and stuff classes in context. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1209–1218).

  • Cao, K., Wei, C., Gaidon, A., Arechiga, N., & Ma, T. (2019). Learning imbalanced datasets with label-distribution-aware margin loss. In Advances in neural information processing systems (pp. 1567–1578)

  • Castrup, H. (2001). Distributions for uncertainty analysis. In Proceedings of international dimensional workshop (pp. 1–12).

  • Chang, N., Koushik, J., Tarr, M. J., Hebert, M., & Wang, Y. X. (2020). Alpha net: Adaptation with composition in classifier space. arXiv:2008.07073

  • Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). Smote: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16, 321–357.

    Article  Google Scholar 

  • Chen, X., Fan, H., Girshick, R., & He, K. (2020). Improved baselines with momentum contrastive learning. arXiv:2003.04297

  • Chen, T., Kornblith, S., Norouzi, M., & Hinton, G. (2020). A simple framework for contrastive learning of visual representations. In International conference on machine learning (pp. 1597–1607). PMLR.

  • Cheng, B., Schwing, A.G., & Kirillov, A. (2021). Per-pixel classification is not all you need for semantic segmentation. arXiv:2107.06278

  • Chou, H. P., Chang, S. C., Pan, J. Y., Wei, W., & Juan, D. C. (2020). Remix: Rebalanced mixup. In Proceedings of the European conference on computer vision (pp. 95–110)

  • Chu, P., Bian, X., Liu, S., & Ling, H. (2020). Feature space augmentation for long-tailed data. In Proceedings of the European conference on computer vision (pp. 694–710).

  • Contributors, M. (2020). Mmsegmentation: Openmmlab semantic segmentation toolbox and benchmark. https://github.com/open-mmlab/mmsegmentation

  • Cubuk, E. D., Zoph, B., Shlens, J., & Le, Q. V. (2020). Randaugment: Practical automated data augmentation with a reduced search space. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops (pp. 702–703).

  • Cui, Y., Jia, M., Lin, T.Y., Song, Y., & Belongie, S. (2019). Class-balanced loss based on effective number of samples. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 9268–9277).

  • Cui, J., Liu, S., Tian, Z., & Jia, J. (2021). Reslt: Residual learning for long-tailed recognition. arXiv:2101.10633

  • Cui, J., Zhong, Z., Liu, S., Yu, B., & Jia, J. (2021). Parametric contrastive learning. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 715–724).

  • Dave, A., Dollár, P., Ramanan, D., Kirillov, A., & Girshick, R. (2021). Evaluating large-vocabulary object detectors: The devil is in the details. arXiv:2102.01066

  • David, A., Hartley, O., & Pearson, S. (1954). The distribution of the ratio, in a single normal sample, of range to standard deviation. Biometrika, 41, 482–493.

    Article  MathSciNet  Google Scholar 

  • Davidson, L. (1999). Uncertainty in economics. In Uncertainty, international money, employment and theory (pp. 30–37).

  • Delmas, R., & Yan, L. (2005). Exploring students’ conceptions of the standard deviation. Statistics Education Research Journal, 4, 55–82.

    Article  Google Scholar 

  • Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., & Fei-Fei, L. (2009). Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 248–255)

  • Deng, J., Guo, J., Xue, N., & Zafeiriou, S. (2019). Arcface: Additive angular margin loss for deep face recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 4690–4699).

  • Devi, D., & Purkayastha, B. (2017). Redundancy-driven modified Tomek-link based undersampling: A solution to class imbalance. Pattern Recognition Letters, 93, 3–12.

    Article  Google Scholar 

  • Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019). Bert: Pre-training of deep bidirectional transformers for language understanding. In Annual conference of the North American chapter of the association for computational linguistics: Human Language Technologies (pp. 4171–4186)

  • Dina, G., Michael, J., David, H., Julio, D., & Robert, S. (2020). Decreasing median age of covid-19 cases in the united states–changing epidemiology or changing surveillance? PLOS ONE, 15, e0240783.

    Article  Google Scholar 

  • Dong, Q., Gong, S., & Zhu, X. (2017). Class rectification hard mining for imbalanced deep learning. In Proceedings of the IEEE international conference on computer vision (pp. 1851–1860).

  • Dong, Q., Gong, S., & Zhu, X. (2018). Imbalanced deep learning by minority class incremental rectification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41, 1367–1381.

    Article  Google Scholar 

  • Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., & Houlsby, N. (2021). An image is worth \(16{\times }16\) words: Transformers for image recognition at scale. In International conference on learning representations.

  • Dvir, S., & Gal, C. (2021). Distributional robustness loss for long-tail learning. arXiv:2104.03066

  • Everingham, M., Van Gool, L., Williams, C. K., Winn, J., & Zisserman, A. (2010). The pascal visual object classes (voc) challenge. International Journal of Computer Vision, 88, 303–338.

    Article  Google Scholar 

  • Fan, Q., Zhuo, W., Tang, C. K., & Tai, Y. W. (2020). Few-shot object detection with attention-RPN and multi-relation detector. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 4013–4022).

  • Fogarty, A., Richard, H., & John, B. (2000). International comparison of median age at death from cystic fibrosis. Chest, 117, 1656–1660.

    Article  Google Scholar 

  • Ghosh, M., Nangia, N., & Kim, D. H. (1996). Estimation of median income of four-person families: A Bayesian time series approach. Journal of the American Statistical Association, 91, 1423–1431.

    Article  MathSciNet  Google Scholar 

  • Gidaris, S., & Komodakis, N. (2018). Dynamic few-shot visual learning without forgetting. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4367–4375).

  • Gini, C. (1912). Variabilità e mutabilità. Memorie di metodologica statistica.

  • Girshick, R. (2015). Fast R-CNN. In Proceedings of the IEEE international conference on computer vision (pp. 1440–1448).

  • Goodfellow, I., Mehdi Mirza, J. P. A., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., & Bengio, Y. (2014). Generative adversarial nets. In Advances in neural information processing systems.

  • Goyal, R., Kahou, S.E., Michalski, V., Materzynska, J., Westphal, S., Heuna, K., Haenel, V., Fruend, I., Yianilos, P., Mueller-Freitag, M., Hoppe, F., Thurau, C., Bax, I., & Memisevic, R. (2017). The “something something” video database for learning and evaluating visual common sense. In Proceedings of the IEEE international conference on computer vision (pp. 5842–5850).

  • Gu, X., Lin, T. Y., Kuo, W., & Cui, Y. (2021). Zero-shot detection via vision and language knowledge distillation. arXiv:2104.13921

  • Gui, S., Wang, H., Yang, H., Wang, C. Y. Z., & Liu., J. (2019). Model compression with adversarial robustness: A unified optimization framework. In Advances in neural information processing systems (pp. 1283–1294).

  • Guo, Y., Zhang, L., Hu, Y., He, X., & Gao, J. (2016). Ms-celeb-1m: A dataset and benchmark for large-scale face recognition. In Proceedings of the European conference on computer vision (pp. 87–102).

  • Gupta, A., Dollar, P., & Girshick, R. (2019). Lvis: A dataset for large vocabulary instance segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 5356–5364).

  • Hadsell, R., Chopra, S., & LeCun, Y. (2006) Dimensionality reduction by learning an invariant mapping. In Proceedings of the IEEE conference on computer vision and pattern recognition.

  • Haixiang, G., Yijing, L., Shang, J., Mingyun, G., Yuanyue, H., & Bing, G. (2017). Learning from class-imbalanced data: Review of methods and applications. Expert Systems with Applications, 73, 220–239.

    Article  Google Scholar 

  • Han, H., Wang, W. Y., & Mao, B. H. (2005). Borderline-smote: A new over-sampling method in imbalanced data sets learning. In International Conference on Intelligent Computing (pp. 878–887). Springer.

  • He, H., Bai, Y., Garcia, E. A., & Li, S. (2008) Adasyn: Adaptive synthetic sampling approach for imbalanced learning. In 2008 IEEE international joint conference on neural networks (pp. 1322–1328).

  • He, K., Fan, H., Wu, Y., Xie, S., & Girshick, R. (2020). Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 9729–9738).

  • He, Y. Y., Wu, J., & Wei, X. S. (2021). Distilling virtual examples for long-tailed recognition. arXiv:2103.15042

  • He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770–778).

  • He, H., & Garcia, E. A. (2009). Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering, 21, 1263–1284.

    Article  Google Scholar 

  • Hinton, G., Vinyals, O., & Dean, J. (2015). Distilling the knowledge in a neural network. arXiv:1503.02531

  • Hong, Y., Han, S., Choi, K., Seo, S., Kim, B., & Chang, B. (2021). Disentangling label distribution for long-tailed visual recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 6626–6636).

  • Hsieh, T. I., Robb, E., Chen, H. T., & Huang, J. B. (2021). Droploss for long-tail instance segmentation. In Proceedings of the AAAI conference on artificial intelligence (pp. 1549–1557).

  • Hu, X., Jiang, Y., Tang, K., Chen, J., Miao, C., & Zhang, H. (2020). Learning to segment the tail. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 14045–14054).

  • Huang, C., Li, Y., Loy, C. C., & Tang, X. (2016). Learning deep representation for imbalanced classification. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 5375–5384).

  • Huang, C., Li, Y., Loy, C. C., & Tang, X. (2019). Deep imbalanced learning for face recognition and attribute prediction. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42, 2781–2794.

    Article  Google Scholar 

  • Inaturalist (2018). Competition dataset. https://github.com/visipedia/inat_comp/tree/master/2018

  • Jacobs, R. A., Jordan, M. I., Nowlan, S. J., & Hinton, G. E. (1991). Adaptive mixtures of local experts. Neural Computation, 3, 79–87.

    Article  Google Scholar 

  • Jamal, M. A., Brown, M., Yang, M. H., Wang, L., & Gong, B. (2020). Rethinking class-balanced methods for long-tailed visual recognition from a domain adaptation perspective. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 7610–7619).

  • Janowczyk, A., & Madabhushi, A. (2016). Deep learning for digital pathology image analysis: A comprehensive tutorial with selected use cases. Journal of Pathology Informatics, 7, 29.

    Article  Google Scholar 

  • Japkowicz, N., & Stephen, S. (2002). The class imbalance problem: A systematic study. Intelligent Data Analysis, 6, 429–449.

    Article  Google Scholar 

  • Jiang, H., Misra, I., Rohrbach, M., Learned-Miller, E., & Chen, X. (2020). In defense of grid features for visual question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition.

  • Jordan, M. I., & Jacobs, R. A. (1994). Hierarchical mixtures of experts and the EM algorithm. Neural Comput, 6, 181–214.

    Article  Google Scholar 

  • Kaelbling, L. P., Littman, M. L., & Moore, A. W. (1996). Reinforcement learning: A survey. Journal of Artificial Intelligence Research, 4, 237–285.

    Article  Google Scholar 

  • Kahn, H., & Marshall, A. W. (1953). Methods of reducing sample size in Monte Carlo computations. Journal of the Operations Research Society of America, 1, 263–278.

    Article  Google Scholar 

  • Kakwani, N. C. (1977). Applications of Lorenz curves in economic analysis. Econometrica, 45, 719–727.

    Article  MathSciNet  Google Scholar 

  • Kang, B., Xie, S., Rohrbach, M., Yan, Z., Gordo, A., Feng, J., & Kalantidis, Y. (2020). Decoupling representation and classifier for long-tailed recognition. In International conference on learning representations.

  • Karras, T., Samuli, L., & Timo, A. (2019). A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 4401–4410).

  • Kim, J., Jeong, J., & Shin, J. (2020). M2m: Imbalanced classification via major-to-minor translation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 13896–13905).

  • Kim, D. J., Sun, X., Choi, J., Lin, S., & Kweon, I. S. (2020). Detecting human-object interactions with action co-occurrence priors. In Proceedings of the European conference on computer vision (pp. 718–736)

  • Kingma, D. P., & Welling, M. (2013). Auto-encoding variational bayes. arXiv:1312.6114

  • Kirillov, A., Girshick, R., He, K., & Dollar, P. (2019). Panoptic feature pyramid networks. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 6399–6408).

  • Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz, J., et al. (2017). Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision, 123, 32–73.

    Article  MathSciNet  Google Scholar 

  • Krizhevsky, A., & Hinton, G. (2009). Learning multiple layers of features from tiny images. Tech Report.

  • Kuznetsova, A., Rom, H., Alldrin, N., Uijlings, J., Krasin, I., Pont-Tuset, J., et al. (2020). The open images dataset v4. International Journal of Computer Vision, 128, 1956–1981.

    Article  Google Scholar 

  • Lample, G., Ott, M., Conneau, A., Denoyer, L., & Ranzato, M. (2018). Phrase-based and neural unsupervised machine translation. arXiv:1804.07755

  • Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., & Soricut, R. (2019). Albert: A lite bert for self-supervised learning of language representations. arXiv:1909.11942

  • Levi, G., & Hassner, T. (2015). Age and gender classification using convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 34–42).

  • Li, T., Cao, P., Yuan, Y., Fan, L., Yang, Y., Feris, R., Indyk, P., & Katabi, D. (2021). Targeted supervised contrastive learning for long-tailed recognition. arXiv:2111.13998

  • Li, Z., Dekel, T., Cole, F., Tucker, R., Snavely, N., Liu, C., & Freeman, W. T. (2019). Learning the depths of moving people by watching frozen people. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 4521–4530).

  • Li, S., Gong, K., Liu, C. H., Wang, Y., Qiao, F., & Cheng, X. (2021). Metasaug: Meta semantic augmentation for long-tailed visual recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 5212–5221).

  • Li, B., Liu, Y., & Wang, X. (2019). Gradient harmonized single-stage detector. In Proceedings of the AAAI conference on artificial intelligence (pp. 8577–8584).

  • Li, J., Tang, S., Li, J., Xiao, J., Wu, F., Pu, S., & Zhuang, Y. (2020). Topic adaptation and prototype encoding for few-shot visual storytelling. In Proceedings of the ACM international conference on multimedia (pp. 4208–4216).

  • Li, T., Wang, L., & Wu, G. (2021). Self supervision to distillation for long-tailed visual recognition. arXiv:2109.04075

  • Li, Y., Wang, T., Kang, B., Tang, S., Wang, C., Li, J., & Feng, J. (2020). Overcoming classifier imbalance for long-tail object detection with balanced group softmax. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 10991–11000).

  • Li, X., Wei, T., Chen, Y. P., Tai, Y. W., & Tang, C. K. (2020). Fss-1000: A 1000-class dataset for few-shot segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition.

  • Li, B., Yao, Y., Tan, J., Zhang, G., Yu, F., Lu, J., & Luo, Y. (2022). Equalized focal loss for dense long-tailed object detection. arXiv:2201.02593

  • Lin, T. Y., Goyal, P., Girshick, R., He, K., & Dollár, P. (2017). Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision (pp. 2980–2988).

  • Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C. L. (2014). Microsoft coco: Common objects in context. In Proceedings of the European conference on computer vision (pp. 740–755).

  • Liu, T. Y. (2011). Learning to rank for information retrieval.

  • Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C. Y., & Berg, A. C. (2016). Ssd: Single shot multibox detector. In Proceedings of the European conference on computer vision (pp. 21–37).

  • Liu, B., Li, H., Kang, H., & Hua, G. (2021). Gistet: A geometric structure transfer network for long-tailed recognition. arXiv:2105.00131

  • Liu, B., Li, H., Kang, H., Hua, G., & Vasconcelos, N. (2021). Breadcrumbs: Adversarial class-balanced sampling for long-tailed recognition. arXiv:2105.00127

  • Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., & Guo, B. (2021). Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 10012–10022).

  • Liu, Z., Miao, Z., Zhan, X., Wang, J., Gong, B., & Yu, S. X. (2019). Large-scale long-tailed recognition in an open world. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 2537–2546).

  • Liu, J., Sun, Y., Han, C., Dou, Z., & Li, W. (2020). Deep representation learning on long-tailed data: A learnable embedding augmentation perspective. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 2970–2979).

  • Liu, J., Zhang, J., Li, W., Zhang, C., & Sun, Y. (2020). Memory-based jitter: Improving visual recognition on long-tailed data with diversity in memory. arXiv:2008.09809

  • Liu, X. Y., Wu, J., & Zhou, Z. H. (2008). Exploratory undersampling for class-imbalance learning. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 39, 539–550.

    Google Scholar 

  • Lvis Challenge (2019). https://www.lvisdataset.org/

  • Madry, A., Makelov, A., Schmidt, L., Tsipras, & D., Vladu, A. (2018). Towards deep learning models resistant to adversarial attacks. In International conference on learning representations.

  • Mahajan, D., Girshick, R., Ramanathan, V., He, K., Paluri, M., Li, Y., Bharambe, A., & Van Der Maaten, L. (2018). Exploring the limits of weakly supervised pretraining. In Proceedings of the European conference on computer vision (pp. 181–196).

  • Mani, I., & Zhang, I. (2003). KNN approach to unbalanced data distributions: A case study involving information extraction. In Proceedings of workshop on learning from imbalanced datasets vol. 126. ICML United States.

  • Masoudnia, S., & Ebrahimpour, R. (2014). Mixture of experts: A literature survey. Artificial Intelligence Review, 42, 275–293.

    Article  Google Scholar 

  • Menon, A. K., Jayasumana, S., Rawat, A. S., Jain, H., Veit, A., & Kumar, S. (2021). Long-tail learning via logit adjustment. In International conference on learning representations.

  • Miao, J., Wei, Y., Wu, Y., Liang, C., Li, G., & Yang, Y. (2021). Vspw: A large-scale dataset for video scene parsing in the wild. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 4133–4143).

  • Mikolov, T., Sutskever, I., Chen, K., Corrado, G., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. arXiv:1310.4546

  • Narayanan, A., Chen, Y. T., & Malla, S. (2018). Semi-supervised learning: Fusion of self-supervised, supervised learning, and multimodal cues for tactical driver behavior detection. arXiv:1807.00864

  • Oh Song, H., Xiang, Y., Jegelka, S., & Savarese, S. (2016). Deep metric learning via lifted structured feature embedding. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4004–4012).

  • Oksuz, K., Cam, B. C., Kalkan, S., & Akbas, E. (2020). Imbalance problems in object detection: A review. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43, 3388–3415.

    Article  Google Scholar 

  • Ouyang, W., Wang, X., Zhang, C., & Yang, X. (2016). Factors in finetuning deep model for object detection with long-tail distribution. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 864–873).

  • Peng, J., Bu, X., Sun, M., Zhang, Z., Tan, T., & Yan, J. (2020). Large-scale object detection in the wild from imbalanced multi-labels. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 9709–9718).

  • Peng, Z., Huang, W., Guo, Z., Zhang, X., Jiao, J., & Ye, Q. (2021). Long-tailed distribution adaptation. In Proceedings of the ACM international conference on multimedia (pp. 3275–3282).

  • Prabhu, V., Kannan, A., Ravuri, M., Chablani, M., Sontag, D., & Amatriain, X. (2018). Prototypical clustering networks for dermatological disease diagnosis. arXiv:1811.03066

  • Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., & Sutskever, I. (2021). Learning transferable visual models from natural language supervision. arXiv:2103.00020

  • Ramanathan, V., Wang, R., & Mahajan, D. (2020). Dlwl: Improving detection for lowshot classes with weakly labelled data. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 9342–9352).

  • Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster R-CNN: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems vol. 28 (pp. 91–99).

  • Ren, J., Yu, C., Sheng, S., Ma, X., Zhao, H., Yi, S., & Li, H. (2020). Balanced meta-softmax for long-tailed visual recognition. In Advances in neural information processing systems.

  • Ren, M., Zeng, W., Yang, B., & Urtasun, R. (2018). Learning to reweight examples for robust deep learning. In International conference on machine learning (pp. 4334–4343). PMLR.

  • Riquelme, C., Puigcerver, J., Mustafa, B., Neumann, M., Jenatton, R., Pinto, A. S., Keysers, D., & Houlsby, N. (2021). Scaling vision with sparse mixture of experts. arXiv:2106.05974

  • Ristani, E., Solera, F., Zou, R. S., Cucchiara, R., & Tomasi, C. (2016). Performance measures and a data set for multi-target, multi-camera tracking. In Proceedings of the European conference on computer vision (pp. 17–35).

  • Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., et al. (2015). Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115, 211–252.

    Article  MathSciNet  Google Scholar 

  • Shaham, T.R., Dekel, T., & Michaeli, T. (2019). Singan: Learning a generative model from a single natural image. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 4570–4580).

  • Shao, S., Li, Z., Zhang, T., Peng, C., Yu, G., Zhang, X., Li, J., & Sun, J. (2019). Objects365: A large-scale, high-quality dataset for object detection. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 8430–8439).

  • Shen, L., Lin, Z., & Huang, Q. (2016). Relay backpropagation for effective learning of deep convolutional neural networks. In Proceedings of the European conference on computer vision (pp. 467–482).

  • Shrivastava, A., Gupta, A., & Girshick, R. (2016). Training region-based object detectors with online hard example mining. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 761–769).

  • Shu, X., Wang, X., Zang, X., Zhang, S., Chen, Y., Li, G., & Tian, Q. (2021). Large-scale spatio-temporal person re-identification: Algorithm and benchmark. arXiv:2105.15076

  • Shu, J., Xie, Q., Yi, L., Zhao, Q., Zhou, S., Xu, Z., & Meng, D. (2019). Meta-weight-net: Learning an explicit mapping for sample weighting. In Advances in neural information processing systems vol. 32 (pp. 1919–1930).

  • Simard, P. Y., LeCun, Y. A., Denker, J. S., & Victorri, B. (1998). Transformation invariance in pattern recognition—tangent distance and tangent propagation. In Neural networks: Tricks of the trade (pp. 239–274). Springer.

  • Sinha, S., Ohashi, H., & Nakamura, K. (2020). Class-wise difficulty-balanced loss for solving class-imbalance. In Proceedings of the Asian conference on computer vision.

  • Sohn, K. (2016). Improved deep metric learning with multi-class n-pair loss objective. In Advances in neural information processing systems (pp. 1857–1865).

  • Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An introduction. MIT Press.

  • Tan, J., Lu, X., Zhang, G., Yin, C., & Li, Q. (2021). Equalization loss v2: A new gradient balance approach for long-tailed object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 1685–1694).

  • Tan, J., Wang, C., Li, B., Li, Q., Ouyang, W., Yin, C., & Yan, J. (2020). Equalization loss for long-tailed object recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 11662–11671).

  • Tang, K., Huang, J., & Zhang, H. (2020). Long-tailed classification by keeping the good and removing the bad momentum causal effect. In Advances in neural information processing systems.

  • Thomee, B., Shamma, D. A., Friedland, G., Elizalde, B., Ni, K., Poland, D., et al. (2016). Yfcc100m: The new data in multimedia research. Communications of the ACM, 59, 64–73.

    Article  Google Scholar 

  • Tian, Z., Shen, C., Chen, H., & He, T. (2019). Fcos: Fully convolutional one-stage object detection. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 9627–9636).

  • van Steenkiste, S., Greff, K., & Schmidhuber, J. (2019). A perspective on objects and systematic generalization in model-based RL. arXiv:1906.01035

  • van den Oord, A., Vinyals, O., & Kavukcuoglu, K. (2017). Neural discrete representation learning. In Advances in neural information processing systems.

  • Van Horn, G., & Perona, P. (2017). The devil is in the tails: Fine-grained classification in the wild. arXiv:1709.01450

  • Van Horn, G., Mac Aodha, O., Song, Y., Cui, Y., Sun, C., Shepard, A., Adam, H., Perona, P., & Belongie, S. (2018). The inaturalist species classification and detection dataset. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 8769–8778).

  • Wang, Y., Gan, W., Yang, J., Wu, W., & Yan, J. (2019). Dynamic curriculum learning for imbalanced data classification. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 5017–5026).

  • Wang, C., Gao, S., Wang, P., Gao, G., Pei, W., Pan, L., & Xu, Z. (2021). Label-aware distribution calibration for long-tailed classification. arXiv:2111.04901

  • Wang, P., Han, K., Wei, X. S., Zhang, L., & Wang, L. (2021). Contrastive learning based hybrid networks for long-tailed image classification. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 943–952).

  • Wang, R., Hu, K., Zhu, Y., Shu, J., Zhao, Q., & Meng, D. (2020). Meta feature modulator for long-tailed recognition. arXiv:2008.03428

  • Wang, T., Li, Y., Kang, B., Li, J., Liew, J., Tang, S., Hoi, S., & Feng, J. (2020). The devil is in classification: A simple framework for long-tail instance segmentation. In Proceedings of the European conference on computer vision (pp. 728–744).

  • Wang, X., Lian, L., Miao, Z., Liu, Z., & Yu, S.X. (2021) Long-tailed recognition by routing diverse distribution-aware experts. In International conference on learning representations.

  • Wang, T. C., Liu, M. Y., Zhu, J. Y., Liu, G., Tao, A., Kautz, J., & Catanzaro, B. (2018). Video-to-video synthesis. In Advances in neural information processing systems (pp. 1152–1164).

  • Wang, Y. X., Ramanan, D., & Hebert, M. (2017). Learning to model the tail. In Advances in neural information processing systems (pp. 7029–7039)

  • Wang, H., Wang, Y., Zhou, Z., Ji, X., Gong, D., Zhou, J., Li, Z., & Liu, W. (2018). Cosface: Large margin cosine loss for deep face recognition. In PProceedings of the IEEE conference on computer vision and pattern recognition (pp. 5265–5274).

  • Wang, H., Xiao, C., Kossaifi, J., Yu, Z., Anandkumar, A., & Wang, Z. (2021). Augmax: Adversarial composition of random augmentations for robust training. In Advances in neural information processing systems.

  • Wang, Y., Yao, Q., Kwok, J., & Ni, L. (2019). Few-shot learning: A survey. arXiv:1904.05046

  • Wang, Y., Zhang, B., Hou, W., Wu, Z., Wang, J., & Shinozaki, T. (2021). Margin calibration for long-tailed visual recognition. arXiv:2112.07225

  • Wang, J., Zhang, W., Zang, Y., Cao, Y., Pang, J., Gong, T., Chen, K., Liu, Z., Loy, C. C., Lin, D. (2021). Seesaw loss for long-tailed instance segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 9695–9704).

  • Wang, T., Zhu, Y., Zhao, C., Zeng, W., Wang, J., & Tang, M. (2021). Adaptive class suppression loss for long-tail object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 3103–3112).

  • Wang, K. J., Makond, B., Chen, K. H., & Wang, K. M. (2014). A hybrid classifier combining SMOTE with PSO to estimate 5-year survivability of breast cancer patients. Applied Soft Computing, 20, 15–24.

    Article  Google Scholar 

  • Wei, C., Sohn, K., Mellina, C., Yuille, A., & Yang, F. (2021). Crest: A class-rebalancing self-training framework for imbalanced semi-supervised learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 10857–10866).

  • Weyand, T., Araujo, A., Cao, B., & Sim, J. (2020). Google landmarks dataset v2-a large-scale benchmark for instance-level recognition and retrieval. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 2575–2584).

  • Wightman, R., Touvron, H., & Jegou, H. (2021). Resnet strikes back: An improved training procedure in timm. arXiv:2110.00476

  • Wilson, D. L. (1972). Asymptotic properties of nearest neighbor rules using edited data. IEEE Transactions on Systems, Man, and Cybernetics, 3, 408–421.

    Article  MathSciNet  Google Scholar 

  • Wu, T., Huang, Q., Liu, Z., Wang, Y., & Lin, D. (2020). Distribution-balanced loss for multi-label classification in long-tailed datasets. In Proceedings of the European conference on computer vision (pp. 162–178).

  • Wu, Y., Kirillov, A., Massa, F., Lo, W. Y., & Girshick, R. (2019). Detectron2. https://github.com/facebookresearch/detectron2

  • Wu, T., Liu, Z., Huang, Q., Wang, Y., & Lin, D. (2021). Adversarial robustness under long-tailed distribution. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 8659–8668).

  • Wu, J., Song, L., Wang, T., Zhang, Q., & Yuan, J. (2020). Forest R-CNN: Large-vocabulary long-tailed object detection and instance segmentation. In Proceedings of the ACM international conference on multimedia (pp. 1570–1578).

  • Xiang, L., Ding, G., & Han, J. (2020). Learning from multiple experts: Self-paced knowledge distillation for long-tailed classification. In Proceedings of the European conference on computer vision (pp. 247–263).

  • Yang, Y., & Xu, Z. (2020). Rethinking the value of labels for improving class-imbalanced learning. In Advances in neural information processing systems.

  • Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R. R., & Le, Q. V. (2019). Xlnet: Generalized autoregressive pretraining for language understanding. In: Advances in neural information processing systems (pp. 5753–5763).

  • Yang, L., Song, Q., & Wu, Y. (2021). Attacks on state-of-the-art face recognition using attentional adversarial attack generative network. Multimedia Tools and Applications, 80, 855–875.

    Article  Google Scholar 

  • Yaoyao, Z., & Weihong, D. (2019). Adversarial learning with margin-based triplet embedding regularization. In Proceedings of the IEEE/CVF international conference on computer vision

  • Yitzhaki, S., & Schechtman, E. (2013). More than a dozen alternative ways of spelling Gini. In The Gini Methodology (pp. 11–31).

  • Yu, W., Yang, T., & Chen, C. (2021). Towards resolving the challenge of long-tail distribution in UAV images for object detection. In Proceedings of the IEEE/CVF winter conference on applications of computer vision (pp. 3258–3267).

  • Zang, Y., Huang, C., & Loy, C. C. (2021). Fasa: Feature augmentation and sampling adaptation for long-tailed instance segmentation. arXiv:2102.12867

  • Zeng, A., Sun, X., Huang, F., Liu, M., Xu, Q., & Lin, S. (2020). Srnet: Improving generalization in 3d human pose estimation with a split-and-recombine approach. In Proceedings of the European conference on computer vision (pp. 507–523).

  • Zhang, S., Chen, C., Hu, X., & Peng, S. (2021). Balanced knowledge distillation for long-tailed learning. arXiv:2104.10510

  • Zhang, Y., Cheng, D.Z., Yao, T., Yi, X., Hong, L., & Chi, E.H. (2021). A model of two tales: Dual transfer learning framework for improved long-tail item recommendation. In Proceedings of the web conference 2021 (pp. 2220–2231).

  • Zhang, H., Cisse, M., Dauphin, Y. N., & Lopez-Paz, D. (2018). mixup: Beyond empirical risk minimization. In International conference on learning representations.

  • Zhang, X., Fang, Z., Wen, Y., Li, Z., & Qiao, Y. (2017). Range loss for deep face recognition with long-tailed training data. In Proceedings of the IEEE international conference on computer vision (pp. 5409–5418)

  • Zhang, Y., Kang, B., Hooi, B., Yan, S., & Feng, J. (2021). Deep long-tailed learning: A survey. arXiv:2110.04596

  • Zhang, P., Li, X., Hu, X., Yang, J., Zhang, L., Wang, L., Choi, Y., & Gao, J. (2021). Vinvl: Revisiting visual representations in vision-language models. arXiv:2101.00529

  • Zhang, S., Li, Z., Yan, S., He, X., & Sun, J. (2021). Distribution alignment: A unified framework for long-tail visual recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 2361–2370).

  • Zhang, G., Lu, X., Tan, J., Li, J., Zhang, Z., Li, Q., & Hu, X. (2021). Refinemask: Towards high-quality instance segmentation with fine-grained features. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 6861–6869).

  • Zhang, C., Pan, T. Y., Li, Y., Hu, H., Xuan, D., Changpinyo, S., Gong, B., & Chao, W. L. (2021). A simple and effective use of object-centric images for long-tailed object detection. arXiv:2102.08884

  • Zhang, Z., Shi, Y., Yuan, C., Li, B., Wang, P., Hu, W., & Zha, Z. J. (2020). Object relational graph with teacher-recommended learning for video captioning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 13278–13288).

  • Zhang, Y., Wei, X. S., Zhou, B., & Wu, J. (2021). Bag of tricks for long-tailed visual recognition with deep convolutional neural networks. In Proceedings of the AAAI conference on artificial intelligence (pp. 3447–3455).

  • Zhao, Y., Chen, W., Tan, X., Huang, K., Xu, J., Wang, C., & Zhu, J. (2021). Improving long-tailed classification from instance level. arXiv:2104.06094

  • Zhao, J., Li, J., Cheng, Y., Zhou, L., Sim, T., Yan, S., & Feng, J. (2018). Understanding humans in crowded scenes: Deep nested adversarial learning and a new benchmark for multi-human parsing. In Proceedings of the ACM international conference on multimedia (pp. 792–800).

  • Zhao, H., Shi, J., Qi, X., Wang, X., & Jia, J. (2017). Pyramid scene parsing network. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2881–2890).

  • Zheng, L., Shen, L., Tian, L., Wang, S., Wang, J., & Tian, Q. (2015). Scalable person re-identification: A benchmark. In Proceedings of the IEEE international conference on computer vision (pp. 1116–1124).

  • Zhong, Z., Cui, J., Liu, S., & Jia, J. (2021). Improving calibration for long-tailed recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 16489–16498).

  • Zhou, B., Cui, Q., Wei, X. S., & Chen, Z. M. (2020). Bbn: Bilateral-branch network with cumulative learning for long-tailed visual recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 9719–9728).

  • Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., & Torralba, A. (2016). Learning deep features for discriminative localization. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2921–2929)

  • Zhou, X., Koltun, V., & Krähenbühl, P. (2021). Probabilistic two-stage detection. arXiv:2103.07461

  • Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., & Torralba, A. (2017). Scene parsing through ade20k dataset. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 633–641)

  • Zhou, B., Lapedriza, A., Khosla, A., Oliva, A., & Torralba, A. (2017). Places: A 10 million image database for scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40, 1452–1464.

    Article  Google Scholar 

  • Zou, Y., Yu, Z., Kumar, B., & Wang, J. (2018). Unsupervised domain adaptation for semantic segmentation via class-balanced self-training. In Proceedings of the European conference on computer vision (pp. 289–305).

Download references

Acknowledgements

This work was supported by the National Key Research and Development Program of China (Grant No. 2021YFF0 500900).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Qing Song.

Additional information

Communicated by Liwei Wang.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yang, L., Jiang, H., Song, Q. et al. A Survey on Long-Tailed Visual Recognition. Int J Comput Vis 130, 1837–1872 (2022). https://doi.org/10.1007/s11263-022-01622-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-022-01622-8

Keywords

Navigation