Skip to main content

Towards Balanced Learning for Instance Recognition

Abstract

Instance recognition is rapidly advanced along with the developments of deep convolutional neural networks. Compared to the model architectures the training process, which is also crucial to the success of detectors, has received relatively less attention. In this work, we carefully revisit the standard training practice of detectors, and find that the detection performance is often limited by the imbalance during the training process, which generally consists in three levels—sample level, feature level, and objective level. To mitigate the adverse effects caused thereby, we propose Libra R-CNN, a simple yet effective framework towards balanced learning for instance recognition. It integrates IoU-balanced sampling, balanced feature pyramid, and objective re-weighting, respectively for reducing the imbalance at sample, feature, and objective level. Extensive experiments conducted on MS COCO, LVIS and Pascal VOC datasets prove the effectiveness of the overall balanced design.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

References

  1. Cai, Z., Fan, Q., Feris, R. S., & Vasconcelos, N. (2016). A unified multi-scale deep convolutional neural network for fast object detection. In European conference on computer vision.

  2. Cai, Z., & Vasconcelos, N. (2018). Cascade R-CNN: Delving into high quality object detection. In IEEE conference on computer vision and pattern recognition.

  3. Cao, K., Wei, C., Gaidon, A., Arechiga, N., & Ma, T. (2019). Learning imbalanced datasets with label-distribution-aware margin loss. In Advances in neural information processing systems (pp. 1567–1578).

  4. Chen, K., Pang, J., Wang, J., Xiong, Y., Li, X., Sun, S., Feng, W., Liu, Z., Shi, J., & Ouyang, W., et al. (2019). Hybrid task cascade for instance segmentation. In IEEE conference on computer vision and pattern recognition.

  5. Chen, K., Wang, J., Pang, J., Cao, Y., Xiong, Y., Li, X., Sun, S., Feng, W., Liu, Z., Xu, J., Zhang, Z., Cheng, D., Zhu, C., Cheng, T., Zhao, Q., Li, B., Lu, X., Zhu, R., Wu, Y., Dai, J., Wang, J., Shi, J., Ouyang, W., Loy, C. C., & Lin, D. (2019). MMDetection: Open MMLab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155.

  6. Chen, X., Kundu, K., Zhang, Z., Ma, H., Fidler, S., & Urtasun, R. (2016). Monocular 3d object detection for autonomous driving. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2147–2156).

  7. Chen, X., Ma, H., Wan, J., Li, B., & Xia, T. (2017). Multi-view 3d object detection network for autonomous driving. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1907–1915).

  8. Cheng, T., Wang, X., Huang, L., & Liu, W. (2020). Boundary-preserving mask R-CNN. In: European conference on computer vision (pp. 660–676), Springer.

  9. Cui, Y., Jia, M., Lin, T.Y., Song, Y., & Belongie, S. (2019). Class-balanced loss based on effective number of samples. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 9268–9277).

  10. Dai, J., Li, Y., He, K., & Sun, J. (2016). R-FCN: Object detection via region-based fully convolutional networks. In Advances in neural information processing systems.

  11. Dai, J., Qi, H., Xiong, Y., Li, Y., Zhang, G., Hu, H., & Wei, Y. (2017). Deformable convolutional networks. In Proceedings of the IEEE international conference on computer vision (pp. 764–773).

  12. Everingham, M., Van Gool, L., Williams, C. K., Winn, J., & Zisserman, A. (2010). The pascal visual object classes (VOC) challenge. International Journal of Computer Vision, 88(2), 303–338.

    Article  Google Scholar 

  13. Geiger, A., Lenz, P., & Urtasun, R. (2012). Are we ready for autonomous driving? the kitti vision benchmark suite. In 2012 IEEE conference on computer vision and pattern recognition, IEEE (pp. 3354–3361).

  14. Girshick, R. (2015). Fast R-CNN. In IEEE conference on computer vision and pattern recognition.

  15. Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2014). Rich feature hierarchies for accurate object detection and semantic segmentation. In IEEE conference on computer vision and pattern recognition.

  16. Gupta, A., Dollar, P., & Girshick, R. (2019). LVIS: A dataset for large vocabulary instance segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 5356–5364).

  17. He, K., Gkioxari, G., Dollár, P., & Girshick, R. (2017). Mask R-CNN. In IEEE international conference on computer vision.

  18. He, K., Zhang, X., Ren, S., & Sun, J. (2014). Spatial pyramid pooling in deep convolutional networks for visual recognition. In European conference on computer vision.

  19. Hearst, M. A., Dumais, S. T., Osuna, E., Platt, J., & Scholkopf, B. (1998). Support vector machines. IEEE Intelligent Systems and their Applications, 13(4), 18–28.

    Article  Google Scholar 

  20. Hoiem, D., Chodpathumwan, Y., & Dai, Q. (2012). Diagnosing error in object detectors. In European conference on computer vision (pp. 340–353), Springer.

  21. Hosang, J. H., Benenson, R., & Schiele, B. (2017). Learning non-maximum suppression. In IEEE conference on computer vision and pattern recognition.

  22. Hu, H., Gu, J., Zhang, Z., Dai, J., & Wei, Y. (2018). Relation networks for object detection. In IEEE conference on computer vision and pattern recognition.

  23. Huang, Z., Huang, L., Gong, Y., Huang, C., & Wang, X. (2019). Mask scoring R-CNN. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 6409–6418).

  24. Kang, B., Xie, S., Rohrbach, M., Yan, Z., Gordo, A., Feng, J., & Kalantidis, Y. (2019). Decoupling representation and classifier for long-tailed recognition. arXiv preprint arXiv:1910.09217.

  25. Kang, K., Ouyang, W., Li, H., & Wang, X. (2016). Object detection from video tubelets with convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 817–825).

  26. Khan, S. H., Hayat, M., Bennamoun, M., Sohel, F. A., & Togneri, R. (2017). Cost-sensitive learning of deep feature representations from imbalanced data. IEEE Transactions on Neural Networks and Learning Systems, 29(8), 3573–3587.

    Google Scholar 

  27. Kim, S. W., Kook, H. K., Sun, J. Y., Kang, M. C., & Ko, S. J. (2018). Parallel feature pyramid network for object detection. In Proceedings of the European conference on computer vision (ECCV) (pp. 234–250).

  28. Kong, T., Sun, F., Liu, H., Jiang, Y., Li, L., & Shi, J. (2020). Foveabox: Beyound anchor-based object detection. IEEE Transactions on Image Processing.

  29. Kong, T., Sun, F., Tan, C., Liu, H., & Huang, W. (2018). Deep feature pyramid reconfiguration for object detection. In European conference on computer vision.

  30. Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems (pp. 1097–1105).

  31. Law, H., & Deng, J. (2018). Cornernet: Detecting objects as paired keypoints. In European conference on computer vision.

  32. Law, H., & Deng, J. (2019). Cornernet: Detecting objects as paired keypoints.

  33. Li, B., Liu, Y., & Wang, X. (2019). Gradient harmonized single-stage detector. In Proceedings of the AAAI conference on artificial intelligence (Vol. 33, pp. 8577–8584).

  34. Li, H., Liu, Y., Ouyang, W., & Wang, X. (2019). Zoom out-and-in network with map attention decision for region proposal and object detection. International Journal of Computer Vision, 127(3), 225–238.

    Article  Google Scholar 

  35. Li, Y., Wang, T., Kang, B., Tang, S., Wang, C., Li, J., & Feng, J. (2020). Overcoming classifier imbalance for long-tail object detection with balanced group softmax. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 10991–11000).

  36. Lin, T. Y., Dollár, P., Girshick, R. B., He, K., Hariharan, B., & Belongie, S. J. (2017). Feature pyramid networks for object detection. In IEEE conference on computer vision and pattern recognition.

  37. Lin, T. Y., Goyal, P., Girshick, R., He, K., & Dollár, P. (2018). Focal loss for dense object detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(2), 318–327.

    Article  Google Scholar 

  38. Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C. L. (2014). Microsoft coco: Common objects in context. In European conference on computer vision.

  39. Liu, L., Ouyang, W., Wang, X., Fieguth, P., Chen, J., Liu, X., et al. (2019). Deep learning for generic object detection: A survey. International Journal of Computer Vision, 128, 261–318.

    Article  Google Scholar 

  40. Liu, S., Qi, L., Qin, H., Shi, J., & Jia, J. (2018). Path aggregation network for instance segmentation. In IEEE conference on computer vision and pattern recognition.

  41. Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.Y., & Berg, A.C. (2016). Ssd: Single shot multibox detector. In European conference on computer vision.

  42. Luo, W., Xing, J., Milan, A., Zhang, X., Liu, W., Zhao, X., & Kim, T. K. (2014). Multiple object tracking: A literature review. arXiv preprint arXiv:1409.7618.

  43. Ouyang, W., Wang, K., Zhu, X., & Wang, X. (2017). Chained cascade network for object detection. In IEEE international conference on computer vision.

  44. Pang, J., Chen, K., Shi, J., Feng, H., Ouyang, W., & Lin, D. (2019). Libra R-CNN: Towards balanced learning for object detection. In IEEE conference on computer vision and pattern recognition.

  45. Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., DeVito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., & Chintala, S. (2019). Pytorch: An imperative style, high-performance deep learning library. In Advances in neural information processing systems.

  46. Redmon, J., Divvala, S., Girshick, R., & Farhadi, A. (2016). You only look once: Unified, real-time object detection. In IEEE conference on computer vision and pattern recognition.

  47. Redmon, J., & Farhadi, A. (2017). Yolo9000: Better, faster, stronger. arXiv preprint.

  48. Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster R-CNN: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems.

  49. Shen, L., Lin, Z., & Huang, Q. (2016). Relay backpropagation for effective learning of deep convolutional neural networks. In European conference on computer vision (pp. 467–482), Springer.

  50. Shrivastava, A., Gupta, A., & Girshick, R. (2016). Training region-based object detectors with online hard example mining. In IEEE conference on computer vision and pattern recognition.

  51. Singh, B., & Davis, L. S. (2018). An analysis of scale invariance in object detection–snip. In IEEE conference on computer vision and pattern recognition.

  52. Singh, B., Najibi, M., & Davis, L. S. (2018). SNIPER: Efficient multi-scale training. In Advances in neural information processing systems.

  53. Tan, J., Wang, C., Li, B., Li, Q., Ouyang, W., Yin, C., & Yan, J. (2020). Equalization loss for long-tailed object recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 11662–11671).

  54. Tian, Z., Shen, C., Chen, H., & He, T. (2019). Fcos: Fully convolutional one-stage object detection. In Proceedings of the IEEE international conference on computer vision (pp. 9627–9636).

  55. Uijlings, J. R., Van De Sande, K. E., Gevers, T., & Smeulders, A. W. (2013). Selective search for object recognition. International Journal of Computer Vision, 104(2), 154–171.

    Article  Google Scholar 

  56. Wang, J., Chen, K., Yang, S., Loy, C. C., & Lin, D. (2019). Region proposal by guided anchoring. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2965–2974).

  57. Wang, X., Girshick, R., Gupta, A., & He, K. (2017). Non-local neural networks, 10. arXiv preprint arXiv:1711.07971.

  58. Wang, Y. X., Ramanan, D., & Hebert, M. (2017). Learning to model the tail. In Advances in neural information processing systems (pp. 7029–7039).

  59. Wu, Y., & He, K. (2018). Group normalization. In Proceedings of the European conference on computer vision (ECCV) (pp. 3–19).

  60. Wu, Y., Lim, J., & Yang, M. H. (2013). Online object tracking: A benchmark. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2411–2418).

  61. Yang, Z., Liu, S., Hu, H., Wang, L., & Lin, S. (2019). Reppoints: Point set representation for object detection. arXiv preprint arXiv:1904.11490.

  62. Zeiler, M. D., & Fergus, R. (2014). Visualizing and understanding convolutional networks. In European conference on computer vision.

  63. Zeng, X., Ouyang, W., Yan, J., Li, H., Xiao, T., Wang, K., et al. (2018). Crafting GBD-Net for object detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(9), 2109–2123.

    Article  Google Scholar 

  64. Zhang, S., Chi, C., Yao, Y., Lei, Z., & Li, S. Z. (2020). Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 9759–9768).

  65. Zhang, S., Wen, L., Bian, X., Lei, Z., & Li, S. Z. (2017). Single-shot refinement neural network for object detection. arXiv preprint.

  66. Zhang, S., Wen, L., Shi, H., Lei, Z., Lyu, S., & Li, S. Z. (2019). Single-shot scale-aware network for real-time face detection. International Journal of Computer Vision, 127(6–7), 537–559.

    Article  Google Scholar 

  67. Zhao, Q., Sheng, T., Wang, Y., Tang, Z., Chen, Y., Cai, L., & Ling, H. (2019). M2det: A single-shot object detector based on multi-level feature pyramid network. In Proceedings of the AAAI conference on artificial intelligence (Vol. 33, pp. 9259–9266).

  68. Zhou, X., Wang, D., & Krähenbühl, P. (2019). Objects as points. arXiv preprint arXiv:1904.07850.

  69. Zhou, X., Zhuo, J., & Krahenbuhl, P. (2019). Bottom-up object detection by grouping extreme and center points. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 850–859).

  70. Zhu, X., Pang, J., Yang, C., Shi, J., & Lin, D. (2019). Adapting object detectors via selective cross-domain alignment. In IEEE conference on computer vision and pattern recognition.

  71. Zhu, X., Wang, Y., Dai, J., Yuan, L., & Wei, Y. (2017). Flow-guided feature aggregation for video object detection. In Proceedings of the IEEE international conference on computer vision (pp. 408–417).

  72. Zou, Z., Shi, Z., Guo, Y., & Ye, J. (2019). Object detection in 20 years: A survey. arXiv preprint arXiv:1905.05055.

Download references

Acknowledgements

This work is partially supported by National Natural Science Foundation of China (No. 61975175) the Civilian Fundamental Research (No. D040301), the Collaborative Research grant from SenseTime Group (CUHK Agreement No. TS1610626 & No. TS1712093), and the General Research Fund (GRF) of Hong Kong (No. 14236516 & No. 14203518).

Author information

Affiliations

Authors

Corresponding author

Correspondence to Qi Li.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Code is available at https://github.com/open-mmlab/mmdetection.

Communicated by S.-C. Zhu.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Pang, J., Chen, K., Li, Q. et al. Towards Balanced Learning for Instance Recognition. Int J Comput Vis 129, 1376–1393 (2021). https://doi.org/10.1007/s11263-021-01434-2

Download citation

Keywords

  • Instance recognition
  • Object detection
  • Balanced learning
  • Deep learning
  • Convolutional neural networks