Skip to main content

Detecting Twenty-Thousand Classes Using Image-Level Supervision

  • Conference paper
  • First Online:
Computer Vision – ECCV 2022 (ECCV 2022)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13669))

Included in the following conference series:

Abstract

Current object detectors are limited in vocabulary size due to the small scale of detection datasets. Image classifiers, on the other hand, reason about much larger vocabularies, as their datasets are larger and easier to collect. We propose Detic, which simply trains the classifiers of a detector on image classification data and thus expands the vocabulary of detectors to tens of thousands of concepts. Unlike prior work, Detic does not need complex assignment schemes to assign image labels to boxes based on model predictions, making it much easier to implement and compatible with a range of detection architectures and backbones. Our results show that Detic yields excellent detectors even for classes without box annotations. It outperforms prior work on both open-vocabulary and long-tail detection benchmarks. Detic provides a gain of 2.4 mAP for all classes and 8.3 mAP for novel classes on the open-vocabulary LVIS benchmark. On the standard LVIS benchmark, Detic obtains 41.7 mAP when evaluated on all classes, or only rare classes, hence closing the gap in performance for object categories with few samples. For the first time, we train a detector with all the twenty-one-thousand classes of the ImageNet dataset and show that it generalizes to new datasets without finetuning. Code is available at https://github.com/facebookresearch/Detic.

X. Zhou—Work done during an internship at Meta.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    We omit the two linear layers and the bias in the second stage for notation simplicity.

  2. 2.

    This is more pronounced in detection than classification, as the “batch-size” for the classification layer is \(512 \times \) image-batch-size, where 512 is #RoIs per image.

References

  1. Arbeláez, P., Pont-Tuset, J., Barron, J.T., Marques, F., Malik, J.: Multiscale combinatorial grouping. In: CVPR (2014)

    Google Scholar 

  2. Bansal, A., Sikka, K., Sharma, G., Chellappa, R., Divakaran, A.: Zero-shot object detection. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11205, pp. 397–414. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01246-5_24

    Chapter  Google Scholar 

  3. Bilen, H., Vedaldi, A.: Weakly supervised deep detection networks. In: CVPR (2016)

    Google Scholar 

  4. Bochkovskiy, A., Wang, C.Y., Liao, H.Y.M.: Yolov4: optimal speed and accuracy of object detection. arXiv:2004.10934 (2020)

  5. Cai, Z., Vasconcelos, N.: Cascade R-CNN: delving into high quality object detection. In: CVPR (2018)

    Google Scholar 

  6. Chang, N., Yu, Z., Wang, Y.X., Anandkumar, A., Fidler, S., Alvarez, J.M.: Image-level or object-level? A tale of two resampling strategies for long-tailed detection. In: ICML (2021)

    Google Scholar 

  7. Chen, L., Yang, T., Zhang, X., Zhang, W., Sun, J.: Points as queries: weakly semi-supervised object detection by points. In: CVPR (2021)

    Google Scholar 

  8. Dave, A., Dollár, P., Ramanan, D., Kirillov, A., Girshick, R.: Evaluating large-vocabulary object detectors: the devil is in the details. arXiv:2102.01066 (2021)

  9. Dave, A., Tokmakov, P., Ramanan, D.: Towards segmenting anything that moves. In: ICCVW (2019)

    Google Scholar 

  10. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: CVPR (2009)

    Google Scholar 

  11. Desai, K., Johnson, J.: VirTex: learning visual representations from textual annotations. In: CVPR (2021)

    Google Scholar 

  12. Dong, B., Huang, Z., Guo, Y., Wang, Q., Niu, Z., Zuo, W.: Boosting weakly supervised object detection via learning bounding box adjusters. In: ICCV (2021)

    Google Scholar 

  13. Fang, S., Cao, Y., Wang, X., Chen, K., Lin, D., Zhang, W.: WSSOD: a new pipeline for weakly-and semi-supervised object detection. arXiv:2105.11293 (2021)

  14. Feng, C., Zhong, Y., Huang, W.: Exploring classification equilibrium in long-tailed object detection. In: ICCV (2021)

    Google Scholar 

  15. Ghiasi, G., et al.: Simple copy-paste is a strong data augmentation method for instance segmentation. In: CVPR (2021)

    Google Scholar 

  16. Ghiasi, G., Gu, X., Cui, Y., Lin, T.Y.: Open-vocabulary image segmentation. arXiv:2112.12143 (2021)

  17. Gu, X., Lin, T.Y., Kuo, W., Cui, Y.: Open-vocabulary object detection via vision and language knowledge distillation. ICLR (2022)

    Google Scholar 

  18. Gupta, A., Dollar, P., Girshick, R.: LVIS: a dataset for large vocabulary instance segmentation. In: CVPR (2019)

    Google Scholar 

  19. Han, J., Niu, M., Du, Z., Wei, L., Xie, L., Zhang, X., Tian, Q.: Joint coco and Lvis workshop at ECCV 2020: Lvis challenge track technical report: asynchronous semi-supervised learning for large vocabulary instance segmentation (2020)

    Google Scholar 

  20. He, K., Gkioxari, G., Dollar, P., Girshick, R.: Mask R-CNN. In: ICCV (2017)

    Google Scholar 

  21. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)

    Google Scholar 

  22. Huang, Z., Zou, Y., Bhagavatula, V., Huang, D.: Comprehensive attention self-distillation for weakly-supervised object detection. In: NeurIPS (2020)

    Google Scholar 

  23. Ilharco, G., et al.: Openclip, July 2021. https://doi.org/10.5281/zenodo.5143773

  24. Joulin, A., Grave, E., Bojanowski, P., Douze, M., Jégou, H., Mikolov, T.: Fasttext. Zip: compressing text classification models. arXiv:1612.03651 (2016)

  25. Kim, D., Lin, T.Y., Angelova, A., Kweon, I.S., Kuo, W.: Learning open-world object proposals without learning to classify. arXiv:2108.06753 (2021)

  26. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: ICLR (2015)

    Google Scholar 

  27. Konan, S., Liang, K.J., Yin, L.: Extending one-stage detection with open-world proposals. arXiv:2201.02302 (2022)

  28. Kuznetsova, A., et al.: The open images dataset v4. In: IJCV (2020)

    Google Scholar 

  29. Li, B., Weinberger, K.Q., Belongie, S., Koltun, V., Ranftl, R.: Language-driven semantic segmentation. In: ICLR (2022)

    Google Scholar 

  30. Li, X., Kan, M., Shan, S., Chen, X.: Weakly supervised object detection with segmentation collaboration. In: ICCV (2019)

    Google Scholar 

  31. Li, Y., Zhang, J., Huang, K., Zhang, J.: Mixed supervised object detection with robust objectness transfer. In: TPAMI (2018)

    Google Scholar 

  32. Li, Y., Wang, T., Kang, B., Tang, S., Wang, C., Li, J., Feng, J.: Overcoming classifier imbalance for long-tail object detection with balanced group softmax. In: CVPR (2020)

    Google Scholar 

  33. Li, Z., Yao, L., Zhang, X., Wang, X., Kanhere, S., Zhang, H.: Zero-shot object detection with textual descriptions. In: AAAI (2019)

    Google Scholar 

  34. Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48

    Chapter  Google Scholar 

  35. Liu, Y., Zhang, Z., Niu, L., Chen, J., Zhang, L.: Mixed supervised object detection by transferringmask prior and semantic similarity. In: NeurIPS (2021)

    Google Scholar 

  36. Liu, Y.C., et al.: Unbiased teacher for semi-supervised object detection. In: ICLR (2021)

    Google Scholar 

  37. Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: ICCV (2021)

    Google Scholar 

  38. Maaz, M., Rasheed, H., Khan, S., Khan, F.S., Anwer, R.M., Yang, M.H.: Multi-modal transformers excel at class-agnostic object detection. arXiv:2111.11430 (2021)

  39. Pan, T.Y., et al.: On model calibration for long-tailed object detection and instance segmentation. In: NeurIPS (2021)

    Google Scholar 

  40. Pennington, J., Socher, R., Manning, C.D.: GloVe: global vectors for word representation. In: EMNLP (2014)

    Google Scholar 

  41. Pinheiro, P.O., Collobert, R.: Weakly supervised semantic segmentation with convolutional networks. In: CVPR (2015)

    Google Scholar 

  42. Radford, A., et al.: Learning transferable visual models from natural language supervision. arXiv:2103.00020 (2021)

  43. Rahman, S., Khan, S., Barnes, N.: Improved visual-semantic alignment for zero-shot object detection. In: AAAI (2020)

    Google Scholar 

  44. Ramanathan, V., Wang, R., Mahajan, D.: DLWL: improving detection for lowshot classes with weakly labelled data. In: CVPR (2020)

    Google Scholar 

  45. Redmon, J., Farhadi, A.: Yolo9000: better, faster, stronger. In: CVPR (2017)

    Google Scholar 

  46. Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: NIPS (2015)

    Google Scholar 

  47. Ren, Z., Yu, Z., Yang, X., Liu, M.-Y., Schwing, A.G., Kautz, J.: UFO\(^2\): a unified framework towards omni-supervised object detection. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12364, pp. 288–313. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58529-7_18

    Chapter  Google Scholar 

  48. Ridnik, T., Ben-Baruch, E., Noy, A., Zelnik-Manor, L.: Imagenet-21k pretraining for the masses. In: NeurIPS (2021)

    Google Scholar 

  49. Shao, S., et al.: Objects365: a large-scale, high-quality dataset for object detection. In: ICCV (2019)

    Google Scholar 

  50. Sharma, P., Ding, N., Goodman, S., Soricut, R.: Conceptual captions: a cleaned, hypernymed, image alt-text dataset for automatic image captioning. In: ACL (2018)

    Google Scholar 

  51. Shen, Y., et al.: Enabling deep residual networks for weakly supervised object detection. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12353, pp. 118–136. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58598-3_8

    Chapter  Google Scholar 

  52. Shen, Y., Ji, R., Wang, Y., Wu, Y., Cao, L.: Cyclic guidance for weakly supervised joint detection and segmentation. In: CVPR (2019)

    Google Scholar 

  53. Singh, B., Li, H., Sharma, A., Davis, L.S.: R-FCN-3000 at 30fps: decoupling detection and classification. In: CVPR (2018)

    Google Scholar 

  54. Sohn, K., Zhang, Z., Li, C.L., Zhang, H., Lee, C.Y., Pfister, T.: A simple semi-supervised learning framework for object detection. arXiv:2005.04757 (2020)

  55. Tan, J., Lu, X., Zhang, G., Yin, C., Li, Q.: Equalization loss v2: a new gradient balance approach for long-tailed object detection. In: CVPR (2021)

    Google Scholar 

  56. Tan, J., et al.: Equalization loss for long-tailed object recognition. In: CVPR (2020)

    Google Scholar 

  57. Tan, J., et al.: 1st place solution of Lvis challenge 2020: a good box is not a guarantee of a good mask. arXiv:2009.01559 (2020)

  58. Tan, M., Pang, R., Le, Q.V.: Efficientdet: scalable and efficient object detection. In: CVPR (2020)

    Google Scholar 

  59. Tang, P., et al.: PCL: proposal cluster learning for weakly supervised object detection. In: TPAMI (2018)

    Google Scholar 

  60. Tang, P., Wang, X., Bai, X., Liu, W.: Multiple instance detection network with online instance classifier refinement. In: CVPR (2017)

    Google Scholar 

  61. Uijlings, J., Popov, S., Ferrari, V.: Revisiting knowledge transfer for training object class detectors. In: CVPR (2018)

    Google Scholar 

  62. Uijlings, J.R., Van De Sande, K.E., Gevers, T., Smeulders, A.W.: Selective search for object recognition. In: IJCV (2013)

    Google Scholar 

  63. Wan, F., Liu, C., Ke, W., Ji, X., Jiao, J., Ye, Q.: C-mil:continuation multiple instance learning for weakly supervised object detection. In: CVPR (2019)

    Google Scholar 

  64. Wang, J., et al.: Seesaw loss for long-tailed instance segmentation. In: CVPR (2021)

    Google Scholar 

  65. Wu, J., Song, L., Wang, T., Zhang, Q., Yuan, J.: Forest R-CNN: large-vocabulary long-tailed object detection and instance segmentation. In: ACM Multimedia (2020)

    Google Scholar 

  66. Wu, Y., Kirillov, A., Massa, F., Lo, W.Y., Girshick, R.: Detectron2 (2019). https://github.com/facebookresearch/detectron2

  67. Xu, M., et al.: End-to-end semi-supervised object detection with soft teacher. In: ICCV (2021)

    Google Scholar 

  68. Yan, Z., Liang, J., Pan, W., Li, J., Zhang, C.: Weakly-and semi-supervised object detection with expectation-maximization algorithm. arXiv:1702.08740 (2017)

  69. Yang, H., Wu, H., Chen, H.: Detecting 11k classes: large scale object detection without fine-grained bounding boxes. In: ICCV (2019)

    Google Scholar 

  70. Yang, K., Li, D., Dou, Y.: Towards precise end-to-end weakly supervised object detection network. In: ICCV (2019)

    Google Scholar 

  71. Ye, K., Zhang, M., Kovashka, A., Li, W., Qin, D., Berent, J.: Cap2DET: learning to amplify weak caption supervision for object detection. In: ICCV (2019)

    Google Scholar 

  72. Zareian, A., Rosa, K.D., Hu, D.H., Chang, S.F.: Open-vocabulary object detection using captions. In: CVPR (2021)

    Google Scholar 

  73. Zhang, C., et al.: MosaicOS: a simple and effective use of object-centric images for long-tailed object detection. In: ICCV (2021)

    Google Scholar 

  74. Zhang, S., Li, Z., Yan, S., He, X., Sun, J.: Distribution alignment: a unified framework for long-tail visual recognition. In: CVPR (2021)

    Google Scholar 

  75. Zhong, Y., Wang, J., Peng, J., Zhang, L.: Boosting weakly supervised object detection with progressive knowledge transfer. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12371, pp. 615–631. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58574-7_37

    Chapter  Google Scholar 

  76. Zhou, X., Koltun, V., Krähenbühl, P.: Probabilistic two-stage detection. arXiv:2103.07461 (2021)

  77. Zhou, X., Koltun, V., Krähenbühl, P.: Simple multi-dataset detection. In: CVPR (2022)

    Google Scholar 

  78. Zhu, P., Wang, H., Saligrama, V.: Don’t even look once: Synthesizing features for zero-shot detection. In: CVPR (2020)

    Google Scholar 

  79. Zhu, X., Su, W., Lu, L., Li, B., Wang, X., Dai, J.: Deformable DeTR: deformable transformers for end-to-end object detection. In: ICLR (2021)

    Google Scholar 

Download references

Acknowledgement

We thank Bowen Cheng and Ross Girshick for helpful discussions and feedback. This material is in part based upon work supported by the National Science Foundation under Grant No. IIS-1845485 and IIS-2006820. Xingyi is supported by a Facebook PhD Fellowship.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xingyi Zhou .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 235 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Zhou, X., Girdhar, R., Joulin, A., Krähenbühl, P., Misra, I. (2022). Detecting Twenty-Thousand Classes Using Image-Level Supervision. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13669. Springer, Cham. https://doi.org/10.1007/978-3-031-20077-9_21

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-20077-9_21

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-20076-2

  • Online ISBN: 978-3-031-20077-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics