Skip to main content

Towards Data-Efficient Detection Transformers

  • Conference paper
  • First Online:
Computer Vision – ECCV 2022 (ECCV 2022)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13669))

Included in the following conference series:

Abstract

Detection transformers have achieved competitive performance on the sample-rich COCO dataset. However, we show most of them suffer from significant performance drops on small-size datasets, like Cityscapes. In other words, the detection transformers are generally data-hungry. To tackle this problem, we empirically analyze the factors that affect data efficiency, through a step-by-step transition from a data-efficient RCNN variant to the representative DETR. The empirical results suggest that sparse feature sampling from local image areas holds the key. Based on this observation, we alleviate the data-hungry issue of existing detection transformers by simply alternating how key and value sequences are constructed in the cross-attention layer, with minimum modifications to the original models. Besides, we introduce a simple yet effective label augmentation method to provide richer supervision and improve data efficiency. Experiments show that our method can be readily applied to different detection transformers and improve their performance on both small-size and sample-rich datasets. Code will be made publicly available at https://github.com/encounter1997/DE-DETRs.

W. Wang—This work was done during Wen Wang’s internship at JD Explore Academy. J. Zhang—Co-first author.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Cai, Z., Vasconcelos, N.: Cascade r-cnn: Delving into high quality object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6154–6162 (2018)

    Google Scholar 

  2. Cao, Y.H., Yu, H., Wu, J.: Training vision transformers with only 2040 images. arXiv preprint arXiv:2201.10728 (2022)

  3. Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 213–229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_13

    Chapter  Google Scholar 

  4. Chen, Z., Zhang, J., Tao, D.: Recurrent glimpse-based decoder for detection with transformer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5260–5269 (2022)

    Google Scholar 

  5. Chen, Z., Xie, L., Niu, J., Liu, X., Wei, L., Tian, Q.: Visformer: the vision-friendly transformer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 589–598 (2021)

    Google Scholar 

  6. Chu, X., Tian, Z., Wang, Y., Zhang, B., Ren, H., Wei, X., Xia, H., Shen, C.: Twins: Revisiting the design of spatial attention in vision transformers. In: 34th Proceedings of the International Conference on Advances in Neural Information Processing Systems (2021)

    Google Scholar 

  7. Cordts, M., et al.: The cityscapes dataset for semantic urban scene understanding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3213–3223 (2016)

    Google Scholar 

  8. Dai, Z., Cai, B., Lin, Y., Chen, J.: UP-DETR: unsupervised pre-training for object detection with transformers. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2020)

    Google Scholar 

  9. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database.ac In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255. IEEE (2009)

    Google Scholar 

  10. Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)

  11. Fang, J., Xie, L., Wang, X., Zhang, X., Liu, W., Tian, Q.: Msg-transformer: exchanging local spatial information by manipulating messenger tokens. arXiv preprint arXiv:2105.15168 (2021)

  12. Fang, Y., et al.: You only look at one sequence: rethinking transformer in vision through object detection. arXiv preprint arXiv:2106.00666 (2021)

  13. Felzenszwalb, P.F., Girshick, R.B., McAllester, D., Ramanan, D.: Object detection with discriminatively trained part-based models. IEEE Trans. Pattern Anal. Mach. Intell. 32(9), 1627–1645 (2009)

    Article  Google Scholar 

  14. Gao, P., Zheng, M., Wang, X., Dai, J., Li, H.: Fast convergence of DETR with spatially modulated co-attention. In: Proceedings of the IEEE International Conference on Computer Vision (2021)

    Google Scholar 

  15. Ge, Z., Liu, S., Li, Z., Yoshie, O., Sun, J.: OTA: optimal transport assignment for object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 303–312 (2021)

    Google Scholar 

  16. Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semaantic segmentation. In: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 580–587 (2014)

    Google Scholar 

  17. Han, K., Xiao, A., Wu, E., Guo, J., Xu, C., Wang, Y.: Transformer in transformer. In: 34th Proceedings of the Conference on Advances in Neural Information Processing Systems (2021)

    Google Scholar 

  18. He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2961–2969 (2017)

    Google Scholar 

  19. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)

    Google Scholar 

  20. Jia, X., De Brabandere, B., Tuytelaars, T., Gool, L.V.: Dynamic filter networks. In: 29th Proceedings of the Conference on Advances in Neural Information Processing Systems (2016)

    Google Scholar 

  21. Lee, C.Y., Xie, S., Gallagher, P., Zhang, Z., Tu, Z.: Deeply-supervised nets. In: Artificial Intelligence and Statistics, pp. 562–570. PMLR (2015)

    Google Scholar 

  22. Lin, T.Y., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2117–2125 (2017)

    Google Scholar 

  23. Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2980–2988 (2017)

    Google Scholar 

  24. Lin, T., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48

    Chapter  Google Scholar 

  25. Liu, F., Wei, H., Zhao, W., Li, G., Peng, J., Li, Z.: WB-DETR: transformer-based detector without backbone. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2979–2987 (2021)

    Google Scholar 

  26. Liu, W., et al.: SSD: single shot multibox detector. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 21–37. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_2

    Chapter  Google Scholar 

  27. Liu, Y., Sangineto, E., Bi, W., Sebe, N., Lepri, B., Nadai, M.: Efficient training of visual transformers with small datasets. In: 34th Proceedings of the Conference on Advances in Neural Information Processing Systems (2021)

    Google Scholar 

  28. Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. arXiv preprint arXiv:2103.14030 (2021)

  29. Meng, D., et al.: Conditional DETR for fast training convergence. In: Proceedings of the IEEE International Conference on Computer Vision (2021)

    Google Scholar 

  30. Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: Unified, real-time object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 779–788 (2016)

    Google Scholar 

  31. Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 39(6), 1137–1149 (2016)

    Article  Google Scholar 

  32. Shen, Y., et al.: Parallel instance query network for named entity recognition. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics (2022). arxiv.org/abs/2203.10545

  33. Sun, P., et al.: Sparse r-CNN: end-to-end object detection with learnable proposals. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14454–14463 (2021)

    Google Scholar 

  34. Tian, Z., Shen, C., Chen, H.: Conditional convolutions for instance segmentation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 282–298. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_17

    Chapter  Google Scholar 

  35. Tian, Z., Shen, C., Chen, H., He, T.: FCOS: fully convolutional one-stage object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9627–9636 (2019)

    Google Scholar 

  36. Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357. PMLR (2021)

    Google Scholar 

  37. Vaswani, A., et al.: Attention is all you need. In: Conference on Neural Information Processing Systems (2017)

    Google Scholar 

  38. Wang, J., Chen, K., Yang, S., Loy, C.C., Lin, D.: Region proposal by guided anchoring. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2965–2974 (2019)

    Google Scholar 

  39. Wang, T., Yuan, L., Chen, Y., Feng, J., Yan, S.: PNP-DETR: towards efficient visual analysis with transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (2021)

    Google Scholar 

  40. Wang, W., Cao, Y., Zhang, J., Tao, D.: FP-DETR: detection transformer advanced by fully pre-training. In: International Conference on Learning Representations (2022). ’openreview.net/forum?id=yjMQuLLcGWK

    Google Scholar 

  41. Wang, W., et al.: Pyramid vision transformer: a versatile backbone for dense prediction without convolutions. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 568–578 (2021)

    Google Scholar 

  42. Xu, Y., Zhang, Q., Zhang, J., Tao, D.: Vitae: vision transformer advanced by exploring intrinsic inductive bias. In: 34th Proceedings of the Conference on Advances in Neural Information Processing Systems (2021)

    Google Scholar 

  43. Yang, T., Zhang, X., Li, Z., Zhang, W., Sun, J.: Metaanchor: learning to detect objects with customized anchaors. In: 31st Proceedings of the Conference on Advances in Neural Information Processing Systems (2018)

    Google Scholar 

  44. Yuan, H., et al.: Polyphonicformer: unified query learning for depth-aware video panoptic segmentation. In: European Conference on Computer Vision (2022)

    Google Scholar 

  45. Yuan, L., et al.: Tokens-to-token VIT: training vision transformers from scratch on ImageNet. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 558–567 (2021)

    Google Scholar 

  46. Zhang, Q., Xu, Y., Zhang, J., Tao, D.: Vitaev2: vision transformer advanced by exploring inductive bias for image recognition and beyond. arXiv preprint aarXiv:2202.10108 (2022)

  47. Zhang, S., Chi, C., Yao, Y., Lei, Z., Li, S.Z.: Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection. In: Proceedings of the IEEE/CVF Conference ON Computer Vision and Pattern Recognition, pp. 9759–9768 (2020)

    Google Scholar 

  48. Zhang, X., Wan, F., Liu, C., Ji, R., Ye, Q.: Freeanchor: Learning to match anchors for visual object detection. In: 32nd Proceedings of the Conference on Advances in Neural Information Processing Systems (2019)

    Google Scholar 

  49. Zhu, B., et al.: Autoassign: differentiable label assignment for dense object detection. arXiv preprint arXiv:2007.03496 (2020)

  50. Zhu, X., Su, W., Lu, L., Li, B., Wang, X., Dai, J.: Deformable DETR: deformable transformers for end-to-end object detection. In: International Conference on Learning and Representations (2020)

    Google Scholar 

Download references

Acknowledgement

This work is supported by National Key R &D Program of China under Grant 2020AAA0105701, National Natural Science Foundation of China (NSFC) under Grants 61872327, Major Special Science and Technology Project of Anhui (No. 012223665049), and the ARC project FL-170100117.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yang Cao .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 3416 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Wang, W., Zhang, J., Cao, Y., Shen, Y., Tao, D. (2022). Towards Data-Efficient Detection Transformers. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13669. Springer, Cham. https://doi.org/10.1007/978-3-031-20077-9_6

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-20077-9_6

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-20076-2

  • Online ISBN: 978-3-031-20077-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics