Skip to main content
Log in

PMG-DETR: fast convergence of DETR with position-sensitive multi-scale attention and grouped queries

  • Theoretical Advances
  • Published:
Pattern Analysis and Applications Aims and scope Submit manuscript

Abstract

The recently proposed DETR successfully applied the Transformer to object detection and achieved impressive results. However, the learned object queries often explore the entire image to match the corresponding regions, resulting in slow convergence of DETR. Additionally, DETR only uses single-scale features from the final stage of the backbone network, leading to poor performance in small object detection. To address these issues, we propose an effective training strategy for improving the DETR framework, named PMG-DETR. We achieve this by using Position-sensitive Multi-scale attention and Grouped queries. First, to better fuse the multi-scale features, we propose a Position-sensitive Multi-scale attention. By incorporating a spatial sampling strategy into deformable attention, we can further improve the performance of small object detection. Second, we extend the attention mechanism by introducing a novel positional encoding scheme. Finally, we propose a grouping strategy for object queries, where queries are grouped at the decoder side for a more precise inclusion of regions of interest and to accelerate DETR convergence. Extensive experiments on the COCO dataset show that PMG-DETR can achieve better performance compared to DETR, e.g., AP 47.8\(\%\) using ResNet50 as backbone trained in 50 epochs. We perform ablation studies on the COCO dataset to validate the effectiveness of the proposed PMG-DETR.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Data Availability

The datasets generated and analysed during the current study are available in COCO (https://cocodataset.org) and Cityscapes (https://www.cityscapes-dataset.com) repositories.

References

  1. Girshick R, Donahue J, Darrell T, Malik J (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 580–587

  2. Girshick R (2015) Fast R-CNN. In: Proceedings of the IEEE international conference on computer vision, pp 1440–1448

  3. Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In: Advances in neural information processing systems, vol 28

  4. Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu C-Y, Berg AC (2016) Ssd: single shot multibox detector. In: Computer vision—ECCV 2016: 14th European conference, Amsterdam, The Netherlands, 11–14 Oct 2016, Proceedings, Part I 14. Springer, pp 21–37

  5. Redmon J, Divvala S, Girshick R, Farhadi A (2016) You only look once: unified, real-time object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 779–788

  6. Neubeck A, Van Gool L (2006) Efficient non-maximum suppression. In: 18th International conference on pattern recognition (ICPR’06), vol 3. IEEE, pp 850–855

  7. Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S (2020) End-to-end object detection with transformers. In: European conference on computer vision. Springer, pp 213–229

  8. Meng D, Chen X, Fan Z, Zeng G, Li H, Yuan Y, Sun L, Wang J (2021) Conditional detr for fast training convergence. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 3651–3660

  9. Liu S, Li F, Zhang H, Yang X, Qi X, Su H, Zhu J, Zhang L (2022) Dab-detr: dynamic anchor boxes are better queries for detr. arXiv preprint arXiv:2201.12329

  10. Li F, Zhang H, Liu S, Guo J, Ni LM, Zhang L (2022) Dn-detr: accelerate detr training by introducing query denoising. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 13619–13627

  11. Zhang H, Li F, Liu S, Zhang L, Su H, Zhu J, Ni LM, Shum H-Y (2022) Dino: detr with improved denoising anchor boxes for end-to-end object detection. arXiv preprint arXiv:2203.03605

  12. Lin T-Y, Dollár P, Girshick R, He K, Hariharan B, Belongie S (2017) Feature pyramid networks for object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2117–2125

  13. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. In: Advances in neural information processing systems, vol 30

  14. Wang Y, Zhang X, Yang T, Sun J (2022) Anchor detr: query design for transformer-based detector. In: Proceedings of the AAAI conference on artificial intelligence, vol 36, pp 2567–2575

  15. Chen Q, Chen X, Zeng G, Wang J (2022) Group detr: fast training convergence with decoupled one-to-many label assignment. arXiv preprint arXiv:2207.13085

  16. Zhao Q, Sheng T, Wang Y, Tang Z, Chen Y, Cai L, Ling H (2019) M2det: a single-shot object detector based on multi-level feature pyramid network. In: Proceedings of the AAAI conference on artificial intelligence, vol 33, pp 9259–9266

  17. Kim S-W, Kook H-K, Sun J-Y, Kang M-C, Ko S-J (2018) Parallel feature pyramid network for object detection. In: Proceedings of the European conference on computer vision (ECCV), pp 234–250

  18. Liu S, Qi L, Qin H, Shi J, Jia J (2018) Path aggregation network for instance segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 8759–8768

  19. Zhu X, Su W, Lu L, Li B, Wang X, Dai J (2020) Deformable detr: deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159

  20. Dai J, Qi H, Xiong Y, Li Y, Zhang G, Hu H, Wei Y (2017) Deformable convolutional networks. In: Proceedings of the IEEE international conference on computer vision, pp 764–773

  21. Cao X, Yuan P, Feng B, Niu K (2022) Cf-detr: coarse-to-fine transformers for end-to-end object detection. In: Proceedings of the AAAI conference on artificial intelligence, vol 36, pp 185–193

  22. He K, Gkioxari G, Dollár P, Girshick R (2017) Mask r-cnn. In: Proceedings of the IEEE international conference on computer vision, pp 2961–2969

  23. Zhang G, Luo Z, Tian Z, Zhang J, Zhang X, Lu S (2023) Towards efficient use of multi-scale features in transformer-based object detectors. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 6206–6216

  24. Li F, Zeng A, Liu S, Zhang H, Li H, Zhang L, Ni LM (2023) Lite detr: an interleaved multi-scale encoder for efficient detr. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 18558–18567

  25. Dai Z, Yang Z, Yang Y, Carbonell J, Le QV, Salakhutdinov R (2019) Transformer-xl: Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860

  26. Devlin J, Chang M-W, Lee K, Toutanova K (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805

  27. Yang Z, Dai Z, Yang Y, Carbonell J, Salakhutdinov RR, Le QV (2019) Xlnet: generalized autoregressive pretraining for language understanding. In: Advances in neural information processing systems, vol 32

  28. Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S et al (2020) An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929

  29. Srinivas A, Lin T-Y, Parmar N, Shlens J, Abbeel P, Vaswani A (2021) Bottleneck transformers for visual recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 16519–16529

  30. Shaw P, Uszkoreit J, Vaswani A (2018) Self-attention with relative position representations. arXiv preprint arXiv:1803.02155

  31. Ramachandran P, Parmar N, Vaswani A, Bello I, Levskaya A, Shlens J (2019) Stand-alone self-attention in vision models. In Advances in neural information processing systems, vol 32

  32. Zhao H, Jia J, Koltun V (2020) Exploring self-attention for image recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10076–10085

  33. Gao P, Zheng M, Wang X, Dai J, Li H (2021) Fast convergence of detr with spatially modulated co-attention. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 3621–3630

  34. Redmon J, Farhadi A (2017) Yolo9000: better, faster, stronger. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7263–7271

  35. Redmon J, Farhadi A (2018) Yolov3: an incremental improvement. arXiv preprint arXiv:1804.02767

  36. Bochkovskiy A, Wang C-Y, Liao H-YM (2020) Yolov4: optimal speed and accuracy of object detection. arXiv preprint arXiv:2004.10934

  37. Wang C-Y, Bochkovskiy A, Liao H-YM (2023) Yolov7: trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 7464–7475

  38. Dai Z, Cai B, Lin Y, Chen J (2021) Up-detr: unsupervised pre-training for object detection with transformers. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 1601–1610

  39. Zhang G, Luo Z, Yu Y, Cui K, Lu S (2022) Accelerating detr convergence via semantic-aligned matching. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 949–958

  40. Qiu T, Zhou L, Xu W, Cheng L, Feng Z, Song M (2023) Team-detr: guide queries as a professional team in detection transformers. arXiv preprint arXiv:2302.07116

  41. Chen Z, Huang G, Li W, Teng J, Wang K, Shao J, Loy CC, Sheng L (2023) Siamese detr. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 15722–15731

  42. Munir MA, Khan SH, Khan MH, Ali M, Shahbaz Khan F (2024) Cal-DETR: Calibrated Detection Transformer. In: Advances in neural information processing systems, vol 36

  43. Roh B, Shin J, Shin W, Kim S (2021) Sparse detr: efficient end-to-end object detection with learnable sparsity. arXiv preprint arXiv:2111.14330

  44. Lin T-Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft coco: Common objects in context. In: Computer vision–ECCV 2014: 13th European conference, Zurich, Switzerland, 6–12 Sept 2014, Proceedings, Part V 13. Springer, pp 740–755

  45. Cordts M, Omran M, Ramos S, Rehfeld T, Enzweiler M, Benenson R, Franke U, Roth S, Schiele B (2016) The cityscapes dataset for semantic urban scene understanding. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3213–3223

  46. Xie X, Zhou P, Li H, Lin Z, Yan S (2022) Adan: adaptive nesterov momentum algorithm for faster optimizing deep models. arXiv preprint arXiv:2208.06677

Download references

Funding

This work was supported by the Hunan Provincial Natural Science Foundation of China (2023JJ50096, 2022JJ50016), the Science and Technology Plan Project of Hunan Province (2016TP1020), the “14th Five-Year Plan” Key Disciplines and Application-oriented Special Disciplines of Hunan Province (Xiangjiaotong [2022] 351).

Author information

Authors and Affiliations

Authors

Contributions

Shuming Cui and Hongwei Deng were responsible for material preparation, data collection, and analysis. Shuming Cui wrote the main manuscript text, and all authors provided comments on previous versions of the manuscript. All authors read and approved the final manuscript. All authors reviewed the manuscript. Data availability statement. The datasets generated and analysed during the current study are available in COCO (https://cocodataset.org) and Cityscapes (https://www.cityscapes-dataset.com) repositories.

Corresponding author

Correspondence to Hongwei Deng.

Ethics declarations

Conflict of interest

The authors have no conflict of interest to declare that are relevant to the content of this article.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Cui, S., Deng, H. PMG-DETR: fast convergence of DETR with position-sensitive multi-scale attention and grouped queries. Pattern Anal Applic 27, 58 (2024). https://doi.org/10.1007/s10044-024-01281-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s10044-024-01281-0

Keywords

Navigation