Skip to main content
Log in

CAM R-CNN: End-to-End Object Detection with Class Activation Maps

  • Published:
Neural Processing Letters Aims and scope Submit manuscript

Abstract

Class activation maps (CAMs) have been widely used on weakly-supervised object localization, which generate attention maps for specific categories in an image. Since CAMs can be obtained using category annotation, which is included in the annotation information of fully-supervised object detection. Therefore, how to adopt attention information in CAMs to improve the performance of fully-supervised object detection is an interesting problem. In this paper, we propose CAM R-CNN to deal with object detection, in which the category-aware attention maps provided by CAMs are integrated into the process of object detection. CAM R-CNN follows the common pipeline of the recent query-based object detectors in an end-to-end fashion, while two key CAM modules are embedded into the process. Specifically, E-CAM module provides embedding-level attention via fusing proposal features and attention information in CAMs with a transformer encoder, and S-CAM module supplies spatial-level attention by multiplying feature maps with the top-activated attention map provided by CAMs. In our experiments, CAM R-CNN demonstrates its superiority compared to several strong baselines on the challenging COCO dataset. Furthermore, we show that S-CAM module can be applied to two-stage detectors such as Faster R-CNN and Cascade R-CNN with consistent gains.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Data Availability

The datasets analysed during the current study are available in the https://cocodataset.org.

References

  1. Zhang X, Wei Y, Feng J, Yang Y, Huang, T (2018) Adversarial complementary learning for weakly supervised object localization. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1325–1334

  2. Zhou B, Khosla A, Lapedriza A, Oliva A, Torralba A (2016) Learning deep features for discriminative localization. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2921–2929

  3. Pang Y, Xie J, Khan MH, Anwer RM, Khan FS, Shao L (2019) Mask-guided attention network for occluded pedestrian detection. In: International conference on computer vision, pp 4967–4975

  4. Zhang S, Yang J, Schiele B (2018) Occluded pedestrian detection through guided attention in cnns. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6995–7003

  5. Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S (2020) End-to-end object detection with transformers. In: European conference on computer vision, pp 213–229

  6. Sun P, Zhang R, Jiang Y, Kong T, Xu C, Zhan W, Tomizuka M, Li L, Yuan Z, Wang C, Luo P (2021) Sparse r-cnn: end-to-end object detection with learnable proposals. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 14454–14463

  7. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Llion Jones ANG, Kaiser Ł, Polosukhin I (2017) Attention is all you need. In: Advances in neural information processing systems, pp 5998–6008

  8. Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft COCO: common objects in context. In: European conference on computer vision, pp 740–755

  9. Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In: Advances in neural information processing systems, pp 91–99

  10. Cai Z, Vasconcelos N (2018) Cascade r-cnn: delving into high quality object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6154–6162

  11. Girshick R (2015) Fast r-cnn. In: Proceedings of the IEEE international conference on computer vision, pp 1440–1448

  12. Girshick R, Donahue J, Darrell T, Malik J (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 580–587

  13. Uijlings JR, Sande KEVD, Gevers T, Smeulders AW (2013) Selective search for object recognition. Int J Comput Vis 104(2):154–171

    Article  Google Scholar 

  14. Bodla N, Singh B, Chellappa R, Davis LS (2017) Soft-nms–improving object detection with one line of code. In: Proceedings of the IEEE international conference on computer vision, pp 5561–5569

  15. Redmon J, Farhadi A (2017) Yolo9000: better, faster, stronger. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7263–7271

  16. Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu CY, Berg, AC (2016) SSD: single shot multibox detector. In: European conference on computer vision, pp 21–37

  17. Lin TY, Goyal P, Girshick R, He K, Dollár, P (2017) Focal loss for dense object detection. In: Proceedings of the IEEE international conference on computer vision, pp 2980–2988

  18. Law H, Deng J (2018) Cornernet: detecting objects as paired keypoints. In: Proceedings of the European conference on computer vision, pp 734–750

  19. Tian Z, Shen C, Chen H, He T (2019) FCOS: fully convolutional one-stage object detection. In: Proceedings of the IEEE international conference on computer vision, pp 9627–9636

  20. Zhou X, Wang D, Krähenbühl P (2019) Objects as points. In: arXiv Preprint arXiv:1904.07850

  21. Zheng M, Gao P, Wang X, Li H, Dong H (2020) End-to-end object detection with adaptive clustering transformer. In: CoRR, Abs/2011.09315

  22. Zhu X, Su W, Lu L, Li B, Wang X, Dai J (2020) Deformable detr: deformable transformers for end-to-end object detection. In: CoRR, Abs/2010.04159

  23. Sun Z, Cao S, Yang Y, Kitani K (2020) Rethinking transformer-based set prediction for object detection. In: CoRR, Abs/2011.10881, pp 3611–3620

  24. Gao P, Zheng M, Wang X, Dai J, Li H (2021) Fast convergence of detr with spatially modulated co-attention. In: CoRR, Abs/2101.07448, pp 3621–3630

  25. Hu J, Cao L, Lu Y, Zhang S, Wang Y, Li K, Huang F, Shao L, Ji R (2021) ISTR: end-to-end instance segmentation with transformers. In: arXiv Preprint arXiv:2105.00637

  26. Hong Q, Liu F, Li D, Liu J, Tian L, Shan Y (2022) Dynamic sparse r-cnn. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4723–4732

  27. Chen S, Sun P, Song Y, Luo P (2022) Diffusiondet: Diffusion model for object detection. arXiv, 2211.09788

  28. Bell S, Zitnick CL, Bala K, Girshick R (2016) Inside-outside net: detecting objects in context with skip pooling and recurrent neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2874–2883

  29. Li R, Mai Z, Trabelsi C, Zhang Z, Jang J, Sanner S (2023) Transcam: transformer attention-based cam refinement for weakly supervised semantic segmentation. J Vis Commun Image Represent 92:103800

    Article  Google Scholar 

  30. Zhang X, Ma J, Liu H, Hu HM, Yang P (2022) Dual attentional siamese network for visual tracking. Displays 74:102205

    Article  Google Scholar 

  31. Newell A, Yang K, Deng J (2016) Stacked hourglass networks for human pose estimation. In: Proceedings of European conference on computer vision, pp 483–499

  32. Chen L, Zhang H, Xiao J, Nie L, Shao J, Liu W, Chua TS (2017) SCA-CNN: spatial and channel-wise attention in convolutional networks for image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6298–6306

  33. Wang F, Jiang M, Qian C, Yang S, Li C, Zhang H, Wang X, Tang X (2017) Residual attention network for image classification. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6450–6458

  34. Hu J, Shen L, Albanie S, Sun G, Wu E (2020) Squeeze-and-excitation networks. IEEE Transactions on pattern analysis and machine intelligence, 2011–2023

  35. Park J, Woo S, Lee JY, Kweon IS (2018) BAM: bottleneck attention module. In: Proceedings of the British machine vision conference, pp 1–14

  36. Woo Park J, Lee JY, Kweon IS (2018) CBAM: convolutional block attention module. In: Proceedings of European conference on computer vision, pp 3–19

  37. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778

  38. Lin TY, Dollár P, Ross Girshick KH, Hariharan B, Belongie S (2017) Feature pyramid networks for object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2117–2125

  39. Wu Y, Kirillov A, Massa F, Lo WY, Girshick R (2019) Detectron2. In: https://github.com/facebookresearch/detectron2

  40. Lee H, Kim HE, Nam H (2019) SRM: A style-based recalibration module for convolutional neural networks. In: Proceedings of IEEE/CVF international conference on computer vision, pp 1854–1862

Download references

Acknowledgements

This work was supported by National Key R &D Program of China (No. 2022ZD0118202), the National Science Fund for Distinguished Young Scholars (No. 62025603), the National Natural Science Foundation of China (Nos. U21B2037, U22B2051, 62176222, 62176223, 62176226, 62072386, 62072387, 62072389, 62002305 and No. 62272401), and the Natural Science Foundation of Fujian Province of China (Nos. 2021J01002, 2022J06001).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Liujuan Cao.

Ethics declarations

Conflict of interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, S., Yu, S., Ding, H. et al. CAM R-CNN: End-to-End Object Detection with Class Activation Maps. Neural Process Lett 55, 10483–10499 (2023). https://doi.org/10.1007/s11063-023-11335-9

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11063-023-11335-9

Keywords

Navigation