Skip to main content

Multi-scale Cross-Modal Transformer Network for RGB-D Object Detection

  • Conference paper
  • First Online:
MultiMedia Modeling (MMM 2022)

Abstract

RGB-D object detection is a fundamental yet challenging task due to the inherent difference between the RGB and Depth information. In this paper, we propose a Multi-scale Cross-modal Transformer Network (MCTNet) consisting of two well-designed components: the Multi-modal Feature Pyramid module (MFP), and the Cross-Modal Transformer (CMTrans). Specially, we introduce the MFP to enrich the high-level semantic features with geometric information and enhance low-level geometric clues with semantic features, which is demonstrated facilitating the further cross-modal feature fusion. Furthermore, we develop the CMTrans to effectively exploit the long-range attention between the enhanced RGB and depth features, enabling the network to focus on regions of interest. Extensive experiments show our MCTNet surpasses state-of-the-art detectors by 1.6% mAP on SUN RGB-D and 1.0% mAP on NYU Depth v2, which demonstrates the effectiveness of the proposed method.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 79.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 99.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Blum, M., Springenberg, J.T., Wülfing, J., Riedmiller, M.: A learned feature descriptor for object recognition in RGB-D data. In: 2012 IEEE International Conference on Robotics and Automation, pp. 1298–1303. IEEE (2012)

    Google Scholar 

  2. Bo, L., Lai, K., Ren, X., Fox, D.: Object recognition with hierarchical kernel descriptors. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1729–1736 (2011)

    Google Scholar 

  3. Bo, L., Ren, X., Fox, D.: Depth kernel descriptors for object recognition. In: 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 821–826. IEEE (2011)

    Google Scholar 

  4. Cai, Z., Vasconcelos, N.: Cascade R-CNN: delving into high quality object detection. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 6154–6162 (2018)

    Google Scholar 

  5. Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. arXiv preprint arXiv:2005.12872 (2020)

  6. Chen, K., et al.: MMDetection: open MMLab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155 (2019)

  7. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255. IEEE (2009)

    Google Scholar 

  8. Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)

  9. Eitel, A., Springenberg, J.T., Spinello, L., Riedmiller, M., Burgard, W.: Multimodal deep learning for robust RGB-D object recognition. In: 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 681–687. IEEE (2015)

    Google Scholar 

  10. Everingham, M., Eslami, S.A., Van Gool, L., Williams, C.K., Winn, J., Zisserman, A.: The pascal visual object classes challenge: a retrospective. Int. J. Comput. Vis. 111(1), 98–136 (2015)

    Article  Google Scholar 

  11. Girshick, R.: Fast R-CNN. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1440–1448 (2015)

    Google Scholar 

  12. Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 580–587 (2014)

    Google Scholar 

  13. Greff, K., Srivastava, R.K., Koutník, J., Steunebrink, B.R., Schmidhuber, J.: LSTM: a search space odyssey. IEEE Trans. Neural Netw. Learn. Syst. 28(10), 2222–2232 (2016)

    Article  MathSciNet  Google Scholar 

  14. Gupta, S., Girshick, R., Arbeláez, P., Malik, J.: Learning rich features from RGB-D images for object detection and segmentation. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8695, pp. 345–360. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10584-0_23

    Chapter  Google Scholar 

  15. Gupta, S., Hoffman, J., Malik, J.: Cross modal distillation for supervision transfer. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2827–2836 (2016)

    Google Scholar 

  16. Hazirbas, C., Ma, L., Domokos, C., Cremers, D.: FuseNet: incorporating depth into semantic segmentation via fusion-based CNN architecture. In: Lai, S.-H., Lepetit, V., Nishino, K., Sato, Y. (eds.) ACCV 2016. LNCS, vol. 10111, pp. 213–228. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-54181-5_14

    Chapter  Google Scholar 

  17. Li, G., Gan, Y., Wu, H., Xiao, N., Lin, L.: Cross-modal attentional context learning for RGB-D object detection. IEEE Trans. Image Process. 28(4), 1591–1601 (2018)

    Article  MathSciNet  Google Scholar 

  18. Li, J., et al.: Attentive contexts for object detection. IEEE Trans. Multimedia 19(5), 944–954 (2016)

    Article  Google Scholar 

  19. Li, X., et al.: Generalized focal loss: learning qualified and distributed bounding boxes for dense object detection. In: NeurIPS (2020)

    Google Scholar 

  20. Liu, Z., et al.: ConvTransformer: a convolutional transformer network for video frame synthesis. arXiv preprint arXiv:2011.10185 (2020)

  21. Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems, pp. 91–99 (2015)

    Google Scholar 

  22. Silberman, N., Hoiem, D., Kohli, P., Fergus, R.: Indoor segmentation and support inference from RGBD images. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7576, pp. 746–760. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33715-4_54

    Chapter  Google Scholar 

  23. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)

  24. Song, S., Lichtenberg, S.P., Xiao, J.: Sun RGB-D: a RGB-D scene understanding benchmark suite. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 567–576 (2015)

    Google Scholar 

  25. Sun, P., et al.: Sparse R-CNN: end-to-end object detection with learnable proposals. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14454–14463 (2021)

    Google Scholar 

  26. Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. arXiv preprint arXiv:2012.12877 (2020)

  27. Vaswani, A., et al.: Attention is all you need. arXiv preprint arXiv:1706.03762 (2017)

  28. Wang, J., et al.: Side-aware boundary localization for more precise object detection. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12349, pp. 403–419. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58548-8_24

    Chapter  Google Scholar 

  29. Xu, X., Li, Y., Wu, G., Luo, J.: Multi-modal deep feature learning for RGB-D object detection. Pattern Recogn. 72, 300–313 (2017)

    Article  Google Scholar 

  30. Yang, Z., Dai, Z., Salakhutdinov, R., Cohen, W.W.: Breaking the softmax bottleneck: a high-rank RNN language model. arXiv preprint arXiv:1711.03953 (2017)

  31. Zhang, D., Zhang, H., Tang, J., Wang, M., Hua, X., Sun, Q.: Feature pyramid transformer. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12373, pp. 323–339. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58604-1_20

    Chapter  Google Scholar 

  32. Zhang, H., Chang, H., Ma, B., Wang, N., Chen, X.: Dynamic R-CNN: towards high quality object detection via dynamic training. arXiv preprint arXiv:2004.06002 (2020)

  33. Zhang, S., Chi, C., Yao, Y., Lei, Z., Li, S.Z.: Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9759–9768 (2020)

    Google Scholar 

Download references

Acknowledgement

This paper is partially supported by National Key R&D Program of China (2019YFB1600400). This paper is partially supported by Sichuan Science and Technology Program (2020GZYZF0006). This work is partially supported by Beijing Advanced Innovation Center for Future Chip (ICFC) and Huachuang Aima Information Technology (Chengdu) Co., Ltd.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Guijin Wang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Xiao, Z., Xie, P., Wang, G. (2022). Multi-scale Cross-Modal Transformer Network for RGB-D Object Detection. In: Þór Jónsson, B., et al. MultiMedia Modeling. MMM 2022. Lecture Notes in Computer Science, vol 13141. Springer, Cham. https://doi.org/10.1007/978-3-030-98358-1_28

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-98358-1_28

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-98357-4

  • Online ISBN: 978-3-030-98358-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics