Multi-scale Cross-Modal Transformer Network for RGB-D Object Detection

Xiao, Zhibin; Xie, Pengwei; Wang, Guijin

doi:10.1007/978-3-030-98358-1_28

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13141))

Included in the following conference series:

International Conference on Multimedia Modeling

2210 Accesses
1 Citations

Abstract

RGB-D object detection is a fundamental yet challenging task due to the inherent difference between the RGB and Depth information. In this paper, we propose a Multi-scale Cross-modal Transformer Network (MCTNet) consisting of two well-designed components: the Multi-modal Feature Pyramid module (MFP), and the Cross-Modal Transformer (CMTrans). Specially, we introduce the MFP to enrich the high-level semantic features with geometric information and enhance low-level geometric clues with semantic features, which is demonstrated facilitating the further cross-modal feature fusion. Furthermore, we develop the CMTrans to effectively exploit the long-range attention between the enhanced RGB and depth features, enabling the network to focus on regions of interest. Extensive experiments show our MCTNet surpasses state-of-the-art detectors by 1.6% mAP on SUN RGB-D and 1.0% mAP on NYU Depth v2, which demonstrates the effectiveness of the proposed method.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Cross-Modal Weighting Network for RGB-D Salient Object Detection

An improved dense-to-sparse cross-modal fusion network for 3D object detection in RGB-D images

Article 22 June 2023

Specificity-preserving RGB-D saliency detection

Article Open access 03 January 2023

References

Blum, M., Springenberg, J.T., Wülfing, J., Riedmiller, M.: A learned feature descriptor for object recognition in RGB-D data. In: 2012 IEEE International Conference on Robotics and Automation, pp. 1298–1303. IEEE (2012)
Google Scholar
Bo, L., Lai, K., Ren, X., Fox, D.: Object recognition with hierarchical kernel descriptors. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1729–1736 (2011)
Google Scholar
Bo, L., Ren, X., Fox, D.: Depth kernel descriptors for object recognition. In: 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 821–826. IEEE (2011)
Google Scholar
Cai, Z., Vasconcelos, N.: Cascade R-CNN: delving into high quality object detection. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 6154–6162 (2018)
Google Scholar
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. arXiv preprint arXiv:2005.12872 (2020)
Chen, K., et al.: MMDetection: open MMLab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155 (2019)
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255. IEEE (2009)
Google Scholar
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
Eitel, A., Springenberg, J.T., Spinello, L., Riedmiller, M., Burgard, W.: Multimodal deep learning for robust RGB-D object recognition. In: 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 681–687. IEEE (2015)
Google Scholar
Everingham, M., Eslami, S.A., Van Gool, L., Williams, C.K., Winn, J., Zisserman, A.: The pascal visual object classes challenge: a retrospective. Int. J. Comput. Vis. 111(1), 98–136 (2015)
Article Google Scholar
Girshick, R.: Fast R-CNN. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1440–1448 (2015)
Google Scholar
Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 580–587 (2014)
Google Scholar
Greff, K., Srivastava, R.K., Koutník, J., Steunebrink, B.R., Schmidhuber, J.: LSTM: a search space odyssey. IEEE Trans. Neural Netw. Learn. Syst. 28(10), 2222–2232 (2016)
Article MathSciNet Google Scholar
Gupta, S., Girshick, R., Arbeláez, P., Malik, J.: Learning rich features from RGB-D images for object detection and segmentation. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8695, pp. 345–360. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10584-0_23
Chapter Google Scholar
Gupta, S., Hoffman, J., Malik, J.: Cross modal distillation for supervision transfer. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2827–2836 (2016)
Google Scholar
Hazirbas, C., Ma, L., Domokos, C., Cremers, D.: FuseNet: incorporating depth into semantic segmentation via fusion-based CNN architecture. In: Lai, S.-H., Lepetit, V., Nishino, K., Sato, Y. (eds.) ACCV 2016. LNCS, vol. 10111, pp. 213–228. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-54181-5_14
Chapter Google Scholar
Li, G., Gan, Y., Wu, H., Xiao, N., Lin, L.: Cross-modal attentional context learning for RGB-D object detection. IEEE Trans. Image Process. 28(4), 1591–1601 (2018)
Article MathSciNet Google Scholar
Li, J., et al.: Attentive contexts for object detection. IEEE Trans. Multimedia 19(5), 944–954 (2016)
Article Google Scholar
Li, X., et al.: Generalized focal loss: learning qualified and distributed bounding boxes for dense object detection. In: NeurIPS (2020)
Google Scholar
Liu, Z., et al.: ConvTransformer: a convolutional transformer network for video frame synthesis. arXiv preprint arXiv:2011.10185 (2020)
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems, pp. 91–99 (2015)
Google Scholar
Silberman, N., Hoiem, D., Kohli, P., Fergus, R.: Indoor segmentation and support inference from RGBD images. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7576, pp. 746–760. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33715-4_54
Chapter Google Scholar
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
Song, S., Lichtenberg, S.P., Xiao, J.: Sun RGB-D: a RGB-D scene understanding benchmark suite. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 567–576 (2015)
Google Scholar
Sun, P., et al.: Sparse R-CNN: end-to-end object detection with learnable proposals. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14454–14463 (2021)
Google Scholar
Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. arXiv preprint arXiv:2012.12877 (2020)
Vaswani, A., et al.: Attention is all you need. arXiv preprint arXiv:1706.03762 (2017)
Wang, J., et al.: Side-aware boundary localization for more precise object detection. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12349, pp. 403–419. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58548-8_24
Chapter Google Scholar
Xu, X., Li, Y., Wu, G., Luo, J.: Multi-modal deep feature learning for RGB-D object detection. Pattern Recogn. 72, 300–313 (2017)
Article Google Scholar
Yang, Z., Dai, Z., Salakhutdinov, R., Cohen, W.W.: Breaking the softmax bottleneck: a high-rank RNN language model. arXiv preprint arXiv:1711.03953 (2017)
Zhang, D., Zhang, H., Tang, J., Wang, M., Hua, X., Sun, Q.: Feature pyramid transformer. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12373, pp. 323–339. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58604-1_20
Chapter Google Scholar
Zhang, H., Chang, H., Ma, B., Wang, N., Chen, X.: Dynamic R-CNN: towards high quality object detection via dynamic training. arXiv preprint arXiv:2004.06002 (2020)
Zhang, S., Chi, C., Yao, Y., Lei, Z., Li, S.Z.: Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9759–9768 (2020)
Google Scholar

Download references

Acknowledgement

This paper is partially supported by National Key R&D Program of China (2019YFB1600400). This paper is partially supported by Sichuan Science and Technology Program (2020GZYZF0006). This work is partially supported by Beijing Advanced Innovation Center for Future Chip (ICFC) and Huachuang Aima Information Technology (Chengdu) Co., Ltd.

Author information

Authors and Affiliations

Department of Electric Engineering, Tsinghua University, Beijing, China
Zhibin Xiao, Pengwei Xie & Guijin Wang
Tsinghua Shenzhen International Graduate School, Tsinghua University, Shenzhen, China
Zhibin Xiao

Authors

Zhibin Xiao
View author publications
You can also search for this author in PubMed Google Scholar
Pengwei Xie
View author publications
You can also search for this author in PubMed Google Scholar
Guijin Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Guijin Wang .

Editor information

Editors and Affiliations

IT University of Copenhagen, Copenhagen, Denmark
Björn Þór Jónsson
Dublin City University, Dublin, Ireland
Cathal Gurrin
University of Science, VNU-HCM, Ho Chi Minh City, Vietnam
Minh-Triet Tran
University of Bergen, Bergen, Norway
Duc-Tien Dang-Nguyen
National Tsing Hua University, Hsinchu, Taiwan
Anita Min-Chun Hu
Hanoi University of Science and Technology, Hanoi, Vietnam
Binh Huynh Thi Thanh
Median Technologies, Valbonne, France
Benoit Huet

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Xiao, Z., Xie, P., Wang, G. (2022). Multi-scale Cross-Modal Transformer Network for RGB-D Object Detection. In: Þór Jónsson, B., et al. MultiMedia Modeling. MMM 2022. Lecture Notes in Computer Science, vol 13141. Springer, Cham. https://doi.org/10.1007/978-3-030-98358-1_28

Download citation

DOI: https://doi.org/10.1007/978-3-030-98358-1_28
Published: 15 March 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-98357-4
Online ISBN: 978-3-030-98358-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Multi-scale Cross-Modal Transformer Network for RGB-D Object Detection

Abstract

Access this chapter

Similar content being viewed by others

Cross-Modal Weighting Network for RGB-D Salient Object Detection

An improved dense-to-sparse cross-modal fusion network for 3D object detection in RGB-D images

Specificity-preserving RGB-D saliency detection

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Multi-scale Cross-Modal Transformer Network for RGB-D Object Detection

Abstract

Access this chapter

Similar content being viewed by others

Cross-Modal Weighting Network for RGB-D Salient Object Detection

An improved dense-to-sparse cross-modal fusion network for 3D object detection in RGB-D images

Specificity-preserving RGB-D saliency detection

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation