AFMCT: adaptive fusion module based on cross-modal transformer block for 3D object detection

Zhang, Bingli; Wang, Yixin; Zhang, Chengbiao; Jiang, Junzhao; Pan, Zehao; Cheng, Jin; Zhang, Yangyang; Wang, Xinyu; Yang, Chenglei; Wang, Yanhui

doi:10.1007/s00138-024-01509-3

AFMCT: adaptive fusion module based on cross-modal transformer block for 3D object detection

RESEARCH
Published: 23 March 2024

Volume 35, article number 40, (2024)
Cite this article

Machine Vision and Applications Aims and scope Submit manuscript

Bingli Zhang¹,
Yixin Wang¹,
Chengbiao Zhang¹,
Junzhao Jiang¹,
Zehao Pan¹,
Jin Cheng²,
Yangyang Zhang²,
Xinyu Wang²,
Chenglei Yang² &
…
Yanhui Wang²

128 Accesses
1 Altmetric
Explore all metrics

Abstract

Lidar and camera are essential sensors for environment perception in autonomous driving. However, fully fusing heterogeneous data from multiple sources remains a non-trivial challenge. As a result, 3D object detection based on multi-modal sensor fusion are often inferior to single-modal methods only based on Lidar, which indicates that multi-sensor machine vision still needs development. In this paper, we propose an adaptive fusion module based on cross-modal transformer block(AFMCT) for 3D object detection by utilizing a bidirectional enhancing strategy. Specifically, we first enhance image feature by extracting an attention-based point feature based on a cross-modal transformer block and linking them in a concatenation fashion, followed by another cross-modal transformer block acting on the enhanced image feature to strengthen the point feature with image semantic information. Extensive experiments operated on the 3D detection benchmark of the KITTI dataset reveal that our proposed structure can significantly improve the detection accuracy of Lidar-only methods and outperform the existing advanced multi-sensor fusion modules by at least 0.45%, which indicates that our method might be a feasible solution to improving 3D object detection based on multi-sensor fusion.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Enhancing Multi-modal Features Using Local Self-attention for 3D Object Detection

3D-CVF: Generating Joint Camera and LiDAR Features Using Cross-view Spatial Feature Fusion for 3D Object Detection

LiDAR-Camera-Based Deep Dense Fusion for Robust 3D Object Detection

Data availability

The datasets generated or analyzed during the current study are available from the corresponding author on reasonable request.

References

Huang, K., Shi, B., Li, X., Li, X., Huang, S., Li, Y.: Multi-modal sensor fusion for auto driving perception: a survey. arXiv preprint arXiv:2202.02703 (2022)
Zhou, Y., Tuzel, O.: Voxelnet: End-to-end learning for point cloud based 3d object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 4490–4499 (2018)
Mao, J., Xue, Y., Niu, M., Bai, H., Feng, J., Liang, X., Xu, H., Xu, C.: Voxel transformer for 3d object detection. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 3164–3173 (2021)
Kuang, H., Wang, B., An, J., Zhang, M., Zhang, Z.: Voxel-FPN: Multi-scale voxel feature aggregation for 3D object detection from LIDAR point clouds. Sensors 20, 704 (2020)
Article Google Scholar
Deng, J., Shi, S., Li, P., Zhou, W., Zhang, Y., Li, H.: Voxel r-cnn: Towards high performance voxel-based 3d object detection. In: Proceedings of the AAAI conference on artificial intelligence. pp. 1201–1209 (2021)
Lang, A.H., Vora, S., Caesar, H., Zhou, L., Yang, J., Beijbom, O.: Pointpillars: Fast encoders for object detection from point clouds. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 12697–12705 (2019)
Wang, Y., Fathi, A., Kundu, A., Ross, D.A., Pantofaru, C., Funkhouser, T., Solomon, J.: Pillar-based object detection for autonomous driving. In: Computer vision–ECCV 2020: 16th European Conference, Glasgow, UK, Proceedings, Springer, Part XXII 16. pp. 18–34 August 23–28 (2020)
Qi, C.R., Su, H., Mo, K., Guibas, L.J.: Pointnet: Deep learning on point sets for 3d classification and segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 652–660 (2017)
Qi, C.R., Yi, L., Su, H., Guibas, L.J.: Pointnet++: Deep hierarchical feature learning on point sets in a metric space. Adv Neural Inf Process Syst. 30, (2017)
Shi, S., Wang, X., Li, H.: Pointrcnn: 3d object proposal generation and detection from point cloud. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 770–779 (2019)
Shi, S., Guo, C., Jiang, L., Wang, Z., Shi, J., Wang, X., Li, H.: Pv-rcnn: Point-voxel feature set abstraction for 3d object detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10529–10538 (2020)
Shi, S., Jiang, L., Deng, J., Wang, Z., Guo, C., Shi, J., Wang, X., Li, H.: PV-RCNN++: Point-voxel feature set abstraction with local vector representation for 3D object detection. Int. J. Comput. Vis. 131, 531–551 (2023)
Article Google Scholar
Yan, Y., Mao, Y., Li, B.: Second sparsely embedded convolutional detection. Sensors (2018). https://doi.org/10.3390/s18103337
Article Google Scholar
Yin, T., Zhou, X., Krähenbühl, P.: Center-based 3D Object Detection and Tracking. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition pp. 11784-11793.
Pang, S., Morris, D., Radha, H.: CLOCs: Camera-LiDAR object candidates fusion for 3D object detection. In: 2020 IEEE/RSJ International conference on intelligent robots and systems (IROS). pp. 10386–10393 (2020)
Chen, X., Ma, H., Wan, J., Li, B., Xia, T.: Multi-view 3d object detection network for autonomous driving. In: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. pp. 1907–1915 (2017)
Ku, J., Mozifian, M., Lee, J., Harakeh, A., Waslander, S.L.: Joint 3d proposal generation and object detection from view aggregation. In: 2018 IEEE/RSJ international conference on intelligent robots and systems (IROS). pp. 1–8 (2018)
Vora, S., Lang, A.H., Helou, B., Beijbom, O.: Pointpainting: Sequential fusion for 3d object detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 4604–4612 (2020)
Wang, C., Ma, C., Zhu, M., Yang, X.: Pointaugmenting: Cross-modal augmentation for 3d object detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 11794–11803 (2021)
Huang, T., Liu, Z., Chen, X., Bai, X.: Epnet: Enhancing point features with image semantics for 3d object detection. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, Proceedings, Springer Part XV 16. pp. 35–52 August (2020)
Liu, Z., Huang, T., Li, B., Chen, X., Wang, X., Bai, X.: EPNet++: Cascade bi-directional fusion for multi-modal 3D object detection. IEEE Trans Pattern Anal Mach Intell. (2022)
Yoo, J.H., Kim, Y., Kim, J., Choi, J.W.: 3d-cvf: Generating joint camera and lidar features using cross-view spatial feature fusion for 3d object detection. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, Proceedings, Springer, Part XXVII 16. pp. 720–736. August (2020)
Zhang, Z., Shen, Y., Li, H., Zhao, X., Yang, M., Tan, W., Pu, S., Mao, H.: Maff-net: Filter false positive for 3d vehicle detection with multi-modal adaptive feature fusion. In: 2022 IEEE 25th International conference on intelligent transportation systems (ITSC). pp. 369–376 (2022)
Wang, G., Tian, B., Zhang, Y., Chen, L., Cao, D., Wu, J.: Multi-view adaptive fusion network for 3D object detection. arXiv preprint arXiv:2011.00652 (2020)
Zhang, Y., Chen, J., Huang, D.: Cat-det: Contrastively augmented transformer for multi-modal 3d object detection. In: Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition. pp. 908–917 (2022)
Chen, Z., Li, Z., Zhang, S., Fang, L., Jiang, Q., Zhao, F., Zhou, B., Zhao, H.: Autoalign: Pixel-instance feature aggregation for multi-modal 3d object detection. arXiv preprint arXiv:2201.06493 (2022)
Chen, Z., Li, Z., Zhang, S., Fang, L., Jiang, Q., Zhao, F.: Autoalignv2: Deformable feature aggregation for dynamic multi-modal 3d object detection. arXiv preprint arXiv:2207.10316 (2022)
Li, Y., Yu, A.W., Meng, T., Caine, B., Ngiam, J., Peng, D., Shen, J., Lu, Y., Zhou, D., Le, Q. V: Deepfusion: Lidar-camera deep fusion for multi-modal 3d object detection. In: Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition. pp. 17182–17191 (2022)
Yang, H., Shi, C., Chen, Y., Wang, L.: Boosting 3D object detection via object-focused image fusion. arXiv preprint arXiv:2207.10589. (2022)
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Adv Neural Inf Process Syst. 30, (2017)
Shaw, P., Uszkoreit, J., Vaswani, A.: Self-attention with relative position representations. arXiv preprint arXiv:1803.02155 (2018)
OpenPCDet Development Team. (2020). OpenPCDet: An open-source toolbox for 3D object detection from point clouds.
Kim, Y., Park, K., Kim, M., Kum, D., & Choi, J. W. (2022). 3D Dual-Fusion: dual-domain dual-query camera-LiDAR fusion for 3D object detection. http://arxiv.org/abs/2211.13529
Tsai, Y.-H.H., Bai, S., Liang, P.P., Kolter, J.Z., Morency, L.-P., Salakhutdinov, R.: Multimodal transformer for unaligned multimodal language sequences. In: Proceedings of the conference. Association for computational linguistics. Meeting. NIH Public Access pp. 6558 (2019)

Download references

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was financially supported by Project of Major Scientific and Technological Achievements Engineering in Hefei (2021CG003), Anhui Provincial Development and Reform Commission 2021 New Energy Vehicle Industry Innovation Development Project (wfgcyh2021439), 2022 Major Science and Technology Projects of Anhui Province (202203a 05020008), Joint research project of the Yangtze River Delta community of sci-tech innovation (2022CSJGG1501), Jining City Industrial Innovation Major Technology "Global Unveiling" Project (2022JBZP002) and Projects for Transformation and Industrialization of Scientific and China Speech Valley innovation and development Project (2108-340161-04-01-727575).

Author information

Authors and Affiliations

School of Automotive and Transportation Engineering, Hefei University of Technology, Hefei, 230009, China
Bingli Zhang, Yixin Wang, Chengbiao Zhang, Junzhao Jiang & Zehao Pan
Anhui Engineering Laboratory of Intelligent Automobile, Hefei, 230009, China
Jin Cheng, Yangyang Zhang, Xinyu Wang, Chenglei Yang & Yanhui Wang

Authors

Bingli Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Yixin Wang
View author publications
You can also search for this author in PubMed Google Scholar
Chengbiao Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Junzhao Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Zehao Pan
View author publications
You can also search for this author in PubMed Google Scholar
Jin Cheng
View author publications
You can also search for this author in PubMed Google Scholar
Yangyang Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Xinyu Wang
View author publications
You can also search for this author in PubMed Google Scholar
Chenglei Yang
View author publications
You can also search for this author in PubMed Google Scholar
Yanhui Wang
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

All authors contributed to the study conception and design. Material preparation, data collection and analysis were performed by Bingli Zhang, Yixin Wang, Chengbiao Zhang, Junzhao Jiang and Zehao Pan. Jin Cheng and Yangyang Zhang wrote the main manuscript text and Chenglei Yang prepared figures 1-7. The tables 1-4 were prepared by Xinyu Wang and Yanhui Wang. All authors commented on previous versions of the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Junzhao Jiang.

Ethics declarations

Conflict of interest

We declare that we have no financial and personal relationships with other people or organizations that can inappropriately influence our work, and there is no professional or other personal interest of any nature or kind in any product, service and/or company that could be construed as influencing the position presented in, or the review of, the manuscript entitled, “AFMCT: Adaptive Fusion Module based on Cross-modal Transformer block for 3D Object Detection”.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Zhang, B., Wang, Y., Zhang, C. et al. AFMCT: adaptive fusion module based on cross-modal transformer block for 3D object detection. Machine Vision and Applications 35, 40 (2024). https://doi.org/10.1007/s00138-024-01509-3

Download citation

Received: 26 April 2023
Revised: 08 December 2023
Accepted: 13 January 2024
Published: 23 March 2024
DOI: https://doi.org/10.1007/s00138-024-01509-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

AFMCT: adaptive fusion module based on cross-modal transformer block for 3D object detection

Abstract

Access this article

Similar content being viewed by others

Enhancing Multi-modal Features Using Local Self-attention for 3D Object Detection

3D-CVF: Generating Joint Camera and LiDAR Features Using Cross-view Spatial Feature Fusion for 3D Object Detection

LiDAR-Camera-Based Deep Dense Fusion for Robust 3D Object Detection

Data availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

AFMCT: adaptive fusion module based on cross-modal transformer block for 3D object detection

Abstract

Access this article

Similar content being viewed by others

Enhancing Multi-modal Features Using Local Self-attention for 3D Object Detection

3D-CVF: Generating Joint Camera and LiDAR Features Using Cross-view Spatial Feature Fusion for 3D Object Detection

LiDAR-Camera-Based Deep Dense Fusion for Robust 3D Object Detection

Data availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation