Skip to main content
Log in

AFMCT: adaptive fusion module based on cross-modal transformer block for 3D object detection

  • RESEARCH
  • Published:
Machine Vision and Applications Aims and scope Submit manuscript

Abstract

Lidar and camera are essential sensors for environment perception in autonomous driving. However, fully fusing heterogeneous data from multiple sources remains a non-trivial challenge. As a result, 3D object detection based on multi-modal sensor fusion are often inferior to single-modal methods only based on Lidar, which indicates that multi-sensor machine vision still needs development. In this paper, we propose an adaptive fusion module based on cross-modal transformer block(AFMCT) for 3D object detection by utilizing a bidirectional enhancing strategy. Specifically, we first enhance image feature by extracting an attention-based point feature based on a cross-modal transformer block and linking them in a concatenation fashion, followed by another cross-modal transformer block acting on the enhanced image feature to strengthen the point feature with image semantic information. Extensive experiments operated on the 3D detection benchmark of the KITTI dataset reveal that our proposed structure can significantly improve the detection accuracy of Lidar-only methods and outperform the existing advanced multi-sensor fusion modules by at least 0.45%, which indicates that our method might be a feasible solution to improving 3D object detection based on multi-sensor fusion.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Data availability

The datasets generated or analyzed during the current study are available from the corresponding author on reasonable request.

References

  1. Huang, K., Shi, B., Li, X., Li, X., Huang, S., Li, Y.: Multi-modal sensor fusion for auto driving perception: a survey. arXiv preprint arXiv:2202.02703 (2022)

  2. Zhou, Y., Tuzel, O.: Voxelnet: End-to-end learning for point cloud based 3d object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 4490–4499 (2018)

  3. Mao, J., Xue, Y., Niu, M., Bai, H., Feng, J., Liang, X., Xu, H., Xu, C.: Voxel transformer for 3d object detection. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 3164–3173 (2021)

  4. Kuang, H., Wang, B., An, J., Zhang, M., Zhang, Z.: Voxel-FPN: Multi-scale voxel feature aggregation for 3D object detection from LIDAR point clouds. Sensors 20, 704 (2020)

    Article  Google Scholar 

  5. Deng, J., Shi, S., Li, P., Zhou, W., Zhang, Y., Li, H.: Voxel r-cnn: Towards high performance voxel-based 3d object detection. In: Proceedings of the AAAI conference on artificial intelligence. pp. 1201–1209 (2021)

  6. Lang, A.H., Vora, S., Caesar, H., Zhou, L., Yang, J., Beijbom, O.: Pointpillars: Fast encoders for object detection from point clouds. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 12697–12705 (2019)

  7. Wang, Y., Fathi, A., Kundu, A., Ross, D.A., Pantofaru, C., Funkhouser, T., Solomon, J.: Pillar-based object detection for autonomous driving. In: Computer vision–ECCV 2020: 16th European Conference, Glasgow, UK, Proceedings, Springer, Part XXII 16. pp. 18–34 August 23–28 (2020)

  8. Qi, C.R., Su, H., Mo, K., Guibas, L.J.: Pointnet: Deep learning on point sets for 3d classification and segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 652–660 (2017)

  9. Qi, C.R., Yi, L., Su, H., Guibas, L.J.: Pointnet++: Deep hierarchical feature learning on point sets in a metric space. Adv Neural Inf Process Syst. 30, (2017)

  10. Shi, S., Wang, X., Li, H.: Pointrcnn: 3d object proposal generation and detection from point cloud. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 770–779 (2019)

  11. Shi, S., Guo, C., Jiang, L., Wang, Z., Shi, J., Wang, X., Li, H.: Pv-rcnn: Point-voxel feature set abstraction for 3d object detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10529–10538 (2020)

  12. Shi, S., Jiang, L., Deng, J., Wang, Z., Guo, C., Shi, J., Wang, X., Li, H.: PV-RCNN++: Point-voxel feature set abstraction with local vector representation for 3D object detection. Int. J. Comput. Vis. 131, 531–551 (2023)

    Article  Google Scholar 

  13. Yan, Y., Mao, Y., Li, B.: Second sparsely embedded convolutional detection. Sensors (2018). https://doi.org/10.3390/s18103337

    Article  Google Scholar 

  14. Yin, T., Zhou, X., Krähenbühl, P.: Center-based 3D Object Detection and Tracking. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition pp. 11784-11793.

  15. Pang, S., Morris, D., Radha, H.: CLOCs: Camera-LiDAR object candidates fusion for 3D object detection. In: 2020 IEEE/RSJ International conference on intelligent robots and systems (IROS). pp. 10386–10393 (2020)

  16. Chen, X., Ma, H., Wan, J., Li, B., Xia, T.: Multi-view 3d object detection network for autonomous driving. In: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. pp. 1907–1915 (2017)

  17. Ku, J., Mozifian, M., Lee, J., Harakeh, A., Waslander, S.L.: Joint 3d proposal generation and object detection from view aggregation. In: 2018 IEEE/RSJ international conference on intelligent robots and systems (IROS). pp. 1–8 (2018)

  18. Vora, S., Lang, A.H., Helou, B., Beijbom, O.: Pointpainting: Sequential fusion for 3d object detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 4604–4612 (2020)

  19. Wang, C., Ma, C., Zhu, M., Yang, X.: Pointaugmenting: Cross-modal augmentation for 3d object detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 11794–11803 (2021)

  20. Huang, T., Liu, Z., Chen, X., Bai, X.: Epnet: Enhancing point features with image semantics for 3d object detection. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, Proceedings, Springer Part XV 16. pp. 35–52 August (2020)

  21. Liu, Z., Huang, T., Li, B., Chen, X., Wang, X., Bai, X.: EPNet++: Cascade bi-directional fusion for multi-modal 3D object detection. IEEE Trans Pattern Anal Mach Intell. (2022)

  22. Yoo, J.H., Kim, Y., Kim, J., Choi, J.W.: 3d-cvf: Generating joint camera and lidar features using cross-view spatial feature fusion for 3d object detection. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, Proceedings, Springer, Part XXVII 16. pp. 720–736. August (2020)

  23. Zhang, Z., Shen, Y., Li, H., Zhao, X., Yang, M., Tan, W., Pu, S., Mao, H.: Maff-net: Filter false positive for 3d vehicle detection with multi-modal adaptive feature fusion. In: 2022 IEEE 25th International conference on intelligent transportation systems (ITSC). pp. 369–376 (2022)

  24. Wang, G., Tian, B., Zhang, Y., Chen, L., Cao, D., Wu, J.: Multi-view adaptive fusion network for 3D object detection. arXiv preprint arXiv:2011.00652 (2020)

  25. Zhang, Y., Chen, J., Huang, D.: Cat-det: Contrastively augmented transformer for multi-modal 3d object detection. In: Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition. pp. 908–917 (2022)

  26. Chen, Z., Li, Z., Zhang, S., Fang, L., Jiang, Q., Zhao, F., Zhou, B., Zhao, H.: Autoalign: Pixel-instance feature aggregation for multi-modal 3d object detection. arXiv preprint arXiv:2201.06493 (2022)

  27. Chen, Z., Li, Z., Zhang, S., Fang, L., Jiang, Q., Zhao, F.: Autoalignv2: Deformable feature aggregation for dynamic multi-modal 3d object detection. arXiv preprint arXiv:2207.10316 (2022)

  28. Li, Y., Yu, A.W., Meng, T., Caine, B., Ngiam, J., Peng, D., Shen, J., Lu, Y., Zhou, D., Le, Q. V: Deepfusion: Lidar-camera deep fusion for multi-modal 3d object detection. In: Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition. pp. 17182–17191 (2022)

  29. Yang, H., Shi, C., Chen, Y., Wang, L.: Boosting 3D object detection via object-focused image fusion. arXiv preprint arXiv:2207.10589. (2022)

  30. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Adv Neural Inf Process Syst. 30, (2017)

  31. Shaw, P., Uszkoreit, J., Vaswani, A.: Self-attention with relative position representations. arXiv preprint arXiv:1803.02155 (2018)

  32. OpenPCDet Development Team. (2020). OpenPCDet: An open-source toolbox for 3D object detection from point clouds.

  33. Kim, Y., Park, K., Kim, M., Kum, D., & Choi, J. W. (2022). 3D Dual-Fusion: dual-domain dual-query camera-LiDAR fusion for 3D object detection. http://arxiv.org/abs/2211.13529

  34. Tsai, Y.-H.H., Bai, S., Liang, P.P., Kolter, J.Z., Morency, L.-P., Salakhutdinov, R.: Multimodal transformer for unaligned multimodal language sequences. In: Proceedings of the conference. Association for computational linguistics. Meeting. NIH Public Access pp. 6558 (2019)

Download references

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was financially supported by Project of Major Scientific and Technological Achievements Engineering in Hefei (2021CG003), Anhui Provincial Development and Reform Commission 2021 New Energy Vehicle Industry Innovation Development Project (wfgcyh2021439), 2022 Major Science and Technology Projects of Anhui Province (202203a 05020008), Joint research project of the Yangtze River Delta community of sci-tech innovation (2022CSJGG1501), Jining City Industrial Innovation Major Technology "Global Unveiling" Project (2022JBZP002) and Projects for Transformation and Industrialization of Scientific and China Speech Valley innovation and development Project (2108-340161-04-01-727575).

Author information

Authors and Affiliations

Authors

Contributions

All authors contributed to the study conception and design. Material preparation, data collection and analysis were performed by Bingli Zhang, Yixin Wang, Chengbiao Zhang, Junzhao Jiang and Zehao Pan. Jin Cheng and Yangyang Zhang wrote the main manuscript text and Chenglei Yang prepared figures 1-7. The tables 1-4 were prepared by Xinyu Wang and Yanhui Wang. All authors commented on previous versions of the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Junzhao Jiang.

Ethics declarations

Conflict of interest

We declare that we have no financial and personal relationships with other people or organizations that can inappropriately influence our work, and there is no professional or other personal interest of any nature or kind in any product, service and/or company that could be construed as influencing the position presented in, or the review of, the manuscript entitled, “AFMCT: Adaptive Fusion Module based on Cross-modal Transformer block for 3D Object Detection”.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, B., Wang, Y., Zhang, C. et al. AFMCT: adaptive fusion module based on cross-modal transformer block for 3D object detection. Machine Vision and Applications 35, 40 (2024). https://doi.org/10.1007/s00138-024-01509-3

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s00138-024-01509-3

Keywords

Navigation