Abstract
Lidar and camera are essential sensors for environment perception in autonomous driving. However, fully fusing heterogeneous data from multiple sources remains a non-trivial challenge. As a result, 3D object detection based on multi-modal sensor fusion are often inferior to single-modal methods only based on Lidar, which indicates that multi-sensor machine vision still needs development. In this paper, we propose an adaptive fusion module based on cross-modal transformer block(AFMCT) for 3D object detection by utilizing a bidirectional enhancing strategy. Specifically, we first enhance image feature by extracting an attention-based point feature based on a cross-modal transformer block and linking them in a concatenation fashion, followed by another cross-modal transformer block acting on the enhanced image feature to strengthen the point feature with image semantic information. Extensive experiments operated on the 3D detection benchmark of the KITTI dataset reveal that our proposed structure can significantly improve the detection accuracy of Lidar-only methods and outperform the existing advanced multi-sensor fusion modules by at least 0.45%, which indicates that our method might be a feasible solution to improving 3D object detection based on multi-sensor fusion.
Similar content being viewed by others
Data availability
The datasets generated or analyzed during the current study are available from the corresponding author on reasonable request.
References
Huang, K., Shi, B., Li, X., Li, X., Huang, S., Li, Y.: Multi-modal sensor fusion for auto driving perception: a survey. arXiv preprint arXiv:2202.02703 (2022)
Zhou, Y., Tuzel, O.: Voxelnet: End-to-end learning for point cloud based 3d object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 4490–4499 (2018)
Mao, J., Xue, Y., Niu, M., Bai, H., Feng, J., Liang, X., Xu, H., Xu, C.: Voxel transformer for 3d object detection. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 3164–3173 (2021)
Kuang, H., Wang, B., An, J., Zhang, M., Zhang, Z.: Voxel-FPN: Multi-scale voxel feature aggregation for 3D object detection from LIDAR point clouds. Sensors 20, 704 (2020)
Deng, J., Shi, S., Li, P., Zhou, W., Zhang, Y., Li, H.: Voxel r-cnn: Towards high performance voxel-based 3d object detection. In: Proceedings of the AAAI conference on artificial intelligence. pp. 1201–1209 (2021)
Lang, A.H., Vora, S., Caesar, H., Zhou, L., Yang, J., Beijbom, O.: Pointpillars: Fast encoders for object detection from point clouds. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 12697–12705 (2019)
Wang, Y., Fathi, A., Kundu, A., Ross, D.A., Pantofaru, C., Funkhouser, T., Solomon, J.: Pillar-based object detection for autonomous driving. In: Computer vision–ECCV 2020: 16th European Conference, Glasgow, UK, Proceedings, Springer, Part XXII 16. pp. 18–34 August 23–28 (2020)
Qi, C.R., Su, H., Mo, K., Guibas, L.J.: Pointnet: Deep learning on point sets for 3d classification and segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 652–660 (2017)
Qi, C.R., Yi, L., Su, H., Guibas, L.J.: Pointnet++: Deep hierarchical feature learning on point sets in a metric space. Adv Neural Inf Process Syst. 30, (2017)
Shi, S., Wang, X., Li, H.: Pointrcnn: 3d object proposal generation and detection from point cloud. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 770–779 (2019)
Shi, S., Guo, C., Jiang, L., Wang, Z., Shi, J., Wang, X., Li, H.: Pv-rcnn: Point-voxel feature set abstraction for 3d object detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10529–10538 (2020)
Shi, S., Jiang, L., Deng, J., Wang, Z., Guo, C., Shi, J., Wang, X., Li, H.: PV-RCNN++: Point-voxel feature set abstraction with local vector representation for 3D object detection. Int. J. Comput. Vis. 131, 531–551 (2023)
Yan, Y., Mao, Y., Li, B.: Second sparsely embedded convolutional detection. Sensors (2018). https://doi.org/10.3390/s18103337
Yin, T., Zhou, X., Krähenbühl, P.: Center-based 3D Object Detection and Tracking. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition pp. 11784-11793.
Pang, S., Morris, D., Radha, H.: CLOCs: Camera-LiDAR object candidates fusion for 3D object detection. In: 2020 IEEE/RSJ International conference on intelligent robots and systems (IROS). pp. 10386–10393 (2020)
Chen, X., Ma, H., Wan, J., Li, B., Xia, T.: Multi-view 3d object detection network for autonomous driving. In: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. pp. 1907–1915 (2017)
Ku, J., Mozifian, M., Lee, J., Harakeh, A., Waslander, S.L.: Joint 3d proposal generation and object detection from view aggregation. In: 2018 IEEE/RSJ international conference on intelligent robots and systems (IROS). pp. 1–8 (2018)
Vora, S., Lang, A.H., Helou, B., Beijbom, O.: Pointpainting: Sequential fusion for 3d object detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 4604–4612 (2020)
Wang, C., Ma, C., Zhu, M., Yang, X.: Pointaugmenting: Cross-modal augmentation for 3d object detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 11794–11803 (2021)
Huang, T., Liu, Z., Chen, X., Bai, X.: Epnet: Enhancing point features with image semantics for 3d object detection. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, Proceedings, Springer Part XV 16. pp. 35–52 August (2020)
Liu, Z., Huang, T., Li, B., Chen, X., Wang, X., Bai, X.: EPNet++: Cascade bi-directional fusion for multi-modal 3D object detection. IEEE Trans Pattern Anal Mach Intell. (2022)
Yoo, J.H., Kim, Y., Kim, J., Choi, J.W.: 3d-cvf: Generating joint camera and lidar features using cross-view spatial feature fusion for 3d object detection. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, Proceedings, Springer, Part XXVII 16. pp. 720–736. August (2020)
Zhang, Z., Shen, Y., Li, H., Zhao, X., Yang, M., Tan, W., Pu, S., Mao, H.: Maff-net: Filter false positive for 3d vehicle detection with multi-modal adaptive feature fusion. In: 2022 IEEE 25th International conference on intelligent transportation systems (ITSC). pp. 369–376 (2022)
Wang, G., Tian, B., Zhang, Y., Chen, L., Cao, D., Wu, J.: Multi-view adaptive fusion network for 3D object detection. arXiv preprint arXiv:2011.00652 (2020)
Zhang, Y., Chen, J., Huang, D.: Cat-det: Contrastively augmented transformer for multi-modal 3d object detection. In: Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition. pp. 908–917 (2022)
Chen, Z., Li, Z., Zhang, S., Fang, L., Jiang, Q., Zhao, F., Zhou, B., Zhao, H.: Autoalign: Pixel-instance feature aggregation for multi-modal 3d object detection. arXiv preprint arXiv:2201.06493 (2022)
Chen, Z., Li, Z., Zhang, S., Fang, L., Jiang, Q., Zhao, F.: Autoalignv2: Deformable feature aggregation for dynamic multi-modal 3d object detection. arXiv preprint arXiv:2207.10316 (2022)
Li, Y., Yu, A.W., Meng, T., Caine, B., Ngiam, J., Peng, D., Shen, J., Lu, Y., Zhou, D., Le, Q. V: Deepfusion: Lidar-camera deep fusion for multi-modal 3d object detection. In: Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition. pp. 17182–17191 (2022)
Yang, H., Shi, C., Chen, Y., Wang, L.: Boosting 3D object detection via object-focused image fusion. arXiv preprint arXiv:2207.10589. (2022)
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Adv Neural Inf Process Syst. 30, (2017)
Shaw, P., Uszkoreit, J., Vaswani, A.: Self-attention with relative position representations. arXiv preprint arXiv:1803.02155 (2018)
OpenPCDet Development Team. (2020). OpenPCDet: An open-source toolbox for 3D object detection from point clouds.
Kim, Y., Park, K., Kim, M., Kum, D., & Choi, J. W. (2022). 3D Dual-Fusion: dual-domain dual-query camera-LiDAR fusion for 3D object detection. http://arxiv.org/abs/2211.13529
Tsai, Y.-H.H., Bai, S., Liang, P.P., Kolter, J.Z., Morency, L.-P., Salakhutdinov, R.: Multimodal transformer for unaligned multimodal language sequences. In: Proceedings of the conference. Association for computational linguistics. Meeting. NIH Public Access pp. 6558 (2019)
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was financially supported by Project of Major Scientific and Technological Achievements Engineering in Hefei (2021CG003), Anhui Provincial Development and Reform Commission 2021 New Energy Vehicle Industry Innovation Development Project (wfgcyh2021439), 2022 Major Science and Technology Projects of Anhui Province (202203a 05020008), Joint research project of the Yangtze River Delta community of sci-tech innovation (2022CSJGG1501), Jining City Industrial Innovation Major Technology "Global Unveiling" Project (2022JBZP002) and Projects for Transformation and Industrialization of Scientific and China Speech Valley innovation and development Project (2108-340161-04-01-727575).
Author information
Authors and Affiliations
Contributions
All authors contributed to the study conception and design. Material preparation, data collection and analysis were performed by Bingli Zhang, Yixin Wang, Chengbiao Zhang, Junzhao Jiang and Zehao Pan. Jin Cheng and Yangyang Zhang wrote the main manuscript text and Chenglei Yang prepared figures 1-7. The tables 1-4 were prepared by Xinyu Wang and Yanhui Wang. All authors commented on previous versions of the manuscript. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Conflict of interest
We declare that we have no financial and personal relationships with other people or organizations that can inappropriately influence our work, and there is no professional or other personal interest of any nature or kind in any product, service and/or company that could be construed as influencing the position presented in, or the review of, the manuscript entitled, “AFMCT: Adaptive Fusion Module based on Cross-modal Transformer block for 3D Object Detection”.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Zhang, B., Wang, Y., Zhang, C. et al. AFMCT: adaptive fusion module based on cross-modal transformer block for 3D object detection. Machine Vision and Applications 35, 40 (2024). https://doi.org/10.1007/s00138-024-01509-3
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s00138-024-01509-3