Homogeneous Multi-modal Feature Fusion and Interaction for 3D Object Detection

Li, Xin; Shi, Botian; Hou, Yuenan; Wu, Xingjiao; Ma, Tianlong; Li, Yikang; He, Liang

doi:10.1007/978-3-031-19839-7_40

Xin Li¹²,
Botian Shi¹³,
Yuenan Hou¹³,
Xingjiao Wu^12,14,
Tianlong Ma^12,14,
Yikang Li¹³ &
…
Liang He¹²

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13698))

Included in the following conference series:

European Conference on Computer Vision

2634 Accesses
16 Citations

Abstract

Multi-modal 3D object detection has been an active research topic in autonomous driving. Nevertheless, it is non-trivial to explore the cross-modal feature fusion between sparse 3D points and dense 2D pixels. Recent approaches either fuse the image features with the point cloud features that are projected onto the 2D image plane or combine the sparse point cloud with dense image pixels. These fusion approaches often suffer from severe information loss, thus causing sub-optimal performance. To address these problems, we construct the homogeneous structure between the point cloud and images to avoid projective information loss by transforming the camera features into the LiDAR 3D space. In this paper, we propose a homogeneous multi-modal feature fusion and interaction method (HMFI) for 3D object detection. Specifically, we first design an image voxel lifter module (IVLM) to lift 2D image features into the 3D space and generate homogeneous image voxel features. Then, we fuse the voxelized point cloud features with the image features from different regions by introducing the self-attention based query fusion mechanism (QFM). Next, we propose a voxel feature interaction module (VFIM) to enforce the consistency of semantic information from identical objects in the homogeneous point cloud and image voxel representations, which can provide object-level alignment guidance for cross-modal feature fusion and strengthen the discriminative ability in complex backgrounds. We conduct extensive experiments on the KITTI and Waymo Open Dataset, and the proposed HMFI achieves better performance compared with the state-of-the-art multi-modal methods. Particularly, for the 3D detection of cyclist on the KITTI benchmark, HMFI surpasses all the published algorithms by a large margin.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Chen, X., Ma, H., Wan, J., Li, B., Xia, T.: Multi-view 3D object detection network for autonomous driving. In: CVPR, pp. 1907–1915 (2017)
Google Scholar
Chen, X., He, K.: Exploring simple siamese representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15750–15758 (2021)
Google Scholar
Deng, J., Shi, S., Li, P., Zhou, W., Zhang, Y., Li, H.: Voxel R-CNN: towards high performance voxel-based 3d object detection. In: AAAI, pp. 1201–1209 (2021)
Google Scholar
Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous driving? The kitti vision benchmark suite. In: CVPR, pp. 3354–3361 (2012)
Google Scholar
Guan, T., Wang, J., Lan, S., Chandra, R., Wu, Z., Davis, L., Manocha, D.: M3detr: multi-representation, multi-scale, mutual-relation 3d object detection with transformers. In: WACV, pp. 772–782 (2022)
Google Scholar
Guo, X., Shi, S., Wang, X., Li, H.: Liga-stereo: learning lidar geometry aware representations for stereo-based 3D detector. In: CVPR, pp. 3153–3163 (2021)
Google Scholar
He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: ICCV, pp. 2961–2969 (2017)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Google Scholar
Huang, T., Liu, Z., Chen, X., Bai, X.: EPNet: enhancing point features with image semantics for 3D object detection. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12360, pp. 35–52. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58555-6_3
Chapter Google Scholar
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. ICLR (2015)
Google Scholar
Ku, J., Mozifian, M., Lee, J., Harakeh, A., Waslander, S.L.: Joint 3D proposal generation and object detection from view aggregation. In: IROS, pp. 1–8. IEEE (2018)
Google Scholar
Lang, A.H., Vora, S., Caesar, H., Zhou, L., Yang, J., Beijbom, O.: Pointpillars: fast encoders for object detection from point clouds. In: CVPR, pp. 12697–12705 (2019)
Google Scholar
Li, Z., Wang, F., Wang, N.: Lidar R-CNN: an efficient and universal 3D object detector. In: CVPR, pp. 7546–7555 (2021)
Google Scholar
Liang, M., Yang, B., Chen, Y., Hu, R., Urtasun, R.: Multi-task multi-sensor fusion for 3D object detection. In: CVPR, pp. 7345–7353 (2019)
Google Scholar
Liang, M., Yang, B., Wang, S., Urtasun, R.: Deep continuous fusion for multi-sensor 3D object detection. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11220, pp. 663–678. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01270-0_39
Chapter Google Scholar
Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection. In: ICCV, pp. 2980–2988 (2017)
Google Scholar
Liu, Z., Wu, Z., Tóth, R.: Smoke: Single-stage monocular 3D object detection via keypoint estimation. In: CVPRW, pp. 996–997 (2020)
Google Scholar
Lu, Y., et al.: Geometry uncertainty projection network for monocular 3D object detection. In: ICCV, pp. 3111–3121 (2021)
Google Scholar
Mao, J., Niu, M., Bai, H., Liang, X., Xu, H., Xu, C.: Pyramid R-CNN: Towards better performance and adaptability for 3D object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2723–2732 (2021)
Google Scholar
Mao, J., et al.: Voxel transformer for 3D object detection. In: ICCV, pp. 3164–3173 (2021)
Google Scholar
Pang, S., Morris, D., Radha, H.: Clocs: camera-lidar object candidates fusion for 3D object detection. In: IROS, pp. 10386–10393. IEEE (2020)
Google Scholar
Philion, J., Fidler, S.: Lift, splat, shoot: encoding images from arbitrary camera rigs by implicitly unprojecting to 3D. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12359, pp. 194–210. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58568-6_12
Chapter Google Scholar
Qi, C.R., Liu, W., Wu, C., Su, H., Guibas, L.J.: Frustum pointnets for 3D object detection from RGB-D data. In: CVPR, pp. 918–927 (2018)
Google Scholar
Qi, C.R., Su, H., Mo, K., Guibas, L.J.: Pointnet: deep learning on point sets for 3D classification and segmentation. In: CVPR, pp. 652–660 (2017)
Google Scholar
Qi, C.R., Yi, L., Su, H., Guibas, L.J.: Pointnet++: deep hierarchical feature learning on point sets in a metric space. In: NeurIPS, vol. 30 (2017)
Google Scholar
Reading, C., Harakeh, A., Chae, J., Waslander, S.L.: Categorical depth distribution network for monocular 3D object detection. In: CVPR, pp. 8555–8564 (2021)
Google Scholar
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: NIPS, vol. 28 (2015)
Google Scholar
Sheng, H., et al.: Improving 3D object detection with channel-wise transformer. In: ICCV, pp. 2743–2752 (2021)
Google Scholar
Shi, S., et al.: PV-RCNN: point-voxel feature set abstraction for 3d object detection. In: CVPR, pp. 10529–10538 (2020)
Google Scholar
Shi, S., Wang, X., Li, H.: PointrCNN: 3D object proposal generation and detection from point cloud. In: CVPR, pp. 770–779 (2019)
Google Scholar
Shi, S., Wang, Z., Shi, J., Wang, X., Li, H.: From points to parts: 3D object detection from point cloud with part-aware and part-aggregation network. PAMI 43(8), 2647–2664 (2020)
Google Scholar
Shi, W., Rajkumar, R.: Point-GNN: graph neural network for 3D object detection in a point cloud. In: CVPR, pp. 1711–1719 (2020)
Google Scholar
Sindagi, V.A., Zhou, Y., Tuzel, O.: MVX-Net: multimodal voxelnet for 3D object detection. In: 2019 International Conference on Robotics and Automation (ICRA), pp. 7276–7282. IEEE (2019)
Google Scholar
Smith, L.N.: A disciplined approach to neural network hyper-parameters: Part 1-learning rate, batch size, momentum, and weight decay. arXiv preprint arXiv:1803.09820 (2018)
Sun, P., et al.: Scalability in perception for autonomous driving: Waymo open dataset. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2446–2454 (2020)
Google Scholar
Tang, Y., Dorn, S., Savani, C.: Center3D: center-based monocular 3D object detection with joint depth understanding. In: Akata, Z., Geiger, A., Sattler, T. (eds.) DAGM GCPR 2020. LNCS, vol. 12544, pp. 289–302. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-71278-5_21
Chapter Google Scholar
Team, O.D.: Openpcdet: an open-source toolbox for 3D object detection from point clouds (2020). https://github.com/open-mmlab/OpenPCDet
Vaswani, A., et al.: Attention is all you need. In: NIPS, vol. 30 (2017)
Google Scholar
Vora, S., Lang, A.H., Helou, B., Beijbom, O.: Pointpainting: sequential fusion for 3D object detection. In: CVPR, pp. 4604–4612 (2020)
Google Scholar
Wang, C., Ma, C., Zhu, M., Yang, X.: Pointaugmenting: cross-modal augmentation for 3D object detection. In: CVPR, pp. 11794–11803 (2021)
Google Scholar
Wang, Y., Mao, Q., Zhu, H., Zhang, Y., Ji, J., Zhang, Y.: Multi-modal 3D object detection in autonomous driving: a survey. CoRR (2021)
Google Scholar
Wang, Y., Mao, Q., Zhu, H., Zhang, Y., Ji, J., Zhang, Y.: Multi-modal 3D object detection in autonomous driving: a survey. arXiv preprint arXiv:2106.12735 (2021)
Wang, Z., Jia, K.: Frustum convnet: sliding frustums to aggregate local point-wise features for amodal 3D object detection. In: IROS, pp. 1742–1749. IEEE (2019)
Google Scholar
Xie, L., Xiang, C., Yu, Z., Xu, G., Yang, Z., Cai, D., He, X.: PI-RCNN: an efficient multi-sensor 3D object detector with point-based attentive cont-conv fusion module. In: AAAI, pp. 12460–12467 (2020)
Google Scholar
Xu, D., Anguelov, D., Jain, A.: Pointfusion: deep sensor fusion for 3D bounding box estimation. In: CVPR, pp. 244–253 (2018)
Google Scholar
Yan, Y., Mao, Y., Li, B.: Second: sparsely embedded convolutional detection. Sensors 18(10), 3337 (2018)
Article Google Scholar
Yoo, J.H., Kim, Y., Kim, J., Choi, J.W.: 3D-CVF: generating joint camera and lidar features using cross-view spatial feature fusion for 3D object detection. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12372, pp. 720–736. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58583-9_43
Chapter Google Scholar
You, Y., et al.: Pseudo-lidar++: accurate depth for 3D object detection in autonomous driving. In: ICLR (2020)
Google Scholar
Zhang, Z., et al.: Maff-net: filter false positive for 3D vehicle detection with multi-modal adaptive feature fusion. arXiv preprint arXiv:2009.10945 (2020)
Zhou, Y., Tuzel, O.: Voxelnet: end-to-end learning for point cloud based 3D object detection. In: CVPR, pp. 4490–4499 (2018)
Google Scholar

Download references

Acknowledgments

This research is funded by the Science and Technology Commission of Shanghai Municipality (19511120200), The computation is performed in ECNU Multifunctional Platform for Innovation (001).

Author information

Authors and Affiliations

East China Normal University, Shanghai, China
Xin Li, Xingjiao Wu, Tianlong Ma & Liang He
Shanghai AI Lab, Shanghai, China
Botian Shi, Yuenan Hou & Yikang Li
Fudan University, Shanghai, China
Xingjiao Wu & Tianlong Ma

Authors

Xin Li
View author publications
You can also search for this author in PubMed Google Scholar
Botian Shi
View author publications
You can also search for this author in PubMed Google Scholar
Yuenan Hou
View author publications
You can also search for this author in PubMed Google Scholar
Xingjiao Wu
View author publications
You can also search for this author in PubMed Google Scholar
Tianlong Ma
View author publications
You can also search for this author in PubMed Google Scholar
Yikang Li
View author publications
You can also search for this author in PubMed Google Scholar
Liang He
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Yikang Li or Liang He .

Editor information

Editors and Affiliations

Tel Aviv University, Tel Aviv, Israel
Shai Avidan
University College London, London, UK
Gabriel Brostow
Google AI, Accra, Ghana
Moustapha Cissé
University of Catania, Catania, Italy
Giovanni Maria Farinella
Facebook (United States), Menlo Park, CA, USA
Tal Hassner

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 339 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Li, X. et al. (2022). Homogeneous Multi-modal Feature Fusion and Interaction for 3D Object Detection. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13698. Springer, Cham. https://doi.org/10.1007/978-3-031-19839-7_40

Download citation

DOI: https://doi.org/10.1007/978-3-031-19839-7_40
Published: 23 October 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-19838-0
Online ISBN: 978-3-031-19839-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Homogeneous Multi-modal Feature Fusion and Interaction for 3D Object Detection