Skip to main content
Log in

SRIF-RCNN: Sparsely represented inputs fusion of different sensors for 3D object detection

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

3D object detection is a vital task in many practical applications, such as autonomous driving, augmented reality and robot navigation. Significant advances have been made in recent LiDAR-only 3D detection methods, but sensor fusion 3D detection methods received less attention and have not made much progress. This paper aims to lift the 3D detection performance of sensor fusion methods. To this end, we present a novel sensor fusion strategy to effectively extract and fuse the features from different sensors. Firstly, the different sensor outputs are transformed in to sparsely represented inputs. Secondly, features are extracted from the inputs through an efficient backbone. Finally, the extracted features of different sensors are fused in a point-wise manner with the help of a gate mechanism. In addition, color supervision is also introduced to learn color distribution for the first time, which can provide discriminative features for proposal refinement. Based on the sensor fusion strategy and color distribution estimation, a multi-sensor 3D object detection network, named Sparsely Represented Inputs Fusion RCNN (SRIF-RCNN), is proposed. It achieves state-of-the-art performance on the highly competitive KITTI official 3D detection leaderboard, which ranks 1st and 2nd among sensor fusion methods and LiDAR-only methods with published works, respectively. Extensive experiments were implemented, and the effectiveness of the proposed network was validated

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

References

  1. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: 2016 IEEE Conference on computer vision and pattern recognition, pp 770–778. https://doi.org/10.1109/CVPR.2016.90

  2. Redmon J, Divvala S, Girshick R, Farhadi A (2016) You only look once: Unified, real-time object detection. In: 2016 IEEE conference on computer vision and pattern recognition, pp 779–788. https://doi.org/10.1109/CVPR.2016.91

  3. Ren S, He K, Girshick R, Sun J (2017) Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell 39(6):1137–1149. https://doi.org/10.1109/TPAMI.2016.2577031

    Article  Google Scholar 

  4. Yang W, Li Z, Wang C, Li J (2020) A multi-task faster r-cnn method for 3d vehicle detection based on a single image. Appl Soft Comput 95:106533. https://doi.org/10.1016/j.asoc.2020.106533

    Article  Google Scholar 

  5. Simonelli A, Bulò SR, Porzi L, Lopez-Antequera M, Kontschieder P (2019) Disentangling monocular 3d object detection. In: 2019 IEEE/CVF International conference on computer vision, pp 1991–1999. https://doi.org/10.1109/ICCV.2019.00208

  6. Chen X, Kundu K, Zhu Y, Ma H, Fidler S, Urtasun R (2018) 3d object proposals using stereo imagery for accurate object class detection. IEEE Trans Pattern Anal Mach Intell 40(5):1259–1272. https://doi.org/10.1109/TPAMI.2017.2706685

    Article  Google Scholar 

  7. Mousavian A, Anguelov D, Flynn J, Kosecka J (2017) 3d bounding box estimation using deep learning and geometry. In: 2017 IEEE conference on computer vision and pattern recognition, pp 7074–7082. https://doi.org/10.1109/CVPR.2017.597

  8. Chabot F, Chaouch M, Rabarisoa J, Teulière C, Chateau T (2017) Deep manta: a coarse-to-fine many-task network for joint 2d and 3d vehicle analysis from monocular image. In: 2017 IEEE Conference on computer vision and pattern recognition, pp 1827–1836. https://doi.org/10.1109/CVPR.2017.198

  9. Xiang Y, Choi W, Lin Y, Savarese S (2017) Subcategory-aware convolutional neural networks for object proposals and detection. In: 2017 IEEE Winter conference on applications of computer vision, pp 924–933. https://doi.org/10.1109/WACV.2017.108

  10. Wang Y, Chao W-L, Garg D, Hariharan B, Campbell M, Weinberger KQ (2019) Pseudo-lidar from visual depth estimation: Bridging the gap in 3d object detection for autonomous driving. In: 2019 IEEE/CVF Conference on computer vision and pattern recognition, pp 8437–8445. https://doi.org/10.1109/CVPR.2019.00864

  11. You Y, Wang Y, Chao W-L, Garg D, Pleiss G, Hariharan B, Campbell M, Weinberger KQ (2019) Pseudo-lidar++: Accurate depth for 3d object detection in autonomous driving. arXiv:1906.06310

  12. Chen X, Ma H, Wan J, Li B, Xia T (2017) Multi-view 3d object detection network for autonomous driving. In: 2017 IEEE Conference on computer vision and pattern recognition, pp 6526–6534. https://doi.org/10.1109/CVPR.2017.691

  13. Ku J, Mozifian M, Lee J, Harakeh A, Waslander SL (2018) Joint 3d proposal generation and object detection from view aggregation. In: 2018 IEEE/RSJ International conference on intelligent robots and systems, pp 1–8. https://doi.org/10.1109/IROS.2018.8594049

  14. Liang M, Yang B, Wang S, Urtasun R (2018) Deep continuous fusion for multi-sensor 3d object detection. In: 2018 European conference on computer vision, pp 641–656

  15. Xu D, Anguelov D, Jain A (2018) Pointfusion: Deep sensor fusion for 3d bounding box estimation. In: 2018 IEEE/CVF Conference on computer vision and pattern recognition, pp 244–253. https://doi.org/10.1109/CVPR.2018.00033

  16. Qi CR, Liu W, Wu C, Su H, Guibas LJ (2018) Frustum pointnets for 3d object detection from rgb-d data. In: 2018 IEEE/CVF Conference on computer vision and pattern recognition, pp 918–927. https://doi.org/10.1109/CVPR.2018.00102

  17. Du X, Ang MH, Karaman S, Rus D (2018) A general pipeline for 3d detection of vehicles. In: 2018 IEEE International conference on robotics and automation, pp 3194–3200. https://doi.org/10.1109/ICRA.2018.8461232

  18. Xie L, Xiang C, Yu Z, Xu G, He X (2020) Pi-rcnn: an efficient multi-sensor 3d object detector with point-based attentive cont-conv fusion module. In: 2020 AAAI Conference on artificial intelligence, vol 34, pp 12460–12467. https://doi.org/10.1609/aaai.v34i07.6933

  19. Wang Z, Jia K (2019) Frustum convnet: Sliding frustums to aggregate local point-wise features for amodal 3d object detection. In: 2019 IEEE/RSJ international conference on intelligent robots and systems, pp 1742–1749. https://doi.org/10.1109/IROS40897.2019.8968513

  20. Vora S, Lang AH, Helou B, Beijbom O (2020) Pointpainting: Sequential fusion for 3d object detection. In: 2020 IEEE/CVF conference on computer vision and pattern recognition, pp 4603–4611. https://doi.org/10.1109/CVPR42600.2020.00466

  21. Liang M, Yang B, Chen Y, Hu R, Urtasun R (2019) Multi-task multi-sensor fusion for 3d object detection. In: 2019 IEEE/CVF Conference on computer vision and pattern recognition, pp 7337–7345. https://doi.org/10.1109/CVPR.2019.00752

  22. Wu Y, Jiang X, Fang Z, Gao Y, Fujita H (2021) Multi-modal 3d object detection by 2d-guided precision anchor proposal and multi-layer fusion. Applied Soft Computing 108:107405. https://doi.org/10.1016/j.asoc.2021.107405

    Article  Google Scholar 

  23. Tian Y, Wang K, Wang Y, Tian Y, Wang Z, Wang F-Y (2020) Adaptive and azimuth-aware fusion network of multimodal local features for 3d object detection. Neurocomputing 411:32–44. https://doi.org/10.1016/j.neucom.2020.05.086

    Article  Google Scholar 

  24. Yan Y, Mao Y, Li B (2018) Second: Sparsely embedded convolutional detection. Sensors 18(10). https://doi.org/10.3390/s18103337

  25. Lang AH, Vora S, Caesar H, Zhou L, Beijbom O (2019) Pointpillars: Fast encoders for object detection from point clouds. In: 2019 IEEE/CVF Conference on computer vision and pattern recognition, pp 12689–12697. https://doi.org/10.1109/CVPR.2019.01298

  26. He C, Zeng H, Huang J, Hua XS, Zhang L (2020) Structure aware single-stage 3d object detection from point cloud. In: 2020 IEEE/CVF Conference on computer vision and pattern recognition, pp 11870–11879. https://doi.org/10.1109/CVPR42600.2020.01189

  27. Shi S, Wang Z, Shi J, Wang X, Li H (2021) From points to parts: 3d object detection from point cloud with part-aware and part-aggregation network. IEEE Trans Pattern Anal Mach Intell 43(8):2647–2664. https://doi.org/10.1109/TPAMI.2020.2977026

    Google Scholar 

  28. Shi S, Guo C, Jiang L, Wang Z, Shi J, Wang X, Li H (2020) Pv-rcnn: Point-voxel feature set abstraction for 3d object detection. In: 2020 IEEE/CVF Conference on computer vision and pattern recognition, pp 10526–10535. https://doi.org/10.1109/CVPR42600.2020.01054

  29. Shi S, Jiang L, Deng J, Wang Z, Guo C, Shi J, Wang X, Li H (2021) Pv-rcnn++: Point-voxel feature set abstraction with local vector representation for 3d object detection. arXiv:2102.00463

  30. Shi S, Wang X, Li H (2019) Pointrcnn: 3d object proposal generation and detection from point cloud. In: 2019 IEEE/CVF Conference on computer vision and pattern recognition, pp 770–779. https://doi.org/10.1109/CVPR.2019.00086

  31. Liu Z, Zhao X, Huang T, Hu R, Bai X (2020) Tanet: Robust 3d object detection from point clouds with triple attention. 2020 AAAI Conference on Artificial Intelligence 34(7):11677–11684. https://doi.org/10.1609/aaai.v34i07.6837

    Article  Google Scholar 

  32. Zheng W, Tang W, Jiang L, Fu C-W (2021) Se-ssd: Self-ensembling single-stage object detector from point cloud. In: 2021 IEEE/CVF Conference on computer vision and pattern recognition, pp 14494–14503

  33. Zheng W, Tang W, Chen S, Jiang L, Fu C-W (2021) Cia-ssd: Confident iou-aware single-stage object detector from point cloud. In: 2021 AAAI Conference on artificial intelligence, vol 35, pp 3555–3562

  34. Li Z, Yao Y, Quan Z, Yang W, Xie J (2021) Sienet: Spatial information enhancement network for 3d object detection from point cloud. arXiv:2103.15396

  35. Yang Y, Chen F, Wu F, Zeng D, Ji Y-M, Jing X-Y (2020) Multi-view semantic learning network for point cloud based 3d object detection. Neurocomputing 397:477–485. https://doi.org/10.1016/j.neucom.2019.10.116

    Article  Google Scholar 

  36. Yang B, Liang M, Urtasun R (2020) Hdnet: Exploiting hd maps for 3d object detection. arXiv:2012.11704

  37. Yang B, Luo W, Urtasun R (2018) Pixor: Real-time 3d object detection from point clouds. In: 2018 IEEE/CVF Conference on computer vision and pattern recognition, pp 7652–7660. https://doi.org/10.1109/CVPR.2018.00798

  38. Zhou Y, Tuzel O (2018) Voxelnet: End-to-end learning for point cloud based 3d object detection. In: 2018 IEEE/CVF Conference on computer vision and pattern recognition, pp 4490–4499. https://doi.org/10.1109/CVPR.2018.00472

  39. Deng J, Shi S, Li P, Zhou W, Zhang Y, Li H (2021) Voxel r-cnn: Towards high performance voxel-based 3d object detection. In: Proceedings of the AAAI conference on artificial intelligence, vol 35, pp 1201–1209

  40. Li J, Sun Y, Luo S, Zhu Z, Dai H, Krylov AS, Ding Y, Shao L (2021) P2v-rcnn: Point to voxel feature learning for 3d object detection from point clouds. IEEE Access 9:98249–98260. https://doi.org/10.1109/ACCESS.2021.3094562

    Article  Google Scholar 

  41. Li J, Dai H, Shao L, Ding Y (2021) From voxel to point: Iou-guided 3d object detection for point cloud with voxel-to-point decoder. Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/3474085.3475314

  42. Yang Z, Sun Y, Liu S, Shen X, Jia J (2019) Std: Sparse-to-dense 3d object detector for point cloud. In: 2019 IEEE/CVF International conference on computer vision, pp 1951–1960. https://doi.org/10.1109/ICCV.2019.00204

  43. Qi CR, Litany O, He K, Guibas LJ (2019) Deep hough voting for 3d object detection in point clouds. In: 2019 IEEE/CVF international conference on computer vision, pp 9277–9286. https://doi.org/10.1109/ICCV.2019.00937

  44. Li J, Luo S, Zhu Z, Dai H, Krylov AS, Ding Y, Shao L (2020) 3d iou-net: Iou guided 3d object detector for point clouds. arXiv:2004.04962

  45. Yang Z, Sun Y, Liu S, Jia J (2020) 3dssd: Point-based 3d single stage object detector. In: 2020 IEEE/CVF conference on computer vision and pattern recognition, pp 11040–11048. https://doi.org/10.1109/CVPR42600.2020.01105

  46. Deng J, Zhou W, Zhang Y, Li H (2021) From multi-view to hollow-3d: Hallucinated hollow-3d r-cnn for 3d object detection. IEEE Trans Circuits Syst Video Technol 31(12):4722–4734. https://doi.org/10.1109/TCSVT.2021.3100848

    Article  Google Scholar 

  47. Qi CR, Su H, Mo K, Guibas LJ (2017) Pointnet: Deep learning on point sets for 3d classification and segmentation. In: 2020 IEEE/CVF conference on computer vision and pattern recognition, pp 652–660. https://doi.org/10.1109/CVPR.2017.16

  48. Qi CR, Yi L, Su H, Guibas LJ (2017) Pointnet++: Deep hierarchical feature learning on point sets in a metric space. arXiv:1706.02413

  49. Graham B, van der Maaten L (2017) Submanifold sparse convolutional networks. arXiv:1706.01307

  50. Graham B, Engelcke M, Maaten LVD (2018) 3d semantic segmentation with submanifold sparse convolutional networks. In: 2018 IEEE/CVF conference on computer vision and pattern recognition, pp 9224–9232. https://doi.org/10.1109/cvpr.2018.00961

  51. Gustafsson FK, Danelljan M, Schön TB (2021) Accurate 3d object detection using energy-based models. In: 2021 IEEE/CVF conference on computer vision and pattern recognition workshops, pp 2849–2858. https://doi.org/10.1109/CVPRW53098.2021.00320

  52. Shi W, Rajkumar R (2020) Point-gnn: Graph neural network for 3d object detection in a point cloud. In: 2020 IEEE/CVF Conference on computer vision and pattern recognition, pp 1711–1719. https://doi.org/10.1109/CVPR42600.2020.00178

  53. He Y, Xia G, Luo Y, Su L, Zhang Z, Li W, Wang P (2021) Dvfenet: Dual-branch voxel feature extraction network for 3d object detection. Neurocomputing 459:201–211. https://doi.org/10.1016/j.neucom.2021.06.046

    Article  Google Scholar 

  54. Yoo JH, Kim Y, Kim J, Choi JW (2020) 3d-cvf: Generating joint camera and lidar features using cross-view spatial feature fusion for 3d object detection. In: 16th European conference on computer vision. Springer Science and Business Media Deutschland GmbH, pp 720–736. https://doi.org/10.1007/978-3-030-58583-9_43

  55. Pang S, Morris D, Radha H (2020) Clocs: Camera-lidar object candidates fusion for 3d object detection. In: 2020 IEEE/RSJ international conference on intelligent robots and systems, pp 10386–10393. https://doi.org/10.1109/IROS45743.2020.9341791

  56. Milioto A, Vizzo I, Behley J, Stachniss C (2019) Rangenet++: Fast and accurate lidar semantic segmentation. In: 2019 IEEE/RSJ international conference on intelligent robots and systems. IEEE, pp 4213–4220. https://doi.org/10.1109/iros40897.2019.8967762

  57. Liang Z, Zhang M, Zhang Z, Zhao X, Pu S (2020) Rangercnn: Towards fast and accurate 3d object detection with range image representation. arXiv:2009.00206

  58. Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: International conference on medical image computing and computer-assisted intervention. Springer, pp 234–241. https://doi.org/10.1007/978-3-319-24574-4_28

  59. Lin T-Y, Goyal P, Girshick R, He K, Dollár P (2020) Focal loss for dense object detection. IEEE Trans Pattern Anal Mach Intell 42(2):318–327. https://doi.org/10.1109/TPAMI.2018.2858826

    Article  Google Scholar 

  60. Geiger A, Lenz P, Urtasun R (2012) Are we ready for autonomous driving? the kitti vision benchmark suite. In: 2012 IEEE Conference on computer vision and pattern recognition, pp 3354–3361. https://doi.org/10.1109/CVPR.2012.6248074

  61. KITTI 3D object detection benchmark leaderboard. http://www.cvlibs.net/datasets/kitti/eval_object.php?obj_benchmark=3d/. Accessed on 2021-7-15

  62. Cai Z, Vasconcelos N (2019) Cascade r-cnn: High quality object detection and instance segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence. https://doi.org/10.1109/TPAMI.2019.2956516

  63. Mao J, Xue Y, Niu M, Bai H, Feng J, Liang X, Xu H, Xu C (2021) Voxel transformer for 3d object detection. In: 2021 IEEE/CVF International conference on computer vision (ICCV), pp 3164–3173

  64. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L.u., Polosukhin I (2017) Attention is all you need. In: Guyon I, Luxburg UV, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R (eds) Advances in neural information processing systems. https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf, vol 30. Curran Associates Inc

  65. Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, Uszkoreit J, Houlsby N (2020) An image is worth 16x16 words: Transformers for image recognition at scale. arXiv:2010.11929

  66. Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Lin S, Guo B (2021) Swin transformer: Hierarchical vision transformer using shifted windows. In: 2021 IEEE/CVF International conference on computer vision (ICCV), pp 10012–10022

  67. Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S (2020) End-to-end object detection with transformers. In: 2020 European conference on computer vision. Springer, pp 213–229. https://doi.org/10.1007/978-3-030-58452-8_13

  68. Sheng H, Cai S, Liu Y, Deng B, Huang J, Hua X-S, Zhao M-J (2021) Improving 3d object detection with channel-wise transformer. In: 2021 IEEE/CVF International conference on computer vision (ICCV), pp 2743–2752

  69. Guan T, Wang J, Lan S, Chandra R, Wu Z, Davis L, Manocha D (2022) M3detr: Multi-representation, multi-scale, mutual-relation 3d object detection with transformers. In: 2022 IEEE/CVF Winter conference on applications of computer vision (WACV), pp 772–782

  70. Xu Q, Zhong Y, Neumann U (2020) Behind the curtain: Learning occluded shapes for 3d object detection. arXiv:2112.02205

  71. Xu Q, Zhou Y, Wang W, Qi CR, Anguelov D (2021) Spg: Unsupervised domain adaptation for 3d object detection via semantic point generation. In: 2021 IEEE/CVF International conference on computer vision (ICCV), pp 15446–15456

  72. Bhatt A, Ganatra A, Kotecha K (2021) Covid-19 pulmonary consolidations detection in chest x-ray using progressive resizing and transfer learning techniques. Heliyon 7(6):07211. https://doi.org/10.1016/j.heliyon.2021.e07211

    Article  Google Scholar 

  73. Rahate A, Walambe R, Ramanna S, Kotecha K (2022) Multimodal co-learning: challenges, applications with datasets, recent advances and future directions. Information Fusion 81:203–239. https://doi.org/10.1016/j.inffus.2021.12.003

    Article  Google Scholar 

  74. Joshi G, Walambe R, Kotecha K (2021) A review on explainability in multimodal deep neural nets. IEEE Access 9:59800–59821. https://doi.org/10.1109/ACCESS.2021.3070212

    Article  Google Scholar 

Download references

Acknowledgements

The authors gratefully acknowledge the financial support from the National Natural Science Foundation of China (nos. 61501394 and 62173289) and Natural Science Foundation of Hebei province of China (no. F2016203155).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Deming Kong.

Ethics declarations

Conflict of Interests

The authors declare that they have no conflict of interest.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, X., Kong, D. SRIF-RCNN: Sparsely represented inputs fusion of different sensors for 3D object detection. Appl Intell 53, 5532–5553 (2023). https://doi.org/10.1007/s10489-022-03594-1

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-022-03594-1

Keywords

Navigation