Skip to main content
Log in

Multi-Modal 3D Object Detection in Autonomous Driving: A Survey

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

The past decade has witnessed the rapid development of autonomous driving systems. However, it remains a daunting task to achieve full autonomy, especially when it comes to understanding the ever-changing, complex driving scenes. To alleviate the difficulty of perception, self-driving vehicles are usually equipped with a suite of sensors (e.g., cameras, LiDARs), hoping to capture the scenes with overlapping perspectives to minimize blind spots. Fusing these data streams and exploiting their complementary properties is thus rapidly becoming the current trend. Nonetheless, combining data that are captured by different sensors with drastically different ranging/ima-ging mechanisms is not a trivial task; instead, many factors need to be considered and optimized. If not careful, data from one sensor may act as noises to data from another sensor, with even poorer results by fusing them. Thus far, there has been no in-depth guidelines to designing the multi-modal fusion based 3D perception algorithms. To fill in the void and motivate further investigation, this survey conducts a thorough study of tens of recent deep learning based multi-modal 3D detection networks (with a special emphasis on LiDAR-camera fusion), focusing on their fusion stage (i.e., when to fuse), fusion inputs (i.e., what to fuse), and fusion granularity (i.e., how to fuse). These important design choices play a critical role in determining the performance of the fusion algorithm. In this survey, we first introduce the background of popular sensors used for self-driving, their data properties, and the corresponding object detection algorithms. Next, we discuss existing datasets that can be used for evaluating multi-modal 3D object detection algorithms. Then we present a review of multi-modal fusion based 3D detection networks, taking a close look at their fusion stage, fusion input and fusion granularity, and how these design choices evolve with time and technology. After the review, we discuss open challenges as well as possible solutions. We hope that this survey can help researchers to get familiar with the field and embark on investigations in the area of multi-modal 3D object detection.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21
Fig. 22
Fig. 23
Fig. 24
Fig. 25
Fig. 26

Similar content being viewed by others

Notes

  1. http://www.cvlibs.net/datasets/kitti/index.php.

  2. https://www.nuscenes.org/object-detection?externalData=all &mapData=all &modalities=Any

  3. https://waymo.com/intl/en_us/open

References

  • Ahmad, W. A., Wessel, J., Ng, H. J., & Kissinger, D. (2020). IoT-ready millimeter-wave radar sensors. In IEEE global conference on artificial intelligence and Internet of Things (GCAIoT) (pp. 1–5).

  • Andriluka, M., Roth, S., & Schiele, B. (2010). Monocular 3d pose estimation and tracking by detection. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 623–630).

  • Arnold, E., Al-Jarrah, O. Y., Dianati, M., Fallah, S., Oxtoby, D., & Mouzakitis, A. (2019). A survey on 3d object detection methods for autonomous driving applications. IEEE Transactions on Intelligent Transportation Systems (TITS), 20(10), 3782–3795.

    Article  Google Scholar 

  • Asvadi, A., Garrote, L., Premebida, C., Peixoto, P., & Nunes, U. (2017). Multimodal vehicle detection: Fusing 3d-lidar and color camera data. Pattern Recognition Letters,115, 20–29.

  • Asvadi, A., Garrote, L., Premebida, C., Peixoto, P., & Nunes, U. J. (2018). Multimodal vehicle detection: Fusing 3d-lidar and color camera data. Pattern Recognition Letters, 115, 20–29.

    Article  Google Scholar 

  • Bai, X., Hu, Z., Zhu, X., Huang, Q., Chen, Y., Fu, H., & Tai, C. L. (2022). Transfusion: Robust lidar-camera fusion for 3d object detection with transformers. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 1090–1099).

  • Beltrán, J., Guindel, C., Moreno, F. M., Cruzado, D., García, F., & De La Escalera, A. (2018). Birdnet: A 3d object detection framework from lidar information. In 2018 21st international conference on intelligent transportation systems (ITSC) (pp. 3517–3523).

  • Caesar, H., Bankiti, V., Lang, A. H., Vora, S., Liong, V. E., Xu, Q., Krishnan, A., Pan, Y., Baldan, G., & Beijbom, O. (2020). nuscenes: A multimodal dataset for autonomous driving. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 11618–11628).

  • Caine, B., Roelofs, R., Vasudevan, V., Ngiam, J., Chai, Y., Chen, Z., & Shlens, J. (2021). Pseudo-labeling for scalable 3d object detection. CoRR abs arXiv:2103.02093

  • Caltagirone, L., Bellone, M., Svensson, L., & Wahde, M. (2019). Lidar-camera fusion for road detection using fully convolutional neural networks. Robotics and Autonomous Systems, 111, 125–131.

    Article  Google Scholar 

  • Carr, P., Sheikh, Y., & Matthews, I. (2012). Monocular object detection using 3d geometric primitives. In A. Fitzgibbon, S. Lazebnik, P. Perona, Y. Sato, & C. Schmid (Eds.), European conference on computer vision (ECCV) (pp. 864–878).

  • Chadwick, S., Maddern, W., & Newman, P. (2019). Distant vehicle detection using radar and vision. In IEEE international conference on robotics and automation (ICRA) (pp. 8311–8317).

  • Chang, M. F., Lambert, J., Sangkloy, P., Singh, J., Bak, S., Hartnett, A., Wang, D., Carr, P., Lucey, S., Ramanan, D., & Hays, J. (2019). Argoverse: 3d tracking and forecasting with rich maps. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 8740–8749).

  • Charles, R. Q., Su, H., Kaichun, M., & Guibas, L. J. (2017). Pointnet: Deep learning on point sets for 3d classification and segmentation. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 77–85).

  • Chen, X., Kundu, K., Zhang, Z., Ma, H., Fidler, S., & Urtasun, R. (2016). Monocular 3d object detection for autonomous driving. In 2016 IEEE conference on computer vision and pattern recognition (CVPR), Las Vegas, NV, USA, June 27–30, 2016 (pp. 2147–2156). IEEE Computer Society. https://doi.org/10.1109/CVPR.2016.236

  • Chen, Z., Li, Z., Zhang, S., Fang, L., Jiang, Q., & Zhao, F. (2022b). Autoalignv2: Deformable feature aggregation for dynamic multi-modal 3d object detection. CoRR. arXiv:2207.10316

  • Chen, Z., Li, Z., Zhang, S., Fang, L., Jiang, Q., Zhao, F., Zhou, B., & Zhao, H. (2022c). AutoAlign: Pixel-instance feature aggregation for multi-modal 3d object detection. In IJCAI.

  • Chen, Y., Liu, J., Qi, X., Zhang, X., Sun, J., & Jia, J. (2022a). Scaling up kernels in 3d CNNs. arXiv preprint arXiv:2206.10555

  • Chen, X., Ma, H., Wan, J., Li, B., & Xia, T. (2017). Multi-view 3d object detection network for autonomous driving. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 1907–1915).

  • Chen, X., Kundu, K., Zhu, Y., Ma, H., Fidler, S., & Urtasun, R. (2018). 3d object proposals using stereo imagery for accurate object class detection. IEEE Transactions on Pattern Recognition and Machine Intelligence (TPAMI), 40(5), 1259–1272.

    Article  Google Scholar 

  • Chen, L., Lin, S., Lu, X., Cao, D., Wu, H., Guo, C., Liu, C., & Wang, F. Y. (2021). Deep neural network based vehicle and pedestrian detection for autonomous driving: A survey. IEEE Transactions on Intelligent Transportation Systems (TITS), 22(6), 3234–3246.

  • Chen, L. C., Papandreou, G., Kokkinos, I., Murphy, K., & Yuille, A. L. (2018). DeepLab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Transactions on Pattern Recognition and Machine Intelligence (TPAMI), 40(4), 834–848.

    Article  Google Scholar 

  • Chen, L., Zou, Q., Pan, Z., Lai, D., & Cao, D. (2019). Surrounding vehicle detection using an FPGA panoramic camera and deep CNNs. IEEE Transactions on Intelligent Transportation Systems, 21(12), 5110–5122.

    Article  Google Scholar 

  • Chu, X., Deng, J., Li, Y., Yuan, Z., Zhang, Y., Ji, J., & Zhang, Y. (2021). Neighbor-vote: Improving monocular 3d object detection through neighbor distance voting. In ACM international conference on multimedia (ACM MM), ACM (pp. 5239–5247).

  • Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., & Schiele, B. (2016). The cityscapes dataset for semantic urban scene understanding. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 3213–3223).

  • Cui, Y., Chen, R., Chu, W., Chen, L., Tian, D., Li, Y., & Cao, D. (2021). Deep learning for image and point cloud fusion in autonomous driving: A review. IEEE Transactions on Intelligent Transportation Systems (TITS), 23, 1–18.

  • de Paula Veronese, L., Auat-Cheein, F., Mutz, F., Oliveira-Santos, T., Guivant, J. E., de Aguiar, E., Badue, C. & De Souza, A. F. (2020). Evaluating the limits of a lidar for an autonomous driving localization. IEEE Transactions on Intelligent Transportation Systems (TITS), 22(3), 1449–1458.

  • Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., & Fei-Fei, L. (2009). Imagenet: A large-scale hierarchical image database. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 248–255).

  • Deng, J., Shi, S., Li, P., Zhou, W., Zhang, Y., & Li, H. (2020). Voxel R-CNN: Towards high performance voxel-based 3d object detection. arXiv:2012.15712

  • Deng, J., Zhou, W., Zhang, Y., & Li, H. (2021). From multi-view to hollow-3d: Hallucinated hollow-3d R-CNN for 3d object detection. IEEE Transactions on Circuits and Systems for Video Technology (TCSVT), 31(12), 4722–4734.

    Article  Google Scholar 

  • Denninger, M., Sundermeyer, M., Winkelbauer, D., Zidan, Y., Olefir, D., Elbadrawy, M., Lodhi, A., & Katam, H. (2019). BlenderProc. CoRR. arXiv:1911.01911.

  • Deschaud, J. E. (2021). KITTI-CARLA: A KITTI-like dataset generated by CARLA simulator. arXiv preprint arXiv:2109.00892

  • Ding, Z., Hu, Y., Ge, R., Huang, L., Chen, S., Wang, Y., & Liao, J. (2020). 1st place solution for Waymo open dataset challenge: 3d detection and domain adaptation. CoRR abs arXiv:2006.15505

  • Dosovitskiy, A., Ros, G., Codevilla, F., Lopez, A., & Koltun, V. (2017). CARLA: An open urban driving simulator. In Proceedings of the annual conference on robot learning (pp. 1–16)

  • Engelberg, T., & Niem, W. (2009). Method for classifying an object using a stereo camera. U.S. Patent App. 10/589,641.

  • Enzweiler, M., & Gavrila, D. M. (2009). Monocular pedestrian detection: Survey and experiments. IEEE Transactions on Pattern Recognition and Machine Intelligence (TPAMI), 31, 2179–2195.

    Article  Google Scholar 

  • Everingham, M., Van Gool, L., Williams, C. K., Winn, J., & Zisserman, A. (2010). The pascal visual object classes (VOC) challenge. International Journal of Computer Vision, 88(2), 303–338.

    Article  Google Scholar 

  • Fan, L., Pang, Z., Zhang, T., Wang, Y. X., Zhao, H., Wang, F., Wang, N., & Zhang, Z. (2022). Embracing single stride 3d object detector with sparse transformer. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 8458–8468).

  • Fan, L., Xiong, X., Wang, F., Wang, N., & Zhang, Z. (2021). RangeDet: In defense of range view for lidar-based 3d object detection. CoRR abs arXiv:2103.10039

  • Fayyad, J., Jaradat, M., Gruyer, D., & Najjaran, H. (2020). Deep learning sensor fusion for autonomous vehicle perception and localization: A review. Sensors, 20, 4220.

    Article  Google Scholar 

  • Feng, D., Haase-Schütz, C., Rosenbaum, L., Hertlein, H., Gläser, C., Timm, F., Wiesbeck, W., & Dietmayer, K. (2021). Deep multi-modal object detection and semantic segmentation for autonomous driving: Datasets, methods, and challenges. IEEE Transactions on Intelligent Transportation Systems (TITS), 22(3), 1341–1360.

  • Gählert, N., Jourdan, N., Cordts, M., Franke, U., & Denzler, J. (2020). Cityscapes 3d: Dataset and benchmark for 9 DoF vehicle detection. CoRR. arXiv:2006.07864.

  • Gaidon, A., Wang, Q., Cabon, Y., & Vig, E. (2016). Virtual worlds as proxy for multi-object tracking analysis. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 4340–4349).

  • Geiger, A., Lenz, P., & Urtasun, R. (2012). Are we ready for autonomous driving? The KITTI vision benchmark suite. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 3354–3361).

  • Geiger, A., Lenz, P., Stiller, C., & Urtasun, R. (2013). Vision meets robotics: The KITTI dataset. The International Journal of Robotics Research (IJRR), 32(11), 1231–1237.

    Article  Google Scholar 

  • Geiger, D., & Yuille, A. L. (1991). A common framework for image segmentation. International Journal on Computer Vision (IJCV), 6(3), 227–243.

    Article  Google Scholar 

  • Girshick, R. (2015). Fast R-CNN. In IEEE international conference on computer vision (ICCV) (pp. 1440–1448).

  • Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2014). Rich feature hierarchies for accurate object detection and semantic segmentation. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 580–587).

  • Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., & Bengio, Y. (2020). Generative adversarial networks. Communications of the ACM, 63(11), 139–144.

  • Guan, T., Wang, J., Lan, S., Chandra, R., Wu, Z., Davis, L., & Manocha, D. (2022). M3DETR: Multi-representation, multi-scale, mutual-relation 3d object detection with transformers. In Proceedings of the IEEE/CVF winter conference on applications of computer vision (pp. 772–782).

  • Guizilini, V., Li, J., Ambruş, R., & Gaidon, A. (2021). Geometric unsupervised domain adaptation for semantic segmentation. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 8537–8547).

  • Guo, X., Shi, S., Wang, X., & Li, H. (2021). LIGA-Stereo: Learning lidar geometry aware representations for stereo-based 3d detector. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 3153–3163).

  • Guo, J., Kurup, U., & Shah, M. (2019). Is it safe to drive? An overview of factors, metrics, and datasets for driveability assessment in autonomous driving. IEEE Transactions on Intelligent Transportation Systems, 21(8), 3135–3151.

    Article  Google Scholar 

  • He, K., Gkioxari, G., Dollár, P., & Girshick, R. B. (2017). Mask R-CNN. In IEEE international conference on computer vision (ICCV) (pp. 2980–2988).

  • He, C., Zeng, H., Huang, J., Hua, X. S., & Zhang, L. (2020). Structure aware single-stage 3d object detection from point cloud. In 2020 IEEE/CVF conference on computer vision and pattern recognition (CVPR), Seattle, WA, USA, June 13–19,2020 (pp. 11870–11879). Computer Vision Foundation/IEEE.

  • He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 770–778).

  • He, T., & Soatto, S. (2019). Mono3d++: Monocular 3d vehicle detection with two-scale 3d hypotheses and task priors. Association for the Advancement of Artificial Intelligence (AAAI), 33, 8409–8416.

    Google Scholar 

  • Hinton, G., Vinyals, O., & Dean, J. (2015). Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531

  • Hodaň, T., Vineet, V., Gal, R., Shalev, E., Hanzelka, J., Connell, T., Urbina, P., Sinha, S. N., & Guenter, B. (2019). Photorealistic image synthesis for object instance detection. In 2019 IEEE international conference on image processing (ICIP), IEEE (pp. 66–70).

  • Hu, Y., Ding, Z., Ge, R., Shao, W., Huang, L., Li, K., & Liu, Q. (2021). AFDetV2: Rethinking the necessity of the second stage for object detection from point clouds. arXiv preprint arXiv:2112.09205

  • Hu, P., Ziglar, J., Held, D., & Ramanan, D. (2020). What you see is what you get: Exploiting visibility for 3d object detection. In IEEE conference on computer vision and pattern recognition (CVPR), computer vision foundation/IEEE (pp. 10998–11006).

  • Huang, J., & Huang, G. (2022). BEVDet4D: Exploit temporal cues in multi-camera 3d object detection. arXiv preprint arXiv:2203.17054

  • Huang, J., Huang, G., Zhu, Z., & Du, D. (2021). BEVDet: High-performance multi-camera 3d object detection in bird-eye-view. arXiv preprint arXiv:2112.11790

  • Huang, G., Liu, Z., Van Der Maaten, L., & Weinberger, K. Q. (2017a). Densely connected convolutional networks. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 2261–2269).

  • Huang, P., Cheng, M., Chen, Y., Luo, H., Wang, C., & Li, J. (2017). Traffic sign occlusion detection using mobile laser scanning point clouds. IEEE Transactions on Intelligent Transportation Systems, 18(9), 2364–2376.

    Article  Google Scholar 

  • Huang, T., Liu, Z., Chen, X., & Bai, X. (2020). EPNet: Enhancing point features with image semantics for 3d object detection. European Conference on Computer Vision (ECCV), 12360, 35–52.

    Google Scholar 

  • Huang, X., Wang, P., Cheng, X., Zhou, D., Geng, Q., & Yang, R. (2019). The apolloscape open dataset for autonomous driving and its application. IEEE Transactions on Pattern Recognition and Machine Intelligence (TPAMI), 42(10), 2702–2719.

    Article  Google Scholar 

  • Ioannidou, A., Chatzilari, E., Nikolopoulos, S., & Kompatsiaris, I. (2017). Deep learning advances in computer vision with 3d data: A survey. ACM Computing Survey, 50(2), 20:1-20:38.

    Google Scholar 

  • Jiang, M., Wu, Y., & Lu, C. (2018). PointSIFT: A sift-like network module for 3d point cloud semantic segmentation. CoRR abs arXiv:1807.00652

  • Jiao, Y., Jie, Z., Chen, S., Chen, J., Wei, X., Ma, L., & Jiang, Y. G. (2022). MSMDfusion: Fusing lidar and camera at multiple scales with multi-depth seeds for 3d object detection. arXiv preprint arXiv:2209.03102

  • Kar, A., Prakash, A., Liu, M. Y., Cameracci, E., Yuan, J., Rusiniak, M., Acuna, D., Torralba, A., & Fidler, S. (2019). Meta-Sim: Learning to generate synthetic datasets. In IEEE international conference on computer vision (ICCV) (pp. 4550–4559).

  • Kellner, D., Klappstein, J., & Dietmayer, K. (2012). Grid-based DBSCAN for clustering extended objects in radar data. In IEEE intelligent vehicles symposium (IV) (pp. 365–370).

  • Kesten, R., Usman, M., Houston, J., Pandya, T., Nadhamuni, K., Ferreira, A., Yuan, M., Low, B., Jain, A., Ondruska, P., Omari, S., Shah, S., Kulkarni, A., Kazakova, A., Tao, C., Platinsky, L., Jiang, W., & Shet, V. (2019). Level 5 perception dataset 2020. https://level-5.global/level5/data/

  • Kim, Y. (2014). Convolutional Neural Networks for Sentence Classification. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), Doha, Qatar, October 25–29, 2014 (pp. 1746–1751). ACL. https://doi.org/10.3115/v1/d14-1181

  • Kim, K., & Woo, W. (2005a). A multi-view camera tracking for modeling of indoor environment. Berlin.

  • Kim, K., & Woo, W. (2005b). A multi-view camera tracking for modeling of indoor environment. In K. Aizawa, Y. Nakamura & S. Satoh (Eds.), Advances in multimedia information processing—PCM 2004 (pp. 288–297).

  • Kim, Y., Choi, J.W., & Kum, D. (2020). GRIF Net: Gated region of interest fusion network for robust 3d object detection from radar point cloud and monocular image. In IROS (pp. 10857–10864).

  • Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet classification with deep convolutional neural networks. In Advances in neural information processing systems (NeurIPS) (vol. 25).

  • Ku, J., Mozifian, M., Lee, J., Harakeh, A., & Waslander, S. L. (2018). Joint 3d proposal generation and object detection from view aggregation. In IEEE international conference on intelligent robots and systems (IROS) (pp. 1–8).

  • Lang, A. H., Vora, S., Caesar, H., Zhou, L., Yang, J., & Beijbom, O. (2019). PointPillars: Fast encoders for object detection from point clouds. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 12697–12705).

  • Lee, S. (2020). Deep learning on radar centric 3d object detection. CoRR abs arXiv:2003.00851

  • Lee, C. H., Lim, Y. C., Kwon, S., & Lee, J. H. (2011). Stereo vision-based vehicle detection using a road feature and disparity histogram. Optical Engineering, 50(2), 027004–027004.

    Article  Google Scholar 

  • Levinson, J., & Thrun, S. (2013). Automatic online calibration of cameras and lasers. In Robotics: Science and systems (vol. 2, p. 7).

  • Li, P., Chen, X., & Shen, S. (2019). Stereo R-CNN based 3d object detection for autonomous driving. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 7644–7652).

  • Li, Y., Yu, A. W., Meng, T., Caine, B., Ngiam, J., Peng, D., Shen, J., Lu, Y., Zhou, D., Le, Q. V., & Yuille, A. (2022). DeepFusion: Lidar-camera deep fusion for multi-modal 3d object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 17182–17191).

  • Liang, M., Yang, B., Chen, Y., Hu, R., & Urtasun, R. (2019). Multi-task multi-sensor fusion for 3d object detection. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 7337–7345).

  • Liang, M., Yang, B., Wang, S., & Urtasun, R. (2018). Deep continuous fusion for multi-sensor 3d object detection. In European conference on computer vision (ECCV) (pp. 663–678).

  • Liang, Z., Zhang, M., Zhang, Z., Zhao, X., & Pu, S. (2020). RangeRCNN: Towards fast and accurate 3d object detection with range image representation. CoRR abs arXiv:2009.00206

  • Lin, T.Y., Dollár, P., Girshick, R., He, K., Hariharan, B., & Belongie, S. (2017a). Feature pyramid networks for object detection. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 936–944).

  • Lin, T. Y., Goyal, P., Girshick, R., He, K., & Dollár, P. (2017b). Focal loss for dense object detection. IEEE Transactions on Pattern Recognition and Machine Intelligence (TPAMI), P.P.(99), 2999–3007

  • Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C. L. (2014). Microsoft COCO: Common objects in context. In European conference on computer vision (ECCV) (pp. 740–755).

  • Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C. Y., & Berg, A. C. (2016). SSD: Single shot multibox detector. In B. Leibe, J. Matas, N. Sebe, & M. Welling (Eds.), European conference on computer vision (ECCV) (pp. 21–37).

  • Liu, H., Simonyan, K., & Yang, Y. (2018). DARTS: Differentiable architecture search. CoRR. arXiv:1806.09055

  • Liu, Z., Tang, H., Amini, A., Yang, X., Mao, H., Rus, D., & Han, S. (2022c). BEVFusion: Multi-task multi-sensor fusion with unified bird’s-eye view representation. arXiv preprint arXiv:2205.13542

  • Liu, Y., Wang, T., Zhang, X., & Sun, J. (2022a). PETR: Position embedding transformation for multi-view 3d object detection. arXiv preprint arXiv:2203.05625

  • Liu, Z., Wu, Z., & Tóth, R. (2020). SMOKE: Single-stage monocular 3d object detection via keypoint estimation. In IEEE conference on computer vision and pattern recognition workshops (CVPRW) (pp. 4289–4298).

  • Liu, Y., Yan, J., Jia, F., Li, S., Gao, Q., Wang, T., Zhang, X., & Sun, J. (2022b). PETRv2: A unified framework for 3d perception from multi-camera images. arXiv preprint arXiv:2206.01256

  • Long, J., Shelhamer, E., & Darrell, T. (2015). Fully convolutional networks for semantic segmentation. IEEE Transactions on Pattern Recognition and Machine Intelligence (TPAMI), 39(4), 640–651.

    Google Scholar 

  • Lu, H., Chen, X., Zhang, G., Zhou, Q., Ma, Y., & Zhao, Y. (2019). SCANet: Spatial-channel attention network for 3d object detection. In IEEE international conference on acoustics, speech and, S.P. (ICASSP) (pp. 1992–1996).

  • Ma, X., Wang, Z., Li, H., Zhang, P., Ouyang, W., & Fan, X. (2019). Accurate monocular 3d object detection via color-embedded 3d reconstruction for autonomous driving. In IEEE international conference on computer vision (ICCV) (pp. 6851–6860).

  • Mahmoud, A., Hu, J. S., & Waslander, S. L. (2022). Dense voxel fusion for 3d object detection. arXiv preprint arXiv:2203.00871

  • Major, B., Fontijne, D., Ansari, A., Sukhavasi, R. T., Gowaiker, R., Hamilton, M., Lee, S., & Grzechnik, S. K., Subramanian, S. (2019). Vehicle detection with automotive radar using deep learning on range-azimuth-doppler tensors. In IEEE international conference on computer vision workshop (ICCVW) (pp. 924–932).

  • Manivasagam, S., Wang, S., Wong, K., Zeng, W., Sazanovich, M., Tan, S., Yang, B., Ma, W. C., & Urtasun, R. (2020). LiDARsim: Realistic lidar simulation by leveraging the real world. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 11167–11176).

  • Mao, J., Xue, Y., Niu, M., Bai, H., Feng, J., Liang, X., Xu, H., & Xu, C. (2021). Voxel transformer for 3d object detection. In 2021 IEEE/CVF international conference on computer vision (ICCV), Montreal, QC, Canada, October 10–17, 2021 (pp. 3144–3153). IEEE. https://doi.org/10.1109/ICCV48922.2021.00315.

  • Marchand, R., & Chaumette, F. (1999). An autonomous active vision system for complete and accurate 3d scene reconstruction. International Journal on Computer Vision (IJCV), 32(3), 171–194.

    Article  Google Scholar 

  • Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A., Veness, J., Bellemare, M., Graves, A., Riedmiller, M., Fidjeland, A., Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou, I., King, H., Kumaran, D., Wierstra, D., Legg, S., & Hassabis, D. (2015). Human-level control through deep reinforcement learning. Nature, 518, 529–33.

    Article  Google Scholar 

  • Mousavian, A., Anguelov, D., Flynn, J., & Košecká, J. (2017). 3d bounding box estimation using deep learning and geometry. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 5632–5640).

  • Nabati, R., & Qi, H. (2019). RRPN: Radar region proposal network for object detection in autonomous vehicles. In IEEE international conference on image processing (ICIP) (pp. 3093–3097).

  • Nabati, R., & Qi, H. (2021). CenterFusion: Center-based radar and camera fusion for 3d object detection. In IEEE winter conference on applications of computer vision (WACV) (pp. 1527–1536).

  • Nießner, M., Zollhöfer, M., Izadi, S., & Stamminger, M. (2013). Real-time 3d reconstruction at scale using voxel hashing. ACM Transactions on Graphics (TOG), 32(6), 1–11.

    Article  Google Scholar 

  • Pan, X., Xia, Z., Song, S., Li, L.E., & Huang, G. (2021). 3d object detection with pointformer. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 7463–7472).

  • Pandey, G., McBride, J. R., Savarese, S., & Eustice, R. M. (2012). Automatic targetless extrinsic calibration of a 3d lidar and camera by maximizing mutual information. In Association for the advancement of artificial intelligence (AAAI) (pp. 2053–2059).

  • Pang, S., Morris, D., & Radha, H. (2020). CLOCs: Camera-lidar object candidates fusion for 3d object detection. In IEEE international conference on intelligent robots and systems (IROS) (pp. 10386–10393).

  • Park, D., Ambrus, R., Guizilini, V., Li, J., & Gaidon, A. (2021). Is pseudo-lidar needed for monocular 3d object detection? In IEEE international conference on computer vision (ICCV) (pp. 3142–3152).

  • Park, J. Y., Chu, C. W., Kim, H. W., Lim, S. J., Park, J. C., & Koo, B. K. (2009). Multi-view camera color calibration method using color checker chart. US Patent 12/334,095

  • Patil, A., Malla, S., Gang, H., & Chen, Y. T. (2019). The H3D dataset for full-surround 3D multi-object detection and tracking in crowded urban scenes. In IEEE international conference on robotics and automation (ICRA) (pp. 9552–9557).

  • Patole, S. M., Torlak, M., Wang, D., & Ali, M. (2017). Automotive radars: A review of signal processing techniques. IEEE Signal Processing Magazine, 34(2), 22–35.

    Article  Google Scholar 

  • Pham, Q. H., Sevestre, P., Pahwa, R. S., Zhan, H., Pang, C. H., Chen, Y., Mustafa, A., Chandrasekhar, V., & Lin, J. (2020). A* 3d dataset: Towards autonomous driving in challenging environments. In IEEE international conference on robotics and automation (ICRA) (pp. 2267–2273).

  • Philion, J., & Fidler, S. (2020). Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d. In European conference on computer vision (pp. 194–210). Springer.

  • Pon, A. D., Ku, J., Li, C., & Waslander, S. L. (2020). Object-centric stereo matching for 3d object detection. In IEEE international conference on robotics and automation (ICRA) (pp. 8383–8389).

  • Prakash, A., Boochoon, S., Brophy, M., Acuna, D., Cameracci, E., State, G., Shapira, O., & Birchfield, S. (2019). Structured domain randomization: Bridging the reality gap by context-aware synthetic data. In IEEE international conference on robotics and automation (ICRA) (pp. 7249–7255).

  • Qi, C. R., Litany, O., He, K., & Guibas, L. (2019). Deep Hough voting for 3d object detection in point clouds. In International conference on computer vision (ICCV) (pp. 9276–9285).

  • Qi, C. R., Liu, W., Wu, C., Su, H., & Guibas, L. J. (2018). Frustum PointNets for 3d object detection from RGB-D data. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 918–927).

  • Qi, C.R., Yi, L., Su, H., & Guibas, L. J. (2017). PointNet++: Deep hierarchical feature learning on point sets in a metric space. In Advances in neural information processing systems (NeurIPS) (vol. 30).

  • Qian, K., Zhu, S., Zhang, X., & Li, L. E. (2021). Robust multimodal vehicle detection in foggy weather using complementary lidar and radar signals. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 444–453).

  • Qin, Z., Wang, J., & Lu, Y. (2019b). Triangulation learning network: From monocular to stereo 3d object detection. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 7615–7623).

  • Qin, Z., Wang, J., & Lu, Y. (2019a). Monogrnet: A geometric reasoning network for monocular 3d object localization. Association for the Advancement of Artificial Intelligence (AAAI), 33, 8851–8858.

    Google Scholar 

  • Redmon, J., Divvala, S., Girshick, R., & Farhadi, A. (2016). You only look once: Unified, real-time object detection. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 779–788).

  • Ren, S., He, K., Girshick, R., & Sun, J. (2017). Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Recognition and Machine Intelligence (TPAMI), 39(6), 1137–1149.

    Article  Google Scholar 

  • Repairer Driven News (2018). Velodyne: Leading LIDAR price halved, new high-res product to improve self-driving cars. https://www.repairerdrivennews.com/2018/01/02/velodyne-leading-lidar-price-halved-new-high-res-product-to-improve-self-driving-cars/

  • Richter, S. R., Vineet, V., Roth, S., & Koltun, V. (2016). Playing for data: Ground truth from computer games. In B. Leibe, J. Matas, N. Sebe, & M. Welling (Eds.), European conference on computer vision (ECCV) (pp. 102–118).

  • Richter, S. R., Al Haija, H. A., & Koltun, V. (2022). Enhancing photorealism enhancement. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(2), 1700–1715.

    Article  Google Scholar 

  • Riegler, G., Ulusoy, A. O., & Geiger, A. (2017). OctNet: Learning deep 3d representations at high resolutions. In IEEE conference on computer vision and pattern recognition (CVPR) IEEE Computer Society (pp. 6620–6629).

  • Roddick, T., & Cipolla, R. (2020). Predicting semantic map representations from images using pyramid occupancy networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 11138–11147).

  • Ronneberger, O., Fischer, P., & Brox, T. (2015). U-Net: Convolutional networks for biomedical image segmentation. In: Medical image computing and computer-assisted intervention (MICCAI), (vol. 9351, pp. 234–241).

  • Ros, G., Sellart, L., Materzynska, J., Vazquez, D., & Lopez, A. M. (2016). The SYNTHIA dataset: A large collection of synthetic images for semantic segmentation of urban scenes. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 3234–3243).

  • Schlosser, J., Chow, C. K., & Kira, Z. (2016). Fusing lidar and images for pedestrian detection using convolutional neural networks. In IEEE international conference on robotics and automation (ICRA) (pp. 2198–2205).

  • Schmidhuber, J. (2015). Deep learning in neural networks: An overview. Neural Networks, 61, 85–117.

    Article  Google Scholar 

  • Schneider, N., Piewak, F., Stiller, C., & Franke, U. (2017). RegNet: Multimodal sensor registration using deep neural networks. In IEEE intelligent vehicles symposium (IV) (pp. 1803–1810).

  • Sheeny, M., Pellegrin, E. D., Mukherjee, S., Ahrabian, A., Wang, S., & Wallace, A. M. (2021). RADIATE: A radar dataset for automotive perception. In IEEE international conference on robotics and automation (ICRA), Xi’an, China, May 30–June 5, 2021 (pp. 1–7). IEEE. https://doi.org/10.1109/ICRA48506.2021.9562089

  • Shi, W., & Rajkumar, R. (2020). Point-GNN: Graph neural network for 3d object detection in a point cloud. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 1711–1719).

  • Shi, S., Guo, C., Jiang, L., Wang, Z., Shi, J., Wang, X., & Li, H. (2020a). PV-RCNN: Point-voxel feature set abstraction for 3d object detection. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 10526–10535).

  • Shi, S., Wang, X., & Li, H. (2019). PointRCNN: 3d object proposal generation and detection from point cloud. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 770–779).

  • Shi, S., Wang, Z., Shi, J., Wang, X., & Li, H. (2020b). From points to parts: 3d object detection from point cloud with part-aware and part-aggregation network. IEEE Transactions on Pattern Recognition and Machine Intelligence (TPAMI), 43, 1–1.

  • Shin, K., Kwon, Y. P., & Tomizuka, M. (2019). RoarNet: A robust 3d object detection based on region approximation refinement. In IEEE intelligent vehicles symposium (IV) (pp. 2510–2515).

  • Silver, D., Huang, A., Maddison, C., Guez, A., Sifre, L., Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., & Dieleman, S. (2016). Mastering the game of go with deep neural networks and tree search. Nature, 529, 484–489.

  • Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. In: 3rd International Conference on Learning Representations (ICLR) San Diego, CA, USA, May 7–9, 2015, Conference Track Proceedings. arXiv:1409.1556

  • Sindagi, V. A., Zhou, Y., & Tuzel, O. (2019). MVX-Net: Multimodal voxelnet for 3d object detection. In IEEE international conference on robotics and automation (ICRA) (pp. 7276–7282).

  • Strecha, C., von Hansen, W., Van Gool, L., Fua, P., & Thoennessen, U. (2008). On benchmarking camera calibration and multi-view stereo for high resolution imagery. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 1–8).

  • Sun, P., Kretzschmar, H., Dotiwalla, X., Chouard, A., Patnaik, V., Tsui, P., Guo J, Zhou, Y., Chai, Y., Caine, B., Vasudevan, V., Han, W., Ngiam, J., Zhao, H., Timofeev, A., Ettinger, S., Krivokon, M., Gao, A., Joshi, A., Zhang, Y., Shlens J, Chen, Z., & Anguelov, D. (2020a). Scalability in perception for autonomous driving: Waymo open dataset. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, June 13–19, 2020 (pp. 2443–2451). Computer Vision Foundation/IEEE. https://doi.org/10.1109/CVPR42600.2020.00252

  • Sun, Y., Zuo, W., Yun, P., Wang, H., & Liu, M. (2020b). FuseSeg: Semantic segmentation of urban scenes based on RGB and thermal data fusion. IEEE Transactions on Automation Science and Engineering, P.P.(99), 1–12.

  • Tang, H., Liu, Z., Zhao, S., Lin, Y., Lin, J., Wang, H., & Han, S. (2020). Searching efficient 3d architectures with sparse point-voxel convolution. In European conference on computer vision (ECCV) (pp. 685–702).

  • Urmson, C., Anhalt, J., Bagnell, D., Baker, C., Bittner, R., Clark, M., Dolan, J., Duggins, D., Galatali, T., Geyer, C. & Gittleman, M. (2008). Autonomous driving in urban environments: Boss and the urban challenge. Journal of Field Robotics, 25(8), 425–466.

  • Urmson, C., Baker, C., Dolan, J., Rybski, P., Salesky, B., Whittaker, W. R., Ferguson, D., & Darms, M. (2009). Autonomous driving in traffic: Boss and the urban challenge. AI Magazine, 30(2), 17–28.

  • Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Lu., & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems (NeurIPS), 30, 6000–6010.

    Google Scholar 

  • Vora, S., Lang, A. H., Helou, B., & Beijbom, O. (2020). PointPainting: Sequential fusion for 3d object detection. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 4603–4611).

  • Wallace, A. M., Halimi, A., & Buller, G. S. (2020). Full waveform lidar for adverse weather conditions. IEEE Transactions on Vehicular Technology (TVT), 69(7), 7064–7077.

    Article  Google Scholar 

  • Wandinger, U. (2005). Introduction to lidar. Brooks/Cole Pub. Co.

  • Wang, Z., & Jia, K. (2019a). Frustum ConvNet: Sliding frustums to aggregate local point-wise features for amodal. In IEEE international conference on intelligent robots and systems (IROS) (pp. 1742–1749).

  • Wang, Z., & Jia, K. (2019b). Frustum ConvNet: Sliding frustums to aggregate local point-wise features for amodal 3d object detection. In IEEE international conference on intelligent robots and systems (IROS) (pp. 1742–1749).

  • Wang, Y., Chao, W. L., Garg, D., Hariharan, B., Campbell, M., & Weinberger, K. Q. (2019). Pseudo-lidar from visual depth estimation: Bridging the gap in 3d object detection for autonomous driving. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 8437–8445).

  • Wang, X., Girshick, R. B., Gupta, A., & He, K. (2018). Non-local neural networks. In IEEE conference on computer vision and pattern recognition (CVPR), Computer Vision Foundation/IEEE Computer Society (pp. 7794–7803).

  • Wang, C., Ma, C., Zhu, M., & Yang, X. (2021). PointAugmenting: Cross-modal augmentation for 3d object detection. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 11794–11803).

  • Wang, S., Suo, S., Ma, W., Pokrovsky, A., & Urtasun, R. (2018). Deep parametric continuous convolutional neural networks. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 2589–2597).

  • Wang, G., Tian, B., Zhang, Y., Chen, L., Cao, D., & Wu, J. (2020). Multi-view adaptive fusion network for 3D object detection. arXiv e-prints p arXiv:2011.00652

  • Wang, M., & Deng, W. (2018). Deep visual domain adaptation: A survey. Neurocomputing, 312, 135–153.

    Article  Google Scholar 

  • Wang, J., & Zhou, L. (2019). Traffic light recognition with high dynamic range imaging and deep learning. IEEE Transactions on Intelligent Transportation Systems, 20(4), 1341–1352.

    Article  Google Scholar 

  • Weng, X., Man, Y., Cheng, D., Park, J., O.’Toole, M., & Kitani, K. (2020). All-in-one drive: A large-scale comprehensive perception dataset with high-density long-range point clouds. arXiv

  • Wilson, B., Qi, W., Agarwal, T., Lambert, J., Singh, J., Khandelwal, S., Pan, B., Kumar, R., Hartnett, A., Pontes, J. K., Ramanan, D., Carr, P., & Hays, J. (2021). Argoverse 2: Next generation datasets for self-driving perception and forecasting. In Proceedings of the neural information processing systems track on datasets and benchmarks (NeurIPS Datasets and Benchmarks 2021).

  • Wu, X., Peng, L., Yang, H., Xie, L., Huang, C., Deng, C., Liu, H., & Cai, D. (2022). Sparse fuse dense: Towards high quality 3d detection with depth completion. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 5418–5427).

  • Xie, J., Kiefel, M., Sun, M. T., & Geiger, A. (2016). Semantic instance annotation of street scenes by 3d to 2d label transfer. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 3688–3697).

  • Xie, L., Xiang, C., Yu, Z., Xu, G., Yang, Z., Cai, D., & He, X. (2020). PI-RCNN: An efficient multi-sensor 3d object detector with point-based attentive cont-conv fusion module. Association for the Advancement of Artificial Intelligence (AAAI), 34, 12460–12467.

    Google Scholar 

  • Xu, D., Anguelov, D., & Jain, A. (2018). PointFusion: Deep sensor fusion for 3d bounding box estimation. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 244–253).

  • Xu, Q., Zhong, Y., & Neumann, U. (2021). Behind the curtain: Learning occluded shapes for 3d object detection. In Thirty-Sixth AAAI Conference on Artificial Intelligence, AAAI Thirty-Fourth Conference on InnovativeApplications of Artificial Intelligence (IAAI), The Twelveth Symposium on Educational Advances in Artificial Intelligence (EAAI) 2022 Virtual Event, February 22–March 1, 2022 (pp. 2893–2901). AAAI Press.

  • Yang, Z., Chen, J., Miao, Z., Li, W., Zhu, X., & Zhang, L. (2022b). DeepInteraction: 3d object detection via modality interaction. arXiv preprint arXiv:2208.11112

  • Yang, W., Li, Q., Liu, W., Yu, Y., Ma, Y., He, S., & Pan, J. (2021). Projecting your view attentively: Monocular road scene layout estimation via cross-view transformation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 15536–15545).

  • Yang, H., Liu, Z., Wu, X., Wang, W., Qian, W., He, X., & Cai, D. (2022a). Graph R-CNN: Towards accurate 3d object detection with semantic-decorated local graph. In Computer Vision - ECCV 2022 - 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part VIII. Lecture Notes in Computer Science (vol. 13668, pp. 662–679). Springer. https://doi.org/10.1007/978-3-031-20074-8_38

  • Yang, B., Luo, W., & Urtasun, R. (2018a). PIXOR: Real-time 3d object detection from point clouds. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 7652–7660).

  • Yang, B., Luo, W., & Urtasun, R. (2018b). PIXOR: Real-time 3d object detection from point clouds. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 7652–7660).

  • Yang, Z., Sun, Y., Liu, S., & Jia, J. (2020). 3DSSD: Point-based 3d single stage object detector. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 11037–11045).

  • Yang, Z., Sun, Y., Liu, S., Shen, X., & Jia, J. (2018). IPOD: Intensive point-based object detector for point cloud. CoRR. arXiv:1812.05276

  • Yang, Z., Sun, Y., Liu, S., Shen, X., & Jia, J. (2019). STD: Sparse-to-dense 3d object detector for point cloud. In IEEE international conference on computer vision (ICCV) (pp. 1951–1960).

  • Yang, B., Guo, R., Liang, M., Casas, S., & Urtasun, R. (2020). RadarNet: Exploiting radar for robust perception of dynamic objects. European Conference on Computer Vision (ECCV), 12363, 496–512.

    Google Scholar 

  • Yan, Y., Mao, Y., & Li, B. (2018). SECOND: Sparsely embedded convolutional detection. Sensors, 18(10), 3337.

    Article  Google Scholar 

  • Yin, T., Zhou, X., & Krähenbühl, P. (2021). Center-based 3d object detection and tracking. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 11784–11793).

  • Yoo, J., Ahn, N., & Sohn, K. (2020a). Rethinking data augmentation for image super-resolution: A comprehensive analysis and a new strategy. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 8372–8381).

  • Yoo, J. H., Kim, Y., Kim, J., & Choi, J. W. (2020b). 3D-CVF: Generating joint camera and lidar features using cross-view spatial feature fusion for 3d object detection. In European conference on computer vision (ECCV) (pp. 720–736).

  • Yosinski, J., Clune, J., Bengio, Y., & Lipson, H. (2014). How transferable are features in deep neural networks? In Advances in neural information processing systems (NeurIPS) (vol. 27).

  • You, Y., Wang, Y., Chao, W., Garg, D., Pleiss, G., Hariharan, B., Campbell, M. E., & Weinberger, K. Q. (2020). Pseudo-lidar++: Accurate depth for 3d object detection in autonomous driving. In 8th International Conference on Learning Representations (ICLR), Addis Ababa, Ethiopia, April 26–30, 2020. OpenReview.net

  • Zewge, N. S., Kim, Y., Kim, J., & Kim, J. H. (2019). Millimeter-wave radar and RGB-D camera sensor fusion for real-time people detection and tracking. In 2019 7th international conference on robot intelligence technology and applications (RiTA) (pp. 93–98).

  • Zhang, Y., Carballo, A., Yang, H., & Takeda, K. (2021b). Autonomous driving in adverse weather conditions: A survey. arXiv preprint arXiv:2112.08936

  • Zhang, Y., Carballo, A., Yang, H., & Takeda, K. (2021c). Autonomous driving in adverse weather conditions: A survey. CoRR abs arXiv:2112.08936

  • Zhang, W., Wang, Z., & Loy, C. C. (2020a). Multi-modality cut and paste for 3d object detection. arXiv:2012.12741

  • Zhang, H., Yang, D., Yurtsever, E., Redmill, K. A., & Özgüner, Ü. (2021a). Faraway-Frustum: Dealing with lidar sparsity for 3d object detection using fusion. In 24th IEEE international intelligent Transportation tystems conference (ITSC), Indianapolis, IN, USA, September 19–22, 2021 (pp. 2646–2652). IEEE. https://doi.org/10.1109/ITSC48978.2021.9564990

  • Zhang, Y., Zhang, S., Zhang, Y., Ji, J., Duan, Y., Huang, Y., Peng, J., & Zhang, Y. (2020). Multi-modality fusion perception and computing in autonomous driving. Journal of Computer Research and Development, 57(9), 1781.

    Google Scholar 

  • Zhao, X., Liu, Z., Hu, R., & Huang, K. (2019). 3d object detection using scale invariant and feature reweighting networks. In Association for the advancement of artificial intelligence (AAAI) (pp. 9267–9274).

  • Zhou, B., & Krähenbühl, P. (2022). Cross-view transformers for real-time map-view semantic segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 13760–13769).

  • Zhou, Y., & Tuzel, O. (2018). VoxelNet: End-to-end learning for point cloud based 3d object detection. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 4490–4499).

  • Zhou, Y., Wan, G., Hou, S., Yu, L., Wang, G., Rui, X., & Song, S. (2020). DA4AD: End-to-end deep attention-based visual localization for autonomous driving. In European conference on computer vision (ECCV) (pp. 271–289).

  • Zhu, H., Deng, J., Zhang, Y., Ji, J., Mao, Q., Li, H., & Zhang, Y. (2022). VPFNet: Improving 3d object detection with virtual point based lidar and stereo data fusion. IEEE Transactions on Multimedia (TMM). https://doi.org/10.1109/TMM.2022.3189778

Download references

Acknowledgements

This work was supported by the Anhui Province Development and Reform Commission 2020 New Energy Vehicle Industry Innovation Development Project.

Funding

The funding was provided by National Key Research and Development Program of China (Grant No. 2018AAA0100500).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yanyong Zhang.

Additional information

Communicated by Slobodan Ilic.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, Y., Mao, Q., Zhu, H. et al. Multi-Modal 3D Object Detection in Autonomous Driving: A Survey. Int J Comput Vis 131, 2122–2152 (2023). https://doi.org/10.1007/s11263-023-01784-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-023-01784-z

Keywords

Navigation