Skip to main content

Cross-Modality Knowledge Distillation Network for Monocular 3D Object Detection

  • Conference paper
  • First Online:
Computer Vision – ECCV 2022 (ECCV 2022)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13670))

Included in the following conference series:

Abstract

Leveraging LiDAR-based detectors or real LiDAR point data to guide monocular 3D detection has brought significant improvement, e.g., Pseudo-LiDAR methods. However, the existing methods usually apply non-end-to-end training strategies and insufficiently leverage the LiDAR information, where the rich potential of the LiDAR data has not been well exploited. In this paper, we propose the Cross-Modality Knowledge Distillation (CMKD) network for monocular 3D detection to efficiently and directly transfer the knowledge from LiDAR modality to image modality on both features and responses. Moreover, we further extend CMKD as a semi-supervised training framework by distilling knowledge from large-scale unlabeled data and significantly boost the performance. Until submission, CMKD ranks \(1^{st}\) among the monocular 3D detectors with publications on both KITTI test set and Waymo val set with significant performance gains compared to previous state-of-the-art methods. Our code will be released at https://github.com/Cc-Hy/CMKD.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Brazil, G., Liu, X.: M3d-rpn: Monocular 3d region proposal network for object detection. In: ICCV (2019)

    Google Scholar 

  2. Caesar, H., Bankiti, V., Lang, A.H., et al.: Nuscenes: a multimodal dataset for autonomous driving. In: CVPR (2020)

    Google Scholar 

  3. Chen, H., Huang, Y., Tian, W., et al.: Monorun: monocular 3d object detection by reconstruction and uncertainty propagation. In: CVPR (2021)

    Google Scholar 

  4. Chen, L., Papandreou, G., Schroff, F., et al.: Rethinking atrous convolution for semantic image segmentation. CoRR abs/1706.05587 (2017)

    Google Scholar 

  5. Chen, X., Kundu, K., Zhu, Y., et al.: 3d object proposals for accurate object class detection. In: NIPS (2015)

    Google Scholar 

  6. Chen, Y.N., Dai, H., Ding, Y.: Pseudo-stereo for monocular 3d object detection in autonomous driving. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 887–897 (2022)

    Google Scholar 

  7. Chong, Z., Ma, X., Zhang, H., Yue, Y., Li, H., Wang, Z., Ouyang, W.: Monodistill: learning spatial features for monocular 3d object detection (2022)

    Google Scholar 

  8. Dai, X., Jiang, Z., Wu, Z., et al.: General instance distillation for object detection. In: CVPR (2021)

    Google Scholar 

  9. Deng, J., Shi, S., Li, P., et al.: Voxel r-cnn: towards high performance voxel-based 3d object detection. In: AAAI (2021)

    Google Scholar 

  10. Ding, M., Huo, Y., Yi, H., et al.: Learning depth-guided convolutions for monocular 3d object detection. In: CVPR (2020)

    Google Scholar 

  11. Eigen, D., Puhrsch, C., Fergus, R.: Depth map prediction from a single image using a multi-scale deep network. In: NIPS (2014)

    Google Scholar 

  12. Ettinger, S., Cheng, S., Caine, B., et al.: Large scale interactive motion forecasting for autonomous driving: the waymo open motion dataset. CoRR abs/2104.10133 (2021)

    Google Scholar 

  13. Fu, H., Gong, M., Wang, C., et al.: Deep ordinal regression network for monocular depth estimation. In: CVPR (2018)

    Google Scholar 

  14. Furlanello, T., Lipton, Z.C., Tschannen, M., et al.: Born-again neural networks. In: Proceedings of International Conference on Machine Learning (ICML) (2018)

    Google Scholar 

  15. Geiger, A., Lenz, P., Stiller, C., Urtasun, R.: Vision meets robotics: the kitti dataset. In: International Journal of Robotics Research (IJRR) (2013)

    Google Scholar 

  16. Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous driving? the kitti vision benchmark suite. In: CVPR (2012)

    Google Scholar 

  17. Gülçehre, Ç., Bengio, Y.: Knowledge matters: importance of prior information for optimization. In: ICLR (2013)

    Google Scholar 

  18. Guo, X., Shi, S., et al.: Liga:learning lidar geometry aware representations for stereo-based 3d detector. In: ICCV (2021)

    Google Scholar 

  19. He, K., Zhang, X., Ren, S., et al.: Deep residual learning for image recognition. In: CVPR (2016)

    Google Scholar 

  20. Hinton, G.E., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. CoRR abs/1503.02531 (2015)

    Google Scholar 

  21. Huang, Z., Wang, N.: Like what you like: Knowledge distill via neuron selectivity transfer. CoRR abs/1707.01219 (2017)

    Google Scholar 

  22. Jörgensen, E., Zach, C., Kahl, F.: Monocular 3d object detection and box fitting trained end-to-end using intersection-over-union loss. CoRR abs/1906.08070

    Google Scholar 

  23. Ku, J., Mozifian, M., Lee, J., Harakeh, A., Waslander, S.L.: Joint 3d proposal generation and object detection from view aggregation. In: IROS (2018)

    Google Scholar 

  24. Ku, J., Mozifian, M., Lee, J., et al.: Joint 3d proposal generation and object detection from view aggregation. In: IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (2018)

    Google Scholar 

  25. Ku, J., Pon, A.D., Waslander, S.L.: Monocular 3d object detection leveraging accurate proposals and shape reconstruction. In: CVPR (2019)

    Google Scholar 

  26. Königshof, H., Salscheider, N.O., Stiller, C.: Realtime 3D object detection for automated driving using stereo vision and semantic information. In: Proceedings of the IEEE Intelligent Transportation Systems Conference (2019)

    Google Scholar 

  27. Lee, J.H., Han, M.K., Ko, D.W., et al.: From big to small: multi-scale local planar guidance for monocular depth estimation (2019)

    Google Scholar 

  28. Li, J., Dai, H., Shao, L., Ding, Y.: Anchor-free 3d single stage detector with mask-guided attention for point cloud. In: Proceedings of the 29th ACM International Conference on Multimedia, pp. 553–562 (2021)

    Google Scholar 

  29. Li, J., Dai, H., Shao, L., Ding, Y.: From voxel to point: Iou-guided 3d object detection for point cloud with voxel-to-point decoder. In: Proceedings of the 29th ACM International Conference on Multimedia, pp. 4622–4631 (2021)

    Google Scholar 

  30. Li, J., et al.: 3d iou-net: Iou guided 3d object detector for point clouds. arXiv preprint. arXiv:2004.04962 (2020)

  31. Li, J., et al.: P2v-rcnn: point to voxel feature learning for 3d object detection from point clouds. IEEE Access 9, 98249–98260 (2021)

    Article  Google Scholar 

  32. Li, P., Chen, X., Shen, S.: Stereo r-cnn based 3d object detection for autonomous driving. In: CVPR (2019)

    Google Scholar 

  33. Li, X., Wang, W., Wu, L., et al.: Generalized focal loss: Learning qualified and distributed bounding boxes for dense object detection. In: NIPS (2020)

    Google Scholar 

  34. Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2980–2988 (2017)

    Google Scholar 

  35. Lin, T.Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48

    Chapter  Google Scholar 

  36. Liu, J., Hou, Q., Cheng, M., et al.: Improving convolutional networks with self-calibrated convolutions. In: CVPR (2020)

    Google Scholar 

  37. Lu, X., Li, Q., Li, B., Yan, J.: MimicDet: bridging the gap between one-stage and two-stage object detection. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12359, pp. 541–557. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58568-6_32

    Chapter  Google Scholar 

  38. Lu, Y., Ma, X., Y ang, L., et al.: Geometry uncertainty projection network for monocular 3d object detection. arXiv preprint. arXiv:2107.13774 (2021)

  39. Luo, S., Dai, H., Shao, L., Ding, Y.: M3dssd: Monocular 3d single stage object detector. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6145–6154 (2021)

    Google Scholar 

  40. Ma, X., Liu, S., Xia, Z., Zhang, H., Zeng, X., Ouyang, W.: Rethinking pseudo-lidar representation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12358, pp. 311–327. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58601-0_19

    Chapter  Google Scholar 

  41. Ma, X., Wang, Z., Li, H., et al.: Accurate monocular 3d object detection via color-embedded 3d reconstruction for autonomous driving. In: ICCV (2019)

    Google Scholar 

  42. Ma, X., Zhang, Y., Xu, D., et al.: Delving into localization errors for monocular 3d object detection. In: CVPR (2021)

    Google Scholar 

  43. Pang, S., Morris, D.D., Radha, H.: Clocs: camera-lidar object candidates fusion for 3d object detection. In: IROS (2020)

    Google Scholar 

  44. Park, D., Ambrus, R., Guizilini, V.O.: Is pseudo-lidar needed for monocular 3d object detection? In: ICCV (2021)

    Google Scholar 

  45. Peng, L., Liu, F., Yu, Z., et al.: Lidar point cloud guided monocular 3d object detection. CoRR (2021)

    Google Scholar 

  46. Qi, C.R., Wei, L., Wu, C., et al.: Frustum pointnets for 3d object detection from rgb-d data. In: CVPR (2018)

    Google Scholar 

  47. Qi, C.R., Su, H., Mo, K., et al.: Pointnet: deep learning on point sets for 3d classification and segmentation. In: CVPR (2017)

    Google Scholar 

  48. Qi, C.R., Yi, L., Su, H., et al.: Pointnet++: deep hierarchical feature learning on point sets in a metric space. In: NIPS (2017)

    Google Scholar 

  49. Qin, Z., Wang, J., Lu, Y.: Monogrnet: a geometric reasoning network for 3d object localization. In: AAAI (2019)

    Google Scholar 

  50. Reading, C., Harakeh, A., Chae, J., Waslander, S.L.: Categorical depth distribution network for monocular 3d object detection. In: CVPR (2021)

    Google Scholar 

  51. Romero, A., Ballas, N., Kahou, S.E., et al.: Fitnets: hints for thin deep nets. In: ICLR (2015)

    Google Scholar 

  52. Shi, S., Guo, C., Jiang, L., et al.: Pv-rcnn: point-voxel feature set abstraction for 3d object detection. In: CVPR (2020)

    Google Scholar 

  53. Shi, S., Wang, X., Li, H.: Pointrcnn: 3d object proposal generation and detection from point cloud. In: CVPR (2019)

    Google Scholar 

  54. Shi, X., Ye, Q., Chen, X., et al.: Geometry-based distance decomposition for monocular 3d object detection. In: ICCV (2021)

    Google Scholar 

  55. Simonelli, A., Bulò, S.R., Porzi, L., et al.: Demystifying pseudo-lidar for monocular 3d object detection. CoRR abs/2012.05796 (2020)

    Google Scholar 

  56. Sun, J., Chen, L., Xie, Y., et al.: Disp r-cnn: stereo 3d object detection via shape prior guided instance disparity estimation. In: CVPR (2020)

    Google Scholar 

  57. Wang, L., Zhang, L., Zhu, Y., et al.: Progressive coordinate transforms for monocular 3d object detection. In: NIPS (2021)

    Google Scholar 

  58. Wang, Y., Chao, W.L., Garg, D., et al.: Pseudo-lidar from visual depth estimation: bridging the gap in 3d object detection for autonomous driving. In: CVPR (2019)

    Google Scholar 

  59. Weng, X., Kitani, K.: Monocular 3D Object Detection with Pseudo-LiDAR Point Cloud. arXiv:1903.09847 (2019)

  60. Xu, Z., Hsu, Y., et al.: Training shallow and thin networks for acceleration via knowledge distillation with conditional adversarial networks. In: ICLR (2018)

    Google Scholar 

  61. Yan, Y., Mao, Y., Li, B.: SECOND: sparsely embedded convolutional detection. Sensors 18(10), 3337 (2018)

    Article  Google Scholar 

  62. Ye, X., et al.: Monocular 3D object detection via feature domain adaptation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12354, pp. 17–34. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58545-7_2

    Chapter  Google Scholar 

  63. You, Y., Wang, Y., Chao, W.L., et al.: Pseudo-lidar++: accurate depth for 3d object detection in autonomous driving. In: ICLR (2020)

    Google Scholar 

  64. Zhang, Y., Lu, J., Zhou, J.: Objects are different: flexible monocular 3d object detection. In: CVPR (2021)

    Google Scholar 

  65. Zheng, W., Tang, W., Jiang, L., et al.: Se-ssd: self-ensembling single-stage object detector from point cloud. In: CVPR (2021)

    Google Scholar 

  66. Zhou, Y., Tuzel, O.: Voxelnet: end-to-end learning for point cloud based 3d object detection. CoRR abs/1711.06396 (2017)

    Google Scholar 

  67. Zou, Z., Ye, X., Du, L., et al.: The devil is in the task: exploiting reciprocal appearance-localization features for monocular 3d object detection. In: ICCV (2021)

    Google Scholar 

Download references

Acknowledgement

This work was supported by the National Key Research and Development Program of China (Grant No. 2018YFE0183900) and the YUNJI Technology Co. Ltd.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Hang Dai or Yong Ding .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 720 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Hong, Y., Dai, H., Ding, Y. (2022). Cross-Modality Knowledge Distillation Network for Monocular 3D Object Detection. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13670. Springer, Cham. https://doi.org/10.1007/978-3-031-20080-9_6

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-20080-9_6

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-20079-3

  • Online ISBN: 978-3-031-20080-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics