Skip to main content
Log in

HVDistill: Transferring Knowledge from Images to Point Clouds via Unsupervised Hybrid-View Distillation

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

We present a hybrid-view-based knowledge distillation framework, termed HVDistill, to guide the feature learning of a point cloud neural network with a pre-trained image network in an unsupervised manner. By exploiting the geometric relationship between RGB cameras and LiDAR sensors, the correspondence between the two modalities based on both image-plane view and bird-eye view can be established, which facilitates representation learning. Specifically, the image-plane correspondences can be simply obtained by projecting the point clouds, while the bird-eye-view correspondences can be achieved by lifting pixels to the 3D space with the predicted depths under the supervision of projected point clouds. The image teacher networks provide rich semantics from the image-plane view and meanwhile acquire geometric information from the bird-eye view. Indeed, image features from the two views naturally complement each other and together can ameliorate the learned feature representation of the point cloud student networks. Moreover, with a self-supervised pre-trained 2D network, HVDistill requires neither 2D nor 3D annotations. We pre-train our model on nuScenes dataset and transfer it to several downstream tasks on nuScenes, SemanticKITTI, and KITTI datasets for evaluation. Extensive experimental results show that our method achieves consistent improvements over the baseline trained from scratch and significantly outperforms the existing schemes. The source code is available at https://github.com/zhangsha1024/HVDistill.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Data Availability

The nuScenes (Caesar et al., 2020) dataset and nuScenes-lidarseg (Caesar et al., 2020) dataset can be obtained from https://www.nuscenes.org/. The KITTI (Geiger et al., 2012) dataset and SemanticKITTI (Behley et al., 2019) dataset can be obtained from https://www.cvlibs.net/datasets/kitti/. The code that supports the findings of this study are available from the corresponding author, Yanyong Zhang, upon reasonable request.

References

  • Achanta, R., Shaji, A., Smith, K., Lucchi, A., Fua, P., & Süsstrunk, S. (2012). Slic superpixels compared to state-of-the-art superpixel methods. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 34(11), 2274–2282.

    Article  Google Scholar 

  • Alexiou, E., Yang, N., & Ebrahimi, T. (2020). Pointxr: A toolbox for visualization and subjective evaluation of point clouds in virtual reality. In: 2020 Twelfth International Conference on Quality of Multimedia Experience (QoMEX), IEEE, pp. 1–6.

  • Alwassel, H., Mahajan, D., Korbar, B., Torresani, L., Ghanem, B., & Tran, D. (2020). Self-supervised learning by cross-modal audio-video clustering. Advances in Neural Information Processing Systems (NeurIPS), 33, 9758–9770.

    Google Scholar 

  • Behley, J., Garbade, M., Milioto, A., Quenzel, J., Behnke, S., Stachniss, C., & Gall, J. (2019). Semantickitti: A dataset for semantic scene understanding of lidar sequences. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 9297–9307.

  • Berman, M., Triki, A. R., & Blaschko, M. B. (2018). The lovász-softmax loss: A tractable surrogate for the optimization of the intersection-over-union measure in neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp. 4413–4421.

  • Buciluǎ, C., Caruana, R., & Niculescu-Mizil, A. (2006). Model compression. In: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining (SIGKDD), pp. 535–541.

  • Caesar, H., Bankiti, V., Lang, A. H., Vora, S., Liong, V. E., Xu, Q., Krishnan, A., Pan, Y., Baldan, G., & Beijbom, O. (2020). Nuscenes: A multimodal dataset for autonomous driving. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp. 11621–11631.

  • Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., & Joulin, A. (2021). Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF international conference on computer vision (ICCV), pp. 9650–9660.

  • Chen, X., Fan, H., Girshick, R., & He, K. (2020). Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297

  • Chen, H., Luo, S., Gao, X., & Hu, W. (2021). Unsupervised learning of geometric sampling invariant representations for 3d point clouds. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 893–903.

  • Chen, S., Duan, C., Yang, Y., Li, D., Feng, C., & Tian, D. (2019). Deep unsupervised learning of 3d point clouds via graph topology inference and filtering. IEEE Transactions on Image Processing (TIP), 29, 3183–3198.

    Article  MathSciNet  ADS  Google Scholar 

  • Cho, J. H., & Hariharan, B. (2019). On the efficacy of knowledge distillation. In: Proceedings of the IEEE/CVF international conference on computer vision (ICCV), pp. 4794–4802.

  • Choy, C., Gwak, J., & Savarese, S. (2019). 4d spatio-temporal convnets: Minkowski convolutional neural networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3075–3084.

  • Duan, Y., Peng, J., Zhang, Y., Ji, J., & Zhang, Y. (2022). Pfilter: Building persistent maps through feature filtering for fast and accurate lidar-based slam. In: 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), IEEE, pp. 11087–11093.

  • Geiger, A., Lenz, P., & Urtasun, R. (2012). Are we ready for autonomous driving? the kitti vision benchmark suite. In: 2012 IEEE conference on computer vision and pattern recognition (CVPR), IEEE, pp. 3354–3361.

  • Grill, J. B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al. (2020). Bootstrap your own latent-a new approach to self-supervised learning. Advances in Neural Information Processing Systems (NeurIPS), 33, 21271–21284.

    Google Scholar 

  • Guo, X., Shi, S., Wang, X., & Li, H. (2021). Liga-stereo: Learning lidar geometry aware representations for stereo-based 3d detector. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (CVPR), pp. 3153–3163.

  • Gupta, S., Hoffman, J., & Malik, J. (2016). Cross modal distillation for supervision transfer. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp. 2827–2836.

  • Han, Z., Wang, X., Liu, Y. S., & Zwicker, M. (2019b). Multi-angle point cloud-vae: Unsupervised feature learning for 3d point clouds from multiple angles by joint self-reconstruction and half-to-half prediction. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), IEEE, pp. 10441–10450.

  • Han, Z., Wang, X., Liu, Y. S., & Zwicker, M. (2021b). Hierarchical view predictor: Unsupervised 3d global feature learning through hierarchical prediction among unordered views. In: Proceedings of the 29th ACM International Conference on Multimedia (ACM MM), pp. 3862–3871.

  • Han, B., Ma, J. W., & Leite, F. (2021). A framework for semi-automatically identifying fully occluded objects in 3d models: Towards comprehensive construction design review in virtual reality. Advanced Engineering Informatics, 50, 101398.

    Article  Google Scholar 

  • Han, Z., Shang, M., Liu, Y. S., & Zwicker, M. (2019). View inter-prediction gan: Unsupervised representation learning for 3d shapes by learning global shape memories to support local view predictions. Proceedings of the AAAI Conference On Artificial Intelligence (AAAI), 33, 8376–8384.

    Article  Google Scholar 

  • He, K., Chen, X., Xie, S., Li, Y., Dollár, P., & Girshick, R. (2022). Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 16000–16009.

  • He, K., Fan, H., Wu, Y., Xie, S., & Girshick, R. (2020). Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp. 9729–9738.

  • He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp. 770–778.

  • Hinton, G., Vinyals, O., Dean, J., et al. (2015). Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 2(7).

  • Huang, J., Huang, G., Zhu, Z., & Du, D. (2021). Bevdet: High-performance multi-camera 3d object detection in bird-eye-view. arXiv preprint arXiv:2112.11790.

  • Jiang, J., Lu, X., Ouyang, W., & Wang, M. (2021). Unsupervised representation learning for 3d point cloud data. arXiv preprint arXiv:2110.06632.

  • Li, Z., Wang, W., Li, H., Xie, E., Sima, C., Lu, T., Qiao, Y., & Dai, J. (2022b). Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part IX, Springer, pp. 1–18.

  • Li, C. L., Zaheer, M., Zhang, Y., Poczos, B., & Salakhutdinov, R. (2018). Point cloud gan. arXiv preprint arXiv:1810.05795.

  • Li, Y., Deng, J., Zhang, Y., Ji, J., Li, H., & Zhang, Y. (2022). Ezfusion: A close look at the integration of lidar, millimeter-wave radar, and camera for accurate 3d object detection and tracking. IEEE Robotics and Automation Letters (RAL), 7(4), 11182–11189.

    Article  Google Scholar 

  • Liu, Y. C., Huang, Y. K., Chiang, H. Y., Su, H. T., Liu, Z. Y., Chen, C. T., Tseng, C. Y., & Hsu, W. H. (2021a). Learning from 2d: Contrastive pixel-to-point knowledge transfer for 3d pretraining. arXiv preprint arXiv:2104.04687

  • Liu, Z., Qi, X., & Fu, C. W. (2021b). 3d-to-2d distillation for indoor scene parsing. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp 4464–4474.

  • Liu, Z., Tang, H., Amini, A., Yang, X., Mao, H., Rus, D., & Han, S. (2022). Bevfusion: Multi-task multi-sensor fusion with unified bird’s-eye view representation. arXiv preprint arXiv:2205.13542.

  • Mirzadeh, S. I., Farajtabar, M., Li, A., Levine, N., Matsukawa, A., & Ghasemzadeh, H. (2020). Improved knowledge distillation via teacher assistant. Proceedings of the AAAI Conference On Artificial Intelligence (AAAI), 34, 5191–5198.

    Article  Google Scholar 

  • Oord, A.v.d., Li, Y., & Vinyals, O. (2018). Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748.

  • Qi, X., Wang, W., Yuan, M., Wang, Y., Li, M., Xue, L., & Sun, Y. (2020). Building semantic grid maps for domestic robot navigation. International Journal of Advanced Robotic Systems, 17(1), 1729881419900066.

    Article  Google Scholar 

  • Rao, Y., Lu, J., & Zhou, J. (2020). Global-local bidirectional reasoning for unsupervised representation learning of 3d point clouds. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5376–5385.

  • Sanghi, A. (2020). Info3d: Representation learning on 3d objects using mutual information maximization and contrastive learning. In: European Conference on Computer Vision (ECCV) (pp. 626–642), Springer.

  • Sautier, C., Puy, G., Gidaris, S., Boulch, A., Bursuc, A., & Marlet, R. (2022). Image-to-lidar self-supervised distillation for autonomous driving data. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9891–9901.

  • Shi, S., Wang, X., & Li, H. (2019). Pointrcnn: 3d object proposal generation and detection from point cloud. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp. 770–779.

  • Shi, S., Jiang, L., Deng, J., Wang, Z., Guo, C., Shi, J., Wang, X., & Li, H. (2023). Pv-rcnn++: Point-voxel feature set abstraction with local vector representation for 3d object detection. International Journal of Computer Vision (IJCV), 131(2), 531–551.

    Article  Google Scholar 

  • Shi, S., Wang, Z., Shi, J., Wang, X., & Li, H. (2021). From points to parts: 3d object detection from point cloud with part-aware and part-aggregation network. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(8), 2647–2664.

    PubMed  Google Scholar 

  • Team, O. D. (2020). Openpcdet: An open-source toolbox for 3d object detection from point clouds. https://github.com/open-mmlab/OpenPCDet.

  • Tian, Y., Krishnan, D., & Isola, P. (2019). Contrastive representation distillation. arXiv preprint arXiv:1910.10699.

  • Wang, Y., Chao, W. L., Garg, D., Hariharan, B., Campbell, M., & Weinberger, K. Q. (2019). Pseudo-lidar from visual depth estimation: Bridging the gap in 3d object detection for autonomous driving. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8445–8453.

  • Wang, Y., Mao, Q., Zhu, H., Deng, J., Zhang, Y., Ji, J., Li, H., & Zhang, Y. (2023). Multi-modal 3d object detection in autonomous driving: A survey. International Journal of Computer Vision (IJCV), 131, 1–31.

    Article  Google Scholar 

  • Wang, P. S., Yang, Y. Q., Zou, Q. F., Wu, Z., Liu, Y., & Tong, X. (2021). Unsupervised 3d learning for shape analysis via multiresolution instance discrimination. Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), 35, 2773–2781.

    Article  Google Scholar 

  • Xiao, T., Liu, S., De Mello, S., Yu, Z., Kautz, J., & Yang, M. H. (2022). Learning contrastive representation for semantic correspondence. International Journal of Computer Vision (IJCV), 130(5), 1293–1309.

    Article  Google Scholar 

  • Xie, S., Gu, J., Guo, D., Qi, C. R., Guibas, L., & Litany, O. (2020). Pointcontrast: Unsupervised pre-training for 3d point cloud understanding. In: European conference on computer vision (ECCV) (pp. 574–591), Springer.

  • Xie, J., Zhan, X., Liu, Z., Ong, Y. S., & Loy, C. C. (2022). Delving into inter-image invariance for unsupervised visual representations. International Journal of Computer Vision (IJCV), 130(12), 2994–3013.

    Article  Google Scholar 

  • Yang, Y., Feng, C., Shen, Y., & Tian, D. (2017). Foldingnet: Interpretable unsupervised learning on 3d point clouds. arXiv preprint arXiv:1712.07262 2(3):5.

  • Zhang, L., & Ma, K. (2020). Improve object detection with feature-based knowledge distillation: Towards accurate and efficient detectors. In: International Conference on Learning Representations (ICLR).

  • Zhang, Z., Girdhar, R., Joulin, A., & Misra, I. (2021). Self-supervised pretraining of 3d features on any point-cloud. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 10252–10263.

  • Zhao, B., Cui, Q., Song, R., Qiu, Y., & Liang, J. (2022a). Decoupled knowledge distillation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11953–11962.

  • Zhao, Y., Fang, G., Guo, Y., Guibas, L., Tombari, F., & Birdal, T. (2022). 3dpointcaps++: Learning 3d representations with capsule networks. International Journal of Computer Vision (IJCV), 130(9), 2321–2336.

    Article  Google Scholar 

  • Zhu, H., Deng, J., Zhang, Y., Ji, J., Mao, Q., Li, H., & Zhang, Y. (2022). Vpfnet: Improving 3d object detection with virtual point based lidar and stereo data fusion. IEEE Transactions on Multimedia (TMM). https://doi.org/10.1109/TMM.2022.3189778

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Jiajun Deng or Yanyong Zhang.

Ethics declarations

Conflict of interest

There are no conflicts to declare.

Additional information

Communicated by Ming-Hsuan Yang.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, S., Deng, J., Bai, L. et al. HVDistill: Transferring Knowledge from Images to Point Clouds via Unsupervised Hybrid-View Distillation. Int J Comput Vis (2024). https://doi.org/10.1007/s11263-023-01981-w

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11263-023-01981-w

Keywords

Navigation