HVDistill: Transferring Knowledge from Images to Point Clouds via Unsupervised Hybrid-View Distillation

Zhang, Sha; Deng, Jiajun; Bai, Lei; Li, Houqiang; Ouyang, Wanli; Zhang, Yanyong

doi:10.1007/s11263-023-01981-w

HVDistill: Transferring Knowledge from Images to Point Clouds via Unsupervised Hybrid-View Distillation

Published: 07 February 2024

(2024)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

Sha Zhang^1,3,
Jiajun Deng²,
Lei Bai³,
Houqiang Li¹,
Wanli Ouyang³ &
…
Yanyong Zhang ORCID: orcid.org/0000-0001-6520-255X¹

338 Accesses
1 Altmetric
Explore all metrics

Abstract

We present a hybrid-view-based knowledge distillation framework, termed HVDistill, to guide the feature learning of a point cloud neural network with a pre-trained image network in an unsupervised manner. By exploiting the geometric relationship between RGB cameras and LiDAR sensors, the correspondence between the two modalities based on both image-plane view and bird-eye view can be established, which facilitates representation learning. Specifically, the image-plane correspondences can be simply obtained by projecting the point clouds, while the bird-eye-view correspondences can be achieved by lifting pixels to the 3D space with the predicted depths under the supervision of projected point clouds. The image teacher networks provide rich semantics from the image-plane view and meanwhile acquire geometric information from the bird-eye view. Indeed, image features from the two views naturally complement each other and together can ameliorate the learned feature representation of the point cloud student networks. Moreover, with a self-supervised pre-trained 2D network, HVDistill requires neither 2D nor 3D annotations. We pre-train our model on nuScenes dataset and transfer it to several downstream tasks on nuScenes, SemanticKITTI, and KITTI datasets for evaluation. Extensive experimental results show that our method achieves consistent improvements over the baseline trained from scratch and significantly outperforms the existing schemes. The source code is available at https://github.com/zhangsha1024/HVDistill.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Image2Point: 3D Point-Cloud Understanding with 2D Image Pretrained Models

PointCMC: cross-modal multi-scale correspondences learning for point cloud understanding

Article 30 April 2024

Self-supervised Learning with Multi-view Rendering for 3D Point Cloud Analysis

Data Availability

The nuScenes (Caesar et al., 2020) dataset and nuScenes-lidarseg (Caesar et al., 2020) dataset can be obtained from https://www.nuscenes.org/. The KITTI (Geiger et al., 2012) dataset and SemanticKITTI (Behley et al., 2019) dataset can be obtained from https://www.cvlibs.net/datasets/kitti/. The code that supports the findings of this study are available from the corresponding author, Yanyong Zhang, upon reasonable request.

References

Achanta, R., Shaji, A., Smith, K., Lucchi, A., Fua, P., & Süsstrunk, S. (2012). Slic superpixels compared to state-of-the-art superpixel methods. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 34(11), 2274–2282.
Article Google Scholar
Alexiou, E., Yang, N., & Ebrahimi, T. (2020). Pointxr: A toolbox for visualization and subjective evaluation of point clouds in virtual reality. In: 2020 Twelfth International Conference on Quality of Multimedia Experience (QoMEX), IEEE, pp. 1–6.
Alwassel, H., Mahajan, D., Korbar, B., Torresani, L., Ghanem, B., & Tran, D. (2020). Self-supervised learning by cross-modal audio-video clustering. Advances in Neural Information Processing Systems (NeurIPS), 33, 9758–9770.
Google Scholar
Behley, J., Garbade, M., Milioto, A., Quenzel, J., Behnke, S., Stachniss, C., & Gall, J. (2019). Semantickitti: A dataset for semantic scene understanding of lidar sequences. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 9297–9307.
Berman, M., Triki, A. R., & Blaschko, M. B. (2018). The lovász-softmax loss: A tractable surrogate for the optimization of the intersection-over-union measure in neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp. 4413–4421.
Buciluǎ, C., Caruana, R., & Niculescu-Mizil, A. (2006). Model compression. In: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining (SIGKDD), pp. 535–541.
Caesar, H., Bankiti, V., Lang, A. H., Vora, S., Liong, V. E., Xu, Q., Krishnan, A., Pan, Y., Baldan, G., & Beijbom, O. (2020). Nuscenes: A multimodal dataset for autonomous driving. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp. 11621–11631.
Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., & Joulin, A. (2021). Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF international conference on computer vision (ICCV), pp. 9650–9660.
Chen, X., Fan, H., Girshick, R., & He, K. (2020). Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297
Chen, H., Luo, S., Gao, X., & Hu, W. (2021). Unsupervised learning of geometric sampling invariant representations for 3d point clouds. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 893–903.
Chen, S., Duan, C., Yang, Y., Li, D., Feng, C., & Tian, D. (2019). Deep unsupervised learning of 3d point clouds via graph topology inference and filtering. IEEE Transactions on Image Processing (TIP), 29, 3183–3198.
Article MathSciNet ADS Google Scholar
Cho, J. H., & Hariharan, B. (2019). On the efficacy of knowledge distillation. In: Proceedings of the IEEE/CVF international conference on computer vision (ICCV), pp. 4794–4802.
Choy, C., Gwak, J., & Savarese, S. (2019). 4d spatio-temporal convnets: Minkowski convolutional neural networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3075–3084.
Duan, Y., Peng, J., Zhang, Y., Ji, J., & Zhang, Y. (2022). Pfilter: Building persistent maps through feature filtering for fast and accurate lidar-based slam. In: 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), IEEE, pp. 11087–11093.
Geiger, A., Lenz, P., & Urtasun, R. (2012). Are we ready for autonomous driving? the kitti vision benchmark suite. In: 2012 IEEE conference on computer vision and pattern recognition (CVPR), IEEE, pp. 3354–3361.
Grill, J. B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al. (2020). Bootstrap your own latent-a new approach to self-supervised learning. Advances in Neural Information Processing Systems (NeurIPS), 33, 21271–21284.
Google Scholar
Guo, X., Shi, S., Wang, X., & Li, H. (2021). Liga-stereo: Learning lidar geometry aware representations for stereo-based 3d detector. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (CVPR), pp. 3153–3163.
Gupta, S., Hoffman, J., & Malik, J. (2016). Cross modal distillation for supervision transfer. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp. 2827–2836.
Han, Z., Wang, X., Liu, Y. S., & Zwicker, M. (2019b). Multi-angle point cloud-vae: Unsupervised feature learning for 3d point clouds from multiple angles by joint self-reconstruction and half-to-half prediction. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), IEEE, pp. 10441–10450.
Han, Z., Wang, X., Liu, Y. S., & Zwicker, M. (2021b). Hierarchical view predictor: Unsupervised 3d global feature learning through hierarchical prediction among unordered views. In: Proceedings of the 29th ACM International Conference on Multimedia (ACM MM), pp. 3862–3871.
Han, B., Ma, J. W., & Leite, F. (2021). A framework for semi-automatically identifying fully occluded objects in 3d models: Towards comprehensive construction design review in virtual reality. Advanced Engineering Informatics, 50, 101398.
Article Google Scholar
Han, Z., Shang, M., Liu, Y. S., & Zwicker, M. (2019). View inter-prediction gan: Unsupervised representation learning for 3d shapes by learning global shape memories to support local view predictions. Proceedings of the AAAI Conference On Artificial Intelligence (AAAI), 33, 8376–8384.
Article Google Scholar
He, K., Chen, X., Xie, S., Li, Y., Dollár, P., & Girshick, R. (2022). Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 16000–16009.
He, K., Fan, H., Wu, Y., Xie, S., & Girshick, R. (2020). Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp. 9729–9738.
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp. 770–778.
Hinton, G., Vinyals, O., Dean, J., et al. (2015). Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 2(7).
Huang, J., Huang, G., Zhu, Z., & Du, D. (2021). Bevdet: High-performance multi-camera 3d object detection in bird-eye-view. arXiv preprint arXiv:2112.11790.
Jiang, J., Lu, X., Ouyang, W., & Wang, M. (2021). Unsupervised representation learning for 3d point cloud data. arXiv preprint arXiv:2110.06632.
Li, Z., Wang, W., Li, H., Xie, E., Sima, C., Lu, T., Qiao, Y., & Dai, J. (2022b). Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part IX, Springer, pp. 1–18.
Li, C. L., Zaheer, M., Zhang, Y., Poczos, B., & Salakhutdinov, R. (2018). Point cloud gan. arXiv preprint arXiv:1810.05795.
Li, Y., Deng, J., Zhang, Y., Ji, J., Li, H., & Zhang, Y. (2022). Ezfusion: A close look at the integration of lidar, millimeter-wave radar, and camera for accurate 3d object detection and tracking. IEEE Robotics and Automation Letters (RAL), 7(4), 11182–11189.
Article Google Scholar
Liu, Y. C., Huang, Y. K., Chiang, H. Y., Su, H. T., Liu, Z. Y., Chen, C. T., Tseng, C. Y., & Hsu, W. H. (2021a). Learning from 2d: Contrastive pixel-to-point knowledge transfer for 3d pretraining. arXiv preprint arXiv:2104.04687
Liu, Z., Qi, X., & Fu, C. W. (2021b). 3d-to-2d distillation for indoor scene parsing. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp 4464–4474.
Liu, Z., Tang, H., Amini, A., Yang, X., Mao, H., Rus, D., & Han, S. (2022). Bevfusion: Multi-task multi-sensor fusion with unified bird’s-eye view representation. arXiv preprint arXiv:2205.13542.
Mirzadeh, S. I., Farajtabar, M., Li, A., Levine, N., Matsukawa, A., & Ghasemzadeh, H. (2020). Improved knowledge distillation via teacher assistant. Proceedings of the AAAI Conference On Artificial Intelligence (AAAI), 34, 5191–5198.
Article Google Scholar
Oord, A.v.d., Li, Y., & Vinyals, O. (2018). Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748.
Qi, X., Wang, W., Yuan, M., Wang, Y., Li, M., Xue, L., & Sun, Y. (2020). Building semantic grid maps for domestic robot navigation. International Journal of Advanced Robotic Systems, 17(1), 1729881419900066.
Article Google Scholar
Rao, Y., Lu, J., & Zhou, J. (2020). Global-local bidirectional reasoning for unsupervised representation learning of 3d point clouds. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5376–5385.
Sanghi, A. (2020). Info3d: Representation learning on 3d objects using mutual information maximization and contrastive learning. In: European Conference on Computer Vision (ECCV) (pp. 626–642), Springer.
Sautier, C., Puy, G., Gidaris, S., Boulch, A., Bursuc, A., & Marlet, R. (2022). Image-to-lidar self-supervised distillation for autonomous driving data. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9891–9901.
Shi, S., Wang, X., & Li, H. (2019). Pointrcnn: 3d object proposal generation and detection from point cloud. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp. 770–779.
Shi, S., Jiang, L., Deng, J., Wang, Z., Guo, C., Shi, J., Wang, X., & Li, H. (2023). Pv-rcnn++: Point-voxel feature set abstraction with local vector representation for 3d object detection. International Journal of Computer Vision (IJCV), 131(2), 531–551.
Article Google Scholar
Shi, S., Wang, Z., Shi, J., Wang, X., & Li, H. (2021). From points to parts: 3d object detection from point cloud with part-aware and part-aggregation network. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(8), 2647–2664.
PubMed Google Scholar
Team, O. D. (2020). Openpcdet: An open-source toolbox for 3d object detection from point clouds. https://github.com/open-mmlab/OpenPCDet.
Tian, Y., Krishnan, D., & Isola, P. (2019). Contrastive representation distillation. arXiv preprint arXiv:1910.10699.
Wang, Y., Chao, W. L., Garg, D., Hariharan, B., Campbell, M., & Weinberger, K. Q. (2019). Pseudo-lidar from visual depth estimation: Bridging the gap in 3d object detection for autonomous driving. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8445–8453.
Wang, Y., Mao, Q., Zhu, H., Deng, J., Zhang, Y., Ji, J., Li, H., & Zhang, Y. (2023). Multi-modal 3d object detection in autonomous driving: A survey. International Journal of Computer Vision (IJCV), 131, 1–31.
Article Google Scholar
Wang, P. S., Yang, Y. Q., Zou, Q. F., Wu, Z., Liu, Y., & Tong, X. (2021). Unsupervised 3d learning for shape analysis via multiresolution instance discrimination. Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), 35, 2773–2781.
Article Google Scholar
Xiao, T., Liu, S., De Mello, S., Yu, Z., Kautz, J., & Yang, M. H. (2022). Learning contrastive representation for semantic correspondence. International Journal of Computer Vision (IJCV), 130(5), 1293–1309.
Article Google Scholar
Xie, S., Gu, J., Guo, D., Qi, C. R., Guibas, L., & Litany, O. (2020). Pointcontrast: Unsupervised pre-training for 3d point cloud understanding. In: European conference on computer vision (ECCV) (pp. 574–591), Springer.
Xie, J., Zhan, X., Liu, Z., Ong, Y. S., & Loy, C. C. (2022). Delving into inter-image invariance for unsupervised visual representations. International Journal of Computer Vision (IJCV), 130(12), 2994–3013.
Article Google Scholar
Yang, Y., Feng, C., Shen, Y., & Tian, D. (2017). Foldingnet: Interpretable unsupervised learning on 3d point clouds. arXiv preprint arXiv:1712.07262 2(3):5.
Zhang, L., & Ma, K. (2020). Improve object detection with feature-based knowledge distillation: Towards accurate and efficient detectors. In: International Conference on Learning Representations (ICLR).
Zhang, Z., Girdhar, R., Joulin, A., & Misra, I. (2021). Self-supervised pretraining of 3d features on any point-cloud. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 10252–10263.
Zhao, B., Cui, Q., Song, R., Qiu, Y., & Liang, J. (2022a). Decoupled knowledge distillation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11953–11962.
Zhao, Y., Fang, G., Guo, Y., Guibas, L., Tombari, F., & Birdal, T. (2022). 3dpointcaps++: Learning 3d representations with capsule networks. International Journal of Computer Vision (IJCV), 130(9), 2321–2336.
Article Google Scholar
Zhu, H., Deng, J., Zhang, Y., Ji, J., Mao, Q., Li, H., & Zhang, Y. (2022). Vpfnet: Improving 3d object detection with virtual point based lidar and stereo data fusion. IEEE Transactions on Multimedia (TMM). https://doi.org/10.1109/TMM.2022.3189778
Article Google Scholar

Download references

Author information

Authors and Affiliations

University of Science and Technology of China, Hefei, Anhui, China
Sha Zhang, Houqiang Li & Yanyong Zhang
The University of Adelaide, Adelaide, Australia
Jiajun Deng
Shanghai AI Laboratory, Shanghai, China
Sha Zhang, Lei Bai & Wanli Ouyang

Authors

Sha Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Jiajun Deng
View author publications
You can also search for this author in PubMed Google Scholar
Lei Bai
View author publications
You can also search for this author in PubMed Google Scholar
Houqiang Li
View author publications
You can also search for this author in PubMed Google Scholar
Wanli Ouyang
View author publications
You can also search for this author in PubMed Google Scholar
Yanyong Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Jiajun Deng or Yanyong Zhang.

Ethics declarations

Conflict of interest

There are no conflicts to declare.

Additional information

Communicated by Ming-Hsuan Yang.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhang, S., Deng, J., Bai, L. et al. HVDistill: Transferring Knowledge from Images to Point Clouds via Unsupervised Hybrid-View Distillation. Int J Comput Vis (2024). https://doi.org/10.1007/s11263-023-01981-w

Download citation

Received: 17 June 2023
Accepted: 25 December 2023
Published: 07 February 2024
DOI: https://doi.org/10.1007/s11263-023-01981-w

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

HVDistill: Transferring Knowledge from Images to Point Clouds via Unsupervised Hybrid-View Distillation

Abstract

Access this article

Similar content being viewed by others

Image2Point: 3D Point-Cloud Understanding with 2D Image Pretrained Models

PointCMC: cross-modal multi-scale correspondences learning for point cloud understanding

Self-supervised Learning with Multi-view Rendering for 3D Point Cloud Analysis

Data Availability

References

Author information

Authors and Affiliations

Corresponding authors

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

HVDistill: Transferring Knowledge from Images to Point Clouds via Unsupervised Hybrid-View Distillation

Abstract

Access this article

Similar content being viewed by others

Image2Point: 3D Point-Cloud Understanding with 2D Image Pretrained Models

PointCMC: cross-modal multi-scale correspondences learning for point cloud understanding

Self-supervised Learning with Multi-view Rendering for 3D Point Cloud Analysis

Data Availability

References

Author information

Authors and Affiliations

Corresponding authors

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation