Skip to main content
Log in

Vote-Based 3D Object Detection with Context Modeling and SOB-3DNMS

  • Original Paper
  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

Most existing 3D object detection methods recognize objects individually, without giving any consideration on contextual information between these objects. However, objects in indoor scenes are usually related to each other and the scene, forming the contextual information. Based on this observation, we propose a novel 3D object detection network, which is built on the state-of-the-art VoteNet but takes into consideration of the contextual information at multiple levels for detection and recognition of 3D objects. To encode relationships between elements at different levels, we introduce three contextual sub-modules, capturing contextual information at patch, object, and scene levels respectively, and build them into the voting and classification stages of VoteNet. In addition, at the post-processing stage, we also consider the spatial diversity of detected objects and propose an improved 3D NMS (non-maximum suppression) method, namely Survival-Of-the-Best 3DNMS (SOB-3DNMS), to reduce false detections. Experiments demonstrate that our method is an effective way to promote detection accuracy, and has achieved new state-of-the-art detection performance on challenging 3D object detection datasets, i.e., SUN RGBD and ScanNet, when only taking point cloud data as input.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15

Similar content being viewed by others

References

  • Atzmon, M., Maron, H., & Lipman, Y. (2018). Point convolutional neural networks by extension operators. arXiv preprint arXiv:1803.10091.

  • Bodla, N., Singh, B., Chellappa, R., & Davis, L. S. (2017). Soft-nms–improving object detection with one line of code. In Proceedings of the IEEE international conference on computer vision (pp. 5561–5569).

  • Cao, Y., Xu, J., Lin, S., Wei, F., & Hu, H. (2019). Gcnet: Non-local networks meet squeeze-excitation networks and beyond. In Proceedings of the IEEE international conference on computer vision workshops.

  • Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., & Zagoruyko, S. (2020). End-to-end object detection with transformers. arXiv preprint arXiv:2005.12872.

  • Chen, Z., Huang, S., & Tao, D. (2018). Context refinement for object detection. In Proceedings of the European conference on computer vision (ECCV) (pp. 71–86).

  • Chen, J., Lei, B., Song, Q., Ying, H., Chen, DZ., & Wu, J. (2020). A hierarchical graph network for 3d object detection on point clouds. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 392–401).

  • Choy, C., Gwak, J., & Savarese, S. (2019). 4d spatio-temporal convnets: Minkowski convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3075–3084).

  • Dai, A., Chang, AX., Savva, M., Halber, M., Funkhouser, T., & Nießner, M. (2017). ScanNet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 5828–5839).

  • Dalal, N., & Triggs, B. (2005). Histograms of oriented gradients for human detection. In 2005 IEEE computer society conference on computer vision and pattern recognition (CVPR’05) (Vol. 1, pp. 886–893). IEEE.

  • Deng, H., Birdal, T., & Ilic, S. (2018). Ppfnet: Global context aware local features for robust 3D point matching. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 195–205).

  • Engelmann, F., Bokeloh, M., Fathi, A., Leibe, B., & Nießner, M. (2020). 3d-mpa: Multi-proposal aggregation for 3d semantic instance segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 9031–9040).

  • Engelmann, F., Kontogianni, T., Hermans, A., & Leibe, B. (2017). Exploring spatial context for 3D semantic segmentation of point clouds. In Proceedings of the IEEE international conference on computer vision (pp. 716–724).

  • Fu, J., Liu, J., Tian, H., Li, Y., Bao, Y., Fang, Z., & Lu, H. (2019). Dual attention network for scene segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3146–3154).

  • He, C., Zeng, H., Huang, J., Hua, XS., & Zhang, L. (2020). Structure aware single-stage 3d object detection from point cloud. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 11,873–11,882).

  • He, K., Gkioxari, G., Dollár, P., & Girshick, R. (2017). Mask R-CNN. In Proceedings of the IEEE international conference on computer vision (pp. 2961–2969).

  • He, Y., Zhang, X., Savvides, M., & Kitani, K. (2018). Softer-nms: Rethinking bounding box regression for accurate object detection. arXiv preprint arXiv:1809.08545

  • Hou, J., Dai, A., & Nießner, M. (2019). 3d-sis: 3D semantic instance segmentation of RGB-D scans. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4421–4430).

  • Hu, H., Gu. J., Zhang. Z., Dai. J., & Wei, Y. (2018a). Relation networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3588–3597).

  • Hu, J., Shen, L., & Sun, G. (2018b). Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 7132–7141).

  • Hu, S. M., Cai, J. X., & Lai, Y. K. (2018c). Semantic labeling and instance segmentation of 3D point clouds using patch context analysis and multiscale processing. IEEE Transactions on Visualization and Computer Graphics, 26, 2485–2498.

  • Jiang, L., Zhao, H., Shi, S., Liu, S., Fu, CW., & Jia, J. (2020). Pointgroup: Dual-set point grouping for 3d instance segmentation. arXiv preprint arXiv:2004.01658.

  • Lahoud, J., & Ghanem, B. (2017). 2D-driven 3D object detection in RGB-D images. In Proceedings of the IEEE international conference on computer vision (pp. 4622–4630).

  • Lang, A. H., Vora, S., Caesar, H., Zhou, L., Yang, J., & Beijbom, O. (2019). Pointpillars: Fast encoders for object detection from point clouds. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 12,697–12,705).

  • Li, J., Luo, S., Zhu, Z., Dai, H., Krylov, A. S., Ding, Y., & Shao, L. (2020a). 3d iou-net: Iou guided 3d object detector for point clouds. arXiv preprint arXiv:2004.04962.

  • Li, Y., Bu, R., Sun, M., Wu, W., Di, X., & Chen, B. (2018). PointCNN: Convolution on x-transformed points. In Advances in neural information processing systems (pp. 820–830).

  • Li, Y., Ma, L., Tan, W., Sun, C., Cao, D., & Li, J. (2020b). Grnet: Geometric relation network for 3d object detection from point clouds. ISPRS Journal of Photogrammetry and Remote Sensing, 165, 43–53.

  • Liu, S., Huang, D., & Wang, Y. (2019a). Adaptive nms: Refining pedestrian detection in a crowd. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 6459–6468).

  • Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C. Y., & Berg, A. C. (2016). Ssd: Single shot multibox detector. In European conference on computer vision (pp. 21–37). Springer.

  • Liu, Y., Fan, B., Xiang, S., & Pan, C. (2019b). Relation-shape convolutional neural network for point cloud analysis. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 8895–8904).

  • Liu, Y., Wang, R., Shan, S., & Chen, X. (2018). Structure inference net: Object detection using scene-level context and instance-level relationships. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 6985–6994).

  • McCormac, J., Clark, R., Bloesch, M., Davison, A., & Leutenegger, S. (2018). Fusion++: Volumetric object-level SLAM. In 2018 international conference on 3D vision (3DV) (pp. 32–41). IEEE.

  • Mottaghi, R., Chen, X., Liu, X., Cho, N. G., Lee, S. W., Fidler, S., Urtasun, R., & Yuille, A. (2014). The role of context for object detection and semantic segmentation in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 891–898).

  • Najibi, M., Lai, G., Kundu, A., Lu, Z., Rathod, V., Funkhouser, T., Pantofaru, C., Ross, D., Davis, L. S., & Fathi, A. (2020). Dops: Learning to detect 3d objects and predict their 3d shapes. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 11,913–11,922).

  • Paigwar, A., Erkent, O., Wolf, C., & Laugier, C. (2019). Attentional PointNet for 3D-object detection in point clouds. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops.

  • Qi, C. R., Chen, X., Litany, O., & Guibas, L. J. (2020). Imvotenet: Boosting 3d object detection in point clouds with image votes. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 4404–4413).

  • Qi, C. R., Litany, O., He, K., & Guibas, L. J. (2019). Deep Hough voting for 3D object detection in point clouds. arXiv preprint arXiv:1904.09664.

  • Qi, C. R., Liu, W., Wu, C., Su, H., & Guibas, L. J. (2018). Frustum PointNets for 3D object detection from RGB-D data. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 918–927).

  • Qi, C. R., Su, H., Mo, K., & Guibas, L. J. (2017a). PointNet: Deep learning on point sets for 3D classification and segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 652–660).

  • Qi, C. R., Yi, L., Su, H., & Guibas, L. J. (2017b). PointNet++: Deep hierarchical feature learning on point sets in a metric space. In Advances in neural information processing systems (pp. 5099–5108).

  • Redmon, J., Divvala, S., Girshick, R., & Farhadi, A. (2016). You only look once: Unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 779–788).

  • Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems (pp. 91–99).

  • Ren, Z., & Sudderth, E. B. (2016). Three-dimensional object detection and layout prediction using clouds of oriented gradients. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1525–1533).

  • Rezatofighi, H., Tsoi, N., Gwak, J., Sadeghian, A., Reid, I., & Savarese, S. (2019). Generalized intersection over union: A metric and a loss for bounding box regression. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 658–666).

  • Salscheider, N. O. (2020). Featurenms: Non-maximum suppression by learning feature embeddings. arXiv preprint arXiv:2002.07662.

  • Shi, S., Guo, C., Jiang, L., Wang, Z., Shi, J., Wang, X., & Li, H. (2020). Pv-rcnn: Point-voxel feature set abstraction for 3d object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition

  • Shi, S., Wang, X., & Li, H. (2019a). PointRCNN: 3D object proposal generation and detection from point cloud. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770–779).

  • Shi, W., & Rajkumar, R. (2020). Point-gnn: Graph neural network for 3d object detection in a point cloud. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 1711–1719).

  • Shi, Y., Chang, AX., Wu, Z., Savva, M., & Xu, K. (2019b). Hierarchy denoising recursive autoencoders for 3D scene layout prediction. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1771–1780).

  • Song, S., Lichtenberg, S. P., & Xiao, J. (2015). SUN RGB-D: A RGB-D scene understanding benchmark suite. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 567–576).

  • Song, S., & Xiao, J. (2016). Deep sliding shapes for amodal 3D object detection in RGB-D images. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 808–816).

  • Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. In Advances in neural information processing systems (pp. 5998–6008).

  • Wang, C., Xu, D., Zhu, Y., Martín-Martín, R., Lu, C., Fei-Fei, L., & Savarese, S. (2019). DenseFusion: 6D object pose estimation by iterative dense fusion. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3343–3352).

  • Wang, G., Tian, B., Ai, Y., Xu, T., Chen, L., & Cao, D. (2020). Centernet3d: An anchor free object detector for autonomous driving. arXiv preprint arXiv:2007.07214.

  • Wang, P. S., Liu, Y., Guo, Y. X., Sun, C. Y., & Tong, X. (2017). O-CNN: Octree-based convolutional neural networks for 3D shape analysis. ACM Transactions on Graphics (TOG), 36(4), 72.

    Google Scholar 

  • Wang, T., He, X., & Barnes, N. (2013). Learning structured Hough voting for joint object detection and occlusion reasoning. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1790–1797).

  • Wang, X., Girshick, R., Gupta, A., & He, K. (2018). Non-local neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 7794–7803).

  • Xie, Q., Lai, YK., Wu, J., Wang, Z., Zhang, Y., Xu, K., & Wang, J. (2020). Mlcvnet: Multi-level context votenet for 3d object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 10,447–10,456).

  • Xie, S., Liu, S., Chen, Z., & Tu, Z. (2018). Attentional shapecontextnet for point cloud recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp 4606–4615).

  • Xu, D., Anguelov, D., & Jain, A. (2018). Pointfusion: Deep sensor fusion for 3d bounding box estimation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 244–253).

  • Yang, J., Lu, J., Lee, S., Batra, D., & Parikh, D. (2018). Graph r-cnn for scene graph generation. In Proceedings of the European conference on computer vision (ECCV) (pp. 670–685).

  • Yang, Z., Sun, Y., Liu, S., & Jia, J. (2020). 3dssd: Point-based 3d single stage object detector. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 11,040–11,048).

  • Ye, X., Li, J., Huang, H., Du, L., & Zhang, X. (2018). 3D recurrent neural networks with context fusion for point cloud semantic segmentation. In Proceedings of the European conference on computer vision (ECCV) (pp. 403–417).

  • Yi, L., Zhao, W., Wang, H., Sung, M., & Guibas, L. J. (2019). GSPN: Generative shape proposal network for 3D instance segmentation in point cloud. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3947–3956).

  • Yin, T., Zhou, X., & Krähenbühl, P. (2020). Center-based 3d object detection and tracking. arXiv preprint arXiv:2006.11275.

  • Yu, R., Chen, X., Morariu, V. I., & Davis, L. S. (2016). The role of context selection in object detection. arXiv preprint arXiv:1609.02948.

  • Yue, K., Sun, M., Yuan, Y., Zhou, F., Ding, E., & Xu, F. (2018). Compact generalized non-local network. In Advances in neural information processing systems (pp. 6510–6519).

  • Zambaldi, V., Raposo, D., Santoro, A., Bapst, V., Li, Y., Babuschkin, I., Tuyls, K., Reichert, D., Lillicrap, T., Lockhart, E., et al. (2018). Relational deep reinforcement learning. arXiv preprint arXiv:1806.01830.

  • Zhang, W., & Xiao, C. (2019). PCAN: 3D attention map learning using contextual information for point cloud based retrieval. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 12,436–12,445).

  • Zhang, Y., Bai, M., Kohli, P., Izadi, S., & Xiao, J. (2017). Deepcontext: Context-encoding neural pathways for 3D holistic scene understanding. In Proceedings of the IEEE International conference on computer vision (pp. 1192–1201).

  • Zhang, Y., Song, S., Tan, P., & Xiao, J. (2014). Panocontext: A whole-room 3D context model for panoramic scene understanding. In European conference on computer vision (pp. 668–686). Springer.

  • Zhang, H., Zhang, H., Wang, C., & Xie, J. (2019). Co-occurrent features in semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 548–557).

  • Zheng, Z., Wang, P., Liu, W., Li, J., Ye, R., & Ren, D. (2020). Distance-iou loss: Faster and better learning for bounding box regression. In AAAI (pp. 12,993–13,000).

  • Zhou, Y., & Tuzel, O. (2018). Voxelnet: End-to-end learning for point cloud based 3d object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4490–4499).

Download references

Acknowledgements

This work is funded by the National Key Research and Development Program of China (2020YFB2010702, 2018A-AA0102200), National Natural Science Foundation of China under Grant 61772267, Aeronautical Science Foundation of China (No. 2019ZE052008), and the Natural Science Foundation of Jiangsu Province under Grant BK20190016.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jun Wang.

Additional information

Communicated by Dima Damen.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Xie, Q., Lai, YK., Wu, J. et al. Vote-Based 3D Object Detection with Context Modeling and SOB-3DNMS. Int J Comput Vis 129, 1857–1874 (2021). https://doi.org/10.1007/s11263-021-01456-w

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-021-01456-w

Keywords

Navigation