Abstract
Relation contexts have been proved to be useful for many challenging vision tasks. In the field of 3D object detection, previous methods have been taking the advantage of context encoding, graph embedding, or explicit relation reasoning to extract relation contexts. However, there exist inevitably redundant relation contexts due to noisy or low-quality proposals. In fact, invalid relation contexts usually indicate underlying scene misunderstanding and ambiguity, which may, on the contrary, reduce the performance in complex scenes. Inspired by recent attention mechanism like Transformer, we propose a novel 3D attention-based relation module (ARM3D). It encompasses object-aware relation reasoning to extract pair-wise relation contexts among qualified proposals and an attention module to distribute attention weights towards different relation contexts. In this way, ARM3D can take full advantage of the useful relation contexts and filter those less relevant or even confusing contexts, which mitigates the ambiguity in detection. We have evaluated the effectiveness of ARM3D by plugging it into several state-of-the-art 3D object detectors and showing more accurate and robust detection results. Extensive experiments show the capability and generalization of ARM3D on 3D object detection. Our source code is available at https://github.com/lanlan96/ARM3D.
Article PDF
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
References
Charles, R. Q.; Hao, S.; Mo, K. C.; Guibas, L. J. PointNet: Deep learning on point sets for 3D classification and segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 77–85, 2017.
Li, Y.; Bu, R.; Sun, M.; Wu, W.; Di, X.; Chen, B. PointCNN: Convolution on X-transformed points. In: Proceedings of the 32nd Conference on Neural Information Processing Systems, 820–830, 2018.
Qi, C. R.; Yi, L.; Su, H.; Guibas, L. J. PointNet++: Deep hierarchical feature learning on point sets in a metric space. In: Proceedings of the 31st Conference on Neural Information Processing Systems, 5099–5108, 2017.
Wu, W. X.; Qi, Z. A.; Li, F. X. PointConv: Deep convolutional networks on 3D point clouds. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 9613–9622, 2019.
Yi, L.; Zhao, W.; Wang, H.; Sung, M.; Guibas, L. J. GSPN: Generative shape proposal network for 3D instance segmentation in point cloud. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 3942–3951, 2019.
Qi, C. R.; Litany, O.; He, K. M.; Guibas, L. Deep Hough voting for 3D object detection in point clouds. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 9276–9285, 2019.
Xie, Q.; Lai, Y. K.; Wu, J.; Wang, Z. T.; Zhang, Y. M.; Xu, K.; Wang, J. MLCVNet: Multi-level context VoteNet for 3D object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10444–10453, 2020.
Zhang, Z.; Sun, B.; Yang, H.; Huang, Q. H3DNet: 3D object detection using hybrid geometric primitives. In: Computer Vision — ECCV 2020. Lecture Notes in Computer Science, Vol. 12357. Vedaldi, A.; Bischof, H.; Brox, T.; Frahm, J. M. Eds. Springer Cham, 311–329, 2020.
Cheng, B. W.; Sheng, L.; Shi, S. S.; Yang, M.; Xu, D. Back-tracing representative points for voting-based 3D object detection in point clouds. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 8959–8968, 2021.
Lan, Y. Q.; Duan, Y.; Shi, Y. F.; Huang, H.; Xu, K. 3DRM: Pair-wise relation module for 3D object detection. Computers & Graphics Vol. 98, 58–70, 2021.
Shi, Y. F.; Long, P. X.; Xu, K.; Huang, H.; Xiong, Y. S. Data-driven contextual modeling for 3D scene understanding. Computers & Graphics Vol. 55, 55–67, 2016.
Qi, X. J.; Liao, R. J.; Jia, J. Y.; Fidler, S.; Urtasun, R. 3D graph neural networks for RGBD semantic segmentation. In: Proceedings of the IEEE International Conference on Computer Vision, 5209–5218, 2017.
Zhang, Y.; Bai, M.; Kohli, P.; Izadi, S.; Xiao, J. DeepContext: Context-encoding neural pathways for 3D holistic scene understanding. In: Proceedings of the IEEE International Conference on Computer Vision, 1201–1210, 2017.
Hu, H.; Gu, J. Y.; Zhang, Z.; Dai, J. F.; Wei, Y. C. Relation networks for object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 3588–3597, 2018.
Xu, H.; Jiang, C. H.; Liang, X. D.; Li, Z. G. Spatial-aware graph relation network for large-scale object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 9290–9299, 2019.
Dai, A.; Chang, A. X.; Savva, M.; Halber, M.; Funkhouser, T.; Niessner, M. ScanNet: Richly-annotated 3D reconstructions of indoor scenes. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2432–2443, 2017.
Song, S. R.; Lichtenberg, S. P.; Xiao, J. X. SUN RGB-D: A RGB-D scene understanding benchmark suite. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 567–576, 2015.
Lin, D. H.; Fidler, S.; Urtasun, R. Holistic scene understanding for 3D object detection with RGBD cameras. In: Proceedings of the IEEE International Conference on Computer Vision, 1417–1424, 2013.
Shi, Y. F.; Chang, A. X.; Wu, Z. L.; Savva, M.; Xu, K. Hierarchy denoising recursive autoencoders for 3D scene layout prediction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 1771–1780, 2019.
Chen, J. T.; Lei, B. W.; Song, Q. Y.; Ying, H. C.; Chen, D. Z.; Wu, J. A hierarchical graph network for 3D object detection on point clouds. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 392–401, 2020.
Qi, C. R.; Liu, W.; Wu, C. X.; Su, H.; Guibas, L. J. Frustum PointNets for 3D object detection from RGB-D data. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 918–927, 2018.
Chen, X. Z.; Ma, H. M.; Wan, J.; Li, B.; Xia, T. Multi-view 3D object detection network for autonomous driving. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1907–1915, 2017.
Ku, J.; Mozifian, M.; Lee, J.; Harakeh, A.; Waslander, S. L. Joint 3D proposal generation and object detection from view aggregation. In: Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, 1–8, 2018.
Shi, S. S.; Wang, X. G.; Li, H. S. PointRCNN: 3D object proposal generation and detection from point cloud. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 770–779, 2019.
Wang, P.-S.; Liu, Y.; Guo, Y.-X.; Sun, C.-Y.; Tong, X. O-CNN: Octree-based convolutional neural networks for 3D shape analysis. ACM Transactions on Graphics Vol. 36, No. 4, Article No. 72, 2017.
Atzmon, M.; Maron, H.; Lipman, Y. Point convolutional neural networks by extension operators. arXiv preprint arXiv:1803.10091, 2018.
Yan, Y.; Mao, Y. X.; Li, B. SECOND: Sparsely embedded convolutional detection. Sensors (Basel) Vol. 18, No. 10, 3337, 2018.
Lang, A. H.; Vora, S.; Caesar, H.; Zhou, L. B.; Yang, J.; Beijbom, O. PointPillars: Fast encoders for object detection from point clouds. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 12689–12697, 2019.
Shi, S. S.; Wang, Z.; Shi, J. P.; Wang, X. G.; Li, H. S. From points to parts: 3D object detection from point cloud with part-aware and part-aggregation network. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 43, No. 8, 2647–2664, 2021.
Pang, G.; Neumann, U. 3D point cloud object detection with multi-view convolutional neural network. In: Proceedings of the 23rd International Conference on Pattern Recognition, 585–590, 2016.
Lahoud, J.; Ghanem, B. 2D-driven 3D object detection in RGB-D images. In: Proceedings of the IEEE International Conference on Computer Vision, 4632–4640, 2017.
Ren, S. Q.; He, K. M.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. In: Proceedings of the 28th International Conference on Neural Information Processing Systems, 91–99, 2015.
Yang, Z. T.; Sun, Y. N.; Liu, S.; Jia, J. Y. 3DSSD: Point-based 3D single stage object detector. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 11037–11045, 2020.
Engelmann, F.; Bokeloh, M.; Fathi, A.; Leibe, B.; NieBner, M. 3D-MPA: Multi-proposal aggregation for 3D semantic instance segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 9028–9037, 2020.
Huang, S.; Qi, S.; Xiao, Y.; Zhu, Y.; Wu, Y. N.; Zhu, S.-C. Cooperative holistic scene understanding: Unifying 3D object, layout, and camera pose estimation. In: Proceedings of the 32nd Conference on Neural Information Processing System, 207–218, 2018.
Santoro, A.; Raposo, D.; Barrett, D. G.; Malinowski, M.; Pascanu, R.; Battaglia, P.; Lillicrap, T. A simple neural network module for relational reasoning. In: Proceedings of the 31st Conference on Neural Information Processing Systems, 4967–4976, 2017.
Mou, L. C.; Hua, Y. S.; Zhu, X. X. A relation-augmented fully convolutional network for semantic segmentation in aerial scenes. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 12408–12417, 2019.
Li, X.; Yang, Y. B.; Zhao, Q. J.; Shen, T. C.; Lin, Z. C.; Liu, H. Spatial pyramid based graph reasoning for semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 8947–8956, 2020.
Chen, X. L.; Gupta, A. Spatial memory for context reasoning in object detection. In: Proceedings of the IEEE International Conference on Computer Vision, 4086–4096, 2017.
Cui, Q. J.; Sun, H. J.; Yang, F. Learning dynamic relationships for 3D human motion prediction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 6518–6526, 2020.
Huang, Y. F.; Sugano, Y.; Sato, Y. Improving action segmentation via graph-based temporal reasoning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 14021–14031, 2020.
Krishna, R.; Zhu, Y. K.; Groth, O.; Johnson, J.; Hata, K. J.; Kravitz, J.; Chen, S.; Kalantidis, Y.; Li, L.-J.; Shamma, D. A.; et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision Vol. 123, No. 1, 32–73, 2017.
Liu, C. C.; Jin, Y.; Xu, K. H.; Gong, G. Q.; Mu, Y. D. Beyond short-term snippet: Video relation detection with spatio-temporal global context. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10837–10846, 2020.
Cadene, R.; Ben-Younes, H.; Cord, M.; Thome, N. MUREL: Multimodal relational reasoning for visual question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 1989–1998, 2019.
Sung, F.; Yang, Y. X.; Zhang, L.; Xiang, T.; Torr, P. H. S.; Hospedales, T. M. Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 1199–1208, 2018.
Wang, W. B.; Wang, R. P.; Shan, S. G.; Chen, X. L. Exploring context and visual pattern of relationship for scene graph generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 8180–8189, 2019.
Huang, S. S.; Fu, H. B.; Hu, S. M. Structure guided interior scene synthesis via graph matching. Graphical Models Vol. 85, 46–55, 2016.
Song, P.; Zheng, Y.; Jia, J. Web3d learning platform of furniture layout based on case-based reasoning and distance field. In: E-Learning and Games. Lecture Notes in Computer Science, Vol. 10345. Tian, F.; Gatzidis, C.; El Rhalibi, A.; Tang, W.; Charles, F. Eds. Springer Cham, 235–250, 2017.
Duan, Y. Q.; Zheng, Y.; Lu, J. W.; Zhou, J.; Tian, Q. Structural relational reasoning of point clouds. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 949–958, 2019.
Kulkarni, N.; Misra, I.; Tulsiani, S.; Gupta, A. 3D-RelNet: Joint object and relational network for 3D prediction. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2212–2221, 2019.
Li, Y.; Ma, L. F.; Tan, W. K.; Sun, C.; Cao, D. P.; Li, J. GRNet: Geometric relation network for 3D object detection from point clouds. ISPRS Journal of Photogrammetry and Remote Sensing Vol. 165, 43–53, 2020.
Wang, L.; Huang, Y. C.; Hou, Y. L.; Zhang, S. M.; Shan, J. Graph attention convolution for point cloud semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10288–10297, 2019.
Chen, C.; Fragonara, L. Z.; Tsourdos, A. GAPNet: Graph attention based point neural network for exploiting local feature of point cloud. arXiv preprint arXiv:1905.08705, 2019.
Wen, C. C.; Li, X.; Yao, X. J.; Peng, L.; Chi, T. H. Airborne LiDAR point cloud classification with global-local graph attention convolution neural network. ISPRS Journal of Photogrammetry and Remote Sensing Vol. 173, 181–194, 2021.
Wen, X.; Li, T. Y.; Han, Z. Z.; Liu, Y. S. Point cloud completion by skip-attention network with hierarchical folding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 1936–1945, 2020.
Wang, Y.; Solomon, J. Deep closest point: Learning representations for point cloud registration. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 3522–3531, 2019.
Yew, Z. J.; Lee, G. H. 3DFeat-Net: Weakly supervised local 3D features for point cloud registration. In: Computer Vision — ECCV 2018. Lecture Notes in Computer Science, Vol. 11219. Ferrari, V.; Hebert, M.; Sminchisescu, C.; Weiss, Y. Eds. Springer Cham, 607–623, 2018.
Zhang, W. X.; Xiao, C. X. PCAN: 3D attention map learning using contextual information for point cloud based retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 12428–12437, 2019.
Sun, Q.; Liu, H. Y.; He, J.; Fan, Z. X.; Du, X. Y. DAGC: Employing dual attention and graph convolution for point cloud based place recognition. In: Proceedings of the International Conference on Multimedia Retrieval, 224–232, 2020.
Guo, M. H.; Cai, J. X.; Liu, Z. N.; Mu, T. J.; Martin, R. R.; Hu, S. M. PCT: Point cloud transformer. Computational Visual Media Vol. 7, No. 2, 187–199, 2021.
Zhao, H.; Jiang, L.; Jia, J.; Torr, P.; Koltun, V. Point transformer. arXiv preprint arXiv:2012.09164, 2020.
Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. PyTorch: An imperative style, high-performance deep learning library. In: Proceedings of the 33rd International Conference on Neural Information Processing Systems, 8026–8037, 2019.
Acknowledgements
We thank Jiazhao Zhang for server management. This paper is supported in part by National Nature Science Foundation of China (62132021, 62102435, 62002375, 62002376), National Key R&D Program of China (2018AAA0102200), and NUDT Research Grants (ZK19-30).
Author information
Authors and Affiliations
Corresponding authors
Ethics declarations
The authors have no competing interests to declare that are relevant to the content of this article.
Additional information
Yuqing Lan received his B.S. degree in network engineering from National University of Defense Technology, China, in 2019. He is now a postgraduate at the School of Computer, National University of Defense Technology, China. His research interests cover 3D object detection and 3D reconstruction.
Yao Duan received her master degree of computer science from National University of Defense Technology. She is now a Ph.D. student at the School of Computer, National University of Defense Technology, China. Her research interests include 3D object detection
Chenyi Liu received her B.S. degree in software engineering from Tianjin Normal University, China, in 2020. She is now a master student at the National University of Defense Technology, China. Her research interests cover 3D point cloud registration.
Chenyang Zhu is an assistant professor at the School of Computer, National University of Defense Technology. The current directions of interest include data-driven shape analysis and modeling, 3D vision and robot perception & navigation, etc.
Yueshan Xiong is a professor at the School of Computer, National University of Defense Technology. The current directions of interest include virtual surgery system, image and graphics processing, and intelligent computing.
Hui Huang is a Distinguished TFA Professor at Shenzhen University, where she directs the Visual Computing Research Center. Her research interests span computer graphics, 3D vision, and visualization. She is currently a senior member of IEEE/ACM/CSIG and a distinguished member of CCF.
Kai Xu is a professor at the School of Computer, National University of Defense Technology, where he received his Ph.D. degree in 2011. He serves on the editorial board of ACM Transactions on Graphics, Computer Graphics Forum, Computers & Graphics, and The Visual Computer. His research work can be found in his personal website: https://www.kevinkaixu.net.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.
The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
Other papers from this open access journal are available free of charge from http://www.springer.com/journal/41095. To submit a manuscript, please go to https://www.editorialmanager.com/cvmj.
About this article
Cite this article
Lan, Y., Duan, Y., Liu, C. et al. ARM3D: Attention-based relation module for indoor 3D object detection. Comp. Visual Media 8, 395–414 (2022). https://doi.org/10.1007/s41095-021-0252-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s41095-021-0252-6