GCVNet: Geometry Constrained Voting Network to Estimate 3D Pose for Fine-Grained Object Categories

Han, Yaohang; Di, Huijun; Zheng, Hanfeng; Qi, Jianyong; Gong, Jianwei

doi:10.1007/978-3-030-60633-6_15

Yaohang Han¹⁶,
Huijun Di¹⁶,
Hanfeng Zheng¹⁶,
Jianyong Qi¹⁷ &
…
Jianwei Gong¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12305))

Included in the following conference series:

Chinese Conference on Pattern Recognition and Computer Vision (PRCV)

2559 Accesses
1 Citations

Abstract

As a fundamental AI problem, monocular 3D pose estimation has received much attention. This paper addresses the challenge of estimating full perspective model parameters, including object pose and camera intrinsics, from a single 2D image of fine-grained object categories. To tackle this highly ill-posed problem, we propose a Geometry Constrained Voting Network (GCVNet). It is a unified end-to-end network consisting of four synergic task-specific subnetworks: 1) Fine-grained classification subnetwork, offering fine-grained 3D shape priors. 2) Voting subnetwork, generating 2D measurements. 3) Segmentation subnetwork, providing a foreground mask for voting. 4) PnP subnetwork, estimating the perspective parameters via explicit geometric reasoning, as well as constraining the classification subnetwork to provide proper 3D priors and the voting subnetwork to generate a group of geometric consistent 2D measurements, rather than independent voting for each 2D measurement in the literature. Experiments on challenging datasets demonstrate the superior performance of GCVNet.

This is a student paper. Special thanks to Megvii Inc. for providing training resources for the paper.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Brachmann, E., Krull, A., Michel, F., Gumhold, S., Shotton, J., Rother, C.: Learning 6D object pose estimation using 3D object coordinates. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) European Conference on Computer Vision, pp. 536–551. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10605-2_35
Chapter Google Scholar
Chen, X., Kundu, K., Zhang, Z., Ma, H., Fidler, S., Urtasun, R.: Monocular 3D object detection for autonomous driving. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2147–2156 (2016)
Google Scholar
Deng, J., Dong, W., Socher, R., Li, L.J., Li, F.F.: ImageNet: a large-scale hierarchical image database. In: 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2009), 20–25 June 2009, Miami, Florida, USA (2009)
Google Scholar
Elhoseiny, M., El-Gaaly, T., Bakry, A., Elgammal, A.: A comparative analysis and study of multiview cnn models for joint object categorization and pose estimation. In: International Conference on Machine Learning, pp. 888–897 (2016)
Google Scholar
Girshick, R.: Fast R-CNN. In: 2015 IEEE International Conference on Computer Vision (ICCV) (2016)
Google Scholar
Grabner, A., Roth, P.M., Lepetit, V.: GP2C: geometric projection parameter consensus for joint 3d pose and focal length estimation in the wild. In: The IEEE International Conference on Computer Vision (ICCV), October 2019
Google Scholar
Grimson, W., Lozano-Perez, T.: Recognition and localization of overlapping parts from sparse data in two and three dimensions. In: Proceedings. 1985 IEEE International Conference on Robotics and Automation, vol. 2, pp. 61–66. IEEE (1985)
Google Scholar
He, K., Zhang, X., Ren, S., Jian, S.: Delving deep into rectifiers: surpassing human-level performance on imagenet classification (2015)
Google Scholar
Hinterstoisser, S., et al.: Model based training, detection and pose estimation of texture-less 3d objects in heavily cluttered scenes. In: Lee, K.M., Matsushita, Y., Rehg, J.M., Hu, Z. (eds.) ACCV 2012. LNCS, vol. 7724, pp. 548–562. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-37331-2_42
Chapter Google Scholar
Hu, Y., Hugonot, J., Fua, P., Salzmann, M.: Segmentation-driven 6D object pose estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3385–3394 (2019)
Google Scholar
Jian, S.: Deep residual learning for image recognition. In: IEEE Conference on Computer Vision & Pattern Recognition (2016)
Google Scholar
Kehl, W., Manhardt, F., Tombari, F., Ilic, S., Navab, N.: SSD-6D: making RGB-based 3D detection and 6d pose estimation great again. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1521–1529 (2017)
Google Scholar
Krause, J., Stark, M., Deng, J., Fei-Fei, L.: 3D object representations for fine-grained categorization. In: Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 554–561 (2013)
Google Scholar
Lai, K., Bo, L., Ren, X., Fox, D.: A scalable tree-based approach for joint object and pose recognition. In: Twenty-Fifth AAAI Conference on Artificial Intelligence (2011)
Google Scholar
Li, Y., Wang, G., Ji, X., Xiang, Y., Fox, D.: DeepIM: deep iterative matching for 6D pose estimation. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 683–698 (2018)
Google Scholar
Lowe, D.G.: Three-dimensional object recognition from single two-dimensional images. Artif. Intell. 31(3), 355–395 (1987)
Article Google Scholar
Mahendran, S., Ali, H., Vidal, R.: 3D pose regression using convolutional neural networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2174–2182 (2017)
Google Scholar
Oberweger, M., Rad, M., Lepetit, V.: Making deep heatmaps robust to partial occlusions for 3D object pose estimation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11219, pp. 125–141. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01267-0_8
Chapter Google Scholar
Pavlakos, G., Zhou, X., Chan, A., Derpanis, K.G., Daniilidis, K.: 6-DoF object pose from semantic keypoints. In: 2017 IEEE International Conference on Robotics and Automation (ICRA), pp. 2011–2018. IEEE (2017)
Google Scholar
Peng, S., Liu, Y., Huang, Q., Zhou, X., Bao, H.: PVNet: pixel-wise voting network for 6DoF pose estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4561–4570 (2019)
Google Scholar
Rad, M., Lepetit, V.: Bb8: a scalable, accurate, robust to partial occlusion method for predicting the 3D poses of challenging objects without using depth. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3828–3836 (2017)
Google Scholar
Schneiderman, H., Kanade, T.: A statistical approach to 3D object detection applied to faces and cars. Carnegie Mellon University, The Robotics Institute (2000)
Google Scholar
Shahrokni, A., Vacchetti, L., Lepetit, V., Fua, P.: Polyhedral object detection and pose estimation for augmented reality applications. In: Proceedings of Computer Animation 2002 (CA 2002), pp. 65–69. IEEE (2002)
Google Scholar
Shimshoni, I., Ponce, J.: Finite-resolution aspect graphs of polyhedral objects. IEEE Trans. Pattern Anal. Mach. Intell. 19(4), 315–327 (1997)
Article Google Scholar
Sochor, J., Herout, A., Havel, J.: BoxCars: 3D boxes as CNN input for improved fine-grained vehicle recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3006–3015 (2016)
Google Scholar
Su, H., Qi, C.R., Li, Y., Guibas, L.J.: Render for CNN: viewpoint estimation in images using CNNs trained with rendered 3D model views. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2686–2694 (2015)
Google Scholar
Tekin, B., Sinha, S.N., Fua, P.: Real-time seamless single shot 6D object pose prediction. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 292–301 (2018)
Google Scholar
Tulsiani, S., Malik, J.: Viewpoints and keypoints. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1510–1519 (2015)
Google Scholar
Varley, J., DeChant, C., Richardson, A., Ruales, J., Allen, P.: Shape completion enabled robotic grasping. In: 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 2442–2447. IEEE (2017)
Google Scholar
Wang, Y., et al.: 3D pose estimation for fine-grained object categories. In: Leal-Taixé, L., Roth, S. (eds.) ECCV 2018. LNCS, vol. 11129, pp. 619–632. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-11009-3_38
Chapter Google Scholar
Wu, J., et al.: Single image 3D interpreter network. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9910, pp. 365–382. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46466-4_22
Chapter Google Scholar
Xiang, Y., Schmidt, T., Narayanan, V., Fox, D.: PoseCNN: a convolutional neural network for 6D object pose estimation in cluttered scenes. arXiv preprint arXiv:1711.00199 (2017)
Yang, L., Luo, P., Change Loy, C., Tang, X.: A large-scale car dataset for fine-grained categorization and verification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3973–3981 (2015)
Google Scholar

Download references

Author information

Authors and Affiliations

Beijing Laboratory of Intelligent Information Technology, School of Computer Science and Technology, Beijing Institute of Technology, Beijing, China
Yaohang Han, Huijun Di & Hanfeng Zheng
Intelligent Vehicle Research Center, School of Mechanical Engineering, Beijing Institute of Technology, Beijing, China
Jianyong Qi & Jianwei Gong

Authors

Yaohang Han
View author publications
You can also search for this author in PubMed Google Scholar
Huijun Di
View author publications
You can also search for this author in PubMed Google Scholar
Hanfeng Zheng
View author publications
You can also search for this author in PubMed Google Scholar
Jianyong Qi
View author publications
You can also search for this author in PubMed Google Scholar
Jianwei Gong
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Huijun Di .

Editor information

Editors and Affiliations

Peking University, Beijing, China
Yuxin Peng
Nanjing University of Information Science and Technology, Nanjing, China
Qingshan Liu
Dalian University of Technology, Dalian, China
Huchuan Lu
Chinese Academy of Sciences, Beijing, China
Zhenan Sun
Chinese Academy of Sciences, Beijing, China
Chenglin Liu
Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
Xilin Chen
Peking University, Beijing, China
Hongbin Zha
Nanjing University of Science and Technology, Nanjing, China
Jian Yang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Han, Y., Di, H., Zheng, H., Qi, J., Gong, J. (2020). GCVNet: Geometry Constrained Voting Network to Estimate 3D Pose for Fine-Grained Object Categories. In: Peng, Y., et al. Pattern Recognition and Computer Vision. PRCV 2020. Lecture Notes in Computer Science(), vol 12305. Springer, Cham. https://doi.org/10.1007/978-3-030-60633-6_15

Download citation

DOI: https://doi.org/10.1007/978-3-030-60633-6_15
Published: 11 October 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-60632-9
Online ISBN: 978-3-030-60633-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics