Abstract
As a fundamental AI problem, monocular 3D pose estimation has received much attention. This paper addresses the challenge of estimating full perspective model parameters, including object pose and camera intrinsics, from a single 2D image of fine-grained object categories. To tackle this highly ill-posed problem, we propose a Geometry Constrained Voting Network (GCVNet). It is a unified end-to-end network consisting of four synergic task-specific subnetworks: 1) Fine-grained classification subnetwork, offering fine-grained 3D shape priors. 2) Voting subnetwork, generating 2D measurements. 3) Segmentation subnetwork, providing a foreground mask for voting. 4) PnP subnetwork, estimating the perspective parameters via explicit geometric reasoning, as well as constraining the classification subnetwork to provide proper 3D priors and the voting subnetwork to generate a group of geometric consistent 2D measurements, rather than independent voting for each 2D measurement in the literature. Experiments on challenging datasets demonstrate the superior performance of GCVNet.
This is a student paper. Special thanks to Megvii Inc. for providing training resources for the paper.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Brachmann, E., Krull, A., Michel, F., Gumhold, S., Shotton, J., Rother, C.: Learning 6D object pose estimation using 3D object coordinates. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) European Conference on Computer Vision, pp. 536–551. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10605-2_35
Chen, X., Kundu, K., Zhang, Z., Ma, H., Fidler, S., Urtasun, R.: Monocular 3D object detection for autonomous driving. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2147–2156 (2016)
Deng, J., Dong, W., Socher, R., Li, L.J., Li, F.F.: ImageNet: a large-scale hierarchical image database. In: 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2009), 20–25 June 2009, Miami, Florida, USA (2009)
Elhoseiny, M., El-Gaaly, T., Bakry, A., Elgammal, A.: A comparative analysis and study of multiview cnn models for joint object categorization and pose estimation. In: International Conference on Machine Learning, pp. 888–897 (2016)
Girshick, R.: Fast R-CNN. In: 2015 IEEE International Conference on Computer Vision (ICCV) (2016)
Grabner, A., Roth, P.M., Lepetit, V.: GP2C: geometric projection parameter consensus for joint 3d pose and focal length estimation in the wild. In: The IEEE International Conference on Computer Vision (ICCV), October 2019
Grimson, W., Lozano-Perez, T.: Recognition and localization of overlapping parts from sparse data in two and three dimensions. In: Proceedings. 1985 IEEE International Conference on Robotics and Automation, vol. 2, pp. 61–66. IEEE (1985)
He, K., Zhang, X., Ren, S., Jian, S.: Delving deep into rectifiers: surpassing human-level performance on imagenet classification (2015)
Hinterstoisser, S., et al.: Model based training, detection and pose estimation of texture-less 3d objects in heavily cluttered scenes. In: Lee, K.M., Matsushita, Y., Rehg, J.M., Hu, Z. (eds.) ACCV 2012. LNCS, vol. 7724, pp. 548–562. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-37331-2_42
Hu, Y., Hugonot, J., Fua, P., Salzmann, M.: Segmentation-driven 6D object pose estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3385–3394 (2019)
Jian, S.: Deep residual learning for image recognition. In: IEEE Conference on Computer Vision & Pattern Recognition (2016)
Kehl, W., Manhardt, F., Tombari, F., Ilic, S., Navab, N.: SSD-6D: making RGB-based 3D detection and 6d pose estimation great again. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1521–1529 (2017)
Krause, J., Stark, M., Deng, J., Fei-Fei, L.: 3D object representations for fine-grained categorization. In: Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 554–561 (2013)
Lai, K., Bo, L., Ren, X., Fox, D.: A scalable tree-based approach for joint object and pose recognition. In: Twenty-Fifth AAAI Conference on Artificial Intelligence (2011)
Li, Y., Wang, G., Ji, X., Xiang, Y., Fox, D.: DeepIM: deep iterative matching for 6D pose estimation. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 683–698 (2018)
Lowe, D.G.: Three-dimensional object recognition from single two-dimensional images. Artif. Intell. 31(3), 355–395 (1987)
Mahendran, S., Ali, H., Vidal, R.: 3D pose regression using convolutional neural networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2174–2182 (2017)
Oberweger, M., Rad, M., Lepetit, V.: Making deep heatmaps robust to partial occlusions for 3D object pose estimation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11219, pp. 125–141. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01267-0_8
Pavlakos, G., Zhou, X., Chan, A., Derpanis, K.G., Daniilidis, K.: 6-DoF object pose from semantic keypoints. In: 2017 IEEE International Conference on Robotics and Automation (ICRA), pp. 2011–2018. IEEE (2017)
Peng, S., Liu, Y., Huang, Q., Zhou, X., Bao, H.: PVNet: pixel-wise voting network for 6DoF pose estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4561–4570 (2019)
Rad, M., Lepetit, V.: Bb8: a scalable, accurate, robust to partial occlusion method for predicting the 3D poses of challenging objects without using depth. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3828–3836 (2017)
Schneiderman, H., Kanade, T.: A statistical approach to 3D object detection applied to faces and cars. Carnegie Mellon University, The Robotics Institute (2000)
Shahrokni, A., Vacchetti, L., Lepetit, V., Fua, P.: Polyhedral object detection and pose estimation for augmented reality applications. In: Proceedings of Computer Animation 2002 (CA 2002), pp. 65–69. IEEE (2002)
Shimshoni, I., Ponce, J.: Finite-resolution aspect graphs of polyhedral objects. IEEE Trans. Pattern Anal. Mach. Intell. 19(4), 315–327 (1997)
Sochor, J., Herout, A., Havel, J.: BoxCars: 3D boxes as CNN input for improved fine-grained vehicle recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3006–3015 (2016)
Su, H., Qi, C.R., Li, Y., Guibas, L.J.: Render for CNN: viewpoint estimation in images using CNNs trained with rendered 3D model views. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2686–2694 (2015)
Tekin, B., Sinha, S.N., Fua, P.: Real-time seamless single shot 6D object pose prediction. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 292–301 (2018)
Tulsiani, S., Malik, J.: Viewpoints and keypoints. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1510–1519 (2015)
Varley, J., DeChant, C., Richardson, A., Ruales, J., Allen, P.: Shape completion enabled robotic grasping. In: 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 2442–2447. IEEE (2017)
Wang, Y., et al.: 3D pose estimation for fine-grained object categories. In: Leal-Taixé, L., Roth, S. (eds.) ECCV 2018. LNCS, vol. 11129, pp. 619–632. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-11009-3_38
Wu, J., et al.: Single image 3D interpreter network. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9910, pp. 365–382. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46466-4_22
Xiang, Y., Schmidt, T., Narayanan, V., Fox, D.: PoseCNN: a convolutional neural network for 6D object pose estimation in cluttered scenes. arXiv preprint arXiv:1711.00199 (2017)
Yang, L., Luo, P., Change Loy, C., Tang, X.: A large-scale car dataset for fine-grained categorization and verification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3973–3981 (2015)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Han, Y., Di, H., Zheng, H., Qi, J., Gong, J. (2020). GCVNet: Geometry Constrained Voting Network to Estimate 3D Pose for Fine-Grained Object Categories. In: Peng, Y., et al. Pattern Recognition and Computer Vision. PRCV 2020. Lecture Notes in Computer Science(), vol 12305. Springer, Cham. https://doi.org/10.1007/978-3-030-60633-6_15
Download citation
DOI: https://doi.org/10.1007/978-3-030-60633-6_15
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-60632-9
Online ISBN: 978-3-030-60633-6
eBook Packages: Computer ScienceComputer Science (R0)