Skip to main content

GCVNet: Geometry Constrained Voting Network to Estimate 3D Pose for Fine-Grained Object Categories

  • Conference paper
  • First Online:
Pattern Recognition and Computer Vision (PRCV 2020)

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12305))

Included in the following conference series:

Abstract

As a fundamental AI problem, monocular 3D pose estimation has received much attention. This paper addresses the challenge of estimating full perspective model parameters, including object pose and camera intrinsics, from a single 2D image of fine-grained object categories. To tackle this highly ill-posed problem, we propose a Geometry Constrained Voting Network (GCVNet). It is a unified end-to-end network consisting of four synergic task-specific subnetworks: 1) Fine-grained classification subnetwork, offering fine-grained 3D shape priors. 2) Voting subnetwork, generating 2D measurements. 3) Segmentation subnetwork, providing a foreground mask for voting. 4) PnP subnetwork, estimating the perspective parameters via explicit geometric reasoning, as well as constraining the classification subnetwork to provide proper 3D priors and the voting subnetwork to generate a group of geometric consistent 2D measurements, rather than independent voting for each 2D measurement in the literature. Experiments on challenging datasets demonstrate the superior performance of GCVNet.

This is a student paper. Special thanks to Megvii Inc. for providing training resources for the paper.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Brachmann, E., Krull, A., Michel, F., Gumhold, S., Shotton, J., Rother, C.: Learning 6D object pose estimation using 3D object coordinates. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) European Conference on Computer Vision, pp. 536–551. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10605-2_35

    Chapter  Google Scholar 

  2. Chen, X., Kundu, K., Zhang, Z., Ma, H., Fidler, S., Urtasun, R.: Monocular 3D object detection for autonomous driving. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2147–2156 (2016)

    Google Scholar 

  3. Deng, J., Dong, W., Socher, R., Li, L.J., Li, F.F.: ImageNet: a large-scale hierarchical image database. In: 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2009), 20–25 June 2009, Miami, Florida, USA (2009)

    Google Scholar 

  4. Elhoseiny, M., El-Gaaly, T., Bakry, A., Elgammal, A.: A comparative analysis and study of multiview cnn models for joint object categorization and pose estimation. In: International Conference on Machine Learning, pp. 888–897 (2016)

    Google Scholar 

  5. Girshick, R.: Fast R-CNN. In: 2015 IEEE International Conference on Computer Vision (ICCV) (2016)

    Google Scholar 

  6. Grabner, A., Roth, P.M., Lepetit, V.: GP2C: geometric projection parameter consensus for joint 3d pose and focal length estimation in the wild. In: The IEEE International Conference on Computer Vision (ICCV), October 2019

    Google Scholar 

  7. Grimson, W., Lozano-Perez, T.: Recognition and localization of overlapping parts from sparse data in two and three dimensions. In: Proceedings. 1985 IEEE International Conference on Robotics and Automation, vol. 2, pp. 61–66. IEEE (1985)

    Google Scholar 

  8. He, K., Zhang, X., Ren, S., Jian, S.: Delving deep into rectifiers: surpassing human-level performance on imagenet classification (2015)

    Google Scholar 

  9. Hinterstoisser, S., et al.: Model based training, detection and pose estimation of texture-less 3d objects in heavily cluttered scenes. In: Lee, K.M., Matsushita, Y., Rehg, J.M., Hu, Z. (eds.) ACCV 2012. LNCS, vol. 7724, pp. 548–562. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-37331-2_42

    Chapter  Google Scholar 

  10. Hu, Y., Hugonot, J., Fua, P., Salzmann, M.: Segmentation-driven 6D object pose estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3385–3394 (2019)

    Google Scholar 

  11. Jian, S.: Deep residual learning for image recognition. In: IEEE Conference on Computer Vision & Pattern Recognition (2016)

    Google Scholar 

  12. Kehl, W., Manhardt, F., Tombari, F., Ilic, S., Navab, N.: SSD-6D: making RGB-based 3D detection and 6d pose estimation great again. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1521–1529 (2017)

    Google Scholar 

  13. Krause, J., Stark, M., Deng, J., Fei-Fei, L.: 3D object representations for fine-grained categorization. In: Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 554–561 (2013)

    Google Scholar 

  14. Lai, K., Bo, L., Ren, X., Fox, D.: A scalable tree-based approach for joint object and pose recognition. In: Twenty-Fifth AAAI Conference on Artificial Intelligence (2011)

    Google Scholar 

  15. Li, Y., Wang, G., Ji, X., Xiang, Y., Fox, D.: DeepIM: deep iterative matching for 6D pose estimation. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 683–698 (2018)

    Google Scholar 

  16. Lowe, D.G.: Three-dimensional object recognition from single two-dimensional images. Artif. Intell. 31(3), 355–395 (1987)

    Article  Google Scholar 

  17. Mahendran, S., Ali, H., Vidal, R.: 3D pose regression using convolutional neural networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2174–2182 (2017)

    Google Scholar 

  18. Oberweger, M., Rad, M., Lepetit, V.: Making deep heatmaps robust to partial occlusions for 3D object pose estimation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11219, pp. 125–141. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01267-0_8

    Chapter  Google Scholar 

  19. Pavlakos, G., Zhou, X., Chan, A., Derpanis, K.G., Daniilidis, K.: 6-DoF object pose from semantic keypoints. In: 2017 IEEE International Conference on Robotics and Automation (ICRA), pp. 2011–2018. IEEE (2017)

    Google Scholar 

  20. Peng, S., Liu, Y., Huang, Q., Zhou, X., Bao, H.: PVNet: pixel-wise voting network for 6DoF pose estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4561–4570 (2019)

    Google Scholar 

  21. Rad, M., Lepetit, V.: Bb8: a scalable, accurate, robust to partial occlusion method for predicting the 3D poses of challenging objects without using depth. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3828–3836 (2017)

    Google Scholar 

  22. Schneiderman, H., Kanade, T.: A statistical approach to 3D object detection applied to faces and cars. Carnegie Mellon University, The Robotics Institute (2000)

    Google Scholar 

  23. Shahrokni, A., Vacchetti, L., Lepetit, V., Fua, P.: Polyhedral object detection and pose estimation for augmented reality applications. In: Proceedings of Computer Animation 2002 (CA 2002), pp. 65–69. IEEE (2002)

    Google Scholar 

  24. Shimshoni, I., Ponce, J.: Finite-resolution aspect graphs of polyhedral objects. IEEE Trans. Pattern Anal. Mach. Intell. 19(4), 315–327 (1997)

    Article  Google Scholar 

  25. Sochor, J., Herout, A., Havel, J.: BoxCars: 3D boxes as CNN input for improved fine-grained vehicle recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3006–3015 (2016)

    Google Scholar 

  26. Su, H., Qi, C.R., Li, Y., Guibas, L.J.: Render for CNN: viewpoint estimation in images using CNNs trained with rendered 3D model views. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2686–2694 (2015)

    Google Scholar 

  27. Tekin, B., Sinha, S.N., Fua, P.: Real-time seamless single shot 6D object pose prediction. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 292–301 (2018)

    Google Scholar 

  28. Tulsiani, S., Malik, J.: Viewpoints and keypoints. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1510–1519 (2015)

    Google Scholar 

  29. Varley, J., DeChant, C., Richardson, A., Ruales, J., Allen, P.: Shape completion enabled robotic grasping. In: 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 2442–2447. IEEE (2017)

    Google Scholar 

  30. Wang, Y., et al.: 3D pose estimation for fine-grained object categories. In: Leal-Taixé, L., Roth, S. (eds.) ECCV 2018. LNCS, vol. 11129, pp. 619–632. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-11009-3_38

    Chapter  Google Scholar 

  31. Wu, J., et al.: Single image 3D interpreter network. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9910, pp. 365–382. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46466-4_22

    Chapter  Google Scholar 

  32. Xiang, Y., Schmidt, T., Narayanan, V., Fox, D.: PoseCNN: a convolutional neural network for 6D object pose estimation in cluttered scenes. arXiv preprint arXiv:1711.00199 (2017)

  33. Yang, L., Luo, P., Change Loy, C., Tang, X.: A large-scale car dataset for fine-grained categorization and verification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3973–3981 (2015)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Huijun Di .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Han, Y., Di, H., Zheng, H., Qi, J., Gong, J. (2020). GCVNet: Geometry Constrained Voting Network to Estimate 3D Pose for Fine-Grained Object Categories. In: Peng, Y., et al. Pattern Recognition and Computer Vision. PRCV 2020. Lecture Notes in Computer Science(), vol 12305. Springer, Cham. https://doi.org/10.1007/978-3-030-60633-6_15

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-60633-6_15

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-60632-9

  • Online ISBN: 978-3-030-60633-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics