Skip to main content

A Visual Navigation Perspective for Category-Level Object Pose Estimation

  • Conference paper
  • First Online:
Computer Vision – ECCV 2022 (ECCV 2022)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13666))

Included in the following conference series:

Abstract

This paper studies category-level object pose estimation based on a single monocular image. Recent advances in pose-aware generative models have paved the way for addressing this challenging task using analysis-by-synthesis. The idea is to sequentially update a set of latent variables, e.g., pose, shape, and appearance, of the generative model until the generated image best agrees with the observation. However, convergence and efficiency are two challenges of this inference procedure. In this paper, we take a deeper look at the inference of analysis-by-synthesis from the perspective of visual navigation, and investigate what is a good navigation policy for this specific task. We evaluate three different strategies, including gradient descent, reinforcement learning and imitation learning, via thorough comparisons in terms of convergence, robustness and efficiency. Moreover, we show that a simple hybrid approach leads to an effective and efficient solution. We further compare these strategies to state-of-the-art methods, and demonstrate superior performance on synthetic and real-world datasets leveraging off-the-shelf pose-aware generative models.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Abdal, R., Qin, Y., Wonka, P.: Image2stylegan: how to embed images into the styleGAN latent space? In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (2019)

    Google Scholar 

  2. Andrychowicz, M., et al.: Learning to learn by gradient descent by gradient descent. In: Advances in Neural Information Processing Systems 29 (2016)

    Google Scholar 

  3. Andrychowicz, M., et al.: Hindsight experience replay. In: Advances in Neural Information Processing Systems (NeurIPS) (2017)

    Google Scholar 

  4. Bojarski, M., et al.: End to end learning for self-driving cars. arXiv.org 1604.07316 (2016)

  5. Chan, E.R., Monteiro, M., Kellnhofer, P., Wu, J., Wetzstein, G.: Pi-GAN: Periodic implicit generative adversarial networks for 3D-aware image synthesis. In: Proceedings IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2021)

    Google Scholar 

  6. Chen, D., Li, J., Wang, Z., Xu, K.: Learning canonical shape space for category-level 6D object pose and size estimation. In: Proceedings IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2020)

    Google Scholar 

  7. Chen, W., Jia, X., Chang, H.J., Duan, J., Shen, L., Leonardis, A.: FS-Net: fast shape-based network for category-level 6d object pose estimation with decoupled rotation mechanism. In: Proceedings IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2021)

    Google Scholar 

  8. Chen, X., Dong, Z., Song, J., Geiger, A., Hilliges, O.: Category level object pose estimation via neural analysis-by-synthesis. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12371, pp. 139–156. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58574-7_9

    Chapter  Google Scholar 

  9. Deng, J., Dong, W., Socher, R., Li, L., Li, K., Fei-Fei, L.: Imagenet: a large-scale hierarchical image database. In: Proceedings IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2009)

    Google Scholar 

  10. Do, T., Pham, T., Cai, M., Reid, I.: LieNet: real-time monocular object instance 6D pose estimation. In: Proceedings of the British Machine Vision Conference (BMVC) (2018)

    Google Scholar 

  11. Dosovitskiy, A., Ros, G., Codevilla, F., Lopez, A., Koltun, V.: Carla: An open urban driving simulator. In: Proceedings Conference on Robot Learning (CoRL) (2017)

    Google Scholar 

  12. Drost, B., Ulrich, M., Navab, N., Ilic, S.: Model globally, match locally: efficient and robust 3d object recognition. In: Proceedings IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2010)

    Google Scholar 

  13. Duggal, S., et al.: Secrets of 3D implicit object shape reconstruction in the wild. arXiv.org 2101.06860 (2021)

  14. Goodfellow, I.J., et al.: Generative adversarial nets. In: Advances in Neural Information Processing Systems (NeurIPS) (2014)

    Google Scholar 

  15. Haarnoja, T., Zhou, A., Abbeel, P., Levine, S.: Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: Proceedings of the International Conference on Machine Learning (ICML) (2018)

    Google Scholar 

  16. Hejrati, M., Ramanan, D.: Analysis by synthesis: 3D object recognition by object reconstruction. In: Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2014)

    Google Scholar 

  17. Henderson, P., Islam, R., Bachman, P., Pineau, J., Precup, D., Meger, D.: Deep reinforcement learning that matters. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018)

    Google Scholar 

  18. Henzler, P., Mitra, N.J., Ritschel, T.: Escaping plato’s cave: 3D shape from adversarial rendering. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (2019)

    Google Scholar 

  19. Hu, Y., Hugonot, J., Fua, P., Salzmann, M.: Segmentation-driven 6D object pose estimation. In: Proceedings IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019)

    Google Scholar 

  20. Isola, P., Liu, C.: Scene collaging: analysis and synthesis of natural images with semantic layers. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (2013)

    Google Scholar 

  21. Johnson, J., Alahi, A., Fei-Fei, L.: Perceptual losses for real-time style transfer and super-resolution. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9906, pp. 694–711. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46475-6_43

    Chapter  Google Scholar 

  22. Kehl, W., Manhardt, F., Tombari, F., Ilic, S., Navab, N.: SSD-6D: making RGB-based 3D detection and 6d pose estimation great again. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (2017)

    Google Scholar 

  23. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: Bengio, Y., LeCun, Y. (eds.) Proceedings of the International Conference on Learning Representations (ICLR) (2015)

    Google Scholar 

  24. Kretzschmar, H., Spies, M., Sprunk, C., Burgard, W.: Socially compliant mobile robot navigation via inverse reinforcement learning. Int. J. Robot. Res. (IJRR) 35(11), 1289–1307 (2016)

    Article  Google Scholar 

  25. Krull, A., Brachmann, E., Michel, F., Yang, M.Y., Gumhold, S., Rother, C.: Learning analysis-by-synthesis for 6d pose estimation in RGB-D images. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (2015)

    Google Scholar 

  26. Krull, A., Brachmann, E., Nowozin, S., Michel, F., Shotton, J., Rother, C.: PoseAgent: budget-constrained 6d object pose estimation via reinforcement learning. In: Proceedings IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)

    Google Scholar 

  27. Li, Y., Wang, G., Ji, X., Xiang, Yu., Fox, D.: DeepIM: deep iterative matching for 6D pose estimation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11210, pp. 695–711. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01231-1_42

    Chapter  Google Scholar 

  28. Li, Z., Wang, G., Ji, X.: CDPN: coordinates-based disentangled pose network for real-time rgb-based 6-dof object pose estimation. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (2019)

    Google Scholar 

  29. Liao, Y., Schwarz, K., Mescheder, L.M., Geiger, A.: Towards unsupervised learning of generative models for 3D controllable image synthesis. In: Proceedings IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2020)

    Google Scholar 

  30. Liu, Z., Li, X., Luo, P., Loy, C.C., Tang, X.: Semantic image segmentation via deep parsing network. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (2015)

    Google Scholar 

  31. Loper, M.M., Black, M.J.: OpenDR: an approximate differentiable renderer. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8695, pp. 154–169. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10584-0_11

    Chapter  Google Scholar 

  32. Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: NeRF: representing scenes as neural radiance fields for view synthesis. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 405–421. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_24

    Chapter  Google Scholar 

  33. Mirowski, P., et al.: Learning to navigate in complex environments. In: Proceedings of the International Conference on Learning Representations (ICLR) (2017)

    Google Scholar 

  34. Moreno, P., Williams, C.K.I., Nash, C., Kohli, P.: Overcoming occlusion with inverse graphics. In: Hua, G., Jégou, H. (eds.) ECCV 2016. LNCS, vol. 9915, pp. 170–185. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-49409-8_16

    Chapter  Google Scholar 

  35. Muñoz, E., Konishi, Y., Murino, V., Del Bue, A.: Fast 6D pose estimation for texture-less objects from a single RGB image. In: Proceedings IEEE International Conference on Robotics and Automation (ICRA) (2016)

    Google Scholar 

  36. Nguyen-Phuoc, T., Li, C., Theis, L., Richardt, C., Yang, Y.L.: HoloGAN: unsupervised learning of 3D representations from natural images. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (2019)

    Google Scholar 

  37. Niemeyer, M., Geiger, A.: GIRAFFE: representing scenes as compositional generative neural feature fields. In: Proceedings IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2021)

    Google Scholar 

  38. Park, K., Mousavian, A., Xiang, Y., Fox, D.: LatentFusion: end-to-end differentiable reconstruction and rendering for unseen object pose estimation. In: Proceedings IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2020)

    Google Scholar 

  39. Park, K., Patten, T., Vincze, M.: Pix2pose: pixel-wise coordinate regression of objects for 6d pose estimation. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (2019)

    Google Scholar 

  40. Peng, S., Liu, Y., Huang, Q., Zhou, X., Bao, H.: PVNet: pixel-wise voting network for 6D of pose estimation. In: Proceedings IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019)

    Google Scholar 

  41. Pfeiffer, M., et al.: Reinforced imitation: sample efficient deep reinforcement learning for mapless navigation by leveraging prior demonstrations. IEEE Robot. Autom. Lett. (RA-L) 3(4), 4423–4430 (2018)

    Google Scholar 

  42. Ross, S., Gordon, G.J., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Conference on Artificial Intelligence and Statistics (AISTATS) (2011)

    Google Scholar 

  43. Ross, S., et al.: Learning monocular reactive UAV control in cluttered natural environments. In: Proceedings IEEE International Conf. on Robotics and Automation (ICRA) (2013)

    Google Scholar 

  44. Sahin, C., Kim, T.-K.: Category-level 6D object pose recovery in depth images. In: Leal-Taixé, L., Roth, S. (eds.) ECCV 2018. LNCS, vol. 11129, pp. 665–681. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-11009-3_41

    Chapter  Google Scholar 

  45. Schwarz, K., Liao, Y., Niemeyer, M., Geiger, A.: GRAF: generative radiance fields for 3d-aware image synthesis. In: Advances in Neural Information Processing Systems (NeurIPS) (2020)

    Google Scholar 

  46. Shao, J., Jiang, Y., Wang, G., Li, Z., Ji, X.: PFRL: pose-free reinforcement learning for 6D pose estimation. In: Proceedings IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2020)

    Google Scholar 

  47. Shen, Y., Gu, J., Tang, X., Zhou, B.: Interpreting the latent space of GANs for semantic face editing. In: Proceedings IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2020)

    Google Scholar 

  48. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: Proceedings of the International Conference on Learning Representations (ICLR) (2015)

    Google Scholar 

  49. Tian, M., Ang, M.H., Lee, G.H.: Shape prior deformation for categorical 6d object pose and size estimation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12366, pp. 530–546. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58589-1_32

    Chapter  Google Scholar 

  50. Wang, H., Sridhar, S., Huang, J., Valentin, J., Song, S., Guibas, L.J.: Normalized object coordinate space for category-level 6D object pose and size estimation. In: Proceedings IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019)

    Google Scholar 

  51. Wang, J., Chen, K., Dou, Q.: Category-level 6D object pose estimation via cascaded relation and recurrent reconstruction networks. In: Proceedings IEEE International Conference on Intelligent Robots and Systems (IROS) (2021)

    Google Scholar 

  52. Xia, W., Zhang, Y., Yang, Y., Xue, J., Zhou, B., Yang, M.: GAN inversion: a survey. arXiv.org 2101.05278 (2021)

  53. Xiang, Y., Schmidt, T., Narayanan, V., Fox, D.: PoseCNN: a convolutional neural network for 6d object pose estimation in cluttered scenes. In: Proceedings Robotics: Science and Systems (RSS) (2018)

    Google Scholar 

  54. Yen-Chen, L., Florence, P., Barron, J.T., Rodriguez, A., Isola, P., Lin, T.Y.: INeRF: inverting neural radiance fields for pose estimation. In: Proceedings IEEE International Conference on Intelligent Robots and Systems (IROS) (2021)

    Google Scholar 

  55. Yuille, A., Kersten, D.: Vision as Bayesian inference: analysis by synthesis? Trends Cogn. Sci. 10(7), 301–308 (2006)

    Article  Google Scholar 

  56. Zamir, A.R., Sax, A., Shen, W.B., Guibas, L.J., Malik, J., Savarese, S.: Taskonomy: disentangling task transfer learning. In: Proceedings IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)

    Google Scholar 

  57. Zhu, J.-Y., Krähenbühl, P., Shechtman, E., Efros, A.A.: Generative visual manipulation on the natural image manifold. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9909, pp. 597–613. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46454-1_36

    Chapter  Google Scholar 

  58. Zhu, Y., et al.: Target-driven visual navigation in indoor scenes using deep reinforcement learning. In: Proc. IEEE International Conference on Robotics and Automation (ICRA) (2017)

    Google Scholar 

Download references

Acknowledgement

This work is supported in NSFC under grant U21B2004, and partially supported by Shenzhen Portion of Shenzhen-Hong Kong Science and Technology Innovation Cooperation Zone under HZQB-KCZYB-20200089, the HK RGC under T42-409/18-R and 14202918, the Multi-Scale Medical Robotics Centre, InnoHK, and the VC Fund 4930745 of the CUHK T Stone Robotics Institute.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Yue Wang or Yiyi Liao .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 6858 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Guo, J., Zhong, F., Xiong, R., Liu, Y., Wang, Y., Liao, Y. (2022). A Visual Navigation Perspective for Category-Level Object Pose Estimation. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13666. Springer, Cham. https://doi.org/10.1007/978-3-031-20068-7_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-20068-7_8

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-20067-0

  • Online ISBN: 978-3-031-20068-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics