Abstract
This paper studies category-level object pose estimation based on a single monocular image. Recent advances in pose-aware generative models have paved the way for addressing this challenging task using analysis-by-synthesis. The idea is to sequentially update a set of latent variables, e.g., pose, shape, and appearance, of the generative model until the generated image best agrees with the observation. However, convergence and efficiency are two challenges of this inference procedure. In this paper, we take a deeper look at the inference of analysis-by-synthesis from the perspective of visual navigation, and investigate what is a good navigation policy for this specific task. We evaluate three different strategies, including gradient descent, reinforcement learning and imitation learning, via thorough comparisons in terms of convergence, robustness and efficiency. Moreover, we show that a simple hybrid approach leads to an effective and efficient solution. We further compare these strategies to state-of-the-art methods, and demonstrate superior performance on synthetic and real-world datasets leveraging off-the-shelf pose-aware generative models.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Abdal, R., Qin, Y., Wonka, P.: Image2stylegan: how to embed images into the styleGAN latent space? In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (2019)
Andrychowicz, M., et al.: Learning to learn by gradient descent by gradient descent. In: Advances in Neural Information Processing Systems 29 (2016)
Andrychowicz, M., et al.: Hindsight experience replay. In: Advances in Neural Information Processing Systems (NeurIPS) (2017)
Bojarski, M., et al.: End to end learning for self-driving cars. arXiv.org 1604.07316 (2016)
Chan, E.R., Monteiro, M., Kellnhofer, P., Wu, J., Wetzstein, G.: Pi-GAN: Periodic implicit generative adversarial networks for 3D-aware image synthesis. In: Proceedings IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2021)
Chen, D., Li, J., Wang, Z., Xu, K.: Learning canonical shape space for category-level 6D object pose and size estimation. In: Proceedings IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2020)
Chen, W., Jia, X., Chang, H.J., Duan, J., Shen, L., Leonardis, A.: FS-Net: fast shape-based network for category-level 6d object pose estimation with decoupled rotation mechanism. In: Proceedings IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2021)
Chen, X., Dong, Z., Song, J., Geiger, A., Hilliges, O.: Category level object pose estimation via neural analysis-by-synthesis. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12371, pp. 139–156. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58574-7_9
Deng, J., Dong, W., Socher, R., Li, L., Li, K., Fei-Fei, L.: Imagenet: a large-scale hierarchical image database. In: Proceedings IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2009)
Do, T., Pham, T., Cai, M., Reid, I.: LieNet: real-time monocular object instance 6D pose estimation. In: Proceedings of the British Machine Vision Conference (BMVC) (2018)
Dosovitskiy, A., Ros, G., Codevilla, F., Lopez, A., Koltun, V.: Carla: An open urban driving simulator. In: Proceedings Conference on Robot Learning (CoRL) (2017)
Drost, B., Ulrich, M., Navab, N., Ilic, S.: Model globally, match locally: efficient and robust 3d object recognition. In: Proceedings IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2010)
Duggal, S., et al.: Secrets of 3D implicit object shape reconstruction in the wild. arXiv.org 2101.06860 (2021)
Goodfellow, I.J., et al.: Generative adversarial nets. In: Advances in Neural Information Processing Systems (NeurIPS) (2014)
Haarnoja, T., Zhou, A., Abbeel, P., Levine, S.: Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: Proceedings of the International Conference on Machine Learning (ICML) (2018)
Hejrati, M., Ramanan, D.: Analysis by synthesis: 3D object recognition by object reconstruction. In: Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2014)
Henderson, P., Islam, R., Bachman, P., Pineau, J., Precup, D., Meger, D.: Deep reinforcement learning that matters. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018)
Henzler, P., Mitra, N.J., Ritschel, T.: Escaping plato’s cave: 3D shape from adversarial rendering. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (2019)
Hu, Y., Hugonot, J., Fua, P., Salzmann, M.: Segmentation-driven 6D object pose estimation. In: Proceedings IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019)
Isola, P., Liu, C.: Scene collaging: analysis and synthesis of natural images with semantic layers. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (2013)
Johnson, J., Alahi, A., Fei-Fei, L.: Perceptual losses for real-time style transfer and super-resolution. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9906, pp. 694–711. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46475-6_43
Kehl, W., Manhardt, F., Tombari, F., Ilic, S., Navab, N.: SSD-6D: making RGB-based 3D detection and 6d pose estimation great again. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (2017)
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: Bengio, Y., LeCun, Y. (eds.) Proceedings of the International Conference on Learning Representations (ICLR) (2015)
Kretzschmar, H., Spies, M., Sprunk, C., Burgard, W.: Socially compliant mobile robot navigation via inverse reinforcement learning. Int. J. Robot. Res. (IJRR) 35(11), 1289–1307 (2016)
Krull, A., Brachmann, E., Michel, F., Yang, M.Y., Gumhold, S., Rother, C.: Learning analysis-by-synthesis for 6d pose estimation in RGB-D images. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (2015)
Krull, A., Brachmann, E., Nowozin, S., Michel, F., Shotton, J., Rother, C.: PoseAgent: budget-constrained 6d object pose estimation via reinforcement learning. In: Proceedings IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)
Li, Y., Wang, G., Ji, X., Xiang, Yu., Fox, D.: DeepIM: deep iterative matching for 6D pose estimation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11210, pp. 695–711. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01231-1_42
Li, Z., Wang, G., Ji, X.: CDPN: coordinates-based disentangled pose network for real-time rgb-based 6-dof object pose estimation. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (2019)
Liao, Y., Schwarz, K., Mescheder, L.M., Geiger, A.: Towards unsupervised learning of generative models for 3D controllable image synthesis. In: Proceedings IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2020)
Liu, Z., Li, X., Luo, P., Loy, C.C., Tang, X.: Semantic image segmentation via deep parsing network. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (2015)
Loper, M.M., Black, M.J.: OpenDR: an approximate differentiable renderer. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8695, pp. 154–169. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10584-0_11
Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: NeRF: representing scenes as neural radiance fields for view synthesis. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 405–421. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_24
Mirowski, P., et al.: Learning to navigate in complex environments. In: Proceedings of the International Conference on Learning Representations (ICLR) (2017)
Moreno, P., Williams, C.K.I., Nash, C., Kohli, P.: Overcoming occlusion with inverse graphics. In: Hua, G., Jégou, H. (eds.) ECCV 2016. LNCS, vol. 9915, pp. 170–185. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-49409-8_16
Muñoz, E., Konishi, Y., Murino, V., Del Bue, A.: Fast 6D pose estimation for texture-less objects from a single RGB image. In: Proceedings IEEE International Conference on Robotics and Automation (ICRA) (2016)
Nguyen-Phuoc, T., Li, C., Theis, L., Richardt, C., Yang, Y.L.: HoloGAN: unsupervised learning of 3D representations from natural images. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (2019)
Niemeyer, M., Geiger, A.: GIRAFFE: representing scenes as compositional generative neural feature fields. In: Proceedings IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2021)
Park, K., Mousavian, A., Xiang, Y., Fox, D.: LatentFusion: end-to-end differentiable reconstruction and rendering for unseen object pose estimation. In: Proceedings IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2020)
Park, K., Patten, T., Vincze, M.: Pix2pose: pixel-wise coordinate regression of objects for 6d pose estimation. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (2019)
Peng, S., Liu, Y., Huang, Q., Zhou, X., Bao, H.: PVNet: pixel-wise voting network for 6D of pose estimation. In: Proceedings IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019)
Pfeiffer, M., et al.: Reinforced imitation: sample efficient deep reinforcement learning for mapless navigation by leveraging prior demonstrations. IEEE Robot. Autom. Lett. (RA-L) 3(4), 4423–4430 (2018)
Ross, S., Gordon, G.J., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Conference on Artificial Intelligence and Statistics (AISTATS) (2011)
Ross, S., et al.: Learning monocular reactive UAV control in cluttered natural environments. In: Proceedings IEEE International Conf. on Robotics and Automation (ICRA) (2013)
Sahin, C., Kim, T.-K.: Category-level 6D object pose recovery in depth images. In: Leal-Taixé, L., Roth, S. (eds.) ECCV 2018. LNCS, vol. 11129, pp. 665–681. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-11009-3_41
Schwarz, K., Liao, Y., Niemeyer, M., Geiger, A.: GRAF: generative radiance fields for 3d-aware image synthesis. In: Advances in Neural Information Processing Systems (NeurIPS) (2020)
Shao, J., Jiang, Y., Wang, G., Li, Z., Ji, X.: PFRL: pose-free reinforcement learning for 6D pose estimation. In: Proceedings IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2020)
Shen, Y., Gu, J., Tang, X., Zhou, B.: Interpreting the latent space of GANs for semantic face editing. In: Proceedings IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2020)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: Proceedings of the International Conference on Learning Representations (ICLR) (2015)
Tian, M., Ang, M.H., Lee, G.H.: Shape prior deformation for categorical 6d object pose and size estimation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12366, pp. 530–546. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58589-1_32
Wang, H., Sridhar, S., Huang, J., Valentin, J., Song, S., Guibas, L.J.: Normalized object coordinate space for category-level 6D object pose and size estimation. In: Proceedings IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019)
Wang, J., Chen, K., Dou, Q.: Category-level 6D object pose estimation via cascaded relation and recurrent reconstruction networks. In: Proceedings IEEE International Conference on Intelligent Robots and Systems (IROS) (2021)
Xia, W., Zhang, Y., Yang, Y., Xue, J., Zhou, B., Yang, M.: GAN inversion: a survey. arXiv.org 2101.05278 (2021)
Xiang, Y., Schmidt, T., Narayanan, V., Fox, D.: PoseCNN: a convolutional neural network for 6d object pose estimation in cluttered scenes. In: Proceedings Robotics: Science and Systems (RSS) (2018)
Yen-Chen, L., Florence, P., Barron, J.T., Rodriguez, A., Isola, P., Lin, T.Y.: INeRF: inverting neural radiance fields for pose estimation. In: Proceedings IEEE International Conference on Intelligent Robots and Systems (IROS) (2021)
Yuille, A., Kersten, D.: Vision as Bayesian inference: analysis by synthesis? Trends Cogn. Sci. 10(7), 301–308 (2006)
Zamir, A.R., Sax, A., Shen, W.B., Guibas, L.J., Malik, J., Savarese, S.: Taskonomy: disentangling task transfer learning. In: Proceedings IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)
Zhu, J.-Y., Krähenbühl, P., Shechtman, E., Efros, A.A.: Generative visual manipulation on the natural image manifold. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9909, pp. 597–613. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46454-1_36
Zhu, Y., et al.: Target-driven visual navigation in indoor scenes using deep reinforcement learning. In: Proc. IEEE International Conference on Robotics and Automation (ICRA) (2017)
Acknowledgement
This work is supported in NSFC under grant U21B2004, and partially supported by Shenzhen Portion of Shenzhen-Hong Kong Science and Technology Innovation Cooperation Zone under HZQB-KCZYB-20200089, the HK RGC under T42-409/18-R and 14202918, the Multi-Scale Medical Robotics Centre, InnoHK, and the VC Fund 4930745 of the CUHK T Stone Robotics Institute.
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Guo, J., Zhong, F., Xiong, R., Liu, Y., Wang, Y., Liao, Y. (2022). A Visual Navigation Perspective for Category-Level Object Pose Estimation. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13666. Springer, Cham. https://doi.org/10.1007/978-3-031-20068-7_8
Download citation
DOI: https://doi.org/10.1007/978-3-031-20068-7_8
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-20067-0
Online ISBN: 978-3-031-20068-7
eBook Packages: Computer ScienceComputer Science (R0)