Skip to main content

RelPose: Predicting Probabilistic Relative Rotation for Single Objects in the Wild

  • Conference paper
  • First Online:
Computer Vision – ECCV 2022 (ECCV 2022)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13691))

Included in the following conference series:

Abstract

We describe a data-driven method for inferring the camera viewpoints given multiple images of an arbitrary object. This task is a core component of classic geometric pipelines such as SfM and SLAM, and also serves as a vital pre-processing requirement for contemporary neural approaches (e.g. NeRF) to object reconstruction and view synthesis. In contrast to existing correspondence-driven methods that do not perform well given sparse views, we propose a top-down prediction based approach for estimating camera viewpoints. Our key technical insight is the use of an energy-based formulation for representing distributions over relative camera rotations, thus allowing us to explicitly represent multiple camera modes arising from object symmetries or views. Leveraging these relative predictions, we jointly estimate a consistent set of camera rotations from multiple images. We show that our approach outperforms state-of-the-art SfM and SLAM methods given sparse images on both seen and unseen categories. Further, our probabilistic approach significantly outperforms directly regressing relative poses, suggesting that modeling multimodality is important for coherent joint reconstruction. We demonstrate that our system can be a stepping stone toward in-the-wild reconstruction from multi-view datasets. The project page with code and videos can be found at jasonyzhang.com/relpose.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Balntas, V., Li, S., Prisacariu, V.: RelocNet: continuous metric learning relocalisation using neural nets. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) Computer Vision – ECCV 2018. LNCS, vol. 11218, pp. 782–799. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01264-9_46

    Chapter  Google Scholar 

  2. Bay, H., Tuytelaars, T., Van Gool, L.: SURF: speeded up robust features. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3951, pp. 404–417. Springer, Heidelberg (2006). https://doi.org/10.1007/11744023_32

    Chapter  Google Scholar 

  3. Brachmann, E., Michel, F., Krull, A., Yang, M.Y., Gumhold, S., et al.: Uncertainty-driven 6D pose estimation of objects and scenes from a single RGB image. In: CVPR (2016)

    Google Scholar 

  4. Bukschat, Y., Vetter, M.: EfficientPose: an efficient, accurate and scalable end-to-end 6D multi object pose estimation approach. arXiv:2011.04307 (2020)

  5. Campos, C., Elvira, R., Gómez, J.J., Montiel, J.M.M., Tardós, J.D.: ORB-SLAM3: an accurate open-source library for visual visual-inertial and multi-map SLAM. T-RO 37(6), 1874–1890 (2021)

    Google Scholar 

  6. Carlone, L., Tron, R., Daniilidis, K., Dellaert, F.: Initialization techniques for 3D SLAM: a survey on rotation estimation and its use in pose graph optimization. ICRA (2015)

    Google Scholar 

  7. Chen, B., Chin, T.J., Klimavicius, M.: Occlusion-robust object pose estimation with holistic representation. In: WACV (2022)

    Google Scholar 

  8. Chen, K., Snavely, N., Makadia, A.: Wide-baseline relative camera pose estimation with directional learning. In: CVPR (2021)

    Google Scholar 

  9. Choy, C.B., Gwak, J., Savarese, S., Chandraker, M.: Universal correspondence network. In: NeurIPS (2016)

    Google Scholar 

  10. Corona, E., Kundu, K., Fidler, S.: Pose estimation for objects with rotational symmetry. In: IROS (2018)

    Google Scholar 

  11. Davison, A.J., Reid, I.D., Molton, N.D., Stasse, O.: MonoSLAM: real-time single camera SLAM. TPAMI 29(6), 1052–1067 (2007)

    Article  Google Scholar 

  12. Deng, X., Mousavian, A., Xiang, Y., Xia, F., Bretl, T., Fox, D.: PoseRBPF: a rao-blackwellized particle filter for 6D object pose tracking. In: RSS (2019)

    Google Scholar 

  13. Deng, X., Xiang, Y., Mousavian, A., Eppner, C., Bretl, T., Fox, D.: Self-supervised 6D object pose estimation for robot manipulation. In: ICRA (2020)

    Google Scholar 

  14. DeTone, D., Malisiewicz, T., Rabinovich, A.: SuperPoint: self-supervised interest point detection and description. In: CVPR-W (2018)

    Google Scholar 

  15. Dusmanu, M., et al.: D2-Net: a trainable CNN for joint detection and description of local features. In: CVPR (2019)

    Google Scholar 

  16. Dusmanu, Mihai, Schönberger, Johannes L.., Pollefeys, Marc: Multi-view optimization of local feature geometry. In: Vedaldi, Andrea, Bischof, Horst, Brox, Thomas, Frahm, Jan-Michael. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 670–686. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_39

    Chapter  Google Scholar 

  17. Engel, J., Koltun, V., Cremers, D.: Direct sparse odometry. TPAMI (2018)

    Google Scholar 

  18. Furukawa, Y., Curless, B., Seitz, S.M., Szeliski, R.: Towards internet-scale multi-view stereo. In: CVPR (2010)

    Google Scholar 

  19. Gilitschenski, I., Sahoo, R., Schwarting, W., Amini, A., Karaman, S., Rus, D.: Deep orientation uncertainty learning based on a Bingham loss. In: ICLR (2019)

    Google Scholar 

  20. Goel, S., Gkioxari, G., Malik, J.: Differentiable stereopsis: meshes from multiple views using differentiable rendering. In: CVPR (2022)

    Google Scholar 

  21. Harris, C., Stephens, M.: A Combined corner and edge detector. In: Alvey Vision Conference (1988)

    Google Scholar 

  22. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)

    Google Scholar 

  23. Iwase, S., Liu, X., Khirodkar, R., Yokota, R., Kitani, K.M.: RePOSE: fast 6D object pose refinement via deep texture rendering. In: ICCV (2021)

    Google Scholar 

  24. Kehl, W., Manhardt, F., Tombari, F., Ilic, S., Navab, N.: SSD-6D: making RGB-based 3D detection and 6D pose estimation great again. In: ICCV (2017)

    Google Scholar 

  25. Kendall, A., Cipolla, R.: Modelling uncertainty in deep learning for camera relocalization. In: ICRA (2016)

    Google Scholar 

  26. Kendall, A., Grimes, M., Cipolla, R.: PoseNet: a convolutional network for real-time 6-DOF camera relocalization. In: ICCV (2015)

    Google Scholar 

  27. Lin, C.H., Ma, W.C., Torralba, A., Lucey, S.: BARF: bundle-adjusting neural radiance fields. In: ICCV (2021)

    Google Scholar 

  28. Lindenberger, P., Sarlin, P.E., Larsson, V., Pollefeys, M.: Pixel-perfect structure-from-motion with featuremetric refinement. In: ICCV (2021)

    Google Scholar 

  29. Liu, C., Yuen, J., Torralba, A.: SIFT flow: dense correspondence across scenes and its applications. TPAMI 33(5), 978–994 (2010)

    Article  Google Scholar 

  30. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. IJCV 60(2), 91–110 (2004)

    Article  Google Scholar 

  31. Lucas, B.D., Kanade, T.: An iterative image registration technique with an application to stereo vision. In: IJCAI (1981)

    Google Scholar 

  32. Mahjourian, R., Wicke, M., Angelova, A.: Unsupervised learning of depth and ego-motion from monocular video using 3D geometric constraints. In: CVPR (2018)

    Google Scholar 

  33. Manhardt, F., et al.: Explaining the ambiguity of object detection and 6D pose from visual data. In: ICCV (2019)

    Google Scholar 

  34. Melekhov, I., Ylioinas, J., Kannala, J., Rahtu, E.: Relative camera pose estimation using convolutional neural networks. In: ACIVS (2017)

    Google Scholar 

  35. Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: NeRF: representing scenes as neural radiance fields for view synthesis. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 405–421. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_24

    Chapter  Google Scholar 

  36. Mohlin, D., Sullivan, J., Bianchi, G.: Probabilistic orientation estimation with matrix fisher distributions. In: NeurIPS (2020)

    Google Scholar 

  37. Mur-Artal, R., Montiel, J.M.M., Tardos, J.D.: ORB-SLAM: a versatile and accurate monocular SLAM system. T-RO 31(5), 1147–1163 (2015)

    Google Scholar 

  38. Mur-Artal, R., Tardós, J.D.: ORB-SLAM2: an open-source SLAM system for monocular stereo and RGB-D cameras. T-RO 33(5), 1255–1262 (2017)

    Google Scholar 

  39. Murphy, K.A., Esteves, C., Jampani, V., Ramalingam, S., Makadia, A.: Implicit-PDF: non-parametric representation of probability distributions on the rotation manifold. In: ICML (2021)

    Google Scholar 

  40. Newcombe, R.A., Lovegrove, S.J., Davison, A.J.: DTAM: dense tracking and mapping in real-time. In: ICCV (2011)

    Google Scholar 

  41. Novotny, D., Larlus, D., Vedaldi, A.: Learning 3D object categories by looking around them. In: ICCV (2017)

    Google Scholar 

  42. Novotny, D., Ravi, N., Graham, B., Neverova, N., Vedaldi, A.: C3DPO: canonical 3D pose networks for non-rigid structure from motion. In: ICCV (2019)

    Google Scholar 

  43. Oberweger, M., Rad, M., Lepetit, V.: Making deep heatmaps robust to partial occlusions for 3D object pose estimation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11219, pp. 125–141. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01267-0_8

    Chapter  Google Scholar 

  44. Okorn, B., Gu, Q., Hebert, M., Held, D.: ZePHyR: zero-shot pose hypothesis scoring. In: ICRA (2021)

    Google Scholar 

  45. Okorn, B., Xu, M., Hebert, M., Held, D.: Learning orientation distributions for object pose estimation. In: IROS (2020)

    Google Scholar 

  46. Pautrat, R., Larsson, V., Oswald, M.R., Pollefeys, M.: Online invariance selection for local feature descriptors. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12347, pp. 707–724. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58536-5_42

    Chapter  Google Scholar 

  47. Prokudin, S., Gehler, P., Nowozin, S.: Deep directional statistics: pose estimation with uncertainty quantification. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11213, pp. 542–559. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01240-3_33

    Chapter  Google Scholar 

  48. Reizenstein, J., Shapovalov, R., Henzler, P., Sbordone, L., Labatut, P., Novotny, D.: Common objects in 3D: large-scale learning and evaluation of real-life 3D category reconstruction. In: ICCV (2021)

    Google Scholar 

  49. Revaud, J., De Souza, C., Humenberger, M., Weinzaepfel, P.: R2D2: reliable and repeatable detector and descriptor. In: NeurIPS (2019)

    Google Scholar 

  50. Rodrigues, O.: Des lois géométriques qui régissent les déplacements d’un système solide dans l’espace, et de la variation des coordonnées provenant de ces déplacements considérés indépendamment des causes qui peuvent les produire. Journal de Mathématiques Pures et Appliquées 5 (1840)

    Google Scholar 

  51. Rosinol, A., Abate, M., Chang, Y., Carlone, L.: Kimera: an open-source library for real-time metric-semantic localization and mapping. In: ICRA (2020)

    Google Scholar 

  52. Sarlin, P.E., Cadena, C., Siegwart, R., Dymczyk, M.: From coarse to fine: robust hierarchical localization at large scale. In: CVPR (2019)

    Google Scholar 

  53. Sarlin, P.E., DeTone, D., Malisiewicz, T., Rabinovich, A.: SuperGlue: learning feature matching with graph neural networks. In: CVPR (2020)

    Google Scholar 

  54. Schönberger, J.L., Frahm, J.M.: Structure-from-motion revisited. In: CVPR (2016)

    Google Scholar 

  55. Schönberger, J.L., Zheng, E., Frahm, J.-M., Pollefeys, M.: Pixelwise view selection for unstructured multi-view stereo. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9907, pp. 501–518. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46487-9_31

    Chapter  Google Scholar 

  56. Schops, T., Sattler, T., Pollefeys, M.: BAD SLAM: bundle adjusted direct RGB-D SLAM. In: CVPR (2019)

    Google Scholar 

  57. Simonyan, K., Vedaldi, A., Zisserman, A.: Learning local feature descriptors using convex optimisation. TPAMI 36(8), 1573–1585 (2014)

    Article  Google Scholar 

  58. Snavely, N., Seitz, S.M., Szeliski, R.: Photo tourism: exploring photo collections in 3D. In: SIGGRAPH. ACM (2006)

    Google Scholar 

  59. Song, C., Song, J., Huang, Q.: HybridPose: 6D object pose estimation under hybrid representations. In: CVPR (2020)

    Google Scholar 

  60. Sun, X., et al.: Pix3D: dataset and methods for single-image 3D shape modeling. In: CVPR (2018)

    Google Scholar 

  61. Sundermeyer, M., Marton, Z.-C., Durner, M., Brucker, M., Triebel, R.: Implicit 3D orientation learning for 6D object detection from RGB images. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11210, pp. 712–729. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01231-1_43

    Chapter  Google Scholar 

  62. Tancik, M., et al.: Fourier features let networks learn high frequency functions in low dimensional domains. In: NeurIPS (2020)

    Google Scholar 

  63. Tang, C., Tan, P.: BA-Net: dense bundle adjustment network. In: ICLR (2019)

    Google Scholar 

  64. Teed, Z., Deng, J.: DROID-SLAM: deep visual SLAM for monocular, stereo, and RGB-D cameras. In: NeurIPS (2021)

    Google Scholar 

  65. Tekin, B., Sinha, S.N., Fua, P.: Real-time seamless single shot 6D object pose prediction. In: CVPR (2018)

    Google Scholar 

  66. Tola, E., Lepetit, V., Fua, P.: Daisy: an efficient dense descriptor applied to wide-baseline stereo. TPAMI 32(5), 815–830 (2009)

    Article  Google Scholar 

  67. Triggs, B., McLauchlan, P.F., Hartley, R.I., Fitzgibbon, A.W.: Bundle adjustment–a modern synthesis. In: International Workshop on Vision Algorithms (1999)

    Google Scholar 

  68. Truong, P., Danelljan, M., Timofte, R.: GLU-Net: global-local universal network for dense flow and correspondences. In: CVPR (2020)

    Google Scholar 

  69. Ummenhofer, B., et al.: DeMoN: depth and motion network for learning monocular stereo. In: CVPR (2017)

    Google Scholar 

  70. Vijayanarasimhan, S., Ricco, S., Schmid, C., Sukthankar, R., Fragkiadaki, K.: SfM-Net: learning of structure and motion from video. arXiv:1704.07804 (2017)

  71. Wang, C., et al.: DenseFusion: 6D object pose estimation by iterative dense fusion. In: CVPR (2019)

    Google Scholar 

  72. Wang, Q., Zhou, X., Hariharan, B., Snavely, N.: Learning feature descriptors using camera pose supervision. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 757–774. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_44

    Chapter  Google Scholar 

  73. Wang, S., Clark, R., Wen, H., Trigoni, N.: DeepVO: towards end-to-end visual odometry with deep recurrent convolutional neural networks. In: ICRA (2017)

    Google Scholar 

  74. Wang, W., Hu, Y., Scherer, S.: TartanVO: a generalizable learning-based VO. In: CoRL (2020)

    Google Scholar 

  75. Wei, X., Zhang, Y., Li, Z., Fu, Y., Xue, X.: DeepSFM: structure from motion via deep bundle adjustment. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 230–247. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_14

    Chapter  Google Scholar 

  76. Wong, J.M., et al.: SegICP: integrated deep semantic segmentation and pose estimation. IROS (2017)

    Google Scholar 

  77. Xiang, Y., Schmidt, T., Narayanan, V., Fox, D.: PoseCNN: a convolutional neural network for 6D object pose estimation in cluttered scenes. In: RSS (2018)

    Google Scholar 

  78. Xiao, Y., Qiu, X., Langlois, P., Aubry, M., Marlet, R.: Pose from shape: deep pose estimation for arbitrary 3D objects. In: BMVC (2019)

    Google Scholar 

  79. Yi, K.M., Trulls, E., Lepetit, V., Fua, P.: LIFT: learned invariant feature transform. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9910, pp. 467–483. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46466-4_28

    Chapter  Google Scholar 

  80. Yin, Z., Shi, J.: GeoNet: unsupervised learning of dense depth, optical flow and camera pose. In: CVPR (2018)

    Google Scholar 

  81. Zhang, J.Y., Pepose, S., Joo, H., Ramanan, D., Malik, J., Kanazawa, A.: Perceiving 3D human-object spatial arrangements from a single image in the wild. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12357, pp. 34–51. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58610-2_3

    Chapter  Google Scholar 

  82. Zhang, J.Y., Yang, G., Tulsiani, S., Ramanan, D.: NeRS: neural reflectance surfaces for sparse-view 3D reconstruction in the wild. In: NeurIPS (2021)

    Google Scholar 

  83. Zhang, R.: Making convolutional networks shift-invariant again. In: ICML (2019)

    Google Scholar 

  84. Zhou, H., Ummenhofer, B., Brox, T.: DeepTAM: deep tracking and mapping. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11220, pp. 851–868. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01270-0_50

    Chapter  Google Scholar 

  85. Zhou, T., Brown, M., Snavely, N., Lowe, D.G.: Unsupervised learning of depth and ego-motion from video. In: CVPR (2017)

    Google Scholar 

  86. Zhou, Y., Barnes, C., Lu, J., Yang, J., Li, H.: On the continuity of rotation representations in neural networks. In: CVPR (2019)

    Google Scholar 

  87. Zubizarreta, J., Aguinaga, I., Montiel, J.M.M.: Direct sparse mapping. T-RO (2020)

    Google Scholar 

Download references

Acknowledgements

We would like to thank Gengshan Yang, Jonathon Luiten, Brian Okorn, and Elliot Wu for helpful feedback and discussion. This work was supported in part by the NSF GFRP (Grant No. DGE1745016), Singapore DSTA, and CMU Argo AI Center for Autonomous Vehicle Research.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jason Y. Zhang .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 6551 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Zhang, J.Y., Ramanan, D., Tulsiani, S. (2022). RelPose: Predicting Probabilistic Relative Rotation for Single Objects in the Wild. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13691. Springer, Cham. https://doi.org/10.1007/978-3-031-19821-2_34

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-19821-2_34

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-19820-5

  • Online ISBN: 978-3-031-19821-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics