Skip to main content
Log in

Object-Based Visual Camera Pose Estimation From Ellipsoidal Model and 3D-Aware Ellipse Prediction

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

In this paper, we propose a method for initial camera pose estimation from just a single image which is robust to viewing conditions and does not require a detailed model of the scene. This method meets the growing need of easy deployment of robotics or augmented reality applications in any environments, especially those for which no accurate 3D model nor huge amount of ground truth data are available. It exploits the ability of deep learning techniques to reliably detect objects regardless of viewing conditions. Previous works have also shown that abstracting the geometry of a scene of objects by an ellipsoid cloud allows to compute the camera pose accurately enough for various application needs. Though promising, these approaches use the ellipses fitted to the detection bounding boxes as an approximation of the imaged objects. In this paper, we go one step further and propose a learning-based method which detects improved elliptic approximations of objects which are coherent with the 3D ellipsoids in terms of perspective projection. Experiments prove that the accuracy of the computed pose significantly increases thanks to our method. This is achieved with very little effort in terms of training data acquisition—a few hundred calibrated images of which only three need manual object annotation. Code and models are released at https://gitlab.inria.fr/tangram/3d-aware-ellipses-for-visual-localization.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21

Similar content being viewed by others

References

  • Arandjelovic, R., Gronát, P., Torii, A., Pajdla,T., & Sivic, J. (2016). Netvlad: CNN architecture for weakly supervised place recognition. In IEEE conference on computer vision and pattern recognition, CVPR 2016, Las Vegas, NV, USA, June 27–30, 2016 (pp. 5297–5307). IEEE Computer Society. Retrieved from https://doi.org/10.1109/CVPR.2016.572

  • Babenko, A., Slesarev, A., Chigorin, A., & Lempitsky, V. S. (2014) Neural codes for image retrieval. In: D. J. Fleet, T. Pajdla, B. Schiele, T. Tuytelaars (Eds.), Proceedings of 13th European conference on computer vision—ECCV 2014, Part I, Zurich, Switzerland, September 6–12, 2014.Lecture notes in computer science (Vol. 8689, pp. 584–599). Springer.

  • Brachmann, E., Krull, A., Nowozin, S., Shotton, J., Michel, F., Gumhold, S., & Rother, C. (2017). DSAC-differentiable RANSAC for camera localization. In IEEE conference on computer vision and pattern recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017 (pp. 2492–2500). IEEE Computer Society.

  • Brachmann, E., Michel, F., Krull, A., Yang, M. Y., Gumhold, S., & Rother, C. (2016). Uncertainty-driven 6d pose estimation of objects and scenes from a single RGB image. In IEEE conference on computer vision and pattern recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016 (pp. 3364–3372). IEEE Computer Society. https://doi.org/10.1109/CVPR.2016.366

  • Brachmann, E., & Rother, C. (2018). Learning less is more-6d camera localization via 3d surface regression. In IEEE conference on computer vision and pattern recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018 (pp. 4654–4662). IEEE Computer Society.

  • Bui, M., Albarqouni, S., Ilic, S., & Navab, N. (2018). Scene coordinate and correspondence learning for image-based localization. In British machine vision conference 2018, BMVC 2018, Newcastle, UK, September 3-6, 2018 (p. 3). BMVA Press. Retrieved from http://bmvc2018.org/contents/papers/0523.pdf

  • Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., & Zagoruyko, S. (2020). End-to-end object detection with transformers. In A. Vedaldi, H. Bischof, T. Brox, J. Frahm (Eds.), Proceedings of 16th European conference on computer vision—ECCV 2020, Part I, Glasgow, UK, August 23-28, 2020. Lecture notes in computer science (Vol. 12346, pp. 213–229). Springer. https://doi.org/10.1007/978-3-030-58452-8_13

  • Delhumeau, J., Gosselin, P. H., Jégou, H., & Pérez, P. (2013). Revisiting the VLAD image representation. In A. Jaimes, N. Sebe, N. Boujemaa, D. Gatica-Perez, D. A. Shamma, M. Worring, & R. Zimmermann (Eds.), ACM multimedia conference, MM ’13, Barcelona, Spain, October 21-25, 2013 (pp. 653–656). ACM.

  • DeTone, D., Malisiewicz, T., & Rabinovich, A. (2018). Superpoint: Self-supervised interest point detection and description. In: IEEE conference on computer vision and pattern recognition workshops, CVPR workshops 2018, Salt Lake City, UT, USA, June 18-22, 2018 (pp. 224–236). IEEE Computer Society.

  • Dong, W., Roy, P., Peng, C., & Isler, V. (2021). Ellipse R-CNN: Learning to infer elliptical object from clustering and occlusion. IEEE Transactions on Image Processing, 30, 2193–2206. https://doi.org/10.1109/TIP.2021.3050673.

    Article  Google Scholar 

  • Gaudillière, V., Simon, G., & Berger, M. O. (2019). Camera relocalization with ellipsoidal abstraction of objects. In 18th IEEE international symposium on mixed and augmented reality—ISMAR 2019, Beijing, China (pp. 19–29). Retrieved from https://hal.archives-ouvertes.fr/hal-02170784

  • Gaudillière, V., Simon, G., & Berger, M. O. (2020). Perspective-2-ellipsoid: Bridging the gap between object detections and 6-DoF camera pose. IEEE Robotics and Automation Letters, 5(4), 5189–5196.

    Article  Google Scholar 

  • He, K., Gkioxari, G., Dollár, P., & Girshick, R. B. (2020). Mask R-CNN. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(2), 386–397.

    Article  Google Scholar 

  • Hinterstoisser, S., Lepetit, V., Ilic, S., Holzer, S., Bradski, G. R., Konolige, K., & Navab, N. (2012). Model based training, detection and pose estimation of texture-less 3d objects in heavily cluttered scenes. In K. M. Lee, Y. Matsushita, J. M. Rehg, Z. Hu (Eds.), Proceedings of 11th Asian conference on computer vision—ACCV 2012, Daejeon, Korea, November 5-9, 2012, Revised Selected Papers, Part I. Lecture notes in computer science (Vol. 7724, pp. 548–562). Springer.

  • Hodaň, T., Haluza, P., Obdržálek, Š., Matas, J., Lourakis, M., & Zabulis, X. (2017). T-LESS: An RGB-D dataset for 6D pose estimation of texture-less objects. In IEEE winter conference on applications of computer vision (WACV).

  • Jégou, H., Douze, M., Schmid, C., & Pérez, P. (2010). Aggregating local descriptors into a compact image representation. In The twenty-third IEEE conference on computer vision and pattern recognition, CVPR 2010, San Francisco, CA, USA, 13-18 June 2010 (pp. 3304–3311). IEEE Computer Society.

  • Kehl, W., Manhardt, F., Tombari, F., Ilic, S., & Navab, N. (2017). SSD-6D: Making RGB-based 3D detection and 6D pose estimation great again. In IEEE international conference on computer vision, ICCV 2017, Venice, Italy, October 22-29, 2017 (pp. 1530–1538). IEEE Computer Society.

  • Kendall, A., & Cipolla, R. (2016). Modelling uncertainty in deep learning for camera relocalization. In IEEE international conference on robotics and automation (pp. 4762–4769).

  • Kendall, A., & Cipolla, R. (2017). Geometric loss functions for camera pose regression with deep learning. In IEEE conference on computer vision and pattern recognition (pp. 5974–5983).

  • Kendall, A., Grimes, M., & Cipolla, R. (2015). Posenet: A convolutional network for real-time 6-dof camera relocalization. In IEEE international conference on computer vision, ICCV 2015, Santiago, Chile, December 7-13, 2015 (pp. 2938–2946). IEEE Computer Society.

  • Kingma, D. P., & Ba, J. (2015). Adam: A method for stochastic optimization. In Y. Bengio, & Y. LeCun (Eds.), 3rd International conference on learning representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference track proceedings. http://arxiv.org/abs/1412.6980

  • Li, Y., Snavely, N., Huttenlocher, D., & Fua, P. (2012). Worldwide pose estimation using 3d point clouds. In: A. W. Fitzgibbon, S. Lazebnik, P. Perona, Y. Sato, & C. Schmid (Eds.), 12th European conference on computer vision—ECCV 2012, Florence, Italy, October 7-13, 2012, Proceedings, Part I. Lecture notes in computer science (Vol. 7572, pp. 15–29). Springer.

  • Li, Z., Wang, G., & Ji, X. (2019). CDPN: Coordinates-based disentangled pose network for real-time RGB-based 6-dof object pose estimation. In IEEE/CVF international conference on computer vision, ICCV 2019, Seoul, Korea (South), October 27-November 2, 2019 (pp. 7677–7686). IEEE.

  • Lowe, D. G. (2004). Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2), 91–110.

    Article  Google Scholar 

  • Melekhov, I., Ylioinas, J., Kannala, J., & Rahtu, E. (2017). Image-based localization using hourglass networks. In IEEE International conference on computer vision (pp. 879–886).

  • Mousavian, A., Anguelov, D., Flynn, J., & Kosecka, J. (2017). 3d bounding box estimation using deep learning and geometry. In IEEE conference on computer vision and pattern recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017 (pp. 5632–5640). IEEE Computer Society.

  • Nicholson, L., Milford, M., & Sünderhauf, N. (2019). QuadricSLAM: Dual quadrics from object detections as landmarks in object-oriented slam. IEEE Robotics and Automation Letters, 4, 1–8.

    Article  Google Scholar 

  • Nistér, D., & Stewénius, H. (2006). Scalable recognition with a vocabulary tree. In IEEE computer society conference on computer vision and pattern recognition (CVPR 2006), 17-22 June 2006, New York, NY, USA (pp. 2161–2168). IEEE Computer Society.

  • Pan, S., Fan, S., Wong, S. W. K., Zidek, J. V., & Rhodin, H. (2021). Ellipse detection and localization with applications to knots in sawn lumber images. In IEEE winter conference on applications of computer vision, WACV 2021, Waikoloa, HI, USA, January 3-8, 2021 (pp. 3891–3900). IEEE.

  • Park, K., Patten, T., & Vincze, M. (2019). Pix2pose: Pixel-wise coordinate regression of objects for 6d pose estimation. In IEEE/CVF international conference on computer vision, ICCV 2019, Seoul, Korea (South), October 27-November 2, 2019 (pp. 7667–7676). IEEE. https://doi.org/10.1109/ICCV.2019.00776

  • Paschalidou, D., Ulusoy, A. O., & Geiger, A. (2019). Superquadrics revisited: Learning 3d shape parsing beyond cuboids. In IEEE conference on computer vision and pattern recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019 (pp. 10344–10353). Computer Vision Foundation/IEEE.

  • Peng, S., Liu, Y., Huang, Q., Zhou, X., & Bao, H. (2019). Pvnet: Pixel-wise voting network for 6dof pose estimation. In IEEE conference on computer vision and pattern recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019 (pp. 4561–4570). Computer Vision Foundation/IEEE.

  • Perronnin, F., Liu, Y., Sánchez, J., & Poirier, H. (2010). Large-scale image retrieval with compressed fisher vectors. In The twenty-third IEEE conference on computer vision and pattern recognition, CVPR 2010, San Francisco, CA, USA, 13-18 June 2010 (pp. 3384–3391). IEEE Computer Society.

  • Philbin, J., Chum, O., Isard, M., Sivic, J., & Zisserman, A. (2007). Object retrieval with large vocabularies and fast spatial matching. In IEEE computer society conference on computer vision and pattern recognition (CVPR 2007), 18-23 June 2007, Minneapolis, Minnesota, USA. IEEE Computer Society.

  • Piasco, N., Sidibé, D., Demonceaux, C., & Gouet-Brunet, V. (2019). Perspective-n-learned-point: Pose estimation from relative depth. In 30th British machine vision conference 2019, BMVC 2019, Cardiff, UK, September 9-12, 2019 (p. 14). BMVA Press. Retrieved from https://bmvc2019.org/wp-content/uploads/papers/0981-paper.pdf

  • Rad, M., & Lepetit, V. (2017). BB8: A scalable, accurate, robust to partial occlusion method for predicting the 3d poses of challenging objects without using depth. In IEEE international conference on computer vision, ICCV 2017, Venice, Italy, October 22-29, 2017 (pp. 3848–3856). IEEE Computer Society.

  • Redmon, J., & Farhadi, A. (2017). YOLO9000: Better, faster, stronger. In CVPR.

  • Ren, S., He, K., Girshick, R. B., & Sun, J. (2015). Faster R-CNN: Towards real-time object detection with region proposal networks. In: C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, & R. Garnett (Eds.), Advances in neural information processing systems 28: Annual conference on neural information processing systems 2015, December 7-12, 2015, Montreal, Quebec, Canada (pp. 91–99).

  • Rosenhahn, B., Brox, T., Cremers, D., & Seidel, H. (2006). A comparison of shape matching methods for contour based pose estimation. In R. Reulke, U. Eckardt, B. Flach, U. Knauer, & K. Polthier (Eds.), 11th International workshop on combinatorial image analysis, IWCIA 2006, Berlin, Germany, June 19-21, 2006, Proceedings. Lecture notes in computer science (Vol. 4040, pp. 263–276). Springer. https://doi.org/10.1007/11774938_21

  • Rubino, C., Crocco, M., & Bue, A. D. (2018). 3d object localisation from multi-view image detections. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40, 1281–1294.

    Google Scholar 

  • Rublee, E., Rabaud, V., Konolige, K., & Bradski, G. R. (2011). ORB: An efficient alternative to SIFT or SURF. In D. N. Metaxas, L. Quan, A. Sanfeliu, & L. V. Gool (Eds.), IEEE International conference on computer vision, ICCV 2011, Barcelona, Spain, November 6-13, 2011 (pp. 2564–2571). IEEE Computer Society.

  • Sarlin, P., DeTone, D., Malisiewicz, T., & Rabinovich, A. (2020). Superglue: Learning feature matching with graph neural networks. In IEEE/CVF conference on computer vision and pattern recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020 (pp. 4937–4946). IEEE.

  • Sattler, T., Leibe, B., & Kobbelt, L. (2012). Improving image-based localization by active correspondence search. In: A. W. Fitzgibbon, S. Lazebnik, P. Perona, Y. Sato, & C. Schmid (Eds.), 12th European conference on computer vision—ECCV 2012, Florence, Italy, October 7-13, 2012, Proceedings, Part I. Lecture notes in computer science (Vol. 7572, pp. 752–765). Springer.

  • Sattler, T., Zhou, Q., Pollefeys, M., & Leal-Taixé, L. (2019). Understanding the limitations of CNN-based absolute camera pose regression. In IEEE conference on computer vision and pattern recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019 (pp. 3302–3312). Computer Vision Foundation/IEEE.

  • Sattler, T., Leibe, B., & Kobbelt, L. (2017). Efficient & effective prioritized matching for large-scale image-based localization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(9), 1744–1756.

    Article  Google Scholar 

  • Shotton, J., Glocker, B., Zach, C., Izadi, S., Criminisi, A., & Fitzgibbon, A. W. (2013). Scene coordinate regression forests for camera relocalization in RGB-D images. In IEEE conference on computer vision and pattern recognition, Portland, OR, USA, June 23-28, 2013 (pp. 2930–2937). IEEE Computer Society.

  • Sivic, J., & Zisserman, A. (2003). Video google: A text retrieval approach to object matching in videos. In 9th IEEE international conference on computer vision (ICCV 2003), 14-17 October 2003, Nice, France (pp. 1470–1477). IEEE Computer Society.

  • Sundermeyer, M., Marton, Z. C., Durner, M., Brucker, M., & Triebel, R. (2018). Implicit 3D orientation learning for 6D object detection from RGB images. In V. Ferrari, M. Hebert, C. Sminchisescu, & Y. Weiss (Eds.), Computer vision—ECCV 2018 (pp. 712–729). Springer.

  • Taira, H., Okutomi, M., Sattler, T., Cimpoi, M., Pollefeys, M., Sivic, J., Pajdla, T., & Torii, A. (2018). InLoc: Indoor visual localization with dense matching and view synthesis. In IEEE conference on computer vision and pattern recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018 (pp. 7199–7209). IEEE Computer Society.

  • Tekin, B., Sinha, S. N., & Fua, P. (2018). Real-time seamless single shot 6D object pose prediction. In IEEE conference on computer vision and pattern recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018 (pp. 292–301).

  • Walch, F., Hazirbas, C., Leal-Taixé, L., Sattler, T., Hilsenbeck, S., & Cremers, D. (2017). Image-based localization using lstms for structured feature correlation. In IEEE international conference on computer vision, ICCV 2017, Venice, Italy, October 22-29, 2017 (pp. 627–637). IEEE Computer Society.

  • Wang, H., Sridhar, S., Huang, J., Valentin, J., Song, S., & Guibas, L. J. (2019). Normalized object coordinate space for category-level 6d object pose and size estimation. In IEEE conference on computer vision and pattern recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019 (pp. 2642–2651). Computer Vision Foundation/IEEE.

  • Weinzaepfel, P., Csurka, G., Cabon, Y., & Humenberger, M. (2019). Visual localization by learning objects-of-interest dense match regression. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR)

  • Yang, S., & Scherer, S. A. (2019). Cubeslam: Monocular 3-d object SLAM. IEEE Transactions on Robotics, 35(4), 925–938.

    Article  Google Scholar 

  • Yang, C., Simon, G., See, J., Berger, M. O., & Wang, W. (2020). WatchPose: A view-aware approach for camera pose data collection in industrial environments. Sensors, 20(11), 3045.

  • Yi, K. M., Trulls, E., Lepetit, V., & Fua, P. (2016). LIFT: Learned invariant feature transform. In: B. Leibe, J. Matas, N. Sebe, & M. Welling (Eds.), 14th European conference—ECCV 2016, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part VI. Lecture notes in computer science (Vol. 9910, pp. 467–483). Springer.

  • Zakharov, S., Shugurov, I., & Ilic, S. (2019). DPOD: 6d pose object detector and refiner. In IEEE/CVF international conference on computer vision, ICCV 2019, Seoul, Korea (South), October 27-November 2, 2019 (pp. 1941–1950). IEEE.

  • Zins, M., Simon, G., & Berger, M. O. (2020). 3D-aware ellipse prediction for object-based camera pose estimation. In International virtual conference on 3D vision—3DV 2020. Fukuoka/Virtual, Japan.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Matthieu Zins.

Additional information

Communicated by A. Hilton.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zins, M., Simon, G. & Berger, MO. Object-Based Visual Camera Pose Estimation From Ellipsoidal Model and 3D-Aware Ellipse Prediction. Int J Comput Vis 130, 1107–1126 (2022). https://doi.org/10.1007/s11263-022-01585-w

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-022-01585-w

Keywords

Navigation