Relative Camera Pose Estimation Using Convolutional Neural Networks

  • Iaroslav Melekhov
  • Juha Ylioinas
  • Juho Kannala
  • Esa Rahtu
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10617)


This paper presents a convolutional neural network based approach for estimating the relative pose between two cameras. The proposed network takes RGB images from both cameras as input and directly produces the relative rotation and translation as output. The system is trained in an end-to-end manner utilising transfer learning from a large scale classification dataset. The introduced approach is compared with widely used local feature based methods (SURF, ORB) and the results indicate a clear improvement over the baseline. In addition, a variant of the proposed architecture containing a spatial pyramid pooling (SPP) layer is evaluated and shown to further improve the performance.


Relative camera pose estimation Deep neural networks Spatial pyramid pooling 


  1. 1.
    Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 60(2), 91–110 (2004)CrossRefGoogle Scholar
  2. 2.
    Tola, E., Lepetit, V., Fua, P.: DAISY: an efficient dense descriptor applied to wide baseline stereo. IEEE Trans. PAMI 32, 815–830 (2010)CrossRefGoogle Scholar
  3. 3.
    Bay, H., Tuytelaars, T., Van Gool, L.: SURF: speeded up robust features. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3951, pp. 404–417. Springer, Heidelberg (2006). CrossRefGoogle Scholar
  4. 4.
    Engel, J., Schöps, T., Cremers, D.: LSD-SLAM: large-scale direct monocular SLAM. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8690, pp. 834–849. Springer, Cham (2014). Google Scholar
  5. 5.
    Jensen, R., Dahl, A., Vogiatzis, G., Tola, E., Aanæs, H.: Large scale multi-view stereopsis evaluation. In: CVPR, pp. 406–413 (2014)Google Scholar
  6. 6.
    He, K., Zhang, X., Ren, S., Sun, J.: Spatial pyramid pooling in deep convolutional networks for visual recognition. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8691, pp. 346–361. Springer, Cham (2014). Google Scholar
  7. 7.
    Rublee, E., Rabaud, V., Konolige, K., Bradski, G.: ORB: an efficient alternative to sift or surf. In: ICCV (2011)Google Scholar
  8. 8.
    Calonder, M., Lepetit, V., Strecha, C., Fua, P.: BRIEF: binary robust independent elementary features. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6314, pp. 778–792. Springer, Heidelberg (2010). CrossRefGoogle Scholar
  9. 9.
    DeTone, D., Malisiewicz, T., Rabinovich, A.: Deep image homography estimation. CoRR abs/1606.03798 (2016)Google Scholar
  10. 10.
    Konda, K., Memisevic, R.: Learning visual odometry with a convolutional network. In: VISIGRAPP (2015)Google Scholar
  11. 11.
    Mohanty, V., Agrawal, S., Datta, S., Ghosh, A., Sharma, V.D., Chakravarty, D.: DeepVO: a deep learning approach for monocular visual odometry. CoRR abs/1611.06069 (2016)Google Scholar
  12. 12.
    Ummenhofer, B., Zhou, H., Uhrig, J., Mayer, N., Ilg, E., Dosovitskiy, A., Brox, T.: DeMoN: depth and motion network for learning monocular stereo. CoRR abs/1612.02401 (2016)Google Scholar
  13. 13.
    Kendall, A., Grimes, M., Cipolla, R.: PoseNet: a convolutional network for real-time 6-DOF camera relocalization. In: ICCV (2015)Google Scholar
  14. 14.
    Chopra, S., Hadsell, R., LeCun, Y.: Learning a similarity metric discriminatively, with application to face verification. In: CVPR (2005)Google Scholar
  15. 15.
    Zhou, B., Lapedriza, A., Xiao, J., Torralba, A., Oliva, A.: Learning deep features for scene recognition using places database. In: NIPS (2014)Google Scholar
  16. 16.
    Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems 25, pp. 1097–1105. Curran Associates, Inc. (2012)Google Scholar
  17. 17.
    Babenko, A., Lempitsky, V.S.: Aggregating deep convolutional features for image retrieval. In: ICCV (2015)Google Scholar
  18. 18.
    Razavian, A.S., Sullivan, J., Maki, A., Carlsson, S.: Visual instance retrieval with deep convolutional networks. CoRR abs/1412.6574 (2014)Google Scholar
  19. 19.
    Azizpour, H., Razavian, A.S., Sullivan, J., Maki, A., Carlsson, S.: From generic to specific deep representations for visual recognition. In: CVPRW (2015)Google Scholar
  20. 20.
    Wilson, K., Snavely, N.: Robust global translations with 1DSfM. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8691, pp. 61–75. Springer, Cham (2014). Google Scholar
  21. 21.
    Moulon, P., Monasse, P., Marlet, R., Others: OpenMVG: an open multiple view geometry library (2012).
  22. 22.
    Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. CoRR abs/1412.6980 (2014)Google Scholar
  23. 23.
    Collobert, R., Kavukcuoglu, K., Farabet, C.: Torch7: a matlab-like environment for machine learning. In: BigLearn, NIPS Workshop (2011)Google Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  • Iaroslav Melekhov
    • 1
  • Juha Ylioinas
    • 1
  • Juho Kannala
    • 1
  • Esa Rahtu
    • 2
  1. 1.Aalto UniversityHelsinkiFinland
  2. 2.Tampere University of TechnologyTampereFinland

Personalised recommendations