Learning Feature Descriptors Using Camera Pose Supervision

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12346)


Recent research on learned visual descriptors has shown promising improvements in correspondence estimation, a key component of many 3D vision tasks. However, existing descriptor learning frameworks typically require ground-truth correspondences between feature points for training, which are challenging to acquire at scale. In this paper we propose a novel weakly-supervised framework that can learn feature descriptors solely from relative camera poses between images. To do so, we devise both a new loss function that exploits the epipolar constraint given by camera poses, and a new model architecture that makes the whole pipeline differentiable and efficient. Because we no longer need pixel-level ground-truth correspondences, our framework opens up the possibility of training on much larger and more diverse datasets for better and unbiased descriptors. We call the resulting descriptors CAmera Pose Supervised, or CAPS, descriptors. Though trained with weak supervision, CAPS descriptors outperform even prior fully-supervised descriptors and achieve state-of-the-art performance on a variety of geometric tasks. (Project page:


Local features Feature descriptors Correspondence Image matching Camera pose 



We thank Kai Zhang, Zixin Luo, Zhengqi Li for helpful discussion and comments. This work was partly supported by a DARPA LwLL grant, and in part by the generosity of Eric and Wendy Schmidt by recommendation of the Schmidt Futures program.

Supplementary material

500725_1_En_44_MOESM1_ESM.pdf (5.3 mb)
Supplementary material 1 (pdf 5442 KB)


  1. 1.
    Arandjelović, R., Zisserman, A.: Three things everyone should know to improve object retrieval. In: CVPR (2012)Google Scholar
  2. 2.
    Balntas, V., Lenc, K., Vedaldi, A., Mikolajczyk, K.: HPatches: a benchmark and evaluation of handcrafted and learned local descriptors. In: CVPR (2017)Google Scholar
  3. 3.
    Balntas, V., Riba, E., Ponsa, D., Mikolajczyk, K.: Learning local feature descriptors with triplets and shallow convolutional neural networks. In: Proceedings of the British Machine Vision Conference (BMVC), p. 3 (2016)Google Scholar
  4. 4.
    Bay, H., Tuytelaars, T., Van Gool, L.: SURF: speeded up robust features. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3951, pp. 404–417. Springer, Heidelberg (2006). Scholar
  5. 5.
    Calonder, M., Lepetit, V., Strecha, C., Fua, P.: BRIEF: binary robust independent elementary features. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6314, pp. 778–792. Springer, Heidelberg (2010). Scholar
  6. 6.
    Chang, A., et al.: Matterport3D: learning from RGB-D data in indoor environments. arXiv preprint arXiv:1709.06158 (2017)
  7. 7.
    Chang, J.R., Chen, Y.S.: Pyramid stereo matching network. In: CVPR (2018)Google Scholar
  8. 8.
    Choy, C.B., Gwak, J., Savarese, S., Chandraker, M.: Universal correspondence network. In: NeurIPS (2016)Google Scholar
  9. 9.
    Christiansen, P.H., Kragh, M.F., Brodskiy, Y., Karstoft, H.: UnsuperPoint: end-to-end unsupervised interest point detector and descriptor. arXiv preprint arXiv:1907.04011 (2019)
  10. 10.
    Dai, A., Chang, A.X., Savva, M., Halber, M., Funkhouser, T., Nießner, M.: ScanNet: richly-annotated 3D reconstructions of indoor scenes. In: CVPR (2017)Google Scholar
  11. 11.
    Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: CVPR (2009)Google Scholar
  12. 12.
    DeTone, D., Malisiewicz, T., Rabinovich, A.: Deep image homography estimation. arXiv preprint arXiv:1606.03798 (2016)
  13. 13.
    DeTone, D., Malisiewicz, T., Rabinovich, A.: SuperPoint: self-supervised interest point detection and description. In: CVPR Workshops (2018)Google Scholar
  14. 14.
    Dusmanu, M., et al.: D2-Net: a trainable CNN for joint detection and description of local features. arXiv preprint arXiv:1905.03561 (2019)
  15. 15.
    Ebel, P., Mishchuk, A., Yi, K.M., Fua, P., Trulls, E.: Beyond cartesian representations for local descriptors. In: ICCV (2019)Google Scholar
  16. 16.
    Fathy, M.E., Tran, Q.-H., Zia, M.Z., Vernaza, P., Chandraker, M.: Hierarchical metric learning and matching for 2D and 3D geometric correspondences. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11219, pp. 832–850. Springer, Cham (2018). Scholar
  17. 17.
    Fischer, P., et al.: FlowNet: learning optical flow with convolutional networks. arXiv preprint arXiv:1504.06852 (2015)
  18. 18.
    Fischler, M.A., Bolles, R.C.: Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM 24(6), 381–395 (1981)MathSciNetCrossRefGoogle Scholar
  19. 19.
    Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning. MIT Press, Cambridge (2016).
  20. 20.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)Google Scholar
  21. 21.
    He, K., Lu, Y., Sclaroff, S.: Local descriptors optimized for average precision. In: CVPR (2018)Google Scholar
  22. 22.
    Ilg, E., Mayer, N., Saikia, T., Keuper, M., Dosovitskiy, A., Brox, T.: FlowNet 2.0: evolution of optical flow estimation with deep networks. In: CVPR (2017)Google Scholar
  23. 23.
    Jafarian, Y., Yao, Y., Park, H.S.: MONET: multiview semi-supervised keypoint via epipolar divergence. arXiv preprint arXiv:1806.00104 (2018)
  24. 24.
    Jeon, S., Kim, S., Min, D., Sohn, K.: PARN: pyramidal affine regression networks for dense semantic correspondence. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11210, pp. 355–371. Springer, Cham (2018). Scholar
  25. 25.
    Jeon, S., Min, D., Kim, S., Sohn, K.: Joint learning of semantic alignment and object landmark detection. In: ICCV (2019)Google Scholar
  26. 26.
    Ke, Y., Sukthankar, R.: PCA-SIFT: a more distinctive representation for local image descriptors. In: CVPR (2004)Google Scholar
  27. 27.
    Keller, M., Chen, Z., Maffra, F., Schmuck, P., Chli, M.: Learning deep descriptors with scale-aware triplet networks. In: CVPR (2018)Google Scholar
  28. 28.
    Kendall, A., et al.: End-to-end learning of geometry and context for deep stereo regression. In: ICCV (2017)Google Scholar
  29. 29.
    Kim, S., Lin, S., Jeon, S.R., Min, D., Sohn, K.: Recurrent transformer networks for semantic correspondence. In: NeurIPS (2018)Google Scholar
  30. 30.
    Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
  31. 31.
    Kumar, B., Carneiro, G., Reid, I., et al.: Learning local image descriptors with deep siamese and triplet convolutional networks by minimising global loss functions. In: CVPR (2016)Google Scholar
  32. 32.
    Li, Z., Snavely, N.: MegaDepth: learning single-view depth prediction from internet photos. In: CVPR (2018)Google Scholar
  33. 33.
    Liu, C., Yuen, J., Torralba, A., Sivic, J., Freeman, W.T.: SIFT Flow: dense correspondence across different scenes. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008. LNCS, vol. 5304, pp. 28–42. Springer, Heidelberg (2008). Scholar
  34. 34.
    Liu, Y., Shen, Z., Lin, Z., Peng, S., Bao, H., Zhou, X.: Gift: learning transformation-invariant dense visual descriptors via group CNNs. In: NeurIPS (2019)Google Scholar
  35. 35.
    Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR (2015)Google Scholar
  36. 36.
    Long, J.L., Zhang, N., Darrell, T.: Do convnets learn correspondence? In: NeurIPS (2014)Google Scholar
  37. 37.
    Lowe, D.G.: Distinctive image features from scale-invariant keypoints. IJCV 60(2), 91–110 (2004)CrossRefGoogle Scholar
  38. 38.
    Luo, Z., et al.: ContextDesc: local descriptor augmentation with cross-modality context. In: CVPR (2019)Google Scholar
  39. 39.
    Luo, Z., et al.: GeoDesc: learning local descriptors by integrating geometry constraints. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11213, pp. 170–185. Springer, Cham (2018). Scholar
  40. 40.
    Melekhov, I., Tiulpin, A., Sattler, T., Pollefeys, M., Rahtu, E., Kannala, J.: DGC-Net: dense geometric correspondence network. In: WACV (2019)Google Scholar
  41. 41.
    Mikolajczyk, K., Schmid, C.: Scale & affine invariant interest point detectors. IJCV 60(1), 63–86 (2004). Scholar
  42. 42.
    Mishchuk, A., Mishkin, D., Radenovic, F., Matas, J.: Working hard to know your neighbor’s margins: local descriptor learning loss. In: NeurIPS (2017)Google Scholar
  43. 43.
    Mishkin, D., Radenović, F., Matas, J.: Repeatability is not enough: learning affine regions via discriminability. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11213, pp. 287–304. Springer, Cham (2018). Scholar
  44. 44.
    Mukundan, A., Tolias, G., Chum, O.: Explicit spatial encoding for deep local descriptors. In: CVPR (2019)Google Scholar
  45. 45.
    Noh, H., Araujo, A., Sim, J., Weyand, T., Han, B.: Large-scale image retrieval with attentive deep local features. In: ICCV (2017)Google Scholar
  46. 46.
    Novotny, D., Albanie, S., Larlus, D., Vedaldi, A.: Self-supervised learning of geometrically stable features through probabilistic introspection. In: CVPR (2018)Google Scholar
  47. 47.
    Oh Song, H., Xiang, Y., Jegelka, S., Savarese, S.: Deep metric learning via lifted structured feature embedding. In: CVPR (2016)Google Scholar
  48. 48.
    Ono, Y., Trulls, E., Fua, P., Yi, K.M.: LF-Net: learning local features from images. In: NeurIPS (2018)Google Scholar
  49. 49.
    Paszke, A., et al.: Automatic differentiation in PyTorch. In: NIPS Autodiff Workshop (2017)Google Scholar
  50. 50.
    Revaud, J., Weinzaepfel, P., de Souza, C.R., Humenberger, M.: R2D2: repeatable and reliable detector and descriptor. In: NeurIPS (2019)Google Scholar
  51. 51.
    Rocco, I., Arandjelovic, R., Sivic, J.: Convolutional neural network architecture for geometric matching. In: CVPR (2017)Google Scholar
  52. 52.
    Rocco, I., Cimpoi, M., Arandjelović, R., Torii, A., Pajdla, T., Sivic, J.: Neighbourhood consensus networks. In: NeurIPS (2018)Google Scholar
  53. 53.
    Rublee, E., Rabaud, V., Konolige, K., Bradski, G.R.: ORB: an efficient alternative to SIFT or SURF. In: Proceedings of the International Conference on Computer Vision (ICCV). Citeseer (2011)Google Scholar
  54. 54.
    Sattler, T., et al.: Benchmarking 6dof outdoor visual localization in changing conditions. In: CVPR (2018)Google Scholar
  55. 55.
    Schmidt, T., Newcombe, R., Fox, D.: Self-supervised visual descriptor learning for dense correspondence. IEEE Robot. Autom. Lett. 2(2), 420–427 (2016)CrossRefGoogle Scholar
  56. 56.
    Schonberger, J.L., Frahm, J.M.: Structure-from-motion revisited. In: CVPR (2016)Google Scholar
  57. 57.
    Schönberger, J.L., Hardmeier, H., Sattler, T., Pollefeys, M.: Comparative evaluation of hand-crafted and learned local features. In: CVPR (2017)Google Scholar
  58. 58.
    Simo-Serra, E., Trulls, E., Ferraz, L., Kokkinos, I., Fua, P., Moreno-Noguer, F.: Discriminative learning of deep convolutional feature point descriptors. In: ICCV (2015)Google Scholar
  59. 59.
    Sohn, K.: Improved deep metric learning with multi-class N-pair loss objective. In: NeurIPS (2016)Google Scholar
  60. 60.
    Sun, D., Yang, X., Liu, M.Y., Kautz, J.: PWC-Net: CNNs for optical flow using pyramid, warping, and cost volume. In: CVPR (2018)Google Scholar
  61. 61.
    Tian, Y., Fan, B., Wu, F.: L2-Net: deep learning of discriminative patch descriptor in Euclidean space. In: CVPR (2017)Google Scholar
  62. 62.
    Tian, Y., Yu, X., Fan, B., Wu, F., Heijnen, H., Balntas, V.: SOSNet: second order similarity regularization for local descriptor learning. In: CVPR (2019)Google Scholar
  63. 63.
    Wang, X., Jabri, A., Efros, A.A.: Learning correspondence from the cycle-consistency of time. In: CVPR (2019)Google Scholar
  64. 64.
    Yang, G., et al.: Learning data-adaptive interest points through epipolar adaptation. In: CVPR Workshops (2019)Google Scholar
  65. 65.
    Yi, K.M., Trulls, E., Lepetit, V., Fua, P.: LIFT: learned invariant feature transform. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9910, pp. 467–483. Springer, Cham (2016). Scholar
  66. 66.
    Zhang, L., Rusinkiewicz, S.: Learning local descriptors with a CDF-based dynamic soft margin. In: ICCV (2019)Google Scholar
  67. 67.
    Zhong, Y., Ji, P., Wang, J., Dai, Y., Li, H.: Unsupervised deep epipolar flow for stationary or dynamic scenes. In: CVPR (2019)Google Scholar
  68. 68.
    Zhou, T., Jae Lee, Y., Yu, S.X., Efros, A.A.: FlowWeb: joint image set alignment by weaving consistent, pixel-wise correspondences. In: CVPR (2015)Google Scholar
  69. 69.
    Zhou, T., Krahenbuhl, P., Aubry, M., Huang, Q., Efros, A.A.: Learning dense correspondence via 3D-guided cycle consistency. In: CVPR (2016)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.Cornell UniversityIthacaUSA
  2. 2.Cornell TechNew YorkUSA
  3. 3.Zhejiang UniversityHangzhouChina

Personalised recommendations