Predicting Visual Overlap of Images Through Interpretable Non-metric Box Embeddings

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12350)


To what extent are two images picturing the same 3D surfaces? Even when this is a known scene, the answer typically requires an expensive search across scale space, with matching and geometric verification of large sets of local features. This expense is further multiplied when a query image is evaluated against a gallery, e.g. in visual relocalization. While we don’t obviate the need for geometric verification, we propose an interpretable image-embedding that cuts the search in scale space to essentially a lookup.

Our approach measures the asymmetric relation between two images. The model then learns a scene-specific measure of similarity, from training examples with known 3D visible-surface overlaps. The result is that we can quickly identify, for example, which test image is a close-up version of another, and by what scale factor. Subsequently, local features need only be detected at that scale. We validate our scene-specific model by showing how this embedding yields competitive image-matching results, while being simpler, faster, and also interpretable by humans.


Image embedding Representation learning Image localization Interpretable representation 



Thanks to Carl Toft for help with normal estimation, to Michael Firman for comments on paper drafts and to the anonymous reviewers for helpful feedback.

Supplementary material

504441_1_En_37_MOESM1_ESM.pdf (47.3 mb)
Supplementary material 1 (pdf 48472 KB)

Supplementary material 2 (mp4 53918 KB)


  1. 1.
    Arandjelovic, R., Gronat, P., Torii, A., Pajdla, T., Sivic, J.: NetVLAD: CNN architecture for weakly supervised place recognition. In: CVPR (2016)Google Scholar
  2. 2.
    Arandjelovic, R., Zisserman, A.: All about VLAD. In: CVPR (2013)Google Scholar
  3. 3.
    Balntas, V., Li, S., Prisacariu, V.: RelocNet: continuous metric learning relocalisation using neural nets. In: ECCV (2018)Google Scholar
  4. 4.
    Balntas, V., Riba, E., Ponsa, D., Mikolajczyk, K.: Learning local feature descriptors with triplets and shallow convolutional neural networks. In: BMVC (2016)Google Scholar
  5. 5.
    Bonin-Font, F., Ortiz, A., Oliver, G.: Visual navigation for mobile robots: a survey. J. Intell. Robot. Syst. 53(3), 263 (2008)CrossRefGoogle Scholar
  6. 6.
    Brachmann, E., Rother, C.: Learning less is more-6D camera localization via 3D surface regression. In: CVPR (2018)Google Scholar
  7. 7.
    Buehler, C., Bosse, M., McMillan, L., Gortler, S., Cohen, M.: Unstructured lumigraph rendering. In: Computer Graphics and Interactive Techniques (2001)Google Scholar
  8. 8.
    Cakir, F., He, K., Xia, X., Kulis, B., Sclaroff, S.: Deep metric learning to rank. In: CVPR (2019)Google Scholar
  9. 9.
    DeTone, D., Malisiewicz, T., Rabinovich, A.: Deep image homography estimation. arXiv preprint arXiv:1606.03798 (2016)
  10. 10.
    Dufournaud, Y., Schmid, C., Horaud, R.: Image matching with scale adjustment. Comput. Vis. Image Underst. 93(2), 175–194 (2004)CrossRefGoogle Scholar
  11. 11.
    Erlik Nowruzi, F., Laganiere, R., Japkowicz, N.: Homography estimation from image pairs with hierarchical convolutional networks. In: ICCVW (2017)Google Scholar
  12. 12.
    Frahm, J.M., et al.: Building rome on a cloudless day. In: ECCV (2010)Google Scholar
  13. 13.
    Gálvez-López, D., Tardós, J.D.: Bags of binary words for fast place recognition in image sequences. IEEE Trans. Robot. 28(5), 1188–1197 (2012)CrossRefGoogle Scholar
  14. 14.
    Gordo, A., Almazán, J., Revaud, J., Larlus, D.: Deep image retrieval: learning global representations for image search. In: ECCV (2016)Google Scholar
  15. 15.
    Guttman, A.: R-trees: a dynamic index structure for spatial searching. In: ACM SIGMOD International Conference on Management of Data (1984)Google Scholar
  16. 16.
    Hadsell, R., Chopra, S., LeCun, Y.: Dimensionality reduction by learning an invariant mapping. In: CVPR (2006)Google Scholar
  17. 17.
    Hartley, R.I.: In defense of the eight-point algorithm. TPAMI 19(6), 580–593 (1997)CrossRefGoogle Scholar
  18. 18.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)Google Scholar
  19. 19.
    He, K., Lu, Y., Sclaroff, S.: Local descriptors optimized for average precision. In: CVPR (2018)Google Scholar
  20. 20.
    Heinly, J., Schönberger, J.L., Dunn, E., Frahm, J.M.: Reconstructing the world* in six days. In: CVPR (2015)Google Scholar
  21. 21.
    Hermans, A., Beyer, L., Leibe, B.: In defense of the triplet loss for person re-identification. arXiv:1703.07737 (2017)
  22. 22.
    Kendall, A., Grimes, M., Cipolla, R.: PoseNet: a convolutional network for real-time 6-DOF camera relocalization. In: ICCV (2015)Google Scholar
  23. 23.
    Kulkarni, T.D., Whitney, W.F., Kohli, P., Tenenbaum, J.: Deep convolutional inverse graphics network. In: NeurIPS (2015)Google Scholar
  24. 24.
    Laskar, Z., Melekhov, I., Kalia, S., Kannala, J.: Camera relocalization by computing pairwise relative poses using convolutional neural network. In: ICCV (2017)Google Scholar
  25. 25.
    Le, H., Liu, F., Zhang, S., Agarwala, A.: Deep homography estimation for dynamic scenes. In: CVPR (2020)Google Scholar
  26. 26.
    Li, X., Vilnis, L., Zhang, D., Boratko, M., McCallum, A.: Smoothing the geometry of probabilistic box embeddings. In: ICLR (2019)Google Scholar
  27. 27.
    Li, Z., Snavely, N.: Megadepth: learning single-view depth prediction from internet photos. In: CVPR (2018)Google Scholar
  28. 28.
    Lowe, D.G.: Distinctive image features from scale-invariant keypoints. IJCV 60(2), 91–110 (2004)CrossRefGoogle Scholar
  29. 29.
    Mikulik, A., Chum, O., Matas, J.: Image retrieval for online browsing in large image collections. In: International Conference on Similarity Search and Applications (2013)Google Scholar
  30. 30.
    Mikulík, A., Radenović, F., Chum, O., Matas, J.: Efficient image detail mining. In: ACCV (2014)Google Scholar
  31. 31.
    Mishchuk, A., Mishkin, D., Radenovic, F., Matas, J.: Working hard to know your neighbor’s margins: Local descriptor learning loss. In: NeurIPS (2017)Google Scholar
  32. 32.
    Mishkin, D., Matas, J., Perdoch, M.: MODS: fast and robust method for two-view matching. Comput. Vis. Image Underst. 141, 81–93 (2015)CrossRefGoogle Scholar
  33. 33.
    Mur-Artal, R., Tardós, J.D.: ORB-SLAM2: an open-source SLAM system for monocular, stereo and RGB-D cameras. IEEE Trans. Robot. 33(5), 1255–1262 (2017)CrossRefGoogle Scholar
  34. 34.
    Nguyen, T., Chen, S.W., Shivakumar, S.S., Taylor, C.J., Kumar, V.: Unsupervised deep homography: a fast and robust homography estimation model. IEEE Robot. Autom. Lett. 3(3), 2346–2353 (2018)CrossRefGoogle Scholar
  35. 35.
    Nickel, M., Kiela, D.: Poincaré embeddings for learning hierarchical representations. In: NeurIPS (2017)Google Scholar
  36. 36.
    Perd’och, M., Chum, O., Matas, J.: Efficient representation of local geometry for large scale object retrieval. In: CVPR (2009)Google Scholar
  37. 37.
    Perronnin, F., Liu, Y., Sánchez, J., Poirier, H.: Large-scale image retrieval with compressed Fisher vectors. In: CVPR (2010)Google Scholar
  38. 38.
    Philbin, J., Chum, O., Isard, M., Sivic, J., Zisserman, A.: Object retrieval with large vocabularies and fast spatial matching. In: CVPR (2007)Google Scholar
  39. 39.
    Revaud, J., et al.: R2D2: repeatable and reliable detector and descriptor. In: NeurIPS (2019)Google Scholar
  40. 40.
    Sattler, T., Leibe, B., Kobbelt, L.: Fast image-based localization using direct 2D-to-3D matching. In: ICCV (2011)Google Scholar
  41. 41.
    Sattler, T., Leibe, B., Kobbelt, L.: Improving image-based localization by active correspondence search. In: ECCV (2012)Google Scholar
  42. 42.
    Sattler, T., Leibe, B., Kobbelt, L.: Efficient & effective prioritized matching for large-scale image-based localization. TPAMI 39(9), 1744–1756 (2016)CrossRefGoogle Scholar
  43. 43.
    Sattler, T., et al.: Are large-scale 3D models really necessary for accurate visual localization? In: CVPR (2017)Google Scholar
  44. 44.
    Sattler, T., Weyand, T., Leibe, B., Kobbelt, L.: Image retrieval for image-based localization revisited. In: BMVC (2012)Google Scholar
  45. 45.
    Sattler, T., Zhou, Q., Pollefeys, M., Leal-Taixe, L.: Understanding the limitations of CNN-based absolute camera pose regression. In: CVPR (2019)Google Scholar
  46. 46.
    Schönberger, J.L., Berg, A.C., Frahm, J.M.: Paige: pairwise image geometry encoding for improved efficiency in structure-from-motion. In: CVPR (2015)Google Scholar
  47. 47.
    Schönberger, J.L., Radenovic, F., Chum, O., Frahm, J.M.: From single image query to detailed 3D reconstruction. In: CVPR (2015)Google Scholar
  48. 48.
    Schönberger, J.L., Frahm, J.M.: Structure-from-motion revisited. In: CVPR (2016)Google Scholar
  49. 49.
    Schroff, F., Kalenichenko, D., Philbin, J.: Facenet: a unified embedding for face recognition and clustering. In: CVPR (2015)Google Scholar
  50. 50.
    Shen, T., et al.: Matchable image retrieval by learning from surface reconstruction. In: ACCV (2018)Google Scholar
  51. 51.
    Shotton, J., Glocker, B., Zach, C., Izadi, S., Criminisi, A., Fitzgibbon, A.: Scene coordinate regression forests for camera relocalization in RGB-D images. In: CVPR (2013)Google Scholar
  52. 52.
    Snavely, N., Garg, R., Seitz, S.M., Szeliski, R.: Finding paths through the world’s photos. ACM Trans. Graph. 27(3), 1–11 (2008)CrossRefGoogle Scholar
  53. 53.
    Snavely, N., Seitz, S.M., Szeliski, R.: Photo tourism: exploring photo collections in 3D. In: SIGGRAPH (2006)Google Scholar
  54. 54.
    Sohn, K., Lee, H.: Learning invariant representations with local transformations. In: ICML (2012)Google Scholar
  55. 55.
    Stewénius, H., Gunderson, S.H., Pilet, J.: Size matters: exhaustive geometric verification for image retrieval. In: ECCV (2012)Google Scholar
  56. 56.
    Subramanian, S., Chakrabarti, S.: New embedded representations and evaluation protocols for inferring transitive relations. In: SIGIR Conference on Research & Development in Information Retrieval (2018)Google Scholar
  57. 57.
    Tian, Y., Yu, X., Fan, B., Wu, F., Heijnen, H., Balntas, V.: Sosnet: second order similarity regularization for local descriptor learning. In: CVPR (2019)Google Scholar
  58. 58.
    Torii, A., Arandjelovic, R., Sivic, J., Okutomi, M., Pajdla, T.: 24/7 place recognition by view synthesis. In: CVPR (2015)Google Scholar
  59. 59.
    Vendrov, I., Kiros, R., Fidler, S., Urtasun, R.: Order-embeddings of images and language. In: ICLR (2016)Google Scholar
  60. 60.
    Vilnis, L., Li, X., Murty, S., McCallum, A.: Probabilistic embedding of knowledge graphs with box lattice measures. In: ACL (2018)Google Scholar
  61. 61.
    Vilnis, L., McCallum, A.: Word representations via gaussian embedding. In: ICLR (2015)Google Scholar
  62. 62.
    Weyand, T., Leibe, B.: Discovering favorite views of popular places with iconoid shift. In: ICCV (2011)Google Scholar
  63. 63.
    Weyand, T., Leibe, B.: Discovering details and scene structure with hierarchical iconoid shift. In: ICCV (2013)Google Scholar
  64. 64.
    Witkin, A.P.: Scale-space filtering. In: IJCAI (1983)Google Scholar
  65. 65.
    Worrall, D.E., Garbin, S.J., Turmukhambetov, D., Brostow, G.J.: Interpretable transformations with encoder-decoder networks. In: ICCV (2017)Google Scholar
  66. 66.
    Zhou, L., Zhu, S., Shen, T., Wang, J., Fang, T., Quan, L.: Progressive large scale-invariant image matching in scale space. In: ICCV (2017)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.University College LondonLondonUK
  2. 2.NianticSan FranciscoUSA

Personalised recommendations