Advertisement

Single-Image Depth Prediction Makes Feature Matching Easier

Conference paper
  • 819 Downloads
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12361)

Abstract

Good local features improve the robustness of many 3D re-localization and multi-view reconstruction pipelines. The problem is that viewing angle and distance severely impact the recognizability of a local feature. Attempts to improve appearance invariance by choosing better local feature points or by leveraging outside information, have come with pre-requisites that made some of them impractical. In this paper, we propose a surprisingly effective enhancement to local feature extraction, which improves matching. We show that CNN-based depths inferred from single RGB images are quite helpful, despite their flaws. They allow us to pre-warp images and rectify perspective distortions, to significantly enhance SIFT and BRISK features, enabling more good matches, even when cameras are looking at the same scene but in opposite directions.

Keywords

Local feature matching Image matching 

Notes

Acknowledgements

The bulk of this work was performed during an internship at Niantic, and the first author would like to thank them for hosting him during the summer of 2019. This work has also been partially supported by the Swedish Foundation for Strategic Research (Semantic Mapping and Visual Navigation for Smart Robots) and the Chalmers AI Research Centre (CHAIR) (VisLocLearn). We would also like to extend our thanks Iaroslav Melekhov, who has captured some of the footage.

Supplementary material

504471_1_En_28_MOESM1_ESM.pdf (9.3 mb)
Supplementary material 1 (pdf 9473 KB)

References

  1. 1.
    Aanæs, H., Dahl, A., Pedersen, K.S.: Interesting interest points. Int. J. Comput. Vis. 97, 18–35 (2012).  https://doi.org/10.1007/s11263-011-0473-8 CrossRefGoogle Scholar
  2. 2.
    Altwaijry, H., Trulls, E., Hays, J., Fua, P., Belongie, S.: Learning to match aerial images with deep attentive architectures. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)Google Scholar
  3. 3.
    Baatz, G., Köser, K., Chen, D., Grzeszczuk, R., Pollefeys, M.: Handling urban location recognition as a 2D homothetic problem. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6316, pp. 266–279. Springer, Heidelberg (2010).  https://doi.org/10.1007/978-3-642-15567-3_20CrossRefGoogle Scholar
  4. 4.
    Baatz, G., Köser, K., Chen, D., Grzeszczuk, R., Pollefeys, M.: Leveraging 3D city models for rotation invariant place-of-interest recognition. Int. J. Comput. Vis. (IJCV) 96(3), 315–334 (2012).  https://doi.org/10.1007/s11263-011-0458-7CrossRefGoogle Scholar
  5. 5.
    Balntas, V., Lenc, K., Vedaldi, A., Mikolajczyk, K.: HPatches: a benchmark and evaluation of handcrafted and learned local descriptors. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5173–5182 (2017)Google Scholar
  6. 6.
    Bradski, G.: The OpenCV Library. Dr. Dobb’s J. Softw. Tools (2000)Google Scholar
  7. 7.
    Chang, A., et al.: Matterport3D: learning from RGB-D data in indoor environments. In: International Conference on 3D Vision (3DV) (2017)Google Scholar
  8. 8.
    Wu, C., Fraundorfer, F., Frahm, J., Pollefeys, M.: 3D model search and pose estimation from single images using VIP features. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR) Workshops (2008)Google Scholar
  9. 9.
    Chaudhury, K., DiVerdi, S., Ioffe, S.: Auto-rectification of user photos. In: 2014 IEEE International Conference on Image Processing (ICIP), pp. 3479–3483. IEEE (2014)Google Scholar
  10. 10.
    Chen, D.M., et al.: City-scale landmark identification on mobile devices. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2011)Google Scholar
  11. 11.
    Chen, W., Fu, Z., Yang, D., Deng, J.: Single-image depth perception in the wild. In: Advances in Neural Information Processing Systems, pp. 730–738 (2016)Google Scholar
  12. 12.
    Chen, W., Xiang, D., Deng, J.: Surface normals in the wild. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1557–1566 (2017)Google Scholar
  13. 13.
    Cordes, K., Rosenhahn, B., Ostermann, J.: High-resolution feature evaluation benchmark. In: Wilson, R., Hancock, E., Bors, A., Smith, W. (eds.) CAIP 2013. LNCS, vol. 8047, pp. 327–334. Springer, Heidelberg (2013).  https://doi.org/10.1007/978-3-642-40261-6_39CrossRefGoogle Scholar
  14. 14.
    Criminisi, A., Reid, I., Zisserman, A.: Single view metrology. Int. J. Comput. Vis. (IJCV) 40(2), 123–148 (2000).  https://doi.org/10.1023/A:1026598000963CrossRefzbMATHGoogle Scholar
  15. 15.
    Criminisi, A.: Single-view metrology: algorithms and applications (invited paper). In: Van Gool, L. (ed.) DAGM 2002. LNCS, vol. 2449, pp. 224–239. Springer, Heidelberg (2002).  https://doi.org/10.1007/3-540-45783-6_28CrossRefGoogle Scholar
  16. 16.
    Davison, A.J., Reid, I.D., Molton, N.D., Stasse, O.: MonoSLAM: real-time single camera SLAM. PAMI 29(6), 1052–1067 (2007)CrossRefGoogle Scholar
  17. 17.
    DeTone, D., Malisiewicz, T., Rabinovich, A.: SuperPoint: self-supervised interest point detection and description. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 224–236 (2018)Google Scholar
  18. 18.
    Dhamo, H., Navab, N., Tombari, F.: Object-driven multi-layer scene decomposition from a single image. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 5369–5378 (2019)Google Scholar
  19. 19.
    Eigen, D., Fergus, R.: Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2650–2658 (2015)Google Scholar
  20. 20.
    Eigen, D., Puhrsch, C., Fergus, R.: Depth map prediction from a single image using a multi-scale deep network. In: Advances in Neural Information Processing Systems, pp. 2366–2374 (2014)Google Scholar
  21. 21.
    Fischler, M.A., Bolles, R.C.: Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM 24(6), 381–395 (1981)MathSciNetCrossRefGoogle Scholar
  22. 22.
    Fraundorfer, F., Bischof, H.: A novel performance evaluation method of local detectors on non-planar scenes. In: 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2005) - Workshops (2005)Google Scholar
  23. 23.
    Gálvez-López, D., Tardós, J.D.: Bags of binary words for fast place recognition in image sequences. IEEE Trans. Robot. 28(5), 1188–1197 (2012).  https://doi.org/10.1109/TRO.2012.2197158CrossRefGoogle Scholar
  24. 24.
    Garg, R., Bg, V.K., Carneiro, G., Reid, I.: Unsupervised CNN for single view depth estimation: geometry to the rescue. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 740–756. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46484-8_45CrossRefGoogle Scholar
  25. 25.
    Germain, H., Bourmaud, G., Lepetit, V.: Sparse-to-dense hypercolumn matching for long-term visual localization. In: International Conference on 3D Vision (3DV) (2019)Google Scholar
  26. 26.
    Godard, C., Mac Aodha, O., Brostow, G.J.: Unsupervised monocular depth estimation with left-right consistency. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 270–279 (2017)Google Scholar
  27. 27.
    Godard, C., Mac Aodha, O., Firman, M., Brostow, G.J.: Digging into self-supervised monocular depth prediction. In: The International Conference on Computer Vision (ICCV), October 2019Google Scholar
  28. 28.
    Heng, L., et al.: Project autovision: localization and 3D scene perception for an autonomous vehicle with a multi-camera system. In: 2019 International Conference on Robotics and Automation (ICRA) (2019)Google Scholar
  29. 29.
    Hickson, S., Raveendran, K., Fathi, A., Murphy, K., Essa, I.: Floors are flat: leveraging semantics for real-time surface normal prediction. In: Proceedings of the IEEE International Conference on Computer Vision Workshops (2019)Google Scholar
  30. 30.
    Hinterstoisser, S., Lepetit, V., Benhimane, S., Fua, P., Navab, N.: Learning real-time perspective patch rectification. Int. J. Comput. Vis. 91(1), 107–130 (2011).  https://doi.org/10.1007/s11263-010-0379-xCrossRefGoogle Scholar
  31. 31.
    Jones, E.S., Soatto, S.: Visual-inertial navigation, mapping and localization: a scalable real-time causal approach. Int. J. Robot. Res. (IJRR) 30(4), 407–430 (2011)CrossRefGoogle Scholar
  32. 32.
    Klodt, M., Vedaldi, A.: Supervising the new with the old: learning SFM from SFM. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 698–713 (2018)Google Scholar
  33. 33.
    Koser, K., Koch, R.: Perspectively invariant normal features. In: IEEE International Conference on Computer Vision (ICCV) (2007)Google Scholar
  34. 34.
    Kuznietsov, Y., Stuckler, J., Leibe, B.: Semi-supervised deep learning for monocular depth map prediction. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6647–6655 (2017)Google Scholar
  35. 35.
    Leutenegger, S., Chli, M., Siegwart, R.: BRISK: binary robust invariant scalable keypoints. In: 2011 IEEE International Conference on Computer Vision (ICCV), pp. 2548–2555. IEEE (2011)Google Scholar
  36. 36.
    Li, B., Shen, C., Dai, Y., Van Den Hengel, A., He, M.: Depth and surface normal estimation from monocular images using regression on deep features and hierarchical CRFs. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1119–1127 (2015)Google Scholar
  37. 37.
    Li, H., Zhao, J., Bazin, J.C., Chen, W., Liu, Z., Liu, Y.H.: Quasi-globally optimal and efficient vanishing point estimation in Manhattan world. In: The IEEE International Conference on Computer Vision (ICCV) (2019)Google Scholar
  38. 38.
    Li, Z., Snavely, N.: MegaDepth: learning single-view depth prediction from internet photos. In: Computer Vision and Pattern Recognition (CVPR) (2018)Google Scholar
  39. 39.
    Liebowitz, D., Criminisi, A., Zisserman, A.: Creating architectural models from images. Comput. Graph. Forum 18(3), 39–50 (1999)CrossRefGoogle Scholar
  40. 40.
    Lim, H., Sinha, S.N., Cohen, M.F., Uyttendaele, M.: Real-time image-based 6-DOF localization in large-scale environments. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition (2012)Google Scholar
  41. 41.
    Lindeberg, T.: Scale-space theory: a basic tool for analysing structures at different scales. J. Appl. Stat. 21(2), 224–270 (1994)Google Scholar
  42. 42.
    Liu, C., Kim, K., Gu, J., Furukawa, Y., Kautz, J.: PlaneRCNN: 3D plane detection and reconstruction from a single image. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4450–4459 (2019)Google Scholar
  43. 43.
    Liu, C., Yang, J., Ceylan, D., Yumer, E., Furukawa, Y.: PlaneNet: piece-wise planar reconstruction from a single RGB image. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2579–2588 (2018)Google Scholar
  44. 44.
    Liu, W., Wang, Y., Chen, J., Guo, J., Lu, Y.: A completely affine invariant image-matching method based on perspective projection. Mach. Vis. Appl. 23(2), 231–242 (2012).  https://doi.org/10.1007/s00138-011-0347-7CrossRefGoogle Scholar
  45. 45.
    Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 60(2), 91–110 (2004).  https://doi.org/10.1023/B:VISI.0000029664.99615.94CrossRefGoogle Scholar
  46. 46.
    Maddern, W., Pascoe, G., Linegar, C., Newman, P.: 1 year, 1000 km: the Oxford RobotCar dataset. Int. J. Robot. Res. 36(1), 3–15 (2017)CrossRefGoogle Scholar
  47. 47.
    Matas, J., Chum, O., Urban, M., Pajdla, T.: Robust wide-baseline stereo from maximally stable extremal regions. Image Vis. Comput. 22(10), 761–767 (2004)CrossRefGoogle Scholar
  48. 48.
    Mikolajczyk, K., et al.: A comparison of affine region detectors. Int. J. Comput. Vis. 65(1), 43–72 (2005).  https://doi.org/10.1007/s11263-005-3848-xCrossRefGoogle Scholar
  49. 49.
    Mikolajczyk, K., Schmid, C.: A performance evaluation of local descriptors. IEEE Trans. Pattern Anal. Mach. Intell. 27, 1615–1630 (2005)CrossRefGoogle Scholar
  50. 50.
    Mishkin, D., Matas, J., Perdoch, M.: MODS: fast and robust method for two-view matching. Comput. Vis. Image Underst. 141, 81–93 (2015)CrossRefGoogle Scholar
  51. 51.
    Yi, K.M., Verdie, Y., Fua, P., Lepetit, V.: Learning to assign orientations to feature points. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)Google Scholar
  52. 52.
    Morel, J.M., Yu, G.: ASIFT: a new framework for fully affine invariant image comparison. SIAM J. Imaging Sci. 2(2), 438–469 (2009)MathSciNetCrossRefGoogle Scholar
  53. 53.
    Mur-Artal, R., Tardós, J.D.: ORB-SLAM2: an open-source SLAM system for monocular, stereo and RGB-D cameras. IEEE Trans. Robot. 33(5), 1255–1262 (2017).  https://doi.org/10.1109/TRO.2017.2705103CrossRefGoogle Scholar
  54. 54.
    Pang, Y., Li, W., Yuan, Y., Pan, J.: Fully affine invariant surf for image matching. Neurocomputing 85, 6–10 (2012)CrossRefGoogle Scholar
  55. 55.
    Pritts, J., Chum, O., Matas, J.: Rectification, and segmentation of coplanar repeated patterns. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2014)Google Scholar
  56. 56.
    Pritts, J., Kukelova, Z., Larsson, V., Chum, O.: Rectification from radially-distorted scales. In: Jawahar, C.V., Li, H., Mori, G., Schindler, K. (eds.) ACCV 2018. LNCS, vol. 11365, pp. 36–52. Springer, Cham (2019).  https://doi.org/10.1007/978-3-030-20873-8_3CrossRefGoogle Scholar
  57. 57.
    Pritts, J., Kukelova, Z., Larsson, V., Chum, O.: Radially-distorted conjugate translations. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)Google Scholar
  58. 58.
    Pritts, J., Rozumnyi, D., Kumar, M.P., Chum, O.: Coplanar repeats by energy minimization. In: Proceedings of the British Machine Vision Conference (BMVC) (2016)Google Scholar
  59. 59.
    Robertson, D.P., Cipolla, R.: An image-based system for urban navigation. In: BMVC (2004)Google Scholar
  60. 60.
    Rublee, E., Rabaud, V., Konolige, K., Bradski, G.R.: ORB: an efficient alternative to sift or surf. In: ICCV, vol. 11, p. 2. Citeseer (2011)Google Scholar
  61. 61.
    Sarlin, P.E., Cadena, C., Siegwart, R., Dymczyk, M.: From coarse to fine: robust hierarchical localization at large scale. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019)Google Scholar
  62. 62.
    Sattler, T., Leibe, B., Kobbelt, L.: Efficient & effective prioritized matching for large-scale image-based localization. IEEE Trans. Pattern Anal. Mach. Intell. 39(9), 1744–1756 (2017)CrossRefGoogle Scholar
  63. 63.
    Schönberger, J.L., Pollefeys, M., Geiger, A., Sattler, T.: Semantic visual localization. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)Google Scholar
  64. 64.
    Schönberger, J.L., Frahm, J.M.: Structure-from-motion revisited. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2016)Google Scholar
  65. 65.
    Schönberger, J.L., Zheng, E., Frahm, J.-M., Pollefeys, M.: Pixelwise view selection for unstructured multi-view stereo. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9907, pp. 501–518. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46487-9_31CrossRefGoogle Scholar
  66. 66.
    Shao, H., Svoboda, T., Gool, L.V.: ZuBuD – Zürich buildings database for image based recognition. Technical report 260, Computer Vision Laboratory, Swiss Federal Institute of Technology, April 2003Google Scholar
  67. 67.
    Simon, G., Fond, A., Berger, M.-O.: A-Contrario horizon-first vanishing point detection using second-order grouping laws. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11214, pp. 323–338. Springer, Cham (2018).  https://doi.org/10.1007/978-3-030-01249-6_20CrossRefGoogle Scholar
  68. 68.
    Snavely, N., Seitz, S.M., Szeliski, R.: Photo tourism: exploring photo collections in 3D. In: SIGGRAPH (2006)Google Scholar
  69. 69.
    Svärm, L., Enqvist, O., Kahl, F., Oskarsson, M.: City-scale localization for cameras with known vertical direction. IEEE Trans. Pattern Anal. Mach. Intell. 39(7), 1455–1461 (2017)CrossRefGoogle Scholar
  70. 70.
    Toft, C., Olsson, C., Kahl, F.: Long-term 3D localization and pose from semantic labellings. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 650–659 (2017)Google Scholar
  71. 71.
    Wang, X., Fouhey, D., Gupta, A.: Designing deep networks for surface normal estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 539–547 (2015)Google Scholar
  72. 72.
    Watson, J., Firman, M., Brostow, G.J., Turmukhambetov, D.: Self-supervised monocular depth hints. In: IEEE International Conference on Computer Vision (ICCV) (2019)Google Scholar
  73. 73.
    Wu, C.: Towards linear-time incremental structure from motion. In: International Conference on 3D Vision (3DV) (2013)Google Scholar
  74. 74.
    Wu, C., Clipp, B., Li, X., Frahm, J.M., Pollefeys, M.: 3D model matching with viewpoint-invariant patches (VIP). In: 2008 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8. IEEE (2008)Google Scholar
  75. 75.
    Wu, C., Frahm, J.-M., Pollefeys, M.: Detecting large repetitive structures with salient boundaries. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6312, pp. 142–155. Springer, Heidelberg (2010).  https://doi.org/10.1007/978-3-642-15552-9_11CrossRefGoogle Scholar
  76. 76.
    Xian, W., Li, Z., Fisher, M., Eisenmann, J., Shechtman, E., Snavely, N.: UprightNet: geometry-aware camera orientation estimation from single images. In: The IEEE International Conference on Computer Vision (ICCV) (2019)Google Scholar
  77. 77.
    Yang, N., Wang, R., Stuckler, J., Cremers, D.: Deep virtual stereo odometry: leveraging deep depth prediction for monocular direct sparse odometry. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 817–833 (2018)Google Scholar
  78. 78.
    Yang, Z., Wang, P., Wang, Y., Xu, W., Nevatia, R.: Lego: learning edge with geometry all at once by watching videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 225–234 (2018)Google Scholar
  79. 79.
    Cao, Y., McDonald, J.: Viewpoint invariant features from single images using 3D geometry. In: Workshop on Applications of Computer Vision (WACV) (2009)Google Scholar
  80. 80.
    Yin, W., Liu, Y., Shen, C., Yan, Y.: Enforcing geometric constraints of virtual normal for depth prediction. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 5684–5693 (2019)Google Scholar
  81. 81.
    Zeisl, B., Köser, K., Pollefeys, M.: Viewpoint invariant matching via developable surfaces. In: Fusiello, A., Murino, V., Cucchiara, R. (eds.) ECCV 2012. LNCS, vol. 7584, pp. 62–71. Springer, Heidelberg (2012).  https://doi.org/10.1007/978-3-642-33868-7_7CrossRefGoogle Scholar
  82. 82.
    Zeisl, B., Köser, K., Pollefeys, M.: Automatic registration of RGB-D scans via salient directions. In: The IEEE International Conference on Computer Vision (ICCV) (2013)Google Scholar
  83. 83.
    Zhan, H., Weerasekera, C.S., Garg, R., Reid, I.: Self-supervised learning for single view depth and surface normal estimation. In: 2019 International Conference on Robotics and Automation (ICRA) (2019)Google Scholar
  84. 84.
    Zhou, T., Brown, M., Snavely, N., Lowe, D.G.: Unsupervised learning of depth and ego-motion from video. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017Google Scholar
  85. 85.
    Zhou, Y., Qi, H., Huang, J., Ma, Y.: NeurVPS: neural vanishing point scanning via conic convolution. In: Conference on Neural Information Processing Systems (NeurIPS) (2019)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.Chalmers University of TechnologyGothenburgSweden
  2. 2.NianticSan FranciscoUSA
  3. 3.University College LondonLondonUK

Personalised recommendations