Advertisement

Efficient Neighbourhood Consensus Networks via Submanifold Sparse Convolutions

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12354)

Abstract

In this work we target the problem of estimating accurately localized correspondences between a pair of images. We adopt the recent Neighbourhood Consensus Networks that have demonstrated promising performance for difficult correspondence problems and propose modifications to overcome their main limitations: large memory consumption, large inference time and poorly localized correspondences. Our proposed modifications can reduce the memory footprint and execution time more than \(10\times \), with equivalent results. This is achieved by sparsifying the correlation tensor containing tentative matches, and its subsequent processing with a 4D CNN using submanifold sparse convolutions. localization accuracy is significantly improved by processing the input images in higher resolution, which is possible due to the reduced memory footprint, and by a novel two-stage correspondence relocalization module. The proposed Sparse-NCNet method obtains state-of-the-art results on the HPatches Sequences and InLoc visual localization benchmarks, and competitive results on the Aachen Day-Night benchmark.

Keywords

Image matching Neighbourhood consensus Sparse CNN 

Notes

Acknowledgements

This work was partially supported by the European Regional Development Fund under project IMPACT (reg. no. CZ.02.1.01/0.0/0.0/15 003/0000468), Louis Vuitton ENS Chair on Artificial Intelligence, and the French government under management of Agence Nationale de la Recherche as part of the “Investissements d’avenir” program, reference ANR-19-P3IA-0001 (PRAIRIE 3IA Institute).

Supplementary material

504446_1_En_35_MOESM1_ESM.pdf (14 mb)
Supplementary material 1 (pdf 14381 KB)

References

  1. 1.
    Arandjelović, R., Gronat, P., Torii, A., Pajdla, T., Sivic, J.: NetVLAD: CNN architecture for weakly supervised place recognition. In: CVPR (2016)Google Scholar
  2. 2.
    Arandjelović, R., Zisserman, A.: Three things everyone should know to improve object retrieval. In: Proceedings CVPR, pp. 2911–2918 (2012)Google Scholar
  3. 3.
    Balntas, V., Lenc, K., Vedaldi, A., Mikolajczyk, K.: HPatches: A benchmark and evaluation of handcrafted and learned local descriptors. In: Proceedings CVPR (2017)Google Scholar
  4. 4.
    Balntas, V., Hammarstrand, L., Heijnen, H., Kahl, F., Maddern, W., Mikolajczyk, K., et al.: Workshop in long-term visual localization under changing conditions. In: CVPR (2019). https://www.visuallocalization.net/workshop/cvpr/2019/
  5. 5.
    Balntas, V., Johns, E., Tang, L., Mikolajczyk, K.: PN-Net: Conjoined triple deep network for learning local image descriptors (2016). arXiv preprint arXiv:1601.05030
  6. 6.
    Balntas, V., Riba, E., Ponsa, D., Mikolajczyk, K.: Learning local feature descriptors with triplets and shallow convolutional neural networks. In: Proceedings BMVC (2016)Google Scholar
  7. 7.
    Bay, H., Tuytelaars, T., Van Gool, L.: SURF: speeded up robust features. In: Leonardis, A., Bischof, H., Pinz, Axel (eds.) ECCV 2006. LNCS, vol. 3951, pp. 404–417. Springer, Heidelberg (2006).  https://doi.org/10.1007/11744023_32CrossRefGoogle Scholar
  8. 8.
    Bian, J., Lin, W.Y., Matsushita, Y., Yeung, S.K., Nguyen, T.D., Cheng, M.M.: GMS: Grid-based motion statistics for fast, ultra-robust feature correspondence. In: Proceedings CVPR (2017)Google Scholar
  9. 9.
    Brachmann, E., Rother, C.: Neural-guided RANSAC: learning where to sample model hypotheses. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4322–4331 (2019)Google Scholar
  10. 10.
    Choy, C., Gwak, J., Savarese, S.: 4D Spatio-temporal ConvNets: Minkowski convolutional neural networks. In: Proceedings CVPR (2019)Google Scholar
  11. 11.
    Choy, C., Park, J., Koltun, V.: Fully convolutional geometric features. In: Proceedings ICCV (2019)Google Scholar
  12. 12.
    DeTone, D., Malisiewicz, T., Rabinovich, A.: SuperPoint: self-supervised interest point detection and description. In: CVPR Workshops (2018)Google Scholar
  13. 13.
    Dusmanu, M., et al.: D2-Net: a trainable CNN for joint detection and description of local features. In: Proceedings CVPR (2019)Google Scholar
  14. 14.
    Gao, X.S., Hou, X.R., Tang, J., Cheng, H.F.: Complete solution classification for the perspective-three-point problem. IEEE PAMI 25(8), 930–943 (2003)CrossRefGoogle Scholar
  15. 15.
    Germain, H., Bourmaud, G., Lepetit, V.: Sparse-to-dense hypercolumn matching for long-term visual localization. In: 3DV (2019)Google Scholar
  16. 16.
    Girshick, R.: Fast R-CNN. In: Proceedings ICCV (2015)Google Scholar
  17. 17.
    Gojcic, Z., Zhou, C., Wegner, J.D., Guibas, L.J., Birdal, T.: Learning multiview 3D point cloud registration (2020). arXiv preprint arXiv:2001.05119
  18. 18.
    Grabner, A., Roth, P.M., Lepetit, V.: 3D pose estimation and 3D model retrieval for objects in the wild. In: Proceedings CVPR (2018)Google Scholar
  19. 19.
    Graham, B.: Sparse 3D convolutional neural networks (2015). arXiv preprint arXiv:1505.02890
  20. 20.
    Graham, B.: Spatially-sparse convolutional neural networks (2014). arXiv preprint arXiv:1409.6070
  21. 21.
    Graham, B., Engelcke, M., van der Maaten, L.: 3D semantic segmentation with submanifold sparse convolutional networks. In: Proceedings CVPR (2018)Google Scholar
  22. 22.
    Han, X., Leung, T., Jia, Y., Sukthankar, R., Berg, A.C.: MatchNet: unifying feature and metric learning for patch-based matching. In: Proceedings CVPR (2015)Google Scholar
  23. 23.
    Johnson, J., Douze, M., Jégou, H.: Billion-scale similarity search with GPUs (2017). arXiv preprint arXiv:1702.08734
  24. 24.
    Julesz, B.: Towards the automation of binocular depth perception. In: Proceedings IFIP Congress, pp. 439–444 (1962)Google Scholar
  25. 25.
    Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: ICLR (2015)Google Scholar
  26. 26.
    Laguna, A.B., Riba, E., Ponsa, D., Mikolajczyk, K.: Key. Net: keypoint detection by handcrafted and learned CNN filters. In: Proceedings ICCV (2019)Google Scholar
  27. 27.
    Lenc, K., Vedaldi, A.: Learning covariant feature detectors. In: Hua, G., Jégou, He (eds.) ECCV 2016. LNCS, vol. 9915, pp. 100–117. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-49409-8_11CrossRefGoogle Scholar
  28. 28.
    Lowe, D.G.: Distinctive image features from scale-invariant keypoints. IJCV 60(2), 91–110 (2004)CrossRefGoogle Scholar
  29. 29.
    Marr, D., Poggio, T.: Cooperative computation of stereo disparity. Science 194(4262), 283–287 (1976)CrossRefGoogle Scholar
  30. 30.
    Mikolajczyk, K., Schmid, C.: An affine invariant interest point detector. In: Heyden, A., Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002. LNCS, vol. 2350, pp. 128–142. Springer, Heidelberg (2002).  https://doi.org/10.1007/3-540-47969-4_9CrossRefGoogle Scholar
  31. 31.
    Mikolajczyk, K., et al.: A comparison of affine region detectors. IJCV 65(1–2), 43–72 (2005)CrossRefGoogle Scholar
  32. 32.
    Mishchuk, A., Mishkin, D., Radenovic, F., Matas, J.: Working hard to know your neighbor’s margins: local descriptor learning loss. In: NIPS (2017)Google Scholar
  33. 33.
    Mishkin, D., Radenović, F., Matas, J.: Repeatability is not enough: learning discriminative affine regions via discriminability. In: Proceedings ECCV (2018)Google Scholar
  34. 34.
    Moo Yi, K., Trulls, E., Ono, Y., Lepetit, V., Salzmann, M., Fua, P.: Learning to find good correspondences. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2666–2674 (2018)Google Scholar
  35. 35.
    Mori, K.I., Kidode, M., Asada, H.: An iterative prediction and correction method for automatic stereocomparison. Comput. Graph. Image Process. 2(3–4), 393–401 (1973)CrossRefGoogle Scholar
  36. 36.
    Noh, H., Araujo, A., Sim, J., Weyand, T., Han, B.: Large-scale image retrieval with attentive deep local features. In: Proceedings ICCV (2017)Google Scholar
  37. 37.
    Ono, Y., Trulls, E., Fua, P., Yi, K.M.: LF-Net: learning local features from images. In: NIPS (2018)Google Scholar
  38. 38.
    Oron, S., Dekel, T., Xue, T., Freeman, W.T., Avidan, S.: Best-buddies similarity–robust template matching using mutual nearest neighbors. IEEE PAMI 40(8), 1799–1813 (2017)CrossRefGoogle Scholar
  39. 39.
    Paszke, A., et al.: Automatic differentiation in PyTorch (2017)Google Scholar
  40. 40.
    Persson, M., Nordberg, K.: Lambda twist: an accurate fast robust perspective three point (P3P) solver. In: Proceedings ECCV (2018)Google Scholar
  41. 41.
    Revaud, J., Weinzaepfel, P., de Souza, C.R., Humenberger, M.: R2D2: repeatable and reliable detector and descriptor. In: NeurIPS (2019)Google Scholar
  42. 42.
    Rocco, I., Arandjelović, R., Sivic, J.: Efficient neighbourhood consensus networks via submanifold sparse convolutions (2020). https://arxiv.org/abs/2004.10566
  43. 43.
    Rocco, I., Arandjelović, R., Sivic, J.: Sparse neighbouhood consensus networks (2020). https://www.di.ens.fr/willow/research/sparse-ncnet/
  44. 44.
    Rocco, I., Cimpoi, M., Arandjelović, R., Torii, A., Pajdla, T., Sivic, J.: Neighbourhood consensus networks. In: NeurIPS (2018)Google Scholar
  45. 45.
    Rublee, E., Rabaud, V., Konolige, K., Bradski, G.: ORB: An efficient alternative to SIFT or SURF. In: Proceedings ICCV (2011)Google Scholar
  46. 46.
    Sarlin, P.E., DeTone, D., Malisiewicz, T., Rabinovich, A.: Superglue: learning feature matching with graph neural networks (2019). arXiv preprint arXiv:1911.11763
  47. 47.
    Sattler, T., et al.: Benchmarking 6DOF outdoor visual localization in changing conditions. In: Proceedings CVPR (2018)Google Scholar
  48. 48.
    Schaffalitzky, F., Zisserman, A.: Automated scene matching in movies. In: Lew, M.S., Sebe, N., Eakins, J.P. (eds.) CIVR 2002. LNCS, vol. 2383, pp. 186–197. Springer, Heidelberg (2002).  https://doi.org/10.1007/3-540-45479-9_20CrossRefzbMATHGoogle Scholar
  49. 49.
    Schmid, C., Mohr, R.: Local grayvalue invariants for image retrieval. IEEE PAMI 19(5), 530–535 (1997)CrossRefGoogle Scholar
  50. 50.
    Schönberger, J.L., Frahm, J.M.: Structure-from-motion revisited. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2016)Google Scholar
  51. 51.
    Schönberger, J.L., Zheng, E., Frahm, J.-M., Pollefeys, M.: Pixelwise view selection for unstructured multi-view stereo. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9907, pp. 501–518. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46487-9_31CrossRefGoogle Scholar
  52. 52.
    Sivic, J., Zisserman, A.: Video google: a text retrieval approach to object matching in videos. In: Proceedings ICCV (2003)Google Scholar
  53. 53.
    Taira, H., et al.: InLoc: indoor visual localization with dense matching and view synthesis. In: Proceedings CVPR (2018)Google Scholar
  54. 54.
    Tian, Y., Fan, B., Wu, F.: L2-Net: deep learning of discriminative patch descriptor in Euclidean space. In: Proceeding CVPR (2017)Google Scholar
  55. 55.
    Torii, A., Arandjelović, R., Sivic, J., Okutomi, M., Pajdla, T.: 24/7 place recognition by view synthesis. In: CVPR (2015)Google Scholar
  56. 56.
    Verdie, Y., Yi, K., Fua, P., Lepetit, V.: TILDE: a temporally invariant learned detector. In: Proceedings CVPR (2015)Google Scholar
  57. 57.
    Widya, A.R., Torii, A., Okutomi, M.: Structure from motion using dense cnn features with keypoint relocalization. IPSJ Trans. Comput. Vis. Appl. 10(1), 6 (2018)CrossRefGoogle Scholar
  58. 58.
    Yi, K.M., Trulls, E., Lepetit, V., Fua, P.: LIFT: learned invariant feature transform. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9910, pp. 467–483. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46466-4_28CrossRefGoogle Scholar
  59. 59.
    Zagoruyko, S., Komodakis, N.: Learning to compare image patches via convolutional neural networks. In: Proceedings CVPR (2015)Google Scholar
  60. 60.
    Zhang, J., et al.: Learning two-view correspondences and geometry using order-aware network. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 5845–5854 (2019)Google Scholar
  61. 61.
    Zhang, Z., Deriche, R., Faugeras, O., Luong, Q.T.: A robust technique for matching two uncalibrated images through the recovery of the unknown epipolar geometry. Artif. Intell. 78(1–2), 87–119 (1995)CrossRefGoogle Scholar
  62. 62.
    Zhao, W.L., Jégou, H., Gravier, G.: Oriented pooling for dense and non-dense rotation-invariant features. In: Proceedings BMVC (2013)Google Scholar
  63. 63.
    Zhou, H., Sattler, T., Jacobs, D.W.: Evaluating local features for day-night matching. In: Hua, G., Jégou, H. (eds.) ECCV 2016. LNCS, vol. 9915, pp. 724–736. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-49409-8_60CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.WILLOW, Inria, DI -ENS, CNRSPSL Research UniversityParisFrance
  2. 2.DeepMindLondonUK
  3. 3.Czech Institute of InformaticsRobotics and Cybernetics, CTUPragueCzechia

Personalised recommendations