Skip to main content
Log in

Using scale-equivariant CNN to enhance scale robustness in feature matching

  • Research
  • Published:
The Visual Computer Aims and scope Submit manuscript

Abstract

Image matching is an important task in computer vision. The detector-free dense matching method is an important research direction of image matching due to its high accuracy and robustness. The classical detector-free image matching methods utilize convolutional neural networks to extract features and then match them. Due to the lack of scale equivariance in CNNs, this method often exhibits poor matching performance when the images to be matched undergo significant scale variations. However, large-scale variations are very common in practical problems. To solve the above problem, we propose SeLFM, a method that combines scale equivariance and the global modeling capability of transformer. The two main advantages of this method are scale-equivariant CNNs can extract scale-equivariant features, while transformer also brings global modeling capability. Experiments prove that this modification improves the performance of the matcher in matching image pairs with large-scale variations and does not affect the general matching performance of the matcher. The code will be open-sourced at this link: https://github.com/LiaoYun0x0/SeLFM/tree/main

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Data availability statement

The datasets analyzed during the current study are available from the following public domain resources:http://www.cs.cornell.edu/projects/megadepth/; http://icvl.ee.ic.ac.uk/vbalnt/hpatches/; https://github.com/abyssgaze/oetr.

References

  1. Lindenberger, P., Sarlin, P.-E., Larsson, V., Pollefeys, M.: Pixel-perfect structure-from-motion with feature metric refinement. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 5967–5977 (2021). https://doi.org/10.1109/ICCV48922.2021.00593

  2. Schönberger, J.L., Frahm, J.-M.: Structure-from-motion revisited. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4104–4113 (2016). https://doi.org/10.1109/CVPR.2016.445

  3. Chen, H., Hu, W., Yang, K., Bai, J., Wang, K.: Panoramic annular SLAM with loop closure and global optimization. Appl. Opt. 60(21), 6264 (2021). https://doi.org/10.1364/ao.424280

    Article  Google Scholar 

  4. Engel, J., Koltun, V., Cremers, D.: Direct sparse odometry. IEEE Trans. Pattern Anal. Mach. Intell. 40(3), 611–625 (2018). https://doi.org/10.1109/TPAMI.2017.2658577

    Article  Google Scholar 

  5. Zhou, L., Kong, M., Liu, Z., Li, L.: Vision sensor-based SLAM problem for small UAVs in dynamic indoor environments. Comput. Animat. Virtual Worlds (2022). https://doi.org/10.1002/cav.2088

    Article  Google Scholar 

  6. Li, S., Yuan, L., Sun, J., Quan, L.: Dual-feature warping-based motion model estimation. In: 2015 IEEE International Conference on Computer Vision (ICCV), pp. 4283–4291 (2015). https://doi.org/10.1109/ICCV.2015.487

  7. Wu, Y., Wang, C.: Parallel-branch network for 3d human pose and shape estimation in video. Comput. Animat. Virtual Worlds (2022). https://doi.org/10.1002/cav.2078

    Article  Google Scholar 

  8. Sun, L., Tang, T., Qu, Y., Qin, W.: Bidirectional temporal feature for 3d human pose and shape estimation from a video. Comput. Animat. Virtual Worlds (2023). https://doi.org/10.1002/cav.2187

    Article  Google Scholar 

  9. Sarlin, P., Cadena, C., Siegwart, R., Dymczyk, M.: From coarse to fine: robust hierarchical localization at large scale. CoRR (2018). arXiv:abs/1812.03506

  10. Taira, H., Okutomi, M., Sattler, T., Cimpoi, M., Pollefeys, M., Sivic, J., Pajdla, T., Torii, A.: InLoc: indoor visual localization with dense matching and view synthesis. IEEE Trans. Pattern Anal. Mach. Intell. 43(4), 1293–1307 (2021). https://doi.org/10.1109/TPAMI.2019.2952114

    Article  Google Scholar 

  11. Yoon, S., Kim, A.: Line as a visual sentence: context-aware line descriptor for visual localization. IEEE Robot. Automat. Lett. 6(4), 8726–8733 (2021). https://doi.org/10.1109/lra.2021.3111760

    Article  Google Scholar 

  12. Li, N., Ai, H.: EfiLoc: large-scale visual indoor localization with efficient correlation between sparse features and 3D points. Visual Comput. 38(6), 2091–2106 (2022). https://doi.org/10.1007/s00371-021-02270-8

    Article  Google Scholar 

  13. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 60(2), 91–110 (2004). https://doi.org/10.1023/B:VISI.0000029664.99615.94

    Article  Google Scholar 

  14. Rublee, E., Rabaud, V., Konolige, K., Bradski, G.: Orb: an efficient alternative to sift or surf. In: 2011 International Conference on Computer Vision, pp. 2564–2571 (2011). https://doi.org/10.1109/ICCV.2011.6126544

  15. Mishchuk, A., Mishkin, D., Radenovic, F., Matas, J.: Working hard to know your neighbor’s margins: local descriptor learning loss. CoRR (2017). arXiv:1705.10872

  16. Tian, Y., Yu, X., Fan, B., Wu, F., Heijnen, H., Balntas, V.: SOSNet: second order similarity regularization for local descriptor learning. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11008–11017 (2019). https://doi.org/10.1109/CVPR.2019.01127

  17. Tian, Y., Laguna, A.B., Ng, T., Balntas, V., Mikolajczyk, K.: HyNet: Local descriptor with hybrid similarity measure and triplet loss. CoRR (2020). arXiv:2006.10202

  18. Tian, Y., Balntas, V., Ng, T., Laguna, A.B., Demiris, Y., Mikolajczyk, K.: D2D: keypoint extraction with describe to detect approach. CoRR (2020). arXiv:2005.13605

  19. Dusmanu, M., Rocco, I., Pajdla, T., Pollefeys, M., Sivic, J., Torii, A., Sattler, T.: D2-net: a trainable CNN for joint description and detection of local features. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8084–8093 (2019). https://doi.org/10.1109/CVPR.2019.00828

  20. Noh, H., Araujo, A., Sim, J., Weyand, T., Han, B.: Large-scale image retrieval with attentive deep local features. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp. 3476–3485 (2017). https://doi.org/10.1109/ICCV.2017.374

  21. DeTone, D., Malisiewicz, T., Rabinovich, A.: Superpoint: self-supervised interest point detection and description. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 337–33712 (2018). https://doi.org/10.1109/CVPRW.2018.00060

  22. Revaud, J., Weinzaepfel, P., Souza, C.R., Pion, N., Csurka, G., Cabon, Y., Humenberger, M.: R2D2: repeatable and reliable detector and descriptor. CoRR (2019). arXiv:1906.06195

  23. Tyszkiewicz, M.J., Fua, P., Trulls, E.: DISK: learning local features with policy gradient. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H. (eds.) Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6–12, 2020, Virtual (2020). https://proceedings.neurips.cc/paper/2020/hash/a42a596fc71e17828440030074d15e74-Abstract.html

  24. Li, K., Wang, L., Liu, L., Ran, Q., Xu, K., Guo, Y.: Decoupling makes weakly supervised local feature better. CoRR (2022). arXiv:2201.02861

  25. Wang, C., Xu, R., Zhang, Y., Xu, S., Meng, W., Fan, B., Zhang, X.: MTLDesc: looking wider to describe better. Proc. AAAI Conf. Artif. Intell. 36(2), 2388–2396 (2022). https://doi.org/10.1609/aaai.v36i2.20138

    Article  Google Scholar 

  26. Sun, J., Shen, Z., Wang, Y., Bao, H., Zhou, X.: Loftr: detector-free local feature matching with transformers. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, Virtual, June 19–25, 2021, pp. 8922–8931. https://doi.org/10.1109/CVPR46437.2021.00881

  27. Wang, Q., Zhang, J., Yang, K., Peng, K., Stiefelhagen, R.: Matchformer: interleaving attention in transformers for feature matching. In: Wang, L., Gall, J., Chin, T., Sato, I., Chellappa, R. (eds.) Computer Vision—ACCV 2022—16th Asian Conference on Computer Vision, Macao, China, December 4-8, 2022, Proceedings, Part III. Lecture Notes in Computer Science, vol. 13843, pp. 256–273. https://doi.org/10.1007/978-3-031-26313-2_16

  28. Wang, Q., Zhou, X., Hariharan, B., Snavely, N.: Learning feature descriptors using camera pose supervision. CoRR (2020). arXiv:2004.13324

  29. Zhou, Q., Sattler, T., Leal-Taixé, L.: Patch2pix: Epipolar-guided pixel-level correspondences. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, Virtual, June 19–25, pp. 4669–4678, 2021. https://doi.org/10.1109/CVPR46437.2021.00464

  30. Bökman, G., Kahl, F.: A case for using rotation invariant features in state of the art feature matchers. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, CVPR Workshops 2022, New Orleans, LA, USA, June 19–20, pp. 5106–5115, 2022. https://doi.org/10.1109/CVPRW56347.2022.00559

  31. Shen, Z., Kong, B., Dong, X.: MAIM: a mixer MLP architecture for image matching. Visual Comput. (2023). https://doi.org/10.1007/s00371-023-02851-9

    Article  Google Scholar 

  32. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: Bengio, Y., LeCun, Y. (eds.) 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7–9, 2015, Conference Track Proceedings (2015). arXiv:1409.1556

  33. Jiang, W., Trulls, E., Hosang, J., Tagliasacchi, A., Yi, K.M.: COTR: correspondence transformer for matching across images. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 6187–6197 (2021). https://doi.org/10.1109/ICCV48922.2021.00615

  34. Chen, H., Luo, Z., Zhou, L., Tian, Y., Zhen, M., Fang, T., McKinnon, D., Tsin, Y., Quan, L.: ASpanFormer: detector-free image matching with adaptive span transformer. In: Computer Vision - ECCV 2022 - 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXII. Lecture Notes in Computer Science, vol. 13692, pp. 20–36. https://doi.org/10.1007/978-3-031-19824-3_2

  35. Tang, S., Zhang, J., Zhu, S., Tan, P.: Quadtree attention for vision transformers. In: International Conference on Learning Representations (2022). https://openreview.net/forum?id=fR-EnKWL_Zb

  36. Jiang, N., Sheng, B., Li, P., Lee, T.-Y.: PhotoHelper: portrait photographing guidance via deep feature retrieval and fusion. IEEE Trans. Multimedia 25, 2226–2238 (2023). https://doi.org/10.1109/TMM.2022.3144890

    Article  Google Scholar 

  37. Sarlin, P., DeTone, D., Malisiewicz, T., Rabinovich, A.: Superglue: Learning feature matching with graph neural networks. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13–19, 2020, pp. 4937–4946. https://doi.org/10.1109/CVPR42600.2020.00499

  38. Xie, Z., Zhang, W., Sheng, B., Li, P., Chen, C.L.P.: BaGFN: broad attentive graph fusion network for high-order feature interactions. IEEE Trans. Neural Netw. Learn. Syst. 34(8), 4499–4513 (2023). https://doi.org/10.1109/TNNLS.2021.3116209

    Article  Google Scholar 

  39. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L.u., Polosukhin, I.: Attention is all you need. In: Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 30 (2017). https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf

  40. Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021, pp. 9992–10002. IEEE(2021). https://doi.org/10.1109/ICCV48922.2021.00986

  41. Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J. (eds.) Computer Vision—ECCV 2020—16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I. Lecture Notes in Computer Science, vol. 12346, pp. 213–229. https://doi.org/10.1007/978-3-030-58452-8_13

  42. Ai, L., Xie, Z., Yao, R., Yang, M.: MVTr: multi-feature voxel transformer for 3d object detection. Visual Comput. (2023). https://doi.org/10.1007/s00371-023-02860-8

    Article  Google Scholar 

  43. Zhang, Z., Jiang, Y., Jiang, J., Wang, X., Luo, P., Gu, J.: Star: a structure-aware lightweight transformer for real-time image enhancement. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 4086–4095 (2021). https://doi.org/10.1109/ICCV48922.2021.00407

  44. Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: SegFormer: simple and efficient design for semantic segmentation with transformers. In: Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P.S., Vaughan, J.W. (eds.) Advances in Neural Information Processing Systems, vol. 34, pp. 12077–12090 (2021). https://proceedings.neurips.cc/paper_files/paper/2021/file/64f1f27bf1b4ec22924fd0acb550c235-Paper.pdf

  45. Chen, L., Wan, L.: CTUNet: automatic pancreas segmentation using a channel-wise transformer and 3D U-Net. Visual Comput. (2022). https://doi.org/10.1007/s00371-022-02656-2

    Article  Google Scholar 

  46. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: transformers for image recognition at scale. In: 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3–7, 2021. https://openreview.net/forum?id=YicbFdNTTy

  47. Chen, Z., Zhu, Y., Zhao, C., Hu, G., Zeng, W., Wang, J., Tang, M.: DPT: deformable patch-based transformer for visual recognition. In: Shen, H.T., Zhuang, Y., Smith, J.R., Yang, Y., César, P., Metze, F., Prabhakaran, B. (eds.) MM’21: ACM Multimedia Conference, Virtual Event, China, October 20–24, 2021, pp. 2899–2907 (2021). https://doi.org/10.1145/3474085.3475467

  48. Lin, X., Sun, S., Huang, W., Sheng, B., Li, P., Feng, D.D.: Eapt: Efficient attention pyramid transformer for image processing. IEEE Trans. Multimedia 25, 50–61 (2023) https://doi.org/10.1109/TMM.2021.3120873

  49. Xu, Y., Xiao, T., Zhang, J., Yang, K., Zhang, Z.: Scale-invariant convolutional neural networks. CoRR (2014). arXiv:1411.6369

  50. Kanazawa, A., Sharma, A., Jacobs, D.W.: Locally scale-invariant convolutional neural networks. CoRR (2014). arXiv:1412.5104

  51. Marcos, D., Kellenberger, B., Lobry, S., Tuia, D.: Scale equivariance in CNNs with vector fields. CoRR (2018) arXiv:1807.11783

  52. Worrall, D.E., Welling, M.: Deep scale-spaces: equivariance over scale. In: Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pp. 7364–7376 (2019). https://proceedings.neurips.cc/paper/2019/hash/f04cd7399b2b0128970efb6d20b5c551-Abstract.html

  53. Ghosh, R., Gupta, A.K.: Scale steerable filters for locally scale-invariant convolutional neural networks. CoRR (2019). arXiv:1906.03861

  54. Sosnovik, I., Szmaja, M., Smeulders, A.W.M.: Scale-equivariant steerable networks. In: 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26+-30, 2020. https://openreview.net/forum?id=HJgpugrKPS

  55. Kondor, R., Trivedi, S.: On the generalization of equivariance and convolution in neural networks to the action of compact groups. In: Dy, J.G., Krause, A. (eds.) Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018. Proceedings of Machine Learning Research, vol. 80, pp. 2752–2760. http://proceedings.mlr.press/v80/kondor18a.html

  56. Li, Z., Snavely, N.: Megadepth: learning single-view depth prediction from internet photos. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2041–2050 (2018). https://doi.org/10.1109/CVPR.2018.00218

  57. Li, X., Han, K., Li, S., Prisacariu, V.: Dual-resolution correspondence networks. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H. (eds.) Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6–12, 2020, Virtual (2020). https://proceedings.neurips.cc/paper/2020/hash/c91591a8d461c2869b9f535ded3e213e-Abstract.html

  58. Chen, Y., Huang, D., Xu, S., Liu, J., Liu, Y.: Guide local feature matching by overlap estimation. In: Thirty-Sixth AAAI Conference on Artificial Intelligence, AAAI 2022, Thirty-Fourth Conference on Innovative Applications of Artificial Intelligence, IAAI 2022, The Twelveth Symposium on Educational Advances in Artificial Intelligence, EAAI 2022 Virtual Event, February 22–March 1, 2022, pp. 365–373. https://doi.org/10.1609/aaai.v36i1.19913

  59. Luo, Z., Zhou, L., Bai, X., Chen, H., Zhang, J., Yao, Y., Li, S., Fang, T., Quan, L.: Aslfeat: Learning local features of accurate shape and localization. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13–19, 2020, pp. 6588–6597. https://doi.org/10.1109/CVPR42600.2020.00662

  60. Truong Giang, K., Song, S., Jo, S.: TopicFM: robust and interpretable topic-assisted feature matching. Proc. AAAI Conf. Artif. Intell. 37(2), 2447–2455 (2023). https://doi.org/10.1609/aaai.v37i2.25341

    Article  Google Scholar 

  61. Balntas, V., Lenc, K., Vedaldi, A., Tuytelaars, T., Matas, J., Mikolajczyk, K.: \(\mathbb{H} \)h-patches: a benchmark and evaluation of handcrafted and learned local descriptors. IEEE Trans. Pattern Anal. Mach. Intell. 42(11), 2825–2841 (2020). https://doi.org/10.1109/TPAMI.2019.2915233

    Article  Google Scholar 

  62. Sarlin, P.-E., Cadena, C., Siegwart, R., Dymczyk, M.: From coarse to fine: robust hierarchical localization at large scale (2019)

  63. Taira, H., Okutomi, M., Sattler, T., Cimpoi, M., Pollefeys, M., Sivic, J., Pajdla, T., Torii, A.: InLoc: indoor visual localization with dense matching and view synthesis (2018)

Download references

Acknowledgements

This work was partially supported by the Open Foundation of Yunnan Key Laboratory of Software Engineering under Grant No.2020SE307, and Scientific Research Foundation of Education Department of Yunnan Province under Grant No. 2021J0007.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Qing Duan.

Ethics declarations

Conflict of interest

The authors declare that they have no conflicts of interests or conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Liao, Y., Liu, P., Wu, X. et al. Using scale-equivariant CNN to enhance scale robustness in feature matching. Vis Comput (2024). https://doi.org/10.1007/s00371-024-03389-0

Download citation

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s00371-024-03389-0

Keywords

Navigation