Learning and Aggregating Deep Local Descriptors for Instance-Level Recognition

Tolias, Giorgos; Jenicek, Tomas; Chum, Ondřej

doi:10.1007/978-3-030-58452-8_27

Giorgos Tolias¹²,
Tomas Jenicek¹² &
Ondřej Chum¹²

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12346))

Included in the following conference series:

European Conference on Computer Vision

13k Accesses
44 Citations

Abstract

We propose an efficient method to learn deep local descriptors for instance-level recognition. The training only requires examples of positive and negative image pairs and is performed as metric learning of sum-pooled global image descriptors. At inference, the local descriptors are provided by the activations of internal components of the network. We demonstrate why such an approach learns local descriptors that work well for image similarity estimation with classical efficient match kernel methods. The experimental validation studies the trade-off between performance and memory requirements of the state-of-the-art image search approach based on match kernels. Compared to existing local descriptors, the proposed ones perform better in two instance-level recognition tasks and keep memory requirements lower. We experimentally show that global descriptors are not effective enough at large scale and that local descriptors are essential. We achieve state-of-the-art performance, in some cases even with a backbone network as small as ResNet18.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://github.com/gtolias/how.
2.
The binarized versions are originally [44] referred to as SMK\(^\star \) and ASMK\(^\star \). Only binarized versions are considered in this work and the asterisk is omitted.
3.
To simplify, we use the same notation, i.e. \(\gamma (\cdot )\), for the normalization of different similarity measures in the rest of the text. In each case, it ensures unit self-similarity of the corresponding similarity measure.
4.
Both f(I) and \(\mathcal {U}\) correspond to the same representation seen as a 3D tensor and a set of descriptors, respectively. We write \(\mathcal {U}=f(I)\) implying the tensor is transformed into a set of vectors. \(\mathcal {U}\) is, in fact, a multi-set, but it is referred to as set in the paper.
5.
The main difference is that we do not follow the two stage training performed in the original work [29]; DELF is trained in a single stage for our ablations.
6.
https://github.com/filipradenovic/cnnimageretrieval-pytorch.

References

Arandjelović, R., Zisserman, A.: All about VLAD. In: CVPR (2013)
Google Scholar
Arandjelović, R., Zisserman, A.: DisLocation: scalable descriptor distinctiveness for location recognition. In: Cremers, D., Reid, I., Saito, H., Yang, M.H. (eds.) ACCV 2014. LNCS, vol. 9006, pp. 188–204. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-16817-3_13
Chapter Google Scholar
Arandjelović, R., Gronat, P., Torii, A., Pajdla, T., Sivic, J.: NetVLAD: CNN architecture for weakly supervised place recognition. In: CVPR (2016)
Google Scholar
Babenko, A., Lempitsky, V.: Aggregating deep convolutional features for image retrieval. In: ICCV (2015)
Google Scholar
Babenko, A., Slesarev, A., Chigorin, A., Lempitsky, V.: Neural codes for image retrieval. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8689, pp. 584–599. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10590-1_38
Chapter Google Scholar
Balntas, V., Lenc, K., Vedaldi, A., Mikolajczyk, K.: Hpatches: a benchmark and evaluation of handcrafted and learned local descriptors. In: CVPR (2017)
Google Scholar
Barroso Laguna, A., Riba, E., Ponsa, D., Mikolajczyk, K.: Key. net: keypoint detection by handcrafted and learned cnn filters. In: ICCV (2019)
Google Scholar
Bay, H., Ess, A., Tuytelaars, T., Gool, L.V.: SURF: speeded up robust features. Comput. Vis. Image Underst. 110(3), 346–359 (2008)
Article Google Scholar
Benbihi, A., Geist, M., Pradalier, C.: Elf: embedded localisation of features in pre-trained cnn. In: CVPR (2019)
Google Scholar
Bhowmik, A., Gumhold, S., Rother, C., Brachmann, E.: Reinforced feature points: optimizing feature detection and description for a high-level task. In: CVPR (2020)
Google Scholar
Cao, B., Araujo, A., Sim, J.: Unifying deep local and global features for efficient image search. In: arxiv (2020)
Google Scholar
DeTone, D., Malisiewicz, T., Rabinovich, A.: Superpoint: self-supervised interest point detection and description. In: CVPRW (2018)
Google Scholar
Dusmanu, M., et al.: D2-net: a trainable cnn for joint detection and description of local features. In: CVPR (2019)
Google Scholar
Gordo, A., Almazán, J., Revaud, J., Larlus, D.: End-to-End learning of deep visual representations for image retrieval. Int. J. Comput. Vis. 124(2), 237–254 (2017). https://doi.org/10.1007/s11263-017-1016-8
Article MathSciNet Google Scholar
Gu, Y., Li, C., Jiang, Y.G.: Towards optimal cnn descriptors for large-scale image retrieval. In: ACM Multimedia (2019)
Google Scholar
Husain, S., Bober, M.: Improving large-scale image retrieval through robust aggregation of local descriptors. PAMI 39(9), 1783–1796 (2016)
Article Google Scholar
Iscen, A., Tolias, G., Gosselin, P.H., Jégou, H.: A comparison of dense region detectors for image search and fine-grained classification. IEEE Trans. Image Process. 24(8), 2369–2381 (2015)
Article MathSciNet Google Scholar
Jégou, H., Douze, M., Schmid, C.: On the burstiness of visual elements. In: CVPR, June 2009
Google Scholar
Jégou, H., Chum, O.: Negative evidences and co-occurences in image retrieval: the benefit of PCA and whitening. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7573, pp. 774–787. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33709-3_55
Chapter Google Scholar
Jégou, H., Douze, M., Schmid, C.: Improving bag-of-features for large scale image search. IJCV 87(3), 316–336 (2010)
Article Google Scholar
Jégou, H., Douze, M., Schmid, C.: Product quantization for nearest neighbor search. PAMI 33(1), 117–128 (2011)
Article Google Scholar
Jégou, H., Perronnin, F., Douze, M., Sánchez, J., Pérez, P., Schmid, C.: Aggregating local descriptors into compact codes. In: PAMI, Sep 2012
Google Scholar
Kalantidis, Y., Mellina, C., Osindero, S.: Cross-dimensional weighting for aggregated deep convolutional features. In: Hua, G., Jégou, H. (eds.) ECCV 2016. LNCS, vol. 9913, pp. 685–701. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46604-0_48
Chapter Google Scholar
Kim, H.J., Dunn, E., Frahm, J.M.: Learned contextual feature reweighting for image geo-localization. In: CVPR (2017)
Google Scholar
Matas, J., Chum, O., Urban, M., Pajdla, T.: Robust wide-baseline stereo from maximally stable extremal regions. Image Vis. Comput. 22(10), 761–767 (2004)
Article Google Scholar
Mikolajczyk, K., Schmid, C.: A performance evaluation of local descriptors. PAMI 27(10), 1615–1630 (2005)
Article Google Scholar
Mikolajczyk, K., et al.: A comparison of affine region detectors. IJCV 65(1/2), 43–72 (2005)
Article Google Scholar
Mohedano, E., McGuinness, K., O’Connor, N.E., Salvador, A., Marques, F., Giro-i Nieto, X.: Bags of local convolutional features for scalable instance search. In: ICMR (2016)
Google Scholar
Noh, H., Araujo, A., Sim, J., Weyand, T., Han, B.: Large-scale image retrieval with attentive deep local features. In: ICCV (2017)
Google Scholar
Perronnin, F., Liu, Y., Sanchez, J., Poirier, H.: Large-scale image retrieval with compressed Fisher vectors. In: CVPR (2010)
Google Scholar
Perronnin, F., Liu, Y., Renders, J.M.: A family of contextual measures of similarity between distributions with application to image retrieval. In: CVPR, pp. 2358–2365 (2009)
Google Scholar
Philbin, J., Chum, O., Isard, M., Sivic, J., Zisserman, A.: Object retrieval with large vocabularies and fast spatial matching. In: CVPR (2007)
Google Scholar
Philbin, J., Chum, O., Isard, M., Sivic, J., Zisserman, A.: Lost in quantization: improving particular object retrieval in large scale image databases. In: CVPR, June 2008
Google Scholar
Radenović, F., Iscen, A., Tolias, G., Avrithis, Y., Chum, O.: Revisiting oxford and paris: large-scale image retrieval benchmarking. In: CVPR (2018)
Google Scholar
Radenović, F., Tolias, G., Chum, O.: Fine-tuning CNN image retrieval with no human annotation. PAMI 41(7), 1655–1668 (2019)
Article Google Scholar
Razavian, A.S., Sullivan, J., Carlsson, S., Maki, A.: Visual instance retrieval with deep convolutional networks. ITE Trans. Media Technol. Appl. 4(3), 251–258 (2016)
Article Google Scholar
Revaud, J., Almazán, J., de Rezende, R.S., de Souza, C.R.: Learning with average precision: training image retrieval with a listwise loss. In: ICCV (2019)
Google Scholar
Revaud, J., et al.: R2d2: repeatable and reliable detector and descriptor. In: NeurIPS (2019)
Google Scholar
Russakovsky, O., et al.: Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 115(3), 211–252 (2015). https://doi.org/10.1007/s11263-015-0816-y
Article MathSciNet Google Scholar
Schönberger, J.L., Radenović, F., Chum, O., Frahm, J.M.: From single image query to detailed 3D reconstruction. In: CVPR (2015)
Google Scholar
Siméoni, O., Avrithis, Y., Chum, O.: Local features and visual words emerge in activations. In: CVPR (2019)
Google Scholar
Sivic, J., Zisserman, A.: Video google: a text retrieval approach to object matching in videos. In: ICCV (2003)
Google Scholar
Teichmann, M., Araujo, A., Zhu, M., Sim, J.: Detect-to-retrieve: efficient regional aggregation for image search. In: CVPR (2019)
Google Scholar
Tolias, G., Avrithis, Y., Jégou, H.: Image search with selective match kernels: aggregation across single and multiple images. IJCV 116(3), 247–261 (2015). https://doi.org/10.1007/s11263-015-0810-4
Article MathSciNet Google Scholar
Tolias, G., Sicre, R., Jégou, H.: Particular object retrieval with integral max-pooling of CNN activations. In: ICLR (2016)
Google Scholar
Vo, N., Jacobs, N., Hays, J.: Revisiting im2gps in the deep learning era. In: CVPR (2017)
Google Scholar
Wang, Q., Zhou, X., Hariharan, B., Snavely, N.: Learning feature descriptors using camera pose supervision. In: arXiv (2020)
Google Scholar
Weyand, T., Araujo, A., Cao, B., Sim, J.: Google landmarks dataset v2-a large-scale benchmark for instance-level recognition and retrieval. In: CVPR (2020)
Google Scholar
Yang, T., Nguyen, D., Heijnen, H., Balntas, V.: Ur2kid: unifying retrieval, keypoint detection, and keypoint description without local correspondence supervision. In: arxiv (2020)
Google Scholar
Yue-Hei Ng, J., Yang, F., Davis, L.S.: Exploiting local features from deep networks for image retrieval. In: CVPR (2015)
Google Scholar
Zhu, C.Z., Jégou, H., ichi Satoh, S.: Query-adaptive asymmetrical dissimilarities for visual object retrieval. In: ICCV (2013)
Google Scholar

Download references

Acknowledgement

The authors would like to thank Yannis Kalantidis for valuable discussions. This work was supported by MSMT LL1901 ERC-CZ grant. Tomas Jenicek was supported by CTU student grant SGS20/171/OHK3/3T/13.

Author information

Authors and Affiliations

Visual Recognition Group, Faculty of Electrical Engineering, Czech Technical University, Prague, Czech Republic
Giorgos Tolias, Tomas Jenicek & Ondřej Chum

Authors

Giorgos Tolias
View author publications
You can also search for this author in PubMed Google Scholar
Tomas Jenicek
View author publications
You can also search for this author in PubMed Google Scholar
Ondřej Chum
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Giorgos Tolias .

Editor information

Editors and Affiliations

University of Oxford, Oxford, UK
Andrea Vedaldi
Graz University of Technology, Graz, Austria
Horst Bischof
University of Freiburg, Freiburg im Breisgau, Germany
Thomas Brox
University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
Jan-Michael Frahm

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Tolias, G., Jenicek, T., Chum, O. (2020). Learning and Aggregating Deep Local Descriptors for Instance-Level Recognition. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, JM. (eds) Computer Vision – ECCV 2020. ECCV 2020. Lecture Notes in Computer Science(), vol 12346. Springer, Cham. https://doi.org/10.1007/978-3-030-58452-8_27

Download citation

DOI: https://doi.org/10.1007/978-3-030-58452-8_27
Published: 03 November 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-58451-1
Online ISBN: 978-3-030-58452-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics