Abstract
We propose an efficient method to learn deep local descriptors for instance-level recognition. The training only requires examples of positive and negative image pairs and is performed as metric learning of sum-pooled global image descriptors. At inference, the local descriptors are provided by the activations of internal components of the network. We demonstrate why such an approach learns local descriptors that work well for image similarity estimation with classical efficient match kernel methods. The experimental validation studies the trade-off between performance and memory requirements of the state-of-the-art image search approach based on match kernels. Compared to existing local descriptors, the proposed ones perform better in two instance-level recognition tasks and keep memory requirements lower. We experimentally show that global descriptors are not effective enough at large scale and that local descriptors are essential. We achieve state-of-the-art performance, in some cases even with a backbone network as small as ResNet18.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
The binarized versions are originally [44] referred to as SMK\(^\star \) and ASMK\(^\star \). Only binarized versions are considered in this work and the asterisk is omitted.
- 3.
To simplify, we use the same notation, i.e. \(\gamma (\cdot )\), for the normalization of different similarity measures in the rest of the text. In each case, it ensures unit self-similarity of the corresponding similarity measure.
- 4.
Both f(I) and \(\mathcal {U}\) correspond to the same representation seen as a 3D tensor and a set of descriptors, respectively. We write \(\mathcal {U}=f(I)\) implying the tensor is transformed into a set of vectors. \(\mathcal {U}\) is, in fact, a multi-set, but it is referred to as set in the paper.
- 5.
The main difference is that we do not follow the two stage training performed in the original work [29]; DELF is trained in a single stage for our ablations.
- 6.
References
Arandjelović, R., Zisserman, A.: All about VLAD. In: CVPR (2013)
Arandjelović, R., Zisserman, A.: DisLocation: scalable descriptor distinctiveness for location recognition. In: Cremers, D., Reid, I., Saito, H., Yang, M.H. (eds.) ACCV 2014. LNCS, vol. 9006, pp. 188–204. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-16817-3_13
Arandjelović, R., Gronat, P., Torii, A., Pajdla, T., Sivic, J.: NetVLAD: CNN architecture for weakly supervised place recognition. In: CVPR (2016)
Babenko, A., Lempitsky, V.: Aggregating deep convolutional features for image retrieval. In: ICCV (2015)
Babenko, A., Slesarev, A., Chigorin, A., Lempitsky, V.: Neural codes for image retrieval. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8689, pp. 584–599. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10590-1_38
Balntas, V., Lenc, K., Vedaldi, A., Mikolajczyk, K.: Hpatches: a benchmark and evaluation of handcrafted and learned local descriptors. In: CVPR (2017)
Barroso Laguna, A., Riba, E., Ponsa, D., Mikolajczyk, K.: Key. net: keypoint detection by handcrafted and learned cnn filters. In: ICCV (2019)
Bay, H., Ess, A., Tuytelaars, T., Gool, L.V.: SURF: speeded up robust features. Comput. Vis. Image Underst. 110(3), 346–359 (2008)
Benbihi, A., Geist, M., Pradalier, C.: Elf: embedded localisation of features in pre-trained cnn. In: CVPR (2019)
Bhowmik, A., Gumhold, S., Rother, C., Brachmann, E.: Reinforced feature points: optimizing feature detection and description for a high-level task. In: CVPR (2020)
Cao, B., Araujo, A., Sim, J.: Unifying deep local and global features for efficient image search. In: arxiv (2020)
DeTone, D., Malisiewicz, T., Rabinovich, A.: Superpoint: self-supervised interest point detection and description. In: CVPRW (2018)
Dusmanu, M., et al.: D2-net: a trainable cnn for joint detection and description of local features. In: CVPR (2019)
Gordo, A., Almazán, J., Revaud, J., Larlus, D.: End-to-End learning of deep visual representations for image retrieval. Int. J. Comput. Vis. 124(2), 237–254 (2017). https://doi.org/10.1007/s11263-017-1016-8
Gu, Y., Li, C., Jiang, Y.G.: Towards optimal cnn descriptors for large-scale image retrieval. In: ACM Multimedia (2019)
Husain, S., Bober, M.: Improving large-scale image retrieval through robust aggregation of local descriptors. PAMI 39(9), 1783–1796 (2016)
Iscen, A., Tolias, G., Gosselin, P.H., Jégou, H.: A comparison of dense region detectors for image search and fine-grained classification. IEEE Trans. Image Process. 24(8), 2369–2381 (2015)
Jégou, H., Douze, M., Schmid, C.: On the burstiness of visual elements. In: CVPR, June 2009
Jégou, H., Chum, O.: Negative evidences and co-occurences in image retrieval: the benefit of PCA and whitening. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7573, pp. 774–787. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33709-3_55
Jégou, H., Douze, M., Schmid, C.: Improving bag-of-features for large scale image search. IJCV 87(3), 316–336 (2010)
Jégou, H., Douze, M., Schmid, C.: Product quantization for nearest neighbor search. PAMI 33(1), 117–128 (2011)
Jégou, H., Perronnin, F., Douze, M., Sánchez, J., Pérez, P., Schmid, C.: Aggregating local descriptors into compact codes. In: PAMI, Sep 2012
Kalantidis, Y., Mellina, C., Osindero, S.: Cross-dimensional weighting for aggregated deep convolutional features. In: Hua, G., Jégou, H. (eds.) ECCV 2016. LNCS, vol. 9913, pp. 685–701. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46604-0_48
Kim, H.J., Dunn, E., Frahm, J.M.: Learned contextual feature reweighting for image geo-localization. In: CVPR (2017)
Matas, J., Chum, O., Urban, M., Pajdla, T.: Robust wide-baseline stereo from maximally stable extremal regions. Image Vis. Comput. 22(10), 761–767 (2004)
Mikolajczyk, K., Schmid, C.: A performance evaluation of local descriptors. PAMI 27(10), 1615–1630 (2005)
Mikolajczyk, K., et al.: A comparison of affine region detectors. IJCV 65(1/2), 43–72 (2005)
Mohedano, E., McGuinness, K., O’Connor, N.E., Salvador, A., Marques, F., Giro-i Nieto, X.: Bags of local convolutional features for scalable instance search. In: ICMR (2016)
Noh, H., Araujo, A., Sim, J., Weyand, T., Han, B.: Large-scale image retrieval with attentive deep local features. In: ICCV (2017)
Perronnin, F., Liu, Y., Sanchez, J., Poirier, H.: Large-scale image retrieval with compressed Fisher vectors. In: CVPR (2010)
Perronnin, F., Liu, Y., Renders, J.M.: A family of contextual measures of similarity between distributions with application to image retrieval. In: CVPR, pp. 2358–2365 (2009)
Philbin, J., Chum, O., Isard, M., Sivic, J., Zisserman, A.: Object retrieval with large vocabularies and fast spatial matching. In: CVPR (2007)
Philbin, J., Chum, O., Isard, M., Sivic, J., Zisserman, A.: Lost in quantization: improving particular object retrieval in large scale image databases. In: CVPR, June 2008
Radenović, F., Iscen, A., Tolias, G., Avrithis, Y., Chum, O.: Revisiting oxford and paris: large-scale image retrieval benchmarking. In: CVPR (2018)
Radenović, F., Tolias, G., Chum, O.: Fine-tuning CNN image retrieval with no human annotation. PAMI 41(7), 1655–1668 (2019)
Razavian, A.S., Sullivan, J., Carlsson, S., Maki, A.: Visual instance retrieval with deep convolutional networks. ITE Trans. Media Technol. Appl. 4(3), 251–258 (2016)
Revaud, J., Almazán, J., de Rezende, R.S., de Souza, C.R.: Learning with average precision: training image retrieval with a listwise loss. In: ICCV (2019)
Revaud, J., et al.: R2d2: repeatable and reliable detector and descriptor. In: NeurIPS (2019)
Russakovsky, O., et al.: Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 115(3), 211–252 (2015). https://doi.org/10.1007/s11263-015-0816-y
Schönberger, J.L., Radenović, F., Chum, O., Frahm, J.M.: From single image query to detailed 3D reconstruction. In: CVPR (2015)
Siméoni, O., Avrithis, Y., Chum, O.: Local features and visual words emerge in activations. In: CVPR (2019)
Sivic, J., Zisserman, A.: Video google: a text retrieval approach to object matching in videos. In: ICCV (2003)
Teichmann, M., Araujo, A., Zhu, M., Sim, J.: Detect-to-retrieve: efficient regional aggregation for image search. In: CVPR (2019)
Tolias, G., Avrithis, Y., Jégou, H.: Image search with selective match kernels: aggregation across single and multiple images. IJCV 116(3), 247–261 (2015). https://doi.org/10.1007/s11263-015-0810-4
Tolias, G., Sicre, R., Jégou, H.: Particular object retrieval with integral max-pooling of CNN activations. In: ICLR (2016)
Vo, N., Jacobs, N., Hays, J.: Revisiting im2gps in the deep learning era. In: CVPR (2017)
Wang, Q., Zhou, X., Hariharan, B., Snavely, N.: Learning feature descriptors using camera pose supervision. In: arXiv (2020)
Weyand, T., Araujo, A., Cao, B., Sim, J.: Google landmarks dataset v2-a large-scale benchmark for instance-level recognition and retrieval. In: CVPR (2020)
Yang, T., Nguyen, D., Heijnen, H., Balntas, V.: Ur2kid: unifying retrieval, keypoint detection, and keypoint description without local correspondence supervision. In: arxiv (2020)
Yue-Hei Ng, J., Yang, F., Davis, L.S.: Exploiting local features from deep networks for image retrieval. In: CVPR (2015)
Zhu, C.Z., Jégou, H., ichi Satoh, S.: Query-adaptive asymmetrical dissimilarities for visual object retrieval. In: ICCV (2013)
Acknowledgement
The authors would like to thank Yannis Kalantidis for valuable discussions. This work was supported by MSMT LL1901 ERC-CZ grant. Tomas Jenicek was supported by CTU student grant SGS20/171/OHK3/3T/13.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Tolias, G., Jenicek, T., Chum, O. (2020). Learning and Aggregating Deep Local Descriptors for Instance-Level Recognition. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, JM. (eds) Computer Vision – ECCV 2020. ECCV 2020. Lecture Notes in Computer Science(), vol 12346. Springer, Cham. https://doi.org/10.1007/978-3-030-58452-8_27
Download citation
DOI: https://doi.org/10.1007/978-3-030-58452-8_27
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-58451-1
Online ISBN: 978-3-030-58452-8
eBook Packages: Computer ScienceComputer Science (R0)