Learning and Aggregating Deep Local Descriptors for Instance-Level Recognition

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12346)


We propose an efficient method to learn deep local descriptors for instance-level recognition. The training only requires examples of positive and negative image pairs and is performed as metric learning of sum-pooled global image descriptors. At inference, the local descriptors are provided by the activations of internal components of the network. We demonstrate why such an approach learns local descriptors that work well for image similarity estimation with classical efficient match kernel methods. The experimental validation studies the trade-off between performance and memory requirements of the state-of-the-art image search approach based on match kernels. Compared to existing local descriptors, the proposed ones perform better in two instance-level recognition tasks and keep memory requirements lower. We experimentally show that global descriptors are not effective enough at large scale and that local descriptors are essential. We achieve state-of-the-art performance, in some cases even with a backbone network as small as ResNet18.


Deep local descriptors Deep local features Efficient match kernel ASMK Image retrieval Instance-level recognition 



The authors would like to thank Yannis Kalantidis for valuable discussions. This work was supported by MSMT LL1901 ERC-CZ grant. Tomas Jenicek was supported by CTU student grant SGS20/171/OHK3/3T/13.


  1. 1.
    Arandjelović, R., Zisserman, A.: All about VLAD. In: CVPR (2013)Google Scholar
  2. 2.
    Arandjelović, R., Zisserman, A.: DisLocation: scalable descriptor distinctiveness for location recognition. In: Cremers, D., Reid, I., Saito, H., Yang, M.H. (eds.) ACCV 2014. LNCS, vol. 9006, pp. 188–204. Springer, Cham (2015). Scholar
  3. 3.
    Arandjelović, R., Gronat, P., Torii, A., Pajdla, T., Sivic, J.: NetVLAD: CNN architecture for weakly supervised place recognition. In: CVPR (2016)Google Scholar
  4. 4.
    Babenko, A., Lempitsky, V.: Aggregating deep convolutional features for image retrieval. In: ICCV (2015)Google Scholar
  5. 5.
    Babenko, A., Slesarev, A., Chigorin, A., Lempitsky, V.: Neural codes for image retrieval. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8689, pp. 584–599. Springer, Cham (2014). Scholar
  6. 6.
    Balntas, V., Lenc, K., Vedaldi, A., Mikolajczyk, K.: Hpatches: a benchmark and evaluation of handcrafted and learned local descriptors. In: CVPR (2017)Google Scholar
  7. 7.
    Barroso Laguna, A., Riba, E., Ponsa, D., Mikolajczyk, K.: Key. net: keypoint detection by handcrafted and learned cnn filters. In: ICCV (2019)Google Scholar
  8. 8.
    Bay, H., Ess, A., Tuytelaars, T., Gool, L.V.: SURF: speeded up robust features. Comput. Vis. Image Underst. 110(3), 346–359 (2008)CrossRefGoogle Scholar
  9. 9.
    Benbihi, A., Geist, M., Pradalier, C.: Elf: embedded localisation of features in pre-trained cnn. In: CVPR (2019)Google Scholar
  10. 10.
    Bhowmik, A., Gumhold, S., Rother, C., Brachmann, E.: Reinforced feature points: optimizing feature detection and description for a high-level task. In: CVPR (2020)Google Scholar
  11. 11.
    Cao, B., Araujo, A., Sim, J.: Unifying deep local and global features for efficient image search. In: arxiv (2020)Google Scholar
  12. 12.
    DeTone, D., Malisiewicz, T., Rabinovich, A.: Superpoint: self-supervised interest point detection and description. In: CVPRW (2018)Google Scholar
  13. 13.
    Dusmanu, M., et al.: D2-net: a trainable cnn for joint detection and description of local features. In: CVPR (2019)Google Scholar
  14. 14.
    Gordo, A., Almazán, J., Revaud, J., Larlus, D.: End-to-End learning of deep visual representations for image retrieval. Int. J. Comput. Vis. 124(2), 237–254 (2017). Scholar
  15. 15.
    Gu, Y., Li, C., Jiang, Y.G.: Towards optimal cnn descriptors for large-scale image retrieval. In: ACM Multimedia (2019)Google Scholar
  16. 16.
    Husain, S., Bober, M.: Improving large-scale image retrieval through robust aggregation of local descriptors. PAMI 39(9), 1783–1796 (2016)CrossRefGoogle Scholar
  17. 17.
    Iscen, A., Tolias, G., Gosselin, P.H., Jégou, H.: A comparison of dense region detectors for image search and fine-grained classification. IEEE Trans. Image Process. 24(8), 2369–2381 (2015)MathSciNetCrossRefGoogle Scholar
  18. 18.
    Jégou, H., Douze, M., Schmid, C.: On the burstiness of visual elements. In: CVPR, June 2009Google Scholar
  19. 19.
    Jégou, H., Chum, O.: Negative evidences and co-occurences in image retrieval: the benefit of PCA and whitening. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7573, pp. 774–787. Springer, Heidelberg (2012). Scholar
  20. 20.
    Jégou, H., Douze, M., Schmid, C.: Improving bag-of-features for large scale image search. IJCV 87(3), 316–336 (2010)CrossRefGoogle Scholar
  21. 21.
    Jégou, H., Douze, M., Schmid, C.: Product quantization for nearest neighbor search. PAMI 33(1), 117–128 (2011)CrossRefGoogle Scholar
  22. 22.
    Jégou, H., Perronnin, F., Douze, M., Sánchez, J., Pérez, P., Schmid, C.: Aggregating local descriptors into compact codes. In: PAMI, Sep 2012Google Scholar
  23. 23.
    Kalantidis, Y., Mellina, C., Osindero, S.: Cross-dimensional weighting for aggregated deep convolutional features. In: Hua, G., Jégou, H. (eds.) ECCV 2016. LNCS, vol. 9913, pp. 685–701. Springer, Cham (2016). Scholar
  24. 24.
    Kim, H.J., Dunn, E., Frahm, J.M.: Learned contextual feature reweighting for image geo-localization. In: CVPR (2017)Google Scholar
  25. 25.
    Matas, J., Chum, O., Urban, M., Pajdla, T.: Robust wide-baseline stereo from maximally stable extremal regions. Image Vis. Comput. 22(10), 761–767 (2004)CrossRefGoogle Scholar
  26. 26.
    Mikolajczyk, K., Schmid, C.: A performance evaluation of local descriptors. PAMI 27(10), 1615–1630 (2005)CrossRefGoogle Scholar
  27. 27.
    Mikolajczyk, K., et al.: A comparison of affine region detectors. IJCV 65(1/2), 43–72 (2005)CrossRefGoogle Scholar
  28. 28.
    Mohedano, E., McGuinness, K., O’Connor, N.E., Salvador, A., Marques, F., Giro-i Nieto, X.: Bags of local convolutional features for scalable instance search. In: ICMR (2016)Google Scholar
  29. 29.
    Noh, H., Araujo, A., Sim, J., Weyand, T., Han, B.: Large-scale image retrieval with attentive deep local features. In: ICCV (2017)Google Scholar
  30. 30.
    Perronnin, F., Liu, Y., Sanchez, J., Poirier, H.: Large-scale image retrieval with compressed Fisher vectors. In: CVPR (2010)Google Scholar
  31. 31.
    Perronnin, F., Liu, Y., Renders, J.M.: A family of contextual measures of similarity between distributions with application to image retrieval. In: CVPR, pp. 2358–2365 (2009)Google Scholar
  32. 32.
    Philbin, J., Chum, O., Isard, M., Sivic, J., Zisserman, A.: Object retrieval with large vocabularies and fast spatial matching. In: CVPR (2007)Google Scholar
  33. 33.
    Philbin, J., Chum, O., Isard, M., Sivic, J., Zisserman, A.: Lost in quantization: improving particular object retrieval in large scale image databases. In: CVPR, June 2008Google Scholar
  34. 34.
    Radenović, F., Iscen, A., Tolias, G., Avrithis, Y., Chum, O.: Revisiting oxford and paris: large-scale image retrieval benchmarking. In: CVPR (2018)Google Scholar
  35. 35.
    Radenović, F., Tolias, G., Chum, O.: Fine-tuning CNN image retrieval with no human annotation. PAMI 41(7), 1655–1668 (2019)CrossRefGoogle Scholar
  36. 36.
    Razavian, A.S., Sullivan, J., Carlsson, S., Maki, A.: Visual instance retrieval with deep convolutional networks. ITE Trans. Media Technol. Appl. 4(3), 251–258 (2016)CrossRefGoogle Scholar
  37. 37.
    Revaud, J., Almazán, J., de Rezende, R.S., de Souza, C.R.: Learning with average precision: training image retrieval with a listwise loss. In: ICCV (2019)Google Scholar
  38. 38.
    Revaud, J., et al.: R2d2: repeatable and reliable detector and descriptor. In: NeurIPS (2019)Google Scholar
  39. 39.
    Russakovsky, O., et al.: Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 115(3), 211–252 (2015). Scholar
  40. 40.
    Schönberger, J.L., Radenović, F., Chum, O., Frahm, J.M.: From single image query to detailed 3D reconstruction. In: CVPR (2015)Google Scholar
  41. 41.
    Siméoni, O., Avrithis, Y., Chum, O.: Local features and visual words emerge in activations. In: CVPR (2019)Google Scholar
  42. 42.
    Sivic, J., Zisserman, A.: Video google: a text retrieval approach to object matching in videos. In: ICCV (2003)Google Scholar
  43. 43.
    Teichmann, M., Araujo, A., Zhu, M., Sim, J.: Detect-to-retrieve: efficient regional aggregation for image search. In: CVPR (2019)Google Scholar
  44. 44.
    Tolias, G., Avrithis, Y., Jégou, H.: Image search with selective match kernels: aggregation across single and multiple images. IJCV 116(3), 247–261 (2015). Scholar
  45. 45.
    Tolias, G., Sicre, R., Jégou, H.: Particular object retrieval with integral max-pooling of CNN activations. In: ICLR (2016)Google Scholar
  46. 46.
    Vo, N., Jacobs, N., Hays, J.: Revisiting im2gps in the deep learning era. In: CVPR (2017)Google Scholar
  47. 47.
    Wang, Q., Zhou, X., Hariharan, B., Snavely, N.: Learning feature descriptors using camera pose supervision. In: arXiv (2020)Google Scholar
  48. 48.
    Weyand, T., Araujo, A., Cao, B., Sim, J.: Google landmarks dataset v2-a large-scale benchmark for instance-level recognition and retrieval. In: CVPR (2020)Google Scholar
  49. 49.
    Yang, T., Nguyen, D., Heijnen, H., Balntas, V.: Ur2kid: unifying retrieval, keypoint detection, and keypoint description without local correspondence supervision. In: arxiv (2020)Google Scholar
  50. 50.
    Yue-Hei Ng, J., Yang, F., Davis, L.S.: Exploiting local features from deep networks for image retrieval. In: CVPR (2015)Google Scholar
  51. 51.
    Zhu, C.Z., Jégou, H., ichi Satoh, S.: Query-adaptive asymmetrical dissimilarities for visual object retrieval. In: ICCV (2013)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.Visual Recognition Group, Faculty of Electrical EngineeringCzech Technical UniversityPragueCzech Republic

Personalised recommendations