International Journal of Computer Vision

, Volume 124, Issue 2, pp 237–254 | Cite as

End-to-End Learning of Deep Visual Representations for Image Retrieval

  • Albert Gordo
  • Jon Almazán
  • Jerome Revaud
  • Diane LarlusEmail author


While deep learning has become a key ingredient in the top performing methods for many computer vision tasks, it has failed so far to bring similar improvements to instance-level image retrieval. In this article, we argue that reasons for the underwhelming results of deep methods on image retrieval are threefold: (1) noisy training data, (2) inappropriate deep architecture, and (3) suboptimal training procedure. We address all three issues. First, we leverage a large-scale but noisy landmark dataset and develop an automatic cleaning method that produces a suitable training set for deep retrieval. Second, we build on the recent R-MAC descriptor, show that it can be interpreted as a deep and differentiable architecture, and present improvements to enhance it. Last, we train this network with a siamese architecture that combines three streams with a triplet loss. At the end of the training process, the proposed architecture produces a global image representation in a single forward pass that is well suited for image retrieval. Extensive experiments show that our approach significantly outperforms previous retrieval approaches, including state-of-the-art methods based on costly local descriptor indexing and spatial verification. On Oxford 5k, Paris 6k and Holidays, we respectively report 94.7, 96.6, and 94.8 mean average precision. Our representations can also be heavily compressed using product quantization with little loss in accuracy.


Deep learning Instance-level retrieval Visual search Visual representation 


  1. Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C. L., et al. (2015). Vqa: Visual question answering. In ICCV.Google Scholar
  2. Arandjelovic, R., & Zisserman, A. (2012). Three things everyone should know to improve object retrieval. In CVPR.Google Scholar
  3. Arandjelovic, R., Gronat, P., Torii, A., Pajdla, T., & Sivic, J. (2016). NetVLAD: CNN architecture for weakly supervised place recognition. In CVPR.Google Scholar
  4. Azizpour, H., Razavian, A., Sullivan, J., Maki, A., & Carlsson, S. (2015). Factors of transferability for a generic convnet representation. IEEE Transactions on Pattern Analysis and Machine Intelligence, (99):1–1.Google Scholar
  5. Babenko, A., & Lempitsky, V. S. (2015). Aggregating deep convolutional features for image retrieval. In ICCV.Google Scholar
  6. Babenko, A., Slesarev, A., Chigorin, A., & Lempitsky, V. S. (2014). Neural codes for image retrieval. In ECCV.Google Scholar
  7. Chopra, S., Hadsell, R., & Lecun, Y. (2005). Learning a similarity metric discriminatively, with application to face verification. In Proceedings of computer vision and pattern recognition conference.Google Scholar
  8. Chum, O., Philbin, J., Sivic, J., Isard, M., & Zisserman, A. (2007). Total recall: Automatic query expansion with a generative feature model for object retrieval. In ICCV.Google Scholar
  9. Chum, O., Mikulik, A., Perdoch, M., & Matas, J. (2011). Total recall II: Query expansion revisited. In CVPR.Google Scholar
  10. Danfeng, Q., Gammeter, S., Bossard, L., Quack, T., & Van Gool, L. (2011). Hello neighbor: Accurate object retrieval with k-reciprocal nearest neighbors. In CVPR.Google Scholar
  11. Deng, C., Ji, R., Liu, W., Tao, D., & Gao, X. (2013). Visual reranking through weakly supervised multi-graph learning. In ICCV.Google Scholar
  12. Deng, J., Dong, W., Socher, R., Li, LJ., Li, K., & Fei-Fei, L. (2009). ImageNet: A large-scale hierarchical image database. In CVPR.Google Scholar
  13. Douze, M., Jegou, H., & Perronnin, F. (2016). Polysemous codes. In ECCV.Google Scholar
  14. Frome, A., Corrado, G. S., Shlens, J., Bengio, S., Dean, J., Ranzato, M. A., & Mikolov, T. (2013). Devise: A deep visual-semantic embedding model. In NIPS.Google Scholar
  15. Girshick, R. (2015). Fast R-CNN. In CVPR.Google Scholar
  16. Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2014). Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR.Google Scholar
  17. Gong, Y., Wang, L., Guo, R., & Lazebnik, S. (2014). Multi-scale orderless pooling of deep convolutional activation features. In ECCV.Google Scholar
  18. Gordo, A., Rodríguez-Serrano, J. A., Perronnin, F., & Valveny, E. (2012). Leveraging category-level labels for instance-level image retrieval. In CVPR.Google Scholar
  19. Gordo, A., Almazán, J., Revaud, J., & Larlus, D. (2016). Deep image retrieval: Learning global representations for image search. In ECCV.Google Scholar
  20. Hadsell, R., Chopra, S., & Lecun, Y. (2006). Dimensionality reduction by learning an invariant mapping. In CVPR.Google Scholar
  21. Hays, J., & Efros, A. A. (2008). im2gps: Estimating geographic information from a single image. In CVPR.Google Scholar
  22. He, K., Zhang, X., Ren, S., & Sun, J. (2014). Spatial pyramid pooling in deep convolutional networks for visual recognition. In ECCV.Google Scholar
  23. He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In CVPR.Google Scholar
  24. Hoffer, E., & Ailon, N. (2015). Deep metric learning using triplet network. In SIMBAD.Google Scholar
  25. Hu, J., Lu, J., & Tan, Y. P. (2014). Discriminative deep metric learning for face verification in the wild. In CVPR.Google Scholar
  26. Jégou, H., & Chum, O. (2012). Negative evidences and co-occurences in image retrieval: The benefit of PCA and whitening. In ECCV.Google Scholar
  27. Jégou, H., & Zisserman, A. (2014). Triangulation embedding and democratic aggregation for image search. In CVPR.Google Scholar
  28. Jégou, H., Douze, M., & Schmid, C. (2008). Hamming embedding and weak geometric consistency for large scale image search. In ECCV.Google Scholar
  29. Jégou, H., Douze, M., & Schmid, C. (2010). Improving bag-of-features for large scale image search. In IJCV.Google Scholar
  30. Jégou, H., Douze, M., Schmid, C., & Pérez, P. (2010). Aggregating local descriptors into a compact image representation. In CVPR.Google Scholar
  31. Jegou, H., Douze, M., & Schmid, C. (2011). Product quantization for nearest neighbor search. In TPAMI.Google Scholar
  32. Kalantidis, Y., Mellina, C., & Osindero, S. (2016). Cross-dimensional weighting for aggregated deep convolutional features. In Workshop on web-scale vision and social media (VSM), ECCV.Google Scholar
  33. Karpathy, A., Joulin, A., & Fei-Fei, L. (2014). Deep fragment embeddings for bidirectional image-sentence mapping. In NIPS.Google Scholar
  34. Krizhevsky, A., Sutskever, I., & Hinton, G. (2012). ImageNet classification with deep convolutional neural networks. In NIPS.Google Scholar
  35. Laptev, D., Savinov, N., Buhmann, J. M., & Pollefeys, M. (2016). Ti-pooling: Transformation-invariant pooling for feature learning in convolutional neural networks. In CVPR.Google Scholar
  36. Li, X., Larson, M., & Hanjalic, A. (2015). Pairwise geometric matching for large-scale object retrieval. In CVPR.Google Scholar
  37. Long, J., Shelhamer, E., & Darrell, T. (2015). Fully convolutional networks for semantic segmentation. In CVPR.Google Scholar
  38. Lowe, D. G. (2004). Distinctive image features from scale-invariant keypoints. In IJCV.Google Scholar
  39. Makadia, A., Pavlovic, V., & Kumar, S. (2008). A new baseline for image annotation. In ECCV.Google Scholar
  40. Mikolajczyk, K., & Schmid, C. (2004), Scale and affine invariant interest point detectors. In IJCV.Google Scholar
  41. Mikulík, A., Perdoch, M., Chum, O., & Matas, J. (2010). Learning a fine vocabulary. In ECCV.Google Scholar
  42. Mikulik, A., Perdoch, M., Chum, O., & Matas, J. (2013). Learning vocabularies over a fine quantization. In IJCV.Google Scholar
  43. Ng, J. Y. H., Yang, F., & Davis, L. S. (2015). Exploiting local features from deep networks for image retrieval. In CVPR workshops.Google Scholar
  44. Nister, D., & Stewenius, H. (2006). Scalable recognition with a vocabulary tree. In CVPR.Google Scholar
  45. Paulin, M., Douze, M., Harchaoui, Z., Mairal, J., Perronin, F., & Schmid, C. (2015). Local convolutional features with unsupervised training for image retrieval. In ICCV.Google Scholar
  46. Perdoch, M., Chum, O., & Matas, J. (2009). Efficient representation of local geometry for large scale object retrieval. In CVPR.Google Scholar
  47. Perronnin, F., & Dance, C. (2007). Fisher kernels on visual vocabularies for image categorization. In CVPR.Google Scholar
  48. Perronnin, F., & Larlus, D. (2015). Fisher vectors meet neural networks: A hybrid classification architecture. In CVPR.Google Scholar
  49. Perronnin, F., Liu, Y., Sánchez, J., & Poirier, H. (2010). Large-scale image retrieval with compressed fisher vectors. In CVPR.Google Scholar
  50. Philbin, J., Chum, O., Isard, M., Sivic, J., & Zisserman, A. (2007). Object retrieval with large vocabularies and fast spatial matching. In CVPR.Google Scholar
  51. Philbin, J., Chum, O., Isard, M., Sivic, J., & Zisserman, A. (2008). Lost in quantization: Improving particular object retrieval in large scale image databases. In CVPR.Google Scholar
  52. Philbin, J., Isard, M., Sivic, J., & Zisserman, A. (2010). Descriptor learning for efficient retrieval. In ECCV.Google Scholar
  53. Radenovic, F., Jegou, H., & Chum, O. (2015). Multiple measurements and joint dimensionality reduction for large scale image search with short vectors-extended version. In International Conference on Multimedia Retrieval.Google Scholar
  54. Radenovic, F., Tolias, G., & Chum, O. (2016). CNN image retrieval learns from BoW: Unsupervised fine-tuning with hard examples. In ECCV.Google Scholar
  55. Razavian, A.S., Azizpour, H., Sullivan, J., & Carlsson, S. (2014). CNN features off-the-shelf: An astounding baseline for recognition. In CVPR deep vision workshop.Google Scholar
  56. Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster R-CNN: Towards real-time object detection with region proposal networks. In NIPS.Google Scholar
  57. Rodriguez-Serrano, J., Larlus, D., & Dai, Z. (2015). Data-driven detection of prominent objects. In TPAMI.Google Scholar
  58. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, AC., & Fei-Fei, L. (2015). ImageNet large scale visual recognition challenge. In IJCV.Google Scholar
  59. Schmidhuber, J. (2012). Multi-column deep neural networks for image classification. In CVPR.Google Scholar
  60. Schroff, F., Kalenichenko, D., & Philbin, J. (2015). Facenet: A unified embedding for face recognition and clustering. In CVPR.Google Scholar
  61. Shen, X., Lin, Z., Brandt, J., & Wu, Y. (2014). Spatially-constrained similarity measurefor large-scale object retrieval. In TPAMI.Google Scholar
  62. Simo-Serra, E., Trulls, E., Ferraz, L., Kokkinos, I., Fua, P., & Moreno-Noguer, F. (2015). Discriminative learning of deep convolutional feature point descriptors. In ICCV.Google Scholar
  63. Simonyan, K., & Zisserman, A. (2015). Very deep convolutional networks for large-scale image recognition. In ICLR.Google Scholar
  64. Sivic, J., & Zisserman, A. (2003). Video google: A text retrieval approach to object matching in videos. In ICCV.Google Scholar
  65. Song, H.O., Xiang, Y., Jegelka, S., & Savarese, S. (2016). Deep metric learning via lifted structured feature embedding. In CVPR.Google Scholar
  66. Sun, Y., Chen, Y., Wang, X., & Tang, X. (2014). Deep learning face representation by joint identification-verification. In NIPS.Google Scholar
  67. Tao, R., Gavves, E., Snoek, C.G., & Smeulders, AW. (2014). Locality in generic instance search from one example. In CVPR.Google Scholar
  68. Tolias, G., & Jégou, H. (2015). Visual query expansion with or without geometry: Refining local descriptors by feature aggregation. In PR.Google Scholar
  69. Tolias, G., Avrithis, Y., & Jégou, H. (2015). Image search with selective match kernels: Aggregation across single and multiple images. In IJCV.Google Scholar
  70. Tolias, G., Sicre, R., & Jégou, H. (2016). Particular object retrieval with integral max-pooling of CNN activations. In ICLR.Google Scholar
  71. Torralba, A., Fergus, R., & Freeman, W. T. (2008). 80 million tiny images: A large data set for nonparametric object and scene recognition. IEEE Transactions on PAMI, 30(11), 1958–1970. doi: 10.1109/TPAMI.2008.128.
  72. Turcot, P., & Lowe, D.G. (2009). Better matching with fewer features: The selection of useful features in large database recognition problems. In ICCV Workshops.Google Scholar
  73. Vardi, Y., & Zhang, C. H. (2004). The multivariate L1-median and associated data depth. In Proceedings of the National Academy of Sciences.Google Scholar
  74. Wang, J., Song, Y., Leung, T., Rosenberg, C., Wang, J., Philbin, J., et al. (2014) Learning fine-grained image similarity with deep ranking. In CVPR.Google Scholar

Copyright information

© Springer Science+Business Media New York 2017

Authors and Affiliations

  • Albert Gordo
    • 1
  • Jon Almazán
    • 1
  • Jerome Revaud
    • 1
  • Diane Larlus
    • 1
    Email author
  1. 1.Computer Vision GroupXerox Research Center EuropeMeylanFrance

Personalised recommendations