Multimedia Tools and Applications

, Volume 77, Issue 5, pp 5385–5415 | Cite as

Aggregating binary local descriptors for image retrieval

  • Giuseppe Amato
  • Fabrizio Falchi
  • Lucia VadicamoEmail author


Content-Based Image Retrieval based on local features is computationally expensive because of the complexity of both extraction and matching of local feature. On one hand, the cost for extracting, representing, and comparing local visual descriptors has been dramatically reduced by recently proposed binary local features. On the other hand, aggregation techniques provide a meaningful summarization of all the extracted feature of an image into a single descriptor, allowing us to speed up and scale up the image search. Only a few works have recently mixed together these two research directions, defining aggregation methods for binary local features, in order to leverage on the advantage of both approaches.In this paper, we report an extensive comparison among state-of-the-art aggregation methods applied to binary features. Then, we mathematically formalize the application of Fisher Kernels to Bernoulli Mixture Models. Finally, we investigate the combination of the aggregated binary features with the emerging Convolutional Neural Network (CNN) features. Our results show that aggregation methods on binary features are effective and represent a worthwhile alternative to the direct matching. Moreover, the combination of the CNN with the Fisher Vector (FV) built upon binary features allowed us to obtain a relative improvement over the CNN results that is in line with that recently obtained using the combination of the CNN with the FV built upon SIFTs. The advantage of using the FV built upon binary features is that the extraction process of binary features is about two order of magnitude faster than SIFTs.


Binary local feature Fisher vector VLAD Bag of words Convolutional neural network Content-based image retrieval 



This work was partially founded by: EAGLE, Europeana network of Ancient Greek and Latin Epigraphy, co-founded by the European Commission, CIP-ICT-PSP.2012.2.1 - Europeana and creativity, Grant Agreement n. 325122; and Smart News, Social sensing for breakingnews, co-founded by the Tuscany region under the FAR-FAS 2014 program, CUP CIPE D58C15000270008.


  1. 1.
    Alcantarilla PF, Nuevo J, Bartoli A (2013) Fast explicit diffusion for accelerated features in nonlinear scale spaces British machine vision conference (BMVC)Google Scholar
  2. 2.
    Amato G, Falchi F, Gennaro C, Vadicamo L (2016) Deep Permutations: Deep Convolutional Neural Networks and Permutation-Based Indexing. Springer International Publishing, Cham, pp 93–106. doi: 10.1007/978-3-319-46759-7_7 Google Scholar
  3. 3.
    Amato G, Falchi F, Vadicamo L (2016) How effective are aggregation methods on binary features? Proceedings of the 11th joint conference on computer vision, imaging and computer graphics theory and applications, vol 4, pp 566–573Google Scholar
  4. 4.
    Amato G, Falchi F, Vadicamo L (2016) Visual Recognition of Ancient Inscriptions Using Convolutional Neural Network and Fisher Vector, J Comput Cult Herit (JOCCH) Article 21 9, 4 (December 2016) 24 pages. doi: 10.1145/2964911
  5. 5.
    Arandjelovic R, Zisserman A (2012) Three things everyone should know to improve object retrieval 2012 IEEE conference on Computer vision and pattern recognition (CVPR), pp 2911–2918CrossRefGoogle Scholar
  6. 6.
    Arandjelovic R, Zisserman A (2013) All about VLAD 2013 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). doi: 10.1109/CVPR.2013.207, pp 1578–1585CrossRefGoogle Scholar
  7. 7.
    Babenko A, Slesarev A, Chigorin A, Lempitsky V (2014) Neural codes for image retrieval Computer Vision–ECCV 2014. doi: 10.1007/978-3-319-10590-1_38. Springer, pp 584–599
  8. 8.
    Bay H, Tuytelaars T, Van Gool L (2006) Surf: Speeded up robust features. In: Leonardis A, Bischof H, Pinz A (eds) Computer Vision - ECCV 2006, Lecture Notes in Computer Science. doi: 10.1007/11744023_32, vol 3951. Springer, Berlin, pp 404–417
  9. 9.
  10. 10.
    Bishop CM (2006) Pattern recognition and machine learning. Information science and statistics. SpringerGoogle Scholar
  11. 11.
    Boureau YL, Bach F, LeCun Y, Ponce J (2010) Learning mid-level features for recognition 2010 IEEE conference on Computer vision and pattern recognition (CVPR), pp 2559–2566CrossRefGoogle Scholar
  12. 12.
    Calonder M, Lepetit V, Strecha C, Fua P (2010) Brief: Binary robust independent elementary features. In: Daniilidis K, Maragos P, Paragios N (eds) Computer Vision - ECCV 2010, Lecture Notes in Computer Science, vol 6314. Springer, Berlin Heidelberg, pp 778–792Google Scholar
  13. 13.
    Chandrasekhar V, Lin J, Morère O, Goh H, Veillard A (2015) A practical guide to cnns and fisher vectors for image instance retrieval arXiv:1508.02496
  14. 14.
    Chen D, Tsai S, Chandrasekhar V, Takacs G, Chen H, Vedantham R, Grzeszczuk R, Girod B (2011) Residual enhanced visual vectors for on-device image matching 2011 Conference Record of the Forty Fifth Asilomar Conference on Signals, Systems and Computers (ASILOMAR). doi: 10.1016/j.sigpro.2012.06.005, pp 850–854CrossRefGoogle Scholar
  15. 15.
    Chum O, Philbin J, Sivic J, Isard M, Zisserman A (2007) Total recall: Automatic query expansion with a generative feature model for object retrieval IEEE 11th international conference on Computer vision, 2007. ICCV 2007, pp 1–8Google Scholar
  16. 16.
    Csurka G, Dance C, Fan L, Willamowski J, Bray C (2004) Visual categorization with bags of keypoints. Workshop on statistical learning in computer vision. ECCV 1(1-22):1–2Google Scholar
  17. 17.
    Datta R, Li J, Wang JZ (2005) Content-based image retrieval: Approaches and trends of the new age Proceedings of the 7th ACM SIGMM international workshop on multimedia information retrieval, MIR ’05. ACM, New York, pp 253–262Google Scholar
  18. 18.
    Delhumeau J, Gosselin PH, Jégou H, Pérez P (2013) Revisiting the VLAD image representation Proceedings of the 21st ACM International Conference on Multimedia, MM 2013. doi: 10.1145/2502081.2502171. ACM, New York, pp 653–656Google Scholar
  19. 19.
    Deng J, Dong W, Socher R, Li L, Li K, Fei-Fei L (2009) Imagenet: A large-scale hierarchical image database IEEE Conference on Computer Vision and Pattern Recognition, 2009. CVPR 2009. doi: 10.1109/CVPR.2009.5206848, pp 248–255CrossRefGoogle Scholar
  20. 20.
    Donahue J, Jia Y, Vinyals O, Hoffman J, Zhang N, Tzeng E, Darrell T (2013) Decaf: A deep convolutional activation feature for generic visual recognition. arXiv:1310.1531
  21. 21.
    Galvez-Lopez D., Tardos J. (2011) Real-time loop detection with bags of binary words IEEE/RSJ international conference on Intelligent robots and systems (IROS), 2011, pp 51–58Google Scholar
  22. 22.
    Goodfellow I, Bengio Y, Courville A (2016) Deep learning. Book in preparation for MIT Press
  23. 23.
  24. 24.
  25. 25.
    Grana C, Borghesani D, Manfredi M, Cucchiara R (2013) A fast approach for integrating ORB descriptors in the bag of words model. In: Snoek CGM, Kennedy LS, Creutzburg R, Akopian D, Wüller D, Matherson KJ, Georgiev TG, Lumsdaine A (eds) IS&T/SPIE Electronic Imaging. International Society for Optics and PhotonicsGoogle Scholar
  26. 26.
    Gray RM, Neuhoff DL (1998) Quantization. IEEE Trans Inf Theory 44 (6):2325–2383. doi: 10.1109/18.720541 CrossRefzbMATHGoogle Scholar
  27. 27.
    Hamming RW (1950) Error detecting and error correcting codes. Bell Syst Tech J 29(2):147–160. doi: 10.1002/j.1538-7305.1950.tb00463.x MathSciNetCrossRefGoogle Scholar
  28. 28.
    Heinly J, Dunn E, Frahm JM (2012) Comparative evaluation of binary features Computer vision - ECCV 2012, lecture notes in computer science. Springer, Berlin, pp 759–773CrossRefGoogle Scholar
  29. 29.
    Householder A. (1964) The Theory of Matrices in Numerical Analysis. A Blaisdell book in pure and applied sciences: introduction to higher mathematics. Blaisdell Publishing CompanyGoogle Scholar
  30. 30.
    Huiskes MJ, Lew MS (2008) The mir flickr retrieval evaluation MIR ’08: Proceedings of the 2008 ACM International Conference on Multimedia Information Retrieval. ACM, New YorkGoogle Scholar
  31. 31.
    Jaakkola T, Haussler D (1998) Exploiting generative models in discriminative classifiers In Advances in Neural Information Processing Systems., vol 11. MIT Press, pp 487–493
  32. 32.
    Jégou H, Douze M, Schmid C (2008) Hamming embedding and weak geometric consistency for large scale image search. In: Forsyth D, Torr P, Zisserman A (eds) European Conference on Computer Vision, LNCS, vol I. Springer, pp 304–317Google Scholar
  33. 33.
    Jégou H, Douze M, Schmid C (2010) Improving bag-of-features for large scale image search. Int J Comput Vis 87(3):316–336. doi: 10.1007/s11263-009-0285-2 CrossRefGoogle Scholar
  34. 34.
    Jégou H, Douze M, Schmid C, Pérez P (2010) Aggregating local descriptors into a compact image representation IEEE Conference on Computer Vision & Pattern Recognition. doi: 10.1109/CVPR.2010.5540039 Google Scholar
  35. 35.
    Jégou H, Douze M, Schmid C (2011) Product quantization for nearest neighbor search. IEEE Trans Pattern Anal Mach Intell 33 (1):117–128. doi: 10.1109/TPAMI.2010.57 CrossRefGoogle Scholar
  36. 36.
    Jégou H, Perronnin F, Douze M, Sànchez J, Pérez P, Schmid C (2012) Aggregating local image descriptors into compact codes. IEEE Trans Pattern Anal Mach Intell 34(9):1704–1716. doi: 10.1109/TPAMI.2011.235 CrossRefGoogle Scholar
  37. 37.
    Jia Y, Shelhamer E, Donahue J, Karayev S, Long J, Girshick R, Guadarrama S, Darrell T (2014) Caffe: Convolutional architecture for fast feature embedding Proceedings of the ACM International Conference on Multimedia. doi: 10.1145/2647868.2654889. ACM, pp 675–678
  38. 38.
    Kaufman L, Rousseeuw P (1987) Clustering by means of medoids. In: Dodge Y (ed) An introduction to l1-norm based statistical data analysis, computational statistics & data analysis, vol 5Google Scholar
  39. 39.
    Krapac J, Verbeek J, Jurie F (2011) Modeling Spatial Layout with Fisher Vectors for Image Categorization ICCV 2011 - International conference on computer vision. IEEE, Barcelona, pp 1487–1494CrossRefGoogle Scholar
  40. 40.
    Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Pereira F, Burges C, Bottou L, Weinberger K (eds) Advances in neural information processing systems, vol 25. Curran Associates, Inc., pp 1097–1105Google Scholar
  41. 41.
    Lai H, Pan Y, Liu Y, Yan S (2015) Simultaneous feature learning and hash coding with deep neural networks The IEEE conference on computer vision and pattern recognition (CVPR)Google Scholar
  42. 42.
    Lazebnik S., Schmid C., Ponce J. (2006) Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories 2006 IEEE computer society conference on Computer vision and pattern recognition, vol 2Google Scholar
  43. 43.
    LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521 (7553):436–444. doi: 10.1038/nature14539 CrossRefGoogle Scholar
  44. 44.
    Lee S, Choi S, Yang H (2015) Bag-of-binary-features for fast image representation. Electron Lett 51(7):555–557CrossRefGoogle Scholar
  45. 45.
    Leutenegger S, Chli M, Siegwart R (2011) Brisk: Binary robust invariant scalable keypoints IEEE International Conference on Computer vision (ICCV), 2011, pp 2548–2555CrossRefGoogle Scholar
  46. 46.
    Levi G, Hassner T (2015) LATCH: learned arrangements of three patch codes. CoRR abs/1501. 03719 Google Scholar
  47. 47.
    Lin K, Yang HF, Hsiao JH, Chen CS (2015) Deep learning of binary hash codes for fast image retrieval The IEEE conference on computer vision and pattern recognition (CVPR) workshopsGoogle Scholar
  48. 48.
    Lloyd S (1982) Least squares quantization in pcm. IEEE Trans Inf Theory 28 (2):129–137. doi: 10.1109/TIT.1982.1056489 MathSciNetCrossRefzbMATHGoogle Scholar
  49. 49.
    Lowe DG (2004) Distinctive image features from scale-invariant keypoints. Int J Comput Vis 60(2):91–110. doi: 10.1023/B:VISI.0000029664.99615.94 CrossRefGoogle Scholar
  50. 50.
    McLachlan G, Peel D (2000) Finite Mixture Models. Wiley series in probability and statistics. WileyGoogle Scholar
  51. 51.
    Miksik O, Mikolajczyk K (2012) Evaluation of local detectors and descriptors for fast feature matching 2012 21st international conference on Pattern recognition (ICPR), pp 2681–2684Google Scholar
  52. 52.
    Perd’och M, Chum O, Matas J (2009) Efficient representation of local geometry for large scale object retrieval IEEE Conference on Computer vision and pattern recognition, 2009. CVPR 2009, pp 9–16CrossRefGoogle Scholar
  53. 53.
    Perronnin F, Dance C (2007) Fisher kernels on visual vocabularies for image categorization IEEE Conference on Computer Vision and Pattern Recognition, 2007. CVPR ’07. doi: 10.1109/CVPR.2007.383266, pp 1–8Google Scholar
  54. 54.
    Perronnin F, Larlus D (2015) Fisher Vectors Meet Neural Networks: A Hybrid Classification Architecture Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3743–3752Google Scholar
  55. 55.
    Perronnin F, Liu Y, Sànchez J, Poirier H (2010) Large-scale image retrieval with compressed fisher vectors 2010 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). doi: 10.1109/CVPR.2010.5540009, pp 3384–3391CrossRefGoogle Scholar
  56. 56.
    Perronnin F, Sànchez J, Mensink T (2010) Improving the fisher kernel for large-scale image classification Computer Vision - ECCV 2010, Lecture Notes in Computer Science. doi: 10.1007/978-3-642-15561-1_11, vol 6314. Springer, Berlin, pp 143–156
  57. 57.
    Philbin J, Chum O, Isard M, Sivic J, Zisserman A (2007) Object retrieval with large vocabularies and fast spatial matching 2007 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). doi: 10.1109/CVPR.2007.383172, pp 1–8Google Scholar
  58. 58.
    Philbin J, Chum O, Isard M, Sivic J, Zisserman A (2008) Lost in quantization: Improving particular object retrieval in large scale image databases IEEE Conference on Computer Vision and Pattern Recognition, 2008. CVPR 2008. doi: 10.1109/CVPR.2008.4587635, pp 1–8Google Scholar
  59. 59.
    Razavian AS, Azizpour H, Sullivan J, Carlsson S (2014) CNN features off-the-shelf: an astounding baseline for recognition 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). doi: 10.1109/CVPRW.2014.131. IEEE, pp 512–519
  60. 60.
    Rublee E, Rabaud V, Konolige K, Bradski G (2011) Orb: an efficient alternative to sift or surf 2011 IEEE International Conference on Computer vision (ICCV), pp 2564–2571CrossRefGoogle Scholar
  61. 61.
    Salton G, McGill MJ (1986) Introduction to Modern Information Retrieval. McGraw-Hill, Inc., New YorkzbMATHGoogle Scholar
  62. 62.
    Sànchez J, Redolfi J (2015) Exponential family fisher vector for image classification. Pattern Recogn Lett 59:26–32. doi: 10.1016/j.patrec.2015.03.010 CrossRefGoogle Scholar
  63. 63.
    Sànchez J, Perronnin F, Mensink T, Verbeek J (2013) Image classification with the fisher vector: Theory and practice. Int J Comput Vis 105 (3):222–245. doi: 10.1007/s11263-013-0636-x MathSciNetCrossRefzbMATHGoogle Scholar
  64. 64.
    Simonyan K, Vedaldi A, Zisserman A (2013) Deep fisher networks for large-scale image classification. In: Burges CJC, Bottou L, Welling M, Ghahramani Z, Weinberger KQ (eds) Advances in neural information processing systems 26. Curran Associates, Inc., pp 163–171Google Scholar
  65. 65.
    Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556
  66. 66.
    Sivic J, Zisserman A (2003) Video google: A text retrieval approach to object matching in videos Proceedings of the Ninth IEEE International Conference on Computer Vision, ICCV ’03. doi: 10.1109/ICCV.2003.1238663, vol 2. IEEE Computer Society, pp 1470–1477
  67. 67.
    Smeulders AWM, Worring M, Santini S, Gupta A, Jain R (2000) Content-based image retrieval at the end of the early years. IEEE Trans Pattern Anal Mach Intell 22(12):1349–1380CrossRefGoogle Scholar
  68. 68.
    Sydorov V, Sakurada M, Lampert CH (2014) Deep fisher kernels - end to end learning of the fisher kernel gmm parameters The IEEE Conference on Computer vision and pattern recognition (CVPR)Google Scholar
  69. 69.
    Tolias G, Avrithis Y (2011) Speeded-up, relaxed spatial matching 2011 IEEE International Conference on Computer Vision (ICCV). doi: 10.1109/ICCV.2011.6126427, pp 1653–1660CrossRefGoogle Scholar
  70. 70.
    Tolias G, Furon T, Jégou H (2014) Orientation covariant aggregation of local descriptors with embeddings. In: Fleet D, Pajdla T, Schiele B, Tuytelaars T (eds) Computer Vision - ECCV 2014, Lecture Notes in Computer Science, vol 8694. Springer International Publishing, pp 382–397Google Scholar
  71. 71.
    Tolias G, Jégou H (2013) Local visual query expansion: Exploiting an image collection to refine local descriptors. Research Report RR-8325.
  72. 72.
    Uchida Y, Sakazawa S (2013) Image retrieval with fisher vectors of binary features 2013 2nd IAPR asian conference on Pattern recognition (ACPR), pp 23–28CrossRefGoogle Scholar
  73. 73.
    Ullman S. (1996) High-Level Vision - object recognition and visual cognition. MIT PressGoogle Scholar
  74. 74.
    Uricchio T, Bertini M, Seidenari L, Del Bimbo A (2015) Fisher encoded convolutional bag-of-windows for efficient image retrieval and social image tagging The IEEE International Conference on Computer Vision (ICCV) WorkshopsGoogle Scholar
  75. 75.
    van Gemert JC, Geusebroek JM, Veenman CJ, Smeulders AW (2008) Kernel codebooks for scene categorization. In: Forsyth D, Torr P, Zisserman A (eds) Computer Vision - ECCV 2008, Lecture Notes in Computer Science, vol 5304. Springer, Berlin, pp 696–709Google Scholar
  76. 76.
    Van Opdenbosch D, Schroth G, Huitl R, Hilsenbeck S, Garcea A, Steinbach E (2014) Camera-based indoor positioning using scalable streaming of compressed binary image signatures IEEE International Conference on Image ProcessingGoogle Scholar
  77. 77.
    Wang J, Yang J, Yu K, Lv F, Huang T, Gong Y (2010) Locality-constrained linear coding for image classification 2010 IEEE conference on Computer vision and pattern recognition (CVPR), pp 3360–3367CrossRefGoogle Scholar
  78. 78.
    Witten IH, Moffat A, Bell TC (1999) Managing gigabytes: compressing and indexing documents and images. Morgan KaufmannGoogle Scholar
  79. 79.
    Yang J, Yu K, Gong Y, Huang T (2009) Linear spatial pyramid matching using sparse coding for image classification IEEE conference on Computer vision and pattern recognition, 2009. CVPR 2009, pp 1794–1801CrossRefGoogle Scholar
  80. 80.
    Yue-Hei Ng J, Yang F, Davis LS (2015) Exploiting local features from deep networks for image retrieval The IEEE conference on computer vision and pattern recognition (CVPR) workshopsGoogle Scholar
  81. 81.
    Zezula P, Amato G, Dohnal V, Batko M (2006) Similarity Search: The Metric Space Approach, Advances in Database Systems vol. 32 SpringerGoogle Scholar
  82. 82.
    Zhang Y, Zhu C, Bres S, Chen L (2013) Encoding local binary descriptors by bag-of-features with hamming distance for visual object categorization. In: Serdyukov P, Braslavski P, Kuznetsov S, Kamps J, Rüger S, Agichtein E, Segalovich I, Yilmaz E (eds) Advances in Information Retrieval, Lecture Notes in Computer Science, vol 7814. Springer, Berlin, pp 630–641Google Scholar
  83. 83.
    Zhao W, Jégou H, Gravier G (2013) Oriented pooling for dense and non-dense rotation-invariant features BMVC - 24Th british machine vision conferenceGoogle Scholar
  84. 84.
    Zhou B, Lapedriza A, Xiao J, Torralba A, Oliva A (2014) Learning deep features for scene recognition using places database. In: Ghahramani Z, Welling M, Cortes C, Lawrence N, Weinberger K (eds) Advances in neural information processing systems, vol 27, Curran Associates, Inc., pp 487–495Google Scholar

Copyright information

© Springer Science+Business Media New York 2017

Authors and Affiliations

  • Giuseppe Amato
    • 1
  • Fabrizio Falchi
    • 1
  • Lucia Vadicamo
    • 1
    Email author
  1. 1.Institute of Information Science and Technologies (ISTI) - CNRPisaItaly

Personalised recommendations