Skip to main content
Log in

Aggregating binary local descriptors for image retrieval

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Content-Based Image Retrieval based on local features is computationally expensive because of the complexity of both extraction and matching of local feature. On one hand, the cost for extracting, representing, and comparing local visual descriptors has been dramatically reduced by recently proposed binary local features. On the other hand, aggregation techniques provide a meaningful summarization of all the extracted feature of an image into a single descriptor, allowing us to speed up and scale up the image search. Only a few works have recently mixed together these two research directions, defining aggregation methods for binary local features, in order to leverage on the advantage of both approaches.In this paper, we report an extensive comparison among state-of-the-art aggregation methods applied to binary features. Then, we mathematically formalize the application of Fisher Kernels to Bernoulli Mixture Models. Finally, we investigate the combination of the aggregated binary features with the emerging Convolutional Neural Network (CNN) features. Our results show that aggregation methods on binary features are effective and represent a worthwhile alternative to the direct matching. Moreover, the combination of the CNN with the Fisher Vector (FV) built upon binary features allowed us to obtain a relative improvement over the CNN results that is in line with that recently obtained using the combination of the CNN with the FV built upon SIFTs. The advantage of using the FV built upon binary features is that the extraction process of binary features is about two order of magnitude faster than SIFTs.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others

Notes

  1. Respect to the experimental setting used in our previous work [3], we improved the computation of the local features before the aggregation phase which allowed us to obtain better performances for BoW and VLAD on the INRIA Holidays dataset than that reported in [3].

  2. A Bernoulli distribution p(x) = μ x(1−μ)1−x of parameter μ can be written as exponential distribution p(x) = e x p(η xl o g(1 + e η)) where \(\eta = \log \left (\frac {\mu }{1-\mu }\right )\) is the natural parameter. In [62] the score function is computed considering the gradient w.r.t. the natural parameters η while in this paper we used the gradient w.r.t. the standard parameter μ of the Bernoulli (as also done in [72]).

  3. http://opencv.org/.

  4. https://github.com/ffalchi/it.cnr.isti.vir.

  5. https://github.com/BVLC/caffe/wiki/Model-Zoo.

  6. https://github.com/BVLC/caffe/tree/master/models/bvlc_reference_caffenet.

  7. To search a database for the objects similar to a query we can use either a similarity function or a distance function. In the first case, we search for the objects with greatest similarity to the query. In the latter case, we search for the objects with lowest distance from the query. A similarity function is said to be equivalent to a distance function if the ranked list of the results to query is the same. For example, the Euclidean distance between two vectors ( 2(x 1,x 2)=∥x 1x 22) is equivalent to the cosine similarity (s cos(x 1,x 2)=(x 1x 2)/(∥x 12x 22)) whenever the vectors are L 2- normalized (i.e. ∥x 12=∥x 22=1). In fact, in such case, \(s_{\text {cos}}(x_{1},x_{2})=1-\frac {1}{2}{\ell _{2}(x_{1},x_{2})}^{2}\), which implies that the ranked list of the results to a query is the same (i.e., 2(x 1,x 2)≤ 2(x 1,x 3) iff s cos(x 1,x 2)≥s cos(x 1,x 3) ∀ x 1,x 2,x 3).

  8. An elementary matrix E(u,v,σ) = Iσ u v H is non-singular if and only if σ v H u ≠ 1 and in this case the inverse is E(u,v,σ)−1 = E(u,v,τ) where τ = σ/(σ v H u−1). More details on this topic can be found in [29].

References

  1. Alcantarilla PF, Nuevo J, Bartoli A (2013) Fast explicit diffusion for accelerated features in nonlinear scale spaces British machine vision conference (BMVC)

    Google Scholar 

  2. Amato G, Falchi F, Gennaro C, Vadicamo L (2016) Deep Permutations: Deep Convolutional Neural Networks and Permutation-Based Indexing. Springer International Publishing, Cham, pp 93–106. doi:10.1007/978-3-319-46759-7_7

    Google Scholar 

  3. Amato G, Falchi F, Vadicamo L (2016) How effective are aggregation methods on binary features? Proceedings of the 11th joint conference on computer vision, imaging and computer graphics theory and applications, vol 4, pp 566–573

  4. Amato G, Falchi F, Vadicamo L (2016) Visual Recognition of Ancient Inscriptions Using Convolutional Neural Network and Fisher Vector, J Comput Cult Herit (JOCCH) Article 21 9, 4 (December 2016) 24 pages. doi:10.1145/2964911

  5. Arandjelovic R, Zisserman A (2012) Three things everyone should know to improve object retrieval 2012 IEEE conference on Computer vision and pattern recognition (CVPR), pp 2911–2918

    Chapter  Google Scholar 

  6. Arandjelovic R, Zisserman A (2013) All about VLAD 2013 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). doi:10.1109/CVPR.2013.207, pp 1578–1585

    Chapter  Google Scholar 

  7. Babenko A, Slesarev A, Chigorin A, Lempitsky V (2014) Neural codes for image retrieval Computer Vision–ECCV 2014. doi:10.1007/978-3-319-10590-1_38. Springer, pp 584–599

  8. Bay H, Tuytelaars T, Van Gool L (2006) Surf: Speeded up robust features. In: Leonardis A, Bischof H, Pinz A (eds) Computer Vision - ECCV 2006, Lecture Notes in Computer Science. doi:10.1007/11744023_32, vol 3951. Springer, Berlin, pp 404–417

  9. Bing images. http://www.bing.com/images/

  10. Bishop CM (2006) Pattern recognition and machine learning. Information science and statistics. Springer

  11. Boureau YL, Bach F, LeCun Y, Ponce J (2010) Learning mid-level features for recognition 2010 IEEE conference on Computer vision and pattern recognition (CVPR), pp 2559–2566

    Chapter  Google Scholar 

  12. Calonder M, Lepetit V, Strecha C, Fua P (2010) Brief: Binary robust independent elementary features. In: Daniilidis K, Maragos P, Paragios N (eds) Computer Vision - ECCV 2010, Lecture Notes in Computer Science, vol 6314. Springer, Berlin Heidelberg, pp 778–792

  13. Chandrasekhar V, Lin J, Morère O, Goh H, Veillard A (2015) A practical guide to cnns and fisher vectors for image instance retrieval arXiv:1508.02496

  14. Chen D, Tsai S, Chandrasekhar V, Takacs G, Chen H, Vedantham R, Grzeszczuk R, Girod B (2011) Residual enhanced visual vectors for on-device image matching 2011 Conference Record of the Forty Fifth Asilomar Conference on Signals, Systems and Computers (ASILOMAR). doi:10.1016/j.sigpro.2012.06.005, pp 850–854

    Chapter  Google Scholar 

  15. Chum O, Philbin J, Sivic J, Isard M, Zisserman A (2007) Total recall: Automatic query expansion with a generative feature model for object retrieval IEEE 11th international conference on Computer vision, 2007. ICCV 2007, pp 1–8

    Google Scholar 

  16. Csurka G, Dance C, Fan L, Willamowski J, Bray C (2004) Visual categorization with bags of keypoints. Workshop on statistical learning in computer vision. ECCV 1(1-22):1–2

    Google Scholar 

  17. Datta R, Li J, Wang JZ (2005) Content-based image retrieval: Approaches and trends of the new age Proceedings of the 7th ACM SIGMM international workshop on multimedia information retrieval, MIR ’05. ACM, New York, pp 253–262

    Google Scholar 

  18. Delhumeau J, Gosselin PH, Jégou H, Pérez P (2013) Revisiting the VLAD image representation Proceedings of the 21st ACM International Conference on Multimedia, MM 2013. doi:10.1145/2502081.2502171. ACM, New York, pp 653–656

    Google Scholar 

  19. Deng J, Dong W, Socher R, Li L, Li K, Fei-Fei L (2009) Imagenet: A large-scale hierarchical image database IEEE Conference on Computer Vision and Pattern Recognition, 2009. CVPR 2009. doi:10.1109/CVPR.2009.5206848, pp 248–255

    Chapter  Google Scholar 

  20. Donahue J, Jia Y, Vinyals O, Hoffman J, Zhang N, Tzeng E, Darrell T (2013) Decaf: A deep convolutional activation feature for generic visual recognition. arXiv:1310.1531

  21. Galvez-Lopez D., Tardos J. (2011) Real-time loop detection with bags of binary words IEEE/RSJ international conference on Intelligent robots and systems (IROS), 2011, pp 51–58

    Google Scholar 

  22. Goodfellow I, Bengio Y, Courville A (2016) Deep learning. http://www.deeplearningbook.org. Book in preparation for MIT Press

  23. Google googles. http://www.google.com/mobile/goggles/

  24. Google images. https://images.google.com/

  25. Grana C, Borghesani D, Manfredi M, Cucchiara R (2013) A fast approach for integrating ORB descriptors in the bag of words model. In: Snoek CGM, Kennedy LS, Creutzburg R, Akopian D, Wüller D, Matherson KJ, Georgiev TG, Lumsdaine A (eds) IS&T/SPIE Electronic Imaging. International Society for Optics and Photonics

  26. Gray RM, Neuhoff DL (1998) Quantization. IEEE Trans Inf Theory 44 (6):2325–2383. doi:10.1109/18.720541

    Article  MATH  Google Scholar 

  27. Hamming RW (1950) Error detecting and error correcting codes. Bell Syst Tech J 29(2):147–160. doi:10.1002/j.1538-7305.1950.tb00463.x

    Article  MathSciNet  Google Scholar 

  28. Heinly J, Dunn E, Frahm JM (2012) Comparative evaluation of binary features Computer vision - ECCV 2012, lecture notes in computer science. Springer, Berlin, pp 759–773

    Chapter  Google Scholar 

  29. Householder A. (1964) The Theory of Matrices in Numerical Analysis. A Blaisdell book in pure and applied sciences: introduction to higher mathematics. Blaisdell Publishing Company

  30. Huiskes MJ, Lew MS (2008) The mir flickr retrieval evaluation MIR ’08: Proceedings of the 2008 ACM International Conference on Multimedia Information Retrieval. ACM, New York

    Google Scholar 

  31. Jaakkola T, Haussler D (1998) Exploiting generative models in discriminative classifiers In Advances in Neural Information Processing Systems. http://dl.acm.org/citation.cfm?id=340534.340715, vol 11. MIT Press, pp 487–493

  32. Jégou H, Douze M, Schmid C (2008) Hamming embedding and weak geometric consistency for large scale image search. In: Forsyth D, Torr P, Zisserman A (eds) European Conference on Computer Vision, LNCS, vol I. Springer, pp 304–317

  33. Jégou H, Douze M, Schmid C (2010) Improving bag-of-features for large scale image search. Int J Comput Vis 87(3):316–336. doi:10.1007/s11263-009-0285-2

    Article  Google Scholar 

  34. Jégou H, Douze M, Schmid C, Pérez P (2010) Aggregating local descriptors into a compact image representation IEEE Conference on Computer Vision & Pattern Recognition. doi:10.1109/CVPR.2010.5540039

    Google Scholar 

  35. Jégou H, Douze M, Schmid C (2011) Product quantization for nearest neighbor search. IEEE Trans Pattern Anal Mach Intell 33 (1):117–128. doi:10.1109/TPAMI.2010.57

    Article  Google Scholar 

  36. Jégou H, Perronnin F, Douze M, Sànchez J, Pérez P, Schmid C (2012) Aggregating local image descriptors into compact codes. IEEE Trans Pattern Anal Mach Intell 34(9):1704–1716. doi:10.1109/TPAMI.2011.235

    Article  Google Scholar 

  37. Jia Y, Shelhamer E, Donahue J, Karayev S, Long J, Girshick R, Guadarrama S, Darrell T (2014) Caffe: Convolutional architecture for fast feature embedding Proceedings of the ACM International Conference on Multimedia. doi:10.1145/2647868.2654889. ACM, pp 675–678

  38. Kaufman L, Rousseeuw P (1987) Clustering by means of medoids. In: Dodge Y (ed) An introduction to l1-norm based statistical data analysis, computational statistics & data analysis, vol 5

  39. Krapac J, Verbeek J, Jurie F (2011) Modeling Spatial Layout with Fisher Vectors for Image Categorization ICCV 2011 - International conference on computer vision. IEEE, Barcelona, pp 1487–1494

    Chapter  Google Scholar 

  40. Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Pereira F, Burges C, Bottou L, Weinberger K (eds) Advances in neural information processing systems, vol 25. Curran Associates, Inc., pp 1097–1105

  41. Lai H, Pan Y, Liu Y, Yan S (2015) Simultaneous feature learning and hash coding with deep neural networks The IEEE conference on computer vision and pattern recognition (CVPR)

    Google Scholar 

  42. Lazebnik S., Schmid C., Ponce J. (2006) Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories 2006 IEEE computer society conference on Computer vision and pattern recognition, vol 2

  43. LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521 (7553):436–444. doi:10.1038/nature14539

    Article  Google Scholar 

  44. Lee S, Choi S, Yang H (2015) Bag-of-binary-features for fast image representation. Electron Lett 51(7):555–557

    Article  Google Scholar 

  45. Leutenegger S, Chli M, Siegwart R (2011) Brisk: Binary robust invariant scalable keypoints IEEE International Conference on Computer vision (ICCV), 2011, pp 2548–2555

    Chapter  Google Scholar 

  46. Levi G, Hassner T (2015) LATCH: learned arrangements of three patch codes. CoRR abs/1501. 03719

  47. Lin K, Yang HF, Hsiao JH, Chen CS (2015) Deep learning of binary hash codes for fast image retrieval The IEEE conference on computer vision and pattern recognition (CVPR) workshops

    Google Scholar 

  48. Lloyd S (1982) Least squares quantization in pcm. IEEE Trans Inf Theory 28 (2):129–137. doi:10.1109/TIT.1982.1056489

    Article  MathSciNet  MATH  Google Scholar 

  49. Lowe DG (2004) Distinctive image features from scale-invariant keypoints. Int J Comput Vis 60(2):91–110. doi:10.1023/B:VISI.0000029664.99615.94

    Article  Google Scholar 

  50. McLachlan G, Peel D (2000) Finite Mixture Models. Wiley series in probability and statistics. Wiley

  51. Miksik O, Mikolajczyk K (2012) Evaluation of local detectors and descriptors for fast feature matching 2012 21st international conference on Pattern recognition (ICPR), pp 2681–2684

    Google Scholar 

  52. Perd’och M, Chum O, Matas J (2009) Efficient representation of local geometry for large scale object retrieval IEEE Conference on Computer vision and pattern recognition, 2009. CVPR 2009, pp 9–16

    Chapter  Google Scholar 

  53. Perronnin F, Dance C (2007) Fisher kernels on visual vocabularies for image categorization IEEE Conference on Computer Vision and Pattern Recognition, 2007. CVPR ’07. doi:10.1109/CVPR.2007.383266, pp 1–8

    Google Scholar 

  54. Perronnin F, Larlus D (2015) Fisher Vectors Meet Neural Networks: A Hybrid Classification Architecture Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3743–3752

    Google Scholar 

  55. Perronnin F, Liu Y, Sànchez J, Poirier H (2010) Large-scale image retrieval with compressed fisher vectors 2010 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). doi:10.1109/CVPR.2010.5540009, pp 3384–3391

    Chapter  Google Scholar 

  56. Perronnin F, Sànchez J, Mensink T (2010) Improving the fisher kernel for large-scale image classification Computer Vision - ECCV 2010, Lecture Notes in Computer Science. doi:10.1007/978-3-642-15561-1_11, vol 6314. Springer, Berlin, pp 143–156

  57. Philbin J, Chum O, Isard M, Sivic J, Zisserman A (2007) Object retrieval with large vocabularies and fast spatial matching 2007 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). doi:10.1109/CVPR.2007.383172, pp 1–8

    Google Scholar 

  58. Philbin J, Chum O, Isard M, Sivic J, Zisserman A (2008) Lost in quantization: Improving particular object retrieval in large scale image databases IEEE Conference on Computer Vision and Pattern Recognition, 2008. CVPR 2008. doi:10.1109/CVPR.2008.4587635, pp 1–8

    Google Scholar 

  59. Razavian AS, Azizpour H, Sullivan J, Carlsson S (2014) CNN features off-the-shelf: an astounding baseline for recognition 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). doi:10.1109/CVPRW.2014.131. IEEE, pp 512–519

  60. Rublee E, Rabaud V, Konolige K, Bradski G (2011) Orb: an efficient alternative to sift or surf 2011 IEEE International Conference on Computer vision (ICCV), pp 2564–2571

    Chapter  Google Scholar 

  61. Salton G, McGill MJ (1986) Introduction to Modern Information Retrieval. McGraw-Hill, Inc., New York

    MATH  Google Scholar 

  62. Sànchez J, Redolfi J (2015) Exponential family fisher vector for image classification. Pattern Recogn Lett 59:26–32. doi:10.1016/j.patrec.2015.03.010

    Article  Google Scholar 

  63. Sànchez J, Perronnin F, Mensink T, Verbeek J (2013) Image classification with the fisher vector: Theory and practice. Int J Comput Vis 105 (3):222–245. doi:10.1007/s11263-013-0636-x

    Article  MathSciNet  MATH  Google Scholar 

  64. Simonyan K, Vedaldi A, Zisserman A (2013) Deep fisher networks for large-scale image classification. In: Burges CJC, Bottou L, Welling M, Ghahramani Z, Weinberger KQ (eds) Advances in neural information processing systems 26. Curran Associates, Inc., pp 163–171

  65. Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556

  66. Sivic J, Zisserman A (2003) Video google: A text retrieval approach to object matching in videos Proceedings of the Ninth IEEE International Conference on Computer Vision, ICCV ’03. doi:10.1109/ICCV.2003.1238663, vol 2. IEEE Computer Society, pp 1470–1477

  67. Smeulders AWM, Worring M, Santini S, Gupta A, Jain R (2000) Content-based image retrieval at the end of the early years. IEEE Trans Pattern Anal Mach Intell 22(12):1349–1380

    Article  Google Scholar 

  68. Sydorov V, Sakurada M, Lampert CH (2014) Deep fisher kernels - end to end learning of the fisher kernel gmm parameters The IEEE Conference on Computer vision and pattern recognition (CVPR)

    Google Scholar 

  69. Tolias G, Avrithis Y (2011) Speeded-up, relaxed spatial matching 2011 IEEE International Conference on Computer Vision (ICCV). doi:10.1109/ICCV.2011.6126427, pp 1653–1660

    Chapter  Google Scholar 

  70. Tolias G, Furon T, Jégou H (2014) Orientation covariant aggregation of local descriptors with embeddings. In: Fleet D, Pajdla T, Schiele B, Tuytelaars T (eds) Computer Vision - ECCV 2014, Lecture Notes in Computer Science, vol 8694. Springer International Publishing, pp 382–397

  71. Tolias G, Jégou H (2013) Local visual query expansion: Exploiting an image collection to refine local descriptors. Research Report RR-8325. https://hal.inria.fr/hal-00840721

  72. Uchida Y, Sakazawa S (2013) Image retrieval with fisher vectors of binary features 2013 2nd IAPR asian conference on Pattern recognition (ACPR), pp 23–28

    Chapter  Google Scholar 

  73. Ullman S. (1996) High-Level Vision - object recognition and visual cognition. MIT Press

  74. Uricchio T, Bertini M, Seidenari L, Del Bimbo A (2015) Fisher encoded convolutional bag-of-windows for efficient image retrieval and social image tagging The IEEE International Conference on Computer Vision (ICCV) Workshops

    Google Scholar 

  75. van Gemert JC, Geusebroek JM, Veenman CJ, Smeulders AW (2008) Kernel codebooks for scene categorization. In: Forsyth D, Torr P, Zisserman A (eds) Computer Vision - ECCV 2008, Lecture Notes in Computer Science, vol 5304. Springer, Berlin, pp 696–709

  76. Van Opdenbosch D, Schroth G, Huitl R, Hilsenbeck S, Garcea A, Steinbach E (2014) Camera-based indoor positioning using scalable streaming of compressed binary image signatures IEEE International Conference on Image Processing

  77. Wang J, Yang J, Yu K, Lv F, Huang T, Gong Y (2010) Locality-constrained linear coding for image classification 2010 IEEE conference on Computer vision and pattern recognition (CVPR), pp 3360–3367

    Chapter  Google Scholar 

  78. Witten IH, Moffat A, Bell TC (1999) Managing gigabytes: compressing and indexing documents and images. Morgan Kaufmann

  79. Yang J, Yu K, Gong Y, Huang T (2009) Linear spatial pyramid matching using sparse coding for image classification IEEE conference on Computer vision and pattern recognition, 2009. CVPR 2009, pp 1794–1801

    Chapter  Google Scholar 

  80. Yue-Hei Ng J, Yang F, Davis LS (2015) Exploiting local features from deep networks for image retrieval The IEEE conference on computer vision and pattern recognition (CVPR) workshops

    Google Scholar 

  81. Zezula P, Amato G, Dohnal V, Batko M (2006) Similarity Search: The Metric Space Approach, Advances in Database Systems vol. 32 Springer

  82. Zhang Y, Zhu C, Bres S, Chen L (2013) Encoding local binary descriptors by bag-of-features with hamming distance for visual object categorization. In: Serdyukov P, Braslavski P, Kuznetsov S, Kamps J, Rüger S, Agichtein E, Segalovich I, Yilmaz E (eds) Advances in Information Retrieval, Lecture Notes in Computer Science, vol 7814. Springer, Berlin, pp 630–641

  83. Zhao W, Jégou H, Gravier G (2013) Oriented pooling for dense and non-dense rotation-invariant features BMVC - 24Th british machine vision conference

    Google Scholar 

  84. Zhou B, Lapedriza A, Xiao J, Torralba A, Oliva A (2014) Learning deep features for scene recognition using places database. In: Ghahramani Z, Welling M, Cortes C, Lawrence N, Weinberger K (eds) Advances in neural information processing systems, vol 27, Curran Associates, Inc., pp 487–495

Download references

Acknowledgments

This work was partially founded by: EAGLE, Europeana network of Ancient Greek and Latin Epigraphy, co-founded by the European Commission, CIP-ICT-PSP.2012.2.1 - Europeana and creativity, Grant Agreement n. 325122; and Smart News, Social sensing for breakingnews, co-founded by the Tuscany region under the FAR-FAS 2014 program, CUP CIPE D58C15000270008.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Lucia Vadicamo.

Appendices

Appendix A: Score vector computation

In the following, we have reported the computation of the score function \(G_{\lambda }^{X}\), defined as the gradient of the log-likelihood of a data X with respect to the parameters λ of a Bernoulli Mixture Model. Throughout this appendix we have used [ [⋅] ] notation to represent the Iverson bracket which equals one if the arguments is true, and zero otherwise.

Under the independence assumption, the Fisher score with respect to the generic parameter λ k is expressed as: \(G_{\lambda _k}^{X} ={\sum }_{t=1}^{\top } \frac {\partial \log p(x_t|\lambda )}{\partial \lambda _k}= {\sum }_{t=1}^{\top } \frac {1} {p(x_t|\lambda )}\frac {\partial }{\partial \lambda _k}\left [{\sum }_{i=1}^{K} w_i p_i(x_t)\right ]\). To compute \(\frac {\partial }{\partial \lambda _k}\left [{\sum }_{i=1}^{K} w_{i} p_{i}(x_t)\right ]\), we first observe that

$$\begin{array}{@{}rcl@{}} \frac{\partial{w_i}}{\partial \alpha_{k}} &=&\frac{\partial}{\partial \alpha_{k}}\left[\frac{\exp(\alpha_i)}{\sum\limits_{j=1}^{K}\exp(\alpha_j)}\right]\\ &=&\frac{\exp(\alpha_k)\left( \sum\limits_{j=1}^{K}\exp(\alpha_{j})\right) {[\kern-2pt[{i=k}]\kern-2pt]}-\exp(\alpha_i)\exp(\alpha_k)}{\left( \sum\limits_{j=1}^K\exp(\alpha_j)\right)^{2}}\\ &=& w_{k}{[\kern-2pt[{i=k}]\kern-2pt]}- w_{k}w_{i} \end{array} $$
(5)

and

$$\begin{array}{@{}rcl@{}} &&\frac{\partial p_{i}(x_{t})} {\partial \mu_{kd}}=\frac{\partial}{\partial \mu_{kd}}\left[{\prod}_{l=1}^{D} \mu_{kl}^{x_{tl}} \left( 1-\mu_{kl}\right)^{1-x_{tl}} \right]{[\kern-2pt[{i=k}]\kern-2pt]}\\ &&=\left( {[\kern-2pt[{x_{td}=1}]\kern-2pt]}- {[\kern-2pt[{x_{td}=0}]\kern-2pt]}\right) \left( {\prod}_{\underset{l\neq d}{l=1}}^{D} \mu_{kl}^{x_{tl}}\left( 1-\mu_{kl}\right)^{1-x_{tl}}\right){[\kern-2pt[{i=k}]\kern-2pt]} \\ &&=\left( {[\kern-2pt[{x_{td}=1}]\kern-2pt]}- {[\kern-2pt[{x_{td}=0}]\kern-2pt]}\right)\left( \frac{p_k(x_t)}{\mu_{kd}^{x_{td}}\left( 1-\mu_{kd}\right)^{1-x_{td}}}\right){[\kern-2pt[{i=k}]\kern-2pt]}\\ &&=p_k(x_t)\left( \frac{(1-\mu_{kd}){[\kern-2pt[{x_{td}=1}]\kern-2pt]}-\mu_{kd}{[\kern-2pt[{x_{td}=0}]\kern-2pt]}}{\mu_{kd}(1-\mu_{kd})}\right) {[\kern-2pt[{i=k}]\kern-2pt]}\\ &&=p_k(x_t)\left( \frac{x_{td}-\mu_{kd}}{\mu_{kd}(1-\mu_{kd})}\right) {[\kern-2pt[{i=k}]\kern-2pt]}. \end{array} $$
(6)

Hence, the Fisher score with respect to the parameter α k is obtained as

$$\begin{array}{@{}rcl@{}} G_{\alpha_k}^{X} &=&\sum\limits_{t=1}^{\top}\sum\limits_{i=1}^{K}\frac{ p_{i}(x_{t})} {p(x_{t}|\lambda)}\frac{\partial w_{i}}{\partial \alpha_{k}}\overset{(5)}{=}\sum\limits_{t=1}^{\top} \sum\limits_{i=1}^{K} \frac{ p_{i}(x_{t})} {p(x_{t}|\lambda)}w_k\left( {[\kern-2pt[{i=k}]\kern-2pt]}-w_{i}\right)\\ &=&\sum\limits_{t=1}^{\top} \left( \frac{ p_{k}(x_{t})} {p(x_{t}|\lambda)}w_{k}-\sum\limits_{i=1}^{K} \frac{ p_i(x_t)} {p(x_t|\lambda)}w_kw_i\right)=\sum\limits_{t=1}^{\top} \left( \gamma_t(k)-w_{k}\sum\limits_{i=1}^{K} \gamma_t(i)\right)\\ &=&\sum\limits_{t=1}^T \left( \gamma_t(k)-w_k \right) \end{array} $$
(7)

and the Fisher score related to the parameter μ k d is

$$\begin{array}{@{}rcl@{}} G_{\mu_{kd}}^{X} &=&\sum\limits_{t=1}^{\top} \frac{\partial\log p(x_t|\lambda)}{\partial \mu_{kd}}=\sum\limits_{t=1}^{\top} \frac{1} {p(x_t|\lambda)}\frac{\partial}{\partial \mu_{kd}}\left[\sum\limits_{i=1}^{K} w_{i} p_{i}(x_t)\right] \\ &=&\sum\limits_{t=1}^{\top} \frac{w_k} {p(x_t|\lambda)}\frac{\partial p_k(x_t)}{\partial \mu_{kd}} \overset{(6)}{=}\sum\limits_{t=1}^{\top} \frac{w_k p_k(x_t)}{p(x_t|\lambda)}\left( \frac{x_{td}-\mu_{kd}}{\mu_{kd}(1-\mu_{kd})}\right) \\ &=&\sum\limits_{t=1}^{\top} \gamma_t(k)\left( \frac{x_{td}-\mu_{kd}}{\mu_{kd}(1-\mu_{kd})}\right). \end{array} $$
(8)

Appendix B: Approximation of the fisher information matrix

Our derivation of the FIM is based on the assumption (see also [55, 63]) that for each observation x=(x 1,…,x D )∈{0,1}D the distribution of the occupancy probability γ(⋅) = p(⋅|x,λ) is sharply peaking, i.e. there is one Bernoulli index k such that γ x (k)≈1 and ∀ ik, γ x (i)≈0. This assumption implies that

$$\begin{array}{@{}rcl@{}} &&\gamma_x(k)\gamma_x(i)\approx 0 \quad \forall\,k,i=1\dots, K, i\neq k\\ &&\gamma_x(k)^2\approx \gamma_x(k) \quad \forall\, k=1,\dots,K \end{array} $$

and then

$$ \gamma_x(k)\gamma_x(i)\approx\gamma_x(k) {[\kern-2pt[{i=k}]\kern-2pt]}, $$
(9)

where [⋅] is the Iverson bracket. The elements of the FIM are defined as:

$$\begin{array}{@{}rcl@{}} [F_{\lambda}]_{i,j}=\mathbb{E}_{x\sim p(\cdot|\lambda)}\left[\left( \frac{\partial\log p(x|\lambda)}{\partial \lambda_i}\right) \left( \frac{\partial\log p(x|\lambda)}{\partial \lambda_j}\right)\right]. \end{array} $$
(10)

Hence, the FIM F λ is symmetric and can be written as block matrix

$$\begin{array}{@{}rcl@{}} F_{\lambda}=\left[\begin{array}{ll} F_{\alpha,\alpha} & F_{\mu,\alpha}\\ F_{\mu,\alpha}^{\top} & F_{\mu,\mu} \end{array}\right]. \end{array} $$

By using the definition of the occupancy probability (i.e. γ x (k) = w k p k (x)/p(x|λ)) and the fact that p k is the distribution of a D-dimensional Bernoulli of mean μ k , we have the following useful equalities:

$$\begin{array}{@{}rcl@{}} &&\mathbb{E}_{x\sim p(\cdot|\lambda)}\left[\gamma_{x}(k)\right]= \sum\limits_{x\in\{0,1\}^{D}}\gamma_{x}(k)p(x|\lambda){=}w_{k} \end{array} $$
(11)
$$\begin{array}{@{}rcl@{}} &&\mathbb{E}_{x\sim p(\cdot|\lambda)}\left[\gamma_{x}(k)x_{d}\right]{=}w_{k}\mu_{kd} \end{array} $$
(12)
$$\begin{array}{@{}rcl@{}} &&\mathbb{E}_{x\sim p(\cdot|\lambda)}\left[\gamma_x(k)x_{d}x_{l}\right]{=}w_k\mu_{kd}\left( \mu_{kl}{[\kern-2pt[{d\neq l}]\kern-2pt]} +{[\kern-2pt[{d= l}]\kern-2pt]}\right) \end{array} $$
(13)
$$\begin{array}{@{}rcl@{}} &&\mathbb{E}_{x\sim p(\cdot|\lambda)}\left[\frac{\partial\log p(x|\lambda)}{\partial \alpha_{k}}\right]\overset{(7)}{=}\mathbb{E}_{x\sim p(\cdot|\lambda)}\left[\gamma_x(k)-w_{k}\right] {=}0 \end{array} $$
(14)
$$\begin{array}{@{}rcl@{}} &&\mathbb{E}_{x\sim p(\cdot|\lambda)}\left[\frac{\partial\log p(x|\lambda)}{\partial \mu_{id}}\right]\overset{(8)}{=} \mathbb{E}_{x\sim p(\cdot|\lambda)}\left[\frac{\gamma_x(k)(x_{d}-\mu_{kd})}{\mu_{kd}(1-\mu_{kd})}\right] {=}0. \end{array} $$
(15)

It follows that F λ may approximated by a diagonal block matrix, because the mixing blocks \(F_{\mu _{kd},\alpha _{i}}\) are close to the zero matrix:

$$\begin{array}{@{}rcl@{}} F_{\mu_{kd},\alpha_{i}} &=& \mathbb{E}_{x\sim p(\cdot|\lambda)}\left[\left( \frac{\partial\log p(x|\lambda)}{\partial \mu_{kd}}\right)\left( \frac{\partial\log p(x|\lambda)}{\partial \alpha_i}\right)\right]\\ &\overset{(7)-(8)}{=}& \mathbb{E}_{x\sim p(\cdot|\lambda)}\left[\gamma_x(k)\frac{(x_{d}-\mu_{kd})}{\mu_{kd}(1-\mu_{kd})}(\gamma_x(i)-w_i) \right]\\ &\overset{(9)}\approx &\mathbb{E}_{x\sim p(\cdot|\lambda)}\left[\frac{\gamma_x(k)(x_{d}-\mu_{kd})}{\mu_{kd}(1-\mu_{kd})}\right]\left( {[\kern-2pt[{i=k}]\kern-2pt]}-w_i\right)\\ &\overset{(15)}{=}&0. \end{array} $$

The block F μ,μ can be written as K D×K D diagonal matrix, in fact:

$$\begin{array}{@{}rcl@{}} F_{\mu_{id},\mu_{kl}} &\overset{(10)}{=}& \mathbb{E}\left[\left( \frac{\partial\log p(x|\lambda)}{\partial \mu_{id} }\right)\left( \frac{\partial\log p(x|\lambda)}{\partial \mu_{kl}}\right)\right]\\ &\overset{(8)}{=}&\mathbb{E}_{x\sim p(\cdot|\lambda)}\left[\gamma_x(i)\gamma_x(k)\frac{(x_{d}-\mu_{id})}{\mu_{id}(1-\mu_{id})}\frac{(x_{l}-\mu_{kl})}{\mu_{kl}(1-\mu_{kl})} \right]\\ &\overset{(9)}{\approx}&\mathbb{E}_{x\sim p(\cdot|\lambda)}\left[ \frac{\gamma_x(k)(x_{d}-\mu_{kd})(x_{l}-\mu_{kl})}{\mu_{kd}\mu_{kl}(1-\mu_{kd})(1-\mu_{kl})}\right]{[\kern-2pt[{i=k}]\kern-2pt]}\\ &\overset{(11)-(13)}{=}& \frac{w_k(\mu_{kd}\mu_{kl}{[\kern-2pt[{d\neq l}]\kern-2pt]} +\mu_{kl}{[\kern-2pt[{d= l}]\kern-2pt]}-\mu_{kd}\mu_{kl})}{\mu_{kd}\mu_{kl}(1-\mu_{kd})(1-\mu_{kl})}{[\kern-2pt[{i=k}]\kern-2pt]}\\ &=& \frac{ w_k(\mu_{kd}{[\kern-2pt[{d\neq l}]\kern-2pt]} +{[\kern-2pt[{d= l}]\kern-2pt]}-\mu_{kd})}{\mu_{kd}(1-\mu_{kd})(1-\mu_{kl})}{[\kern-2pt[{i=k}]\kern-2pt]}\\ &=& \frac{ w_k}{\mu_{kd}(1-\mu_{kd})}{[\kern-2pt[{i=k}]\kern-2pt]}{[\kern-2pt[{d=l}]\kern-2pt]}. \end{array} $$
(16)

The relation (16) points that the diagonal elements of our FIM approximation are w k /μ k d (1−μ k d ) and the corresponding entries in L λ (i.e. the square root of the inverse of FIM) equal \( \sqrt {{\mu _{kd}(1-\mu _{kd})}/{ w_k}}\). The block related to the α parameters is F α,α =(diag(w)−w w ) where w=[w 1,…,w K ], in fact:

$$\begin{array}{@{}rcl@{}} F_{\alpha_{k},\alpha_{i}}&\overset{(10)}{=}& \mathbb{E}_{x\sim p(\cdot|\lambda)}\left[\left( \frac{\partial\log p(x|\lambda)}{\partial \alpha_{k} }\right)\left( \frac{\partial\log p(x|\lambda)}{\partial \alpha_i}\right)\right]\\ &\quad\overset{(7)}{=}&\mathbb{E}_{x\sim p(\cdot|\lambda)}\left[(\gamma_x(k)-w_k)(\gamma_x(i)-w_i) \right]\\ &\overset{(9)}{\approx}&\mathbb{E}_{p(\cdot|\lambda)}\left[\gamma_x(k){[\kern-2pt[{i=k}]\kern-2pt]}- \gamma_x(k)w_i-\gamma_x(i)w_k +w_iw_k \right]\\ &\overset{(11)-(12)}{=}&\left( w_k{[\kern-2pt[{i=k}]\kern-2pt]}-w_iw_k \right). \end{array} $$

The matrix F α,α is not invertible (indeed F α,α e=0 where e=[1,…,1]) due to the dependence of the mixing weights \(\left ({\sum }_{i=1}^K\alpha _i={\sum }_{i=1}^K w_i=1\right )\). Since there are only K−1 degrees of freedom in the mixing weight, as proposed in [63], we can fix α K equal to a constant without loss of generality and work with a reduced set of K−1 parameters: \(\tilde {\alpha }=[\alpha _1,\dots ,\alpha _{K-1}]^{\top }\).

Taking into account the Fisher score with respect to \(\tilde {\alpha }\), i.e.

$$G_{\tilde{\alpha}}^{X}= \nabla_{\tilde{\alpha}}\log p(X|\lambda)=[G_{{\alpha_1}}^{X},\dots, G_{{\alpha_{K-1}}}^{X}]^{\top} =\widetilde{G_{\alpha}^X}, $$

the corresponding block of the FIM is \(F_{\tilde {\alpha },\tilde {\alpha }}= (\text {diag}(\tilde {w})-\tilde {w}\tilde {w}^{\top } ), \) where \(\tilde {w}=[w_1,\dots ,w_{K-1}]^{\top }\). The matrix \(F_{\tilde {\alpha },\tilde {\alpha }}\) is invertible, indeed it can be decomposed into a product of an invertible diagonal matrix \(D=\text {diag}(\tilde {w})\) and an invertible elementary matrix Footnote 8 \(E(\mathbf {e},\tilde {w},-1)= I-\mathbf {e}\tilde {w}^{\top } \); its inverse is

$$F_{\tilde{\alpha},\tilde{\alpha}}^{-1}=\text{diag}(\tilde{w})^{-1}\left( I+\frac{1}{\sum\limits_{i=1}^{K-1}w_i-1}\mathbf{e}\tilde{w}^{\top} \right)= \left( \text{diag}(\tilde{w})^{-1}+\frac{1}{w_K}\mathbf{e}\mathbf{e}^{\top} \right). $$

It follows that

$$K_{\tilde{\alpha}}(X,Y)\,=\,\left( G_{\tilde{\alpha}}^{X}\right)^{\top} F_{\tilde{\alpha},\tilde{\alpha}}^{-1} G_{\tilde{\alpha}}^{Y}\,=\,\!\left( \!\left( G_{\tilde{\alpha}}^{X}\right)^{\top}\!\! \text{diag}(\tilde{w})^{-1}G_{\tilde{\alpha}}^{Y}\,+\,\frac{1}{w_K}\left( \mathbf{e}^{\top} G_{\tilde{\alpha}}^{X}\right)\left( \mathbf{e}^{\top} G_{\tilde{\alpha}}^{Y}\right)\!\right)\,=\,\sum\limits_{k=1}^{K} \!\frac{G_{{\alpha_k}}^{X} G_{{\alpha_k}}^{Y}}{w_{k}} $$

where we used \(\mathbf {e}^{\top } G_{\tilde {\alpha }}^{Z}={\sum }_{k=1}^{K-1}{\sum }_{z\in Z} \left (\gamma _{z}(k)-w_k \right ) =-{\sum }_{z\in Z} \left (\gamma _{z}(K)-w_K\right )=-G_{{\alpha _K}}^{Z}\).

By defining \(\mathcal {G}_{\alpha _k}^{X} =\frac {1}{\sqrt {w_k}}{\sum }_{x\in X} \left (\gamma _x(k)-w_{k}\right ), \) we finally obtain \(K_{\tilde {\alpha }}(X,Y)=\left (\mathcal {G}_{\alpha }^{X}\right )^{\top } \mathcal {G}_{\alpha }^{Y}\). Please note that we don’t need to explicitly compute the Cholesky decomposition of the matrix \(F_{\tilde {\alpha },\tilde {\alpha }}^{-1}\) because the Fisher Kernel \(K_{\tilde {\alpha }}(X,Y)\) can be easily rewritten as dot product between the feature vector \(\mathcal {G}_{{\alpha }}^{X}\) and \(\mathcal {G}_{{\alpha }}^{Y}\).

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Amato, G., Falchi, F. & Vadicamo, L. Aggregating binary local descriptors for image retrieval. Multimed Tools Appl 77, 5385–5415 (2018). https://doi.org/10.1007/s11042-017-4450-2

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-017-4450-2

Keywords

Navigation