Aggregating binary local descriptors for image retrieval

Amato, Giuseppe; Falchi, Fabrizio; Vadicamo, Lucia

doi:10.1007/s11042-017-4450-2

Aggregating binary local descriptors for image retrieval

Published: 02 March 2017

Volume 77, pages 5385–5415, (2018)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Giuseppe Amato¹,
Fabrizio Falchi¹ &
Lucia Vadicamo¹

402 Accesses
6 Citations
1 Altmetric
Explore all metrics

Abstract

Content-Based Image Retrieval based on local features is computationally expensive because of the complexity of both extraction and matching of local feature. On one hand, the cost for extracting, representing, and comparing local visual descriptors has been dramatically reduced by recently proposed binary local features. On the other hand, aggregation techniques provide a meaningful summarization of all the extracted feature of an image into a single descriptor, allowing us to speed up and scale up the image search. Only a few works have recently mixed together these two research directions, defining aggregation methods for binary local features, in order to leverage on the advantage of both approaches.In this paper, we report an extensive comparison among state-of-the-art aggregation methods applied to binary features. Then, we mathematically formalize the application of Fisher Kernels to Bernoulli Mixture Models. Finally, we investigate the combination of the aggregated binary features with the emerging Convolutional Neural Network (CNN) features. Our results show that aggregation methods on binary features are effective and represent a worthwhile alternative to the direct matching. Moreover, the combination of the CNN with the Fisher Vector (FV) built upon binary features allowed us to obtain a relative improvement over the CNN results that is in line with that recently obtained using the combination of the CNN with the FV built upon SIFTs. The advantage of using the FV built upon binary features is that the extraction process of binary features is about two order of magnitude faster than SIFTs.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Microsoft COCO: Common Objects in Context

Deep learning models for digital image processing: a review

Article 07 January 2024

A survey of the recent architectures of deep convolutional neural networks

Article 21 April 2020

Notes

Respect to the experimental setting used in our previous work [3], we improved the computation of the local features before the aggregation phase which allowed us to obtain better performances for BoW and VLAD on the INRIA Holidays dataset than that reported in [3].
A Bernoulli distribution p(x) = μ ^x(1−μ)^1−x of parameter μ can be written as exponential distribution p(x) = e x p(η x−l o g(1 + e ^η)) where $\eta = \log \left (\frac {\mu }{1-\mu }\right )$ is the natural parameter. In [62] the score function is computed considering the gradient w.r.t. the natural parameters η while in this paper we used the gradient w.r.t. the standard parameter μ of the Bernoulli (as also done in [72]).
http://opencv.org/.
https://github.com/ffalchi/it.cnr.isti.vir.
https://github.com/BVLC/caffe/wiki/Model-Zoo.
https://github.com/BVLC/caffe/tree/master/models/bvlc_reference_caffenet.
To search a database for the objects similar to a query we can use either a similarity function or a distance function. In the first case, we search for the objects with greatest similarity to the query. In the latter case, we search for the objects with lowest distance from the query. A similarity function is said to be equivalent to a distance function if the ranked list of the results to query is the same. For example, the Euclidean distance between two vectors (ℓ ₂(x ₁,x ₂)=∥x ₁−x ₂∥₂) is equivalent to the cosine similarity (s _cos(x ₁,x ₂)=(x ₁⋅x ₂)/(∥x ₁∥₂∥x ₂∥₂)) whenever the vectors are L ₂- normalized (i.e. ∥x ₁∥₂=∥x ₂∥₂=1). In fact, in such case, $s_{\text {cos}}(x_{1},x_{2})=1-\frac {1}{2}{\ell _{2}(x_{1},x_{2})}^{2}$, which implies that the ranked list of the results to a query is the same (i.e., ℓ ₂(x ₁,x ₂)≤ℓ ₂(x ₁,x ₃) iff s _cos(x ₁,x ₂)≥s _cos(x ₁,x ₃) ∀ x ₁,x ₂,x ₃).
An elementary matrix E(u,v,σ) = I−σ u v ^H is non-singular if and only if σ v ^H u ≠ 1 and in this case the inverse is E(u,v,σ)⁻¹ = E(u,v,τ) where τ = σ/(σ v ^H u−1). More details on this topic can be found in [29].

References

Alcantarilla PF, Nuevo J, Bartoli A (2013) Fast explicit diffusion for accelerated features in nonlinear scale spaces British machine vision conference (BMVC)
Google Scholar
Amato G, Falchi F, Gennaro C, Vadicamo L (2016) Deep Permutations: Deep Convolutional Neural Networks and Permutation-Based Indexing. Springer International Publishing, Cham, pp 93–106. doi:10.1007/978-3-319-46759-7_7
Google Scholar
Amato G, Falchi F, Vadicamo L (2016) How effective are aggregation methods on binary features? Proceedings of the 11th joint conference on computer vision, imaging and computer graphics theory and applications, vol 4, pp 566–573
Amato G, Falchi F, Vadicamo L (2016) Visual Recognition of Ancient Inscriptions Using Convolutional Neural Network and Fisher Vector, J Comput Cult Herit (JOCCH) Article 21 9, 4 (December 2016) 24 pages. doi:10.1145/2964911
Arandjelovic R, Zisserman A (2012) Three things everyone should know to improve object retrieval 2012 IEEE conference on Computer vision and pattern recognition (CVPR), pp 2911–2918
Chapter Google Scholar
Arandjelovic R, Zisserman A (2013) All about VLAD 2013 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). doi:10.1109/CVPR.2013.207, pp 1578–1585
Chapter Google Scholar
Babenko A, Slesarev A, Chigorin A, Lempitsky V (2014) Neural codes for image retrieval Computer Vision–ECCV 2014. doi:10.1007/978-3-319-10590-1_38. Springer, pp 584–599
Bay H, Tuytelaars T, Van Gool L (2006) Surf: Speeded up robust features. In: Leonardis A, Bischof H, Pinz A (eds) Computer Vision - ECCV 2006, Lecture Notes in Computer Science. doi:10.1007/11744023_32, vol 3951. Springer, Berlin, pp 404–417
Bing images. http://www.bing.com/images/
Bishop CM (2006) Pattern recognition and machine learning. Information science and statistics. Springer
Boureau YL, Bach F, LeCun Y, Ponce J (2010) Learning mid-level features for recognition 2010 IEEE conference on Computer vision and pattern recognition (CVPR), pp 2559–2566
Chapter Google Scholar
Calonder M, Lepetit V, Strecha C, Fua P (2010) Brief: Binary robust independent elementary features. In: Daniilidis K, Maragos P, Paragios N (eds) Computer Vision - ECCV 2010, Lecture Notes in Computer Science, vol 6314. Springer, Berlin Heidelberg, pp 778–792
Chandrasekhar V, Lin J, Morère O, Goh H, Veillard A (2015) A practical guide to cnns and fisher vectors for image instance retrieval arXiv:1508.02496
Chen D, Tsai S, Chandrasekhar V, Takacs G, Chen H, Vedantham R, Grzeszczuk R, Girod B (2011) Residual enhanced visual vectors for on-device image matching 2011 Conference Record of the Forty Fifth Asilomar Conference on Signals, Systems and Computers (ASILOMAR). doi:10.1016/j.sigpro.2012.06.005, pp 850–854
Chapter Google Scholar
Chum O, Philbin J, Sivic J, Isard M, Zisserman A (2007) Total recall: Automatic query expansion with a generative feature model for object retrieval IEEE 11th international conference on Computer vision, 2007. ICCV 2007, pp 1–8
Google Scholar
Csurka G, Dance C, Fan L, Willamowski J, Bray C (2004) Visual categorization with bags of keypoints. Workshop on statistical learning in computer vision. ECCV 1(1-22):1–2
Google Scholar
Datta R, Li J, Wang JZ (2005) Content-based image retrieval: Approaches and trends of the new age Proceedings of the 7th ACM SIGMM international workshop on multimedia information retrieval, MIR ’05. ACM, New York, pp 253–262
Google Scholar
Delhumeau J, Gosselin PH, Jégou H, Pérez P (2013) Revisiting the VLAD image representation Proceedings of the 21st ACM International Conference on Multimedia, MM 2013. doi:10.1145/2502081.2502171. ACM, New York, pp 653–656
Google Scholar
Deng J, Dong W, Socher R, Li L, Li K, Fei-Fei L (2009) Imagenet: A large-scale hierarchical image database IEEE Conference on Computer Vision and Pattern Recognition, 2009. CVPR 2009. doi:10.1109/CVPR.2009.5206848, pp 248–255
Chapter Google Scholar
Donahue J, Jia Y, Vinyals O, Hoffman J, Zhang N, Tzeng E, Darrell T (2013) Decaf: A deep convolutional activation feature for generic visual recognition. arXiv:1310.1531
Galvez-Lopez D., Tardos J. (2011) Real-time loop detection with bags of binary words IEEE/RSJ international conference on Intelligent robots and systems (IROS), 2011, pp 51–58
Google Scholar
Goodfellow I, Bengio Y, Courville A (2016) Deep learning. http://www.deeplearningbook.org. Book in preparation for MIT Press
Google googles. http://www.google.com/mobile/goggles/
Google images. https://images.google.com/
Grana C, Borghesani D, Manfredi M, Cucchiara R (2013) A fast approach for integrating ORB descriptors in the bag of words model. In: Snoek CGM, Kennedy LS, Creutzburg R, Akopian D, Wüller D, Matherson KJ, Georgiev TG, Lumsdaine A (eds) IS&T/SPIE Electronic Imaging. International Society for Optics and Photonics
Gray RM, Neuhoff DL (1998) Quantization. IEEE Trans Inf Theory 44 (6):2325–2383. doi:10.1109/18.720541
Article MATH Google Scholar
Hamming RW (1950) Error detecting and error correcting codes. Bell Syst Tech J 29(2):147–160. doi:10.1002/j.1538-7305.1950.tb00463.x
Article MathSciNet Google Scholar
Heinly J, Dunn E, Frahm JM (2012) Comparative evaluation of binary features Computer vision - ECCV 2012, lecture notes in computer science. Springer, Berlin, pp 759–773
Chapter Google Scholar
Householder A. (1964) The Theory of Matrices in Numerical Analysis. A Blaisdell book in pure and applied sciences: introduction to higher mathematics. Blaisdell Publishing Company
Huiskes MJ, Lew MS (2008) The mir flickr retrieval evaluation MIR ’08: Proceedings of the 2008 ACM International Conference on Multimedia Information Retrieval. ACM, New York
Google Scholar
Jaakkola T, Haussler D (1998) Exploiting generative models in discriminative classifiers In Advances in Neural Information Processing Systems. http://dl.acm.org/citation.cfm?id=340534.340715, vol 11. MIT Press, pp 487–493
Jégou H, Douze M, Schmid C (2008) Hamming embedding and weak geometric consistency for large scale image search. In: Forsyth D, Torr P, Zisserman A (eds) European Conference on Computer Vision, LNCS, vol I. Springer, pp 304–317
Jégou H, Douze M, Schmid C (2010) Improving bag-of-features for large scale image search. Int J Comput Vis 87(3):316–336. doi:10.1007/s11263-009-0285-2
Article Google Scholar
Jégou H, Douze M, Schmid C, Pérez P (2010) Aggregating local descriptors into a compact image representation IEEE Conference on Computer Vision & Pattern Recognition. doi:10.1109/CVPR.2010.5540039
Google Scholar
Jégou H, Douze M, Schmid C (2011) Product quantization for nearest neighbor search. IEEE Trans Pattern Anal Mach Intell 33 (1):117–128. doi:10.1109/TPAMI.2010.57
Article Google Scholar
Jégou H, Perronnin F, Douze M, Sànchez J, Pérez P, Schmid C (2012) Aggregating local image descriptors into compact codes. IEEE Trans Pattern Anal Mach Intell 34(9):1704–1716. doi:10.1109/TPAMI.2011.235
Article Google Scholar
Jia Y, Shelhamer E, Donahue J, Karayev S, Long J, Girshick R, Guadarrama S, Darrell T (2014) Caffe: Convolutional architecture for fast feature embedding Proceedings of the ACM International Conference on Multimedia. doi:10.1145/2647868.2654889. ACM, pp 675–678
Kaufman L, Rousseeuw P (1987) Clustering by means of medoids. In: Dodge Y (ed) An introduction to l1-norm based statistical data analysis, computational statistics & data analysis, vol 5
Krapac J, Verbeek J, Jurie F (2011) Modeling Spatial Layout with Fisher Vectors for Image Categorization ICCV 2011 - International conference on computer vision. IEEE, Barcelona, pp 1487–1494
Chapter Google Scholar
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Pereira F, Burges C, Bottou L, Weinberger K (eds) Advances in neural information processing systems, vol 25. Curran Associates, Inc., pp 1097–1105
Lai H, Pan Y, Liu Y, Yan S (2015) Simultaneous feature learning and hash coding with deep neural networks The IEEE conference on computer vision and pattern recognition (CVPR)
Google Scholar
Lazebnik S., Schmid C., Ponce J. (2006) Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories 2006 IEEE computer society conference on Computer vision and pattern recognition, vol 2
LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521 (7553):436–444. doi:10.1038/nature14539
Article Google Scholar
Lee S, Choi S, Yang H (2015) Bag-of-binary-features for fast image representation. Electron Lett 51(7):555–557
Article Google Scholar
Leutenegger S, Chli M, Siegwart R (2011) Brisk: Binary robust invariant scalable keypoints IEEE International Conference on Computer vision (ICCV), 2011, pp 2548–2555
Chapter Google Scholar
Levi G, Hassner T (2015) LATCH: learned arrangements of three patch codes. CoRR abs/1501. 03719
Lin K, Yang HF, Hsiao JH, Chen CS (2015) Deep learning of binary hash codes for fast image retrieval The IEEE conference on computer vision and pattern recognition (CVPR) workshops
Google Scholar
Lloyd S (1982) Least squares quantization in pcm. IEEE Trans Inf Theory 28 (2):129–137. doi:10.1109/TIT.1982.1056489
Article MathSciNet MATH Google Scholar
Lowe DG (2004) Distinctive image features from scale-invariant keypoints. Int J Comput Vis 60(2):91–110. doi:10.1023/B:VISI.0000029664.99615.94
Article Google Scholar
McLachlan G, Peel D (2000) Finite Mixture Models. Wiley series in probability and statistics. Wiley
Miksik O, Mikolajczyk K (2012) Evaluation of local detectors and descriptors for fast feature matching 2012 21st international conference on Pattern recognition (ICPR), pp 2681–2684
Google Scholar
Perd’och M, Chum O, Matas J (2009) Efficient representation of local geometry for large scale object retrieval IEEE Conference on Computer vision and pattern recognition, 2009. CVPR 2009, pp 9–16
Chapter Google Scholar
Perronnin F, Dance C (2007) Fisher kernels on visual vocabularies for image categorization IEEE Conference on Computer Vision and Pattern Recognition, 2007. CVPR ’07. doi:10.1109/CVPR.2007.383266, pp 1–8
Google Scholar
Perronnin F, Larlus D (2015) Fisher Vectors Meet Neural Networks: A Hybrid Classification Architecture Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3743–3752
Google Scholar
Perronnin F, Liu Y, Sànchez J, Poirier H (2010) Large-scale image retrieval with compressed fisher vectors 2010 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). doi:10.1109/CVPR.2010.5540009, pp 3384–3391
Chapter Google Scholar
Perronnin F, Sànchez J, Mensink T (2010) Improving the fisher kernel for large-scale image classification Computer Vision - ECCV 2010, Lecture Notes in Computer Science. doi:10.1007/978-3-642-15561-1_11, vol 6314. Springer, Berlin, pp 143–156
Philbin J, Chum O, Isard M, Sivic J, Zisserman A (2007) Object retrieval with large vocabularies and fast spatial matching 2007 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). doi:10.1109/CVPR.2007.383172, pp 1–8
Google Scholar
Philbin J, Chum O, Isard M, Sivic J, Zisserman A (2008) Lost in quantization: Improving particular object retrieval in large scale image databases IEEE Conference on Computer Vision and Pattern Recognition, 2008. CVPR 2008. doi:10.1109/CVPR.2008.4587635, pp 1–8
Google Scholar
Razavian AS, Azizpour H, Sullivan J, Carlsson S (2014) CNN features off-the-shelf: an astounding baseline for recognition 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). doi:10.1109/CVPRW.2014.131. IEEE, pp 512–519
Rublee E, Rabaud V, Konolige K, Bradski G (2011) Orb: an efficient alternative to sift or surf 2011 IEEE International Conference on Computer vision (ICCV), pp 2564–2571
Chapter Google Scholar
Salton G, McGill MJ (1986) Introduction to Modern Information Retrieval. McGraw-Hill, Inc., New York
MATH Google Scholar
Sànchez J, Redolfi J (2015) Exponential family fisher vector for image classification. Pattern Recogn Lett 59:26–32. doi:10.1016/j.patrec.2015.03.010
Article Google Scholar
Sànchez J, Perronnin F, Mensink T, Verbeek J (2013) Image classification with the fisher vector: Theory and practice. Int J Comput Vis 105 (3):222–245. doi:10.1007/s11263-013-0636-x
Article MathSciNet MATH Google Scholar
Simonyan K, Vedaldi A, Zisserman A (2013) Deep fisher networks for large-scale image classification. In: Burges CJC, Bottou L, Welling M, Ghahramani Z, Weinberger KQ (eds) Advances in neural information processing systems 26. Curran Associates, Inc., pp 163–171
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556
Sivic J, Zisserman A (2003) Video google: A text retrieval approach to object matching in videos Proceedings of the Ninth IEEE International Conference on Computer Vision, ICCV ’03. doi:10.1109/ICCV.2003.1238663, vol 2. IEEE Computer Society, pp 1470–1477
Smeulders AWM, Worring M, Santini S, Gupta A, Jain R (2000) Content-based image retrieval at the end of the early years. IEEE Trans Pattern Anal Mach Intell 22(12):1349–1380
Article Google Scholar
Sydorov V, Sakurada M, Lampert CH (2014) Deep fisher kernels - end to end learning of the fisher kernel gmm parameters The IEEE Conference on Computer vision and pattern recognition (CVPR)
Google Scholar
Tolias G, Avrithis Y (2011) Speeded-up, relaxed spatial matching 2011 IEEE International Conference on Computer Vision (ICCV). doi:10.1109/ICCV.2011.6126427, pp 1653–1660
Chapter Google Scholar
Tolias G, Furon T, Jégou H (2014) Orientation covariant aggregation of local descriptors with embeddings. In: Fleet D, Pajdla T, Schiele B, Tuytelaars T (eds) Computer Vision - ECCV 2014, Lecture Notes in Computer Science, vol 8694. Springer International Publishing, pp 382–397
Tolias G, Jégou H (2013) Local visual query expansion: Exploiting an image collection to refine local descriptors. Research Report RR-8325. https://hal.inria.fr/hal-00840721
Uchida Y, Sakazawa S (2013) Image retrieval with fisher vectors of binary features 2013 2nd IAPR asian conference on Pattern recognition (ACPR), pp 23–28
Chapter Google Scholar
Ullman S. (1996) High-Level Vision - object recognition and visual cognition. MIT Press
Uricchio T, Bertini M, Seidenari L, Del Bimbo A (2015) Fisher encoded convolutional bag-of-windows for efficient image retrieval and social image tagging The IEEE International Conference on Computer Vision (ICCV) Workshops
Google Scholar
van Gemert JC, Geusebroek JM, Veenman CJ, Smeulders AW (2008) Kernel codebooks for scene categorization. In: Forsyth D, Torr P, Zisserman A (eds) Computer Vision - ECCV 2008, Lecture Notes in Computer Science, vol 5304. Springer, Berlin, pp 696–709
Van Opdenbosch D, Schroth G, Huitl R, Hilsenbeck S, Garcea A, Steinbach E (2014) Camera-based indoor positioning using scalable streaming of compressed binary image signatures IEEE International Conference on Image Processing
Wang J, Yang J, Yu K, Lv F, Huang T, Gong Y (2010) Locality-constrained linear coding for image classification 2010 IEEE conference on Computer vision and pattern recognition (CVPR), pp 3360–3367
Chapter Google Scholar
Witten IH, Moffat A, Bell TC (1999) Managing gigabytes: compressing and indexing documents and images. Morgan Kaufmann
Yang J, Yu K, Gong Y, Huang T (2009) Linear spatial pyramid matching using sparse coding for image classification IEEE conference on Computer vision and pattern recognition, 2009. CVPR 2009, pp 1794–1801
Chapter Google Scholar
Yue-Hei Ng J, Yang F, Davis LS (2015) Exploiting local features from deep networks for image retrieval The IEEE conference on computer vision and pattern recognition (CVPR) workshops
Google Scholar
Zezula P, Amato G, Dohnal V, Batko M (2006) Similarity Search: The Metric Space Approach, Advances in Database Systems vol. 32 Springer
Zhang Y, Zhu C, Bres S, Chen L (2013) Encoding local binary descriptors by bag-of-features with hamming distance for visual object categorization. In: Serdyukov P, Braslavski P, Kuznetsov S, Kamps J, Rüger S, Agichtein E, Segalovich I, Yilmaz E (eds) Advances in Information Retrieval, Lecture Notes in Computer Science, vol 7814. Springer, Berlin, pp 630–641
Zhao W, Jégou H, Gravier G (2013) Oriented pooling for dense and non-dense rotation-invariant features BMVC - 24Th british machine vision conference
Google Scholar
Zhou B, Lapedriza A, Xiao J, Torralba A, Oliva A (2014) Learning deep features for scene recognition using places database. In: Ghahramani Z, Welling M, Cortes C, Lawrence N, Weinberger K (eds) Advances in neural information processing systems, vol 27, Curran Associates, Inc., pp 487–495

Download references

Acknowledgments

This work was partially founded by: EAGLE, Europeana network of Ancient Greek and Latin Epigraphy, co-founded by the European Commission, CIP-ICT-PSP.2012.2.1 - Europeana and creativity, Grant Agreement n. 325122; and Smart News, Social sensing for breakingnews, co-founded by the Tuscany region under the FAR-FAS 2014 program, CUP CIPE D58C15000270008.

Author information

Authors and Affiliations

Institute of Information Science and Technologies (ISTI) - CNR, Via Moruzzi 1, Pisa, 56124, Italy
Giuseppe Amato, Fabrizio Falchi & Lucia Vadicamo

Authors

Giuseppe Amato
View author publications
You can also search for this author in PubMed Google Scholar
Fabrizio Falchi
View author publications
You can also search for this author in PubMed Google Scholar
Lucia Vadicamo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lucia Vadicamo.

Appendices

Appendix A: Score vector computation

In the following, we have reported the computation of the score function $G_{\lambda }^{X}$, defined as the gradient of the log-likelihood of a data X with respect to the parameters λ of a Bernoulli Mixture Model. Throughout this appendix we have used [ [⋅] ] notation to represent the Iverson bracket which equals one if the arguments is true, and zero otherwise.

Under the independence assumption, the Fisher score with respect to the generic parameter λ _k is expressed as: $G_{\lambda _k}^{X} ={\sum }_{t=1}^{\top } \frac {\partial \log p(x_t|\lambda )}{\partial \lambda _k}= {\sum }_{t=1}^{\top } \frac {1} {p(x_t|\lambda )}\frac {\partial }{\partial \lambda _k}\left [{\sum }_{i=1}^{K} w_i p_i(x_t)\right ]$. To compute $\frac {\partial }{\partial \lambda _k}\left [{\sum }_{i=1}^{K} w_{i} p_{i}(x_t)\right ]$, we first observe that

$$\begin{array}{@{}rcl@{}} \frac{\partial{w_i}}{\partial \alpha_{k}} &=&\frac{\partial}{\partial \alpha_{k}}\left[\frac{\exp(\alpha_i)}{\sum\limits_{j=1}^{K}\exp(\alpha_j)}\right]\\ &=&\frac{\exp(\alpha_k)\left( \sum\limits_{j=1}^{K}\exp(\alpha_{j})\right) {[\kern-2pt[{i=k}]\kern-2pt]}-\exp(\alpha_i)\exp(\alpha_k)}{\left( \sum\limits_{j=1}^K\exp(\alpha_j)\right)^{2}}\\ &=& w_{k}{[\kern-2pt[{i=k}]\kern-2pt]}- w_{k}w_{i} \end{array} $$

(5)

and

$$\begin{array}{@{}rcl@{}} &&\frac{\partial p_{i}(x_{t})} {\partial \mu_{kd}}=\frac{\partial}{\partial \mu_{kd}}\left[{\prod}_{l=1}^{D} \mu_{kl}^{x_{tl}} \left( 1-\mu_{kl}\right)^{1-x_{tl}} \right]{[\kern-2pt[{i=k}]\kern-2pt]}\\ &&=\left( {[\kern-2pt[{x_{td}=1}]\kern-2pt]}- {[\kern-2pt[{x_{td}=0}]\kern-2pt]}\right) \left( {\prod}_{\underset{l\neq d}{l=1}}^{D} \mu_{kl}^{x_{tl}}\left( 1-\mu_{kl}\right)^{1-x_{tl}}\right){[\kern-2pt[{i=k}]\kern-2pt]} \\ &&=\left( {[\kern-2pt[{x_{td}=1}]\kern-2pt]}- {[\kern-2pt[{x_{td}=0}]\kern-2pt]}\right)\left( \frac{p_k(x_t)}{\mu_{kd}^{x_{td}}\left( 1-\mu_{kd}\right)^{1-x_{td}}}\right){[\kern-2pt[{i=k}]\kern-2pt]}\\ &&=p_k(x_t)\left( \frac{(1-\mu_{kd}){[\kern-2pt[{x_{td}=1}]\kern-2pt]}-\mu_{kd}{[\kern-2pt[{x_{td}=0}]\kern-2pt]}}{\mu_{kd}(1-\mu_{kd})}\right) {[\kern-2pt[{i=k}]\kern-2pt]}\\ &&=p_k(x_t)\left( \frac{x_{td}-\mu_{kd}}{\mu_{kd}(1-\mu_{kd})}\right) {[\kern-2pt[{i=k}]\kern-2pt]}. \end{array} $$

(6)

Hence, the Fisher score with respect to the parameter α _k is obtained as

$$\begin{array}{@{}rcl@{}} G_{\alpha_k}^{X} &=&\sum\limits_{t=1}^{\top}\sum\limits_{i=1}^{K}\frac{ p_{i}(x_{t})} {p(x_{t}|\lambda)}\frac{\partial w_{i}}{\partial \alpha_{k}}\overset{(5)}{=}\sum\limits_{t=1}^{\top} \sum\limits_{i=1}^{K} \frac{ p_{i}(x_{t})} {p(x_{t}|\lambda)}w_k\left( {[\kern-2pt[{i=k}]\kern-2pt]}-w_{i}\right)\\ &=&\sum\limits_{t=1}^{\top} \left( \frac{ p_{k}(x_{t})} {p(x_{t}|\lambda)}w_{k}-\sum\limits_{i=1}^{K} \frac{ p_i(x_t)} {p(x_t|\lambda)}w_kw_i\right)=\sum\limits_{t=1}^{\top} \left( \gamma_t(k)-w_{k}\sum\limits_{i=1}^{K} \gamma_t(i)\right)\\ &=&\sum\limits_{t=1}^T \left( \gamma_t(k)-w_k \right) \end{array} $$

(7)

and the Fisher score related to the parameter μ _{k
d} is

$$\begin{array}{@{}rcl@{}} G_{\mu_{kd}}^{X} &=&\sum\limits_{t=1}^{\top} \frac{\partial\log p(x_t|\lambda)}{\partial \mu_{kd}}=\sum\limits_{t=1}^{\top} \frac{1} {p(x_t|\lambda)}\frac{\partial}{\partial \mu_{kd}}\left[\sum\limits_{i=1}^{K} w_{i} p_{i}(x_t)\right] \\ &=&\sum\limits_{t=1}^{\top} \frac{w_k} {p(x_t|\lambda)}\frac{\partial p_k(x_t)}{\partial \mu_{kd}} \overset{(6)}{=}\sum\limits_{t=1}^{\top} \frac{w_k p_k(x_t)}{p(x_t|\lambda)}\left( \frac{x_{td}-\mu_{kd}}{\mu_{kd}(1-\mu_{kd})}\right) \\ &=&\sum\limits_{t=1}^{\top} \gamma_t(k)\left( \frac{x_{td}-\mu_{kd}}{\mu_{kd}(1-\mu_{kd})}\right). \end{array} $$

(8)

Appendix B: Approximation of the fisher information matrix

Our derivation of the FIM is based on the assumption (see also [55, 63]) that for each observation x=(x ₁,…,x _D)∈{0,1}^D the distribution of the occupancy probability γ(⋅) = p(⋅|x,λ) is sharply peaking, i.e. there is one Bernoulli index k such that γ _x(k)≈1 and ∀ i≠k, γ _x(i)≈0. This assumption implies that

$$\begin{array}{@{}rcl@{}} &&\gamma_x(k)\gamma_x(i)\approx 0 \quad \forall\,k,i=1\dots, K, i\neq k\\ &&\gamma_x(k)^2\approx \gamma_x(k) \quad \forall\, k=1,\dots,K \end{array} $$

and then

$$ \gamma_x(k)\gamma_x(i)\approx\gamma_x(k) {[\kern-2pt[{i=k}]\kern-2pt]}, $$

(9)

where [⋅] is the Iverson bracket. The elements of the FIM are defined as:

$$\begin{array}{@{}rcl@{}} [F_{\lambda}]_{i,j}=\mathbb{E}_{x\sim p(\cdot|\lambda)}\left[\left( \frac{\partial\log p(x|\lambda)}{\partial \lambda_i}\right) \left( \frac{\partial\log p(x|\lambda)}{\partial \lambda_j}\right)\right]. \end{array} $$

(10)

Hence, the FIM F _λ is symmetric and can be written as block matrix

$$\begin{array}{@{}rcl@{}} F_{\lambda}=\left[\begin{array}{ll} F_{\alpha,\alpha} & F_{\mu,\alpha}\\ F_{\mu,\alpha}^{\top} & F_{\mu,\mu} \end{array}\right]. \end{array} $$

By using the definition of the occupancy probability (i.e. γ _x(k) = w _k p _k(x)/p(x|λ)) and the fact that p _k is the distribution of a D-dimensional Bernoulli of mean μ _k, we have the following useful equalities:

$$\begin{array}{@{}rcl@{}} &&\mathbb{E}_{x\sim p(\cdot|\lambda)}\left[\gamma_{x}(k)\right]= \sum\limits_{x\in\{0,1\}^{D}}\gamma_{x}(k)p(x|\lambda){=}w_{k} \end{array} $$

(11)

$$\begin{array}{@{}rcl@{}} &&\mathbb{E}_{x\sim p(\cdot|\lambda)}\left[\gamma_{x}(k)x_{d}\right]{=}w_{k}\mu_{kd} \end{array} $$

(12)

$$\begin{array}{@{}rcl@{}} &&\mathbb{E}_{x\sim p(\cdot|\lambda)}\left[\gamma_x(k)x_{d}x_{l}\right]{=}w_k\mu_{kd}\left( \mu_{kl}{[\kern-2pt[{d\neq l}]\kern-2pt]} +{[\kern-2pt[{d= l}]\kern-2pt]}\right) \end{array} $$

(13)

$$\begin{array}{@{}rcl@{}} &&\mathbb{E}_{x\sim p(\cdot|\lambda)}\left[\frac{\partial\log p(x|\lambda)}{\partial \alpha_{k}}\right]\overset{(7)}{=}\mathbb{E}_{x\sim p(\cdot|\lambda)}\left[\gamma_x(k)-w_{k}\right] {=}0 \end{array} $$

(14)

$$\begin{array}{@{}rcl@{}} &&\mathbb{E}_{x\sim p(\cdot|\lambda)}\left[\frac{\partial\log p(x|\lambda)}{\partial \mu_{id}}\right]\overset{(8)}{=} \mathbb{E}_{x\sim p(\cdot|\lambda)}\left[\frac{\gamma_x(k)(x_{d}-\mu_{kd})}{\mu_{kd}(1-\mu_{kd})}\right] {=}0. \end{array} $$

(15)

It follows that F _λ may approximated by a diagonal block matrix, because the mixing blocks $F_{\mu _{kd},\alpha _{i}}$ are close to the zero matrix:

$$\begin{array}{@{}rcl@{}} F_{\mu_{kd},\alpha_{i}} &=& \mathbb{E}_{x\sim p(\cdot|\lambda)}\left[\left( \frac{\partial\log p(x|\lambda)}{\partial \mu_{kd}}\right)\left( \frac{\partial\log p(x|\lambda)}{\partial \alpha_i}\right)\right]\\ &\overset{(7)-(8)}{=}& \mathbb{E}_{x\sim p(\cdot|\lambda)}\left[\gamma_x(k)\frac{(x_{d}-\mu_{kd})}{\mu_{kd}(1-\mu_{kd})}(\gamma_x(i)-w_i) \right]\\ &\overset{(9)}\approx &\mathbb{E}_{x\sim p(\cdot|\lambda)}\left[\frac{\gamma_x(k)(x_{d}-\mu_{kd})}{\mu_{kd}(1-\mu_{kd})}\right]\left( {[\kern-2pt[{i=k}]\kern-2pt]}-w_i\right)\\ &\overset{(15)}{=}&0. \end{array} $$

The block F _μ,μ can be written as K D×K D diagonal matrix, in fact:

$$\begin{array}{@{}rcl@{}} F_{\mu_{id},\mu_{kl}} &\overset{(10)}{=}& \mathbb{E}\left[\left( \frac{\partial\log p(x|\lambda)}{\partial \mu_{id} }\right)\left( \frac{\partial\log p(x|\lambda)}{\partial \mu_{kl}}\right)\right]\\ &\overset{(8)}{=}&\mathbb{E}_{x\sim p(\cdot|\lambda)}\left[\gamma_x(i)\gamma_x(k)\frac{(x_{d}-\mu_{id})}{\mu_{id}(1-\mu_{id})}\frac{(x_{l}-\mu_{kl})}{\mu_{kl}(1-\mu_{kl})} \right]\\ &\overset{(9)}{\approx}&\mathbb{E}_{x\sim p(\cdot|\lambda)}\left[ \frac{\gamma_x(k)(x_{d}-\mu_{kd})(x_{l}-\mu_{kl})}{\mu_{kd}\mu_{kl}(1-\mu_{kd})(1-\mu_{kl})}\right]{[\kern-2pt[{i=k}]\kern-2pt]}\\ &\overset{(11)-(13)}{=}& \frac{w_k(\mu_{kd}\mu_{kl}{[\kern-2pt[{d\neq l}]\kern-2pt]} +\mu_{kl}{[\kern-2pt[{d= l}]\kern-2pt]}-\mu_{kd}\mu_{kl})}{\mu_{kd}\mu_{kl}(1-\mu_{kd})(1-\mu_{kl})}{[\kern-2pt[{i=k}]\kern-2pt]}\\ &=& \frac{ w_k(\mu_{kd}{[\kern-2pt[{d\neq l}]\kern-2pt]} +{[\kern-2pt[{d= l}]\kern-2pt]}-\mu_{kd})}{\mu_{kd}(1-\mu_{kd})(1-\mu_{kl})}{[\kern-2pt[{i=k}]\kern-2pt]}\\ &=& \frac{ w_k}{\mu_{kd}(1-\mu_{kd})}{[\kern-2pt[{i=k}]\kern-2pt]}{[\kern-2pt[{d=l}]\kern-2pt]}. \end{array} $$

(16)

The relation (16) points that the diagonal elements of our FIM approximation are w _k/μ _{k
d}(1−μ _{k
d}) and the corresponding entries in L _λ (i.e. the square root of the inverse of FIM) equal $ \sqrt {{\mu _{kd}(1-\mu _{kd})}/{ w_k}}$. The block related to the α parameters is F _α,α=(diag(w)−w w ^⊤) where w=[w ₁,…,w _K]^⊤, in fact:

$$\begin{array}{@{}rcl@{}} F_{\alpha_{k},\alpha_{i}}&\overset{(10)}{=}& \mathbb{E}_{x\sim p(\cdot|\lambda)}\left[\left( \frac{\partial\log p(x|\lambda)}{\partial \alpha_{k} }\right)\left( \frac{\partial\log p(x|\lambda)}{\partial \alpha_i}\right)\right]\\ &\quad\overset{(7)}{=}&\mathbb{E}_{x\sim p(\cdot|\lambda)}\left[(\gamma_x(k)-w_k)(\gamma_x(i)-w_i) \right]\\ &\overset{(9)}{\approx}&\mathbb{E}_{p(\cdot|\lambda)}\left[\gamma_x(k){[\kern-2pt[{i=k}]\kern-2pt]}- \gamma_x(k)w_i-\gamma_x(i)w_k +w_iw_k \right]\\ &\overset{(11)-(12)}{=}&\left( w_k{[\kern-2pt[{i=k}]\kern-2pt]}-w_iw_k \right). \end{array} $$

The matrix F _α,α is not invertible (indeed F _α,α e=0 where e=[1,…,1]^⊤) due to the dependence of the mixing weights $\left ({\sum }_{i=1}^K\alpha _i={\sum }_{i=1}^K w_i=1\right )$. Since there are only K−1 degrees of freedom in the mixing weight, as proposed in [63], we can fix α _K equal to a constant without loss of generality and work with a reduced set of K−1 parameters: $\tilde {\alpha }=[\alpha _1,\dots ,\alpha _{K-1}]^{\top }$.

Taking into account the Fisher score with respect to $\tilde {\alpha }$, i.e.

$$G_{\tilde{\alpha}}^{X}= \nabla_{\tilde{\alpha}}\log p(X|\lambda)=[G_{{\alpha_1}}^{X},\dots, G_{{\alpha_{K-1}}}^{X}]^{\top} =\widetilde{G_{\alpha}^X}, $$

the corresponding block of the FIM is $F_{\tilde {\alpha },\tilde {\alpha }}= (\text {diag}(\tilde {w})-\tilde {w}\tilde {w}^{\top } ), $ where $\tilde {w}=[w_1,\dots ,w_{K-1}]^{\top }$. The matrix $F_{\tilde {\alpha },\tilde {\alpha }}$ is invertible, indeed it can be decomposed into a product of an invertible diagonal matrix $D=\text {diag}(\tilde {w})$ and an invertible elementary matrix ^{Footnote 8} $E(\mathbf {e},\tilde {w},-1)= I-\mathbf {e}\tilde {w}^{\top } $; its inverse is

$$F_{\tilde{\alpha},\tilde{\alpha}}^{-1}=\text{diag}(\tilde{w})^{-1}\left( I+\frac{1}{\sum\limits_{i=1}^{K-1}w_i-1}\mathbf{e}\tilde{w}^{\top} \right)= \left( \text{diag}(\tilde{w})^{-1}+\frac{1}{w_K}\mathbf{e}\mathbf{e}^{\top} \right). $$

It follows that

$$K_{\tilde{\alpha}}(X,Y)\,=\,\left( G_{\tilde{\alpha}}^{X}\right)^{\top} F_{\tilde{\alpha},\tilde{\alpha}}^{-1} G_{\tilde{\alpha}}^{Y}\,=\,\!\left( \!\left( G_{\tilde{\alpha}}^{X}\right)^{\top}\!\! \text{diag}(\tilde{w})^{-1}G_{\tilde{\alpha}}^{Y}\,+\,\frac{1}{w_K}\left( \mathbf{e}^{\top} G_{\tilde{\alpha}}^{X}\right)\left( \mathbf{e}^{\top} G_{\tilde{\alpha}}^{Y}\right)\!\right)\,=\,\sum\limits_{k=1}^{K} \!\frac{G_{{\alpha_k}}^{X} G_{{\alpha_k}}^{Y}}{w_{k}} $$

where we used $\mathbf {e}^{\top } G_{\tilde {\alpha }}^{Z}={\sum }_{k=1}^{K-1}{\sum }_{z\in Z} \left (\gamma _{z}(k)-w_k \right ) =-{\sum }_{z\in Z} \left (\gamma _{z}(K)-w_K\right )=-G_{{\alpha _K}}^{Z}$.

By defining $\mathcal {G}_{\alpha _k}^{X} =\frac {1}{\sqrt {w_k}}{\sum }_{x\in X} \left (\gamma _x(k)-w_{k}\right ), $ we finally obtain $K_{\tilde {\alpha }}(X,Y)=\left (\mathcal {G}_{\alpha }^{X}\right )^{\top } \mathcal {G}_{\alpha }^{Y}$. Please note that we don’t need to explicitly compute the Cholesky decomposition of the matrix $F_{\tilde {\alpha },\tilde {\alpha }}^{-1}$ because the Fisher Kernel $K_{\tilde {\alpha }}(X,Y)$ can be easily rewritten as dot product between the feature vector $\mathcal {G}_{{\alpha }}^{X}$ and $\mathcal {G}_{{\alpha }}^{Y}$.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Amato, G., Falchi, F. & Vadicamo, L. Aggregating binary local descriptors for image retrieval. Multimed Tools Appl 77, 5385–5415 (2018). https://doi.org/10.1007/s11042-017-4450-2

Download citation

Received: 02 August 2016
Revised: 20 December 2016
Accepted: 27 January 2017
Published: 02 March 2017
Issue Date: March 2018
DOI: https://doi.org/10.1007/s11042-017-4450-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Aggregating binary local descriptors for image retrieval

Abstract

Access this article

Similar content being viewed by others

Microsoft COCO: Common Objects in Context

Deep learning models for digital image processing: a review

A survey of the recent architectures of deep convolutional neural networks

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Appendices

Appendix A: Score vector computation

Appendix B: Approximation of the fisher information matrix

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Aggregating binary local descriptors for image retrieval

Abstract

Access this article

Similar content being viewed by others

Microsoft COCO: Common Objects in Context

Deep learning models for digital image processing: a review

A survey of the recent architectures of deep convolutional neural networks

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Appendices

Appendix A: Score vector computation

Appendix B: Approximation of the fisher information matrix

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation