Skip to main content
Log in

FhVLAD: Fine-grained quantization and encoding high-order descriptor statistics for scalable image retrieval

  • 1166: Advances of machine learning in data analytics and visual information processing
  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

We are interested in the encoding of local descriptors of an image (e.g. SIFT) to design a compact representation vector and thereby address scalable image retrieval. We revisit the implicit design choices in the popular vector of locally aggregated descriptors (VLAD), which aggregates the residuals of descriptors to the codewords. VLAD’s use of a coarse codebook and first-order descriptor statistics in residual computation results in less discriminative residuals. To address this problem, we propose a division of codebook feature space using a novel fine-grained quantization strategy. After quantization, we embed the resulting residuals with high-order statistics of descriptor distribution. Experiments on three challenging image retrieval datasets (INRIA Holidays, UKBench, Oxford 5k) confirm the improved discriminative power of our novel encoding method called FhVLAD. We observe superior accuracy to baseline and competitive performance to state-of-the-art techniques with a limited increase in dimension.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

Notes

  1. https://tineye.com

  2. https://www.pinterest.com

  3. https://www.visenze.com

  4. http://www.vlfeat.org/

  5. https://www.vlfeat.org/overview/kdtree.html

  6. https://lear.inrialpes.fr/~jegou/data.php

  7. https://archive.org/details/ukbench

  8. https://www.robots.ox.ac.uk/~vgg/data/oxbuildings/

References

  1. Arandjelović R, Zisserman A (2012) Three things everyone should know to improve object retrieval. In: 2012 IEEE conference on computer vision and pattern recognition, pp 2911–2918

  2. Arandjelović R, Gronat P, Torii A, Pajdla T, Sivic J (2018) NetVLAD: CNN architecture for weakly supervised place recognition. IEEE Trans Pattern Anal Mach Intell 40(6):1437–1451. conference Name: IEEE Transactions on Pattern Analysis and Machine Intelligence

    Article  Google Scholar 

  3. Arandjelovic R, Zisserman A (2013) All About VLAD. In: 2013 IEEE conference on computer vision and pattern recognition, pp 1578–1585

  4. Babenko A, Lempitsky V (2015) Aggregating local deep features for image retrieval. In: Proceedings of the 2015 IEEE international conference on computer vision (ICCV), IEEE Computer Society, USA, ICCV ’15, pp 1269–1277

  5. Babenko A, Slesarev A, Chigorin A, Lempitsky VS (2014) Neural codes for image retrieval. In: Computer vision - ECCV 2014 - 13th european conference, zurich, Switzerland, September 6-12, 2014, Proceedings Part I, pp 584–599

  6. Balanda KP, MacGillivray HL (1988) Kurtosis: a critical review. The American Statistician 42(2):111–119

    Google Scholar 

  7. Bay H, Tuytelaars T, Gool LV (2006) SURF: speeded up robust features. In: Computer vision – ECCV 2006, Lecture Notes in Computer Science. springer, Berlin, pp 404–417

  8. Bhowmick A, Saharia S, Hazarika SM (2019) Encoding high-Order statistics in VLAD for scalable image retrieval. In: Deka B, Maji P, Mitra S, Bhattacharyya DK, Bora PK, Pal SK (eds) Pattern recognition and machine intelligence, lecture notes in computer science. Springer International Publishing, Cham, pp 559–566

  9. Bishop C (2006) Pattern recognition and machine learning information science and statistics. Springer, New York

    MATH  Google Scholar 

  10. Csurka G, Dance CR, Fan L, Willamowski J, Bray C (2004) Visual categorization with bags of keypoints. In: In workshop on statistical learning in computer vision, ECCV, pp 1–22

  11. Delhumeau J, Gosselin PH, Jégou H, Pérez P (2013) Revisiting the VLAD image representation, ACM, New York

  12. Eggert C, Romberg S, Lienhart R (2014) Improving VLAD: Hierarchical coding and a refined local coordinate system. In: 2014 IEEE international conference on image processing (ICIP), pp 3018–3022

  13. Gao W, Zhu Y, Zhang W, Zhang K, Gao H (2019) A hierarchical recurrent approach to predict scene graphs from a visual-attention-oriented perspective. Comput Intell 35(3):496–516

    Article  MathSciNet  Google Scholar 

  14. Gong Y, Wang L, Guo R, Lazebnik S (2014) Multi-scale orderless pooling of deep convolutional activation features. In: Fleet DJ, Pajdla T, Schiele B, Tuytelaars T (eds) Computer vision - ECCV 2014 - 13th european conference, zurich, switzerland, september 6-12, 2014, proceedings, Part VII, Springer, Lecture Notes in Computer Science, vol 8695, pp 392–407

  15. Husain SS, Bober M (2016) Improving large-Scale image retrieval through robust aggregation of local descriptors. IEEE Trans Pattern Anal Mach Intell 39 (9):1783–1796

    Article  Google Scholar 

  16. Jegou H, Chum O (2012) Negative evidences and co-occurences in image retrieval: the benefit of PCA and whitening. In: Computer vision – ECCV 2012, Lecture Notes in Computer Science. Springer, Berlin, pp 774–787

  17. Jegou H, Zisserman A (2014) Triangulation embedding and democratic aggregation for image search. In: Proc IEEE Conf computer vision and patter recognition, pp 3310–3317

  18. Jegou H, Douze M, Schmid C (2008) Hamming embedding and weak geometric consistency for large scale image search. In: Proceedings of the 10th European conference on computer vision: Part I, Springer, Berlin, Heidelberg, ECCV ’08, pp 304–317

  19. Jegou H, Douze M, Schmid C (2010) Improving bag-of-features for large scale image search. Int J Comput Vis 87(3):316–336

    Article  Google Scholar 

  20. Jegou H, Douze M, Schmid C, Pérez P (2010) Aggregating local descriptors into a compact image representation. In: 2010 IEEE computer society conference on computer vision and pattern recognition, pp 3304–3311

  21. Jegou H, Perronnin F, Douze M, Sánchez J, Pérez P, Schmid C (2012) Aggregating local image descriptors into compact codes. IEEE Trans Pattern Anal Mach Intell 34(9):1704–1716

    Article  Google Scholar 

  22. Lazebnik S, Schmid C, Ponce J (2006) Beyond bags of features: spatial pyramid matching for recognizing natural scene categories. In: Proceedings of the 2006 IEEE computer society conference on computer vision and pattern recognition - Volume 2, IEEE Computer Society, USA, CVPR ’06, pp 2169–2178

  23. Li Q, Peng Q, Yan C (2018) Multiple VLAD encoding of CNNs for image classification. Comput Sci Eng 20(2):52–63

    Article  Google Scholar 

  24. Liu L, Wang L, Liu X (2011) In Defense of Soft-assignment Coding. In: Proceedings of the 2011 international conference on computer vision, ICCV ’11. IEEE Computer Society, Washington, pp 2486–2493

  25. Liu P, Miao Z, Guo H, Wang Y, Ai N (2018) Adding spatial distribution clue to aggregated vector in image retrieval. EURASIP J Image Video Process 2018(1):9

    Article  Google Scholar 

  26. Liu Z, Houqiang L, Wengang Z, Ting R, Qi T (2016) Making residual vector distribution uniform for distinctive image representation. IEEE Trans Circ Syst Video Technol 26(2):375–384

    Article  Google Scholar 

  27. Liu Z, Wang S, Tian Q (2016) Fine-residual VLAD for Image Retrieval. Neurocomput 173(P3):1183–1191

    Article  Google Scholar 

  28. Long X, Lu H, Peng Y, Wang X, Feng S (2016) Image classification based on improved VLAD. Multimedia Tools and Applications 75(10):5533–5555

    Article  Google Scholar 

  29. Lowe DG (2004) Distinctive image features from scale-Invariant keypoints. Int J Comput Vis 60(2):91–110

    Article  Google Scholar 

  30. Mairal J, Koniusz P, Harchaoui Z, Schmid C (2014) Convolutional kernel networks. In: Proceedings of the 27th international conference on neural information processing systems, NIPS’14, vol 2. MIT Press, Cambridge, pp 2627–2635

  31. Mikolajczyk K, Schmid C (2004) Scale & affine invariant interest point detectors. Int J Comput Vision 60(1):63–86

    Article  Google Scholar 

  32. Mironică I, Duţă IC, Ionescu B, Sebe N (2016) A modified vector of locally aggregated descriptors approach for fast video classification. Multimedia Tools and Applications 75(15):9045–9072

    Article  Google Scholar 

  33. Ng JYH, Yang F, Davis LS (2015) Exploiting local features from deep networks for image retrieval. In: 2015 IEEE Conference on computer vision and pattern recognition workshops (CVPRW). IEEE, Boston, pp 53–61

  34. Nister D, Stewenius H (2006) Scalable recognition with a vocabulary tree. In: 2006 IEEE Computer society conference on computer vision and pattern recognition (CVPR’06), vol 2, pp 2161–2168

  35. Noh H, Araujo A, Sim J, Weyand T, Han B (2017) Large-Scale image retrieval with attentive deep local features. In: IEEE International conference on computer vision, ICCV 2017, venice, italy, October 22-29, 2017, pp 3476–3485

  36. Paulin M, Douze M, Harchaoui Z, Mairal J, Perronin F, Schmid C (2015) Local convolutional features with unsupervised training for image retrieval. In: 2015 IEEE international conference on computer vision (ICCV), pp 91–99

  37. Peng X, Wang L, Qiao Y, Peng Q (2014) Boosting VLAD with supervised dictionary learning and high-Order statistics. In: Computer vision – ECCV 2014, Lecture Notes in Computer Science. Springer, Cham, pp 660–674

  38. Perronnin F, Dance C (2007) Fisher kernels on visual vocabularies for image categorization. In: 2007 IEEE conference on computer vision and pattern recognition, pp 1–8

  39. Perronnin F, Liu Y, Sánchez J, Poirier H (2010) Large-scale image retrieval with compressed Fisher vectors. In: 2010 IEEE computer society conference on computer vision and pattern recognition, pp 3384–3391

  40. Perronnin F, Sánchez J, Mensink T (2010) Improving the fisher kernel for large-scale image classification. In: Proceedings of the 11th European conference on computer vision: Part IV, ECCV’10. Springer, Berlin, pp 143–156

  41. Philbin J, Chum O, Isard M, Sivic J, Zisserman A (2007) Object retrieval with large vocabularies and fast spatial matching. In: 2007 IEEE conference on computer vision and pattern recognition, pp 1–8

  42. Philbin J, Chum O, Isard M, Sivic J, Zisserman A (2008) Lost in quantization: Improving particular object retrieval in large scale image databases. In: 2008 IEEE conference on computer vision and pattern recognition, pp 1–8

  43. Razavian AS, Azizpour H, Sullivan J, Carlsson S (2014) CNN features off-the-shelf: an astounding baseline for recognition. In: Proceedings of the 2014 IEEE conference on computer vision and pattern recognition workshops, IEEE Computer Society, USA, CVPRW ’14, pp 512–519

  44. Razavian AS, Sullivan J, Carlsson S, Maki A (2016) Visual instance retrieval with deep convolutional networks. arXiv:14126574 [cs]

  45. Sattler T, Havlena M, Schindler K, Pollefeys M (2016) Large-scale location recognition and the geometric burstiness problem. In: 2016 IEEE conference on computer vision and pattern recognition (CVPR), pp 1582–1590 . https://doi.org/10.1109/CVPR.2016.175

  46. Shen X, Lin Z, Brandt J, Wu Y (2014) Spatially-Constrained similarity measurefor large-Scale object retrieval. IEEE Trans Pattern Anal Mach Intell 36(6):1229–1241

    Article  Google Scholar 

  47. Sivic J, Zisserman A (2003) Video Google: a text retrieval approach to object matching in videos. In: Proceedings Ninth IEEE international conference on computer vision, vol 2, pp 1470–1477

  48. Tolias G, Sicre R, Jégou H (2016) Particular object retrieval with integral max-pooling of CNN activations. In: 4th international conference on learning representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings

  49. Tzelepi M, Tefas A (2018) Deep convolutional learning for Content Based Image Retrieval. Neurocomputing 275:2467–2478

    Article  Google Scholar 

  50. Wang J, Yang J, Yu K, Lv F, Huang T, Gong Y (2010) Locality-constrained Linear Coding for image classification. In: 2010 IEEE computer society conference on computer vision and pattern recognition, pp 3360–3367

  51. Wang Y, Duan LY, Lin J, Wang Z, Huang T (2015) Hierarchical multi-VLAD for image retrieval. In: 2015 IEEE international conference on image processing (ICIP), pp 4629–4633

  52. Wei XS, Luo JH, Wu J, Zhou ZH (2017) Selective convolutional descriptor aggregation for fine-Grained image retrieval. IEEE Trans Image Process 26(6):2868–2881

    Article  MathSciNet  Google Scholar 

  53. Wu Z, Yu J (2019) A multi-level descriptor using ultra-deep feature for image retrieval. Multimedia Tools and Applications 78(18):25655–25672

    Article  Google Scholar 

  54. Yang J, Yu K, Gong Y, Huang T (2009) Linear spatial pyramid matching using sparse coding for image classification. In: 2009 IEEE conference on computer vision and pattern recognition, pp 1794–1801

  55. Yu W, Yang K, Yao H, Sun X, Xu P (2017) Exploiting the complementary strengths of multi-layer CNN features for image retrieval. Neurocomputing 237:235–241

    Article  Google Scholar 

  56. Zhao WL, Gravier G, Jégou H (2013) Oriented pooling for dense and non-dense rotation-invariant features. In: Burghardt T, Damen D, Mayol-Cuevas WW, Mirmehdi M (eds) British Machine Vision Conference, BMVC 2013, Bristol, UK, September 9-13, 2013, BMVA Press

  57. Zheng J, Chen JC, Bodla N, Patel VM, Chellappa R (2016) VLAD encoded Deep Convolutional features for unconstrained face verification. In: 2016 23rd international conference on pattern recognition (ICPR), pp 4101–4106

  58. Zhou Q, Wang C, Liu P, Li Q, Wang Y, Chen S (2016) Distribution entropy boosted VLAD for image retrieval. Entropy 18(8):311

    Article  Google Scholar 

  59. Zhou R, Yuan Q, Gu X, Zhang D (2014) Spatial pyramid VLAD. In: 2014 IEEE visual communications and image processing conference, pp 342–345

  60. Zhou X, Yu K, Zhang T, Huang TS (2010) Image classification using super-Vector coding of local image descriptors. In: Computer vision – ECCV 2010, Lecture Notes in Computer Science. Springer, Berlin, pp 141–154

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Alexy Bhowmick.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Bhowmick, A., Saharia, S. & Hazarika, S.M. FhVLAD: Fine-grained quantization and encoding high-order descriptor statistics for scalable image retrieval. Multimed Tools Appl 80, 35495–35520 (2021). https://doi.org/10.1007/s11042-020-10491-7

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-020-10491-7

Keywords

Navigation