Abstract
We are interested in the encoding of local descriptors of an image (e.g. SIFT) to design a compact representation vector and thereby address scalable image retrieval. We revisit the implicit design choices in the popular vector of locally aggregated descriptors (VLAD), which aggregates the residuals of descriptors to the codewords. VLAD’s use of a coarse codebook and first-order descriptor statistics in residual computation results in less discriminative residuals. To address this problem, we propose a division of codebook feature space using a novel fine-grained quantization strategy. After quantization, we embed the resulting residuals with high-order statistics of descriptor distribution. Experiments on three challenging image retrieval datasets (INRIA Holidays, UKBench, Oxford 5k) confirm the improved discriminative power of our novel encoding method called FhVLAD. We observe superior accuracy to baseline and competitive performance to state-of-the-art techniques with a limited increase in dimension.
Similar content being viewed by others
References
Arandjelović R, Zisserman A (2012) Three things everyone should know to improve object retrieval. In: 2012 IEEE conference on computer vision and pattern recognition, pp 2911–2918
Arandjelović R, Gronat P, Torii A, Pajdla T, Sivic J (2018) NetVLAD: CNN architecture for weakly supervised place recognition. IEEE Trans Pattern Anal Mach Intell 40(6):1437–1451. conference Name: IEEE Transactions on Pattern Analysis and Machine Intelligence
Arandjelovic R, Zisserman A (2013) All About VLAD. In: 2013 IEEE conference on computer vision and pattern recognition, pp 1578–1585
Babenko A, Lempitsky V (2015) Aggregating local deep features for image retrieval. In: Proceedings of the 2015 IEEE international conference on computer vision (ICCV), IEEE Computer Society, USA, ICCV ’15, pp 1269–1277
Babenko A, Slesarev A, Chigorin A, Lempitsky VS (2014) Neural codes for image retrieval. In: Computer vision - ECCV 2014 - 13th european conference, zurich, Switzerland, September 6-12, 2014, Proceedings Part I, pp 584–599
Balanda KP, MacGillivray HL (1988) Kurtosis: a critical review. The American Statistician 42(2):111–119
Bay H, Tuytelaars T, Gool LV (2006) SURF: speeded up robust features. In: Computer vision – ECCV 2006, Lecture Notes in Computer Science. springer, Berlin, pp 404–417
Bhowmick A, Saharia S, Hazarika SM (2019) Encoding high-Order statistics in VLAD for scalable image retrieval. In: Deka B, Maji P, Mitra S, Bhattacharyya DK, Bora PK, Pal SK (eds) Pattern recognition and machine intelligence, lecture notes in computer science. Springer International Publishing, Cham, pp 559–566
Bishop C (2006) Pattern recognition and machine learning information science and statistics. Springer, New York
Csurka G, Dance CR, Fan L, Willamowski J, Bray C (2004) Visual categorization with bags of keypoints. In: In workshop on statistical learning in computer vision, ECCV, pp 1–22
Delhumeau J, Gosselin PH, Jégou H, Pérez P (2013) Revisiting the VLAD image representation, ACM, New York
Eggert C, Romberg S, Lienhart R (2014) Improving VLAD: Hierarchical coding and a refined local coordinate system. In: 2014 IEEE international conference on image processing (ICIP), pp 3018–3022
Gao W, Zhu Y, Zhang W, Zhang K, Gao H (2019) A hierarchical recurrent approach to predict scene graphs from a visual-attention-oriented perspective. Comput Intell 35(3):496–516
Gong Y, Wang L, Guo R, Lazebnik S (2014) Multi-scale orderless pooling of deep convolutional activation features. In: Fleet DJ, Pajdla T, Schiele B, Tuytelaars T (eds) Computer vision - ECCV 2014 - 13th european conference, zurich, switzerland, september 6-12, 2014, proceedings, Part VII, Springer, Lecture Notes in Computer Science, vol 8695, pp 392–407
Husain SS, Bober M (2016) Improving large-Scale image retrieval through robust aggregation of local descriptors. IEEE Trans Pattern Anal Mach Intell 39 (9):1783–1796
Jegou H, Chum O (2012) Negative evidences and co-occurences in image retrieval: the benefit of PCA and whitening. In: Computer vision – ECCV 2012, Lecture Notes in Computer Science. Springer, Berlin, pp 774–787
Jegou H, Zisserman A (2014) Triangulation embedding and democratic aggregation for image search. In: Proc IEEE Conf computer vision and patter recognition, pp 3310–3317
Jegou H, Douze M, Schmid C (2008) Hamming embedding and weak geometric consistency for large scale image search. In: Proceedings of the 10th European conference on computer vision: Part I, Springer, Berlin, Heidelberg, ECCV ’08, pp 304–317
Jegou H, Douze M, Schmid C (2010) Improving bag-of-features for large scale image search. Int J Comput Vis 87(3):316–336
Jegou H, Douze M, Schmid C, Pérez P (2010) Aggregating local descriptors into a compact image representation. In: 2010 IEEE computer society conference on computer vision and pattern recognition, pp 3304–3311
Jegou H, Perronnin F, Douze M, Sánchez J, Pérez P, Schmid C (2012) Aggregating local image descriptors into compact codes. IEEE Trans Pattern Anal Mach Intell 34(9):1704–1716
Lazebnik S, Schmid C, Ponce J (2006) Beyond bags of features: spatial pyramid matching for recognizing natural scene categories. In: Proceedings of the 2006 IEEE computer society conference on computer vision and pattern recognition - Volume 2, IEEE Computer Society, USA, CVPR ’06, pp 2169–2178
Li Q, Peng Q, Yan C (2018) Multiple VLAD encoding of CNNs for image classification. Comput Sci Eng 20(2):52–63
Liu L, Wang L, Liu X (2011) In Defense of Soft-assignment Coding. In: Proceedings of the 2011 international conference on computer vision, ICCV ’11. IEEE Computer Society, Washington, pp 2486–2493
Liu P, Miao Z, Guo H, Wang Y, Ai N (2018) Adding spatial distribution clue to aggregated vector in image retrieval. EURASIP J Image Video Process 2018(1):9
Liu Z, Houqiang L, Wengang Z, Ting R, Qi T (2016) Making residual vector distribution uniform for distinctive image representation. IEEE Trans Circ Syst Video Technol 26(2):375–384
Liu Z, Wang S, Tian Q (2016) Fine-residual VLAD for Image Retrieval. Neurocomput 173(P3):1183–1191
Long X, Lu H, Peng Y, Wang X, Feng S (2016) Image classification based on improved VLAD. Multimedia Tools and Applications 75(10):5533–5555
Lowe DG (2004) Distinctive image features from scale-Invariant keypoints. Int J Comput Vis 60(2):91–110
Mairal J, Koniusz P, Harchaoui Z, Schmid C (2014) Convolutional kernel networks. In: Proceedings of the 27th international conference on neural information processing systems, NIPS’14, vol 2. MIT Press, Cambridge, pp 2627–2635
Mikolajczyk K, Schmid C (2004) Scale & affine invariant interest point detectors. Int J Comput Vision 60(1):63–86
Mironică I, Duţă IC, Ionescu B, Sebe N (2016) A modified vector of locally aggregated descriptors approach for fast video classification. Multimedia Tools and Applications 75(15):9045–9072
Ng JYH, Yang F, Davis LS (2015) Exploiting local features from deep networks for image retrieval. In: 2015 IEEE Conference on computer vision and pattern recognition workshops (CVPRW). IEEE, Boston, pp 53–61
Nister D, Stewenius H (2006) Scalable recognition with a vocabulary tree. In: 2006 IEEE Computer society conference on computer vision and pattern recognition (CVPR’06), vol 2, pp 2161–2168
Noh H, Araujo A, Sim J, Weyand T, Han B (2017) Large-Scale image retrieval with attentive deep local features. In: IEEE International conference on computer vision, ICCV 2017, venice, italy, October 22-29, 2017, pp 3476–3485
Paulin M, Douze M, Harchaoui Z, Mairal J, Perronin F, Schmid C (2015) Local convolutional features with unsupervised training for image retrieval. In: 2015 IEEE international conference on computer vision (ICCV), pp 91–99
Peng X, Wang L, Qiao Y, Peng Q (2014) Boosting VLAD with supervised dictionary learning and high-Order statistics. In: Computer vision – ECCV 2014, Lecture Notes in Computer Science. Springer, Cham, pp 660–674
Perronnin F, Dance C (2007) Fisher kernels on visual vocabularies for image categorization. In: 2007 IEEE conference on computer vision and pattern recognition, pp 1–8
Perronnin F, Liu Y, Sánchez J, Poirier H (2010) Large-scale image retrieval with compressed Fisher vectors. In: 2010 IEEE computer society conference on computer vision and pattern recognition, pp 3384–3391
Perronnin F, Sánchez J, Mensink T (2010) Improving the fisher kernel for large-scale image classification. In: Proceedings of the 11th European conference on computer vision: Part IV, ECCV’10. Springer, Berlin, pp 143–156
Philbin J, Chum O, Isard M, Sivic J, Zisserman A (2007) Object retrieval with large vocabularies and fast spatial matching. In: 2007 IEEE conference on computer vision and pattern recognition, pp 1–8
Philbin J, Chum O, Isard M, Sivic J, Zisserman A (2008) Lost in quantization: Improving particular object retrieval in large scale image databases. In: 2008 IEEE conference on computer vision and pattern recognition, pp 1–8
Razavian AS, Azizpour H, Sullivan J, Carlsson S (2014) CNN features off-the-shelf: an astounding baseline for recognition. In: Proceedings of the 2014 IEEE conference on computer vision and pattern recognition workshops, IEEE Computer Society, USA, CVPRW ’14, pp 512–519
Razavian AS, Sullivan J, Carlsson S, Maki A (2016) Visual instance retrieval with deep convolutional networks. arXiv:14126574 [cs]
Sattler T, Havlena M, Schindler K, Pollefeys M (2016) Large-scale location recognition and the geometric burstiness problem. In: 2016 IEEE conference on computer vision and pattern recognition (CVPR), pp 1582–1590 . https://doi.org/10.1109/CVPR.2016.175
Shen X, Lin Z, Brandt J, Wu Y (2014) Spatially-Constrained similarity measurefor large-Scale object retrieval. IEEE Trans Pattern Anal Mach Intell 36(6):1229–1241
Sivic J, Zisserman A (2003) Video Google: a text retrieval approach to object matching in videos. In: Proceedings Ninth IEEE international conference on computer vision, vol 2, pp 1470–1477
Tolias G, Sicre R, Jégou H (2016) Particular object retrieval with integral max-pooling of CNN activations. In: 4th international conference on learning representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings
Tzelepi M, Tefas A (2018) Deep convolutional learning for Content Based Image Retrieval. Neurocomputing 275:2467–2478
Wang J, Yang J, Yu K, Lv F, Huang T, Gong Y (2010) Locality-constrained Linear Coding for image classification. In: 2010 IEEE computer society conference on computer vision and pattern recognition, pp 3360–3367
Wang Y, Duan LY, Lin J, Wang Z, Huang T (2015) Hierarchical multi-VLAD for image retrieval. In: 2015 IEEE international conference on image processing (ICIP), pp 4629–4633
Wei XS, Luo JH, Wu J, Zhou ZH (2017) Selective convolutional descriptor aggregation for fine-Grained image retrieval. IEEE Trans Image Process 26(6):2868–2881
Wu Z, Yu J (2019) A multi-level descriptor using ultra-deep feature for image retrieval. Multimedia Tools and Applications 78(18):25655–25672
Yang J, Yu K, Gong Y, Huang T (2009) Linear spatial pyramid matching using sparse coding for image classification. In: 2009 IEEE conference on computer vision and pattern recognition, pp 1794–1801
Yu W, Yang K, Yao H, Sun X, Xu P (2017) Exploiting the complementary strengths of multi-layer CNN features for image retrieval. Neurocomputing 237:235–241
Zhao WL, Gravier G, Jégou H (2013) Oriented pooling for dense and non-dense rotation-invariant features. In: Burghardt T, Damen D, Mayol-Cuevas WW, Mirmehdi M (eds) British Machine Vision Conference, BMVC 2013, Bristol, UK, September 9-13, 2013, BMVA Press
Zheng J, Chen JC, Bodla N, Patel VM, Chellappa R (2016) VLAD encoded Deep Convolutional features for unconstrained face verification. In: 2016 23rd international conference on pattern recognition (ICPR), pp 4101–4106
Zhou Q, Wang C, Liu P, Li Q, Wang Y, Chen S (2016) Distribution entropy boosted VLAD for image retrieval. Entropy 18(8):311
Zhou R, Yuan Q, Gu X, Zhang D (2014) Spatial pyramid VLAD. In: 2014 IEEE visual communications and image processing conference, pp 342–345
Zhou X, Yu K, Zhang T, Huang TS (2010) Image classification using super-Vector coding of local image descriptors. In: Computer vision – ECCV 2010, Lecture Notes in Computer Science. Springer, Berlin, pp 141–154
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Bhowmick, A., Saharia, S. & Hazarika, S.M. FhVLAD: Fine-grained quantization and encoding high-order descriptor statistics for scalable image retrieval. Multimed Tools Appl 80, 35495–35520 (2021). https://doi.org/10.1007/s11042-020-10491-7
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-020-10491-7