Multimedia Tools and Applications

, Volume 51, Issue 2, pp 441–477 | Cite as

Building descriptive and discriminative visual codebook for large-scale image applications

  • Qi Tian
  • Shiliang Zhang
  • Wengang Zhou
  • Rongrong Ji
  • Bingbing Ni
  • Nicu Sebe


Inspired by the success of textual words in large-scale textual information processing, researchers are trying to extract visual words from images which function similar as textual words. Visual words are commonly generated by clustering a large amount of image local features and the cluster centers are taken as visual words. This approach is simple and scalable, but results in noisy visual words. Lots of works are reported trying to improve the descriptive and discriminative ability of visual words. This paper gives a comprehensive survey on visual vocabulary and details several state-of-the-art algorithms. A comprehensive review and summarization of the related works on visual vocabulary is first presented. Then, we introduce our recent algorithms on descriptive and discriminative visual word generation, i.e., latent visual context analysis for descriptive visual word identification [74], descriptive visual words and visual phrases generation [68], contextual visual vocabulary which combines both semantic contexts and spatial contexts [69], and visual vocabulary hierarchy optimization [18]. Additionally, we introduce two interesting post processing strategies to further improve the performance of visual vocabulary, i.e., spatial coding [73] is proposed to efficiently remove the mismatched visual words between images for more reasonable image similarity computation; user preference based visual word weighting [44] is developed to make the image similarity computed based on visual words more consistent with users’ preferences or habits.


Visual vocabulary Large-scale image retrieval Image search re-ranking Feature space quantization 



This work is supported in part by NSF IIS 1052851 and by Akiira Media Systems, Inc. The work of Nicu Sebe has been supported by the FP7 IP GLOCAL European project and by the FIRB S-PATTERN project.


  1. 1.
    Agarwal S, Roth D (2002) Learning a sparse representation for object detection. ECCVGoogle Scholar
  2. 2.
    Battiato S, Farinella G, Gallo G, Ravi D (2009) Spatial hierarchy of textons distribution for scene classification. Proc. Eurocom Multimedia Modeling, pp 333–342Google Scholar
  3. 3.
    Blei D, Ng A, Jordan M (2003) Latent dirichlet allocation. J MLR 3:993–1022zbMATHGoogle Scholar
  4. 4.
    Brin S, Page L (1998) The anatomy of a large-scale hyper textual web search engine. WWWGoogle Scholar
  5. 5.
    Chum O, Philbin J, Sivic J, Isard M, Zisserman A (2007) Total recall: automatic query expansion with a generative feature model for object retrieval. ICCVGoogle Scholar
  6. 6.
    Deerwester S, Dumais S, Harshman R (1990) Indexing by latent semantic analysis. J Am Soc Inf Sci 41(6):391–407CrossRefGoogle Scholar
  7. 7.
    Duygulu P, Barnard K, Freitas J, Forsyth D (2002) Object recognition as machine translation: Learning a lexicon for a fixed image vocabulary. ECCVGoogle Scholar
  8. 8.
    Fischler M, Bolles R (1981) Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Comm ACM 24:381–395CrossRefMathSciNetGoogle Scholar
  9. 9.
    Gemert V, Veenman C, Smeulders A, Geusebroek J (2010) Visual word ambiguity. T-PAMI 32(7):1271–1283Google Scholar
  10. 10.
    K. Grauman and T. Darrell. Approximate correspondences in high dimensions. NIPS, 2007.Google Scholar
  11. 11.
    Globerson A, Roweis S (2006) Metric learning by collapsing classes. Adv In Neu Info Proce Sys 18:451–458Google Scholar
  12. 12.
    Hofmann T (1999) Probabilistic latent semantic indexing. ACM SIGIRGoogle Scholar
  13. 13.
    Hofmann T (2001) Unsupervised learning by probabilistic latent semantic analysis. ML 41:177–196Google Scholar
  14. 14.
    Indyk P, Thaper N (1998) Fast image retrieval via embeddings. Symposium on Theory of ComputingGoogle Scholar
  15. 15.
    Jegou H, Harzallah H, Schmid C (2007) A contextual dissimilarity measure for accurate and efficient image search. CVPRGoogle Scholar
  16. 16.
    Jegou H, Douze M, Schmid C (2008) Hamming embedding and weak geometric consistency for large scale image search. ECCVGoogle Scholar
  17. 17.
    Ji R, Xie X, Yao H, Wu Y, Ma W (2008) Incremental indexing of visual vocabulary for scalable retrieval. ICMEGoogle Scholar
  18. 18.
    Ji R, Xie X, Yao H, Ma W (2009) Vocabulary hierarchy optimization for effective and transferable retrieval. CVPRGoogle Scholar
  19. 19.
    Ji R, Yao H, Sun X, Zhong B, Gao W (2010) Towards semantic embedding in visual vocabulary. CVPRGoogle Scholar
  20. 20.
    Jing Y, Baluja S (2008) VisualRank: applying pagerank to large-scale image search. IEEE Trans on PAMI 30:1877–1890Google Scholar
  21. 21.
    Jurie F, Triggs B (2005) Creating efficient codebooks for visual recognition. IJCV, pp 604–610Google Scholar
  22. 22.
    Kim G, Faloutsos C, Hebert M (2008) Unsupervised modeling of object categories using link analysis techniques. CVPRGoogle Scholar
  23. 23.
    Kim G, Faloutsos C, Hebert M (2008) Unsupervised modeling and recognition of object categories with combination of visual contents and geometric similarity links. ACM MIRGoogle Scholar
  24. 24.
    Kohonen T (1986) Learning vector quantization for pattern recognition. Tech. Rep. TKK-F-A601, Helsinki Institute of TechnologyGoogle Scholar
  25. 25.
    Kohonen T (2000) Self-organizing maps, 3rd edition, Springer-VerlagGoogle Scholar
  26. 26.
    Lazebnik S, Raginsky M (2009) Supervised learning of quantizer codebook by information loss minimization. PAMI 31(7):1294–1309Google Scholar
  27. 27.
    Leibe B, Leonardis A, Schiele B (2004) Combined object categorization and segmentation with an implicit shape model. ECCVGoogle Scholar
  28. 28.
    Leordeanu M, Hebert M (2005) A spectral technique for correspondence problems using pairwise constraints. ICCVGoogle Scholar
  29. 29.
    Leung T, Malik J (2001) Representing and recognizing the visual appearance of materials using 3-d textons. IJCVGoogle Scholar
  30. 30.
    Li F, Pietro P (2007) A bayesian hierarchical model for learning natural scene categories. ICCVGoogle Scholar
  31. 31.
    Li T, Mei T, Kweon I, Hua X (2010) Contextual bag-of-words for visual categorization. IEEE Transactions on Circuits and Systems for Video TechnologyGoogle Scholar
  32. 32.
    Liu D, Hua G, Viola P, Chen T (2008) Integrated feature selection and higher-order spatial feature extraction for object categorization. CVPR, pp 1–8Google Scholar
  33. 33.
    Liu C, Yuen J, Torralba A (2009) Dense scene alignment using SIFT flow for object recognition. CVPRGoogle Scholar
  34. 34.
    Liu J, Yang Y, Shah M (2009) Learning semantic visual vocabularies using diffusion distance. CVPRGoogle Scholar
  35. 35.
    Liu D, Hua X, Yang L, Wang M, Zhang H (2009) Tag ranking. WWWGoogle Scholar
  36. 36.
    Lowe D (2004) Distinctive image features form scale-invariant keypoints. IJCV 20(2):91–110CrossRefGoogle Scholar
  37. 37.
    MacQueen J (1967) Some methods for classification and analysis of multivariate observations. 5th Berkeley Symposium on Mathematical Statistics and Probability, pp 281–297Google Scholar
  38. 38.
    Marszalek M, Schmid C (2006) Spatial weighting for bag-of-features. CVPR, pp 2118–2125Google Scholar
  39. 39.
    Mairal J, Bach F, Ponce J, Sapiro G, Zisserman A (2008) Supervised dictionary learning. NIPSGoogle Scholar
  40. 40.
    Marszalek M, Schmid C (2007) Semantic hierarchies for visual object recognition. CVPRGoogle Scholar
  41. 41.
    Matas J, Chum O, Urban M, Pajla T (2002) Robust wide baseline stereo from maximally stable extremal regions. BMVCGoogle Scholar
  42. 42.
    Moosmann F, Triggs B, Jurie F (2006) Fast discriminative visual codebooks using randomized clustering forests. NIPSGoogle Scholar
  43. 43.
    Moosmann F, Nowak E, Jurie F (2008) Randomized clustering forests for image classification. PAMI 30(9):1632–1646Google Scholar
  44. 44.
    Ni B, Tian Q, Yang L, Yan S (2010) Query-log aware content based image retrieval. To be submittedGoogle Scholar
  45. 45.
    Nister D, Stewenius H (2006) Scalable recognition with a vocabulary tree. CVPR, pp 2161–2168Google Scholar
  46. 46.
    Perronnin F (2008) Universal and adapted vocabularies for generic visual categorization. PAMI 30(7):1243–1256Google Scholar
  47. 47.
    Perronnin F, Dance C, Csurka G, Bressan M (2006) Adapted vocabularies for generic visual categorization. ECCV Google Scholar
  48. 48.
    Philbin J, Chum O, Isard M, Sivic J, Zisserman A (2007) Object retrieval with large vocabularies and fast spatial matching. CVPR Google Scholar
  49. 49.
    Philbin J, Chum O, Isard M, Sivic J, Zisserman A (2008) Lost in quantization: improving particular object retrieval in large scale image databases. CVPR Google Scholar
  50. 50.
    Rao A, Miller D, Rose K, Gersho A (1996) A generalized VQ method for combined compression and estimation. ICASSP Google Scholar
  51. 51.
    Salton G, McGill MJ (1983) Introduction to modern information retrieval. McGraw-Hill, New YorkzbMATHGoogle Scholar
  52. 52.
    Salton G, Buckley C (1988) Term-weighting approaches in automatic text retrieval. Inf Process Manage 24(5):513–523CrossRefGoogle Scholar
  53. 53.
    Savarese S, Winn J, Criminisi A (2006) Discriminative object class models of appearance and shape by correlatons. CVPR, pp 2033–2040Google Scholar
  54. 54.
    Schindler G, Brown M (2007) City-scale location recognition. CVPR Google Scholar
  55. 55.
    Sivic J, Zisserman A (2003) Video Google: a text retrieval approach to object matching in videos. ICCV, pp 1470–1477Google Scholar
  56. 56.
    Viola P, Jones M (2001) Robust real-time face detection. ICCV, pp 7–14Google Scholar
  57. 57.
    Wang L (2007) Toward a discriminative codebook: codeword selection across multi-resolution. CVPR Google Scholar
  58. 58.
    Wang F, Jiang Y, Ngo C (2008) Video event detection using motion relativity and visual relatedness. ACM Multimedia, pp 239–248Google Scholar
  59. 59.
    Wang S, Huang Q, Jiang S, Qin L, Tian Q (2009) Visual context rank for web image re-ranking. ACM workshop on LSMRM Google Scholar
  60. 60.
    Wu Z, Ke Q, Sun J (2009) Bundling features for large-scale partial-duplicate web image search. CVPR Google Scholar
  61. 61.
    Wu L, Hoi S, Yu N (2009) Semantic-preserving bag-of-words models for efficient image annotation. ACM workshop on LSMRM, pp 19–26Google Scholar
  62. 62.
    Xu D, Chang S (2008) Video event recognition using kernel methods with multilevel temporal alignment. PAMI 30(11):1985–1997Google Scholar
  63. 63.
    Yang J (2007) Evaluating bag-of-visual-words representations in scene classification. ACM Multimedia Google Scholar
  64. 64.
    Yang L, Meer P, Foran D (2007) Multiple class segmentation using a unified framework over mean-shift patches. CVPR, pp 1–8Google Scholar
  65. 65.
    Yates R, Neto B (1999) Modern information retrieval, Addison Wesley Longman Publishing Co. IncGoogle Scholar
  66. 66.
    Yuan J, Wu Y, Yang M (2007) Discovery of collocation patterns: from visual words to visual phrases. CVPR, pp 1–8Google Scholar
  67. 67.
    Zhang J, Marszalek M, Lazebnik S, Schmid C (2007) Local features and kernels for classification of texture and object categories: A comprehensive review. IJCV Google Scholar
  68. 68.
    Zhang S, Tian Q, Hua G, Huang Q, Li S (2009) Descriptive visual words and visual phrases for image applications. ACM Multimedia Google Scholar
  69. 69.
    Zhang S, Huang Q, Hua G, Jiang S, Gao W, Tian Q (2010) Building contextual visual vocabulary for large-scale image applications. ACM Multimedia Google Scholar
  70. 70.
    Zhang S, Huang Q, Lu Y, Gao W, Tian Q (2010) Building pair-wise visual word tree for efficient image re-ranking. ICASSP Google Scholar
  71. 71.
    Zheng Y, Zhao M, Neo S, Chua T, Tian Q (2008) Visual synset: a higher-level visual representation. CVPR, pp 1–8Google Scholar
  72. 72.
    Zhou W, Li H, Lu Y, Tian Q (2010) Large scale partial-duplicate image retrieval with bi-space quantization and geometric consistency. ICASSP Google Scholar
  73. 73.
    ZhouW, Lu Y, Song Y, Li H, Tian Q (2010) Spatial coding for large-scale partial-duplicate web image search. ACM Multimedia Google Scholar
  74. 74.
    Zhou W, Tian Q, Yang L, Li H (2010) Latent visual context analysis for image re-ranking. ACM International Conference on Image and Video Retrieval (CIVR), Xi’an, ChinaGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2010

Authors and Affiliations

  • Qi Tian
    • 1
  • Shiliang Zhang
    • 2
  • Wengang Zhou
    • 3
  • Rongrong Ji
    • 4
  • Bingbing Ni
    • 5
  • Nicu Sebe
    • 6
  1. 1.Computer Science DepartmentUniversity of Texas at San AntonioSan AntonioUSA
  2. 2.Key Lab of Intelligent Information ProcessingInstitute of Computing Technology, Chinese Academy of SciencesBeijingChina
  3. 3.EEIS DepartmentUniversity of Science and Technology of ChinaHeifeiChina
  4. 4.Harbin Institute of TechnologyHarbinChina
  5. 5.National University of SingaporeSingaporeSingapore
  6. 6.Department of Information Engineering and Computer ScienceUniversity of TrentoTrentoItaly

Personalised recommendations