Learning Vocabularies over a Fine Quantization

An Erratum to this article was published on 01 December 2013


A novel similarity measure for bag-of-words type large scale image retrieval is presented. The similarity function is learned in an unsupervised manner, requires no extra space over the standard bag-of-words method and is more discriminative than both L2-based soft assignment and Hamming embedding. The novel similarity function achieves mean average precision that is superior to any result published in the literature on the standard Oxford 5k, Oxford 105k and Paris datasets/protocols. We study the effect of a fine quantization and very large vocabularies (up to 64 million words) and show that the performance of specific object retrieval increases with the size of the vocabulary. This observation is in contradiction with previously published results. We further demonstrate that the large vocabularies increase the speed of the tf-idf scoring step.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10


  1. 1.

    We only consider and compare with methods that support queries that cover only a (small) part of the test image. Global methods like GIST (Oliva and Torralba 2006) achieve a much smaller memory footprint at the cost of allowing whole image queries only.

  2. 2.


  3. 3.


  4. 4.


  5. 5.

    The Holidays dataset presented in (Jegou et al. 2008) contains about 5–10 % of the images rotated unnaturally for a human observer. Because the rotational variant feature descriptor was used in our experiment, we report the performance on a version of the dataset with corrected orientation of the images according to EXIF, or manually (by 90\(^\circ \), 180\(^\circ \) or 270\(^\circ \)), where the EXIF information is missing and the correct (sky-is-up) orientation is obvious.


  1. Agarwal, S., Snavely, N., Simon, I., Seitz, S., & Szeliski, R. (2009). Building rome in a day. In Proceedings of ICCV, Kyoto.

  2. Avrithis, Y., & Kalantidis, Y. (2012). Approximate gaussian mixtures for large scale vocabularies. In Proceedings of European conference on computer vision (ECCV 2012), Florence.

  3. Baeza-Yates, R., & Ribeiro-Neto, B. (1999). Modern information retrieval. New York: ACM Press (ISBN: 020139829).

  4. Cech, J., Matas, J., & Perdoch, M. (2008). Efficient sequential correspondence selection by cosegmentation. In Proceedings of CVPR, Anchorage.

  5. Chum, O., & Matas, J. (2010). Large-scale discovery of spatially related images. IEEE PAMI, 32, 371–377.

    Article  Google Scholar 

  6. Chum, O., Perdoch, M., & Matas, J. (2009). Geometric min-hashing: Finding a (thick) needle in a haystack. In Proceedings of CVPR, Miami.

  7. Chum, O., Philbin, J., Sivic, J., Isard, M., & Zisserman, A. (2007). Total recall: Automatic query expansion with a generative feature model for object retrieval. In Proceedings of ICCV, Rio de Janeiro.

  8. Duda, R., Hart, P., & Stork, D. (1995). Pattern classification and scene analysis (2nd ed.). New York: Wiley.

  9. Ferrari, V., Tuytelaars, T., & Van Gool, L. (2004). Simultaneous object recognition and segmentation by image exploration. In Proceedings of ECCV, Prague.

  10. Fraundorfer, F., Stewénius, H., & Nistér, D. (2007). A binning scheme for fast hard drive based image search. In Proceedings of CVPR, Minneapolis.

  11. Godsil, C., & Royle, G. (2001). Algebraic graph theory. New York: Springer.

    Google Scholar 

  12. Hua, G., Brown, M., & Winder, S. (2007). Discriminant embedding for local image descriptors. In Proceedings of ICCV, Rio de Janeiro.

  13. Jegou, H., Douze, M., & Schmid, C. (2008). Hamming embedding and weak geometric consistency for large scale image search. In Proceedings of ECCV, Marseille.

  14. Jégou, H., Douze, M., & Schmid, C. (2009). On the burstiness of visual elements. In Proceedings CVPR, Miami.

  15. Jégou, H., Douze, M., & Schmid, C. (2010). Improving bag-of-features for large scale image search. IJCV, 87(3), 316–336.

    Article  Google Scholar 

  16. Li, X., Wu, C., Zach, C., Lazebnik, S., & Frahm, J. -M. (2008). Modeling and recognition of landmark image collections using iconic scene graphs. In Proceedings of ECCV, Marseille.

  17. Lowe, D. G. (2004). Distinctive image features from scale-invariant keypoints. IJCV, 60(2), 91–110.

    Article  Google Scholar 

  18. Makadia, A. (2010). Feature tracking for wide-baseline image retrieval. Berlin: Springer.

    Google Scholar 

  19. Mikolajczyk, K., & Matas, J. (2007). Improving sift for fast tree matching by optimal linear projection. In Proceedings of ICCV, Rio de Janeiro.

  20. Mikulik, A., Perdoch, M., Chum, O., & Matas, J. (2010). Learning a fine vocabulary. In Daniilidis, K., Maragos, P., & Paragios, N., (eds.), Proceedings of ECCV, Lecture notes in computer science (Vol. 6313, pp. 1–14). Heidelberg, Germany. (Foundation for Research and Technology-Hellas (FORTH), Springer. CD-ROM).

  21. Muja, M., & Lowe, D. G. (2009). Fast approximate nearest neighbors with automatic algorithm configuration. In VISSAPP.

  22. Nister, D., & Stewenius, H. (2006). Scalable recognition with a vocabulary tree. In Proceedings of CVPR, New York.

  23. Oliva, A., & Torralba, A. (2006). Building the gist of a scene: The role of global image features in recognition. Progress in Brain Research: Visual Perception 155, 23–36.

  24. Perdoch, M., Chum, O., & Matas, J. (2009). Efficient representation of local geometry for large scale object retrieval. In Proceedings of CVPR, Kyoto.

  25. Perronnin, F. (2008). Universal and adapted vocabularies for generic visual categorization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30, 1243–1256.

    Article  Google Scholar 

  26. Philbin, J., Chum, O., Isard, M., Sivic, J., & Zisserman, A. (2007). Object retrieval with large vocabularies and fast spatial matching. In Proceedings of CVPR, Minneapolis.

  27. Philbin, J., Chum, O., Isard, M., Sivic, J., & Zisserman, A. (2008). Lost in quantization: Improving particular object retrieval in large scale image databases. In Proceedings of CVPR, Anchorage.

  28. Project page (2012). Data, binaries, and source codes released with the paper. http://cmp.felk.cvut.cz/qqmikula/publications/ijcv2012/index.html.

  29. Sivic, J., & Zisserman, A. (2003). Video google: A text retrieval approach to object matching in videos. In Proceedings of ICCV, Nice (pp. 1470–1477).

  30. Tavenard, R., Amsaleg, L., & Jégou, H. (2010). Balancing clusters to reduce response time variability in large scale image search. Research Report RR-7387, INRIA.

Download references

Author information



Corresponding author

Correspondence to Andrej Mikulik.

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Mikulik, A., Perdoch, M., Chum, O. et al. Learning Vocabularies over a Fine Quantization. Int J Comput Vis 103, 163–175 (2013). https://doi.org/10.1007/s11263-012-0600-1

Download citation


  • Image retrieval
  • Vocabulary
  • Feature track