International Journal of Computer Vision

, Volume 121, Issue 1, pp 126–148 | Cite as

Image Annotation by Propagating Labels from Semantic Neighbourhoods

  • Yashaswi Verma
  • C. V. Jawahar


Automatic image annotation aims at predicting a set of semantic labels for an image. Because of large annotation vocabulary, there exist large variations in the number of images corresponding to different labels (“class-imbalance”). Additionally, due to the limitations of human annotation, several images are not annotated with all the relevant labels (“incomplete-labelling”). These two issues affect the performance of most of the existing image annotation models. In this work, we propose 2-pass k-nearest neighbour (2PKNN) algorithm. It is a two-step variant of the classical k-nearest neighbour algorithm, that tries to address these issues in the image annotation task. The first step of 2PKNN uses “image-to-label” similarities, while the second step uses “image-to-image” similarities, thus combining the benefits of both. We also propose a metric learning framework over 2PKNN. This is done in a large margin set-up by generalizing a well-known (single-label) classification metric learning algorithm for multi-label data. In addition to the features provided by Guillaumin et al. (2009) that are used by almost all the recent image annotation methods, we benchmark using new features that include features extracted from a generic convolutional neural network model and those computed using modern encoding techniques. We also learn linear and kernelized cross-modal embeddings over different feature combinations to reduce semantic gap between visual features and textual labels. Extensive evaluations on four image annotation datasets (Corel-5K, ESP-Game, IAPR-TC12 and MIRFlickr-25K) demonstrate that our method achieves promising results, and establishes a new state-of-the-art on the prevailing image annotation datasets.


Image annotation Nearest neighbour Metric learning Cross-media analysis 



We thank Prof. Raghavan Manmatha for sharing the Corel-5K dataset, and the anonymous reviewers for their helpful comments. Yashaswi Verma is partially supported by Microsoft Research India PhD fellowship 2013.


  1. Anderson, C. (2006). The long tail: Why the future of business is selling less of more. Hyperion.Google Scholar
  2. Ballan, L., Uricchio, T., Seidenari, L., & Bimbo, A. D. (2014). A cross-media model for automatic image annotation. In Proceedings of the ICMR.Google Scholar
  3. Carneiro, G., Chan, A. B., Moreno, P. J., & Vasconcelos, N. (2007). Supervised learning of semantic classes for image annotation and retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(3), 394–410.CrossRefGoogle Scholar
  4. Chen, M., Zheng, A., & Weinberger, K. Q. (2013). Fast image tagging. In Proceedings of the ICML.Google Scholar
  5. Donahue, J., Jia, Y., Vinyals, O., Hoffman, J., Zhang, N., Tzeng, E., et al. (2014). DeCAF: A deep convolutional activation feature for generic visual recognition. In Proceedings of the ICML.Google Scholar
  6. Duygulu, P., Barnard, K., de Freitas, J. F., & Forsyth, D. A. (2002). Object recognition as machine translation: Learning a lexicon for a fixed image vocabulary. In Proceedings of the ECCV (pp. 97–112).Google Scholar
  7. Feng, S. L., Manmatha, R., & Lavrenko, V. (2004). Multiple bernoulli relevance models for image and video annotation. In Proceedings of the CVPR (pp. 1002–1009).Google Scholar
  8. Fu, H., Zhang, Q., & Qiu, G. (2012). Random forest for image annotation. In Proceedings of the ECCV (pp. 86–99).Google Scholar
  9. Grubinger, M. (2007). Analysis and evaluation of visual information systems performance. PhD thesis, Victoria University, Melbourne, Australia.Google Scholar
  10. Guillaumin, M., Mensink, T., Verbeek, J. J., & Schmid, C. (2009). Tagprop: Discriminative metric learning in nearest neighbor models for image auto-annotation. In Proceedings of the ICCV (pp. 309–316).Google Scholar
  11. Gupta, A., Verma, Y., & Jawahar, C. V. (2012). Choosing linguistics over vision to describe images. In Proceedings of the AAAI.Google Scholar
  12. Hardoon, D. R., Szedmak, S., & Shawe-Taylor, J. (2004). Canonical correlation analysis: An overview with application to learning methods. Neural Computation, 16(12), 2639–2664.CrossRefzbMATHGoogle Scholar
  13. Hotelling, H. (1936). Relations between two sets of variates. Biometrika, 28, 321–377.CrossRefzbMATHGoogle Scholar
  14. Huiskes, M. J., & Lew, M. S. (2008). The MIR Flickr retrieval evaluation. In MIR.Google Scholar
  15. Jégou, H., Douze, M., Schmid, C., & Pérez, P. (2010). Aggregating local descriptors into a compact image representation. In Proceedings of the CVPR (pp. 3304–3311).Google Scholar
  16. Jeon, J., Lavrenko, V., & Manmatha, R. (2003). Automatic image annotation and retrieval using cross-media relevance models. In Proceedings of the ACM SIGIR (pp. 119–126).Google Scholar
  17. Jin, R., Wang, S., & Zhou, Z. H. (2009). Learning a distance metric from multi-instance multi-label data. In Proceedings of the CVPR (pp. 896–902).Google Scholar
  18. Kalayeh, M. M., Idrees, H., & Shah, M. (2014). NMF-KNN: Image annotation using weighted multi-view non-negative matrix factorization. In Proceedings of the CVPR.Google Scholar
  19. Lavrenko, V., Manmatha, R., & Jeon, J. (2003). A model for learning the semantics of pictures. In NIPS.Google Scholar
  20. Li, X., Snoek, C. G. M., & Worring, M. (2009). Learning social tag relevance by neighbor voting. IEEE Transactions on Multimedia, 11(7), 1310–1322.CrossRefGoogle Scholar
  21. Liu, J., Li, M., Liu, Q., Lu, H., & Ma, S. (2009). Image annotation via graph learning. Pattern Recognition, 42(2), 218–228.CrossRefzbMATHGoogle Scholar
  22. Lowe, D. G. (2004). Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2), 91–110.CrossRefGoogle Scholar
  23. Makadia, A., Pavlovic, V., & Kumar, S. (2008). A new baseline for image annotation. In Proceedings of the ECCV (pp. 316–329).Google Scholar
  24. Makadia, A., Pavlovic, V., & Kumar, S. (2010). Baselines for image annotation. International Journal of Computer Vision, 90(1), 88–105.CrossRefGoogle Scholar
  25. Metzler, D., & Manmatha, R. (2004). An inference network approach to image retrieval. In Proceedings of the CIVR (pp. 42–50).Google Scholar
  26. Moran, S., & Lavrenko, V. (2011). Optimal tag sets for automatic image annotation. In Proceedings of the BMVC (pp. 1.1–1.11).Google Scholar
  27. Moran, S., & Lavrenko, V. (2014). A sparse kernel relevance model for automatic image annotation. International Journal of Multimedia Information Retrieval, 3(4), 209–229.Google Scholar
  28. Mori, Y., Takahashi, H., & Oka, R. (1999). Image-to-word transformation based on dividing and vector quantizing images with words. In MISRM’99 first international workshop on multimedia intelligent storage and retrieval management.Google Scholar
  29. Murthy, V. N., Can, E. F., & Manmatha, R. (2014). A hybrid model for automatic image annotation. In Proceedings of the ICMR.Google Scholar
  30. Nakayama, H. (2011). Linear distance metric learning for large-scale generic image recognition. PhD thesis, The University of Tokyo, Japan.Google Scholar
  31. Oliva, A., & Torralba, A. (2001). Modeling the shape of the scene: A holistic representation of the spatial envelope. International Journal of Computer Vision, 42(3), 145–175.CrossRefzbMATHGoogle Scholar
  32. Perronnin, F., Sánchez, J., & Mensink, T. (2010). Improving the fisher kernel for large-scale image classification. In Proceedings of the ECCV (pp. 143–156).Google Scholar
  33. Shalev-Shwartz, S., Singer, Y., & Srebro, N. (2007). Pegasos: Primal estimated sub-gradient solver for svm. In Proceedings of the ICML (pp. 807–814).Google Scholar
  34. van de Weijer, J., & Schmid, C. (2006). Coloring local feature extraction. In Proceedings of the ECCV (pp. 334–348).Google Scholar
  35. Verbeek, J., Guillaumin, M., Mensink, T., & Schmid, C. (2010). Image Annotation with TagProp on the MIRFLICKR set. In MIR.Google Scholar
  36. Verma, Y., & Jawahar, C. V. (2012). Image annotation using metric learning in semantic neighbourhoods. In Proceedings of the ECCV (pp. 836–849).Google Scholar
  37. Verma, Y., & Jawahar, C. V. (2013). Exploring SVM for image annotation in presence of confusing labels. In Proceedings of the BMVC.Google Scholar
  38. von Ahn, L., & Dabbish, L. (2004). Labeling images with a computer game. In SIGCHI conference on human factors in computing systems (pp. 319–326).Google Scholar
  39. Wang, C., Blei, D., & Fei-Fei, L. (2009). Simultaneous image classification and annotation. In Proceedings of the CVPR.Google Scholar
  40. Wang, H., Huang, H., & Ding, C. H. Q. (2011). Image annotation using bi-relational graph of images and semantic labels. In Proceedings of the CVPR (pp. 793–800).Google Scholar
  41. Weinberger, K. Q., & Saul, L. K. (2009). Distance metric learning for large margin nearest neighbor classification. Journal of Machine Learning Research, 10, 207–244.zbMATHGoogle Scholar
  42. Xiang, Y., Zhou, X., Chua, T. S., & Ngo, C. W. (2009). A revisit of generative model for automatic image annotation using markov random fields. In Proceedings of the CVPR (pp. 1153–1160).Google Scholar
  43. Yavlinsky, A., Schofield, E., & Rüger, S. (2005). Automated image annotation using global features and robust nonparametric density estimation. In Proceedings of the CIVR (pp. 507–517).Google Scholar
  44. Zhang, S., Huang, J., Huang, Y., Yu, Y., Li, H., & Metaxas, D. N. (2010). Automatic image annotation using group sparsity. In Proceedings of the CVPR (pp. 3312–3319).Google Scholar

Copyright information

© Springer Science+Business Media New York 2016

Authors and Affiliations

  1. 1.Center for Visual Information TechnologyIIITHyderabadIndia

Personalised recommendations