Multimedia Tools and Applications

, Volume 77, Issue 24, pp 31991–32011 | Cite as

Automatic image annotation: the quirks and what works

  • Ayushi DuttaEmail author
  • Yashaswi Verma
  • C. V. Jawahar


Automatic image annotation is one of the fundamental problems in computer vision and machine learning. Given an image, here the goal is to predict a set of textual labels that describe the semantics of that image. During the last decade, a large number of image annotation techniques have been proposed that have been shown to achieve encouraging results on various annotation datasets. However, their scope has mostly remained restricted to quantitative results on the test data, thus ignoring various key aspects related to dataset properties and evaluation metrics that inherently affect the performance to a considerable extent. In this paper, first we evaluate ten state-of-the-art (both deep-learning based as well as non-deep-learning based) approaches for image annotation using the same baseline CNN features. Then we propose new quantitative measures to examine various issues/aspects in the image annotation domain, such as dataset specific biases, per-label versus per-image evaluation criteria, and the impact of changing the number and type of predicted labels. We believe the conclusions derived in this paper through thorough empirical analyzes would be helpful in making systematic advancements in this domain.


Image tagging Empirical study Evaluation metrics Dataset analysis 



Yashaswi Verma would like to thank the Department of Science and Technology (India) for the INSPIRE Faculty Award 2017.


  1. 1.
    Ahn LV, Dabbish L (2004) Labeling images with a computer game. In: ACM SIGCHI Conference on human factors in computing systemsGoogle Scholar
  2. 2.
    Carneiro G, Chan AB, Moreno PJ, Vasconcelos N (2007) Supervised learning of semantic classes for image annotation and retrieval. IEEE Trans Pattern Anal Mach Intell 29(3):394–410CrossRefGoogle Scholar
  3. 3.
    Chen M, Zheng A, Weinberger KQ (2013) Fast image tagging. In: ICMLGoogle Scholar
  4. 4.
    Chua TS, Tang J, Hong R, Li H, Luo Z, Zheng Y (2009) NUS-WIDE: A real-world web image database from National University of Singapore. In: ACM CIVRGoogle Scholar
  5. 5.
    Cristianini N, Shawe-Taylor J (2000) An Introduction to Support Vector Machines: And Other Kernel-based Learning Methods. Cambridge University Press, CambridgeCrossRefGoogle Scholar
  6. 6.
    Devlin J, Cheng H, Fang H, Gupta S, Deng L, He X, Zweig G, Mitchell M (2015) Language models for image captioning: The quirks and what works. In: ACLGoogle Scholar
  7. 7.
    Duygulu P, Barnard K, de Freitas JFG, Forsyth DA (2002) Object recognition as machine translation: Learning a lexicon for a fixed image vocabulary. In: ECCVGoogle Scholar
  8. 8.
    Feng SL, Manmatha R, Lavrenko V (2004) Multiple Bernoulli relevance models for image and video annotation. In: CVPRGoogle Scholar
  9. 9.
    Fu H, Zhang Q, Qiu G (2012) Random forest for image annotation. In: ECCV, pp 86–99Google Scholar
  10. 10.
    Gong Y, Jia Y, Leung TK, Toshev A, Ioffe S (2014) Deep convolutional ranking for multilabel image annotation. In: ICLRGoogle Scholar
  11. 11.
    Grubinger M, Clough PD, Müller H, Deselaers T (2006) The IAPR benchmark: A new evaluation resource for visual information systems. In: International Conference on Language Resources and Evaluation.
  12. 12.
    Guillaumin M, Mensink T, Verbeek J, Schmid C (2009) TagProp: Discriminative metric learning in nearest neighbour models for image auto-annotation. In: ICCVGoogle Scholar
  13. 13.
    Gupta A, Verma Y, Jawahar CV (2012) Choosing linguistics over vision to describe images. In: AAAIGoogle Scholar
  14. 14.
    Hardoon DR, Szedmak S, Shawe-Taylor J (2004) Canonical correlation analysis: An overview with application to learning methods. Neural Comput 16(12):2639–2664CrossRefGoogle Scholar
  15. 15.
    He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: CVPRGoogle Scholar
  16. 16.
    Hu H, Zhou GT, Deng Z, Liao Z, Mori G (2016) Learning structured inference neural networks with label relations. In: CVPRGoogle Scholar
  17. 17.
    Johnson J, Ballan L, Fei-Fei L (2015) Love thy neighbors: Image annotation by exploiting image metadata. In: ICCVGoogle Scholar
  18. 18.
    Kalayeh MM, Idrees H, Shah M (2014) NMF-KNN: Image annotation using weighted multi-view non-negative matrix factorization. In: CVPRGoogle Scholar
  19. 19.
    Kuznetsova P, Ordonez V, Berg AC, Berg TL, Choi Y (2012) Collective generation of natural image descriptions. In: ACLGoogle Scholar
  20. 20.
    Li Z, Tang J (2016) Weakly supervised deep matrix factorization for social image understanding. IEEE Trans Image Process 26(1):276–288MathSciNetCrossRefGoogle Scholar
  21. 21.
    Li X, Snoek CGM, Worring M (2009) Learning social tag relevance by neighbor voting. Trans Multi 11(7):1310–1322CrossRefGoogle Scholar
  22. 22.
    Li Z, Liu J, Xu C, Lu H (2013) Mlrank: Multi-correlation learning to rank for image annotation. Pattern Recogn 46(10):2700–2710CrossRefGoogle Scholar
  23. 23.
    Li Z, Liu J, Tang J, Lu H (2015) Robust structured subspace learning for data representation. IEEE Trans Pattern Anal Mach Intell 37(10):2085–2098CrossRefGoogle Scholar
  24. 24.
    Li Y, Song Y, Luo J (2017) Improving pairwise ranking for multi-label image classification. In: CVPRGoogle Scholar
  25. 25.
    Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollar P, Zitnic CL (2014) Microsoft COCO: Common objects in contex. In: ECCVGoogle Scholar
  26. 26.
    Liu F, Xiang T, Hospedales TM, Yang W, Sun C (2017) Semantic regularisation for recurrent image annotation. In: CVPRGoogle Scholar
  27. 27.
    Makadia A, Pavlovic V, Kumar S (2008) A new baseline for image annotation. In: ECCVGoogle Scholar
  28. 28.
    Makadia A, Pavlovic V, Kumar S (2010) Baselines for image annotation. Int J Comput Vis 90(1):88–105CrossRefGoogle Scholar
  29. 29.
    Moran S, Lavrenko V (2014) A sparse kernel relevance model for automatic image annotation. Int J Multimed Inf Retr 3(4):209–219CrossRefGoogle Scholar
  30. 30.
    Mori Y, Takahashi H, Oka R (1999) Image-to-word transformation based on dividing and vector quantizing images with words. In: MISRM’99 First international workshop on multimedia intelligent storage and retrieval managementGoogle Scholar
  31. 31.
    Platt JC (2000) Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. In: Advances in large margin classifiersGoogle Scholar
  32. 32.
    Ren Z, Jin H, Lin ZL, Fang C, Yuille AL (2015) Multi-instance visual-semantic embedding. CoRR arXiv:1512.06963
  33. 33.
    Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M, Berg AC, Fei-Fei L (2015) Imagenet large scale visual recognition challenge. Int J Comput Vis 115(3):211–252MathSciNetCrossRefGoogle Scholar
  34. 34.
    Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In: CVPRGoogle Scholar
  35. 35.
    Uricchio T, Ballan L, Seidenari L, Bimbo AD (2016) Automatic image annotation via label transfer in the semantic space. CoRR arXiv:1605.04770
  36. 36.
    Verma Y, Jawahar CV (2012) Image annotation using metric learning in semantic neighbourhoods. In: ECCVGoogle Scholar
  37. 37.
    Verma Y, Jawahar CV (2013) Exploring SVM for image annotation in presence of confusing labels. In: BMVCGoogle Scholar
  38. 38.
    Verma Y, Jawahar CV (2017) Image annotation by propagating labels from semantic neighbourhoods. Int J Comput Vis 121(1):126–148CrossRefGoogle Scholar
  39. 39.
    Verma Y, Gupta A, Mannem P, Jawahar CV (2013) Generating image descriptions using semantic similarities in the output space. In: CVPR WorkshopGoogle Scholar
  40. 40.
    Wang J, Yang Y, Mao J, Huang Z, Huang C, Xu W (2016) CNN-RNN: A unified framework for multi-label image classification. In: CVPRGoogle Scholar
  41. 41.
    Weston J, Bengio S, Usunier N (2011) WSABIE: Scaling up to large vocabulary image annotation. In: IJCAIGoogle Scholar
  42. 42.
    Zhang S, Huang J, Huang Y, Yu Y, Li H, Metaxas DN (2010) Automatic image annotation using group sparsity. In: CVPR, pp 3312–3319Google Scholar
  43. 43.
    Zhang M, Zhou Z (2014) A review on multi-label learning algorithms. IEEE Trans Knowl Data Eng 26(99):1819–1837CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  1. 1.CVIT, IIITHyderabadIndia
  2. 2.CDS, IIScBangaloreIndia

Personalised recommendations