Abstract
The presence of non-relevant tags in image folksonomies hampers the effective organization and retrieval of user-contributed images. In this paper, we propose to learn the relevance of user-supplied tags by means of visually weighted neighbor voting, a variant of the popular baseline neighbor voting algorithm proposed by Li et al. (IEEE Trans Multimedia 11(7):1310–1322, 2009). To gain insight into the effectiveness of baseline and visually weighted neighbor voting, we qualitatively analyze the difference in tag relevance when using a different number of neighbors, for both tags relevant and tags not relevant to the content of a seed image. Our qualitative analysis shows that tag relevance values computed by means of visually weighted neighbor voting are more stable and representative than tag relevance values computed by means of baseline neighbor voting. This is quantitatively confirmed through extensive experimentation with MIRFLICKR-25000, studying the variation of tag relevance values as a function of the number of neighbors used (for both tags relevant and tags not relevant with respect to the content of a seed image), as well as the influence of tag relevance learning on the effectiveness of image tag refinement, tag-based image retrieval, and image tag recommendation.
Similar content being viewed by others
Notes
The subsequent qualitative analysis does not assume that visual search is perfect.
References
Agrawal G (2011) Relevancy tag ranking. In: International conference on computer and communication technology, pp 169–173
Ahn L, Dabbish L (2004) Labeling images with a computer game. In: SIGCHI conference on human factors in computing systems, pp 319–326
Chua T, Tang J, Hong R, Li H, Luo Z, Zheng Y (2009) NUS-WIDE: a real-world web image database from National University of Singapore. In: ACM international conference on image and video retrieval (CIVR), pp 1–9
Feng S, Hong B, Lang C, Xu D (2011) Combining visual attention model with multi-instance learning for tag ranking. Neurocomputing 74(17):3619–3627
Ferreira J, Silva A, Delgado J (2004) How to improve retrieval effectiveness on the web. In: IADIS E-society conference, pp 1–9
Flickr’s Photostream (2012) Trend report—summer’12. http://www.flickr.com/photos/flickr/. Accessed 24 Aug 2012
Huiskes MJ, Lew MS (2008) The MIR Flickr retrieval evaluation. In: ACM international conference on multimedia information retrieval, pp 39–43
Jin Y, Khan L, Wang L, Awad M (2005) Image annotation by combining multiple evidence & WordNet. In: 13th ACM international conference on multimedia, pp 706–715
Kennedy L, Slaney M, Weinberger K (2009) Reliable tags using image similarity: mining specificity and expertise from large-scale multimedia databases. In: 17th ACM international conference on multimedia, pp 17–24
Lee S, De Neve W, Ro YM (2010) Tag refinement in an image Folksonomy using visual similarity and tag co-occurrence statistics. Signal Process 25(10):761–773
Li X, Snoek CGM, Worring M (2009) Learning social tag relevance by neighbor voting. IEEE Trans Multimedia 11(7):1310–1322
Li X, Snoek CGM, Worring M (2010) Unsupervised multi-feature tag relevance learning for social image retrieval. In: ACM international conference on image and video retrieval (CIVR), pp 10–17
Lindstaedt S, Morzinger R, Sorschag R, Pammer V, Thallinger G (2009) Automatic image annotation using visual content and Folksonomies. Multimedia Tools and Applications 42(1):97–113
Liu D, Hua XS, Yan L, Wang M, Zhang HJ (2009) Tag ranking. In: 18th international conference on world wide web (WWW), pp 351–360
Liu D, Wang M, Yang L, Hua XS, Zhang HJ (2009) Tag quality improvement for social images. In: IEEE international conference on multimedia & expo (ICME), pp 350–353
Manjunath B, Salembier P, Sikora T (2003) Introduction to MPEG-7: multimedia content description interface. Wiley, New Jersey
OECD (2007) OECD study on the participative web: user generated content. http://www.oecd.org/dataoecd/57/14/38393115.pdf. Accessed 24 Aug 2012
PlanetTech (2012) Facebook reveals staggering new stats. http://www.planettechnews.com/business/item1094. Accessed 24 Aug 2012
Singh K, Ma M, Park D, An S (2005) Image indexing based On MPEG-7 scalable color descriptor. Key Eng Mater 277:375–382
Sun A, Bhowmick SS (2010) Quantifying tag representativeness of visual content of social images. In: 18th ACM international conference on multimedia, pp 471–480
Vander Wal T (2007) Folksonomy coinage and definition. http://www.vanderwal.net/folksonomy.html. Accessed 24 Aug 2012
van de Sande KEA, Gevers T, Snoek CGM (2010) Evaluating color descriptors for object and scene recognition. IEEE Trans Pattern Anal Mach Intell 32(9):1582–1596
Wang X, Yang M, Cour T, Zhu S, Yu K, and Han TX (2011) Contextual weighting for vocabulary tree based image retrieval. In: IEEE international conference on computer vision, pp 6–13
Wu L, Yang L, Yu N, Hua XS (2009) Learning to tag. In: 18th international conference on world wide web (WWW), pp 361–370
Zhuang J, Hoi SCH (2011) A two-view learning approach for image tag ranking. In: ACM international conference on web search and data mining, pp 625–634
Acknowledgements
This research was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education, Science and Technology (2012K2A1A2033054).
Author information
Authors and Affiliations
Corresponding author
Appendix
Appendix
This appendix details the derivation of the difference in accuracy of visual search over random sampling. To that end, given a seed image I, we make a distinction between a tag w 1 relevant to the content of I and a tag w 2 not relevant to the content of I.
Difference in search accuracy for w 1 We make use of \(V_{I,w_1}(k)\) to represent the number of images relevant to w 1 in the set of k visual neighbors of I. We assume that the value of \(V_{I,w_1}(k)\) is (1) upper-bounded by the number of images relevant to w 1 when making use of perfectly working visual search and (2) lower-bounded by the number of images relevant to w 1 when making use of random sampling. This is conceptually illustrated by Fig. 6.
When visual search works perfectly, \(V_{I,w_1}(k)\) increases linearly from zero to \(|R_{w_1}|\) for k varying from zero to \(|R_{w_1}|\). Indeed, all images in the set of visual neighbors belong to \(|R_{w_1}|\). For \(k>|R_{w_1}|\) , \(V_{I,w_1}(k)=|R_{w_1}|\) because Φ only contains \(|R_{w_1}|\) images related to w 1. This is denoted in Fig. 6 by “ideal”. When making use of random sampling, we assume that \(V_{I,w_1}(k)\) increases linearly and that all images of \(R_{w_1}\) can only be found in the set of visual neighbors when this set is identical to Φ (this is, when k is equal to |Φ|). This is denoted in Fig. 6 by “random”. In practice, we also assume that \(V_{I,w_1}(k)\) increases linearly until the value of \(V_{I,w_1}(k)\) is equal to \(|R_{w_1}|\). This is denoted in Fig. 6 by “real”. When visual search is effective, the dashed line will be close to “ideal”. Otherwise, when visual search is not effective, the dashed line will be close to “random”. In Fig. 6, k′ represents the minimal value of k for which all images of \(R_{w_1}\) can be found in the set of visual neighbors of I.
In general, given a tag w 1, the accuracy of visual search \(A_{I,w_{1},k}\) can be written as \(V_{I,w_1}(k)/k\). Given the above observations made for \(V_{I,w_1}(k)\), \(A_{I,w_{1},k}\) can also be expressed as follows:
The difference in accuracy of visual search over random sampling for w 1 can then be expressed as follows:
Difference in search accuracy for w 2 We make use of \(V_{I,w_2}(k)\) to represent the number of images relevant to w 2 in the set of k visual neighbors of I. Further, we assume that the value of \(V_{I,w_2}(k)\) is (1) lower-bounded by the number of images relevant to w 2 when visual search works perfectly and (2) upper-bounded by the number of images relevant to w 2 when making use of random sampling. This is conceptually illustrated by Fig. 7.
When visual search works perfectly (in this case, when visual search finds all images relevant to I in Φ), then the images in \(R_{w_2}\) should not be among the visual neighbors of I when k ≤ |R I |, where R I represents the set of images relevant to I. Here, we assume that images are relevant to each other when they have semantic concepts in common (for the sake of simplicity, we also assume that images relevant to I are not relevant to w 2). However, for k > |R I |, the set of visual neighbors of I will start to contain images belonging to \(R_{w_2}\). This is denoted in Fig. 7 by “ideal”. When making use of random sampling, we assume that the number of images of \(R_{w_2}\) in the set of visual neighbors increases linearly when k varies from zero to |Φ|. This is denoted in Fig. 7 by “random”. In practice, we are able to find a k′ for which we can start to see images of \(R_{w_2}\) in the set of visual neighbors. This is denoted in Fig. 7 by means of “real”. In practice, we also assume that the number of images of \(R_{w_2}\) in the set of visual neighbors increases linearly. The accuracy of visual search for w 2, \(A_{I,w_2,k}\), is calculated by dividing \(V_{I,w_2}(k)\) by k:
The difference in accuracy of visual search over random sampling for w 2 can then be expressed as follows:
Rights and permissions
About this article
Cite this article
Lee, S., De Neve, W. & Ro, Y.M. Visually weighted neighbor voting for image tag relevance learning. Multimed Tools Appl 72, 1363–1386 (2014). https://doi.org/10.1007/s11042-013-1439-3
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-013-1439-3