Skip to main content
Log in

Visually weighted neighbor voting for image tag relevance learning

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

The presence of non-relevant tags in image folksonomies hampers the effective organization and retrieval of user-contributed images. In this paper, we propose to learn the relevance of user-supplied tags by means of visually weighted neighbor voting, a variant of the popular baseline neighbor voting algorithm proposed by Li et al. (IEEE Trans Multimedia 11(7):1310–1322, 2009). To gain insight into the effectiveness of baseline and visually weighted neighbor voting, we qualitatively analyze the difference in tag relevance when using a different number of neighbors, for both tags relevant and tags not relevant to the content of a seed image. Our qualitative analysis shows that tag relevance values computed by means of visually weighted neighbor voting are more stable and representative than tag relevance values computed by means of baseline neighbor voting. This is quantitatively confirmed through extensive experimentation with MIRFLICKR-25000, studying the variation of tag relevance values as a function of the number of neighbors used (for both tags relevant and tags not relevant with respect to the content of a seed image), as well as the influence of tag relevance learning on the effectiveness of image tag refinement, tag-based image retrieval, and image tag recommendation.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Notes

  1. http://www.flickr.com/

  2. http://www.facebook.com/

  3. The subsequent qualitative analysis does not assume that visual search is perfect.

References

  1. Agrawal G (2011) Relevancy tag ranking. In: International conference on computer and communication technology, pp 169–173

  2. Ahn L, Dabbish L (2004) Labeling images with a computer game. In: SIGCHI conference on human factors in computing systems, pp 319–326

  3. Chua T, Tang J, Hong R, Li H, Luo Z, Zheng Y (2009) NUS-WIDE: a real-world web image database from National University of Singapore. In: ACM international conference on image and video retrieval (CIVR), pp 1–9

  4. Feng S, Hong B, Lang C, Xu D (2011) Combining visual attention model with multi-instance learning for tag ranking. Neurocomputing 74(17):3619–3627

    Article  Google Scholar 

  5. Ferreira J, Silva A, Delgado J (2004) How to improve retrieval effectiveness on the web. In: IADIS E-society conference, pp 1–9

  6. Flickr’s Photostream (2012) Trend report—summer’12. http://www.flickr.com/photos/flickr/. Accessed 24 Aug 2012

  7. Huiskes MJ, Lew MS (2008) The MIR Flickr retrieval evaluation. In: ACM international conference on multimedia information retrieval, pp 39–43

  8. Jin Y, Khan L, Wang L, Awad M (2005) Image annotation by combining multiple evidence & WordNet. In: 13th ACM international conference on multimedia, pp 706–715

  9. Kennedy L, Slaney M, Weinberger K (2009) Reliable tags using image similarity: mining specificity and expertise from large-scale multimedia databases. In: 17th ACM international conference on multimedia, pp 17–24

  10. Lee S, De Neve W, Ro YM (2010) Tag refinement in an image Folksonomy using visual similarity and tag co-occurrence statistics. Signal Process 25(10):761–773

    Google Scholar 

  11. Li X, Snoek CGM, Worring M (2009) Learning social tag relevance by neighbor voting. IEEE Trans Multimedia 11(7):1310–1322

    Article  Google Scholar 

  12. Li X, Snoek CGM, Worring M (2010) Unsupervised multi-feature tag relevance learning for social image retrieval. In: ACM international conference on image and video retrieval (CIVR), pp 10–17

  13. Lindstaedt S, Morzinger R, Sorschag R, Pammer V, Thallinger G (2009) Automatic image annotation using visual content and Folksonomies. Multimedia Tools and Applications 42(1):97–113

    Article  Google Scholar 

  14. Liu D, Hua XS, Yan L, Wang M, Zhang HJ (2009) Tag ranking. In: 18th international conference on world wide web (WWW), pp 351–360

  15. Liu D, Wang M, Yang L, Hua XS, Zhang HJ (2009) Tag quality improvement for social images. In: IEEE international conference on multimedia & expo (ICME), pp 350–353

  16. Manjunath B, Salembier P, Sikora T (2003) Introduction to MPEG-7: multimedia content description interface. Wiley, New Jersey

    Google Scholar 

  17. OECD (2007) OECD study on the participative web: user generated content. http://www.oecd.org/dataoecd/57/14/38393115.pdf. Accessed 24 Aug 2012

  18. PlanetTech (2012) Facebook reveals staggering new stats. http://www.planettechnews.com/business/item1094. Accessed 24 Aug 2012

  19. Singh K, Ma M, Park D, An S (2005) Image indexing based On MPEG-7 scalable color descriptor. Key Eng Mater 277:375–382

    Article  Google Scholar 

  20. Sun A, Bhowmick SS (2010) Quantifying tag representativeness of visual content of social images. In: 18th ACM international conference on multimedia, pp 471–480

  21. Vander Wal T (2007) Folksonomy coinage and definition. http://www.vanderwal.net/folksonomy.html. Accessed 24 Aug 2012

  22. van de Sande KEA, Gevers T, Snoek CGM (2010) Evaluating color descriptors for object and scene recognition. IEEE Trans Pattern Anal Mach Intell 32(9):1582–1596

    Article  Google Scholar 

  23. Wang X, Yang M, Cour T, Zhu S, Yu K, and Han TX (2011) Contextual weighting for vocabulary tree based image retrieval. In: IEEE international conference on computer vision, pp 6–13

  24. Wu L, Yang L, Yu N, Hua XS (2009) Learning to tag. In: 18th international conference on world wide web (WWW), pp 361–370

  25. Zhuang J, Hoi SCH (2011) A two-view learning approach for image tag ranking. In: ACM international conference on web search and data mining, pp 625–634

Download references

Acknowledgements

This research was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education, Science and Technology (2012K2A1A2033054).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yong Man Ro.

Appendix

Appendix

This appendix details the derivation of the difference in accuracy of visual search over random sampling. To that end, given a seed image I, we make a distinction between a tag w 1 relevant to the content of I and a tag w 2 not relevant to the content of I.

Difference in search accuracy for w 1   We make use of \(V_{I,w_1}(k)\) to represent the number of images relevant to w 1 in the set of k visual neighbors of I. We assume that the value of \(V_{I,w_1}(k)\) is (1) upper-bounded by the number of images relevant to w 1 when making use of perfectly working visual search and (2) lower-bounded by the number of images relevant to w 1 when making use of random sampling. This is conceptually illustrated by Fig. 6.

Fig. 6
figure 6

The number of images relevant to w 1 in the set of k visual neighbors of I

When visual search works perfectly, \(V_{I,w_1}(k)\) increases linearly from zero to \(|R_{w_1}|\) for k varying from zero to \(|R_{w_1}|\). Indeed, all images in the set of visual neighbors belong to \(|R_{w_1}|\). For \(k>|R_{w_1}|\) , \(V_{I,w_1}(k)=|R_{w_1}|\) because Φ only contains \(|R_{w_1}|\) images related to w 1. This is denoted in Fig. 6 by “ideal”. When making use of random sampling, we assume that \(V_{I,w_1}(k)\) increases linearly and that all images of \(R_{w_1}\) can only be found in the set of visual neighbors when this set is identical to Φ (this is, when k is equal to |Φ|). This is denoted in Fig. 6 by “random”. In practice, we also assume that \(V_{I,w_1}(k)\) increases linearly until the value of \(V_{I,w_1}(k)\) is equal to \(|R_{w_1}|\). This is denoted in Fig. 6 by “real”. When visual search is effective, the dashed line will be close to “ideal”. Otherwise, when visual search is not effective, the dashed line will be close to “random”. In Fig. 6, k′ represents the minimal value of k for which all images of \(R_{w_1}\) can be found in the set of visual neighbors of I.

In general, given a tag w 1, the accuracy of visual search \(A_{I,w_{1},k}\) can be written as \(V_{I,w_1}(k)/k\). Given the above observations made for \(V_{I,w_1}(k)\), \(A_{I,w_{1},k}\) can also be expressed as follows:

$$ \label{eq:15} A_{I,w_{1},k} = \left\{ \begin{array}{ll} \dfrac{|R_{w_1}|}{k'}, & {k \leq k'}\\\\[-9pt] \dfrac{|R_{w_1}|}{k}, & {k > k'}.\end{array} \right. $$
(15)

The difference in accuracy of visual search over random sampling for w 1 can then be expressed as follows:

$$ \label{eq:16} \epsilon_{I,w_{1},k} = \left\{ \begin{array}{ll} \dfrac{|R_{w_1}|}{k'}-P(R_{w_1}), & {k \leq k'}\\\\[-9pt] \dfrac{|R_{w_1}|}{k}-P(R_{w_1}), & {k > k'}.\end{array} \right. $$
(16)

Difference in search accuracy for w 2   We make use of \(V_{I,w_2}(k)\) to represent the number of images relevant to w 2 in the set of k visual neighbors of I. Further, we assume that the value of \(V_{I,w_2}(k)\) is (1) lower-bounded by the number of images relevant to w 2 when visual search works perfectly and (2) upper-bounded by the number of images relevant to w 2 when making use of random sampling. This is conceptually illustrated by Fig. 7.

Fig. 7
figure 7

The number of images relevant to w 2 in the set of k visual neighbors of I

When visual search works perfectly (in this case, when visual search finds all images relevant to I in Φ), then the images in \(R_{w_2}\) should not be among the visual neighbors of I when k ≤ |R I |, where R I represents the set of images relevant to I. Here, we assume that images are relevant to each other when they have semantic concepts in common (for the sake of simplicity, we also assume that images relevant to I are not relevant to w 2). However, for k > |R I |, the set of visual neighbors of I will start to contain images belonging to \(R_{w_2}\). This is denoted in Fig. 7 by “ideal”. When making use of random sampling, we assume that the number of images of \(R_{w_2}\) in the set of visual neighbors increases linearly when k varies from zero to |Φ|. This is denoted in Fig. 7 by “random”. In practice, we are able to find a k′ for which we can start to see images of \(R_{w_2}\) in the set of visual neighbors. This is denoted in Fig. 7 by means of “real”. In practice, we also assume that the number of images of \(R_{w_2}\) in the set of visual neighbors increases linearly. The accuracy of visual search for w 2, \(A_{I,w_2,k}\), is calculated by dividing \(V_{I,w_2}(k)\) by k:

$$ \label{eq:17} A_{I,w_2,k} = \left\{ \begin{array}{ll} 0, & {k \leq k'}\\\\[-9pt] \dfrac{|R_{w_2}|-|R_{w_2}| \cdot \frac{k'}{k}}{|\Phi|-k'}, & {k > k'}.\end{array} \right. $$
(17)

The difference in accuracy of visual search over random sampling for w 2 can then be expressed as follows:

$$ \label{eq:18} \epsilon_{I,w_2,k} = \left\{ \begin{array}{ll} -P(R_{w_2}), & {k \leq k'}\\\\[-9pt] \dfrac{|R_{w_2}|-|R_{w_2}| \cdot \frac{k'}{k}}{|\Phi|-k'}-P(R_{w_2}), & {k > k'}.\end{array} \right. $$
(18)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lee, S., De Neve, W. & Ro, Y.M. Visually weighted neighbor voting for image tag relevance learning. Multimed Tools Appl 72, 1363–1386 (2014). https://doi.org/10.1007/s11042-013-1439-3

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-013-1439-3

Keywords

Navigation