Multimedia Systems

, Volume 22, Issue 4, pp 405–412 | Cite as

On the tag localization of web video

  • Haojie Li
  • Bin Liu
  • Lei Yi
  • Yue Guan
  • Zhong-Xuan Luo
Special Issue Paper


Nowadays, numerous social videos have pervaded on the web. Social web videos are characterized with the accompanying rich contextual information which describe the content of videos and thus greatly facilitate video search and browsing. Generally, those contextual data such as tags are provided at the whole video level, without temporal indication of when they actually appear in the video, let alone the spatial annotation of object related tags in the video frames. However, many tags only describe parts of the video content. Therefore, tag localization, the process of assigning tags to the underlying relevant video segments or frames even regions in frames is gaining increasing research interests and a benchmark dataset for the fair evaluation of tag localization algorithms is highly desirable. In this paper, we describe and release a dataset called DUT-WEBV, which contains about 4,000 videos collected from YouTube portal by issuing 50 concepts as queries. These concepts cover a wide range of semantic aspects including scenes like “mountain”, events like “flood”, objects like “cows”, sites like “gas station”, and activities like “handshaking”, offering great challenges to the tag (i.e., concept) localization task. For each video of a tag, we carefully annotate the time durations when the tag appears in the video and also label the spatial location of object with mask in frames for object related tag. Besides the video itself, the contextual information, such as thumbnail images, titles, and YouTube categories, is also provided. Together with this benchmark dataset, we present a baseline for tag localization using multiple instance learning approach. Finally, we discuss some open research issues for tag localization in web videos.


Video annotation Tag localization Video retrieval 



This work was supported by National Natural Science Funds of China (61033012, 61173104, 61300085) and the Fundamental Research Funds for the Central Universities (DUT13JR03, DUT14QY03).


  1. 1.
    Wang, M., Ni, B., Hua, X.-S., Chua, T.-S.: Assistive tagging: a survey of multimedia tagging with human–computer joint exploration. ACM Comput. Surv. 44(4), 25 (2012). CrossRefGoogle Scholar
  2. 2.
    Gao, Y., Wang, M., Zha, Z., Shen, J., Li, X.: Visual-textual joint relevance learning for tag-based social image search. IEEE Trans. Image Process. 22(1), 363–376 (2013)MathSciNetCrossRefGoogle Scholar
  3. 3.
    Ulges, A., Schulze, C., Breuel, T.: Identifying Relevant frames in weakly labeled videos for training concept detectors. ACM CIVR (2008)Google Scholar
  4. 4.
    Ikizler-Cinbis, N., Cinbis, R.G., Sclaroff, S.: Learning actions from the web. International Conference on Computer Vision (2009)Google Scholar
  5. 5.
    Li, G., Wang, M., Zheng, Y.-T., Li, H., Zha, Z.-J., Chua, T.-S.: ShotTagger: tag location for internet videos. In: ICMR (2011)Google Scholar
  6. 6.
    Gao, Y., Wang, W.-B., Yong, J.-H., Gu, H.-J.: Dynamic video summarization using two-level redundancy detection. Multimed. Tools Appl. 42(2), 233–250 (2009)CrossRefGoogle Scholar
  7. 7.
    Hong, R., Tang, J., Tan, H.-K., Ngo, C.-W., Yan, S., Chua, T-S.: Beyond search: event-driven summarization for web videos. TOMCCAP 7(4), 35 (2011)CrossRefGoogle Scholar
  8. 8.
    Tang, J., Zha, Z.-J., Tao, D., Chua, T.-S.: Semantic-gap oriented active learning for multi-label image annotation. IEEE Trans. Image Process. 21(4), 2354–2360 (2012)MathSciNetCrossRefGoogle Scholar
  9. 9.
    Tang, S., Zheng, Y.-T., Wang, Y., Chua, T.-S.: Sparse ensemble learning for concept detection. IEEE Trans. Multimed. 14(1), 43–54 (2012)CrossRefGoogle Scholar
  10. 10.
    Ballan, L., Bertini, M., Del Bimbo, A. et al.: Tag suggestion and localization in user-generated videos based on social knowledge. In: Proceedings of the 2nd ACM SIGMM International Workshop on Social Media (2010)Google Scholar
  11. 11.
    Ballan, L., Bertini, M., Del Bimbo, A., Serra, G.: Enriching and localizing semantic tags in internet videos. ACM Multimedia (2011)Google Scholar
  12. 12.
    Chu, W.-T., Li, C.-J. Chou, Y.-K.: Tag suggestion and localization for web videos by bipartite graph matching. ACM SIGMM Workshop on Social Media (2011)Google Scholar
  13. 13.
    Wang, M., Hong, R., Li, G., Yan, S., Chua, T.-S.: Event driven web video summarization by tag localization and key-shot identification. IEEE Trans. Multimed. 14(4), 975–985 (2012)CrossRefGoogle Scholar
  14. 14.
    Ulges, A., Schulze, C., Breuel, T.: Multiple instance learning from weakly labeled videos. SAMT Workshop on Cross-Media Information Analysis and Retrieval (2008)Google Scholar
  15. 15.
    Yang, Y., Yang, Y., Huang, Z., Shen, H.T., Nie, F.: Tag localization with spatial correlations and joint group sparsity. CVPR (2011)Google Scholar
  16. 16.
    Yang, Y., Huang, Z., Yang, Y., Liu, J., Shen, H.T., Luo, J.: Local image tagging via graph regularized joint group sparsity. Pattern Recognit. 46(5), 1358–1368 (2013)CrossRefMATHGoogle Scholar
  17. 17.
    TRECVid evaluation (2013).
  18. 18.
    Prest, A., Leistner, C., Civera, J., Schmid, C., Ferrari, V.: Learning object class detectors from weakly annotated video. Computer Vision and Pattern Recognition (CVPR) (2012)Google Scholar
  19. 19.
    Siva, P., Russell, C., Xiang, T.: In defence of negative mining for annotating weakly labelled data. In: ECCV (2012)Google Scholar
  20. 20.
    Tang K., Sukthankar, R., Yagnik J., Fei-Fei, L.: Discriminative segment annotation in weakly labeled video. IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2013)Google Scholar
  21. 21.
    Naphade, M., Smith, J.R., Tesic, J., Chang, S.-F., Hsu, W., Kennedy, L., Hauptmann, A., Curtis, J.: Large-scale concept ontology for multimedia. IEEE Multimed. 13, 86–91 (2006)CrossRefGoogle Scholar
  22. 22.
    Jiang, Y.-G., Ye, G., Chang, S.-F., Ellis, D.P.W., Loui A.C.: Consumer video understanding: a benchmark database and an evaluation of human and machine performance. In: ICMR (2011)Google Scholar
  23. 23.
    Cao, J., Zhang, Y., Song Y., Chen, Z., Zhang, X., Li, J.: MCG-WEBV: A benchmark dataset for web video analysis. Technical Report, ICT-MCG-09-001, Institute of Computing Technology, May 2009Google Scholar
  24. 24.
    Ulges A., Schulze C., Keysers, D.: A system that learns to tag videos by watching YouTube. Thomas Breuel International Conference on Computer Vision Systems (2008)Google Scholar
  25. 25.
    Brox T., Malik, J.: Object segmentation by long term analysis of point trajectories. European Conference on Computer Vision (ECCV) (2010)Google Scholar
  26. 26.
    Russell, B.C., Torralba, A., Murphy, K.P., Freeman, W.T.: LabelMe: a database and web-based tool for image annotation. Int. J. Comput. Vis. 77(1), 157–173 (2008)CrossRefGoogle Scholar
  27. 27.
    Tang, J., Li, H., Qi, G.-J., Chua, T.-S.: Image annotation by graph-based inference with integrated multiple/single instance representations. IEEE Trans. Multimed. 12(2), 131–141 (2010)CrossRefGoogle Scholar
  28. 28.
    Zhang, M.-L., Zhou, Z.-H.: Improve multi-instance neural networks through feature selection. Neural Process Lett. 19(1), 1–10 (2004)CrossRefGoogle Scholar
  29. 29.
    Li, H., Wang, X., Tang, J., Zhao, C.: Combining global and local matching of multiple features for precise item image retrieval. Multimed. Syst. 19(1), 37–49 (2013)CrossRefGoogle Scholar
  30. 30.
    Nister, D., Stewenius, H.: Scalable recognition with a vocabulary tree. In: CVPR (2006)Google Scholar
  31. 31.
    Shen, J., Tao, D., Li, X.: Modality mixture projections for semantic video event detection. IEEE Trans. Circuits Syst. Video Technol. 18(11), 1587–1596 (2008)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2014

Authors and Affiliations

  • Haojie Li
    • 1
  • Bin Liu
    • 1
  • Lei Yi
    • 1
  • Yue Guan
    • 1
  • Zhong-Xuan Luo
    • 1
  1. 1.School of SoftwareDalian University of TechnologyDalianChina

Personalised recommendations