Multimedia Tools and Applications

, Volume 74, Issue 4, pp 1291–1315 | Cite as

Best practices for learning video concept detectors from social media examples

  • Svetlana Kordumova
  • Xirong Li
  • Cees G. M. Snoek


Learning video concept detectors from social media sources, such as Flickr images and YouTube videos, has the potential to address a wide variety of concept queries for video search. While the potential has been recognized by many, and progress on the topic has been impressive, we argue that key questions crucial to know how to learn effective video concept detectors from social media examples? remain open. As an initial attempt to answer these questions, we conduct an experimental study using a video search engine which is capable of learning concept detectors from social media examples, be it socially tagged videos or socially tagged images. Within the video search engine we investigate three strategies for positive example selection, three negative example selection strategies and three learning strategies. The performance is evaluated on the challenging TRECVID 2012 benchmark consisting of 600 h of Internet video. From the experiments we derive four best practices: (1) tagged images are a better source for learning video concepts than tagged videos, (2) selecting tag relevant positive training examples is always beneficial, (3) selecting relevant negative examples is advantageous and should be treated differently for video and image sources, and (4) learning concept detectors with selected relevant training data before learning is better then incorporating the relevance during the learning process. The best practices within our video search engine lead to state-of-the-art performance in the TRECVID 2013 benchmark for concept detection without manually provided annotations.


Video retrieval Concept Detection Social media 



This research is supported by the STW STORY project, the Dutch national program COMMIT, the Chinese NSFC (No. 61303184), SRFDP (No. 20130004120006), the Fundamental Research Funds for the Central Universities, the Research Funds of Renmin University of China (No. 14XNLQ01), and by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior National Business Center contract number D11PC20067. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon. Disclaimer: The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of IARPA, DoI/NBC, or the U.S. Government.


  1. 1.
    Ballan L, Bertini M, Del Bimbo A, Serra G (2011) Enriching and localizing semantic tags in internet videos. In: MM 1541–1544Google Scholar
  2. 2.
    Chang S-F, Ellis D, Jiang W, Lee K, Yanagawa A, Loui AC, Luo J (2007) Large-scale multimodal semantic concept detection for consumer video. In: MIR 255–264Google Scholar
  3. 3.
    Fan J, Shen Y, Zhou N, Gao Y (2010) Harvesting large-scale weakly-tagged image databases from the web. In: CVPR 802–809Google Scholar
  4. 4.
    Heikkila M, Pietikainen M, Schmid C (2009) Description of interest regions with local binary patterns. In: PR 42(3):425–436Google Scholar
  5. 5.
    Hu Y, Li M, Yu N (2008) Multiple-instance ranking: learning to rank images for image retrieval. In: CVPR 1–8Google Scholar
  6. 6.
    Hwang SJ, Grauman K (2012) Learning the relative importance of objects from tagged images for retrieval and cross-modal search. In: IJCV 100(2):134–153Google Scholar
  7. 7.
    Jain V, Varma M (2011) Learning to re-rank: query-dependent image re-ranking using click data. In: WWW 277–286Google Scholar
  8. 8.
    Jiang W, Cotton CV, Chang S-F, Ellis D, Loui AC (2009) Short-term audio-visual atoms for generic video concept classification. In: MM. doi: 10.1145/1631272.1631277
  9. 9.
    Jiang Y-G, Yang J, Ngo C-W, Hauptmann A (2010) Representations of keypoint-based semantic concept detection: a comprehensive study. In: TMM 12(1):42–53Google Scholar
  10. 10.
    Joachims T (2002) Optimizing search engines using clickthrough data. In: SIGKDD 133–142Google Scholar
  11. 11.
    Kennedy LS, Chang S-F, Kozintsev IV (2006) To search or to label?: predicting the performance of search-based automatic image classifiers. In: MIR 249–258Google Scholar
  12. 12.
    Kim J, Pavlovic V (2012) Attribute rating for classification of visual objects. In: ICPR 1611–1614Google Scholar
  13. 13.
    Kordumova S, Li X, Snoek CGM (2013) Evaluating sources and strategies for learning video concepts from social media. In: CBMI 91–96Google Scholar
  14. 14.
    Li M (2007) Texture moment for content-based image retrieval. In: ICME 508–511Google Scholar
  15. 15.
    Li X, Snoek CGM, Worring M (2009) Learning social tag relevance by neighbor voting. In: TMM 11(7):1310–1322Google Scholar
  16. 16.
    Li X, Snoek CGM, Worring M (2010) Unsupervised multi-feature tag relevance learning for social image retrieval. In: CIVR 10–17Google Scholar
  17. 17.
    Li X, Snoek CGM, Worring M, Koelma DC, Smeulders AWM (2013) Bootstrapping visual categorization with relevant negatives. In: TMM 15(4):933–945Google Scholar
  18. 18.
    Li X, Snoek CGM, Worring M, Smeulders AWM (2012) Harvesting social images for bi-concept search. In: TMM 14(4):1091–1104Google Scholar
  19. 19.
    Li G, Wang M, Zheng Y-T, Li H, Zha Z-J, Chua T-S (2011) Shottagger: tag location for internet videos. In: ICMR. doi: 10.1145/1991996.1992033
  20. 20.
    Liu D, Hua X, Yang L, Wang M, Zhang H (2009) Tag ranking. In: WWW 351–360Google Scholar
  21. 21.
    Liu Y, Xu D, Tsang IW-H, Luo J (2011) Textual query of personal photos facilitated by large-scale web data. In: PAMI 33(5):1022–1036Google Scholar
  22. 22.
    Lowe DG (2003) Distinctive image features from scale-invariant keypoints. In: IJCV 60(2):91–110Google Scholar
  23. 23.
    Maji S, Berg A, Malik J (2008) Classification using intersection kernel support vector machines is efficient. In: CVPR 1–8Google Scholar
  24. 24.
    Oliva A, Torralba A (2001) Modeling the shape of the scene: a holistic representation of the spatial envelope. In: IJCV 42(3):145–175Google Scholar
  25. 25.
    Ray S, Craven M (2005) Supervised versus multiple instance learning: an empirical comparison. In ICML 697–704Google Scholar
  26. 26.
    Schindler G, Zitnick L, Brown M (2008) Internet video category recognition. CVPR. doi: 10.1109/CVPRW.2008.4562960
  27. 27.
    Schroff F, Criminisi A, Zisserman A (2007) Harvesting image databases from the web. In: ICCV 33(4):754–66Google Scholar
  28. 28.
    Settles B, Craven M, Ray S (2008) Multiple-instance active learning. In: NIPS 1289–1296Google Scholar
  29. 29.
    Setz A, Snoek CGM (2009) Can social tagged images aid concept-based video search? In: ICME 1460–1463Google Scholar
  30. 30.
    Sigurbjörnsson B, van Zwol R (2008) Flickr tag recommendation based on collective knowledge. In: WWW 327–336Google Scholar
  31. 31.
    Sivic J, Zisserman A (2003) Video google: a text retrieval approach to object matching in videos. In: ICCV 2:1470–1477Google Scholar
  32. 32.
    Smeaton AF, Over P, Kraaij W (2006) Evaluation campaigns and TRECVid. In: MIR 321–330Google Scholar
  33. 33.
    Sun Y, Kojima A (2011) A novel method for semantic video concept learning using web images. In: MM 1081–1084Google Scholar
  34. 34.
    Ulges A, Koch M, Borth D (2012) Linking visual concept detection with viewer demographics. In: ICMR. doi: 10.1145/2324796.2324827
  35. 35.
    Ulges A, Schulze C, Keysers D, Breuel T (2008) A system that learns to tag videos by watching youtube. In: ICVS 5008:415–424Google Scholar
  36. 36.
    Ulges A, Schulze C, Keysers D, Breuel T (2008) Identifying relevant frames in weakly labeled videos for training concept detectors. In: CIVR 9–16Google Scholar
  37. 37.
    Uricchio T, Ballan L, Bertini M, Del Bimbo A (2013) An evaluation of nearest-neighbor methods for tag refinement. In: ICME 1–6Google Scholar
  38. 38.
    van de Sande K, Gevers T, Snoek CGM (2010) Evaluating color descriptors for object and scene recognition. In: PAMI 32(9):1582–1596Google Scholar
  39. 39.
    Vapnik VN (1998) Statistical learning theory. Wiley, New YorkzbMATHGoogle Scholar
  40. 40.
    Wang H, Schmid C (2013) Action recognition with improved trajectories. In: ICCV 3551–3558Google Scholar
  41. 41.
    Wang Z, Zhao M, Song Y, Kumar S, Li B (2010) Youtubecat: learning to categorize wild web videos. In: CVPRGoogle Scholar
  42. 42.
    Yan R, Hauptmann AG, Jin R (2003) Negative pseudo-relevance feedback in content-based video retrieval. I:n MM 343–346Google Scholar
  43. 43.
    Yang J, Hauptmann A (2008) (Un)reliability of video concept detection. In: CIVR 85–94Google Scholar
  44. 44.
    Zhao W-L, Wu X, Ngo C-W (2010) On the annotation of web videos by efficient near-duplicate search. In: TMM 12(5):448–461Google Scholar
  45. 45.
    Zhu S, Ngo C-W, Jiang Y-G (2012) Sampling and ontologically pooling web images for visual concept learning. In: TMM 14(4):1068–1078Google Scholar

Copyright information

© Springer Science+Business Media New York 2014

Authors and Affiliations

  • Svetlana Kordumova
    • 1
  • Xirong Li
    • 2
  • Cees G. M. Snoek
    • 1
  1. 1.Intelligent Systems Lab AmsterdamUniversity of AmsterdamAmsterdamThe Netherlands
  2. 2.Key Lab of Data Engineering and Knowledge EngineeringRenmin University of ChinaBeijingChina

Personalised recommendations