Visual Concept Learning from Weakly Labeled Web Videos

  • Adrian Ulges
  • Damian Borth
  • Thomas M. Breuel
Part of the Studies in Computational Intelligence book series (SCI, volume 287)


Concept detection is a core component of video database search, concerned with the automatic recognition of visually diverse categories of objects (“airplane”), locations (“desert”), or activities (“interview”). The task poses a difficult challenge as the amount of accurately labeled data available for supervised training is limited and coverage of concept classes is poor. In order to overcome these problems, we describe the use of videos found on the web as training data for concept detectors, using tagging and folksonomies as annotation sources. This permits us to scale up training to very large data sets and concept vocabularies.

In order to take advantage of user-supplied tags on the web, we need to overcome problems of label weakness; web tags are context-dependent, unreliable and coarse. Our approach to addressing this problem is to automatically identify and filter non-relevant material. We demonstrate on a large database of videos retrieved from the web that this approach - called relevance filtering - leads to significant improvements over supervised learning techniques for categorization. In addition, we show how the approach can be combined with active learning to achieve additional performance improvements at moderate annotation cost.


Assure Beach Tated Egypt Univer 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Ayache, S., Quenot, G.: Evaluation of active learning strategies for video indexing. Signal Processing: Image Communication 22(7-8), 692–704 (2007)CrossRefGoogle Scholar
  2. 2.
    Ayache, S., Quenot, G.: Video Corpus Annotation using Active Learning. In: Proc. Europ. Conf. on Information Retrieval, pp. 187–198 (March 2008)Google Scholar
  3. 3.
    Barnard, K., Duygulu, P., Forsyth, D., de Freitas, N., Blei, D., Jordan, M.: Matching Words and Pictures. J. Mach. Learn. Res. 3, 1107–1135 (2003)MATHCrossRefGoogle Scholar
  4. 4.
    Berg, T., Forsyth, D.: Animals on the Web. In: Proc. Int. Conf. Computer Vision and Pattern Recognition, pp. 1463–1470 (June 2006)Google Scholar
  5. 5.
    Blum, A., Mitchell, T.: Combining Labeled and Unlabeled Data with Co-training. In: Proc. Ann. Conf. on Computational Learning Theory, pp. 92–100 (July 1998)Google Scholar
  6. 6.
    Snoek, C., et al.: The MediaMill TRECVID 2007 Semantic Video Search Engine. In: Proc. TRECVID Workshop (unreviewed workshop paper) (November 2007)Google Scholar
  7. 7.
    Campbell, M., Haubold, A., Liu, M., Natsev, A., Smith, J., Tesic, J., Xie, L., Yan, R., Yang, J.: IBM Research TRECVID-2007 Video Retrieval System. In: Proc. TRECVID Workshop (unreviewed workshop paper) (November 2007)Google Scholar
  8. 8.
    Chang, C.-C., Lin, C.-J. (LIBSVM): A Library for Support Vector Machines (2001), Software available at
  9. 9.
    Chapelle, O., Schölkopf, B., Zien, A. (eds.): Semi-supervised Learning. MIT Press, Cambridge (2006)Google Scholar
  10. 10.
    Chen, M., Christel, M., Hauptmann, A., Wactlar, H.: Putting Active Learning into Multimedia Applications: Dynamic Definition and Refinement of Concept Classifiers. In: Proc. Int. Conf. on Multimedia, pp. 902–911 ( November 2005)Google Scholar
  11. 11.
    Dempster, A., Laird, N., Rubin, D.: Maximum Likelihood from Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society, Series B 39(1), 1–38 (1977)MATHMathSciNetGoogle Scholar
  12. 12.
    Duda, R., Hart, P., Stork, D.: Pattern Classification, 2nd edn. Wiley Interscience, Hoboken (2000)Google Scholar
  13. 13.
    Everingham, M., Van Gool, L., Williams, C., Winn, J., Zisserman, A.: The PASCAL Visual Object Classes Challenge 2008 (VOC2008) Results (October 2008)Google Scholar
  14. 14.
    Fergus, R., Fei-Fei, L., Perona, P., Zisserman, A.: Learning Object Categories from Google’s Image Search. Computer Vision 2, 1816–1823 (2005)Google Scholar
  15. 15.
    Gargi, U., Yagnik, J.: Solving the Label Resolution Problem in Supervised Video Content Classification. In: Proc. Int. Conf. on Multimedia Retrieval, pp. 276–282 (October 2008)Google Scholar
  16. 16.
    Gu, Z., Mei, T., Hua, X.-S., Tang, J., Wu, X.: Multi-layer Multi-instance Kernel for Video Concept Detection. In: Proc. Int. Conf. on Multimedia, pp. 349–352 (September 2007)Google Scholar
  17. 17.
    Hauptmann, A., Yan, R., Lin, W.: How many High-Level Concepts will Fill the Semantic Gap in News Video Retrieval? In: Proc. Int. Conf. Image and Video Retrieval, pp. 627–634 (July 2007)Google Scholar
  18. 18.
    Hofmann, T.: Unsupervised Learning by Probabilistic Latent Semantic Analysis. Machine Learning 42, 177–196 (2001)MATHCrossRefGoogle Scholar
  19. 19.
    Yuan, J., et al.: THU and ICRC at TRECVID 2007. In: Proc. TRECVID Workshop (unreviewed workshop paper) (November 2007)Google Scholar
  20. 20.
    Joachims, T.: Transductive Inference for Text Classification using Support Vector Machines. In: Int. Conf. Machine Learning, pp. 200–209 (June 1999)Google Scholar
  21. 21.
    Kennedy, L., Chang, S.-F., Kozintsev, I.: To Search or to Label?: Predicting the Performance of Search-based Automatic Image Classifiers. In: Int. Workshop Multimedia Information Retrieval, pp. 249–258 (October 2006)Google Scholar
  22. 22.
    Kraaij, W., Over, P.: TRECVID-2007 High-Level Feature Task: Overview. In: Proc. TRECVID Workshop (November 2007)Google Scholar
  23. 23.
    Lewis, D., Gale, W.: A Sequential Algorithm for Training Text Classifiers. In: Proc. Int. Conf. Research and Development in Information Retrieval, pp. 3–12 (July 1994)Google Scholar
  24. 24.
    Li, L.-J., Wang, G., Fei-Fei, L.: OPTIMOL: automatic Object Picture collecTion via Incremental MOdel Learning. In: Proc. Int. Conf. Computer Vision and Pattern Recognition, pp. 57–64 (June 2007)Google Scholar
  25. 25.
    Lowe, D.: Object Recognition from Local Scale-Invariant Features. In: Int. Conf. Computer Vision, pp. 1150–1157 (September 1999)Google Scholar
  26. 26.
    Morsillo, N., Pal, C., Nelson, R.: Semi-supervised Visual Scene and Object Analysis from Web Images and Text. In: Scene Understanding Symposium (February 2008)Google Scholar
  27. 27.
    Naphade, M., Smith, J., Tesic, J., Chang, S., Hsu, W., Kennedy, L., Hauptmann, A., Curtis, J.: Large-Scale Concept Ontology for Multimedia. IEEE MultiMedia 13(3), 86–91 (2006)CrossRefGoogle Scholar
  28. 28.
    Paredes, R., Perez-Cortes, A.: Local Representations and a Direct Voting Scheme for Face Recognition. In: Proc. Workshop on Pattern Rec. and Inf. Systems, pp. 71–79 (July 2001)Google Scholar
  29. 29.
    Salton, G., Buckley, C.: Improving Retrieval Performance by Relevance Feedback. Journal of the American Society for Information Science 41(4), 288–297 (1990)CrossRefGoogle Scholar
  30. 30.
    Schölkopf, B., Smola, A.: Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press, Cambridge (2001)Google Scholar
  31. 31.
    Schroff, F., Criminisi, A., Zisserman, A.: Harvesting Image Databases from the Web. In: Proc. Int. Conf. Computer Vision, pp. 1–8 (October 2007)Google Scholar
  32. 32.
    Settles, B.: Active Learning Literature Survey. Computer Sciences Technical Report 1648, University of Wisconsin–Madison (2009)Google Scholar
  33. 33.
    Sivic, J., Zisserman, A.: Video Google: Efficient Visual Search of Videos. In: Toward Category-Level Object Recognition, pp. 127–144. Springer, New York, Inc. (2006)Google Scholar
  34. 34.
    Smeaton, A.: Techniques Used and Open Challenges to the Analysis, Indexing and Retrieval of Digital Video. Inf. Syst. 32(4), 545–559 (2007)CrossRefGoogle Scholar
  35. 35.
    Smeaton, A., Over, P., Kraaij, W.: Evaluation Campaigns and TRECVID. In: Int. Workshop Multimedia Information Retrieval, pp. 321–330 (October 2006)Google Scholar
  36. 36.
    Smeulders, A., Worring, M., Santini, S., Gupta, A., Jain, R.: Content-Based Image Retrieval at the End of the Early Years. IEEE Trans. Pattern Analysis and Machine Intelligence 22(12), 1349–1380 (2000)CrossRefGoogle Scholar
  37. 37.
    Snoek, C., Worring, M.: Concept-based Video Retrieval. Foundations and Trends in Information Retrieval 4(2), 215–322 (2009)Google Scholar
  38. 38.
    Snoek, C., Worring, M., de Rooij, O., van de Sande, K., Yan, R., Hauptmann, A.: VideOlympics: Real-Time Evaluation of Multimedia Retrieval Systems. IEEE MultiMedia 15(1), 86–91 (2008)CrossRefGoogle Scholar
  39. 39.
    Snoek, C., Worring, M., Huurnink, B., van Gemert, J., van de Sande, K., Koelma, D., de Rooij, O.: MediaMill: Video Search using a Thesaurus of 500 Machine Learned Concepts. In: 1st Int. Conf. Sem. Dig. Media Techn (Posters and Demos.) (2006)Google Scholar
  40. 40.
    Sun, Y., Shimada, S., Taniguchi, Y., Kojima, A.: A Novel Region-based Approach to Visual Concept Modeling using Web Images. In: Int. Conf. Multimedia, pp. 635–638 (2008)Google Scholar
  41. 41.
    Tong, S., Chang, E.: Support Vector Machine Active Learning for Image Retrieval. In: Proc. Int. Conf. on Multimedia, pp. 107–118 (September 2001)Google Scholar
  42. 42.
    Turlach, B.: Bandwidth Selection in Kernel Density Estimation: A Review. In: CORE and Institut de Statistique, pp. 23–49 (1993)Google Scholar
  43. 43.
    Ulges, A., Schulze, C., Keysers, D., Breuel, T.: Identifying Relevant Frames in Weakly Labeled Videos for Training Concept Detectors. In: Proc. Int. Conf. Image and Video Retrieval, pp. 9–16 (July 2008)Google Scholar
  44. 44.
    YouTube Serves up 100 Million Videos a Day Online. USA Today (Garnnett Company, Inc.) (July 2006), (retrieved, September 2008)
  45. 45.
    van de Sande, K., Gevers, T., Snoek, C.: A Comparison of Color Features for Visual Concept Classification. In: Proc. Int. Conf. Image and Video Retrieval, pp. 141–150 (July 2008)Google Scholar
  46. 46.
    Wang, D., Liu, X., Luo, L., Li, J., Zhang, B.: Video Diver: Generic Video Indexing with Diverse Features. In: Proc. Int. Workshop Multimedia Information Retrieval, pp. 61–70 (September 2007)Google Scholar
  47. 47.
    Wang, M., Hua, X.-S., Song, Y., Yuan, X., Li, S., Zhang, H.-J.: Automatic Video Annotation by Semi-supervised Learning with Kernel Density Estimation. In: Proc. Int. Conf. on Multimedia, October 2006, pp. 967–976 (2006)Google Scholar
  48. 48.
    Wnuk, K., Soatto, S.: Filtering Internet Image Search Results Towards Keyword Based Category Recognition. In: Proc. Int. Conf. Computer Vision and Pattern Recognition, pp. 1–8 (June 2008)Google Scholar
  49. 49.
    Yanagawa, A., Chang, S.-F., Kennedy, L., Hsu, W.: Columbia University’s Baseline Detectors for 374 LSCOM Semantic Visual Concepts. Technical report, Columbia University (2007)Google Scholar
  50. 50.
    Yanai, K., Barnard, K.: Probabilistic Web Image Gathering. In: Int. Workshop on Multimedia Inf. Retrieval, November 2005, pp. 57–64 (2005)Google Scholar
  51. 51.
    Yang, J., Hauptmann, A.: (Un)Reliability of Video Concept Detection. In: Proc. Int. Conf. Image and Video Retrieval, July 2008, pp. 85–94 (2008)Google Scholar
  52. 52.
    Zhu, X.: Semi-supervised Learning Literature Survey. Technical Report 1530, Computer Sciences, University of Wisconsin, Madison (2005)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2010

Authors and Affiliations

  • Adrian Ulges
    • 1
  • Damian Borth
    • 2
  • Thomas M. Breuel
    • 2
  1. 1.German Research Center for Artificial Intelligence (DFKI)KaiserslauternGermany
  2. 2.University of KaiserslauternKaiserslauternGermany

Personalised recommendations