Abstract
Concept detection is a core component of video database search, concerned with the automatic recognition of visually diverse categories of objects (“airplane”), locations (“desert”), or activities (“interview”). The task poses a difficult challenge as the amount of accurately labeled data available for supervised training is limited and coverage of concept classes is poor. In order to overcome these problems, we describe the use of videos found on the web as training data for concept detectors, using tagging and folksonomies as annotation sources. This permits us to scale up training to very large data sets and concept vocabularies.
In order to take advantage of user-supplied tags on the web, we need to overcome problems of label weakness; web tags are context-dependent, unreliable and coarse. Our approach to addressing this problem is to automatically identify and filter non-relevant material. We demonstrate on a large database of videos retrieved from the web that this approach - called relevance filtering - leads to significant improvements over supervised learning techniques for categorization. In addition, we show how the approach can be combined with active learning to achieve additional performance improvements at moderate annotation cost.
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Ayache, S., Quenot, G.: Evaluation of active learning strategies for video indexing. Signal Processing: Image Communication 22(7-8), 692–704 (2007)
Ayache, S., Quenot, G.: Video Corpus Annotation using Active Learning. In: Proc. Europ. Conf. on Information Retrieval, pp. 187–198 (March 2008)
Barnard, K., Duygulu, P., Forsyth, D., de Freitas, N., Blei, D., Jordan, M.: Matching Words and Pictures. J. Mach. Learn. Res. 3, 1107–1135 (2003)
Berg, T., Forsyth, D.: Animals on the Web. In: Proc. Int. Conf. Computer Vision and Pattern Recognition, pp. 1463–1470 (June 2006)
Blum, A., Mitchell, T.: Combining Labeled and Unlabeled Data with Co-training. In: Proc. Ann. Conf. on Computational Learning Theory, pp. 92–100 (July 1998)
Snoek, C., et al.: The MediaMill TRECVID 2007 Semantic Video Search Engine. In: Proc. TRECVID Workshop (unreviewed workshop paper) (November 2007)
Campbell, M., Haubold, A., Liu, M., Natsev, A., Smith, J., Tesic, J., Xie, L., Yan, R., Yang, J.: IBM Research TRECVID-2007 Video Retrieval System. In: Proc. TRECVID Workshop (unreviewed workshop paper) (November 2007)
Chang, C.-C., Lin, C.-J. (LIBSVM): A Library for Support Vector Machines (2001), Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm
Chapelle, O., Schölkopf, B., Zien, A. (eds.): Semi-supervised Learning. MIT Press, Cambridge (2006)
Chen, M., Christel, M., Hauptmann, A., Wactlar, H.: Putting Active Learning into Multimedia Applications: Dynamic Definition and Refinement of Concept Classifiers. In: Proc. Int. Conf. on Multimedia, pp. 902–911 ( November 2005)
Dempster, A., Laird, N., Rubin, D.: Maximum Likelihood from Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society, Series B 39(1), 1–38 (1977)
Duda, R., Hart, P., Stork, D.: Pattern Classification, 2nd edn. Wiley Interscience, Hoboken (2000)
Everingham, M., Van Gool, L., Williams, C., Winn, J., Zisserman, A.: The PASCAL Visual Object Classes Challenge 2008 (VOC2008) Results (October 2008)
Fergus, R., Fei-Fei, L., Perona, P., Zisserman, A.: Learning Object Categories from Google’s Image Search. Computer Vision 2, 1816–1823 (2005)
Gargi, U., Yagnik, J.: Solving the Label Resolution Problem in Supervised Video Content Classification. In: Proc. Int. Conf. on Multimedia Retrieval, pp. 276–282 (October 2008)
Gu, Z., Mei, T., Hua, X.-S., Tang, J., Wu, X.: Multi-layer Multi-instance Kernel for Video Concept Detection. In: Proc. Int. Conf. on Multimedia, pp. 349–352 (September 2007)
Hauptmann, A., Yan, R., Lin, W.: How many High-Level Concepts will Fill the Semantic Gap in News Video Retrieval? In: Proc. Int. Conf. Image and Video Retrieval, pp. 627–634 (July 2007)
Hofmann, T.: Unsupervised Learning by Probabilistic Latent Semantic Analysis. Machine Learning 42, 177–196 (2001)
Yuan, J., et al.: THU and ICRC at TRECVID 2007. In: Proc. TRECVID Workshop (unreviewed workshop paper) (November 2007)
Joachims, T.: Transductive Inference for Text Classification using Support Vector Machines. In: Int. Conf. Machine Learning, pp. 200–209 (June 1999)
Kennedy, L., Chang, S.-F., Kozintsev, I.: To Search or to Label?: Predicting the Performance of Search-based Automatic Image Classifiers. In: Int. Workshop Multimedia Information Retrieval, pp. 249–258 (October 2006)
Kraaij, W., Over, P.: TRECVID-2007 High-Level Feature Task: Overview. In: Proc. TRECVID Workshop (November 2007)
Lewis, D., Gale, W.: A Sequential Algorithm for Training Text Classifiers. In: Proc. Int. Conf. Research and Development in Information Retrieval, pp. 3–12 (July 1994)
Li, L.-J., Wang, G., Fei-Fei, L.: OPTIMOL: automatic Object Picture collecTion via Incremental MOdel Learning. In: Proc. Int. Conf. Computer Vision and Pattern Recognition, pp. 57–64 (June 2007)
Lowe, D.: Object Recognition from Local Scale-Invariant Features. In: Int. Conf. Computer Vision, pp. 1150–1157 (September 1999)
Morsillo, N., Pal, C., Nelson, R.: Semi-supervised Visual Scene and Object Analysis from Web Images and Text. In: Scene Understanding Symposium (February 2008)
Naphade, M., Smith, J., Tesic, J., Chang, S., Hsu, W., Kennedy, L., Hauptmann, A., Curtis, J.: Large-Scale Concept Ontology for Multimedia. IEEE MultiMedia 13(3), 86–91 (2006)
Paredes, R., Perez-Cortes, A.: Local Representations and a Direct Voting Scheme for Face Recognition. In: Proc. Workshop on Pattern Rec. and Inf. Systems, pp. 71–79 (July 2001)
Salton, G., Buckley, C.: Improving Retrieval Performance by Relevance Feedback. Journal of the American Society for Information Science 41(4), 288–297 (1990)
Schölkopf, B., Smola, A.: Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press, Cambridge (2001)
Schroff, F., Criminisi, A., Zisserman, A.: Harvesting Image Databases from the Web. In: Proc. Int. Conf. Computer Vision, pp. 1–8 (October 2007)
Settles, B.: Active Learning Literature Survey. Computer Sciences Technical Report 1648, University of Wisconsin–Madison (2009)
Sivic, J., Zisserman, A.: Video Google: Efficient Visual Search of Videos. In: Toward Category-Level Object Recognition, pp. 127–144. Springer, New York, Inc. (2006)
Smeaton, A.: Techniques Used and Open Challenges to the Analysis, Indexing and Retrieval of Digital Video. Inf. Syst. 32(4), 545–559 (2007)
Smeaton, A., Over, P., Kraaij, W.: Evaluation Campaigns and TRECVID. In: Int. Workshop Multimedia Information Retrieval, pp. 321–330 (October 2006)
Smeulders, A., Worring, M., Santini, S., Gupta, A., Jain, R.: Content-Based Image Retrieval at the End of the Early Years. IEEE Trans. Pattern Analysis and Machine Intelligence 22(12), 1349–1380 (2000)
Snoek, C., Worring, M.: Concept-based Video Retrieval. Foundations and Trends in Information Retrieval 4(2), 215–322 (2009)
Snoek, C., Worring, M., de Rooij, O., van de Sande, K., Yan, R., Hauptmann, A.: VideOlympics: Real-Time Evaluation of Multimedia Retrieval Systems. IEEE MultiMedia 15(1), 86–91 (2008)
Snoek, C., Worring, M., Huurnink, B., van Gemert, J., van de Sande, K., Koelma, D., de Rooij, O.: MediaMill: Video Search using a Thesaurus of 500 Machine Learned Concepts. In: 1st Int. Conf. Sem. Dig. Media Techn (Posters and Demos.) (2006)
Sun, Y., Shimada, S., Taniguchi, Y., Kojima, A.: A Novel Region-based Approach to Visual Concept Modeling using Web Images. In: Int. Conf. Multimedia, pp. 635–638 (2008)
Tong, S., Chang, E.: Support Vector Machine Active Learning for Image Retrieval. In: Proc. Int. Conf. on Multimedia, pp. 107–118 (September 2001)
Turlach, B.: Bandwidth Selection in Kernel Density Estimation: A Review. In: CORE and Institut de Statistique, pp. 23–49 (1993)
Ulges, A., Schulze, C., Keysers, D., Breuel, T.: Identifying Relevant Frames in Weakly Labeled Videos for Training Concept Detectors. In: Proc. Int. Conf. Image and Video Retrieval, pp. 9–16 (July 2008)
YouTube Serves up 100 Million Videos a Day Online. USA Today (Garnnett Company, Inc.) (July 2006), http://www.usatoday.com/tech/news/2006-07-16-youtube-views_x.htm (retrieved, September 2008)
van de Sande, K., Gevers, T., Snoek, C.: A Comparison of Color Features for Visual Concept Classification. In: Proc. Int. Conf. Image and Video Retrieval, pp. 141–150 (July 2008)
Wang, D., Liu, X., Luo, L., Li, J., Zhang, B.: Video Diver: Generic Video Indexing with Diverse Features. In: Proc. Int. Workshop Multimedia Information Retrieval, pp. 61–70 (September 2007)
Wang, M., Hua, X.-S., Song, Y., Yuan, X., Li, S., Zhang, H.-J.: Automatic Video Annotation by Semi-supervised Learning with Kernel Density Estimation. In: Proc. Int. Conf. on Multimedia, October 2006, pp. 967–976 (2006)
Wnuk, K., Soatto, S.: Filtering Internet Image Search Results Towards Keyword Based Category Recognition. In: Proc. Int. Conf. Computer Vision and Pattern Recognition, pp. 1–8 (June 2008)
Yanagawa, A., Chang, S.-F., Kennedy, L., Hsu, W.: Columbia University’s Baseline Detectors for 374 LSCOM Semantic Visual Concepts. Technical report, Columbia University (2007)
Yanai, K., Barnard, K.: Probabilistic Web Image Gathering. In: Int. Workshop on Multimedia Inf. Retrieval, November 2005, pp. 57–64 (2005)
Yang, J., Hauptmann, A.: (Un)Reliability of Video Concept Detection. In: Proc. Int. Conf. Image and Video Retrieval, July 2008, pp. 85–94 (2008)
Zhu, X.: Semi-supervised Learning Literature Survey. Technical Report 1530, Computer Sciences, University of Wisconsin, Madison (2005)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Ulges, A., Borth, D., Breuel, T.M. (2010). Visual Concept Learning from Weakly Labeled Web Videos. In: Schonfeld, D., Shan, C., Tao, D., Wang, L. (eds) Video Search and Mining. Studies in Computational Intelligence, vol 287. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-12900-1_8
Download citation
DOI: https://doi.org/10.1007/978-3-642-12900-1_8
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-12899-8
Online ISBN: 978-3-642-12900-1
eBook Packages: EngineeringEngineering (R0)