Abstract
We propose to learn pixel-level segmentations of objects from weakly labeled (tagged) internet videos. Specifically, given a large collection of raw YouTube content, along with potentially noisy tags, our goal is to automatically generate spatiotemporal masks for each object, such as “dog”, without employing any pre-trained object detectors. We formulate this problem as learning weakly supervised classifiers for a set of independent spatio-temporal segments. The object seeds obtained using segment-level classifiers are further refined using graphcuts to generate high-precision object masks. Our results, obtained by training on a dataset of 20,000 YouTube videos weakly tagged into 15 classes, demonstrate automatic extraction of pixel-level object masks. Evaluated against a ground-truthed subset of 50,000 frames with pixel-level annotations, we confirm that our proposed methods can learn good object masks just by watching YouTube.
Keywords
- Object Segmentation
- Visual Concept
- Video Segmentation
- Multiple Instance Learn
- Video Stabilization
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
Chapter PDF
References
Ramanan, D., Forsyth, D., Barnard, K.: Building models of animals from video. PAMI 28 (2006)
Ommer, B., Mader, T., Buhmann, J.: Seeing the objects behind the dots: Recognition in videos from a moving camera. IJCV 83 (2009)
Ali, K., Hasler, D., Fleuret, F.: FlowBoost—Appearance learning from sparsely annotated video. In: CVPR (2011)
Leistner, C., Godec, M., Schulter, S., Saffari, A., Werlberger, M., Bischof, H.: Improving classifiers with unlabeled weakly-related videos. In: CVPR (2011)
Prest, A., Leistner, C., Civera, J., Schmid, C., Ferrari, V.: Learning object class detectors from weakly annotated video. In: CVPR (2012)
Kalal, Z., Matas, J., Mikolajczyk, K.: P-N Learning: Bootstrapping binary classifiers by structural constraints. In: CVPR (2010)
Ke, Y., Sukthankar, R., Hebert, M.: Event detection in crowded videos. In: ICCV (2007)
Niebles, J.C., Han, B., Ferencz, A., Fei-Fei, L.: Extracting Moving People from Internet Videos. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part IV. LNCS, vol. 5305, pp. 527–540. Springer, Heidelberg (2008)
Brendel, W., Todorovic, S.: Learning spatiotemporal graphs of human activities. In: ICCV (2011)
Xiao, J., Shah, M.: Motion layer extraction in the presence of occlusion using graph cuts. PAMI 27, 1644–1659 (2005)
Brox, T., Malik, J.: Object Segmentation by Long Term Analysis of Point Trajectories. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part V. LNCS, vol. 6315, pp. 282–295. Springer, Heidelberg (2010)
Grundmann, M., Kwatra, V., Han, M., Essa, I.: Efficient hierarchical graph-based video segmentation. In: CVPR (2011)
Grundmann, M., Kwatra, V., Essa, I.: Auto-directed video stabilization with robust L1 optimal camera paths. In: CVPR (2011)
Zha, Z.J., Hua, X.S., Mei, T., Wang, J., Qi, G.J., Wang, Z.: Joint multi-label multi-instance learning for image classification. In: CVPR (2008)
Viola, P., Platt, J., Zhang, C.: Multiple instance boosting for object detection. In: NIPS (2005)
Chen, Y., Bi, J., Wang, J.: MILES: Multiple-instance learning via embedded instance selection. PAMI 28, 1931–1947 (2006)
Ren, X., Gu, C.: Figure-ground segmentation improves handled object recognition in egocentric video. In: CVPR (2010)
Duchenne, O., Laptev, I., Sivic, J., Bach, F., Ponce, J.: Automatic annotation of human actions in video. In: ICCV (2009)
Liu, D., Hua, G., Chen, T.: A hierarchical visual model for video object summarization. PAMI 32, 2178–2190 (2010)
Fan, R.E., Chang, K.W., Hsieh, C.J., Wang, X.R., Lin, C.J.: LIBLINEAR: A library for large linear classification. JMLR 9, 1871–1874 (2008)
Duchi, J., Singer, Y.: Boosting with structural sparsity. In: ICML (2009)
Boykov, Y., Veksler, O., Zabih, R.: Fast approximate energy minimization via graph cuts. PAMI 23, 1222–1239 (2001)
Ojala, T., et al.: Performance evaluation of texture measures with classification based on Kullback discrimination of distributions. In: ICPR (1994)
Wang, X., Han, T.: An HOG-LBP human detector with partial occlusion handling. In: ICCV (2009)
Chaudhry, R., et al.: Histograms of oriented optical flow and Binet-Cauchy kernels on nonlinear dynamical systems. In: CVPR (2009)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Hartmann, G. et al. (2012). Weakly Supervised Learning of Object Segmentations from Web-Scale Video. In: Fusiello, A., Murino, V., Cucchiara, R. (eds) Computer Vision – ECCV 2012. Workshops and Demonstrations. ECCV 2012. Lecture Notes in Computer Science, vol 7583. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-33863-2_20
Download citation
DOI: https://doi.org/10.1007/978-3-642-33863-2_20
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-33862-5
Online ISBN: 978-3-642-33863-2
eBook Packages: Computer ScienceComputer Science (R0)